[ 
https://issues.apache.org/jira/browse/KAFKA-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697499#comment-14697499
 ] 

Jay Kreps commented on KAFKA-2367:
----------------------------------

I'm more negative on using Avro. Here's my thinking.

What we need is about 1% of what Avro does. We need a simple runtime object 
model that supports JSON-like types (String, List, Map/Record, etc). Avro has 
this but also upteen bazillion other things mostly around serialization. In 
addition to itself it specifies a JSON/jackson so users will also be pinned to 
that jackson version (jackson is heavily used so there will be lots of 
conflicts).

Also most of the world doesn't use Avro so making that the core contract will 
discourage connector developers who aren't intending to use Avro. If I were 
evaluating ways of getting data into Kafka and saw that the core interface for 
copycat was Avro records and I was using ProtocolBuffers or JSON, I would 
probably be like "oh this clearly isn't for me, this is for avro people". If we 
go this route I suspect we will spend a ton of time explaining that even though 
copycat requires Avro it can be used with other stuff.

I am also more pessimistic about Avro compatibility. Avro has been fantastic 
about data compatibility but e.g. LinkedIn is still on Avro 1.4 after several 
major efforts to get an upgrade plan since the library is shared across several 
thousand artifacts that all depend on each other and Avro keeps breaking binary 
(classfile) compatibility. We should be able to support e.g. JSON as a first 
class citizen and have that work just as well.

I also think we should be able to easily add convenience/builder methods when 
we need to.

Philosophically I think this is the core copycat contract. I think we should 
think through exactly the minimum we need to specify here and implement that 
rather than wholesale inheriting something else that is a superset.

I agree that using Avro would get us there a little quicker and with a bit less 
code. But ultimately I think we should favor a clean copycat interface over 
minimizing copycat code since there will be (we hope) hundreds of connectors. 
So a small improvement for the connector developer should be worth some pain on 
our part.



> Add Copycat runtime data API
> ----------------------------
>
>                 Key: KAFKA-2367
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2367
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: copycat
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Ewen Cheslack-Postava
>             Fix For: 0.8.3
>
>
> Design the API used for runtime data in Copycat. This API is used to 
> construct schemas and records that Copycat processes. This needs to be a 
> fairly general data model (think Avro, JSON, Protobufs, Thrift) in order to 
> support complex, varied data types that may be input from/output to many data 
> systems.
> This should issue should also address the serialization interfaces used 
> within Copycat, which translate the runtime data into serialized byte[] form. 
> It is important that these be considered together because the data format can 
> be used in multiple ways (records, partition IDs, partition offsets), so it 
> and the corresponding serializers must be sufficient for all these use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to