[jira] [Commented] (KAFKA-2367) Add Copycat runtime data API

Martin Kleppmann (JIRA) Sat, 15 Aug 2015 07:42:06 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698296#comment-14698296
 ]


Martin Kleppmann commented on KAFKA-2367:
-----------------------------------------

Just looked at this in the context of hopefully porting [Bottled 
Water|https://github.com/confluentinc/bottledwater-pg] (PostgreSQL change data 
capture to Kafka) to Copycat.

Bottled Water inspects the schema of the source database, and automatically 
generates an Avro schema from it (each PostgreSQL table definition is mapped to 
an Avro record type; each DB column becomes a field in the Avro record; record 
names and field names are taken from their names in the database). This makes 
integration quite smooth: you don't have to configure any data mappings (let 
alone write translation code), you just get a sensible data model by default.

So that background biases me towards Avro, and I'm happy for the Copycat data 
model to simply be Avro (although I'm not so keen on how the Avro code is 
currently unceremoniously copied and pasted into the Copycat patch). Here are a 
few more comments on bits of the discussion so far:

- Serialisation formats that use explicit field tags (Thrift, Protocol Buffers, 
Cap'n Proto) are painful with dynamically generated schemas, because their 
contract is that field numbers are forever. Say the schema is dynamically 
generated from a database schema, and someone drops a column from the middle of 
a table in the source database. If you don't forever keep that dropped column's 
field number reserved, you will generate invalid data in future. Avro doesn't 
have this problem, because fields are just identified by name. (Avro would only 
run into trouble if you create a new column with the same name as a column that 
previously existed and was dropped. Seems unlikely in practice.)

- I understand the desire to support JSON and other serialisation formats, but 
I don't think that using Avro as internal data model precludes that. We can 
make it easy to convert Avro objects at run-time into other formats, and even 
include support for a few popular formats. Making a "neutral" run-time format 
seems to me like unnecessary [standards proliferation|https://xkcd.com/927/].

- I think the claim that Copycat only needs 1% of Avro is rather exaggerated. A 
quick glance suggests that serialization is actually only about 30% of the Avro 
core code, and 70% is data model and schema management. If you start from the 
assumption that Copycat needs schemas, then you very quickly end up with 
something that looks very like Avro.

- IMHO, the problem with LinkedIn failing to upgrade from Avro 1.4 says more 
about problems with LinkedIn's dependency management than it says about Avro 
itself. Also, the Avro dependency we're talking about is only in Copycat 
connectors, so it is very localised, whereas LinkedIn is using it in every 
single application that has a Kafka client (i.e. basically everything).

To sum up, I agree with [~gwenshap]'s position.

> Add Copycat runtime data API
> ----------------------------
>
>                 Key: KAFKA-2367
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2367
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: copycat
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Ewen Cheslack-Postava
>             Fix For: 0.8.3
>
>
> Design the API used for runtime data in Copycat. This API is used to 
> construct schemas and records that Copycat processes. This needs to be a 
> fairly general data model (think Avro, JSON, Protobufs, Thrift) in order to 
> support complex, varied data types that may be input from/output to many data 
> systems.
> This should issue should also address the serialization interfaces used 
> within Copycat, which translate the runtime data into serialized byte[] form. 
> It is important that these be considered together because the data format can 
> be used in multiple ways (records, partition IDs, partition offsets), so it 
> and the corresponding serializers must be sufficient for all these use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2367) Add Copycat runtime data API

Reply via email to