Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Jay Kreps Thu, 18 Jun 2015 10:37:46 -0700

1. We were calling the plugins connectors (which is kind of a generic way
to say either source or sink) and the framework copycat. The pro of copycat
is it is kind of fun. The con is that it doesn't really say what it does.
The Kafka Connector Framework would be a duller but more intuitive name,
but I suspect people would then just shorten it to KCF which again has no
intuitive meaning.

2. Potentially. One alternative we had thought of wrt the consumer was to
have the protocol just handle the group management part and have the
partition assignment be purely driven by the client. At the time copycat
wasn't even a twinkle in our eyes so we weren't really thinking about that.
There were pros and cons to this and we decided it was better to do
partition assignment on the broker side. We could revisit this, it might
not be a massive change in the consumer, but it would definitely add work
there. I do agree that if we have learned one thing it is to keep clients
away from zk. This zk usage is more limited though, in that there is no
intention of having copycat in different languages as the clients are.

4. I think the idea is to include the structural schema information you
have available so it can be taken advantage of. Obviously the easiest
approach would just be to have a static schema for the messages like
timestamp + string/byte[]. However this means that i the source has schema
information there is no real official way to propagate that. Having a real
built-in schema mechanism gives you a little more power to make the data
usable. So if you were publishing apache logs the low-touch generic way
would just be to have the schema be "string" since that is what apache log
entries are. However if you had the actual format string used for the log
you could use that to have a richer schema and parse out the individual
fields, which is significantly more usable. The advantage of this is that
systems like databases, Hadoop, and so on that have some notion of schemas
can take advantage of this information that is captured with the source
data. So, e.g. the JDBC plugin can map the individual fields to columns
automatically, and you can support features like projecting out particular
fields and renaming fields easily without having to write custom
source-specific code.

-Jay

On Tue, Jun 16, 2015 at 5:00 PM, Joe Stein <joe.st...@stealth.ly> wrote:

> Hey Ewen, very interesting!
>
> I like the idea of the connector and making one side always being Kafka for
> all the reasons you mentioned. It makes having to build consumers (over and
> over and over (and over)) again for these type of tasks much more
> consistent for everyone.
>
> Some initial comments (will read a few more times and think more through
> it).
>
> 1) Copycat, it might be weird/hard to talk about producers, consumers,
> brokers and copycat for what and how "kafka" runs. I think the other naming
> makes sense but maybe we can call it something else? "Sinks" or whatever
> (don't really care just bringing up it might be something to consider). We
> could also just call it "connectors"...dunno.... producers, consumers,
> brokers and connectors...
>
> 2) Can we do copycat-workers without having to rely on Zookeeper? So much
> work has been done to remove this dependency if we can do something without
> ZK lets try (or at least abstract it so it is easier later to make it
> pluggable).
>
> 3) Even though connectors being managed in project has already been
> rejected... maybe we want to have a few (or one) that are in the project
> and maintained. This makes out of the box really out of the box (if only
> file or hdfs or something).
>
> 4) "all records include schemas which describe the format of their data" I
> don't totally get this... a lot of data doesn't have the schema with it, we
> have to plug that in... so would the plugin you are talking about for
> serializer would inject the schema to use with the record when it sees the
> data?
>
>
> ~ Joe Stein
> - - - - - - - - - - - - - - - - -
>
>   http://www.stealth.ly
> - - - - - - - - - - - - - - - - -
>
> On Tue, Jun 16, 2015 at 4:33 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > Oops, linked the wrong thing. Here's the correct one:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >
> > -Ewen
> >
> > On Tue, Jun 16, 2015 at 4:32 PM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> > > Hi all,
> > >
> > > I just posted KIP-26 - Add Copycat, a connector framework for data
> > > import/export here:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > >
> > > This is a large KIP compared to what we've had so far, and is a bit
> > > different from most. We're proposing the addition of a fairly big new
> > > component to Kafka because we think including it as part of Kafka
> rather
> > > than as an external project is in the best interest of both Copycat and
> > > Kafka itself.
> > >
> > > The goal with this KIP is to decide whether such a tool would make
> sense
> > > in Kafka, give a high level sense of what it would entail, and scope
> what
> > > would be included vs what would be left to third-parties. I'm hoping to
> > > leave discussion of specific design and implementation details, as well
> > > logistics like how best to include it in the Kafka repository &
> project,
> > to
> > > the subsequent JIRAs or follow up KIPs.
> > >
> > > Looking forward to your feedback!
> > >
> > > -Ewen
> > >
> > > P.S. Preemptive relevant XKCD: https://xkcd.com/927/
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>

Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Reply via email to