Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Jay Kreps Fri, 19 Jun 2015 14:49:10 -0700

I think we want the connectors to be federated just because trying to
maintain all the connectors centrally would be really painful. I think if
we really do this well we would want to have >100 of these connectors so it
really won't make sense to maintain them with the project. I think the
thought was just to include the framework and maybe one simple connector as
an example.


Thoughts?

-Jay

On Fri, Jun 19, 2015 at 2:38 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> I think BikeShed will be a great name.
>
> Can you clarify the scope? The KIP discusses a framework and also few
> examples for connectors. Does the addition include just the framework
> (and perhaps an example or two), or do we plan to start accepting
> connectors to Apache Kafka project?
>
> Gwen
>
> On Thu, Jun 18, 2015 at 3:09 PM, Jay Kreps <j...@confluent.io> wrote:
> > I think the only problem we came up with was that Kafka KopyKat
> abbreviates
> > as KKK which is not ideal in the US. Copykat would still be googlable
> > without that issue. :-)
> >
> > -Jay
> >
> > On Thu, Jun 18, 2015 at 1:20 PM, Otis Gospodnetic <
> > otis.gospodne...@gmail.com> wrote:
> >
> >> Just a comment on the name. KopyKat? More unique, easy to write,
> >> pronounce, remember...
> >>
> >> Otis
> >>
> >>
> >>
> >> > On Jun 18, 2015, at 13:36, Jay Kreps <j...@confluent.io> wrote:
> >> >
> >> > 1. We were calling the plugins connectors (which is kind of a generic
> way
> >> > to say either source or sink) and the framework copycat. The pro of
> >> copycat
> >> > is it is kind of fun. The con is that it doesn't really say what it
> does.
> >> > The Kafka Connector Framework would be a duller but more intuitive
> name,
> >> > but I suspect people would then just shorten it to KCF which again
> has no
> >> > intuitive meaning.
> >> >
> >> > 2. Potentially. One alternative we had thought of wrt the consumer
> was to
> >> > have the protocol just handle the group management part and have the
> >> > partition assignment be purely driven by the client. At the time
> copycat
> >> > wasn't even a twinkle in our eyes so we weren't really thinking about
> >> that.
> >> > There were pros and cons to this and we decided it was better to do
> >> > partition assignment on the broker side. We could revisit this, it
> might
> >> > not be a massive change in the consumer, but it would definitely add
> work
> >> > there. I do agree that if we have learned one thing it is to keep
> clients
> >> > away from zk. This zk usage is more limited though, in that there is
> no
> >> > intention of having copycat in different languages as the clients are.
> >> >
> >> > 4. I think the idea is to include the structural schema information
> you
> >> > have available so it can be taken advantage of. Obviously the easiest
> >> > approach would just be to have a static schema for the messages like
> >> > timestamp + string/byte[]. However this means that i the source has
> >> schema
> >> > information there is no real official way to propagate that. Having a
> >> real
> >> > built-in schema mechanism gives you a little more power to make the
> data
> >> > usable. So if you were publishing apache logs the low-touch generic
> way
> >> > would just be to have the schema be "string" since that is what apache
> >> log
> >> > entries are. However if you had the actual format string used for the
> log
> >> > you could use that to have a richer schema and parse out the
> individual
> >> > fields, which is significantly more usable. The advantage of this is
> that
> >> > systems like databases, Hadoop, and so on that have some notion of
> >> schemas
> >> > can take advantage of this information that is captured with the
> source
> >> > data. So, e.g. the JDBC plugin can map the individual fields to
> columns
> >> > automatically, and you can support features like projecting out
> >> particular
> >> > fields and renaming fields easily without having to write custom
> >> > source-specific code.
> >> >
> >> > -Jay
> >> >
> >> >> On Tue, Jun 16, 2015 at 5:00 PM, Joe Stein <joe.st...@stealth.ly>
> >> wrote:
> >> >>
> >> >> Hey Ewen, very interesting!
> >> >>
> >> >> I like the idea of the connector and making one side always being
> Kafka
> >> for
> >> >> all the reasons you mentioned. It makes having to build consumers
> (over
> >> and
> >> >> over and over (and over)) again for these type of tasks much more
> >> >> consistent for everyone.
> >> >>
> >> >> Some initial comments (will read a few more times and think more
> through
> >> >> it).
> >> >>
> >> >> 1) Copycat, it might be weird/hard to talk about producers,
> consumers,
> >> >> brokers and copycat for what and how "kafka" runs. I think the other
> >> naming
> >> >> makes sense but maybe we can call it something else? "Sinks" or
> whatever
> >> >> (don't really care just bringing up it might be something to
> consider).
> >> We
> >> >> could also just call it "connectors"...dunno.... producers,
> consumers,
> >> >> brokers and connectors...
> >> >>
> >> >> 2) Can we do copycat-workers without having to rely on Zookeeper? So
> >> much
> >> >> work has been done to remove this dependency if we can do something
> >> without
> >> >> ZK lets try (or at least abstract it so it is easier later to make it
> >> >> pluggable).
> >> >>
> >> >> 3) Even though connectors being managed in project has already been
> >> >> rejected... maybe we want to have a few (or one) that are in the
> project
> >> >> and maintained. This makes out of the box really out of the box (if
> only
> >> >> file or hdfs or something).
> >> >>
> >> >> 4) "all records include schemas which describe the format of their
> >> data" I
> >> >> don't totally get this... a lot of data doesn't have the schema with
> >> it, we
> >> >> have to plug that in... so would the plugin you are talking about for
> >> >> serializer would inject the schema to use with the record when it
> sees
> >> the
> >> >> data?
> >> >>
> >> >>
> >> >> ~ Joe Stein
> >> >> - - - - - - - - - - - - - - - - -
> >> >>
> >> >>  http://www.stealth.ly
> >> >> - - - - - - - - - - - - - - - - -
> >> >>
> >> >> On Tue, Jun 16, 2015 at 4:33 PM, Ewen Cheslack-Postava <
> >> e...@confluent.io>
> >> >> wrote:
> >> >>
> >> >>> Oops, linked the wrong thing. Here's the correct one:
> >> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >> >>>
> >> >>> -Ewen
> >> >>>
> >> >>> On Tue, Jun 16, 2015 at 4:32 PM, Ewen Cheslack-Postava <
> >> >> e...@confluent.io>
> >> >>> wrote:
> >> >>>
> >> >>>> Hi all,
> >> >>>>
> >> >>>> I just posted KIP-26 - Add Copycat, a connector framework for data
> >> >>>> import/export here:
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> >> >>>>
> >> >>>> This is a large KIP compared to what we've had so far, and is a bit
> >> >>>> different from most. We're proposing the addition of a fairly big
> new
> >> >>>> component to Kafka because we think including it as part of Kafka
> >> >> rather
> >> >>>> than as an external project is in the best interest of both Copycat
> >> and
> >> >>>> Kafka itself.
> >> >>>>
> >> >>>> The goal with this KIP is to decide whether such a tool would make
> >> >> sense
> >> >>>> in Kafka, give a high level sense of what it would entail, and
> scope
> >> >> what
> >> >>>> would be included vs what would be left to third-parties. I'm
> hoping
> >> to
> >> >>>> leave discussion of specific design and implementation details, as
> >> well
> >> >>>> logistics like how best to include it in the Kafka repository &
> >> >> project,
> >> >>> to
> >> >>>> the subsequent JIRAs or follow up KIPs.
> >> >>>>
> >> >>>> Looking forward to your feedback!
> >> >>>>
> >> >>>> -Ewen
> >> >>>>
> >> >>>> P.S. Preemptive relevant XKCD: https://xkcd.com/927/
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Thanks,
> >> >>> Ewen
> >> >>
> >>
>

Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Reply via email to