RE: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Kartik Paramasivam Tue, 23 Jun 2015 23:09:00 -0700

+1 

I like the idea of having CopyCat to be part of core Kafka, and having the 
connectors to be in a separate sub-project.
This will allow CopyCat to be a new additional API for Kafka.  Also it is 
critical to keep the CopyCat API as solid and stable as the rest of Kafka.


The separate sub-project for connectors would make it easier for people to make 
connector contributions.  At the same time we do want to keep the quality of 
connectors to be good. Hope we figure out a good way to strike the right 
balance between these two conflicting goals.  The app store analogy is actually 
pretty appropriate.  Either way keeping this outside of core Kafka I think is 
important.

Kartik
(LinkedIn)
________________________________________
From: Jay Kreps [j...@confluent.io]
Sent: Monday, June 22, 2015 4:02 PM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data 
import/export

Hey Gwen,

That makes a lot of sense. Here was the thinking on our side.

I guess there are two questions, where does Copycat go and where do the
connectors go?

I'm in favor of Copycat being in Kafka and the connectors being federated.

Arguments for federating connectors:
- There will be like >> 100 connectors so if we keep them all in the same
repo it will be a lot.
- These plugin apis are a fantastic area for open source contribution--well
defined, bite sized, immediately useful, etc.
- If I wrote connector A I'm not particularly qualified to review connector
B. These things require basic Kafka knowledge but mostly they're very
system specific. Putting them all in one project ends up being kind of a
mess.
- Many people will have in-house systems that require custom connectors
anyway.
- You can't centrally maintain all the connectors so you need in any case
need to solve the whole "app store" experience for connectors (Ewen laughs
at me every time I say "app store for connectors"). Once you do that it
makes sense to just use the mechanism for everything.
- Many vendors we've talked to want to be able to maintain their own
connector and release it with their system not release it with Kafka or
another third party project.
- There is definitely a role for testing and certification of the
connectors but it's probably not something the main project should take on.

Federation doesn't necessarily mean that there can only be one repository
for each connector. We have a single repo for the connectors we're building
at confluent just for simplicity. It just means that regardless of where
the connector is maintained it integrates as a first-class citizen.

Basically I think really nailing federated connectors is pretty central to
having a healthy connector ecosystem which is the primary thing for making
this work.

Okay now the question of whether the copycat apis/framework should be in
Kafka or be an external project. We debated this a lot internally.

I was on the pro-kafka-inclusion side so let me give that argument. I think
the apis for pulling data into Kafka or pushing into a third party system
are actually really a core thing to what Kafka is. Kafka currently provides
a push producer and pull consumer because those are the harder problems to
solve, but about half the time you need the opposite (a pull producer and
push consumer). It feels weird to include any new thing, but I actually
feel like these apis are super central and natural to include in Kafka (in
fact they are so natural many other system only have that style of API).

I think the key question is whether we can do a good job at designing these
apis. If we can then we should really have an official set of apis. Having
official Kafka apis that are documented as part of the main docs and are
part of each release will do a ton to help foster the connector ecosystem
because it will be kind of a default way of doing Kaka integration and all
the people building in-house from-scratch connectors will likely just use
it. If it is a separate project then it is a separate discovery and
adoption decision (this is somewhat irrational but totally true).

I think one assumption we are making is that the copycat framework won't be
huge. It should be a manageable chunk of code.

I agree with your description of the some of the cons of bundling. However
I think there are pros as well and some of them are quite important.

The biggest is that for some reasons things that are maintained and
documented together end up feeling and working like a single product. This
is sort of a fuzzy thing. But one complaint I have about the Hadoop
ecosystem (and it is one of the more amazing products of open source in the
history of the world, so forgive the criticism) is that it FEELs like a
loosely affiliated collection of independent things kind of bolted
together. Products that are more centralized can give a much more holistic
feel to usage (configuration, commands, monitoring, etc) and things that
aren't somehow always drift apart (maybe just because the committers are
different).

So I actually totally agree with what you said about Spark. And if we end
up trying to include a machine learning library or anything far afield I
think I would agree we would have exactly that problem.

But I think the argument I would make is that this is actually a gap in our
existing product, not a new product and so having that identity is
important.

-Jay

On Sun, Jun 21, 2015 at 9:24 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> Ah, I see this in rejected alternatives now. Sorry :)
>
> I actually prefer the idea of a separate project for framework +
> connectors over having the framework be part of Apache Kafka.
>
> Looking at nearby examples: Hadoop has created a wide ecosystem of
> projects, with Sqoop and Flume supplying connectors. Spark on the
> other hand keeps its subprojects as part of Apache Spark.
>
> When I look at both projects, I see that Flume and Sqoop created
> active communities (that was especially true a few years back when we
> were rapidly growing), with many companies contributing. Spark OTOH
> (and with all respect to my friends at Spark), has tons of
> contributors to its core, but much less activity on its sub-projects
> (for example, SparkStreaming). I strongly believe that SparkStreaming
> is under-served by being a part of Spark, especially when compared to
> Storm which is an independent project with its own community.
>
> The way I see it, connector frameworks are significantly simpler than
> distributed data stores (although they are pretty large in terms of
> code base, especially with copycat having its own distributed
> processing framework). Which means that the barrier to contribution to
> connector frameworks is lower, both for contributing to the framework
> and for contributing connectors. Separate communities can also have
> different rules regarding dependencies and committership.
> Committership is the big one, and IMO what prevents SparkStreaming
> from growing - I can give someone commit bit on Sqoop without giving
> them any power over Hadoop. Not true for Spark and SparkStreaming.
> This means that a CopyCat community (with its own sexy cat logo) will
> be able to attract more volunteers and grow at a faster pace than core
> Kafka, making it more useful to the community.
>
> The other part is that just like Kafka will be more useful with a
> connector framework, a connector framework tends to work better when
> there are lots of connectors. So if we decide to partition the Kafka /
> Connector framework / Connectors triad, I'm not sure which
> partitioning makes more sense. Giving CopyCat (I love the name. You
> can say things like "get the data into MySQL and CC Kafka") its own
> community will allow the CopyCat community to accept connector
> contributions, which is good for CopyCat and for Kafka adoption.
> Oracle and Netezza contributed connectors to Sqoop, they probably
> couldn't contribute it at all if Sqoop was inside Hadoop, and they
> can't really opensource their own stuff through Github, so it was a
> win for our community. This doesn't negate the possibility to create
> connectors for CopyCat and not contribute them to the community (like
> the popular Teradata connector for Sqoop).
>
> Regarding ease of use and adoption: Right now, a lot of people adopt
> Kafka as stand-alone piece, while Hadoop usually shows up through a
> distribution. I expect that soon people will start adopting Kafka
> through distributions, so the framework and a collection of connectors
> will be part of every distribution. In the same way that no one thinks
> of Sqoop or Flume as stand alone projects. With a bunch of Kafka
> distributions out there, people will get Kafka + Framework +
> Connectors, with a core connection portion being common to multiple
> distributions - this will allow even easier adoption, while allowing
> the Kafka community to focus on core Kafka.
>
> The point about documentation that Ewen has made in the KIP is a good
> one. We definitely want to point people to the right place for export
> / import tools. However, it sounds solvable with few links.
>
> Sorry for the lengthy essay - I'm a bit passionate about connectors
> and want to see CopyCat off to a great start in life :)
>
> (BTW. I think Apache is a great place for CopyCat. I'll be happy to
> help with the process of incubating it)
>
>
> On Fri, Jun 19, 2015 at 2:47 PM, Jay Kreps <j...@confluent.io> wrote:
> > I think we want the connectors to be federated just because trying to
> > maintain all the connectors centrally would be really painful. I think if
> > we really do this well we would want to have >100 of these connectors so
> it
> > really won't make sense to maintain them with the project. I think the
> > thought was just to include the framework and maybe one simple connector
> as
> > an example.
> >
> > Thoughts?
> >
> > -Jay
> >
> > On Fri, Jun 19, 2015 at 2:38 PM, Gwen Shapira <gshap...@cloudera.com>
> wrote:
> >
> >> I think BikeShed will be a great name.
> >>
> >> Can you clarify the scope? The KIP discusses a framework and also few
> >> examples for connectors. Does the addition include just the framework
> >> (and perhaps an example or two), or do we plan to start accepting
> >> connectors to Apache Kafka project?
> >>
> >> Gwen
> >>
> >> On Thu, Jun 18, 2015 at 3:09 PM, Jay Kreps <j...@confluent.io> wrote:
> >> > I think the only problem we came up with was that Kafka KopyKat
> >> abbreviates
> >> > as KKK which is not ideal in the US. Copykat would still be googlable
> >> > without that issue. :-)
> >> >
> >> > -Jay
> >> >
> >> > On Thu, Jun 18, 2015 at 1:20 PM, Otis Gospodnetic <
> >> > otis.gospodne...@gmail.com> wrote:
> >> >
> >> >> Just a comment on the name. KopyKat? More unique, easy to write,
> >> >> pronounce, remember...
> >> >>
> >> >> Otis
> >> >>
> >> >>
> >> >>
> >> >> > On Jun 18, 2015, at 13:36, Jay Kreps <j...@confluent.io> wrote:
> >> >> >
> >> >> > 1. We were calling the plugins connectors (which is kind of a
> generic
> >> way
> >> >> > to say either source or sink) and the framework copycat. The pro of
> >> >> copycat
> >> >> > is it is kind of fun. The con is that it doesn't really say what it
> >> does.
> >> >> > The Kafka Connector Framework would be a duller but more intuitive
> >> name,
> >> >> > but I suspect people would then just shorten it to KCF which again
> >> has no
> >> >> > intuitive meaning.
> >> >> >
> >> >> > 2. Potentially. One alternative we had thought of wrt the consumer
> >> was to
> >> >> > have the protocol just handle the group management part and have
> the
> >> >> > partition assignment be purely driven by the client. At the time
> >> copycat
> >> >> > wasn't even a twinkle in our eyes so we weren't really thinking
> about
> >> >> that.
> >> >> > There were pros and cons to this and we decided it was better to do
> >> >> > partition assignment on the broker side. We could revisit this, it
> >> might
> >> >> > not be a massive change in the consumer, but it would definitely
> add
> >> work
> >> >> > there. I do agree that if we have learned one thing it is to keep
> >> clients
> >> >> > away from zk. This zk usage is more limited though, in that there
> is
> >> no
> >> >> > intention of having copycat in different languages as the clients
> are.
> >> >> >
> >> >> > 4. I think the idea is to include the structural schema information
> >> you
> >> >> > have available so it can be taken advantage of. Obviously the
> easiest
> >> >> > approach would just be to have a static schema for the messages
> like
> >> >> > timestamp + string/byte[]. However this means that i the source has
> >> >> schema
> >> >> > information there is no real official way to propagate that.
> Having a
> >> >> real
> >> >> > built-in schema mechanism gives you a little more power to make the
> >> data
> >> >> > usable. So if you were publishing apache logs the low-touch generic
> >> way
> >> >> > would just be to have the schema be "string" since that is what
> apache
> >> >> log
> >> >> > entries are. However if you had the actual format string used for
> the
> >> log
> >> >> > you could use that to have a richer schema and parse out the
> >> individual
> >> >> > fields, which is significantly more usable. The advantage of this
> is
> >> that
> >> >> > systems like databases, Hadoop, and so on that have some notion of
> >> >> schemas
> >> >> > can take advantage of this information that is captured with the
> >> source
> >> >> > data. So, e.g. the JDBC plugin can map the individual fields to
> >> columns
> >> >> > automatically, and you can support features like projecting out
> >> >> particular
> >> >> > fields and renaming fields easily without having to write custom
> >> >> > source-specific code.
> >> >> >
> >> >> > -Jay
> >> >> >
> >> >> >> On Tue, Jun 16, 2015 at 5:00 PM, Joe Stein <joe.st...@stealth.ly>
> >> >> wrote:
> >> >> >>
> >> >> >> Hey Ewen, very interesting!
> >> >> >>
> >> >> >> I like the idea of the connector and making one side always being
> >> Kafka
> >> >> for
> >> >> >> all the reasons you mentioned. It makes having to build consumers
> >> (over
> >> >> and
> >> >> >> over and over (and over)) again for these type of tasks much more
> >> >> >> consistent for everyone.
> >> >> >>
> >> >> >> Some initial comments (will read a few more times and think more
> >> through
> >> >> >> it).
> >> >> >>
> >> >> >> 1) Copycat, it might be weird/hard to talk about producers,
> >> consumers,
> >> >> >> brokers and copycat for what and how "kafka" runs. I think the
> other
> >> >> naming
> >> >> >> makes sense but maybe we can call it something else? "Sinks" or
> >> whatever
> >> >> >> (don't really care just bringing up it might be something to
> >> consider).
> >> >> We
> >> >> >> could also just call it "connectors"...dunno.... producers,
> >> consumers,
> >> >> >> brokers and connectors...
> >> >> >>
> >> >> >> 2) Can we do copycat-workers without having to rely on Zookeeper?
> So
> >> >> much
> >> >> >> work has been done to remove this dependency if we can do
> something
> >> >> without
> >> >> >> ZK lets try (or at least abstract it so it is easier later to
> make it
> >> >> >> pluggable).
> >> >> >>
> >> >> >> 3) Even though connectors being managed in project has already
> been
> >> >> >> rejected... maybe we want to have a few (or one) that are in the
> >> project
> >> >> >> and maintained. This makes out of the box really out of the box
> (if
> >> only
> >> >> >> file or hdfs or something).
> >> >> >>
> >> >> >> 4) "all records include schemas which describe the format of their
> >> >> data" I
> >> >> >> don't totally get this... a lot of data doesn't have the schema
> with
> >> >> it, we
> >> >> >> have to plug that in... so would the plugin you are talking about
> for
> >> >> >> serializer would inject the schema to use with the record when it
> >> sees
> >> >> the
> >> >> >> data?
> >> >> >>
> >> >> >>
> >> >> >> ~ Joe Stein
> >> >> >> - - - - - - - - - - - - - - - - -
> >> >> >>
> >> >> >>  http://www.stealth.ly
> >> >> >> - - - - - - - - - - - - - - - - -
> >> >> >>
> >> >> >> On Tue, Jun 16, 2015 at 4:33 PM, Ewen Cheslack-Postava <
> >> >> e...@confluent.io>
> >> >> >> wrote:
> >> >> >>
> >> >> >>> Oops, linked the wrong thing. Here's the correct one:
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >> >> >>>
> >> >> >>> -Ewen
> >> >> >>>
> >> >> >>> On Tue, Jun 16, 2015 at 4:32 PM, Ewen Cheslack-Postava <
> >> >> >> e...@confluent.io>
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>>> Hi all,
> >> >> >>>>
> >> >> >>>> I just posted KIP-26 - Add Copycat, a connector framework for
> data
> >> >> >>>> import/export here:
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> >> >> >>>>
> >> >> >>>> This is a large KIP compared to what we've had so far, and is a
> bit
> >> >> >>>> different from most. We're proposing the addition of a fairly
> big
> >> new
> >> >> >>>> component to Kafka because we think including it as part of
> Kafka
> >> >> >> rather
> >> >> >>>> than as an external project is in the best interest of both
> Copycat
> >> >> and
> >> >> >>>> Kafka itself.
> >> >> >>>>
> >> >> >>>> The goal with this KIP is to decide whether such a tool would
> make
> >> >> >> sense
> >> >> >>>> in Kafka, give a high level sense of what it would entail, and
> >> scope
> >> >> >> what
> >> >> >>>> would be included vs what would be left to third-parties. I'm
> >> hoping
> >> >> to
> >> >> >>>> leave discussion of specific design and implementation details,
> as
> >> >> well
> >> >> >>>> logistics like how best to include it in the Kafka repository &
> >> >> >> project,
> >> >> >>> to
> >> >> >>>> the subsequent JIRAs or follow up KIPs.
> >> >> >>>>
> >> >> >>>> Looking forward to your feedback!
> >> >> >>>>
> >> >> >>>> -Ewen
> >> >> >>>>
> >> >> >>>> P.S. Preemptive relevant XKCD: https://xkcd.com/927/
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Thanks,
> >> >> >>> Ewen
> >> >> >>
> >> >>
> >>
>

RE: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

Reply via email to