Re: Kafka Connect ++ Kafka Streams

2016-03-29 Thread Michal Haris
Replies from the dev list weren't getting to my original sender account. I
could create a pull request but it has some patchy code in ConnectEmbedded
and dependencies which shouldn't be there (connect-runtime) as it was
mostly intended for imagining how the integrated connect and streams
topology would look - but if you are happy for :stream:examples to have
 dependencies that can be removed after connect-api has a proper embedded
support than I can send a pull request your way.

On 25 March 2016 at 21:38, Guozhang Wang  wrote:

> I am thinking maybe we can even consider pulling the project as a whole
> into examples instead of adding the connector and streams implementation
> separately into Kafka Connect and Kafka Streams if Michal is interested to
> filing a PR: currently the examples folder only contains consumer /
> producer demos which is packaged as kafka.examples.
>
> Guozhang
>
>
> On Fri, Mar 25, 2016 at 9:28 AM, Neha Narkhede  wrote:
>
> > Michal -- This is really cool. Mind submitting a pull request?
> >
> > Also, would you like your IRC connector to be featured on the Kafka
> > Connector Hub ?
> >
> > On Fri, Mar 25, 2016 at 9:08 AM, Michal Hariš 
> > wrote:
> >
> > > So I had a go and hacked it up here: ConnectEmbedded.java
> > > <
> > >
> >
> https://github.com/amient/affinity-stack/blob/master/dev/connectors/connect-runtime/src/main/java/io/amient/kafka/connect/ConnectEmbedded.java
> > > >
> > >
> > >
> > > And this is how the wikipedia demo looks with it: hello-kafka-streams
> > > <
> > >
> >
> https://github.com/amient/affinity-stack/blob/master/dev/hello-kafka-streams/src/main/java/io/amient/kafka/streams/wikipedia/WikipediaStreamAppMain.java
> > > >
> > >
> > >
> > > As a side-effect there is a generic IRC connector too:
> kafka-connect-irc
> > > <
> > >
> >
> https://github.com/amient/affinity-stack/tree/master/dev/connectors/kafka-connect-irc/src/main/java/io/amient/kafka/connect/irc
> > > >
> > >
> > > It's kind of neat to have topology encapsulating connect and streams
> in a
> > > single instance that can just be scaled together symmetrically.
> > >
> > > Overall this was one of the most fun hack I had in a long time and the
> > > result compared to the Samza equivalent looks clean and lightweight. It
> > > also allows for zero-downtime with appropriate combination of
> deployment
> > > strategy and replication, which is something that was quite tricky with
> > > Samza and  YARN host affinity.
> > >
> > > One thing though I can't get my head around is why in Kafka Connect
> there
> > > has to be a custom internal schema format  for the in-memory runtime
> > > instead of just using Avro as the internal - the systems that talk in
> > Avro
> > > would have a performance gain and non-Avro guys would have converters
> the
> > > same way they have them now.
> > >
> > >
> > > On Thu, Mar 24, 2016 at 11:46 AM, Michal Hariš <
> michal.har...@gmail.com>
> > > wrote:
> > >
> > > > Hello Kafka people!
> > > >
> > > > Great to see Kafka Streams coming along, the design validates (and in
> > > many
> > > > way supersedes) my own findings from working with various stream
> > > processing
> > > > systems/frameworks and eventually ending-up using just a small custom
> > > > library built directly around Kafka.
> > > >
> > > > I have set out yesterday to translate Hello Samza (the wikipedia feed
> > > > example) into Kafka Streams application. Now because this workflow
> > starts
> > > > by polling wikipedia IRC and publishes to a topic from which the
> stream
> > > > processors pick-up it would be nice to have this first part done by
> > Kafka
> > > > Connect but:
> > > >
> > > > 1. IRC channels are not seekable and Kafka Connect architecture
> claims
> > > > that all sources must be seekable - is this still suitable ? (I guess
> > yes
> > > > as FileStreamSourceTask can read from stdin which is similar)
> > > >
> > > > 2. I would like to have ConnectEmbedded (as opposed to
> > ConnectStandalone
> > > > or ConnectDistributed) which is similar to ConnectDistributed, just
> > > without
> > > > the rest server - i.e. say I have the WikipediaFeedConnector and I
> want
> > > to
> > > > launch it programatically from all the instances along-side the Kafka
> > > > Streams - but reusing the connect distributed coordination so that
> only
> > > one
> > > > instance actually reads the IRC data but another instance picks up
> work
> > > if
> > > > that one dies - does it sound like a bad idea for some design reason
> ?
> > -
> > > > the only problem I see is rather technical that the coordination
> > process
> > > > uses the rest server for some actions.
> > > >
> > > > Cheers,
> > > > Michal
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Neha
> >
>
>
>
> --
> -- Guozhang
>



-- 
Michal Haris
Technical Architect
direct line: +44 (0) 207 749 0229
www.visualdna.com | t: +44 (0) 207 734 7033
31 Old Nichol Street
London
E2 7HR


Re: Kafka Connect ++ Kafka Streams

2016-03-25 Thread Guozhang Wang
I am thinking maybe we can even consider pulling the project as a whole
into examples instead of adding the connector and streams implementation
separately into Kafka Connect and Kafka Streams if Michal is interested to
filing a PR: currently the examples folder only contains consumer /
producer demos which is packaged as kafka.examples.

Guozhang


On Fri, Mar 25, 2016 at 9:28 AM, Neha Narkhede  wrote:

> Michal -- This is really cool. Mind submitting a pull request?
>
> Also, would you like your IRC connector to be featured on the Kafka
> Connector Hub ?
>
> On Fri, Mar 25, 2016 at 9:08 AM, Michal Hariš 
> wrote:
>
> > So I had a go and hacked it up here: ConnectEmbedded.java
> > <
> >
> https://github.com/amient/affinity-stack/blob/master/dev/connectors/connect-runtime/src/main/java/io/amient/kafka/connect/ConnectEmbedded.java
> > >
> >
> >
> > And this is how the wikipedia demo looks with it: hello-kafka-streams
> > <
> >
> https://github.com/amient/affinity-stack/blob/master/dev/hello-kafka-streams/src/main/java/io/amient/kafka/streams/wikipedia/WikipediaStreamAppMain.java
> > >
> >
> >
> > As a side-effect there is a generic IRC connector too: kafka-connect-irc
> > <
> >
> https://github.com/amient/affinity-stack/tree/master/dev/connectors/kafka-connect-irc/src/main/java/io/amient/kafka/connect/irc
> > >
> >
> > It's kind of neat to have topology encapsulating connect and streams in a
> > single instance that can just be scaled together symmetrically.
> >
> > Overall this was one of the most fun hack I had in a long time and the
> > result compared to the Samza equivalent looks clean and lightweight. It
> > also allows for zero-downtime with appropriate combination of deployment
> > strategy and replication, which is something that was quite tricky with
> > Samza and  YARN host affinity.
> >
> > One thing though I can't get my head around is why in Kafka Connect there
> > has to be a custom internal schema format  for the in-memory runtime
> > instead of just using Avro as the internal - the systems that talk in
> Avro
> > would have a performance gain and non-Avro guys would have converters the
> > same way they have them now.
> >
> >
> > On Thu, Mar 24, 2016 at 11:46 AM, Michal Hariš 
> > wrote:
> >
> > > Hello Kafka people!
> > >
> > > Great to see Kafka Streams coming along, the design validates (and in
> > many
> > > way supersedes) my own findings from working with various stream
> > processing
> > > systems/frameworks and eventually ending-up using just a small custom
> > > library built directly around Kafka.
> > >
> > > I have set out yesterday to translate Hello Samza (the wikipedia feed
> > > example) into Kafka Streams application. Now because this workflow
> starts
> > > by polling wikipedia IRC and publishes to a topic from which the stream
> > > processors pick-up it would be nice to have this first part done by
> Kafka
> > > Connect but:
> > >
> > > 1. IRC channels are not seekable and Kafka Connect architecture claims
> > > that all sources must be seekable - is this still suitable ? (I guess
> yes
> > > as FileStreamSourceTask can read from stdin which is similar)
> > >
> > > 2. I would like to have ConnectEmbedded (as opposed to
> ConnectStandalone
> > > or ConnectDistributed) which is similar to ConnectDistributed, just
> > without
> > > the rest server - i.e. say I have the WikipediaFeedConnector and I want
> > to
> > > launch it programatically from all the instances along-side the Kafka
> > > Streams - but reusing the connect distributed coordination so that only
> > one
> > > instance actually reads the IRC data but another instance picks up work
> > if
> > > that one dies - does it sound like a bad idea for some design reason ?
> -
> > > the only problem I see is rather technical that the coordination
> process
> > > uses the rest server for some actions.
> > >
> > > Cheers,
> > > Michal
> > >
> >
>
>
>
> --
> Thanks,
> Neha
>



-- 
-- Guozhang


Re: Kafka Connect ++ Kafka Streams

2016-03-25 Thread Neha Narkhede
Michal -- This is really cool. Mind submitting a pull request?

Also, would you like your IRC connector to be featured on the Kafka
Connector Hub ?

On Fri, Mar 25, 2016 at 9:08 AM, Michal Hariš 
wrote:

> So I had a go and hacked it up here: ConnectEmbedded.java
> <
> https://github.com/amient/affinity-stack/blob/master/dev/connectors/connect-runtime/src/main/java/io/amient/kafka/connect/ConnectEmbedded.java
> >
>
>
> And this is how the wikipedia demo looks with it: hello-kafka-streams
> <
> https://github.com/amient/affinity-stack/blob/master/dev/hello-kafka-streams/src/main/java/io/amient/kafka/streams/wikipedia/WikipediaStreamAppMain.java
> >
>
>
> As a side-effect there is a generic IRC connector too: kafka-connect-irc
> <
> https://github.com/amient/affinity-stack/tree/master/dev/connectors/kafka-connect-irc/src/main/java/io/amient/kafka/connect/irc
> >
>
> It's kind of neat to have topology encapsulating connect and streams in a
> single instance that can just be scaled together symmetrically.
>
> Overall this was one of the most fun hack I had in a long time and the
> result compared to the Samza equivalent looks clean and lightweight. It
> also allows for zero-downtime with appropriate combination of deployment
> strategy and replication, which is something that was quite tricky with
> Samza and  YARN host affinity.
>
> One thing though I can't get my head around is why in Kafka Connect there
> has to be a custom internal schema format  for the in-memory runtime
> instead of just using Avro as the internal - the systems that talk in Avro
> would have a performance gain and non-Avro guys would have converters the
> same way they have them now.
>
>
> On Thu, Mar 24, 2016 at 11:46 AM, Michal Hariš 
> wrote:
>
> > Hello Kafka people!
> >
> > Great to see Kafka Streams coming along, the design validates (and in
> many
> > way supersedes) my own findings from working with various stream
> processing
> > systems/frameworks and eventually ending-up using just a small custom
> > library built directly around Kafka.
> >
> > I have set out yesterday to translate Hello Samza (the wikipedia feed
> > example) into Kafka Streams application. Now because this workflow starts
> > by polling wikipedia IRC and publishes to a topic from which the stream
> > processors pick-up it would be nice to have this first part done by Kafka
> > Connect but:
> >
> > 1. IRC channels are not seekable and Kafka Connect architecture claims
> > that all sources must be seekable - is this still suitable ? (I guess yes
> > as FileStreamSourceTask can read from stdin which is similar)
> >
> > 2. I would like to have ConnectEmbedded (as opposed to ConnectStandalone
> > or ConnectDistributed) which is similar to ConnectDistributed, just
> without
> > the rest server - i.e. say I have the WikipediaFeedConnector and I want
> to
> > launch it programatically from all the instances along-side the Kafka
> > Streams - but reusing the connect distributed coordination so that only
> one
> > instance actually reads the IRC data but another instance picks up work
> if
> > that one dies - does it sound like a bad idea for some design reason ? -
> > the only problem I see is rather technical that the coordination process
> > uses the rest server for some actions.
> >
> > Cheers,
> > Michal
> >
>



-- 
Thanks,
Neha


Re: Kafka Connect ++ Kafka Streams

2016-03-25 Thread Michal Hariš
So I had a go and hacked it up here: ConnectEmbedded.java



And this is how the wikipedia demo looks with it: hello-kafka-streams



As a side-effect there is a generic IRC connector too: kafka-connect-irc


It's kind of neat to have topology encapsulating connect and streams in a
single instance that can just be scaled together symmetrically.

Overall this was one of the most fun hack I had in a long time and the
result compared to the Samza equivalent looks clean and lightweight. It
also allows for zero-downtime with appropriate combination of deployment
strategy and replication, which is something that was quite tricky with
Samza and  YARN host affinity.

One thing though I can't get my head around is why in Kafka Connect there
has to be a custom internal schema format  for the in-memory runtime
instead of just using Avro as the internal - the systems that talk in Avro
would have a performance gain and non-Avro guys would have converters the
same way they have them now.


On Thu, Mar 24, 2016 at 11:46 AM, Michal Hariš 
wrote:

> Hello Kafka people!
>
> Great to see Kafka Streams coming along, the design validates (and in many
> way supersedes) my own findings from working with various stream processing
> systems/frameworks and eventually ending-up using just a small custom
> library built directly around Kafka.
>
> I have set out yesterday to translate Hello Samza (the wikipedia feed
> example) into Kafka Streams application. Now because this workflow starts
> by polling wikipedia IRC and publishes to a topic from which the stream
> processors pick-up it would be nice to have this first part done by Kafka
> Connect but:
>
> 1. IRC channels are not seekable and Kafka Connect architecture claims
> that all sources must be seekable - is this still suitable ? (I guess yes
> as FileStreamSourceTask can read from stdin which is similar)
>
> 2. I would like to have ConnectEmbedded (as opposed to ConnectStandalone
> or ConnectDistributed) which is similar to ConnectDistributed, just without
> the rest server - i.e. say I have the WikipediaFeedConnector and I want to
> launch it programatically from all the instances along-side the Kafka
> Streams - but reusing the connect distributed coordination so that only one
> instance actually reads the IRC data but another instance picks up work if
> that one dies - does it sound like a bad idea for some design reason ? -
> the only problem I see is rather technical that the coordination process
> uses the rest server for some actions.
>
> Cheers,
> Michal
>


Re: Kafka Connect ++ Kafka Streams

2016-03-24 Thread Ewen Cheslack-Postava
On Thu, Mar 24, 2016 at 4:46 AM, Michal Hariš 
wrote:

> Hello Kafka people!
>
> Great to see Kafka Streams coming along, the design validates (and in many
> way supersedes) my own findings from working with various stream processing
> systems/frameworks and eventually ending-up using just a small custom
> library built directly around Kafka.
>
> I have set out yesterday to translate Hello Samza (the wikipedia feed
> example) into Kafka Streams application. Now because this workflow starts
> by polling wikipedia IRC and publishes to a topic from which the stream
> processors pick-up it would be nice to have this first part done by Kafka
> Connect but:
>
> 1. IRC channels are not seekable and Kafka Connect architecture claims that
> all sources must be seekable - is this still suitable ? (I guess yes as
> FileStreamSourceTask can read from stdin which is similar)
>

They need to be seekable in order to guarantee delivery. If you're fine
with an outage causing you to miss some data, then you don't need the
source to be seekable. However, keep in mind that in distributed mode,
there will be brief periods where work is being rebalanced across workers
and data will not be processed. These are windows where you could easily
lose data if you can't track offsets and recover events that occurred
during the rebalance process. You can of course stick with standalone mode,
but then you lose some of the fault tolerance features.


>
> 2. I would like to have ConnectEmbedded (as opposed to ConnectStandalone or
> ConnectDistributed) which is similar to ConnectDistributed, just without
> the rest server - i.e. say I have the WikipediaFeedConnector and I want to
> launch it programatically from all the instances along-side the Kafka
> Streams - but reusing the connect distributed coordination so that only one
> instance actually reads the IRC data but another instance picks up work if
> that one dies - does it sound like a bad idea for some design reason ? -
> the only problem I see is rather technical that the coordination process
> uses the rest server for some actions.
>

This is planned and is described in the KIP that added Kafka Connect -
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
However, good embedded support required enough of Kafka Streams to be
defined to ensure good integration. Now that both components are available,
this is a project we'll want to start tackling (but will not be in the next
0.10.0.0 release).

-Ewen


>
> Cheers,
> Michal
>



-- 
Thanks,
Ewen


Kafka Connect ++ Kafka Streams

2016-03-24 Thread Michal Hariš
Hello Kafka people!

Great to see Kafka Streams coming along, the design validates (and in many
way supersedes) my own findings from working with various stream processing
systems/frameworks and eventually ending-up using just a small custom
library built directly around Kafka.

I have set out yesterday to translate Hello Samza (the wikipedia feed
example) into Kafka Streams application. Now because this workflow starts
by polling wikipedia IRC and publishes to a topic from which the stream
processors pick-up it would be nice to have this first part done by Kafka
Connect but:

1. IRC channels are not seekable and Kafka Connect architecture claims that
all sources must be seekable - is this still suitable ? (I guess yes as
FileStreamSourceTask can read from stdin which is similar)

2. I would like to have ConnectEmbedded (as opposed to ConnectStandalone or
ConnectDistributed) which is similar to ConnectDistributed, just without
the rest server - i.e. say I have the WikipediaFeedConnector and I want to
launch it programatically from all the instances along-side the Kafka
Streams - but reusing the connect distributed coordination so that only one
instance actually reads the IRC data but another instance picks up work if
that one dies - does it sound like a bad idea for some design reason ? -
the only problem I see is rather technical that the coordination process
uses the rest server for some actions.

Cheers,
Michal