On Thu, Mar 24, 2016 at 4:46 AM, Michal Hariš <michal.har...@gmail.com>
wrote:

> Hello Kafka people!
>
> Great to see Kafka Streams coming along, the design validates (and in many
> way supersedes) my own findings from working with various stream processing
> systems/frameworks and eventually ending-up using just a small custom
> library built directly around Kafka.
>
> I have set out yesterday to translate Hello Samza (the wikipedia feed
> example) into Kafka Streams application. Now because this workflow starts
> by polling wikipedia IRC and publishes to a topic from which the stream
> processors pick-up it would be nice to have this first part done by Kafka
> Connect but:
>
> 1. IRC channels are not seekable and Kafka Connect architecture claims that
> all sources must be seekable - is this still suitable ? (I guess yes as
> FileStreamSourceTask can read from stdin which is similar)
>

They need to be seekable in order to guarantee delivery. If you're fine
with an outage causing you to miss some data, then you don't need the
source to be seekable. However, keep in mind that in distributed mode,
there will be brief periods where work is being rebalanced across workers
and data will not be processed. These are windows where you could easily
lose data if you can't track offsets and recover events that occurred
during the rebalance process. You can of course stick with standalone mode,
but then you lose some of the fault tolerance features.


>
> 2. I would like to have ConnectEmbedded (as opposed to ConnectStandalone or
> ConnectDistributed) which is similar to ConnectDistributed, just without
> the rest server - i.e. say I have the WikipediaFeedConnector and I want to
> launch it programatically from all the instances along-side the Kafka
> Streams - but reusing the connect distributed coordination so that only one
> instance actually reads the IRC data but another instance picks up work if
> that one dies - does it sound like a bad idea for some design reason ? -
> the only problem I see is rather technical that the coordination process
> uses the rest server for some actions.
>

This is planned and is described in the KIP that added Kafka Connect -
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
However, good embedded support required enough of Kafka Streams to be
defined to ensure good integration. Now that both components are available,
this is a project we'll want to start tackling (but will not be in the next
0.10.0.0 release).

-Ewen


>
> Cheers,
> Michal
>



-- 
Thanks,
Ewen

Reply via email to