Spark's direct stream kafka integration should take advantage of data
locality if you're running Spark executors on the same nodes as Kafka
brokers.

On Wed, Nov 25, 2015 at 9:50 AM, Dave Ariens <dari...@blackberry.com> wrote:

> I just finished reading up on Kafka Connect<
> http://kafka.apache.org/documentation.html#connect> and am trying to wrap
> my head around where it fits within the big data ecosystem.
>
> Other than the high level overview provided in the docs I haven't heard
> much about this feature. My limited understanding of it so far is that it
> includes semantics similar to Storm (sources/spouts, sinks/bolts) and
> allows for distributed processing of streams using tasks that handle data
> defined in records conforming to a schema.  Assuming that's mostly
> accurate, is anyone able to speak to why a developer would want to use
> Kafka Connect over Spark (or maybe even Storm but to a lesser degree)?  Is
> Kafka Connect trying to address any short comings?  I understand it greatly
> simplifies offset persistence but that's not terribly difficult to
> implement on top of Spark (see my offset persistence hack<
> https://gist.github.com/ariens/e6a39bc3dbeb11467e53>).  Where is Kafka
> Connect being targeted to within the  vast ecosystem that is big data?
>
> Does Kafka Connect offer efficiencies 'under the hood' taking advantage of
> data locality and the fact that it distributes workload on the actual Kafka
> cluster itself?
>
> I can see basic ETL and data warehouse bulk operations simplified where
> one just wants an easy way to get all data in/out of Kafka and reduce the
> network IO of having multiple compute clusters but for any data science
> type operations (machine learning, etc) I would expect working with Spark's
> RDDs to be more efficient.
>
>
>
>
>
>
>
>
>
>
>

Reply via email to