Spark's direct stream kafka integration should take advantage of data locality if you're running Spark executors on the same nodes as Kafka brokers.
On Wed, Nov 25, 2015 at 9:50 AM, Dave Ariens <dari...@blackberry.com> wrote: > I just finished reading up on Kafka Connect< > http://kafka.apache.org/documentation.html#connect> and am trying to wrap > my head around where it fits within the big data ecosystem. > > Other than the high level overview provided in the docs I haven't heard > much about this feature. My limited understanding of it so far is that it > includes semantics similar to Storm (sources/spouts, sinks/bolts) and > allows for distributed processing of streams using tasks that handle data > defined in records conforming to a schema. Assuming that's mostly > accurate, is anyone able to speak to why a developer would want to use > Kafka Connect over Spark (or maybe even Storm but to a lesser degree)? Is > Kafka Connect trying to address any short comings? I understand it greatly > simplifies offset persistence but that's not terribly difficult to > implement on top of Spark (see my offset persistence hack< > https://gist.github.com/ariens/e6a39bc3dbeb11467e53>). Where is Kafka > Connect being targeted to within the vast ecosystem that is big data? > > Does Kafka Connect offer efficiencies 'under the hood' taking advantage of > data locality and the fact that it distributes workload on the actual Kafka > cluster itself? > > I can see basic ETL and data warehouse bulk operations simplified where > one just wants an easy way to get all data in/out of Kafka and reduce the > network IO of having multiple compute clusters but for any data science > type operations (machine learning, etc) I would expect working with Spark's > RDDs to be more efficient. > > > > > > > > > > >