This a good question. I really like using Kafka as a centralized source for streaming data in an organization and, with Spark 2.2, we have full support for reading and writing data to/from Kafka in both streaming and batch <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>. I'll focus here on what I think the advantages are of Structured Streaming over Kafka Streams (a stream processing library that reads from Kafka).
- *High level productive APIs* - Streaming queries in Spark can be expressed using DataFrames, Datasets or even plain SQL. Streaming DataFrames/SQL are supported in Scala, Java, Python and even R. This means that for common operations like filtering, joining, aggregating, you can use built-in operations. For complicated custom logic you can use UDFs and lambda functions. In contrast, Kafka Streams mostly requires you to express your transformations using lambda functions. - *High Performance* - Since it is built on Spark SQL, streaming queries take advantage of the Catalyst optimizer and the Tungsten execution engine. This design leads to huge performance wins <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>, which means you need less hardware to accomplish the same job. - *Ecosystem* - Spark has connectors for working with all kinds of data stored in a variety of systems. This means you can join a stream with data encoded in parquet and stored in S3/HDFS. Perhaps more importantly, it also means that if you decide that you don't want to manage a Kafka cluster anymore and would rather use Kinesis, you can do that too. We recently moved a bunch of our pipelines from Kafka to Kinesis and had to only change a few lines of code! I think its likely that in the future Spark will also have connectors for Google's PubSub and Azure's streaming offerings. Regarding latency, there has been a lot of discussion about the inherent latencies of micro-batch. Fortunately, we were very careful to leave batching out of the user facing API, and as we demo'ed last week, this makes it possible for the Spark Streaming to achieve sub-millisecond latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>. Watch SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more on this effort to eliminate micro-batch from Spark's execution model. At the far other end of the latency spectrum... For those with jobs that run in the cloud on data that arrives sporadically, you can run streaming jobs that only execute every few hours or every few days, shutting the cluster down in between. This architecture can result in a huge cost savings for some applications <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html> . Michael On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I am trying hard to figure out what is the real difference between Kafka > Streaming vs Spark Streaming other than saying one can be used as part of > Micro services (since Kafka streaming is just a library) and the other is a > Standalone framework by itself. > > If I can accomplish same job one way or other this is a sort of a puzzling > question for me so it would be great to know what Spark streaming can do > that Kafka Streaming cannot do efficiently or whatever ? > > Thanks! > >