Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful stream processing capabilities.
A few more things to consider: 1. If you don’t already have a Spark cluster, but have Kafka cluster, it may be easier to use Kafka streaming since you don’t need to setup and manage another cluster. 2. On the other hand, if you already have a spark cluster, but don’t have a Kafka cluster (in case you are using some other messaging system), Spark streaming is a better option. 3. If you already know and use Spark, you may find it easier to program with Spark Streaming API even if you are using Kafka. 4. Spark Streaming may give you better throughput. So you have to decide what is more important for your stream processing application – latency or throughput? 5. Kafka streaming is relatively new and less mature than Spark Streaming Mohammed From: vincent gromakowski [mailto:vincent.gromakow...@gmail.com] Sent: Sunday, June 11, 2017 12:09 PM To: yohann jardin <yohannjar...@hotmail.com> Cc: kant kodali <kanth...@gmail.com>; vaquar khan <vaquar.k...@gmail.com>; user <user@spark.apache.org> Subject: Re: What is the real difference between Kafka streaming and Spark Streaming? I think Kafka streams is good when the processing of each row is independant from each other (row parsing, data cleaning...) Spark is better when processing group of rows (group by, ml, window func...) Le 11 juin 2017 8:15 PM, "yohann jardin" <yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>> a écrit : Hey, Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams I don’t know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons to argument on this topic. Yohann Jardin Le 6/11/2017 à 7:08 PM, vaquar khan a écrit : Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms collects data for some time, build RDD and then process these micro batches. Please read doc : https://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams. Regards, Vaquar khan On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote: Hi All, I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself. If I can accomplish same job one way or other this is a sort of a puzzling question for me so it would be great to know what Spark streaming can do that Kafka Streaming cannot do efficiently or whatever ? Thanks! -- Regards, Vaquar Khan +1 -224-436-0783<tel:(224)%20436-0783> Greater Chicago