Re: What is the real difference between Kafka streaming and Spark Streaming?

Paolo Patierno Tue, 13 Jun 2017 03:46:16 -0700

I think that a big advantage to not use Spark Streaming when your solution is 
already based on Kafka is that you don't have to deal with another cluster. I 
mean ...
Imagine that your solution is already based on Kafka as ingestion systems for 
your events and then you need to do some real time analysis with streams. 
Adding Spark means adding a new cluster with a master and one or more nodes 
then Spark will distribute jobs for you. Using the lightweight streams library 
from Kafka means just developing a new application for getting events from the 
same cluster. You can deploy more instances of the same application for load 
balancing and all is done always by Kafka itself.
I think that in terms of deployment this is a big advantage of using Kafka 
stream in the same Kafka cluster instead of adding Spark.

Paolo
________________________________
From: kant kodali <kanth...@gmail.com>
Sent: Monday, June 12, 2017 12:40:37 AM
To: Mohammed Guller
Cc: vincent gromakowski; yohann jardin; vaquar khan; user
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

Also another difference I see is some thing like Spark Sql where there are 
logical plans, physical plans, Code generation and all those optimizations I 
don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
I appreciate the responses however I see the other side of the argument and I 
actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like 
wise data can be ingested from many sources in Kafka and send the results out 
to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of 
people saying Spark Streaming is real time and other group of people is Kafka 
streaming is the true real time. so do we say Micro-batch is real time or Event 
at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want 
to run ML Algorithms and so on but I also hear that people can use H2O package 
along with Kafka Streaming. so efficient each of these approaches are is 
something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka 
Streaming has anything like this instead it just allows you to run lambda 
expressions on a stream and write it out to specific topic/partition and from 
there one can use Kafka Connect to write it out to any sink. so In short, All 
the optimizations built into spark scheduler don't seem to exist in Kafka 
Streaming so if I were to make a decision on which framework to use this is an 
additional question I would think about like "Do I want my stream to go through 
the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)

On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski 
[mailto:vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>>
Cc: kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>>; vaquar khan 
<vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>>; user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

I think Kafka streams is good when the processing of each row is independant 
from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" 
<yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>> a écrit :

Hey,
Kafka can also do streaming on its own: 
https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in 
conferences, saying that one should give a try to Kafka streaming when its 
whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark 
Streaming is used as the real time processing ,Kafka and Spark Streaming work 
together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching for 
streaming data, In easy terms collects data for some time, build RDD and then 
process these micro batches.

Please read doc : 
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables scalable, 
high-throughput, fault-tolerant stream processing of live data streams. Data 
can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, 
and can be processed using complex algorithms expressed with high-level 
functions like map, reduce, join and window. Finally, processed data can be 
pushed out to filesystems, databases, and live dashboards. In fact, you can 
apply Spark’s machine 
learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph 
processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> 
algorithms on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka 
Streaming vs Spark Streaming other than saying one can be used as part of Micro 
services (since Kafka streaming is just a library) and the other is a 
Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling 
question for me so it would be great to know what Spark streaming can do that 
Kafka Streaming cannot do efficiently or whatever ?

Thanks!

--
Regards,
Vaquar Khan
+1 -224-436-0783<tel:(224)%20436-0783>
Greater Chicago

Re: What is the real difference between Kafka streaming and Spark Streaming?

Reply via email to