Re: What is the real difference between Kafka streaming and Spark Streaming?

Michael Armbrust Thu, 15 Jun 2017 12:07:41 -0700

Continuous processing is still a work in progress.  I would really like to
at least have a basic version in Spark 2.3.


The announcement about 2.2 is that we are planning to remove the
experimental tag from Structured Streaming.

On Thu, Jun 15, 2017 at 11:53 AM, kant kodali <kanth...@gmail.com> wrote:

> vow! you caught the 007!  Is continuous processing mode available in 2.2?
> The ticket says the target version is 2.3 but the talk in the Video says
> 2.2 and beyond so I am just curious if it is available in 2.2 or should I
> try it from the latest build?
>
> Thanks!
>
> On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> This a good question. I really like using Kafka as a centralized source
>> for streaming data in an organization and, with Spark 2.2, we have full
>> support for reading and writing data to/from Kafka in both streaming and
>> batch
>> <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>.
>> I'll focus here on what I think the advantages are of Structured Streaming
>> over Kafka Streams (a stream processing library that reads from Kafka).
>>
>>  - *High level productive APIs* - Streaming queries in Spark can be
>> expressed using DataFrames, Datasets or even plain SQL.  Streaming
>> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
>> that for common operations like filtering, joining, aggregating, you can
>> use built-in operations.  For complicated custom logic you can use UDFs and
>> lambda functions. In contrast, Kafka Streams mostly requires you to express
>> your transformations using lambda functions.
>>  - *High Performance* - Since it is built on Spark SQL, streaming
>> queries take advantage of the Catalyst optimizer and the Tungsten execution
>> engine. This design leads to huge performance wins
>> <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>,
>> which means you need less hardware to accomplish the same job.
>>  - *Ecosystem* - Spark has connectors for working with all kinds of data
>> stored in a variety of systems.  This means you can join a stream with data
>> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
>> also means that if you decide that you don't want to manage a Kafka cluster
>> anymore and would rather use Kinesis, you can do that too.  We recently
>> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
>> a few lines of code! I think its likely that in the future Spark will also
>> have connectors for Google's PubSub and Azure's streaming offerings.
>>
>> Regarding latency, there has been a lot of discussion about the inherent
>> latencies of micro-batch.  Fortunately, we were very careful to leave
>> batching out of the user facing API, and as we demo'ed last week, this
>> makes it possible for the Spark Streaming to achieve sub-millisecond
>> latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>.  Watch
>> SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more
>> on this effort to eliminate micro-batch from Spark's execution model.
>>
>> At the far other end of the latency spectrum...  For those with jobs that
>> run in the cloud on data that arrives sporadically, you can run streaming
>> jobs that only execute every few hours or every few days, shutting the
>> cluster down in between.  This architecture can result in a huge cost
>> savings for some applications
>> <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html>
>> .
>>
>> Michael
>>
>> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am trying hard to figure out what is the real difference between Kafka
>>> Streaming vs Spark Streaming other than saying one can be used as part of
>>> Micro services (since Kafka streaming is just a library) and the other is a
>>> Standalone framework by itself.
>>>
>>> If I can accomplish same job one way or other this is a sort of a
>>> puzzling question for me so it would be great to know what Spark streaming
>>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>>
>>> Thanks!
>>>
>>>
>>
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

Reply via email to