vow! you caught the 007!  Is continuous processing mode available in 2.2?
The ticket says the target version is 2.3 but the talk in the Video says
2.2 and beyond so I am just curious if it is available in 2.2 or should I
try it from the latest build?

Thanks!

On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> This a good question. I really like using Kafka as a centralized source
> for streaming data in an organization and, with Spark 2.2, we have full
> support for reading and writing data to/from Kafka in both streaming and
> batch
> <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>.
> I'll focus here on what I think the advantages are of Structured Streaming
> over Kafka Streams (a stream processing library that reads from Kafka).
>
>  - *High level productive APIs* - Streaming queries in Spark can be
> expressed using DataFrames, Datasets or even plain SQL.  Streaming
> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
> that for common operations like filtering, joining, aggregating, you can
> use built-in operations.  For complicated custom logic you can use UDFs and
> lambda functions. In contrast, Kafka Streams mostly requires you to express
> your transformations using lambda functions.
>  - *High Performance* - Since it is built on Spark SQL, streaming queries
> take advantage of the Catalyst optimizer and the Tungsten execution engine.
> This design leads to huge performance wins
> <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>,
> which means you need less hardware to accomplish the same job.
>  - *Ecosystem* - Spark has connectors for working with all kinds of data
> stored in a variety of systems.  This means you can join a stream with data
> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
> also means that if you decide that you don't want to manage a Kafka cluster
> anymore and would rather use Kinesis, you can do that too.  We recently
> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
> a few lines of code! I think its likely that in the future Spark will also
> have connectors for Google's PubSub and Azure's streaming offerings.
>
> Regarding latency, there has been a lot of discussion about the inherent
> latencies of micro-batch.  Fortunately, we were very careful to leave
> batching out of the user facing API, and as we demo'ed last week, this
> makes it possible for the Spark Streaming to achieve sub-millisecond
> latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>.  Watch
> SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more
> on this effort to eliminate micro-batch from Spark's execution model.
>
> At the far other end of the latency spectrum...  For those with jobs that
> run in the cloud on data that arrives sporadically, you can run streaming
> jobs that only execute every few hours or every few days, shutting the
> cluster down in between.  This architecture can result in a huge cost
> savings for some applications
> <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html>
> .
>
> Michael
>
> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am trying hard to figure out what is the real difference between Kafka
>> Streaming vs Spark Streaming other than saying one can be used as part of
>> Micro services (since Kafka streaming is just a library) and the other is a
>> Standalone framework by itself.
>>
>> If I can accomplish same job one way or other this is a sort of a
>> puzzling question for me so it would be great to know what Spark streaming
>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>
>> Thanks!
>>
>>
>

Reply via email to