vow! you caught the 007! Is continuous processing mode available in 2.2? The ticket says the target version is 2.3 but the talk in the Video says 2.2 and beyond so I am just curious if it is available in 2.2 or should I try it from the latest build?
Thanks! On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mich...@databricks.com> wrote: > This a good question. I really like using Kafka as a centralized source > for streaming data in an organization and, with Spark 2.2, we have full > support for reading and writing data to/from Kafka in both streaming and > batch > <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>. > I'll focus here on what I think the advantages are of Structured Streaming > over Kafka Streams (a stream processing library that reads from Kafka). > > - *High level productive APIs* - Streaming queries in Spark can be > expressed using DataFrames, Datasets or even plain SQL. Streaming > DataFrames/SQL are supported in Scala, Java, Python and even R. This means > that for common operations like filtering, joining, aggregating, you can > use built-in operations. For complicated custom logic you can use UDFs and > lambda functions. In contrast, Kafka Streams mostly requires you to express > your transformations using lambda functions. > - *High Performance* - Since it is built on Spark SQL, streaming queries > take advantage of the Catalyst optimizer and the Tungsten execution engine. > This design leads to huge performance wins > <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>, > which means you need less hardware to accomplish the same job. > - *Ecosystem* - Spark has connectors for working with all kinds of data > stored in a variety of systems. This means you can join a stream with data > encoded in parquet and stored in S3/HDFS. Perhaps more importantly, it > also means that if you decide that you don't want to manage a Kafka cluster > anymore and would rather use Kinesis, you can do that too. We recently > moved a bunch of our pipelines from Kafka to Kinesis and had to only change > a few lines of code! I think its likely that in the future Spark will also > have connectors for Google's PubSub and Azure's streaming offerings. > > Regarding latency, there has been a lot of discussion about the inherent > latencies of micro-batch. Fortunately, we were very careful to leave > batching out of the user facing API, and as we demo'ed last week, this > makes it possible for the Spark Streaming to achieve sub-millisecond > latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>. Watch > SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more > on this effort to eliminate micro-batch from Spark's execution model. > > At the far other end of the latency spectrum... For those with jobs that > run in the cloud on data that arrives sporadically, you can run streaming > jobs that only execute every few hours or every few days, shutting the > cluster down in between. This architecture can result in a huge cost > savings for some applications > <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html> > . > > Michael > > On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I am trying hard to figure out what is the real difference between Kafka >> Streaming vs Spark Streaming other than saying one can be used as part of >> Micro services (since Kafka streaming is just a library) and the other is a >> Standalone framework by itself. >> >> If I can accomplish same job one way or other this is a sort of a >> puzzling question for me so it would be great to know what Spark streaming >> can do that Kafka Streaming cannot do efficiently or whatever ? >> >> Thanks! >> >> >