Continuous processing is still a work in progress. I would really like to at least have a basic version in Spark 2.3.
The announcement about 2.2 is that we are planning to remove the experimental tag from Structured Streaming. On Thu, Jun 15, 2017 at 11:53 AM, kant kodali <kanth...@gmail.com> wrote: > vow! you caught the 007! Is continuous processing mode available in 2.2? > The ticket says the target version is 2.3 but the talk in the Video says > 2.2 and beyond so I am just curious if it is available in 2.2 or should I > try it from the latest build? > > Thanks! > > On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> This a good question. I really like using Kafka as a centralized source >> for streaming data in an organization and, with Spark 2.2, we have full >> support for reading and writing data to/from Kafka in both streaming and >> batch >> <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>. >> I'll focus here on what I think the advantages are of Structured Streaming >> over Kafka Streams (a stream processing library that reads from Kafka). >> >> - *High level productive APIs* - Streaming queries in Spark can be >> expressed using DataFrames, Datasets or even plain SQL. Streaming >> DataFrames/SQL are supported in Scala, Java, Python and even R. This means >> that for common operations like filtering, joining, aggregating, you can >> use built-in operations. For complicated custom logic you can use UDFs and >> lambda functions. In contrast, Kafka Streams mostly requires you to express >> your transformations using lambda functions. >> - *High Performance* - Since it is built on Spark SQL, streaming >> queries take advantage of the Catalyst optimizer and the Tungsten execution >> engine. This design leads to huge performance wins >> <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>, >> which means you need less hardware to accomplish the same job. >> - *Ecosystem* - Spark has connectors for working with all kinds of data >> stored in a variety of systems. This means you can join a stream with data >> encoded in parquet and stored in S3/HDFS. Perhaps more importantly, it >> also means that if you decide that you don't want to manage a Kafka cluster >> anymore and would rather use Kinesis, you can do that too. We recently >> moved a bunch of our pipelines from Kafka to Kinesis and had to only change >> a few lines of code! I think its likely that in the future Spark will also >> have connectors for Google's PubSub and Azure's streaming offerings. >> >> Regarding latency, there has been a lot of discussion about the inherent >> latencies of micro-batch. Fortunately, we were very careful to leave >> batching out of the user facing API, and as we demo'ed last week, this >> makes it possible for the Spark Streaming to achieve sub-millisecond >> latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>. Watch >> SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more >> on this effort to eliminate micro-batch from Spark's execution model. >> >> At the far other end of the latency spectrum... For those with jobs that >> run in the cloud on data that arrives sporadically, you can run streaming >> jobs that only execute every few hours or every few days, shutting the >> cluster down in between. This architecture can result in a huge cost >> savings for some applications >> <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html> >> . >> >> Michael >> >> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote: >> >>> Hi All, >>> >>> I am trying hard to figure out what is the real difference between Kafka >>> Streaming vs Spark Streaming other than saying one can be used as part of >>> Micro services (since Kafka streaming is just a library) and the other is a >>> Standalone framework by itself. >>> >>> If I can accomplish same job one way or other this is a sort of a >>> puzzling question for me so it would be great to know what Spark streaming >>> can do that Kafka Streaming cannot do efficiently or whatever ? >>> >>> Thanks! >>> >>> >> >