Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Dood
On 5/16/2016 9:53 AM, Yuval Itzchakov wrote: AFAIK, the underlying data represented under the DataSet[T] abstraction will be formatted in Tachyon under the hood, but as with RDD's if needed they will be spilled to local disk on the worker of needed. There is another option in case of

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
AFAIK, the underlying data represented under the DataSet[T] abstraction will be formatted in Tachyon under the hood, but as with RDD's if needed they will be spilled to local disk on the worker of needed. On Mon, May 16, 2016, 19:47 Benjamin Kim wrote: > I have a curiosity

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Benjamin Kim
I have a curiosity question. These forever/unlimited DataFrames/DataSets will persist and be query capable. I still am foggy about how this data will be stored. As far as I know, memory is finite. Will the data be spilled to disk and be retrievable if the query spans data not in memory? Is

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
Oh, that looks neat! Thx, will read up on that. On Mon, May 16, 2016, 14:10 Ofir Manor wrote: > Yuval, > Not sure what in-scope to land in 2.0, but there is another new infra bit > to manage state more efficiently called State Store, whose initial version > is already

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Ofir Manor
Yuval, Not sure what in-scope to land in 2.0, but there is another new infra bit to manage state more efficiently called State Store, whose initial version is already commited: SPARK-13809 - State Store: A new framework for state management for computing Streaming Aggregates

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Yuval Itzchakov
Also, re-reading the relevant part from the Structured Streaming documentation ( https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x ): Discretized streams (aka dstream) Unlike Storm, dstream exposes a higher level API similar to RDDs. There

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval Itzchakov
Hi Ofir, Thanks for the elaborated answer. I have read both documents, where they do a light touch on infinite Dataframes/Datasets. However, they do not go in depth as regards to how existing transformations on DStreams, for example, will be transformed into the Dataset APIs. I've been browsing

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Ofir, Thanks for the clarification. I was confused for the moment. The links will be very helpful. > On May 15, 2016, at 2:32 PM, Ofir Manor wrote: > > Ben, > I'm just a Spark user - but at least in March Spark Summit, that was the main > term used. > Taking a step

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Ben, I'm just a Spark user - but at least in March Spark Summit, that was the main term used. Taking a step back from the details, maybe this new post from Reynold is a better intro to Spark 2.0 highlights

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Hi Ofir, I just recently saw the webinar with Reynold Xin. He mentioned the Spark Session unification efforts, but I don’t remember the DataSet for Structured Streaming aka Continuous Applications as he put it. He did mention streaming or unlimited DataFrames for Structured Streaming so one

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Ofir Manor
Hi Yuval, let me share my understanding based on similar questions I had. First, Spark 2.x aims to replace a whole bunch of its APIs with just two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging of Dataset and Dataframe - which is why it inherits all the SparkSQL

Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval.Itzchakov
I've been reading/watching videos about the upcoming Spark 2.0 release which brings us Structured Streaming. One thing I've yet to understand is how this relates to the current state of working with Streaming in Spark with the DStream abstraction. All examples I can find, in the Spark