here is my two cents, experts please correct me if wrong

its important to understand why one over other and for what kind of use
case. There might be sometime in future where low level API's are abstracted
and become legacy but for now in Spark RDD API is the core and low level
API, all higher APIs translate to RDD ultimately,  and RDD's are immutable. 

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
these are things that are not supported and this list needs to be validated
with the use case you have. 

>From my experience Structured Streaming is still new and DStreams API is a
matured API. 
some things that are missing or need to explore more.

watermarking/windowing based on no of records in a particular window

assuming you have watermark and windowing on event time of the data,  the
resultant dataframe is grouped data set, only thing you can do is run
aggregate functions. you can't simply use that output as another dataframe
and manipulate. There is a custom aggregator but I feel its limited.

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
There is option to do stateful operations, using GroupState where the
function gets iterator of events for that window. This is the closest access
to StateStore a developer could get. 
This arbitrary state that programmer could keep across invocations has its
limitations as such how much state we could keep?, is that state stored in
driver memory? What happens if the spark job fails is this checkpointed or
restored?

thanks
Vijay



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to