Re: couple naive questions on Spark Structured Streaming

Burak Yavuz Mon, 22 May 2017 09:26:42 -0700

Hi Kant,

>
>
> 1. Can we use Spark Structured Streaming for stateless transformations
> just like we would do with DStreams or Spark Structured Streaming is only
> meant for stateful computations?
>


Of course you can do stateless transformations. Any map, filter, select,
type of transformation is stateless. Aggregations are generally stateful.
You could also perform arbitrary stateless aggregations with "flatMapGroups
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L145>"
or make them stateful with "flatMapGroupsWithState
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L376>
".



> 2. When we use groupBy and Window operations for event time processing and
> specify a watermark does this mean the timestamp field in each message is
> compared to the processing time of that machine/node and discard that
> events that are late than the specified threshold? If we don't specify a
> watermark I am assuming the processing time wont come into the picture. is
> that right? Just trying to understand the interplay between processing time
> and even time when we do even time processing.
>
> Watermarks are tracked with respect to the event time of your data, not
the processing time of the machine. Please take a look at the blog below
for more details
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Best,
Burak

Re: couple naive questions on Spark Structured Streaming

Reply via email to