Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread kant kodali
HI Burak,

My response is inline.

Thanks a lot!

On Mon, May 22, 2017 at 9:26 AM, Burak Yavuz  wrote:

> Hi Kant,
>
>>
>>
>> 1. Can we use Spark Structured Streaming for stateless transformations
>> just like we would do with DStreams or Spark Structured Streaming is only
>> meant for stateful computations?
>>
>
> Of course you can do stateless transformations. Any map, filter, select,
> type of transformation is stateless. Aggregations are generally stateful.
> You could also perform arbitrary stateless aggregations with "
> flatMapGroups
> "
> or make them stateful with "flatMapGroupsWithState
> 
> ".
>

*Got it. so Spark Structured Streaming does both Stateful and Stateless
tranformations. In that case I am assuming DStreams API will be deprecated?
  How about groupBy ? That is stateful right?*

>
>
>
>> 2. When we use groupBy and Window operations for event time processing
>> and specify a watermark does this mean the timestamp field in each message
>> is compared to the processing time of that machine/node and discard that
>> events that are late than the specified threshold? If we don't specify a
>> watermark I am assuming the processing time wont come into the picture. is
>> that right? Just trying to understand the interplay between processing time
>> and even time when we do even time processing.
>>
>> Watermarks are tracked with respect to the event time of your data, not
> the processing time of the machine. Please take a look at the blog below
> for more details
> https://databricks.com/blog/2017/05/08/event-time-
> aggregation-watermarking-apache-sparks-structured-streaming.html
>

*Thanks for this article. I am not sure if I am interpreting the article
incorrectly buy Looks Like that Article shows there is indeed a
relationship between Processing time and event time. For example*
*say I set an Watermark of 10 minutes and *

*1. I send one message which has an event time stamp of May 22 2017 1PM and
Processing Time as May 22 2017 1:02 PM*


*2. I send another message which has an event time of May 22 2017 12:55 PM
and Processing Time as May 23 2017 1PM*

*Simply put, say I am just faking my event timestamp's to meet the cut off
specified by the watermark but I am actually sending them a day or week
later. How does Spark Structured Streaming handle this case? *

>
>
> Best,
> Burak
>


Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz
Hi Kant,

>
>
> 1. Can we use Spark Structured Streaming for stateless transformations
> just like we would do with DStreams or Spark Structured Streaming is only
> meant for stateful computations?
>

Of course you can do stateless transformations. Any map, filter, select,
type of transformation is stateless. Aggregations are generally stateful.
You could also perform arbitrary stateless aggregations with "flatMapGroups
"
or make them stateful with "flatMapGroupsWithState

".



> 2. When we use groupBy and Window operations for event time processing and
> specify a watermark does this mean the timestamp field in each message is
> compared to the processing time of that machine/node and discard that
> events that are late than the specified threshold? If we don't specify a
> watermark I am assuming the processing time wont come into the picture. is
> that right? Just trying to understand the interplay between processing time
> and even time when we do even time processing.
>
> Watermarks are tracked with respect to the event time of your data, not
the processing time of the machine. Please take a look at the blog below
for more details
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Best,
Burak


couple naive questions on Spark Structured Streaming

2017-05-20 Thread kant kodali
Hi,

1. Can we use Spark Structured Streaming for stateless transformations just
like we would do with DStreams or Spark Structured Streaming is only meant
for stateful computations?

2. When we use groupBy and Window operations for event time processing and
specify a watermark does this mean the timestamp field in each message is
compared to the processing time of that machine/node and discard that
events that are late than the specified threshold? If we don't specify a
watermark I am assuming the processing time wont come into the picture. is
that right? Just trying to understand the interplay between processing time
and even time when we do even time processing.

Thanks!