Re: Lazy Spark Structured Streaming

2020-08-02 Thread Jungtaek Lim
SPARK-24156 runs the no-data batch to apply the updated watermark, but the
updated watermark may not be eligible to evict all state rows. (e.g.
window, lateness of watermark)
You'll still need to provide dummy input record to advance watermark, so
that all expected state rows can be evicted.

On Sun, Aug 2, 2020 at 5:44 PM Phillip Henry 
wrote:

> Thanks, Jungtaek. Very useful information.
>
> Could I please trouble you with one further question - what you said makes
> perfect sense but to what exactly does SPARK-24156
>  refer if not fixing
> the "need to add a dummy record to move watermark forward"?
>
> Kind regards,
>
> Phillip
>
>
>
>
> On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I'm not sure what exactly your problem is, but given you've mentioned
>> window and OutputMode.Append, you may want to remind that append mode
>> doesn't produce the output of aggregation unless the watermark "passes by".
>> It's expected behavior if you're seeing lazy outputs on OutputMode.Append
>> compared to OutputMode.Update.
>>
>> Unfortunately there's no mechanism on SSS to move forward only watermark
>> without actual input, so if you want to test some behavior on
>> OutputMode.Append you would need to add a dummy record to move watermark
>> forward.
>>
>> Hope this helps.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry 
>> wrote:
>>
>>> Sorry, should have mentioned that Spark only seems reluctant to take the
>>> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>>>
>>> I've asked on StackOverflow:
>>>
>>> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
>>> but am still struggling. Can anybody please help?
>>>
>>> How do people test their SSS code if you have to put a message on Kafka
>>> to get Spark to consume a batch?
>>>
>>> Kind regards,
>>>
>>> Phillip
>>>
>>>
>>> On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry 
>>> wrote:
>>>
 Hi, folks.

 I noticed that SSS won't process a waiting batch if there are no
 batches after that. To put it another way, Spark must always leave one
 batch on Kafka waiting to be consumed.

 There is a JIRA for this at:

 https://issues.apache.org/jira/browse/SPARK-24156

 that says it's resolved in 2.4.0 but my code
 
 is using 2.4.2 yet I still see Spark reluctant to consume another batch
 from Kafka if it means there is nothing else waiting to be processed in the
 topic.

 Do I have to do something special to exploit the behaviour that
 SPARK-24156 says it has addressed?

 Regards,

 Phillip






Re: Lazy Spark Structured Streaming

2020-08-02 Thread Phillip Henry
Thanks, Jungtaek. Very useful information.

Could I please trouble you with one further question - what you said makes
perfect sense but to what exactly does SPARK-24156
 refer if not fixing the
"need to add a dummy record to move watermark forward"?

Kind regards,

Phillip




On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim 
wrote:

> I'm not sure what exactly your problem is, but given you've mentioned
> window and OutputMode.Append, you may want to remind that append mode
> doesn't produce the output of aggregation unless the watermark "passes by".
> It's expected behavior if you're seeing lazy outputs on OutputMode.Append
> compared to OutputMode.Update.
>
> Unfortunately there's no mechanism on SSS to move forward only watermark
> without actual input, so if you want to test some behavior on
> OutputMode.Append you would need to add a dummy record to move watermark
> forward.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry 
> wrote:
>
>> Sorry, should have mentioned that Spark only seems reluctant to take the
>> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>>
>> I've asked on StackOverflow:
>>
>> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
>> but am still struggling. Can anybody please help?
>>
>> How do people test their SSS code if you have to put a message on Kafka
>> to get Spark to consume a batch?
>>
>> Kind regards,
>>
>> Phillip
>>
>>
>> On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry 
>> wrote:
>>
>>> Hi, folks.
>>>
>>> I noticed that SSS won't process a waiting batch if there are no batches
>>> after that. To put it another way, Spark must always leave one batch on
>>> Kafka waiting to be consumed.
>>>
>>> There is a JIRA for this at:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24156
>>>
>>> that says it's resolved in 2.4.0 but my code
>>> 
>>> is using 2.4.2 yet I still see Spark reluctant to consume another batch
>>> from Kafka if it means there is nothing else waiting to be processed in the
>>> topic.
>>>
>>> Do I have to do something special to exploit the behaviour that
>>> SPARK-24156 says it has addressed?
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>>
>>>
>>>


Re: Lazy Spark Structured Streaming

2020-07-27 Thread Jungtaek Lim
I'm not sure what exactly your problem is, but given you've mentioned
window and OutputMode.Append, you may want to remind that append mode
doesn't produce the output of aggregation unless the watermark "passes by".
It's expected behavior if you're seeing lazy outputs on OutputMode.Append
compared to OutputMode.Update.

Unfortunately there's no mechanism on SSS to move forward only watermark
without actual input, so if you want to test some behavior on
OutputMode.Append you would need to add a dummy record to move watermark
forward.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry 
wrote:

> Sorry, should have mentioned that Spark only seems reluctant to take the
> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>
> I've asked on StackOverflow:
>
> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
> but am still struggling. Can anybody please help?
>
> How do people test their SSS code if you have to put a message on Kafka to
> get Spark to consume a batch?
>
> Kind regards,
>
> Phillip
>
>
> On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry 
> wrote:
>
>> Hi, folks.
>>
>> I noticed that SSS won't process a waiting batch if there are no batches
>> after that. To put it another way, Spark must always leave one batch on
>> Kafka waiting to be consumed.
>>
>> There is a JIRA for this at:
>>
>> https://issues.apache.org/jira/browse/SPARK-24156
>>
>> that says it's resolved in 2.4.0 but my code
>> 
>> is using 2.4.2 yet I still see Spark reluctant to consume another batch
>> from Kafka if it means there is nothing else waiting to be processed in the
>> topic.
>>
>> Do I have to do something special to exploit the behaviour that
>> SPARK-24156 says it has addressed?
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>>


Re: Lazy Spark Structured Streaming

2020-07-27 Thread Phillip Henry
Sorry, should have mentioned that Spark only seems reluctant to take the
last windowed, groupBy batch from Kafka when using OutputMode.Append.

I've asked on StackOverflow:
https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
but am still struggling. Can anybody please help?

How do people test their SSS code if you have to put a message on Kafka to
get Spark to consume a batch?

Kind regards,

Phillip


On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry 
wrote:

> Hi, folks.
>
> I noticed that SSS won't process a waiting batch if there are no batches
> after that. To put it another way, Spark must always leave one batch on
> Kafka waiting to be consumed.
>
> There is a JIRA for this at:
>
> https://issues.apache.org/jira/browse/SPARK-24156
>
> that says it's resolved in 2.4.0 but my code
> 
> is using 2.4.2 yet I still see Spark reluctant to consume another batch
> from Kafka if it means there is nothing else waiting to be processed in the
> topic.
>
> Do I have to do something special to exploit the behaviour that
> SPARK-24156 says it has addressed?
>
> Regards,
>
> Phillip
>
>
>
>


Lazy Spark Structured Streaming

2020-07-12 Thread Phillip Henry
Hi, folks.

I noticed that SSS won't process a waiting batch if there are no batches
after that. To put it another way, Spark must always leave one batch on
Kafka waiting to be consumed.

There is a JIRA for this at:

https://issues.apache.org/jira/browse/SPARK-24156

that says it's resolved in 2.4.0 but my code

is using 2.4.2 yet I still see Spark reluctant to consume another batch
from Kafka if it means there is nothing else waiting to be processed in the
topic.

Do I have to do something special to exploit the behaviour that SPARK-24156
says it has addressed?

Regards,

Phillip