Thanks, Jungtaek. Very useful information.

Could I please trouble you with one further question - what you said makes
perfect sense but to what exactly does SPARK-24156
<https://issues.apache.org/jira/browse/SPARK-24156> refer if not fixing the
"need to add a dummy record to move watermark forward"?

Kind regards,

Phillip




On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> I'm not sure what exactly your problem is, but given you've mentioned
> window and OutputMode.Append, you may want to remind that append mode
> doesn't produce the output of aggregation unless the watermark "passes by".
> It's expected behavior if you're seeing lazy outputs on OutputMode.Append
> compared to OutputMode.Update.
>
> Unfortunately there's no mechanism on SSS to move forward only watermark
> without actual input, so if you want to test some behavior on
> OutputMode.Append you would need to add a dummy record to move watermark
> forward.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry <londonjava...@gmail.com>
> wrote:
>
>> Sorry, should have mentioned that Spark only seems reluctant to take the
>> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>>
>> I've asked on StackOverflow:
>>
>> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
>> but am still struggling. Can anybody please help?
>>
>> How do people test their SSS code if you have to put a message on Kafka
>> to get Spark to consume a batch?
>>
>> Kind regards,
>>
>> Phillip
>>
>>
>> On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry <londonjava...@gmail.com>
>> wrote:
>>
>>> Hi, folks.
>>>
>>> I noticed that SSS won't process a waiting batch if there are no batches
>>> after that. To put it another way, Spark must always leave one batch on
>>> Kafka waiting to be consumed.
>>>
>>> There is a JIRA for this at:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-24156
>>>
>>> that says it's resolved in 2.4.0 but my code
>>> <https://github.com/PhillHenry/SSSPlayground/blob/Spark2/src/test/scala/uk/co/odinconsultants/sssplayground/windows/TimestampedStreamingSpec.scala>
>>> is using 2.4.2 yet I still see Spark reluctant to consume another batch
>>> from Kafka if it means there is nothing else waiting to be processed in the
>>> topic.
>>>
>>> Do I have to do something special to exploit the behaviour that
>>> SPARK-24156 says it has addressed?
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>>
>>>
>>>

Reply via email to