Re: Lazy Spark Structured Streaming

2020-08-02 Thread Jungtaek Lim
SPARK-24156 runs the no-data batch to apply the updated watermark, but the updated watermark may not be eligible to evict all state rows. (e.g. window, lateness of watermark) You'll still need to provide dummy input record to advance watermark, so that all expected state rows can be evicted. On

Re: Lazy Spark Structured Streaming

2020-08-02 Thread Phillip Henry
Thanks, Jungtaek. Very useful information. Could I please trouble you with one further question - what you said makes perfect sense but to what exactly does SPARK-24156 refer if not fixing the "need to add a dummy record to move watermark

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Jungtaek Lim
I'm not sure what exactly your problem is, but given you've mentioned window and OutputMode.Append, you may want to remind that append mode doesn't produce the output of aggregation unless the watermark "passes by". It's expected behavior if you're seeing lazy outputs on OutputMode.Append compared

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Phillip Henry
Sorry, should have mentioned that Spark only seems reluctant to take the last windowed, groupBy batch from Kafka when using OutputMode.Append. I've asked on StackOverflow: https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka but am still

Lazy Spark Structured Streaming

2020-07-12 Thread Phillip Henry
Hi, folks. I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed. There is a JIRA for this at: https://issues.apache.org/jira/browse/SPARK-24156 that says it's resolved in