Hi TD,

I've also been fighting this issue only to find the exact same solution you
are suggesting. 
Too bad I didn't find either the post or the issue sooner.

I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
state objects) per batch and only calling the updatestatebykey function.

This is my interpretation, please correct me if needed:
Because of Spark’s lazy computation the RDDs weren’t being updated as
expected on the batch interval execution. The assumption was that as long as
I have a streaming batch run (with or without new messages), I should get
updated RDDs, which was not happening. We only get updateStateByKey calls
for objects which got events or that are forced through an output function
to compute. I did not make further test to confirm this, but that's the
given impression.

This doesn't fit our requirements as we want to do duration updates based on
the batch interval execution...so I had to force the computation of all the
objects through the ForeachRDD function.

I will also appreciate if the priority can be increased to the issue. I
assume the ForeachRDD is additional unnecessary resource allocation
(although I'm not sure how much) as opposite to doing it somehow by default
on batch interval execution. 

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-window-not-behaving-as-advertised-v1-0-1-tp10453p11168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to