1. udpateStateByKey should be called on all keys even if there is not data
corresponding to that key. There is a unit test for that.
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala#L337
2. I am increasing the priority for t
Hi TD,
I've also been fighting this issue only to find the exact same solution you
are suggesting.
Too bad I didn't find either the post or the issue sooner.
I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
state objects) per batch and only calling the updatestatebykey f
TD,
We are seeing the same issue. We struggled through this until we found this
post and the work around.
A quick fix in the Spark Streaming software will help a lot for others who
are encountering this and pulling their hair out on why RDD on some
partitions are not computed (we ended up spending
Yeah, maybe I should bump the issue to major. Now that I thought about
to give my previous answer, this should be easy to fix just by doing a
foreachRDD on all the input streams within the system (rather than
explicitly doing it like I asked you to do).
Thanks Alan, for testing this out and confir
TD, it looks like your instincts were correct. I misunderstood what you meant.
If I force an eval on the inputstream using foreachRDD, the windowing will
work correctly. If I don’t do that, lazy eval somehow screws with window
batches I eventually receive. Any reason the bug is categorized a
foreachRDD is how I extracted values in the first place, so that’s not going to
make a difference. I don’t think it’s related to SPARK-1312 because I’m
generating data every second in the first place and I’m using foreachRDD right
after the window operation. The code looks something like
val
It could be related to this bug that is currently open.
https://issues.apache.org/jira/browse/SPARK-1312
Here is a workaround. Can you put a inputStream.foreachRDD(rdd => { }) and
try these combos again?
TD
On Tue, Jul 22, 2014 at 6:01 PM, Alan Ngai wrote:
> I have a sample application pumpin
I have a sample application pumping out records 1 per second. The batch
interval is set to 5 seconds. Here’s a list of “observed window intervals” vs
what was actually set
window=25, slide=25 : observed-window=25, overlapped-batches=0
window=25, slide=20 : observed-window=20, overlapped-batche