Thank you, TD. This is important information for us. Will keep an eye on
that.
Cheers,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Yes, this is the limitation of the current implementation. But this will
Good to know! I am bumping the priority of this issue in my head. Thanks
for the feedback. Others seeing this thread, please comment if you think
that this is an important issue for you as well.
Not at my computer right now but I will make a Jira for this.
TD
On Jul 17, 2014 11:22 PM, Yan Fang
Hi guys,
sure you have similar use case and want to know how you deal with that. In
our application, we want to check the previous state of some keys and
compare with their current state.
AFAIK, Spark Streaming does not have key-value access. So current what I am
doing is storing the previous
For accessing previous version, I would do it the same way. :)
1. Can you elaborate on what you mean by that with an example? What do you
mean by accessing keys?
2. Yeah, that is hard to do with the ability to do point lookups into an
RDD, which we dont support yet. You could try embedding the
Hi TD,
Thank you for the quick replying and backing my approach. :)
1) The example is this:
1. In the first 2 second interval, after updateStateByKey, I get a few keys
and their states, say, (a - 1, b - 2, c - 3)
2. In the following 2 second interval, I only receive c and d and their
value. But
The updateFunction given in updateStateByKey should be called on ALL the
keys are in the state, even if there is no new data in the batch for some
key. Is that not the behavior you see?
What do you mean by show all the existing states? You have access to the
latest state RDD by doing
Hi TD,
Thank you. Yes, it behaves as you described. Sorry for missing this point.
Then my only concern is in the performance side - since Spark Streaming
operates on all the keys everytime a new batch comes, I think it is fine
when the state size is small. When the state size becomes big, say, a
Yes, this is the limitation of the current implementation. But this will be
improved a lt when we have IndexedRDD
https://github.com/apache/spark/pull/1297 in the Spark that allows faster
single value updates to a key-value (within each partition, without
processing the entire partition.