Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-18 Thread Yan Fang
Thank you, TD. This is important information for us. Will keep an eye on
that.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Yes, this is the limitation of the current implementation. But this will
 be improved a lt when we have IndexedRDD
 https://github.com/apache/spark/pull/1297 in the Spark that allows
 faster single value updates to a key-value (within each partition, without
 processing the entire partition.

 Soon.

 TD


 On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you. Yes, it behaves as you described. Sorry for missing this
 point.

 Then my only concern is in the performance side - since Spark Streaming
 operates on all the keys everytime a new batch comes, I think it is fine
 when the state size is small. When the state size becomes big, say, a few
 GBs, if we still go through the whole key list, would the operation be a
 little inefficient then? Maybe I miss some points in Spark Streaming, which
 consider this situation.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 The updateFunction given in updateStateByKey should be called on ALL the
 keys are in the state, even if there is no new data in the batch for some
 key. Is that not the behavior you see?

 What do you mean by show all the existing states? You have access to
 the latest state RDD by doing stateStream.foreachRDD(...). There you can do
 whatever operation on all the key-state pairs.

 TD




 On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you for the quick replying and backing my approach. :)

 1) The example is this:

 1. In the first 2 second interval, after updateStateByKey, I get a few
 keys and their states, say, (a - 1, b - 2, c - 3)
 2. In the following 2 second interval, I only receive c and d and
 their value. But I want to update/display the state of a and b
 accordingly.
 * It seems I have no way to access the a and b and get their
 states.
 * also, do I have a way to show all the existing states?

 I guess the approach to solve this will be similar to what you
 mentioned for 2). But the difficulty is that, if I want to display all the
 existing states, need to bundle all the rest keys to one key.

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 For accessing previous version, I would do it the same way. :)

 1. Can you elaborate on what you mean by that with an example? What do
 you mean by accessing keys?

 2. Yeah, that is hard to do with the ability to do point lookups into
 an RDD, which we dont support yet. You could try embedding the related key
 in the values of the keys that need it. That is, B will is present in the
 value of key A. Then put this transformed DStream through 
 updateStateByKey.

 TD








Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-18 Thread Tathagata Das
Good to know! I am bumping the priority of this issue in my head. Thanks
for the feedback. Others seeing this thread, please comment if you think
that this is an important issue for you as well.

Not at my computer right now but I will make a Jira for this.

TD
On Jul 17, 2014 11:22 PM, Yan Fang yanfang...@gmail.com wrote:

 Thank you, TD. This is important information for us. Will keep an eye on
 that.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Yes, this is the limitation of the current implementation. But this will
 be improved a lt when we have IndexedRDD
 https://github.com/apache/spark/pull/1297 in the Spark that allows
 faster single value updates to a key-value (within each partition, without
 processing the entire partition.

 Soon.

 TD


 On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you. Yes, it behaves as you described. Sorry for missing this
 point.

 Then my only concern is in the performance side - since Spark Streaming
 operates on all the keys everytime a new batch comes, I think it is fine
 when the state size is small. When the state size becomes big, say, a few
 GBs, if we still go through the whole key list, would the operation be a
 little inefficient then? Maybe I miss some points in Spark Streaming, which
 consider this situation.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 The updateFunction given in updateStateByKey should be called on ALL
 the keys are in the state, even if there is no new data in the batch for
 some key. Is that not the behavior you see?

 What do you mean by show all the existing states? You have access to
 the latest state RDD by doing stateStream.foreachRDD(...). There you can do
 whatever operation on all the key-state pairs.

 TD




 On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com
 wrote:

 Hi TD,

 Thank you for the quick replying and backing my approach. :)

 1) The example is this:

 1. In the first 2 second interval, after updateStateByKey, I get a few
 keys and their states, say, (a - 1, b - 2, c - 3)
 2. In the following 2 second interval, I only receive c and d and
 their value. But I want to update/display the state of a and b
 accordingly.
 * It seems I have no way to access the a and b and get their
 states.
 * also, do I have a way to show all the existing states?

 I guess the approach to solve this will be similar to what you
 mentioned for 2). But the difficulty is that, if I want to display all the
 existing states, need to bundle all the rest keys to one key.

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 For accessing previous version, I would do it the same way. :)

 1. Can you elaborate on what you mean by that with an example? What
 do you mean by accessing keys?

 2. Yeah, that is hard to do with the ability to do point lookups into
 an RDD, which we dont support yet. You could try embedding the related 
 key
 in the values of the keys that need it. That is, B will is present in the
 value of key A. Then put this transformed DStream through 
 updateStateByKey.

 TD









Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
For accessing previous version, I would do it the same way. :)

1. Can you elaborate on what you mean by that with an example? What do you
mean by accessing keys?

2. Yeah, that is hard to do with the ability to do point lookups into an
RDD, which we dont support yet. You could try embedding the related key in
the values of the keys that need it. That is, B will is present in the
value of key A. Then put this transformed DStream through updateStateByKey.

TD


Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi TD,

Thank you for the quick replying and backing my approach. :)

1) The example is this:

1. In the first 2 second interval, after updateStateByKey, I get a few keys
and their states, say, (a - 1, b - 2, c - 3)
2. In the following 2 second interval, I only receive c and d and their
value. But I want to update/display the state of a and b accordingly.
* It seems I have no way to access the a and b and get their states.
* also, do I have a way to show all the existing states?

I guess the approach to solve this will be similar to what you mentioned
for 2). But the difficulty is that, if I want to display all the existing
states, need to bundle all the rest keys to one key.

Thank you.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 For accessing previous version, I would do it the same way. :)

 1. Can you elaborate on what you mean by that with an example? What do you
 mean by accessing keys?

 2. Yeah, that is hard to do with the ability to do point lookups into an
 RDD, which we dont support yet. You could try embedding the related key in
 the values of the keys that need it. That is, B will is present in the
 value of key A. Then put this transformed DStream through updateStateByKey.

 TD



Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
The updateFunction given in updateStateByKey should be called on ALL the
keys are in the state, even if there is no new data in the batch for some
key. Is that not the behavior you see?

What do you mean by show all the existing states? You have access to the
latest state RDD by doing stateStream.foreachRDD(...). There you can do
whatever operation on all the key-state pairs.

TD




On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you for the quick replying and backing my approach. :)

 1) The example is this:

 1. In the first 2 second interval, after updateStateByKey, I get a few
 keys and their states, say, (a - 1, b - 2, c - 3)
 2. In the following 2 second interval, I only receive c and d and
 their value. But I want to update/display the state of a and b
 accordingly.
 * It seems I have no way to access the a and b and get their states.
 * also, do I have a way to show all the existing states?

 I guess the approach to solve this will be similar to what you mentioned
 for 2). But the difficulty is that, if I want to display all the existing
 states, need to bundle all the rest keys to one key.

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 For accessing previous version, I would do it the same way. :)

 1. Can you elaborate on what you mean by that with an example? What do
 you mean by accessing keys?

 2. Yeah, that is hard to do with the ability to do point lookups into an
 RDD, which we dont support yet. You could try embedding the related key in
 the values of the keys that need it. That is, B will is present in the
 value of key A. Then put this transformed DStream through updateStateByKey.

 TD





Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi TD,

Thank you. Yes, it behaves as you described. Sorry for missing this point.

Then my only concern is in the performance side - since Spark Streaming
operates on all the keys everytime a new batch comes, I think it is fine
when the state size is small. When the state size becomes big, say, a few
GBs, if we still go through the whole key list, would the operation be a
little inefficient then? Maybe I miss some points in Spark Streaming, which
consider this situation.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 The updateFunction given in updateStateByKey should be called on ALL the
 keys are in the state, even if there is no new data in the batch for some
 key. Is that not the behavior you see?

 What do you mean by show all the existing states? You have access to the
 latest state RDD by doing stateStream.foreachRDD(...). There you can do
 whatever operation on all the key-state pairs.

 TD




 On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you for the quick replying and backing my approach. :)

 1) The example is this:

 1. In the first 2 second interval, after updateStateByKey, I get a few
 keys and their states, say, (a - 1, b - 2, c - 3)
 2. In the following 2 second interval, I only receive c and d and
 their value. But I want to update/display the state of a and b
 accordingly.
 * It seems I have no way to access the a and b and get their
 states.
 * also, do I have a way to show all the existing states?

 I guess the approach to solve this will be similar to what you mentioned
 for 2). But the difficulty is that, if I want to display all the existing
 states, need to bundle all the rest keys to one key.

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 For accessing previous version, I would do it the same way. :)

 1. Can you elaborate on what you mean by that with an example? What do
 you mean by accessing keys?

 2. Yeah, that is hard to do with the ability to do point lookups into an
 RDD, which we dont support yet. You could try embedding the related key in
 the values of the keys that need it. That is, B will is present in the
 value of key A. Then put this transformed DStream through updateStateByKey.

 TD






Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
Yes, this is the limitation of the current implementation. But this will be
improved a lt when we have IndexedRDD
https://github.com/apache/spark/pull/1297 in the Spark that allows faster
single value updates to a key-value (within each partition, without
processing the entire partition.

Soon.

TD


On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you. Yes, it behaves as you described. Sorry for missing this point.

 Then my only concern is in the performance side - since Spark Streaming
 operates on all the keys everytime a new batch comes, I think it is fine
 when the state size is small. When the state size becomes big, say, a few
 GBs, if we still go through the whole key list, would the operation be a
 little inefficient then? Maybe I miss some points in Spark Streaming, which
 consider this situation.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 The updateFunction given in updateStateByKey should be called on ALL the
 keys are in the state, even if there is no new data in the batch for some
 key. Is that not the behavior you see?

 What do you mean by show all the existing states? You have access to
 the latest state RDD by doing stateStream.foreachRDD(...). There you can do
 whatever operation on all the key-state pairs.

 TD




 On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote:

 Hi TD,

 Thank you for the quick replying and backing my approach. :)

 1) The example is this:

 1. In the first 2 second interval, after updateStateByKey, I get a few
 keys and their states, say, (a - 1, b - 2, c - 3)
 2. In the following 2 second interval, I only receive c and d and
 their value. But I want to update/display the state of a and b
 accordingly.
 * It seems I have no way to access the a and b and get their
 states.
 * also, do I have a way to show all the existing states?

 I guess the approach to solve this will be similar to what you mentioned
 for 2). But the difficulty is that, if I want to display all the existing
 states, need to bundle all the rest keys to one key.

 Thank you.

 Cheers,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108


 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 For accessing previous version, I would do it the same way. :)

 1. Can you elaborate on what you mean by that with an example? What do
 you mean by accessing keys?

 2. Yeah, that is hard to do with the ability to do point lookups into
 an RDD, which we dont support yet. You could try embedding the related key
 in the values of the keys that need it. That is, B will is present in the
 value of key A. Then put this transformed DStream through updateStateByKey.

 TD