Re: Is there a way to get previous/other keys' state in Spark Streaming?
Thank you, TD. This is important information for us. Will keep an eye on that. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Yes, this is the limitation of the current implementation. But this will be improved a lt when we have IndexedRDD https://github.com/apache/spark/pull/1297 in the Spark that allows faster single value updates to a key-value (within each partition, without processing the entire partition. Soon. TD On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you. Yes, it behaves as you described. Sorry for missing this point. Then my only concern is in the performance side - since Spark Streaming operates on all the keys everytime a new batch comes, I think it is fine when the state size is small. When the state size becomes big, say, a few GBs, if we still go through the whole key list, would the operation be a little inefficient then? Maybe I miss some points in Spark Streaming, which consider this situation. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by show all the existing states? You have access to the latest state RDD by doing stateStream.foreachRDD(...). There you can do whatever operation on all the key-state pairs. TD On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But I want to update/display the state of a and b accordingly. * It seems I have no way to access the a and b and get their states. * also, do I have a way to show all the existing states? I guess the approach to solve this will be similar to what you mentioned for 2). But the difficulty is that, if I want to display all the existing states, need to bundle all the rest keys to one key. Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD
Re: Is there a way to get previous/other keys' state in Spark Streaming?
Good to know! I am bumping the priority of this issue in my head. Thanks for the feedback. Others seeing this thread, please comment if you think that this is an important issue for you as well. Not at my computer right now but I will make a Jira for this. TD On Jul 17, 2014 11:22 PM, Yan Fang yanfang...@gmail.com wrote: Thank you, TD. This is important information for us. Will keep an eye on that. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 6:54 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Yes, this is the limitation of the current implementation. But this will be improved a lt when we have IndexedRDD https://github.com/apache/spark/pull/1297 in the Spark that allows faster single value updates to a key-value (within each partition, without processing the entire partition. Soon. TD On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you. Yes, it behaves as you described. Sorry for missing this point. Then my only concern is in the performance side - since Spark Streaming operates on all the keys everytime a new batch comes, I think it is fine when the state size is small. When the state size becomes big, say, a few GBs, if we still go through the whole key list, would the operation be a little inefficient then? Maybe I miss some points in Spark Streaming, which consider this situation. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by show all the existing states? You have access to the latest state RDD by doing stateStream.foreachRDD(...). There you can do whatever operation on all the key-state pairs. TD On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But I want to update/display the state of a and b accordingly. * It seems I have no way to access the a and b and get their states. * also, do I have a way to show all the existing states? I guess the approach to solve this will be similar to what you mentioned for 2). But the difficulty is that, if I want to display all the existing states, need to bundle all the rest keys to one key. Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD
Re: Is there a way to get previous/other keys' state in Spark Streaming?
For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD
Re: Is there a way to get previous/other keys' state in Spark Streaming?
Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But I want to update/display the state of a and b accordingly. * It seems I have no way to access the a and b and get their states. * also, do I have a way to show all the existing states? I guess the approach to solve this will be similar to what you mentioned for 2). But the difficulty is that, if I want to display all the existing states, need to bundle all the rest keys to one key. Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD
Re: Is there a way to get previous/other keys' state in Spark Streaming?
The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by show all the existing states? You have access to the latest state RDD by doing stateStream.foreachRDD(...). There you can do whatever operation on all the key-state pairs. TD On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But I want to update/display the state of a and b accordingly. * It seems I have no way to access the a and b and get their states. * also, do I have a way to show all the existing states? I guess the approach to solve this will be similar to what you mentioned for 2). But the difficulty is that, if I want to display all the existing states, need to bundle all the rest keys to one key. Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD
Re: Is there a way to get previous/other keys' state in Spark Streaming?
Hi TD, Thank you. Yes, it behaves as you described. Sorry for missing this point. Then my only concern is in the performance side - since Spark Streaming operates on all the keys everytime a new batch comes, I think it is fine when the state size is small. When the state size becomes big, say, a few GBs, if we still go through the whole key list, would the operation be a little inefficient then? Maybe I miss some points in Spark Streaming, which consider this situation. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by show all the existing states? You have access to the latest state RDD by doing stateStream.foreachRDD(...). There you can do whatever operation on all the key-state pairs. TD On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But I want to update/display the state of a and b accordingly. * It seems I have no way to access the a and b and get their states. * also, do I have a way to show all the existing states? I guess the approach to solve this will be similar to what you mentioned for 2). But the difficulty is that, if I want to display all the existing states, need to bundle all the rest keys to one key. Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD
Re: Is there a way to get previous/other keys' state in Spark Streaming?
Yes, this is the limitation of the current implementation. But this will be improved a lt when we have IndexedRDD https://github.com/apache/spark/pull/1297 in the Spark that allows faster single value updates to a key-value (within each partition, without processing the entire partition. Soon. TD On Thu, Jul 17, 2014 at 5:57 PM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you. Yes, it behaves as you described. Sorry for missing this point. Then my only concern is in the performance side - since Spark Streaming operates on all the keys everytime a new batch comes, I think it is fine when the state size is small. When the state size becomes big, say, a few GBs, if we still go through the whole key list, would the operation be a little inefficient then? Maybe I miss some points in Spark Streaming, which consider this situation. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 1:47 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by show all the existing states? You have access to the latest state RDD by doing stateStream.foreachRDD(...). There you can do whatever operation on all the key-state pairs. TD On Thu, Jul 17, 2014 at 11:58 AM, Yan Fang yanfang...@gmail.com wrote: Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But I want to update/display the state of a and b accordingly. * It seems I have no way to access the a and b and get their states. * also, do I have a way to show all the existing states? I guess the approach to solve this will be similar to what you mentioned for 2). But the difficulty is that, if I want to display all the existing states, need to bundle all the rest keys to one key. Thank you. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Thu, Jul 17, 2014 at 11:36 AM, Tathagata Das tathagata.das1...@gmail.com wrote: For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the related key in the values of the keys that need it. That is, B will is present in the value of key A. Then put this transformed DStream through updateStateByKey. TD