Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
Data that are not updated should be saved earlier: while the data added to the DStream at the first time, it should be considered as updated. So save the same data again is a waste. What are the community is doing? Is there any doc or discussion that I can look for? Thanks. Shixiong Zhu

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu
You can create connection like this: val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { val dbConnection = create a db connection iterator.flatMap { case (key, values, stateOption) => if (values.isEmpty) { // don't access database

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu
Could you write your update func like this? val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { iterator.flatMap { case (key, values, stateOption) => if (values.isEmpty) { // don't access database } else { // update to new

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
It seems like a work around. But I don't know how to get the database connection from the working nodes. Shixiong Zhu 于2015年9月24日周四 下午5:37写道: > Could you write your update func like this? > > val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) > => { >

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Shixiong Zhu
For data that are not updated, where do you save? Or do you only want to avoid accessing database for those that are not updated? Besides, the community is working on optimizing "updateStateBykey"'s performance. Hope it will be delivered soon. Best Regards, Shixiong Zhu 2015-09-24 13:45

Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
Thanks, it seems good, though a little hack. And here is another question. updateByKey compute on all the data from the beginning, but in many situation, we just need to update the coming data. This could be a big improve on speed and resource. Would this to be support in the future? Shixiong

Get only updated RDDs from or after updateStateBykey

2015-09-23 Thread Bin Wang
I've read the source code and it seems to be impossible, but I'd like to confirm it. It is a very useful feature. For example, I need to store the state of DStream into my database, in order to recovery them from next redeploy. But I only need to save the updated ones. Save all keys into database