Re: How to initialize updateStateByKey operation

2014-09-23 Thread Soumitra Kumar
I thought I did a good job ;-)

OK, so what is the best way to initialize updateStateByKey operation? I have 
counts from previous spark-submit, and want to load that in next spark-submit 
job.

- Original Message -
From: Soumitra Kumar kumar.soumi...@gmail.com
To: spark users user@spark.apache.org
Sent: Sunday, September 21, 2014 10:43:01 AM
Subject: How to initialize updateStateByKey operation

I started with StatefulNetworkWordCount to have a running count of words seen.

I have a file 'stored.count' which contains the word counts.

$ cat stored.count
a 1
b 2

I want to initialize StatefulNetworkWordCount with the values in 'stored.count' 
file, how do I do that?

I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it would be 
useful to have an initial RDD feeding into 'counts' at 't = 1', as below.

   initial
 |
t = 1: pageView - ones - counts
 |
t = 2: pageView - ones - counts
...

I have also attached the modified figure 2 of 
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .

I managed to hack Spark code to achieve this, and attaching the modified files.

Essentially, I added an argument 'initial : RDD [(K, S)]' to updateStateByKey 
method, as
def updateStateByKey[S: ClassTag](
initial : RDD [(K, S)],
updateFunc: (Seq[V], Option[S]) = Option[S],
partitioner: Partitioner
  ): DStream[(K, S)]

If it sounds interesting for larger crowd I would be happy to cleanup the code, 
and volunteer to push into the code. I don't know the procedure to that though.

-Soumitra.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to initialize updateStateByKey operation

2014-09-23 Thread Tathagata Das
At a high-level, the suggestion sounds good to me. However regarding code,
its best to submit a Pull Request on Spark github page for community
reviewing. You will find more information here.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Tue, Sep 23, 2014 at 10:11 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:

 I thought I did a good job ;-)

 OK, so what is the best way to initialize updateStateByKey operation? I
 have counts from previous spark-submit, and want to load that in next
 spark-submit job.

 - Original Message -
 From: Soumitra Kumar kumar.soumi...@gmail.com
 To: spark users user@spark.apache.org
 Sent: Sunday, September 21, 2014 10:43:01 AM
 Subject: How to initialize updateStateByKey operation

 I started with StatefulNetworkWordCount to have a running count of words
 seen.

 I have a file 'stored.count' which contains the word counts.

 $ cat stored.count
 a 1
 b 2

 I want to initialize StatefulNetworkWordCount with the values in
 'stored.count' file, how do I do that?

 I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it
 would be useful to have an initial RDD feeding into 'counts' at 't = 1', as
 below.

initial
  |
 t = 1: pageView - ones - counts
  |
 t = 2: pageView - ones - counts
 ...

 I have also attached the modified figure 2 of
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .

 I managed to hack Spark code to achieve this, and attaching the modified
 files.

 Essentially, I added an argument 'initial : RDD [(K, S)]' to
 updateStateByKey method, as
 def updateStateByKey[S: ClassTag](
 initial : RDD [(K, S)],
 updateFunc: (Seq[V], Option[S]) = Option[S],
 partitioner: Partitioner
   ): DStream[(K, S)]

 If it sounds interesting for larger crowd I would be happy to cleanup the
 code, and volunteer to push into the code. I don't know the procedure to
 that though.

 -Soumitra.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org