At a high-level, the suggestion sounds good to me. However regarding code,
its best to submit a Pull Request on Spark github page for community
reviewing. You will find more information here.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
On Tue, Sep 23, 2014 at 10:11 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
I thought I did a good job ;-)
OK, so what is the best way to initialize updateStateByKey operation? I
have counts from previous spark-submit, and want to load that in next
spark-submit job.
- Original Message -
From: Soumitra Kumar kumar.soumi...@gmail.com
To: spark users user@spark.apache.org
Sent: Sunday, September 21, 2014 10:43:01 AM
Subject: How to initialize updateStateByKey operation
I started with StatefulNetworkWordCount to have a running count of words
seen.
I have a file 'stored.count' which contains the word counts.
$ cat stored.count
a 1
b 2
I want to initialize StatefulNetworkWordCount with the values in
'stored.count' file, how do I do that?
I looked at the paper 'EECS-2012-259.pdf' Matei et al, in figure 2, it
would be useful to have an initial RDD feeding into 'counts' at 't = 1', as
below.
initial
|
t = 1: pageView - ones - counts
|
t = 2: pageView - ones - counts
...
I have also attached the modified figure 2 of
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf .
I managed to hack Spark code to achieve this, and attaching the modified
files.
Essentially, I added an argument 'initial : RDD [(K, S)]' to
updateStateByKey method, as
def updateStateByKey[S: ClassTag](
initial : RDD [(K, S)],
updateFunc: (Seq[V], Option[S]) = Option[S],
partitioner: Partitioner
): DStream[(K, S)]
If it sounds interesting for larger crowd I would be happy to cleanup the
code, and volunteer to push into the code. I don't know the procedure to
that though.
-Soumitra.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org