Hi!

  We have a common usecase with Spark - we go out to some database, e.g. 
Cassandra, crunch though all of its data, but along the RDD pipeline we use a 
pipe operator to some script. All the data before the pipe has some unique IDs, 
but inside the pipe everything is lost.

  The only current solution we have is to format the data into the pipe, so it 
includes the ids, and then restore it all in a map after the pipe.

  However it would be much nicer if we could just join/zip back the output of 
the pipe. However we can’t cache the RDDs, so it would be nice to have a 
forkRDD of some sort that only keeps the last partition in cache (since we’re 
guaranteed that there’ll be a zip later on and the dataflow will be 
synchronized). Or maybe we can already do this in Spark?

Thank you,
Pavel Velikhov
Chief Science Officer
TopRater
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to