Consistent hashing of RDD row

lev Mon, 22 Dec 2014 21:51:07 -0800

Hello,

I have a process where I need to create a random number for each row in an
RDD.
That new RDD will be used in a few iteration, and it is necessary that
between iterations the numbers won't change
(i.e., if a partition get evicted from the cache, the numbers of that
partition will be regenerated the same)
One way to solve it is to persist the RDD (after the random numbers are
created) on the disk, but it might be evicted if we run out of space on the
disk, no?


My idea is to do zipWithIndex on my original RDD, and for each row, create a
new random generator with the index as the seed.

I would like to know if zipWithIndex will match the same index if its get
evicted from the cache,
for example:

rdd1.join(rdd2).zipWithIndex()
if the join gets recalculated, the rows will get the same index?

or in:
val rdd = hiveContext.sql("...").zipWithIndex()
if the partitions of the query get evicted and recalculated, will the index
stay the same?

I'd love to hear your thoughts on the matter.

Thanks,
Lev.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Consistent-hashing-of-RDD-row-tp20820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Consistent hashing of RDD row

Reply via email to