I think the robust thing to do is sort the RDD, and then zipWithIndex.
Even if the RDD is recomputed, the ordering and thus assignment of IDs
should be the same.

On Mon, Dec 1, 2014 at 2:36 PM, rok <rokros...@gmail.com> wrote:
> I have an RDD that serves as a feature look-up table downstream in my
> analysis. I create it using the zipWithIndex() and because I suppose that
> the elements of the RDD could end up in a different order if it is
> regenerated at any point, I cache it to try and ensure that the (feature -->
> index) mapping remains fixed.
>
> However, I'm having trouble verifying that this is actually robust -- can
> someone comment whether using such a mapping should be stable or is there
> another preferred method? zipWithUniqueID() isn't optimal since max ID
> generated this way is always greater than the number of features so I'm
> trying to avoid it.
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to