Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

Spico Florin Fri, 24 Apr 2015 06:07:16 -0700

Hello!
  I know that HadoopRDD partitions are built based on the number of splits
in HDFS. I'm wondering if these partitions preserve the initial order of
data in file.
As an example, if I have an HDFS (myTextFile) file that has these splits:


split 0-> line 1, ..., line k
split 1->line k+1,..., line k+n
splt 2->line k+n, line k+n+m

and the code
val lines=sc.textFile("hdfs://mytextFile")
lines.zipWithIndex()

will the order of lines preserved?
(line 1, zipIndex 1) , .. (line k, zipIndex k), and so one.

I found this question on stackoverflow (
http://stackoverflow.com/questions/26046410/how-can-i-obtain-an-element-position-in-sparks-rdd)
whose answer intrigued me:
"Essentially, RDD's zipWithIndex() method seems to do this, but it won't
preserve the original ordering of the data the RDD was created from"

Can you please confirm that is this the correct answer?

Thanks.
 Florin

Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

Reply via email to