Hello!
  I know that HadoopRDD partitions are built based on the number of splits
in HDFS. I'm wondering if these partitions preserve the initial order of
data in file.
As an example, if I have an HDFS (myTextFile) file that has these splits:

split 0-> line 1, ..., line k
split 1->line k+1,..., line k+n
splt 2->line k+n, line k+n+m

and the code
val lines=sc.textFile("hdfs://mytextFile")
lines.zipWithIndex()

will the order of lines preserved?
(line 1, zipIndex 1) , .. (line k, zipIndex k), and so one.

I found this question on stackoverflow (
http://stackoverflow.com/questions/26046410/how-can-i-obtain-an-element-position-in-sparks-rdd)
whose answer intrigued me:
"Essentially, RDD's zipWithIndex() method seems to do this, but it won't
preserve the original ordering of the data the RDD was created from"

Can you please confirm that is this the correct answer?

Thanks.
 Florin

Reply via email to