zipwithIndex will preserve the order whatever is there in your val lines. I am not sure about the "val lines=sc.textFile("hdfs://mytextFile") " if this line maintain the order, next will maintain for sure
On 24 April 2015 at 18:35, Spico Florin <spicoflo...@gmail.com> wrote: > Hello! > I know that HadoopRDD partitions are built based on the number of splits > in HDFS. I'm wondering if these partitions preserve the initial order of > data in file. > As an example, if I have an HDFS (myTextFile) file that has these splits: > > split 0-> line 1, ..., line k > split 1->line k+1,..., line k+n > splt 2->line k+n, line k+n+m > > and the code > val lines=sc.textFile("hdfs://mytextFile") > lines.zipWithIndex() > > will the order of lines preserved? > (line 1, zipIndex 1) , .. (line k, zipIndex k), and so one. > > I found this question on stackoverflow ( > http://stackoverflow.com/questions/26046410/how-can-i-obtain-an-element-position-in-sparks-rdd) > whose answer intrigued me: > "Essentially, RDD's zipWithIndex() method seems to do this, but it won't > preserve the original ordering of the data the RDD was created from" > > Can you please confirm that is this the correct answer? > > Thanks. > Florin > > > >