Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Spico Florin
Hello! I know that HadoopRDD partitions are built based on the number of splits in HDFS. I'm wondering if these partitions preserve the initial order of data in file. As an example, if I have an HDFS (myTextFile) file that has these splits: split 0- line 1, ..., line k split 1-line k+1,...,

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
zipwithIndex will preserve the order whatever is there in your val lines. I am not sure about the val lines=sc.textFile(hdfs://mytextFile) if this line maintain the order, next will maintain for sure On 24 April 2015 at 18:35, Spico Florin spicoflo...@gmail.com wrote: Hello! I know that

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I did a quick test as I was curious about it too. I created a file with numbers from 0 to 999, in order, line by line. Then I did: scala val numbers = sc.textFile(./numbers.txt) scala val zipped = numbers.zipWithUniqueId scala zipped.foreach(i = println(i)) Expected result if the order was

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Sean Owen
The order of elements in an RDD is in general not guaranteed unless you sort. You shouldn't expect to encounter the partitions of an RDD in any particular order. In practice, you probably find the partitions come up in the order Hadoop presents them in this case. And within a partition, in this

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
...@boxever.commailto:michal.michal...@boxever.com] Sent: Friday, April 24, 2015 11:04 AM Eastern Standard Time To: Ganelin, Ilya Cc: Spico Florin; user Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? The problem I'm facing is that I need to process lines from input

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
Eastern Standard Time *To: *Ganelin, Ilya *Cc: *Spico Florin; user *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? I read it one by one as I need to maintain the order, but it doesn't mean that I process them one by one later. Input lines

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
) -Original Message- *From: *Michal Michalski [michal.michal...@boxever.com] *Sent: *Friday, April 24, 2015 10:41 AM Eastern Standard Time *To: *Spico Florin *Cc: *user *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? Of course after

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
Eastern Standard Time *To: *Ganelin, Ilya *Cc: *Spico Florin; user *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? I read it one by one as I need to maintain the order, but it doesn't mean that I process them one by one later. Input lines refer

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Imran Rashid
Another issue is that hadooprdd (which sc.textfile uses) might split input files and even if it doesn't split, it doesn't guarantee that part files numbers go to the corresponding partition number in the rdd. Eg part-0 could go to partition 27 On Apr 24, 2015 7:41 AM, Michal Michalski

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
, Ilya *Cc: *Spico Florin; user *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? The problem I'm facing is that I need to process lines from input file in the order they're stored in the file, as they define the order of updates I need to apply

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
:41 AM Eastern Standard Time *To: *Spico Florin *Cc: *user *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? Of course after you do it, you probably want to call repartition(somevalue) on your RDD to get your paralellism back. Kind regards

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
: *Michal Michalski [michal.michal...@boxever.com] *Sent: *Friday, April 24, 2015 11:18 AM Eastern Standard Time *To: *Ganelin, Ilya *Cc: *Spico Florin; user *Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? I read it one by one as I need

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
Of course after you do it, you probably want to call repartition(somevalue) on your RDD to get your paralellism back. Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 15:28, Michal Michalski michal.michal...@boxever.com wrote: I did a quick test as I was curious

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
Eastern Standard Time To: Spico Florin Cc: user Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? Of course after you do it, you probably want to call repartition(somevalue) on your RDD to get your paralellism back. Kind regards, Michał Michalski

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? The problem I'm facing is that I need to process lines from input file in the order they're stored in the file, as they define the order of updates I need to apply on some data and these updates

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
- From: Michal Michalski [michal.michal...@boxever.commailto:michal.michal...@boxever.com] Sent: Friday, April 24, 2015 11:18 AM Eastern Standard Time To: Ganelin, Ilya Cc: Spico Florin; user Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop? I