Hello!
I know that HadoopRDD partitions are built based on the number of splits
in HDFS. I'm wondering if these partitions preserve the initial order of
data in file.
As an example, if I have an HDFS (myTextFile) file that has these splits:
split 0- line 1, ..., line k
split 1-line k+1,...,
zipwithIndex will preserve the order whatever is there in your val lines.
I am not sure about the val lines=sc.textFile(hdfs://mytextFile) if
this line maintain the order, next will maintain for sure
On 24 April 2015 at 18:35, Spico Florin spicoflo...@gmail.com wrote:
Hello!
I know that
I did a quick test as I was curious about it too. I created a file with
numbers from 0 to 999, in order, line by line. Then I did:
scala val numbers = sc.textFile(./numbers.txt)
scala val zipped = numbers.zipWithUniqueId
scala zipped.foreach(i = println(i))
Expected result if the order was
The order of elements in an RDD is in general not guaranteed unless
you sort. You shouldn't expect to encounter the partitions of an RDD
in any particular order.
In practice, you probably find the partitions come up in the order
Hadoop presents them in this case. And within a partition, in this
...@boxever.commailto:michal.michal...@boxever.com]
Sent: Friday, April 24, 2015 11:04 AM Eastern Standard Time
To: Ganelin, Ilya
Cc: Spico Florin; user
Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input
data from Hadoop?
The problem I'm facing is that I need to process lines from input
Eastern Standard Time
*To: *Ganelin, Ilya
*Cc: *Spico Florin; user
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order
of the input data from Hadoop?
I read it one by one as I need to maintain the order, but it doesn't
mean that I process them one by one later. Input lines
)
-Original Message-
*From: *Michal Michalski [michal.michal...@boxever.com]
*Sent: *Friday, April 24, 2015 10:41 AM Eastern Standard Time
*To: *Spico Florin
*Cc: *user
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
the input data from Hadoop?
Of course after
Eastern Standard Time
*To: *Ganelin, Ilya
*Cc: *Spico Florin; user
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
the input data from Hadoop?
I read it one by one as I need to maintain the order, but it doesn't mean
that I process them one by one later. Input lines refer
Another issue is that hadooprdd (which sc.textfile uses) might split input
files and even if it doesn't split, it doesn't guarantee that part files
numbers go to the corresponding partition number in the rdd. Eg part-0
could go to partition 27
On Apr 24, 2015 7:41 AM, Michal Michalski
, Ilya
*Cc: *Spico Florin; user
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
the input data from Hadoop?
The problem I'm facing is that I need to process lines from input file in
the order they're stored in the file, as they define the order of updates I
need to apply
:41 AM Eastern Standard Time
*To: *Spico Florin
*Cc: *user
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
the input data from Hadoop?
Of course after you do it, you probably want to call
repartition(somevalue) on your RDD to get your paralellism back.
Kind regards
: *Michal Michalski [michal.michal...@boxever.com]
*Sent: *Friday, April 24, 2015 11:18 AM Eastern Standard Time
*To: *Ganelin, Ilya
*Cc: *Spico Florin; user
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order of
the input data from Hadoop?
I read it one by one as I need
Of course after you do it, you probably want to call repartition(somevalue)
on your RDD to get your paralellism back.
Kind regards,
Michał Michalski,
michal.michal...@boxever.com
On 24 April 2015 at 15:28, Michal Michalski michal.michal...@boxever.com
wrote:
I did a quick test as I was curious
Eastern Standard Time
To: Spico Florin
Cc: user
Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input
data from Hadoop?
Of course after you do it, you probably want to call repartition(somevalue) on
your RDD to get your paralellism back.
Kind regards,
Michał Michalski
*Subject: *Re: Does HadoopRDD.zipWithIndex method preserve the order
of the input data from Hadoop?
The problem I'm facing is that I need to process lines from input file
in the order they're stored in the file, as they define the order of
updates I need to apply on some data and these updates
-
From: Michal Michalski
[michal.michal...@boxever.commailto:michal.michal...@boxever.com]
Sent: Friday, April 24, 2015 11:18 AM Eastern Standard Time
To: Ganelin, Ilya
Cc: Spico Florin; user
Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input
data from Hadoop?
I
16 matches
Mail list logo