Re: RDD order guarantees

Reynold Xin Sun, 18 Jan 2015 23:42:07 -0800

Hi Ewan,

Not sure if there is a JIRA ticket (there are too many that I lose track).


I chatted briefly with Aaron on this. The way we can solve it is to create
a new FileSystem implementation that overrides the listStatus method, and
then in Hadoop Conf set the fs.file.impl to that.

Shouldn't be too hard. Would you be interested in working on it?




On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs <ewan.hi...@ugent.be> wrote:

>  Yes, I am running on a local file system.
>
> Is there a bug open for this? Mingyu Kim reported the problem last April:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html
>
> -Ewan
>
>
> On 01/16/2015 07:41 PM, Reynold Xin wrote:
>
> You are running on a local file system right? HDFS orders the file based
> on names, but local file system often don't. I think that's why the
> difference.
>
>  We might be able to do a sort and order the partitions when we create a
> RDD to make this universal though.
>
> On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs <ewan.hi...@ugent.be> wrote:
>
>> Hi all,
>> Quick one: when reading files, are the orders of partitions guaranteed to
>> be preserved? I am finding some weird behaviour where I run sortByKeys() on
>> an RDD (which has 16 byte keys) and write it to disk. If I open a python
>> shell and run the following:
>>
>> for part in range(29):
>>     print map(ord,
>> open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
>> 'r').read(16))
>>
>> Then each partition is in order based on the first value of each
>> partition.
>>
>> I can also call TeraValidate.validate from TeraSort and it is happy with
>> the results. It seems to be on loading the file that the reordering
>> happens. If this is expected, is there a way to ask Spark nicely to give me
>> the RDD in the order it was saved?
>>
>> This is based on trying to fix my TeraValidate code on this branch:
>> https://github.com/ehiggs/spark/tree/terasort
>>
>> Thanks,
>> Ewan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>

Re: RDD order guarantees

Reply via email to