Re: RDD order guarantees

Ewan Higgs Fri, 16 Jan 2015 15:38:35 -0800

Yes, I am running on a local file system.

Is there a bug open for this? Mingyu Kim reported the problem last April:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html


-Ewan

On 01/16/2015 07:41 PM, Reynold Xin wrote:

You are running on a local file system right? HDFS orders the filebased on names, but local file system often don't. I think that's whythe difference.

We might be able to do a sort and order the partitions when we createa RDD to make this universal though.

On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs <ewan.hi...@ugent.be<mailto:ewan.hi...@ugent.be>> wrote:


    Hi all,
    Quick one: when reading files, are the orders of partitions
    guaranteed to be preserved? I am finding some weird behaviour
    where I run sortByKeys() on an RDD (which has 16 byte keys) and
    write it to disk. If I open a python shell and run the following:

    for part in range(29):
        print map(ord,
    open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
    'r').read(16))

    Then each partition is in order based on the first value of each
    partition.

    I can also call TeraValidate.validate from TeraSort and it is
    happy with the results. It seems to be on loading the file that
    the reordering happens. If this is expected, is there a way to ask
    Spark nicely to give me the RDD in the order it was saved?

    This is based on trying to fix my TeraValidate code on this branch:
    https://github.com/ehiggs/spark/tree/terasort

    Thanks,
    Ewan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
    <mailto:dev-unsubscr...@spark.apache.org>
    For additional commands, e-mail: dev-h...@spark.apache.org
    <mailto:dev-h...@spark.apache.org>

Re: RDD order guarantees

Reply via email to