RDD order guarantees

Ewan Higgs Fri, 16 Jan 2015 08:27:14 -0800

Hi all,

Quick one: when reading files, are the orders of partitions guaranteedto be preserved? I am finding some weird behaviour where I runsortByKeys() on an RDD (which has 16 byte keys) and write it to disk. IfI open a python shell and run the following:


for part in range(29):

print map(ord,open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),'r').read(16))


Then each partition is in order based on the first value of each partition.

I can also call TeraValidate.validate from TeraSort and it is happy withthe results. It seems to be on loading the file that the reorderinghappens. If this is expected, is there a way to ask Spark nicely to giveme the RDD in the order it was saved?


This is based on trying to fix my TeraValidate code on this branch:
https://github.com/ehiggs/spark/tree/terasort

Thanks,
Ewan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RDD order guarantees

Reply via email to