Hi all,
Quick one: when reading files, are the orders of partitions guaranteed
to be preserved? I am finding some weird behaviour where I run
sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If
I open a python shell and run the following:
for part in range(29):
print map(ord,
open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
'r').read(16))
Then each partition is in order based on the first value of each partition.
I can also call TeraValidate.validate from TeraSort and it is happy with
the results. It seems to be on loading the file that the reordering
happens. If this is expected, is there a way to ask Spark nicely to give
me the RDD in the order it was saved?
This is based on trying to fix my TeraValidate code on this branch:
https://github.com/ehiggs/spark/tree/terasort
Thanks,
Ewan
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org