Greetings![My apologies for this respost, I'm not certain that the first 
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first 
"partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is 
not the same  as implied by the names of the parquet files (even when the 
number of partitions is the same in therdd which was read as on disk).
If I "take()" a few hundred values, they are sorted, but they are *not* the 
same as if I explicitly open "part-r-00000.parquet" and take values from that.
It seems that when opening the rdd, the "partitions" of the rdd are not in the 
sameorder as implied by the data on disk (i.e., "part-r-00000.parquet, 
part-r-00001.parquet, etc).
So, how might one read the data so that one maintains the sort order?
And while on the subject, after the "terasort", how did they check that the 
data was actually sorted correctly? (or did they :-) ? ).
Is there any way to read the data back in so as to preserve the sort, or do I 
need to "zipWithIndex" before writing it out, and write the index at that time? 
(I haven't tried the latter yet).
Thanks!-Mike

Reply via email to