Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.
I think you could conceivably
One approach would be to repartition the whole data into 1 (costly
operation though, but will give you a single file). Also, You could try
using zipWithIndex before writing it out.
Thanks
Best Regards
On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert
m_albert...@yahoo.com.invalid wrote:
Thanks for the information! (to all who responded)
The code below *seems* to work.Any hidden gotcha's that anyone sees?
And still, in terasort, how did they check that the data was actually sorted?
:-)
-Mike
class MyInputFormat[T] extends parquet.hadoop.ParquetInputFormat[T]{
override def
Greetings![My apologies for this respost, I'm not certain that the first
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is
not the same as