;",-1L) } } }}
From: Sean Owen
To: Michael Albert
Cc: User
Sent: Monday, March 23, 2015 7:31 AM
Subject: Re: How to check that a dataset is sorted after it has been written
out?
Data is not (necessarily) sorted when read from disk, no. A file might
have many block
Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.
I think you could conceivably mak
One approach would be to repartition the whole data into 1 (costly
operation though, but will give you a single file). Also, You could try
using zipWithIndex before writing it out.
Thanks
Best Regards
On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert <
m_albert...@yahoo.com.invalid> wrote:
> Greet
Greetings![My apologies for this respost, I'm not certain that the first
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
"partit
Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
"partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is
not the same as im