Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Michael Albert
;",-1L) } }    }} From: Sean Owen To: Michael Albert Cc: User Sent: Monday, March 23, 2015 7:31 AM Subject: Re: How to check that a dataset is sorted after it has been written out? Data is not (necessarily) sorted when read from disk, no. A file might have many block

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Sean Owen
Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably mak

Re: How to check that a dataset is sorted after it has been written out?

2015-03-22 Thread Akhil Das
One approach would be to repartition the whole data into 1 (costly operation though, but will give you a single file). Also, You could try using zipWithIndex before writing it out. Thanks Best Regards On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert < m_albert...@yahoo.com.invalid> wrote: > Greet

How to check that a dataset is sorted after it has been written out? [repost]

2015-03-22 Thread Michael Albert
Greetings![My apologies for this respost, I'm not certain that the first message made it to the list]. I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first "partit

How to check that a dataset is sorted after it has been written out?

2015-03-20 Thread Michael Albert
Greetings! I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first "partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is not the same  as im