subject:"How to check that a dataset is sorted after it has been written out\?"

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Sean Owen

Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Akhil Das

One approach would be to repartition the whole data into 1 (costly operation though, but will give you a single file). Also, You could try using zipWithIndex before writing it out. Thanks Best Regards On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert m_albert...@yahoo.com.invalid wrote:

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Michael Albert

Thanks for the information! (to all who responded) The code below *seems* to work.Any hidden gotcha's that anyone sees? And still, in terasort, how did they check that the data was actually sorted? :-) -Mike class MyInputFormat[T] extends parquet.hadoop.ParquetInputFormat[T]{ override def

How to check that a dataset is sorted after it has been written out? [repost]

2015-03-22 Thread Michael Albert

Greetings![My apologies for this respost, I'm not certain that the first message made it to the list]. I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first

How to check that a dataset is sorted after it has been written out?

2015-03-20 Thread Michael Albert

Greetings! I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is not the same as

Re: How to check that a dataset is sorted after it has been written out?

Re: How to check that a dataset is sorted after it has been written out?

Re: How to check that a dataset is sorted after it has been written out?

How to check that a dataset is sorted after it has been written out? [repost]

How to check that a dataset is sorted after it has been written out?

5 matches

Site Navigation

Mail list logo

Footer information