How to check that a dataset is sorted after it has been written out?
Greetings! I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first "partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is not the same as implied by the names of the parquet files (even when the number of partitions is the same in therdd which was read as on disk). If I "take()" a few hundred values, they are sorted, but they are *not* the same as if I explicitly open "part-r-0.parquet" and take values from that. It seems that when opening the rdd, the "partitions" of the rdd are not in the sameorder as implied by the data on disk (i.e., "part-r-0.parquet, part-r-1.parquet, etc). So, how might one read the data so that one maintains the sort order? And while on the subject, after the "terasort", how did they check that the data was actually sorted correctly? (or did they :-) ? ). Is there any way to read the data back in so as to preserve the sort, or do I need to "zipWithIndex" before writing it out, and write the index at that time? (I haven't tried the latter yet). Thanks!-Mike
How to check that a dataset is sorted after it has been written out? [repost]
Greetings![My apologies for this respost, I'm not certain that the first message made it to the list]. I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first "partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is not the same as implied by the names of the parquet files (even when the number of partitions is the same in therdd which was read as on disk). If I "take()" a few hundred values, they are sorted, but they are *not* the same as if I explicitly open "part-r-0.parquet" and take values from that. It seems that when opening the rdd, the "partitions" of the rdd are not in the sameorder as implied by the data on disk (i.e., "part-r-0.parquet, part-r-1.parquet, etc). So, how might one read the data so that one maintains the sort order? And while on the subject, after the "terasort", how did they check that the data was actually sorted correctly? (or did they :-) ? ). Is there any way to read the data back in so as to preserve the sort, or do I need to "zipWithIndex" before writing it out, and write the index at that time? (I haven't tried the latter yet). Thanks!-Mike
Re: How to check that a dataset is sorted after it has been written out?
One approach would be to repartition the whole data into 1 (costly operation though, but will give you a single file). Also, You could try using zipWithIndex before writing it out. Thanks Best Regards On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert < m_albert...@yahoo.com.invalid> wrote: > Greetings! > > I sorted a dataset in Spark and then wrote it out in avro/parquet. > > Then I wanted to check that it was sorted. > > It looks like each partition has been sorted, but when reading in, the > first "partition" (i.e., as > seen in the partition index of mapPartitionsWithIndex) is not the same as > implied by > the names of the parquet files (even when the number of partitions is the > same in the > rdd which was read as on disk). > > If I "take()" a few hundred values, they are sorted, but they are *not* > the same as if I > explicitly open "part-r-0.parquet" and take values from that. > > It seems that when opening the rdd, the "partitions" of the rdd are not in > the same > order as implied by the data on disk (i.e., "part-r-0.parquet, > part-r-1.parquet, etc). > > So, how might one read the data so that one maintains the sort order? > > And while on the subject, after the "terasort", how did they check that > the > data was actually sorted correctly? (or did they :-) ? ). > > Is there any way to read the data back in so as to preserve the sort, or > do I need to > "zipWithIndex" before writing it out, and write the index at that time? (I > haven't tried the > latter yet). > > Thanks! > -Mike > >
Re: How to check that a dataset is sorted after it has been written out?
Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably make some custom RDD or InputFormat that reads blocks in a well-defined order and, assuming the data is sorted in some knowable way on disk, then must have them sorted. I think that's even been brought up. Deciding whether the data is sorted is quite different. You'd have to decide what ordering you expect (is part 0 before part 1? should it be sorted in a part file?) and then just verify that externally. On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert wrote: > Greetings! > > I sorted a dataset in Spark and then wrote it out in avro/parquet. > > Then I wanted to check that it was sorted. > > It looks like each partition has been sorted, but when reading in, the first > "partition" (i.e., as > seen in the partition index of mapPartitionsWithIndex) is not the same as > implied by > the names of the parquet files (even when the number of partitions is the > same in the > rdd which was read as on disk). > > If I "take()" a few hundred values, they are sorted, but they are *not* the > same as if I > explicitly open "part-r-0.parquet" and take values from that. > > It seems that when opening the rdd, the "partitions" of the rdd are not in > the same > order as implied by the data on disk (i.e., "part-r-0.parquet, > part-r-1.parquet, etc). > > So, how might one read the data so that one maintains the sort order? > > And while on the subject, after the "terasort", how did they check that the > data was actually sorted correctly? (or did they :-) ? ). > > Is there any way to read the data back in so as to preserve the sort, or do > I need to > "zipWithIndex" before writing it out, and write the index at that time? (I > haven't tried the > latter yet). > > Thanks! > -Mike > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to check that a dataset is sorted after it has been written out?
Thanks for the information! (to all who responded) The code below *seems* to work.Any hidden gotcha's that anyone sees? And still, in "terasort", how did they check that the data was actually sorted? :-) -Mike class MyInputFormat[T] extends parquet.hadoop.ParquetInputFormat[T]{ override def getSplits(jobContext: org.apache.hadoop.mapreduce.JobContext) :java.util.List[org.apache.hadoop.mapreduce.InputSplit] = { val splits = super.getSplits(jobContext) import scala.collection.JavaConversions._ splits.sortBy{ split => split match { case fileSplit :org.apache.hadoop.mapreduce.lib.input.FileSplit => (fileSplit.getPath.getName, fileSplit.getStart) case _ => ("",-1L) } } }} From: Sean Owen To: Michael Albert Cc: User Sent: Monday, March 23, 2015 7:31 AM Subject: Re: How to check that a dataset is sorted after it has been written out? Data is not (necessarily) sorted when read from disk, no. A file might have many blocks even, and while a block yields a partition in general, the order in which those partitions appear in the RDD is not defined. This is why you'd sort if you need the data sorted. I think you could conceivably make some custom RDD or InputFormat that reads blocks in a well-defined order and, assuming the data is sorted in some knowable way on disk, then must have them sorted. I think that's even been brought up. Deciding whether the data is sorted is quite different. You'd have to decide what ordering you expect (is part 0 before part 1? should it be sorted in a part file?) and then just verify that externally. On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert wrote: > Greetings! > > I sorted a dataset in Spark and then wrote it out in avro/parquet. > > Then I wanted to check that it was sorted. > > It looks like each partition has been sorted, but when reading in, the first > "partition" (i.e., as > seen in the partition index of mapPartitionsWithIndex) is not the same as > implied by > the names of the parquet files (even when the number of partitions is the > same in the > rdd which was read as on disk). > > If I "take()" a few hundred values, they are sorted, but they are *not* the > same as if I > explicitly open "part-r-0.parquet" and take values from that. > > It seems that when opening the rdd, the "partitions" of the rdd are not in > the same > order as implied by the data on disk (i.e., "part-r-0.parquet, > part-r-1.parquet, etc). > > So, how might one read the data so that one maintains the sort order? > > And while on the subject, after the "terasort", how did they check that the > data was actually sorted correctly? (or did they :-) ? ). > > Is there any way to read the data back in so as to preserve the sort, or do > I need to > "zipWithIndex" before writing it out, and write the index at that time? (I > haven't tried the > latter yet). > > Thanks! > -Mike > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org