subject:"How to check that a dataset is sorted after it has been written out\?"

How to check that a dataset is sorted after it has been written out?

2015-03-20 Thread Michael Albert

Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first 
"partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is 
not the same  as implied by the names of the parquet files (even when the 
number of partitions is the same in therdd which was read as on disk).
If I "take()" a few hundred values, they are sorted, but they are *not* the 
same as if I explicitly open "part-r-0.parquet" and take values from that.
It seems that when opening the rdd, the "partitions" of the rdd are not in the 
sameorder as implied by the data on disk (i.e., "part-r-0.parquet, 
part-r-1.parquet, etc).
So, how might one read the data so that one maintains the sort order?
And while on the subject, after the "terasort", how did they check that the 
data was actually sorted correctly? (or did they :-) ? ).
Is there any way to read the data back in so as to preserve the sort, or do I 
need to "zipWithIndex" before writing it out, and write the index at that time? 
(I haven't tried the latter yet).
Thanks!-Mike

How to check that a dataset is sorted after it has been written out? [repost]

2015-03-22 Thread Michael Albert

Greetings![My apologies for this respost, I'm not certain that the first 
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first 
"partition" (i.e., as seen in the partition index of mapPartitionsWithIndex) is 
not the same  as implied by the names of the parquet files (even when the 
number of partitions is the same in therdd which was read as on disk).
If I "take()" a few hundred values, they are sorted, but they are *not* the 
same as if I explicitly open "part-r-0.parquet" and take values from that.
It seems that when opening the rdd, the "partitions" of the rdd are not in the 
sameorder as implied by the data on disk (i.e., "part-r-0.parquet, 
part-r-1.parquet, etc).
So, how might one read the data so that one maintains the sort order?
And while on the subject, after the "terasort", how did they check that the 
data was actually sorted correctly? (or did they :-) ? ).
Is there any way to read the data back in so as to preserve the sort, or do I 
need to "zipWithIndex" before writing it out, and write the index at that time? 
(I haven't tried the latter yet).
Thanks!-Mike

Re: How to check that a dataset is sorted after it has been written out?

2015-03-22 Thread Akhil Das

One approach would be to repartition the whole data into 1 (costly
operation though, but will give you a single file). Also, You could try
using zipWithIndex before writing it out.

Thanks
Best Regards

On Sat, Mar 21, 2015 at 4:11 AM, Michael Albert <
m_albert...@yahoo.com.invalid> wrote:

> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the
> first "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same  as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not*
> the same as if I
> explicitly open "part-r-0.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-0.parquet,
> part-r-1.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that
> the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or
> do I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>
>

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Sean Owen

Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.

I think you could conceivably make some custom RDD or InputFormat that
reads blocks in a well-defined order and, assuming the data is sorted
in some knowable way on disk, then must have them sorted. I think
that's even been brought up.

Deciding whether the data is sorted is quite different. You'd have to
decide what ordering you expect (is part 0 before part 1? should it be
sorted in a part file?) and then just verify that externally.

On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert
 wrote:
> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the first
> "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same  as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not* the
> same as if I
> explicitly open "part-r-0.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-0.parquet,
> part-r-1.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or do
> I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to check that a dataset is sorted after it has been written out?

2015-03-23 Thread Michael Albert

Thanks for the information! (to all who responded)
The code below *seems* to work.Any hidden gotcha's that anyone sees?
And still, in "terasort", how did they check that the data was actually sorted? 
:-)
-Mike
class MyInputFormat[T]    extends parquet.hadoop.ParquetInputFormat[T]{    
override def getSplits(jobContext: org.apache.hadoop.mapreduce.JobContext)      
  :java.util.List[org.apache.hadoop.mapreduce.InputSplit] =    {        val 
splits = super.getSplits(jobContext)           import 
scala.collection.JavaConversions._        splits.sortBy{ split => split match { 
                        case fileSplit                            
:org.apache.hadoop.mapreduce.lib.input.FileSplit                                
        => (fileSplit.getPath.getName,                                          
   fileSplit.getStart)                         case _ => ("",-1L) } }    }}

  From: Sean Owen 
 To: Michael Albert  
Cc: User  
 Sent: Monday, March 23, 2015 7:31 AM
 Subject: Re: How to check that a dataset is sorted after it has been written 
out?
   
Data is not (necessarily) sorted when read from disk, no. A file might
have many blocks even, and while a block yields a partition in
general, the order in which those partitions appear in the RDD is not
defined. This is why you'd sort if you need the data sorted.

I think you could conceivably make some custom RDD or InputFormat that
reads blocks in a well-defined order and, assuming the data is sorted
in some knowable way on disk, then must have them sorted. I think
that's even been brought up.

Deciding whether the data is sorted is quite different. You'd have to
decide what ordering you expect (is part 0 before part 1? should it be
sorted in a part file?) and then just verify that externally.



On Fri, Mar 20, 2015 at 10:41 PM, Michael Albert
 wrote:
> Greetings!
>
> I sorted a dataset in Spark and then wrote it out in avro/parquet.
>
> Then I wanted to check that it was sorted.
>
> It looks like each partition has been sorted, but when reading in, the first
> "partition" (i.e., as
> seen in the partition index of mapPartitionsWithIndex) is not the same  as
> implied by
> the names of the parquet files (even when the number of partitions is the
> same in the
> rdd which was read as on disk).
>
> If I "take()" a few hundred values, they are sorted, but they are *not* the
> same as if I
> explicitly open "part-r-0.parquet" and take values from that.
>
> It seems that when opening the rdd, the "partitions" of the rdd are not in
> the same
> order as implied by the data on disk (i.e., "part-r-0.parquet,
> part-r-1.parquet, etc).
>
> So, how might one read the data so that one maintains the sort order?
>
> And while on the subject, after the "terasort", how did they check that the
> data was actually sorted correctly? (or did they :-) ? ).
>
> Is there any way to read the data back in so as to preserve the sort, or do
> I need to
> "zipWithIndex" before writing it out, and write the index at that time? (I
> haven't tried the
> latter yet).
>
> Thanks!
> -Mike
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to check that a dataset is sorted after it has been written out?

How to check that a dataset is sorted after it has been written out? [repost]

Re: How to check that a dataset is sorted after it has been written out?

Re: How to check that a dataset is sorted after it has been written out?

Re: How to check that a dataset is sorted after it has been written out?

5 matches

Site Navigation

Mail list logo

Footer information