[ 
https://issues.apache.org/jira/browse/CRUNCH-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459414#comment-13459414
 ] 

Gabriel Reid commented on CRUNCH-51:
------------------------------------

I just tried running it again, and spotted the issue (it was only in the task 
tracker logs, and not shown in the log on stdout) -- the problem is a 
ClassCastException in the ReservoirRecordReader when attempting to use 
GenericRecords based on the StringWrapper schema. It looks like the 
AvroRecordReader being used in the ReservoirRecordReader isn't being properly 
set up with the appropriate Avro PType.

I looked into the patch in more detail, and I get the feeling that there's an 
easier way to do this, without needing the ReservoirRecordReader and related 
classes at all. The approach that I was thinking of was:
- Use a DoFn to handle the reservoir sampling (instead of a separate 
inputformat, source, and record reader)
- Use the same code as is used in MapsideJoin to put the reservoir-sampled data 
into the distributed cache
- Rework the CrunchTotalOrderPartitioner to read in the partition sample values 
in the same way as MapsideJoin does it

I think that the advantages of this approach are:
- Everything works with built-in Crunch functionality for handling writables 
and avro (and maybe other serialization schemes) transparently
- We can avoid adding an extra sample method to PCollection (as there is a move 
to reduce the number of methods in PCollection)

Additionally, using a DoFn for the reservoir sampling can be a lot more 
efficient, as in some (probably many) cases, the values from the PCollection 
will be being passed through the pipline regardless, so there is no additional 
IO overhead added to the pipeline, which is the case right now with the 
ReservoirFileInputFormat and friends.

One other point -- I had to remove the CrunchTotalOrderPartitionerTest class in 
order to be able to build with Maven -- I think it should be included in the 
integration tests (if it does need to be included at all).

What do you think? I know you've gotten a closer look at this stuff than I 
have, so I might be missing some important points in my reasoning.


                
> PCollection#sort relies on using a single reducer for total order sorting
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-51
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-51
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Gabriel Reid
>         Attachments: 0001-CRUNCH-51-Total-Order-Sort.patch, CRUNCH-51.patch, 
> CRUNCH-51.patch, SortTest.java
>
>
> The total-order sorting provided by the Sort class (and therefore 
> PCollection#sort) relies on using a single reducer in order to provide 
> total-order sorting. This is very inefficient for large datasets, and should 
> be replaced with a total order partitioner instead.
> For more information, see CRUNCH-23 (and possibly also MAPREDUCE-4574).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to