Gabriel Reid created CRUNCH-673:
-----------------------------------
Summary: Sort fails when using more reducers than records
Key: CRUNCH-673
URL: https://issues.apache.org/jira/browse/CRUNCH-673
Project: Crunch
Issue Type: Bug
Reporter: Gabriel Reid
We've run into an issue where running Sort with a number of reducers that is
higher than the number of records to be sorted fails.
The way in which this occurs is that a large PCollection is filtered down to
almost nothing (say 10 records), and that filtered PCollection is passed in to
Sort. Sort configures n reducers for the small PCollection (because it doesn't
realize that it has been filtered so aggressively), so then there are for
example 20 reducers configured. Reservoir sampling is used to build up the
partition definitions for the TotalOrderPartitioner, but because there are only
10 records in the filtered PCollection, only 10 partitions are defined for the
TotalOrderPartitioner. This then causes a precondition in TotalOrderPartitioner
to fail, because the number of partitions in the partitions file doesn't match
up with the number of configured reducers.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)