[
https://issues.apache.org/jira/browse/CRUNCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143951#comment-15143951
]
Micah Whitacre commented on CRUNCH-588:
---------------------------------------
+1 to the patch.
Getting some failures trying to apply the patch to master (issues with it
applying to TotalOrderPartitioner). I'll keep on working to get it applied and
merged.
> Modify HFileUtils to flex on affected regions for hfiles, rather than all
> regions
> ---------------------------------------------------------------------------------
>
> Key: CRUNCH-588
> URL: https://issues.apache.org/jira/browse/CRUNCH-588
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Stephen Durfey
> Assignee: Stephen Durfey
> Attachments: crunch-588-0.14-SNAPSHOT.patch,
> crunch-588-0.14-SNAPSHOT_v2.patch, hfileutils_0.8.5.patch
>
>
> HFileUtils when preparing for writing HFiles sets the [number of reducers |
> https://github.com/apache/crunch/blob/master/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java#L422]
> equal to the number of regions in the table, and then writes out the start
> keys for each region to a sequence file for the TotalOrderPartitioner to
> consume when partitioning data. This can result in a very large quantity of
> reducers that don't do anything due to not having any data to write to hfiles
> for the region its partition belonged to.
> My proposal is to modify HFileUtils, with an optional parameter (or a config,
> that's up for debate) to determine which regions data will be loaded into
> ahead of time, and set the number of reducers to equal the number of regions,
> and only write out the start keys for those affected regions.
> I have working code to do this on the 0.8.x branch of crunch, as that is what
> I am currently on. I can modify it to work on more recent versions, but I
> wanted to start a discussion around the viability of this code being
> contributed back to the community. I am still in process of capturing metrics
> around the impact of the change (and trying to get data large enough to test
> this out), but at least from a reducer count I have seen substantial drops in
> my limited testing so far. For example, I had a job go from 705 reduce tasks
> during the write down to 36 reduce tasks.
> I've attached what I have so far as of 0.8.4. I'm going to start working on a
> version modified for the latest version of crunch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)