[
https://issues.apache.org/jira/browse/CRUNCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107880#comment-15107880
]
Micah Whitacre commented on CRUNCH-588:
---------------------------------------
Few notes:
* Should maintain API passivity in the sortAndPartition method as well.
* You have a System.out in the middle of sortAndPartition.
* Javadoc for sure on the new methods to better describe what
"limitToAffectedRegions" means.
* For getting region start key probably want to code against HTableInterface or
HBaseAdmin[1] vs HTable as it'll make changes for the "master" branch easier.
Also that should return a sorted collection so you don't need to sort in each
instance of the DetermineAffectedRegionsFn. Or at least could sort once before
passing to the fn.
* For each family (lines 362-366) we re-calculate the region keys. Is it that
expense or think we should just spin through family filtered KVs once and use
those splits for all families?
* On lines 504-505: I'm not seeing how we guarantee the start keys do not get
repeated. Would using something like distinct to eliminate duplicates make
sense?
Offline we talked about tests being good as well.
[1] -
https://archive.cloudera.com/cdh5/cdh/5/hbase/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#getTableRegions(byte[])
> Modify HFileUtils to flex on affected regions for hfiles, rather than all
> regions
> ---------------------------------------------------------------------------------
>
> Key: CRUNCH-588
> URL: https://issues.apache.org/jira/browse/CRUNCH-588
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Stephen Durfey
> Assignee: Josh Wills
> Attachments: hfileutils_0.8.5.patch
>
>
> HFileUtils when preparing for writing HFiles sets the [number of reducers |
> https://github.com/apache/crunch/blob/master/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java#L422]
> equal to the number of regions in the table, and then writes out the start
> keys for each region to a sequence file for the TotalOrderPartitioner to
> consume when partitioning data. This can result in a very large quantity of
> reducers that don't do anything due to not having any data to write to hfiles
> for the region its partition belonged to.
> My proposal is to modify HFileUtils, with an optional parameter (or a config,
> that's up for debate) to determine which regions data will be loaded into
> ahead of time, and set the number of reducers to equal the number of regions,
> and only write out the start keys for those affected regions.
> I have working code to do this on the 0.8.x branch of crunch, as that is what
> I am currently on. I can modify it to work on more recent versions, but I
> wanted to start a discussion around the viability of this code being
> contributed back to the community. I am still in process of capturing metrics
> around the impact of the change (and trying to get data large enough to test
> this out), but at least from a reducer count I have seen substantial drops in
> my limited testing so far. For example, I had a job go from 705 reduce tasks
> during the write down to 36 reduce tasks.
> I've attached what I have so far as of 0.8.4. I'm going to start working on a
> version modified for the latest version of crunch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)