[jira] [Created] (CRUNCH-588) Modify HFileUtils to flex on affected regions for hfiles, rather than all regions

Stephen Durfey (JIRA) Tue, 19 Jan 2016 08:27:19 -0800

Stephen Durfey created CRUNCH-588:
-------------------------------------

             Summary: Modify HFileUtils to flex on affected regions for hfiles, 
rather than all regions
                 Key: CRUNCH-588
                 URL: https://issues.apache.org/jira/browse/CRUNCH-588
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Stephen Durfey
            Assignee: Josh Wills
         Attachments: hfileutils_0.8.4.patch

HFileUtils when preparing for writing HFiles sets the [number of reducers |
https://github.com/apache/crunch/blob/master/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HFileUtils.java#L422]
equal to the number of regions in the table, and then writes out the start
keys for each region to a sequence file for the TotalOrderPartitioner to
consume when partitioning data. This can result in a very large quantity of
reducers that don't do anything due to not having any data to write to hfiles
for the region its partition belonged to.

My proposal is to modify HFileUtils, with an optional parameter (or a config,
that's up for debate) to determine which regions data will be loaded into ahead
of time, and set the number of reducers to equal the number of regions, and
only write out the start keys for those affected regions.

I have working code to do this on the 0.8.x branch of crunch, as that is what I
am currently on. I can modify it to work on more recent versions, but I wanted
to start a discussion around the viability of this code being contributed back
to the community. I am still in process of capturing metrics around the impact
of the change (and trying to get data large enough to test this out), but at
least from a reducer count I have seen substantial drops in my limited testing
so far. For example, I had a job go from 705 reduce tasks during the write down
to 36 reduce tasks.

I've attached what I have so far as of 0.8.4. I'm going to start working on a
version modified for the latest version of crunch.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CRUNCH-588) Modify HFileUtils to flex on affected regions for hfiles, rather than all regions

Reply via email to