[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Bill Graham (JIRA) Wed, 14 Nov 2012 21:22:18 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bill Graham updated PIG-2934:
-----------------------------

    Status: Patch Available  (was: In Progress)

We uncovered a significant performance problem with HBaseStorage > 0.9 when 
used with a long list of columns on a tall table. The previous use of filters 
is too hard hitting on HBase and it pegs HBase cluster CPU. We should consider 
this patch to be included in Pig 0.11.
                
> HBaseStorage filter optimizations
> ---------------------------------
>
>                 Key: PIG-2934
>                 URL: https://issues.apache.org/jira/browse/PIG-2934
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Bill Graham
>            Assignee: Bill Graham
>              Labels: hbase
>         Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of 
> HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan 
> instance, at least in addition to the RowFilters. Without this you're doing a 
> full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be 
> more efficient to set the family + columns on the Scan object (addFamily(), 
> addColumn()), instead of using a FilterList. I'm not familiar with the 
> family:prefix handling you mention, but that would still seem to require 
> filters. But if that's not being used, it would be better to avoid the 
> FilterList for columns. At minimum, we should probably call Scan.addFamily() 
> with the distinct families, so we can skip entire column families that are 
> not being used. In the case of a table with 4 CFs, if, say, only 1 is being 
> used, this could be a big gain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2934) HBaseStorage filter optimizations

Reply via email to