[
https://issues.apache.org/jira/browse/PIG-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Graham updated PIG-2934:
-----------------------------
Status: Patch Available (was: In Progress)
We uncovered a significant performance problem with HBaseStorage > 0.9 when
used with a long list of columns on a tall table. The previous use of filters
is too hard hitting on HBase and it pegs HBase cluster CPU. We should consider
this patch to be included in Pig 0.11.
> HBaseStorage filter optimizations
> ---------------------------------
>
> Key: PIG-2934
> URL: https://issues.apache.org/jira/browse/PIG-2934
> Project: Pig
> Issue Type: Improvement
> Reporter: Bill Graham
> Assignee: Bill Graham
> Labels: hbase
> Attachments: PIG-2934.1.patch
>
>
> Our HBase pal/guru Gary Helmling was kind enough to do a code review of
> HBaseStorage. He suggested some good filter optimizations:
> * when using the "lt*" and "gt*" options, set the start/stop rows on the Scan
> instance, at least in addition to the RowFilters. Without this you're doing a
> full table scan, regardless of the RowFilters.
> * when selecting specific columns or entire families to return, it would be
> more efficient to set the family + columns on the Scan object (addFamily(),
> addColumn()), instead of using a FilterList. I'm not familiar with the
> family:prefix handling you mention, but that would still seem to require
> filters. But if that's not being used, it would be better to avoid the
> FilterList for columns. At minimum, we should probably call Scan.addFamily()
> with the distinct families, so we can skip entire column families that are
> not being used. In the case of a table with 4 CFs, if, say, only 1 is being
> used, this could be a big gain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira