[jira] [Updated] (HBASE-4435) Add Group By functionality using Coprocessors

Aaron Tokhy (JIRA) Wed, 17 Oct 2012 12:02:07 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aaron Tokhy updated HBASE-4435:
-------------------------------

    Attachment: HBASE-4435-v2.patch

I have a newer version of the patch:

Improvements:

1) Added implementations of ColumnInterpreter classes so both AggregationClient 
and GroupByClient could perform aggregations on Long, Short, Integer, Double, 
Float, Character (or unsigned short), and BigDecimal types.

2) The GroupByStatsValues class is a Java generic that constrains on Java types 
that only implement the 'Number' interface.  This way the generic is 
constrained for those types at compile time.

3) Previously, a HashMap was returned at the end of each RPC call.  HashMap 
uses java.io.Serializable, which is relatively heavyweight.  Switched to using 
the Hadoop Writable interface so all objects passed between clients and 
regionservers use the Hadoop Writable interface.

4) Fixed some validateParameter bugs in the previous patch which would allow 
selections of column qualifiers not found in the Scan object to go through.

Caveats:

1) This works well if your resultset fits into memory as group by values are 
aggregated into a HashMap on the client.  Therefore, if the cardinality of the 
aggregation table is too high, you may get an OOME.

2) All aggregations are calculated by the 'GroupByStatsValues' container.  
Perhaps at object construction, a 'statsvalues' can be constructed to only 
perform some of the aggregations instead of all of them at the same time.  
However this operation is Scan (IO) bound, so improvements would be minimal 
here.

3) Like all coprocessors that accept a Scan object, if the aggregation is 
performing a full table scan, this will run on all regionservers.  Each region 
level coprocessor is loaded into an IPC handler (default of 10) on the 
regionserver.  If the regionserver has more regions than IPC handlers, only 10 
group by operations will run at a time.

Depending on your table schema, region size and blockCacheHitRatio, your 
mileage may vary.  If data can be preaggregated for a group by operation, this 
patch would be handy for aggregating a single column value projection of the 
original full table.  A column oriented representation of the original table 
would work well in this case, or possibly a client/coprocessor managed 
secondary index.

The patch applies cleanly onto HBase 0.92.1 and HBase 0.94.1.
                
> Add Group By functionality using Coprocessors
> ---------------------------------------------
>
>                 Key: HBASE-4435
>                 URL: https://issues.apache.org/jira/browse/HBASE-4435
>             Project: HBase
>          Issue Type: Improvement
>          Components: Coprocessors
>            Reporter: Nichole Treadway
>            Priority: Minor
>         Attachments: HBase-4435.patch, HBASE-4435-v2.patch
>
>
> Adds in a Group By -like functionality to HBase, using the Coprocessor 
> framework. 
> It provides the ability to group the result set on one or more columns 
> (groupBy families). It computes statistics (max, min, sum, count, sum of 
> squares, number missing) for a second column, called the stats column. 
> To use, I've provided two implementations.
> 1. In the first, you specify a single group-by column and a stats field:
>       statsMap = gbc.getStats(tableName, scan, groupByFamily, 
> groupByQualifier, statsFamily, statsQualifier, statsFieldColumnInterpreter);
> The result is a map with the Group By column value (as a String) to a 
> GroupByStatsValues object. The GroupByStatsValues object has max,min,sum etc. 
> of the stats column for that group.
> 2. The second implementation allows you to specify a list of group-by columns 
> and a stats field. The List of group-by columns is expected to contain lists 
> of {column family, qualifier} pairs. 
>       statsMap = gbc.getStats(tableName, scan, listOfGroupByColumns, 
> statsFamily, statsQualifier, statsFieldColumnInterpreter);
> The GroupByStatsValues code is adapted from the Solr Stats component.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4435) Add Group By functionality using Coprocessors

Reply via email to