[jira] Commented: (HBASE-1512) Coprocessors: Support aggregate functions

Gary Helmling (JIRA) Tue, 02 Nov 2010 18:08:51 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927706#action_12927706
 ]


Gary Helmling commented on HBASE-1512:
--------------------------------------

Thanks for the patch Himanshu!  

For the scope of the functionality and what sort of aggregation functions you 
might cover, you might want to start with a comparison of common SQL functions 
(ex. http://dev.mysql.com/doc/refman/5.5/en/group-by-functions.html).  I don't 
know if you really need to implement all of them, but a good start would 
probably be:

 * COUNT
 * AVG
 * MIN
 * MAX
 * STD
 * SUM

(just my opinion of course).  All of these would need some form of server side 
function, and in some cases the client/server coordination might be a little 
tricky when you have to span regions.

The client side interface for these also has it's own needs.  Does it make 
sense to be able to combine different client side aggregation functions with 
unmatched server side functions?  Would you want to take a client side minimum 
of the per-region maximum values returned from the row range?  As far as I can 
see, you would mostly want a single client function paired with a given 
server-side method.

I do see that the "raw" HTable.coprocessorExec() interface is a bit clumsy for 
these types of operations.  You really want to be able to return a single 
value, not a value per region.  But I think you can create some client helper 
methods to hide that complexity.

For the current client classes ProcessResultsFromCP seems to have a lot of 
overlap with Batch.Callback.  The main difference being that 
HTable.processResultsFromCP() allows you to pass a list of instances (as 
opposed to a single Batch.Callback).  If using a single Callback instance is 
limiting, we could allow use of a list of Callbacks, or provide a 
Batch.callbackList() factory method that allows chaining multiple instances 
together.  But for the common cases here, it seems like you'll want a single 
client side function (min, max, etc) paired with a single server-side 
invocation (min, max, etc.), so the current Batch.Callback would probably 
suffice.

So as an example on the client side, you could provide a client wrapper in the 
form:

{{{
public class Aggregations {
    private static class ClientSum implements Batch.Callback<Long> {
        private long sum;
        public void update(byte[] region, byte[] row, Long value) {
            sum += value;
        }
        public long getValue() { return sum; }
    }

    public static long sum(HTable table, byte[] start, byte[] end, byte[] 
family, byte[] col) {
        ClientSum sum = new ClientSum();
        table.coprocessorExec(AggFunctionProtocol.class, start, end, 
            new Batch.Call<AggFunctionProtocol,Long>() {
                public Long call(AggFunctionProtocol instance) {
                    return instance.sum(family, col);
                }
            }, sum);
        return sum.getValue();
    }
}}}

And so on for the other types of operations...  Then clients can just call 
Aggregations.sum() with the right args.

There may be better ways to do it, this is just an illustration. :)

And, please, if you see ways that HTable.coprocessorExec() can be improved to 
make this easier, comment on HBASE-2002!




> Coprocessors: Support aggregate functions
> -----------------------------------------
>
>                 Key: HBASE-1512
>                 URL: https://issues.apache.org/jira/browse/HBASE-1512
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: stack
>         Attachments: 1512.zip
>
>
> Chatting with jgray and holstad at the kitchen table about counts, sums, and 
> other aggregating facility, facility generally where you want to calculate 
> some meta info on your table, it seems like it wouldn't be too hard making a 
> filter type that could run a function server-side and return the result ONLY 
> of the aggregation or whatever.
> For example, say you just want to count rows, currently you scan, server 
> returns all data to client and count is done by client counting up row keys.  
> A bunch of time and resources have been wasted returning data that we're not 
> interested in.  With this new filter type, the counting would be done 
> server-side and then it would make up a new result that was the count only 
> (kinda like mysql when you ask it to count, it returns a 'table' with a count 
> column whose value is count of rows).   We could have it so the count was 
> just done per region and return that.  Or we could maybe make a small change 
> in scanner too so that it aggregated the per-region counts.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1512) Coprocessors: Support aggregate functions

Reply via email to