[
https://issues.apache.org/jira/browse/HBASE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927706#action_12927706
]
Gary Helmling commented on HBASE-1512:
--------------------------------------
Thanks for the patch Himanshu!
For the scope of the functionality and what sort of aggregation functions you
might cover, you might want to start with a comparison of common SQL functions
(ex. http://dev.mysql.com/doc/refman/5.5/en/group-by-functions.html). I don't
know if you really need to implement all of them, but a good start would
probably be:
* COUNT
* AVG
* MIN
* MAX
* STD
* SUM
(just my opinion of course). All of these would need some form of server side
function, and in some cases the client/server coordination might be a little
tricky when you have to span regions.
The client side interface for these also has it's own needs. Does it make
sense to be able to combine different client side aggregation functions with
unmatched server side functions? Would you want to take a client side minimum
of the per-region maximum values returned from the row range? As far as I can
see, you would mostly want a single client function paired with a given
server-side method.
I do see that the "raw" HTable.coprocessorExec() interface is a bit clumsy for
these types of operations. You really want to be able to return a single
value, not a value per region. But I think you can create some client helper
methods to hide that complexity.
For the current client classes ProcessResultsFromCP seems to have a lot of
overlap with Batch.Callback. The main difference being that
HTable.processResultsFromCP() allows you to pass a list of instances (as
opposed to a single Batch.Callback). If using a single Callback instance is
limiting, we could allow use of a list of Callbacks, or provide a
Batch.callbackList() factory method that allows chaining multiple instances
together. But for the common cases here, it seems like you'll want a single
client side function (min, max, etc) paired with a single server-side
invocation (min, max, etc.), so the current Batch.Callback would probably
suffice.
So as an example on the client side, you could provide a client wrapper in the
form:
{{{
public class Aggregations {
private static class ClientSum implements Batch.Callback<Long> {
private long sum;
public void update(byte[] region, byte[] row, Long value) {
sum += value;
}
public long getValue() { return sum; }
}
public static long sum(HTable table, byte[] start, byte[] end, byte[]
family, byte[] col) {
ClientSum sum = new ClientSum();
table.coprocessorExec(AggFunctionProtocol.class, start, end,
new Batch.Call<AggFunctionProtocol,Long>() {
public Long call(AggFunctionProtocol instance) {
return instance.sum(family, col);
}
}, sum);
return sum.getValue();
}
}}}
And so on for the other types of operations... Then clients can just call
Aggregations.sum() with the right args.
There may be better ways to do it, this is just an illustration. :)
And, please, if you see ways that HTable.coprocessorExec() can be improved to
make this easier, comment on HBASE-2002!
> Coprocessors: Support aggregate functions
> -----------------------------------------
>
> Key: HBASE-1512
> URL: https://issues.apache.org/jira/browse/HBASE-1512
> Project: HBase
> Issue Type: Sub-task
> Reporter: stack
> Attachments: 1512.zip
>
>
> Chatting with jgray and holstad at the kitchen table about counts, sums, and
> other aggregating facility, facility generally where you want to calculate
> some meta info on your table, it seems like it wouldn't be too hard making a
> filter type that could run a function server-side and return the result ONLY
> of the aggregation or whatever.
> For example, say you just want to count rows, currently you scan, server
> returns all data to client and count is done by client counting up row keys.
> A bunch of time and resources have been wasted returning data that we're not
> interested in. With this new filter type, the counting would be done
> server-side and then it would make up a new result that was the count only
> (kinda like mysql when you ask it to count, it returns a 'table' with a count
> column whose value is count of rows). We could have it so the count was
> just done per region and return that. Or we could maybe make a small change
> in scanner too so that it aggregated the per-region counts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.