[jira] [Comment Edited] (CASSANDRA-7888) Decide the best way to define user-define functions

Robert Stupp (JIRA) Fri, 05 Sep 2014 07:45:34 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123003#comment-14123003
 ]


Robert Stupp edited comment on CASSANDRA-7888 at 9/5/14 2:36 PM:
-----------------------------------------------------------------

A Java interface would work fine with 'class' UDFs (CASSANDRA-7395).
'java' (CASSANDRA-7562) UDFs and JSR223 (script) UDFs from CASSANDRA-7526 might 
get complicated.
(Although 'java' code generation could be changed.)

I thought of an alternative approach to pass some "result set context" object 
into UDFs as the first parameter for aggregate functions.
Means: each SELECT execution generates one "result set context" object for each 
used aggregate function.
The drawback of course is that such a "generic result set context" could not 
use primitive types ({{int}}, {{long}}, {{double}}, etc) but only the wrapped 
types ({{Integer}}, {{Long}}, {{Double}}, etc).

Or we could let the UDF implementation return some "aggregate interface 
implementation" which gets called for each row and for the final result. For 
example for 'class' UDFs:
{noformat}
class MyAggregateFunctionContext implements AggregateFunctionResultSet<Double> {
    void forEachRow(SomeRowOrCell data) {
        ... per row magic code
    }
    Double getResult() {
        return resultValue;
    }
}
{noformat}

for 'java' UDFs:
{noformat}
CREATE FUNCTION aggregateMagic ( input double ) RETURNS double LANGUAGE java AS 
'
    return new AggregateFunctionResultSet<Double> {
        void forEachRow(SomeRowOrCell data) {
            ... per row magic code
        }
        Double getResult() {
            return resultValue;
        }
    }
';
{noformat}

Maybe it's necessary to add some {{CREATE AGGREGATE FUNCTION ...}} syntax to 
distinguish between scalar and aggregation functions.

BTW: javassist has been added as part of CASSANDRA-7562.

EDIT: strike that {{SomeRowOrCell}} - should read {{Double}}


was (Author: snazy):
A Java interface would work fine with 'class' UDFs (CASSANDRA-7395).
'java' (CASSANDRA-7562) UDFs and JSR223 (script) UDFs from CASSANDRA-7526 might 
get complicated.
(Although 'java' code generation could be changed.)

I thought of an alternative approach to pass some "result set context" object 
into UDFs as the first parameter for aggregate functions.
Means: each SELECT execution generates one "result set context" object for each 
used aggregate function.
The drawback of course is that such a "generic result set context" could not 
use primitive types ({{int}}, {{long}}, {{double}}, etc) but only the wrapped 
types ({{Integer}}, {{Long}}, {{Double}}, etc).

Or we could let the UDF implementation return some "aggregate interface 
implementation" which gets called for each row and for the final result. For 
example for 'class' UDFs:
{noformat}
class MyAggregateFunctionContext implements AggregateFunctionResultSet<Double> {
    void forEachRow(SomeRowOrCell data) {
        ... per row magic code
    }
    Double getResult() {
        return resultValue;
    }
}
{noformat}

for 'java' UDFs:
{noformat}
CREATE FUNCTION aggregateMagic ( input double ) RETURNS double LANGUAGE java AS 
'
    return new AggregateFunctionResultSet<Double> {
        void forEachRow(SomeRowOrCell data) {
            ... per row magic code
        }
        Double getResult() {
            return resultValue;
        }
    }
';
{noformat}

Maybe it's necessary to add some {{CREATE AGGREGATE FUNCTION ...}} syntax to 
distinguish between scalar and aggregation functions.

BTW: javassist has been added as part of CASSANDRA-7562.

> Decide the best way to define user-define functions
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7888
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7888
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Lerer
>              Labels: cql
>             Fix For: 3.0
>
>
> The goal of this ticket is to define what would be the best way from the ease 
> of use and performance point of view for defining User Defined Scalar 
> Function and User Defined Aggregate Function.
> I would like to clarify this point before we add support for User Defined 
> Aggregate Function as part of #4914 
> The current version of UDF is supporting only the addition of Scalar Function 
> and does so by allowing a User to provide some classes containing static 
> methods that can then be loaded as functions within Cassandra.
> The problem with the static method approach is that it force us internally to 
> perform a method call via reflection for each call of the function. So if the 
> request load 10 000 rows the static method will be called 10 000 times via 
> reflection.
> As the Method object is cached the HotSpot compiler will optimize the method 
> call after a certain amount of iterations. Nevertheless, from a performance 
> point of view it is definetly not a optimal situation.
> Ideally a proper solution from the performance point of view will limit the 
> impact to the function loading time (when the function is first added or at 
> startup time) but not at query time.
> The first solution to solve that problem would be to force the designer of a 
> new function to implements a specific interface like:
> {code}
> public interface UserDefinedScalarFunction
> {
>     Object execute(Object... args);
> }
> {code}
> or for aggregate function
> {code}
> public interface UserDefinedAggregateFunction
> {
>     UserDefinedAggregation newAggregate();
>     public interface UserDefinedAggregate 
>     {
>         void add(Object... args);
>         Object getResult();
>         void reset();
>     }
> } 
> {code} 
> This will allow use to create one object instance via reflection and then 
> reuse that object everytime the function is called.
> The problems with that approach is that we loose the type safety of the 
> arguments and of the return type and by consequence we will be able to detect 
> a problem only at running time.
> The second solution would be to force the designer of a new function to 
> create a new class in which it marks the method to execute with an annotation.
> {code}
> public class AbsFunction
> {
>     @Execute
>     public double abs(double d)
>     {
>         return Maths.abs(d);
>     }
> }
> {code}
> The same approach for aggregate functions will give:
> {code}
> public class AvgFunction
> {
>     private double sum;
>     private int count
>     @Add
>     public void addValue(double d)
>     {
>         sum += d;
>       count++;
>     }
>     @Get
>     public double getAvg()
>     {
>         if (count == 0)
>           return 0;
>         return sum / count
>     }
>  
>     @Reset
>     public void clear()
>     {
>       sum = 0;
>         count = 0;
>     }
> }
> {code}
> For this approach to work we need to use, at loading time, code generation 
> for extending the provided class with the method needed to adapt the class to 
> our framework.
> The disavantage of it is that we will need to add a new library like 
> javaassist to the libraries used by C*.
> Its advantage is that it will allow us to detect type mismatch at creation 
> time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-7888) Decide the best way to define user-define functions

Reply via email to