[ 
https://issues.apache.org/jira/browse/CASSANDRA-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-7888:
--------------------------------------
    Description: 
The goal of this ticket is to define what would be the best way from the ease 
of use and performance point of view for defining User Defined Scalar Function 
and User Defined Aggregate Function.
I would like to clarify this point before we add support for User Defined 
Aggregate Function as part of # 
The current version of UDF created within # is supporting only the addition of 
Scalar Function and does so by allowing a User to provide some classes 
containing static methods that can then be loaded as functions within Cassandra.
The problem with the static method approach is that it force us internally to 
perform a method call via reflection for each call of the function. So if the 
request load 10 000 rows the static method will be called 10 000 times via 
reflection.
As the Method object is cached the HotSpot compiler will optimize the method 
call after a certain amount of iterations. Nevertheless, from a performance 
point of view it is definetly not a optimal situation.

Ideally a proper solution from the performance point of view will limit the 
impact to the function loading time (when the function is first added or at 
startup time) but not at query time.

The first solution to solve that problem would be to force the designer of a 
new function to implements a specific interface like:
{code}
public interface UserDefinedScalarFunction
{
    Object execute(Object... args);
}
{code}
or for aggregate function
{code}
public interface UserDefinedAggregateFunction
{
    UserDefinedAggregation newAggregate();

    public interface UserDefinedAggregate 
    {
        void add(Object... args);

        Object getResult();

        void reset();
    }
} 
{code} 

This will allow use to create one object instance via reflection and then reuse 
that object everytime the function is called.

The problems with that approach is that we loose the type safety of the 
arguments and of the return type and by consequence we will be able to detect a 
problem only at running time.

The second solution would be to force the designer of a new function to create 
a new class in which it marks the method to execute with an annotation.

{code}
public class AbsFunction
{
    @Execute
    public double abs(double d)
    {
        return Maths.abs(d);
    }
}
{code}

The same approach for aggregate functions will give:
{code}
public class AvgFunction
{
    private double sum;
    private int count

    @Add
    public void addValue(double d)
    {
        sum += d;
        count++;
    }

    @Get
    public double getAvg()
    {
        if (count == 0)
            return 0;
        return sum / count
    }
 
    @Reset
    public void clear()
    {
        sum = 0;
        count = 0;
    }
}
{code}

For this approach to work we need to use, at loading time, code generation for 
extending the provided class with the method needed to adapt the class to our 
framework.
The disavantage of it is that we will need to add a new library like javaassist 
to the libraries used by C*.
Its advantage is that it will allow us to detect type mismatch at creation time.

> Decide the best way to define user-define functions
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7888
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7888
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Lerer
>              Labels: cql
>             Fix For: 3.0
>
>
> The goal of this ticket is to define what would be the best way from the ease 
> of use and performance point of view for defining User Defined Scalar 
> Function and User Defined Aggregate Function.
> I would like to clarify this point before we add support for User Defined 
> Aggregate Function as part of # 
> The current version of UDF created within # is supporting only the addition 
> of Scalar Function and does so by allowing a User to provide some classes 
> containing static methods that can then be loaded as functions within 
> Cassandra.
> The problem with the static method approach is that it force us internally to 
> perform a method call via reflection for each call of the function. So if the 
> request load 10 000 rows the static method will be called 10 000 times via 
> reflection.
> As the Method object is cached the HotSpot compiler will optimize the method 
> call after a certain amount of iterations. Nevertheless, from a performance 
> point of view it is definetly not a optimal situation.
> Ideally a proper solution from the performance point of view will limit the 
> impact to the function loading time (when the function is first added or at 
> startup time) but not at query time.
> The first solution to solve that problem would be to force the designer of a 
> new function to implements a specific interface like:
> {code}
> public interface UserDefinedScalarFunction
> {
>     Object execute(Object... args);
> }
> {code}
> or for aggregate function
> {code}
> public interface UserDefinedAggregateFunction
> {
>     UserDefinedAggregation newAggregate();
>     public interface UserDefinedAggregate 
>     {
>         void add(Object... args);
>         Object getResult();
>         void reset();
>     }
> } 
> {code} 
> This will allow use to create one object instance via reflection and then 
> reuse that object everytime the function is called.
> The problems with that approach is that we loose the type safety of the 
> arguments and of the return type and by consequence we will be able to detect 
> a problem only at running time.
> The second solution would be to force the designer of a new function to 
> create a new class in which it marks the method to execute with an annotation.
> {code}
> public class AbsFunction
> {
>     @Execute
>     public double abs(double d)
>     {
>         return Maths.abs(d);
>     }
> }
> {code}
> The same approach for aggregate functions will give:
> {code}
> public class AvgFunction
> {
>     private double sum;
>     private int count
>     @Add
>     public void addValue(double d)
>     {
>         sum += d;
>       count++;
>     }
>     @Get
>     public double getAvg()
>     {
>         if (count == 0)
>           return 0;
>         return sum / count
>     }
>  
>     @Reset
>     public void clear()
>     {
>       sum = 0;
>         count = 0;
>     }
> }
> {code}
> For this approach to work we need to use, at loading time, code generation 
> for extending the provided class with the method needed to adapt the class to 
> our framework.
> The disavantage of it is that we will need to add a new library like 
> javaassist to the libraries used by C*.
> Its advantage is that it will allow us to detect type mismatch at creation 
> time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to