[ https://issues.apache.org/jira/browse/CASSANDRA-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123003#comment-14123003 ]
Robert Stupp edited comment on CASSANDRA-7888 at 9/5/14 2:36 PM: ----------------------------------------------------------------- A Java interface would work fine with 'class' UDFs (CASSANDRA-7395). 'java' (CASSANDRA-7562) UDFs and JSR223 (script) UDFs from CASSANDRA-7526 might get complicated. (Although 'java' code generation could be changed.) I thought of an alternative approach to pass some "result set context" object into UDFs as the first parameter for aggregate functions. Means: each SELECT execution generates one "result set context" object for each used aggregate function. The drawback of course is that such a "generic result set context" could not use primitive types ({{int}}, {{long}}, {{double}}, etc) but only the wrapped types ({{Integer}}, {{Long}}, {{Double}}, etc). Or we could let the UDF implementation return some "aggregate interface implementation" which gets called for each row and for the final result. For example for 'class' UDFs: {noformat} class MyAggregateFunctionContext implements AggregateFunctionResultSet<Double> { void forEachRow(SomeRowOrCell data) { ... per row magic code } Double getResult() { return resultValue; } } {noformat} for 'java' UDFs: {noformat} CREATE FUNCTION aggregateMagic ( input double ) RETURNS double LANGUAGE java AS ' return new AggregateFunctionResultSet<Double> { void forEachRow(SomeRowOrCell data) { ... per row magic code } Double getResult() { return resultValue; } } '; {noformat} Maybe it's necessary to add some {{CREATE AGGREGATE FUNCTION ...}} syntax to distinguish between scalar and aggregation functions. BTW: javassist has been added as part of CASSANDRA-7562. EDIT: strike that {{SomeRowOrCell}} - should read {{Double}} was (Author: snazy): A Java interface would work fine with 'class' UDFs (CASSANDRA-7395). 'java' (CASSANDRA-7562) UDFs and JSR223 (script) UDFs from CASSANDRA-7526 might get complicated. (Although 'java' code generation could be changed.) I thought of an alternative approach to pass some "result set context" object into UDFs as the first parameter for aggregate functions. Means: each SELECT execution generates one "result set context" object for each used aggregate function. The drawback of course is that such a "generic result set context" could not use primitive types ({{int}}, {{long}}, {{double}}, etc) but only the wrapped types ({{Integer}}, {{Long}}, {{Double}}, etc). Or we could let the UDF implementation return some "aggregate interface implementation" which gets called for each row and for the final result. For example for 'class' UDFs: {noformat} class MyAggregateFunctionContext implements AggregateFunctionResultSet<Double> { void forEachRow(SomeRowOrCell data) { ... per row magic code } Double getResult() { return resultValue; } } {noformat} for 'java' UDFs: {noformat} CREATE FUNCTION aggregateMagic ( input double ) RETURNS double LANGUAGE java AS ' return new AggregateFunctionResultSet<Double> { void forEachRow(SomeRowOrCell data) { ... per row magic code } Double getResult() { return resultValue; } } '; {noformat} Maybe it's necessary to add some {{CREATE AGGREGATE FUNCTION ...}} syntax to distinguish between scalar and aggregation functions. BTW: javassist has been added as part of CASSANDRA-7562. > Decide the best way to define user-define functions > --------------------------------------------------- > > Key: CASSANDRA-7888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7888 > Project: Cassandra > Issue Type: Improvement > Reporter: Benjamin Lerer > Labels: cql > Fix For: 3.0 > > > The goal of this ticket is to define what would be the best way from the ease > of use and performance point of view for defining User Defined Scalar > Function and User Defined Aggregate Function. > I would like to clarify this point before we add support for User Defined > Aggregate Function as part of #4914 > The current version of UDF is supporting only the addition of Scalar Function > and does so by allowing a User to provide some classes containing static > methods that can then be loaded as functions within Cassandra. > The problem with the static method approach is that it force us internally to > perform a method call via reflection for each call of the function. So if the > request load 10 000 rows the static method will be called 10 000 times via > reflection. > As the Method object is cached the HotSpot compiler will optimize the method > call after a certain amount of iterations. Nevertheless, from a performance > point of view it is definetly not a optimal situation. > Ideally a proper solution from the performance point of view will limit the > impact to the function loading time (when the function is first added or at > startup time) but not at query time. > The first solution to solve that problem would be to force the designer of a > new function to implements a specific interface like: > {code} > public interface UserDefinedScalarFunction > { > Object execute(Object... args); > } > {code} > or for aggregate function > {code} > public interface UserDefinedAggregateFunction > { > UserDefinedAggregation newAggregate(); > public interface UserDefinedAggregate > { > void add(Object... args); > Object getResult(); > void reset(); > } > } > {code} > This will allow use to create one object instance via reflection and then > reuse that object everytime the function is called. > The problems with that approach is that we loose the type safety of the > arguments and of the return type and by consequence we will be able to detect > a problem only at running time. > The second solution would be to force the designer of a new function to > create a new class in which it marks the method to execute with an annotation. > {code} > public class AbsFunction > { > @Execute > public double abs(double d) > { > return Maths.abs(d); > } > } > {code} > The same approach for aggregate functions will give: > {code} > public class AvgFunction > { > private double sum; > private int count > @Add > public void addValue(double d) > { > sum += d; > count++; > } > @Get > public double getAvg() > { > if (count == 0) > return 0; > return sum / count > } > > @Reset > public void clear() > { > sum = 0; > count = 0; > } > } > {code} > For this approach to work we need to use, at loading time, code generation > for extending the provided class with the method needed to adapt the class to > our framework. > The disavantage of it is that we will need to add a new library like > javaassist to the libraries used by C*. > Its advantage is that it will allow us to detect type mismatch at creation > time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)