Re: Some questions on UDFs

Jim Bates Sat, 04 Jul 2015 17:40:36 -0700

I still have issues finding the correct way to create and use a
RepeatedHolder and Writers are a non starter for Workspace values. I can
make do with creating a concatenated string in a VarCharHolder for small
data sets to get past this in the short term and finish testing the output
values I expect but won't be able to do any scale till I figure out how to
make a repeated list.


On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <[email protected]> wrote:

> Well... Converting from string to integers anyway... To many 4th of July
> Hot Dogs. going into nitrate overload. :)
>
> I am pulling an array of string values from json data. The string values
> are actually integers. I am converting to integers and summing each array
> entry to the final tally.
>
> On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <[email protected]> wrote:
>
>> Ted,
>>
>> Yes, I started out just getting a basic count to work. I am trying to
>> keep the workflow as close to a basic user as possible. As such, I am
>> building and using the MapR Apache Drill sandbox to test.
>>
>>
>>    1. Always look at the drillbits.log file to see if drill had any
>>    issues loading your UDF. That was where I learned that all workspace 
>> values
>>    needed to be holders
>>       -
>>       - WARN  o.a.d.exec.expr.fn.FunctionConverter - Failure loading
>>       function class
>>       com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, 
>> field
>>       xList. Aggregate function 'MyLinearRegression1' workspace variable 
>> 'xList'
>>       is of type 'interface
>>       org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'.
>>       Please change it to Holder type.
>>    2. Error messages:
>>       - If you get an error in this format it means that Drill can not
>>       find your function so it probably didn't load it. back to step 1:
>>          -
>>          - PARSE ERROR: From line 1, column 8 to line 1, column 44: No
>>          match found for function signature MyFunctionName(<ANY>)
>>       - If you get an error in this format it means that the function is
>>       there but Drill could not find a signature that matched the param 
>> types or
>>       param numbers you were passing it. The exact wording will change but
>>       the Missing function implementation is the key phrase to look for:
>>          -
>>          - Error: SYSTEM ERROR:
>>          org.apache.drill.exec.exception.SchemaChangeException: Failure 
>> while trying
>>          to materialize incoming schema.  Errors:
>>          - Error in expression at index -1.  Error: Missing function
>>          implementation: [castBIGINT(VARCHAR-REPEATED)].  Full expression: 
>> --UNKNOWN
>>          EXPRESSION--
>>       3. In your function definition for aggregate functions you need to
>>    set null processing to internal and your isRandom to false. Example below:
>>       -
>>       - @FunctionTemplate(name = "MyFunctionName", scope =
>>       FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>>       FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>>       isBinaryCommutative = false, costCategory =
>>       FunctionTemplate.FunctionCostCategory.COMPLEX)
>>
>> Below is an example from the Apache Drill tutorial data sets contained in
>> the MapR Apache Drill sandbox. I am pulling an array if string values from
>> json data. The string values are actually integers. I am converting to
>> string and summing each array entry to the final tally. This in no way
>> represents what this data was for but it did become a handy way for me to
>> peck out the "correct" way to build an aggregation UDF function
>>
>> @FunctionTemplate(name = "MyArraySum", scope =
>> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls =
>> FunctionTemplate.NullHandling.INTERNAL, isRandom = false,
>> isBinaryCommutative = false, costCategory =
>> FunctionTemplate.FunctionCostCategory.COMPLEX)
>> public static class MyArraySum implements DrillAggFunc {
>>
>> @Param RepeatedVarCharHolder listToSearch;
>> @Workspace NullableBigIntHolder count;
>> @Workspace NullableBigIntHolder sum;
>> @Workspace NullableVarCharHolder vc;
>> @Output BigIntHolder out;
>>
>> @Override
>> public void setup() {
>> count.value=0;
>> sum.value = 0;
>> }
>>
>> @Override
>> public void add() {
>> int c = listToSearch.end - listToSearch.start;
>> int val = 0;
>> try {
>> for(int i=0; i<c; i++){
>> listToSearch.vector.getAccessor().get(i, vc);
>> String inputStr =
>> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start,
>> vc.end, vc.buffer);
>> val = Integer.parseInt(inputStr);
>> sum.value = sum.value + val;
>> }
>> } catch (Exception e) {
>> val = 0;
>> }
>> count.value = count.value + 1;
>> }
>>
>> Example select statement:
>> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as
>> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5);
>>
>> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <[email protected]>
>> wrote:
>>
>>> Jim,
>>>
>>> I think that you may be having trouble with aggregators in general.
>>>
>>> Have you been able to build *any* aggregator of anything?  I haven't.
>>>
>>> When I try to build an aggregator of int's or doubles, I get a very
>>> persistent problem with Drill even seeing my aggregates:
>>>
>>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from
>>> cp.`employee.json`;*
>>>
>>> Jul 04, 2015 4:19:35 PM
>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>
>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>> found for function signature sum_int(<ANY>)
>>>
>>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException
>>> <init>
>>>
>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>> column 8 to line 1, column 27: No match found for function signature
>>> sum_int(<ANY>)
>>>
>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No match
>>> found for function signature sum_int(<ANY>)*
>>>
>>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010
>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>
>>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from
>>> cp.`employee.json`*;
>>>
>>> Jul 04, 2015 4:19:45 PM
>>> org.apache.calcite.sql.validate.SqlValidatorException <init>
>>>
>>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match
>>> found for function signature sum_int(<NUMERIC>)
>>>
>>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException
>>> <init>
>>>
>>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1,
>>> column 8 to line 1, column 40: No match found for function signature
>>> sum_int(<NUMERIC>)
>>>
>>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No match
>>> found for function signature sum_int(<NUMERIC>)*
>>>
>>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010
>>> <http://10.0.1.2:31010>] (state=,code=0)*
>>>
>>> 0: jdbc:drill:zk=local>
>>>
>>>
>>> It looks like there is some undocumented subtlety about how to register
>>> an
>>> aggregator.
>>>
>>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <[email protected]> wrote:
>>>
>>> > I'm working on the same thing. I want to aggregate a list of values.
>>> It has
>>> > been a search and guess game for the most part. I'm still stuck in the
>>> > process of getting the values all into a list. The writers look
>>> interesting
>>> > but for aggregation functions  it looks like the input is the param and
>>> > output objects can't hold the aggregations steps. The Workspace is
>>> where
>>> > that happens. If I try and use a Writer in a workspace it won't load
>>> and
>>> > tells me to change it to Holders which was why I was using them to
>>> start
>>> > with. Maybe I'm missing the architecture of the agg function. It looked
>>> > like it was....
>>> >
>>> > @Param comes in -> initialize @Workspace vars in setup -> process data
>>> > through @Workspace vars in add -> finalize @Output in output.
>>> >
>>> > So I'm back to trying to figure out how to create a
>>> RepeatedBigIntHolder or
>>> > a RepeatedVarCharHolder...
>>> >
>>> >
>>> >
>>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <[email protected]>
>>> wrote:
>>> >
>>> > > I am working on trying to build any kind of list constructing
>>> aggregator
>>> > > and having absolute fits.
>>> > >
>>> > > To simplify life, I decided to just build a generic list builder
>>> that is
>>> > a
>>> > > scalar function that returns a list containing its argument.  Thus
>>> > zoop(3)
>>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]].
>>> > >
>>> > > The ComplexWriter looks like the place to go. As usual, the complete
>>> lack
>>> > > of comments in most of Drill makes this very hard since I have to
>>> guess
>>> > > what works and what doesn't.
>>> > >
>>> > > In my code, I note that ComplexWriter has a nice rootAsList()
>>> method.  I
>>> > > used this in zip and it works nicely to construct lists for output.
>>> I
>>> > note
>>> > > that the resulting ListWriter has a method copyReader(FieldReader
>>> var1)
>>> > > which looks really good.
>>> > >
>>> > > Unfortunately, the only implementation of copyReader() is in
>>> > > AbstractFieldWriter and it looks this:
>>> > >
>>> > > public void copyReader(FieldReader reader) {
>>> > >     this.fail("Copy FieldReader");
>>> > > }
>>> > >
>>> > > I would like to formally say at this point "WTF"?
>>> > >
>>> > > In digging in further, I see other methods that look handy like
>>> > >
>>> > > public void write(IntHolder holder) {
>>> > >     this.fail("Int");
>>> > > }
>>> > >
>>> > > And then in looking at implementations, it looks like there is a
>>> > > combinatorial explosion because every type seems to need a write
>>> method
>>> > for
>>> > > every other type.
>>> > >
>>> > > What is the thought here?  How can I copy an arbitrary value into a
>>> list?
>>> > >
>>> > > My next thought was to build code that dispatches on type.  There is
>>> a
>>> > > method called getType() on the FieldReader.  Unfortunately, that
>>> drives
>>> > > into code generated by protoc and I see no way to dispatch on the
>>> type of
>>> > > an incoming value.
>>> > >
>>> > >
>>> > > How is this supposed to work?
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <[email protected]>
>>> > wrote:
>>> > >
>>> > > > For a detailed example on using ComplexWriter interface you can
>>> take a
>>> > > look
>>> > > > at the Mappify
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java
>>> > > > >
>>> > > > (kvgen) function. The function itself is very simple however it
>>> makes
>>> > use
>>> > > > of the utility methods in MappifyUtility
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java
>>> > > > >
>>> > > > and MapUtility
>>> > > > <
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java
>>> > > > >
>>> > > > which perform most of the work.
>>> > > >
>>> > > > Currently we don't have a generic infrastructure to handle errors
>>> > coming
>>> > > > out of functions. However there is UserException, which when raised
>>> > will
>>> > > > make sure that Drill does not gobble up the error message in that
>>> > > > exception. So you can probably throw a UserException with the
>>> failing
>>> > > input
>>> > > > in your function to make sure it propagates to the user.
>>> > > >
>>> > > > Thanks
>>> > > > Mehant
>>> > > >
>>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <[email protected]
>>> >
>>> > > wrote:
>>> > > >
>>> > > > > *Holders are for both input and output.  You can also use
>>> > CompleWriter
>>> > > > for
>>> > > > > output and FieldReader for input if you want to write or read a
>>> > complex
>>> > > > > value.
>>> > > > >
>>> > > > > I don't think we've provided a really clean way to construct a
>>> > > > > Repeated*Holder for output purposes.  You can probably do it by
>>> > > reaching
>>> > > > > into a bunch of internal interfaces in Drill.  However, I would
>>> > > recommend
>>> > > > > using the ComplexWriter output pattern for now.  This will be a
>>> > little
>>> > > > less
>>> > > > > efficient but substantially less brittle.  I suggest you open up
>>> a
>>> > jira
>>> > > > for
>>> > > > > using a Repeated*Holder as an output.
>>> > > > >
>>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning <
>>> [email protected]>
>>> > > > wrote:
>>> > > > >
>>> > > > > > Holders are for input, I think.
>>> > > > > >
>>> > > > > > Try the different kinds of writers.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates <
>>> [email protected]>
>>> > > > wrote:
>>> > > > > >
>>> > > > > > > Using a repeatedholder as a @param I've got working. I was
>>> > working
>>> > > > on a
>>> > > > > > > custom aggregator function using DrillAggFunc. In this I can
>>> do
>>> > > > simple
>>> > > > > > > things but If I want to build a list values and do something
>>> with
>>> > > it
>>> > > > in
>>> > > > > > the
>>> > > > > > > final output method I think I need to use RepeatedHolders in
>>> the
>>> > > > > > > @Workspace. To do that I need to create a new one in the
>>> setup
>>> > > > method.
>>> > > > > I
>>> > > > > > > can't get one built. They all require a BufferAllocator to be
>>> > > passed
>>> > > > in
>>> > > > > > to
>>> > > > > > > build it. I have not found a way to get an allocator yet. Any
>>> > > > > > suggestions?
>>> > > > > > >
>>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning <
>>> > [email protected]
>>> > > >
>>> > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > If you look at the zip function in
>>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions you
>>> can
>>> > > have
>>> > > > an
>>> > > > > > > > example of building a structure.
>>> > > > > > > >
>>> > > > > > > > The basic idea is that your output is denoted as
>>> > > > > > > >
>>> > > > > > > >         @Output
>>> > > > > > > >         BaseWriter.ComplexWriter writer;
>>> > > > > > > >
>>> > > > > > > > The pattern for building a list of lists of integers is
>>> like
>>> > > this:
>>> > > > > > > >
>>> > > > > > > >         writer.setValueCount(n);
>>> > > > > > > >         ...
>>> > > > > > > >         BaseWriter.ListWriter outer = writer.rootAsList();
>>> > > > > > > >         outer.start(); // [ outer list
>>> > > > > > > >         ...
>>> > > > > > > >         // for each inner list
>>> > > > > > > >             BaseWriter.ListWriter inner = outer.list();
>>> > > > > > > >             inner.start();
>>> > > > > > > >             // for each inner list element
>>> > > > > > > >                 inner.integer().writeInt(accessor.get(i));
>>> > > > > > > >             }
>>> > > > > > > >             inner.end();   // ] inner list
>>> > > > > > > >         }
>>> > > > > > > >         outer.end(); // ] outer list
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates <
>>> > [email protected]>
>>> > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > I have working aggregation and simple UDFs. I've been
>>> trying
>>> > to
>>> > > > > > > document
>>> > > > > > > > > and understand each of the options available in a Drill
>>> UDF.
>>> > > > > > > > Understanding
>>> > > > > > > > > the different FunctionScope's, the ones that are
>>> allowed, the
>>> > > > ones
>>> > > > > > that
>>> > > > > > > > are
>>> > > > > > > > > not. The impact of different cost categories. The
>>> different
>>> > > > steps
>>> > > > > > > needed
>>> > > > > > > > > to understand handling any of the supported data types
>>> and
>>> > > > > > structures
>>> > > > > > > in
>>> > > > > > > > > drill.
>>> > > > > > > > >
>>> > > > > > > > > Here are a few of my current road blocks. Any pointers
>>> would
>>> > be
>>> > > > > > greatly
>>> > > > > > > > > appreciated.
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >    1. I've been trying to understand how to correctly use
>>> > > > > > > RepeatedHolders
>>> > > > > > > > >    of whatever type. For this discussion lets start with
>>> a
>>> > > > > > > > >    RepeatedBigIntHolder. I'm trying to figure out the
>>> best
>>> > way
>>> > > to
>>> > > > > > > create
>>> > > > > > > > a
>>> > > > > > > > > new
>>> > > > > > > > >    one. I have not figured out where in the existing
>>> drill
>>> > code
>>> > > > > > someone
>>> > > > > > > > > does
>>> > > > > > > > >    this. If I use a  RepeatedBigIntHolder as a Workspace
>>> > object
>>> > > > is
>>> > > > > is
>>> > > > > > > > null
>>> > > > > > > > > to
>>> > > > > > > > >    start with. I created a new one in the startup
>>> section of
>>> > > the
>>> > > > > udf
>>> > > > > > > but
>>> > > > > > > > > the
>>> > > > > > > > >    vector was null. I can find no reference in creating
>>> a new
>>> > > > > > > > BigIntVector.
>>> > > > > > > > >    There is a way to create a BigIntVector and I did
>>> find an
>>> > > > > example
>>> > > > > > of
>>> > > > > > > > >    creating a new VarCharVector but I can't do that
>>> using the
>>> > > > drill
>>> > > > > > jar
>>> > > > > > > > > files
>>> > > > > > > > >    from 1.0. The
>>> org.apache.drill.common.types.TypeProtos and
>>> > > > > > > > >    the org.apache.drill.common.types.TypeProtos.MinorType
>>> > > classes
>>> > > > > do
>>> > > > > > > not
>>> > > > > > > > >    appear to be accessible from the drill jar files.
>>> > > > > > > > >    2. What is the best way to close out a UDF in the
>>> event it
>>> > > > > > generates
>>> > > > > > > > an
>>> > > > > > > > >    exception? Are there specific steps one should follow
>>> to
>>> > > make
>>> > > > a
>>> > > > > > > clean
>>> > > > > > > > > exit
>>> > > > > > > > >    in a catch block that are beneficial to Drill?
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Some questions on UDFs

Reply via email to