I still have issues finding the correct way to create and use a RepeatedHolder and Writers are a non starter for Workspace values. I can make do with creating a concatenated string in a VarCharHolder for small data sets to get past this in the short term and finish testing the output values I expect but won't be able to do any scale till I figure out how to make a repeated list.
On Sat, Jul 4, 2015 at 7:12 PM, Jim Bates <[email protected]> wrote: > Well... Converting from string to integers anyway... To many 4th of July > Hot Dogs. going into nitrate overload. :) > > I am pulling an array of string values from json data. The string values > are actually integers. I am converting to integers and summing each array > entry to the final tally. > > On Sat, Jul 4, 2015 at 7:04 PM, Jim Bates <[email protected]> wrote: > >> Ted, >> >> Yes, I started out just getting a basic count to work. I am trying to >> keep the workflow as close to a basic user as possible. As such, I am >> building and using the MapR Apache Drill sandbox to test. >> >> >> 1. Always look at the drillbits.log file to see if drill had any >> issues loading your UDF. That was where I learned that all workspace >> values >> needed to be holders >> - >> - WARN o.a.d.exec.expr.fn.FunctionConverter - Failure loading >> function class >> com.mapr.example.udfs.drill.MyDrillAggFunctions$MyLinearRegression1, >> field >> xList. Aggregate function 'MyLinearRegression1' workspace variable >> 'xList' >> is of type 'interface >> org.apache.drill.exec.vector.complex.writer.BaseWriter$ComplexWriter'. >> Please change it to Holder type. >> 2. Error messages: >> - If you get an error in this format it means that Drill can not >> find your function so it probably didn't load it. back to step 1: >> - >> - PARSE ERROR: From line 1, column 8 to line 1, column 44: No >> match found for function signature MyFunctionName(<ANY>) >> - If you get an error in this format it means that the function is >> there but Drill could not find a signature that matched the param >> types or >> param numbers you were passing it. The exact wording will change but >> the Missing function implementation is the key phrase to look for: >> - >> - Error: SYSTEM ERROR: >> org.apache.drill.exec.exception.SchemaChangeException: Failure >> while trying >> to materialize incoming schema. Errors: >> - Error in expression at index -1. Error: Missing function >> implementation: [castBIGINT(VARCHAR-REPEATED)]. Full expression: >> --UNKNOWN >> EXPRESSION-- >> 3. In your function definition for aggregate functions you need to >> set null processing to internal and your isRandom to false. Example below: >> - >> - @FunctionTemplate(name = "MyFunctionName", scope = >> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls = >> FunctionTemplate.NullHandling.INTERNAL, isRandom = false, >> isBinaryCommutative = false, costCategory = >> FunctionTemplate.FunctionCostCategory.COMPLEX) >> >> Below is an example from the Apache Drill tutorial data sets contained in >> the MapR Apache Drill sandbox. I am pulling an array if string values from >> json data. The string values are actually integers. I am converting to >> string and summing each array entry to the final tally. This in no way >> represents what this data was for but it did become a handy way for me to >> peck out the "correct" way to build an aggregation UDF function >> >> @FunctionTemplate(name = "MyArraySum", scope = >> FunctionTemplate.FunctionScope.POINT_AGGREGATE, nulls = >> FunctionTemplate.NullHandling.INTERNAL, isRandom = false, >> isBinaryCommutative = false, costCategory = >> FunctionTemplate.FunctionCostCategory.COMPLEX) >> public static class MyArraySum implements DrillAggFunc { >> >> @Param RepeatedVarCharHolder listToSearch; >> @Workspace NullableBigIntHolder count; >> @Workspace NullableBigIntHolder sum; >> @Workspace NullableVarCharHolder vc; >> @Output BigIntHolder out; >> >> @Override >> public void setup() { >> count.value=0; >> sum.value = 0; >> } >> >> @Override >> public void add() { >> int c = listToSearch.end - listToSearch.start; >> int val = 0; >> try { >> for(int i=0; i<c; i++){ >> listToSearch.vector.getAccessor().get(i, vc); >> String inputStr = >> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(vc.start, >> vc.end, vc.buffer); >> val = Integer.parseInt(inputStr); >> sum.value = sum.value + val; >> } >> } catch (Exception e) { >> val = 0; >> } >> count.value = count.value + 1; >> } >> >> Example select statement: >> SELECT MyArraySum(my_arrays) FROM (SELECT t.trans_info.prod_id as >> my_arrays FROM `dfs.clicks`.`./clicks/clicks.campaign.json` t limit 5); >> >> On Sat, Jul 4, 2015 at 6:22 PM, Ted Dunning <[email protected]> >> wrote: >> >>> Jim, >>> >>> I think that you may be having trouble with aggregators in general. >>> >>> Have you been able to build *any* aggregator of anything? I haven't. >>> >>> When I try to build an aggregator of int's or doubles, I get a very >>> persistent problem with Drill even seeing my aggregates: >>> >>> 0: jdbc:drill:zk=local> *select sum_int(employee_id) from >>> cp.`employee.json`;* >>> >>> Jul 04, 2015 4:19:35 PM >>> org.apache.calcite.sql.validate.SqlValidatorException <init> >>> >>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match >>> found for function signature sum_int(<ANY>) >>> >>> Jul 04, 2015 4:19:35 PM org.apache.calcite.runtime.CalciteException >>> <init> >>> >>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, >>> column 8 to line 1, column 27: No match found for function signature >>> sum_int(<ANY>) >>> >>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 27: No match >>> found for function signature sum_int(<ANY>)* >>> >>> *[Error Id: 91b78fa6-6dd1-4214-a85f-c2bf2c393145 on 10.0.1.2:31010 >>> <http://10.0.1.2:31010>] (state=,code=0)* >>> >>> 0: jdbc:drill:zk=local> *select sum_int(cast(employee_id as int)) from >>> cp.`employee.json`*; >>> >>> Jul 04, 2015 4:19:45 PM >>> org.apache.calcite.sql.validate.SqlValidatorException <init> >>> >>> SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: No match >>> found for function signature sum_int(<NUMERIC>) >>> >>> Jul 04, 2015 4:19:45 PM org.apache.calcite.runtime.CalciteException >>> <init> >>> >>> SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, >>> column 8 to line 1, column 40: No match found for function signature >>> sum_int(<NUMERIC>) >>> >>> *Error: PARSE ERROR: From line 1, column 8 to line 1, column 40: No match >>> found for function signature sum_int(<NUMERIC>)* >>> >>> *[Error Id: f649fc85-6b6a-4468-9a4f-bfef0b23d06b on 10.0.1.2:31010 >>> <http://10.0.1.2:31010>] (state=,code=0)* >>> >>> 0: jdbc:drill:zk=local> >>> >>> >>> It looks like there is some undocumented subtlety about how to register >>> an >>> aggregator. >>> >>> On Sat, Jul 4, 2015 at 4:08 PM, Jim Bates <[email protected]> wrote: >>> >>> > I'm working on the same thing. I want to aggregate a list of values. >>> It has >>> > been a search and guess game for the most part. I'm still stuck in the >>> > process of getting the values all into a list. The writers look >>> interesting >>> > but for aggregation functions it looks like the input is the param and >>> > output objects can't hold the aggregations steps. The Workspace is >>> where >>> > that happens. If I try and use a Writer in a workspace it won't load >>> and >>> > tells me to change it to Holders which was why I was using them to >>> start >>> > with. Maybe I'm missing the architecture of the agg function. It looked >>> > like it was.... >>> > >>> > @Param comes in -> initialize @Workspace vars in setup -> process data >>> > through @Workspace vars in add -> finalize @Output in output. >>> > >>> > So I'm back to trying to figure out how to create a >>> RepeatedBigIntHolder or >>> > a RepeatedVarCharHolder... >>> > >>> > >>> > >>> > On Sat, Jul 4, 2015 at 4:53 PM, Ted Dunning <[email protected]> >>> wrote: >>> > >>> > > I am working on trying to build any kind of list constructing >>> aggregator >>> > > and having absolute fits. >>> > > >>> > > To simplify life, I decided to just build a generic list builder >>> that is >>> > a >>> > > scalar function that returns a list containing its argument. Thus >>> > zoop(3) >>> > > => [3], zoop('abc') => 'abc' and zoop([1,2,3]) => [[1,2,3]]. >>> > > >>> > > The ComplexWriter looks like the place to go. As usual, the complete >>> lack >>> > > of comments in most of Drill makes this very hard since I have to >>> guess >>> > > what works and what doesn't. >>> > > >>> > > In my code, I note that ComplexWriter has a nice rootAsList() >>> method. I >>> > > used this in zip and it works nicely to construct lists for output. >>> I >>> > note >>> > > that the resulting ListWriter has a method copyReader(FieldReader >>> var1) >>> > > which looks really good. >>> > > >>> > > Unfortunately, the only implementation of copyReader() is in >>> > > AbstractFieldWriter and it looks this: >>> > > >>> > > public void copyReader(FieldReader reader) { >>> > > this.fail("Copy FieldReader"); >>> > > } >>> > > >>> > > I would like to formally say at this point "WTF"? >>> > > >>> > > In digging in further, I see other methods that look handy like >>> > > >>> > > public void write(IntHolder holder) { >>> > > this.fail("Int"); >>> > > } >>> > > >>> > > And then in looking at implementations, it looks like there is a >>> > > combinatorial explosion because every type seems to need a write >>> method >>> > for >>> > > every other type. >>> > > >>> > > What is the thought here? How can I copy an arbitrary value into a >>> list? >>> > > >>> > > My next thought was to build code that dispatches on type. There is >>> a >>> > > method called getType() on the FieldReader. Unfortunately, that >>> drives >>> > > into code generated by protoc and I see no way to dispatch on the >>> type of >>> > > an incoming value. >>> > > >>> > > >>> > > How is this supposed to work? >>> > > >>> > > >>> > > >>> > > >>> > > On Sat, Jul 4, 2015 at 2:14 PM, mehant baid <[email protected]> >>> > wrote: >>> > > >>> > > > For a detailed example on using ComplexWriter interface you can >>> take a >>> > > look >>> > > > at the Mappify >>> > > > < >>> > > > >>> > > >>> > >>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java >>> > > > > >>> > > > (kvgen) function. The function itself is very simple however it >>> makes >>> > use >>> > > > of the utility methods in MappifyUtility >>> > > > < >>> > > > >>> > > >>> > >>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/MappifyUtility.java >>> > > > > >>> > > > and MapUtility >>> > > > < >>> > > > >>> > > >>> > >>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/MapUtility.java >>> > > > > >>> > > > which perform most of the work. >>> > > > >>> > > > Currently we don't have a generic infrastructure to handle errors >>> > coming >>> > > > out of functions. However there is UserException, which when raised >>> > will >>> > > > make sure that Drill does not gobble up the error message in that >>> > > > exception. So you can probably throw a UserException with the >>> failing >>> > > input >>> > > > in your function to make sure it propagates to the user. >>> > > > >>> > > > Thanks >>> > > > Mehant >>> > > > >>> > > > On Sat, Jul 4, 2015 at 1:48 PM, Jacques Nadeau <[email protected] >>> > >>> > > wrote: >>> > > > >>> > > > > *Holders are for both input and output. You can also use >>> > CompleWriter >>> > > > for >>> > > > > output and FieldReader for input if you want to write or read a >>> > complex >>> > > > > value. >>> > > > > >>> > > > > I don't think we've provided a really clean way to construct a >>> > > > > Repeated*Holder for output purposes. You can probably do it by >>> > > reaching >>> > > > > into a bunch of internal interfaces in Drill. However, I would >>> > > recommend >>> > > > > using the ComplexWriter output pattern for now. This will be a >>> > little >>> > > > less >>> > > > > efficient but substantially less brittle. I suggest you open up >>> a >>> > jira >>> > > > for >>> > > > > using a Repeated*Holder as an output. >>> > > > > >>> > > > > On Sat, Jul 4, 2015 at 1:38 PM, Ted Dunning < >>> [email protected]> >>> > > > wrote: >>> > > > > >>> > > > > > Holders are for input, I think. >>> > > > > > >>> > > > > > Try the different kinds of writers. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > On Sat, Jul 4, 2015 at 12:49 PM, Jim Bates < >>> [email protected]> >>> > > > wrote: >>> > > > > > >>> > > > > > > Using a repeatedholder as a @param I've got working. I was >>> > working >>> > > > on a >>> > > > > > > custom aggregator function using DrillAggFunc. In this I can >>> do >>> > > > simple >>> > > > > > > things but If I want to build a list values and do something >>> with >>> > > it >>> > > > in >>> > > > > > the >>> > > > > > > final output method I think I need to use RepeatedHolders in >>> the >>> > > > > > > @Workspace. To do that I need to create a new one in the >>> setup >>> > > > method. >>> > > > > I >>> > > > > > > can't get one built. They all require a BufferAllocator to be >>> > > passed >>> > > > in >>> > > > > > to >>> > > > > > > build it. I have not found a way to get an allocator yet. Any >>> > > > > > suggestions? >>> > > > > > > >>> > > > > > > On Sat, Jul 4, 2015 at 1:37 PM, Ted Dunning < >>> > [email protected] >>> > > > >>> > > > > > wrote: >>> > > > > > > >>> > > > > > > > If you look at the zip function in >>> > > > > > > > https://github.com/mapr-demos/simple-drill-functions you >>> can >>> > > have >>> > > > an >>> > > > > > > > example of building a structure. >>> > > > > > > > >>> > > > > > > > The basic idea is that your output is denoted as >>> > > > > > > > >>> > > > > > > > @Output >>> > > > > > > > BaseWriter.ComplexWriter writer; >>> > > > > > > > >>> > > > > > > > The pattern for building a list of lists of integers is >>> like >>> > > this: >>> > > > > > > > >>> > > > > > > > writer.setValueCount(n); >>> > > > > > > > ... >>> > > > > > > > BaseWriter.ListWriter outer = writer.rootAsList(); >>> > > > > > > > outer.start(); // [ outer list >>> > > > > > > > ... >>> > > > > > > > // for each inner list >>> > > > > > > > BaseWriter.ListWriter inner = outer.list(); >>> > > > > > > > inner.start(); >>> > > > > > > > // for each inner list element >>> > > > > > > > inner.integer().writeInt(accessor.get(i)); >>> > > > > > > > } >>> > > > > > > > inner.end(); // ] inner list >>> > > > > > > > } >>> > > > > > > > outer.end(); // ] outer list >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > On Sat, Jul 4, 2015 at 10:29 AM, Jim Bates < >>> > [email protected]> >>> > > > > > wrote: >>> > > > > > > > >>> > > > > > > > > I have working aggregation and simple UDFs. I've been >>> trying >>> > to >>> > > > > > > document >>> > > > > > > > > and understand each of the options available in a Drill >>> UDF. >>> > > > > > > > Understanding >>> > > > > > > > > the different FunctionScope's, the ones that are >>> allowed, the >>> > > > ones >>> > > > > > that >>> > > > > > > > are >>> > > > > > > > > not. The impact of different cost categories. The >>> different >>> > > > steps >>> > > > > > > needed >>> > > > > > > > > to understand handling any of the supported data types >>> and >>> > > > > > structures >>> > > > > > > in >>> > > > > > > > > drill. >>> > > > > > > > > >>> > > > > > > > > Here are a few of my current road blocks. Any pointers >>> would >>> > be >>> > > > > > greatly >>> > > > > > > > > appreciated. >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > 1. I've been trying to understand how to correctly use >>> > > > > > > RepeatedHolders >>> > > > > > > > > of whatever type. For this discussion lets start with >>> a >>> > > > > > > > > RepeatedBigIntHolder. I'm trying to figure out the >>> best >>> > way >>> > > to >>> > > > > > > create >>> > > > > > > > a >>> > > > > > > > > new >>> > > > > > > > > one. I have not figured out where in the existing >>> drill >>> > code >>> > > > > > someone >>> > > > > > > > > does >>> > > > > > > > > this. If I use a RepeatedBigIntHolder as a Workspace >>> > object >>> > > > is >>> > > > > is >>> > > > > > > > null >>> > > > > > > > > to >>> > > > > > > > > start with. I created a new one in the startup >>> section of >>> > > the >>> > > > > udf >>> > > > > > > but >>> > > > > > > > > the >>> > > > > > > > > vector was null. I can find no reference in creating >>> a new >>> > > > > > > > BigIntVector. >>> > > > > > > > > There is a way to create a BigIntVector and I did >>> find an >>> > > > > example >>> > > > > > of >>> > > > > > > > > creating a new VarCharVector but I can't do that >>> using the >>> > > > drill >>> > > > > > jar >>> > > > > > > > > files >>> > > > > > > > > from 1.0. The >>> org.apache.drill.common.types.TypeProtos and >>> > > > > > > > > the org.apache.drill.common.types.TypeProtos.MinorType >>> > > classes >>> > > > > do >>> > > > > > > not >>> > > > > > > > > appear to be accessible from the drill jar files. >>> > > > > > > > > 2. What is the best way to close out a UDF in the >>> event it >>> > > > > > generates >>> > > > > > > > an >>> > > > > > > > > exception? Are there specific steps one should follow >>> to >>> > > make >>> > > > a >>> > > > > > > clean >>> > > > > > > > > exit >>> > > > > > > > > in a catch block that are beneficial to Drill? >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >
