#3 is incorrectly labelled and should be called "3. since it seems unclear what type UDFs return then Drill will, under some circumstances, assume arithmetic comparison (=, <>, etc.) rather than string comparison even though the UDF returns VarChar
On Sat, Jul 25, 2015 at 4:18 PM, Stefán Baxter <ste...@activitystream.com> wrote: > Hi, > > I'm still working on our evaluation and now focusing on directory based > queries and mixing directory and parquet based partitions. > This is also a continued trip down the UDF rabbit hole :) (some pun > intended) > > I continue to come across things that surprise me and I would like to > share them with both the developers, that might want to address some of > them, and other newcomers that might benefit from them. > > The UDF code referenced here van be found at ( > https://github.com/acmeguy/asdrill) and the documents and the directory > structure used in these examples are included in the tiny attachment. > > I will try to keep this as brief as possible. > > *What you need to know is:* > > - there are 33 records in the 19 files in a mixed directory structure > - see zip for details > > - each record contains a date that is valid within that directory > structure > > - the dirInRange function is a UDF that takes a date range and > directory information to determine if a directory contains target data for > the provided date range > - see github for details > > - the dirInRange function should be able to accept all null values or > missing parameters for everything other than the time range starts > parameter > > - the dirInRange function returns a string with a number that > represents the number of parameters used (function variant) > - has no other purpose/function at this point > - will return the value of the last dirN paramater that is not null > (dir0, dir1 or dir2) > > *Observations* > > 1. The UDF function (dirInRange) is not called if dir0, dir1 or dir2 are > missing (missing is not the same as null here) > > - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from > dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at; > - return 33 records > > - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from > dfs.tmp.`/analytics/processed/test/events` as t where > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) > '0' > order by occurred_at; > - returns 33 record (Coalesce handles the missing values are replaces > them with '-') > > - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from > dfs.tmp.`/analytics/processed/test/events` as t where > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > timestamp),dir0,dir1,dir2) > '0' order by occurred_at; > - returns 13 records (only those in the deepest directories > where dir0, dir1, dir2 are all set > > - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay, > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > timestamp),dir0,dir1,dir2) inRange from > dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at; > - return 33 records but 20 of the records will have inRange set to > null (the UDF never returns null so it's being ignored completely) > > - *Lesson:* It's not enough to use Nullable*Holder in UDF and have all > permutations covered > - Drill will not call the function and fails silently, evaluating the > outcome of the function to null, if any of the dirN parameters are not > initialized > > 2. System.out.print out is the way to get information from within the > Drillbit > > - It would be good to know which Drillbit instance, if many, is > responsible for the println > - I don't know how to get the parent drillbit injected into the UDF > > 3. If directories have numeric names then Drill starts to insist they are > all numeric (in the where condition) event though dirInRange always returns > a varchar. > > - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from > dfs.tmp.`/analytics/processed/test/events` as t where dir0 = > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order > by occurred_at; > "2011,2012" is the name of the directory (same happens with > directories ("Q1" and "W1" etc.) > java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: > NumberFormatException: 2011,2012 > Fragment 0:0 > [Error Id: 0c3e1370-ccc5-4288-b6c9-ea0ef4884f1e on localhost:31010] > > This seems to fail on the other side where Drill thinks that the > outcome of the dirInRange function is numeric and that the "=" expression > is a numerical one. > > - this runs though: select occurred_at, dir0 dYear, dir1 dMonth, dir2 > dDay from dfs.tmp.`/analytics/processed/test/events` as t where dir0 = > trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order > by occurred_at; > - here the trim() function takes care of making sure that the UDF > returns a string (I think) even though the return type of the UDF is always > a varcharholder. > > 4. Directories do not fail and all files and all their records are > evaluated > > - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from > dfs.tmp.`/analytics/processed/test/events` as t where dir0 = > trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order > by occurred_at; > - correctly returns no records (return value of dirInRange never > matches dir0) > > - This still evaluates* all the records in all the files in all the > directories* > - no partition pruning is available for selecting from directories > - massive performance/optimization gain to be had > > 5. File name is not available > > - It might be good to have dynamic filename variable available (just > as the directory variables) > > > Point number 4 bugs the living daylights out of me and I hope to share > something constructive on that soon. > > There are 3+ Jira items in there somewhere and I will create what I > believe to be correct tasks but please feel free to correct them for me. > > Regards, > -Stefan > > >