#3 is incorrectly labelled and should be called "3. since it seems unclear
what type UDFs return then Drill will, under some circumstances, assume
arithmetic comparison (=, <>, etc.) rather than string comparison even
though the UDF returns VarChar

On Sat, Jul 25, 2015 at 4:18 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi,
>
> I'm still working on our evaluation and now focusing on directory based
> queries and mixing directory and parquet based partitions.
> This is also a continued trip down the UDF rabbit hole :) (some pun
> intended)
>
> I continue to come across things that surprise me and I would like to
> share them with both the developers, that might want to address some of
> them, and other newcomers that might benefit from them.
>
> The UDF code referenced here van be found at (
> https://github.com/acmeguy/asdrill) and the documents and the directory
> structure used in these examples are included in the tiny attachment.
>
> I will try to keep this as brief as possible.
>
> *What you need to know is:*
>
>    - there are 33 records in the 19 files in a mixed directory structure
>    - see zip for details
>
>    - each record contains a date that is valid within that directory
>    structure
>
>    - the dirInRange function is a UDF that takes a date range and
>    directory information to determine if a directory contains target data for
>    the provided date range
>    - see github for details
>
>    - the dirInRange function should be able to accept all null values or
>    missing parameters for everything other than the time range starts 
> parameter
>
>    - the dirInRange function returns a string with a number that
>    represents the number of parameters used (function variant)
>    - has no other purpose/function at this point
>    - will return the value of the last dirN paramater that is not null
>    (dir0, dir1 or dir2)
>
> *Observations*
>
> 1. The UDF function (dirInRange) is not called if dir0, dir1 or dir2 are
> missing (missing is not the same as null here)
>
>    - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
>    dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at;
>    - return 33 records
>
>    - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
>    dfs.tmp.`/analytics/processed/test/events` as t where
>    dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
>    timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) > '0'
>    order by occurred_at;
>    - returns 33 record (Coalesce handles the missing values are replaces
>    them with '-')
>
>    - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
>    dfs.tmp.`/analytics/processed/test/events` as t where
>    dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
>    timestamp),dir0,dir1,dir2) > '0' order by occurred_at;
>    - returns 13 records (only those in the deepest directories
>    where dir0, dir1, dir2 are all set
>
>    - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay,
>    dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
>    timestamp),dir0,dir1,dir2) inRange from
>    dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at;
>    - return 33 records but 20 of the records will have inRange set to
>    null (the UDF never returns null so it's being ignored completely)
>
>    - *Lesson:* It's not enough to use Nullable*Holder in UDF and have all
>    permutations covered
>    - Drill will not call the function and fails silently, evaluating the
>    outcome of the function to null, if any of the dirN parameters are not
>    initialized
>
> 2. System.out.print out is the way to get information from within the
> Drillbit
>
>    - It would be good to know which Drillbit instance, if many, is
>    responsible for the println
>    - I don't know how to get the parent drillbit injected into the UDF
>
> 3. If directories have numeric names then Drill starts to insist they are
> all numeric (in the where condition) event though dirInRange always returns
> a varchar.
>
>    - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
>    dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
>    dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
>    timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order
>    by occurred_at;
>    "2011,2012" is the name of the directory (same happens with
>    directories ("Q1" and "W1" etc.)
>    java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR:
>    NumberFormatException: 2011,2012
>    Fragment 0:0
>    [Error Id: 0c3e1370-ccc5-4288-b6c9-ea0ef4884f1e on localhost:31010]
>
>    This seems to fail on the other side where Drill thinks that the
>    outcome of the dirInRange function is numeric and that the "=" expression
>    is a numerical one.
>
>    - this runs though: select occurred_at, dir0 dYear, dir1 dMonth, dir2
>    dDay from dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
>    trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
>    timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order
>    by occurred_at;
>    - here the trim() function takes care of making sure that the UDF
>    returns a string (I think) even though the return type of the UDF is always
>    a varcharholder.
>
> 4. Directories do not fail and all files and all their records are
> evaluated
>
>    - select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
>    dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
>    trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
>    timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order
>    by occurred_at;
>    - correctly returns no records (return value of dirInRange never
>    matches dir0)
>
>    - This still evaluates* all the records in all the files in all the
>    directories*
>    - no partition pruning is available for selecting from directories
>    - massive performance/optimization gain to be had
>
> 5. File name is not available
>
>    - It might be good to have dynamic filename variable available (just
>    as the directory variables)
>
>
> Point number 4 bugs the living daylights out of me and I hope to share
> something constructive on that soon.
>
> There are 3+ Jira items in there somewhere and I will create what I
> believe to be correct tasks but please feel free to correct them for me.
>
> Regards,
>   -Stefan
>
>
>

Reply via email to