Stefán Baxter created DRILL-3561:
------------------------------------
Summary: The return type of UDF seems ambigious and varchar
results can force arithmetic comparison
Key: DRILL-3561
URL: https://issues.apache.org/jira/browse/DRILL-3561
Project: Apache Drill
Issue Type: Bug
Components: SQL Parser
Reporter: Stefán Baxter
Assignee: Aman Sinha
Please see information in the following user group email where dir0, containing
"2011-2012" is being compared to a varchar/null result of a UDF and Drill tries
to convert dir0 to a number and fails.
======================================================
I'm still working on our evaluation and now focusing on directory based queries
and mixing directory and parquet based partitions.
This is also a continued trip down the UDF rabbit hole :) (some pun intended)
I continue to come across things that surprise me and I would like to share
them with both the developers, that might want to address some of them, and
other newcomers that might benefit from them.
The UDF code referenced here van be found at
(https://github.com/acmeguy/asdrill) and the documents and the directory
structure used in these examples are included in the tiny attachment.
I will try to keep this as brief as possible.
What you need to know is:
there are 33 records in the 19 files in a mixed directory structure
- see zip for details
each record contains a date that is valid within that directory structure
the dirInRange function is a UDF that takes a date range and directory
information to determine if a directory contains target data for the provided
date range
- see github for details
the dirInRange function should be able to accept all null values or missing
parameters for everything other than the time range starts parameter
the dirInRange function returns a string with a number that represents the
number of parameters used (function variant)
- has no other purpose/function at this point
- will return the value of the last dirN paramater that is not null (dir0, dir1
or dir2)
Observations
1. The UDF function (dirInRange) is not called if dir0, dir1 or dir2 are
missing (missing is not the same as null here)
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at;
- return 33 records
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
dfs.tmp.`/analytics/processed/test/events` as t where
dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) > '0'
order by occurred_at;
- returns 33 record (Coalesce handles the missing values are replaces them with
'-')
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
dfs.tmp.`/analytics/processed/test/events` as t where
dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),dir0,dir1,dir2) > '0' order by occurred_at;
- returns 13 records (only those in the deepest directories where dir0, dir1,
dir2 are all set
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay,
dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),dir0,dir1,dir2) inRange from
dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at;
- return 33 records but 20 of the records will have inRange set to null (the
UDF never returns null so it's being ignored completely)
Lesson: It's not enough to use Nullable*Holder in UDF and have all permutations
covered
- Drill will not call the function and fails silently, evaluating the outcome
of the function to null, if any of the dirN parameters are not initialized
2. System.out.print out is the way to get information from within the Drillbit
It would be good to know which Drillbit instance, if many, is responsible for
the println
- I don't know how to get the parent drillbit injected into the UDF
3. If directories have numeric names then Drill starts to insist they are all
numeric (in the where condition) event though dirInRange always returns a
varchar.
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) order by
occurred_at;
"2011,2012" is the name of the directory (same happens with directories ("Q1"
and "W1" etc.)
java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR:
NumberFormatException: 2011,2012
Fragment 0:0
[Error Id: 0c3e1370-ccc5-4288-b6c9-ea0ef4884f1e on localhost:31010]
This seems to fail on the other side where Drill thinks that the outcome of the
dirInRange function is numeric and that the "=" expression is a numerical one.
this runs though: select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order by
occurred_at;
- here the trim() function takes care of making sure that the UDF returns a
string (I think) even though the return type of the UDF is always a
varcharholder.
4. Directories do not fail and all files and all their records are evaluated
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from
dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order by
occurred_at;
- correctly returns no records (return value of dirInRange never matches dir0)
This still evaluates all the records in all the files in all the directories
- no partition pruning is available for selecting from directories
- massive performance/optimization gain to be had
5. File name is not available
It might be good to have dynamic filename variable available (just as the
directory variables)
Point number 4 bugs the living daylights out of me and I hope to share
something constructive on that soon.
There are 3+ Jira items in there somewhere and I will create what I believe to
be correct tasks but please feel free to correct them for me.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)