There is an annotation on the function template.  I don't have a laptop
close but I believe it is something similar to isRandom. It basically tells
Drill that this is a nondeterministic function. I will be more specific
once I get back to my machine if you don't find it sooner.

Jacques
*Summary:*

Drill is very aggressive about optimizing away calls to functions with
constant arguments. I worry that could extend to per record batch
optimization if I accidentally have constant values and even if that
doesn't happen, it is a pain in the ass now largely because Drill is clever
enough to see through my attempt to hide the constant nature of my
parameters.

*Question:*

Is there a way to mark a UDF as not being a pure function?

*Details:*

I have written a UDF to generate a random number.  It takes parameters that
define the distribution.  All seems well and good.

I find, however, that the function is only called once (twice, actually
apparently due to pipeline warmup) and then Drill optimizes away later
calls, apparently because the parameters to the function are constant and
Drill thinks my function is a pure function.  If I make up some bogus data
to pass in as a parameter, all is well and the function is called as much
as I wanted.

For instance, with the uniform distribution, my function takes two
arguments, those being the minimum and maximum value to return.  Here is
what I see with constants for the min and max:

0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as tbl(x);
into eval
into eval
+---------------------+
|       EXPR$0        |
+---------------------+
| 1.7787372583008298  |
| 1.7787372583008298  |
| 1.7787372583008298  |
| 1.7787372583008298  |
+---------------------+


If I include an actual value, we see more interesting behavior even if the
value is effectively constant:

0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as tbl(x);
into eval
into eval
into eval
into eval
+----------------------+
|        EXPR$0        |
+----------------------+
| 3.688377805419459    |
| 0.2827056410711032   |
| 2.3107479622644918   |
| 0.10813788169218574  |
+----------------------+
4 rows selected (0.088 seconds)


Even if I make the max value come along from the sub-query, I get the evil
behavior although the function is now surprisingly actually called three
times, apparently to do with warming up the pipeline:

0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as
max_value,x from (values 5,5,5,5) as tbl(x)) foo;
into eval
into eval
into eval
+---------------------+
|       EXPR$0        |
+---------------------+
| 13.404462063773702  |
| 13.404462063773702  |
| 13.404462063773702  |
| 13.404462063773702  |
+---------------------+
4 rows selected (0.121 seconds)

The UDF itself is boring and can be found at
https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0

So how can I defeat this behavior?

Reply via email to