Re: question about UDF optimization

Ted Dunning Mon, 20 Jul 2015 18:45:06 -0700

I found the option.  Works a charm.  I will add it to the FAQ list.

On Mon, Jul 20, 2015 at 5:11 PM, Jacques Nadeau <[email protected]> wrote:


> There is an annotation on the function template.  I don't have a laptop
> close but I believe it is something similar to isRandom. It basically tells
> Drill that this is a nondeterministic function. I will be more specific
> once I get back to my machine if you don't find it sooner.
>
> Jacques
> *Summary:*
>
> Drill is very aggressive about optimizing away calls to functions with
> constant arguments. I worry that could extend to per record batch
> optimization if I accidentally have constant values and even if that
> doesn't happen, it is a pain in the ass now largely because Drill is clever
> enough to see through my attempt to hide the constant nature of my
> parameters.
>
> *Question:*
>
> Is there a way to mark a UDF as not being a pure function?
>
> *Details:*
>
> I have written a UDF to generate a random number.  It takes parameters that
> define the distribution.  All seems well and good.
>
> I find, however, that the function is only called once (twice, actually
> apparently due to pipeline warmup) and then Drill optimizes away later
> calls, apparently because the parameters to the function are constant and
> Drill thinks my function is a pure function.  If I make up some bogus data
> to pass in as a parameter, all is well and the function is called as much
> as I wanted.
>
> For instance, with the uniform distribution, my function takes two
> arguments, those being the minimum and maximum value to return.  Here is
> what I see with constants for the min and max:
>
> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as
> tbl(x);
> into eval
> into eval
> +---------------------+
> |       EXPR$0        |
> +---------------------+
> | 1.7787372583008298  |
> | 1.7787372583008298  |
> | 1.7787372583008298  |
> | 1.7787372583008298  |
> +---------------------+
>
>
> If I include an actual value, we see more interesting behavior even if the
> value is effectively constant:
>
> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as tbl(x);
> into eval
> into eval
> into eval
> into eval
> +----------------------+
> |        EXPR$0        |
> +----------------------+
> | 3.688377805419459    |
> | 0.2827056410711032   |
> | 2.3107479622644918   |
> | 0.10813788169218574  |
> +----------------------+
> 4 rows selected (0.088 seconds)
>
>
> Even if I make the max value come along from the sub-query, I get the evil
> behavior although the function is now surprisingly actually called three
> times, apparently to do with warming up the pipeline:
>
> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as
> max_value,x from (values 5,5,5,5) as tbl(x)) foo;
> into eval
> into eval
> into eval
> +---------------------+
> |       EXPR$0        |
> +---------------------+
> | 13.404462063773702  |
> | 13.404462063773702  |
> | 13.404462063773702  |
> | 13.404462063773702  |
> +---------------------+
> 4 rows selected (0.121 seconds)
>
> The UDF itself is boring and can be found at
> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0
>
> So how can I defeat this behavior?
>

Re: question about UDF optimization

Reply via email to