I found the option. Works a charm. I will add it to the FAQ list. On Mon, Jul 20, 2015 at 5:11 PM, Jacques Nadeau <[email protected]> wrote:
> There is an annotation on the function template. I don't have a laptop > close but I believe it is something similar to isRandom. It basically tells > Drill that this is a nondeterministic function. I will be more specific > once I get back to my machine if you don't find it sooner. > > Jacques > *Summary:* > > Drill is very aggressive about optimizing away calls to functions with > constant arguments. I worry that could extend to per record batch > optimization if I accidentally have constant values and even if that > doesn't happen, it is a pain in the ass now largely because Drill is clever > enough to see through my attempt to hide the constant nature of my > parameters. > > *Question:* > > Is there a way to mark a UDF as not being a pure function? > > *Details:* > > I have written a UDF to generate a random number. It takes parameters that > define the distribution. All seems well and good. > > I find, however, that the function is only called once (twice, actually > apparently due to pipeline warmup) and then Drill optimizes away later > calls, apparently because the parameters to the function are constant and > Drill thinks my function is a pure function. If I make up some bogus data > to pass in as a parameter, all is well and the function is called as much > as I wanted. > > For instance, with the uniform distribution, my function takes two > arguments, those being the minimum and maximum value to return. Here is > what I see with constants for the min and max: > > 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as > tbl(x); > into eval > into eval > +---------------------+ > | EXPR$0 | > +---------------------+ > | 1.7787372583008298 | > | 1.7787372583008298 | > | 1.7787372583008298 | > | 1.7787372583008298 | > +---------------------+ > > > If I include an actual value, we see more interesting behavior even if the > value is effectively constant: > > 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as tbl(x); > into eval > into eval > into eval > into eval > +----------------------+ > | EXPR$0 | > +----------------------+ > | 3.688377805419459 | > | 0.2827056410711032 | > | 2.3107479622644918 | > | 0.10813788169218574 | > +----------------------+ > 4 rows selected (0.088 seconds) > > > Even if I make the max value come along from the sub-query, I get the evil > behavior although the function is now surprisingly actually called three > times, apparently to do with warming up the pipeline: > > 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as > max_value,x from (values 5,5,5,5) as tbl(x)) foo; > into eval > into eval > into eval > +---------------------+ > | EXPR$0 | > +---------------------+ > | 13.404462063773702 | > | 13.404462063773702 | > | 13.404462063773702 | > | 13.404462063773702 | > +---------------------+ > 4 rows selected (0.121 seconds) > > The UDF itself is boring and can be found at > https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0 > > So how can I defeat this behavior? >
