Even in my own warped experience, the vast majority of UDF's I have written
or considered writing have been pure.



On Tue, Jul 21, 2015 at 4:27 PM, Jacques Nadeau <[email protected]> wrote:

> I don't think so.  There are something like 1500 functions where this isn't
> true (default) and one or two where it is.
>
> On Tue, Jul 21, 2015 at 4:25 PM, Daniel Barclay <[email protected]>
> wrote:
>
> >
> > Should Drill be defaulting the other way?
> >
> > That is, instead of assuming pure unless declared otherwise (leading to
> > wrong results in the case that the assumption is wrong (or the annotation
> > was forgotten)), should Drill be assuming not pure unless declared pure
> > (leading to only lower performance in the wrong-assumption case)?
> >
> > Daniel
> >
> >
> >
> >
> > Jacques Nadeau wrote:
> >
> >> There is an annotation on the function template.  I don't have a laptop
> >> close but I believe it is something similar to isRandom. It basically
> >> tells
> >> Drill that this is a nondeterministic function. I will be more specific
> >> once I get back to my machine if you don't find it sooner.
> >>
> >> Jacques
> >> *Summary:*
> >>
> >> Drill is very aggressive about optimizing away calls to functions with
> >> constant arguments. I worry that could extend to per record batch
> >> optimization if I accidentally have constant values and even if that
> >> doesn't happen, it is a pain in the ass now largely because Drill is
> >> clever
> >> enough to see through my attempt to hide the constant nature of my
> >> parameters.
> >>
> >> *Question:*
> >>
> >> Is there a way to mark a UDF as not being a pure function?
> >>
> >> *Details:*
> >>
> >> I have written a UDF to generate a random number.  It takes parameters
> >> that
> >> define the distribution.  All seems well and good.
> >>
> >> I find, however, that the function is only called once (twice, actually
> >> apparently due to pipeline warmup) and then Drill optimizes away later
> >> calls, apparently because the parameters to the function are constant
> and
> >> Drill thinks my function is a pure function.  If I make up some bogus
> data
> >> to pass in as a parameter, all is well and the function is called as
> much
> >> as I wanted.
> >>
> >> For instance, with the uniform distribution, my function takes two
> >> arguments, those being the minimum and maximum value to return.  Here is
> >> what I see with constants for the min and max:
> >>
> >> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as
> >> tbl(x);
> >> into eval
> >> into eval
> >> +---------------------+
> >> |       EXPR$0        |
> >> +---------------------+
> >> | 1.7787372583008298  |
> >> | 1.7787372583008298  |
> >> | 1.7787372583008298  |
> >> | 1.7787372583008298  |
> >> +---------------------+
> >>
> >>
> >> If I include an actual value, we see more interesting behavior even if
> the
> >> value is effectively constant:
> >>
> >> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as
> >> tbl(x);
> >> into eval
> >> into eval
> >> into eval
> >> into eval
> >> +----------------------+
> >> |        EXPR$0        |
> >> +----------------------+
> >> | 3.688377805419459    |
> >> | 0.2827056410711032   |
> >> | 2.3107479622644918   |
> >> | 0.10813788169218574  |
> >> +----------------------+
> >> 4 rows selected (0.088 seconds)
> >>
> >>
> >> Even if I make the max value come along from the sub-query, I get the
> evil
> >> behavior although the function is now surprisingly actually called three
> >> times, apparently to do with warming up the pipeline:
> >>
> >> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as
> >> max_value,x from (values 5,5,5,5) as tbl(x)) foo;
> >> into eval
> >> into eval
> >> into eval
> >> +---------------------+
> >> |       EXPR$0        |
> >> +---------------------+
> >> | 13.404462063773702  |
> >> | 13.404462063773702  |
> >> | 13.404462063773702  |
> >> | 13.404462063773702  |
> >> +---------------------+
> >> 4 rows selected (0.121 seconds)
> >>
> >> The UDF itself is boring and can be found at
> >> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0
> >>
> >> So how can I defeat this behavior?
> >>
> >>
> >
> > --
> > Daniel Barclay
> > MapR Technologies
> >
>

Reply via email to