Yep, I would expect pure to be the majority and default, and that makes
sense, because these functions are not class members that could have a
(implicit) "this" pointer that references member variables whose state
would change, leading to implementations with side-effects (and impure
results).

On Tue, Jul 21, 2015 at 5:25 PM, Ted Dunning <[email protected]> wrote:

> Even in my own warped experience, the vast majority of UDF's I have written
> or considered writing have been pure.
>
>
>
> On Tue, Jul 21, 2015 at 4:27 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > I don't think so.  There are something like 1500 functions where this
> isn't
> > true (default) and one or two where it is.
> >
> > On Tue, Jul 21, 2015 at 4:25 PM, Daniel Barclay <[email protected]>
> > wrote:
> >
> > >
> > > Should Drill be defaulting the other way?
> > >
> > > That is, instead of assuming pure unless declared otherwise (leading to
> > > wrong results in the case that the assumption is wrong (or the
> annotation
> > > was forgotten)), should Drill be assuming not pure unless declared pure
> > > (leading to only lower performance in the wrong-assumption case)?
> > >
> > > Daniel
> > >
> > >
> > >
> > >
> > > Jacques Nadeau wrote:
> > >
> > >> There is an annotation on the function template.  I don't have a
> laptop
> > >> close but I believe it is something similar to isRandom. It basically
> > >> tells
> > >> Drill that this is a nondeterministic function. I will be more
> specific
> > >> once I get back to my machine if you don't find it sooner.
> > >>
> > >> Jacques
> > >> *Summary:*
> > >>
> > >> Drill is very aggressive about optimizing away calls to functions with
> > >> constant arguments. I worry that could extend to per record batch
> > >> optimization if I accidentally have constant values and even if that
> > >> doesn't happen, it is a pain in the ass now largely because Drill is
> > >> clever
> > >> enough to see through my attempt to hide the constant nature of my
> > >> parameters.
> > >>
> > >> *Question:*
> > >>
> > >> Is there a way to mark a UDF as not being a pure function?
> > >>
> > >> *Details:*
> > >>
> > >> I have written a UDF to generate a random number.  It takes parameters
> > >> that
> > >> define the distribution.  All seems well and good.
> > >>
> > >> I find, however, that the function is only called once (twice,
> actually
> > >> apparently due to pipeline warmup) and then Drill optimizes away later
> > >> calls, apparently because the parameters to the function are constant
> > and
> > >> Drill thinks my function is a pure function.  If I make up some bogus
> > data
> > >> to pass in as a parameter, all is well and the function is called as
> > much
> > >> as I wanted.
> > >>
> > >> For instance, with the uniform distribution, my function takes two
> > >> arguments, those being the minimum and maximum value to return.  Here
> is
> > >> what I see with constants for the min and max:
> > >>
> > >> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as
> > >> tbl(x);
> > >> into eval
> > >> into eval
> > >> +---------------------+
> > >> |       EXPR$0        |
> > >> +---------------------+
> > >> | 1.7787372583008298  |
> > >> | 1.7787372583008298  |
> > >> | 1.7787372583008298  |
> > >> | 1.7787372583008298  |
> > >> +---------------------+
> > >>
> > >>
> > >> If I include an actual value, we see more interesting behavior even if
> > the
> > >> value is effectively constant:
> > >>
> > >> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as
> > >> tbl(x);
> > >> into eval
> > >> into eval
> > >> into eval
> > >> into eval
> > >> +----------------------+
> > >> |        EXPR$0        |
> > >> +----------------------+
> > >> | 3.688377805419459    |
> > >> | 0.2827056410711032   |
> > >> | 2.3107479622644918   |
> > >> | 0.10813788169218574  |
> > >> +----------------------+
> > >> 4 rows selected (0.088 seconds)
> > >>
> > >>
> > >> Even if I make the max value come along from the sub-query, I get the
> > evil
> > >> behavior although the function is now surprisingly actually called
> three
> > >> times, apparently to do with warming up the pipeline:
> > >>
> > >> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as
> > >> max_value,x from (values 5,5,5,5) as tbl(x)) foo;
> > >> into eval
> > >> into eval
> > >> into eval
> > >> +---------------------+
> > >> |       EXPR$0        |
> > >> +---------------------+
> > >> | 13.404462063773702  |
> > >> | 13.404462063773702  |
> > >> | 13.404462063773702  |
> > >> | 13.404462063773702  |
> > >> +---------------------+
> > >> 4 rows selected (0.121 seconds)
> > >>
> > >> The UDF itself is boring and can be found at
> > >> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0
> > >>
> > >> So how can I defeat this behavior?
> > >>
> > >>
> > >
> > > --
> > > Daniel Barclay
> > > MapR Technologies
> > >
> >
>

Reply via email to