Hi Aman & Gautam,

FWIW, here is my understanding of UDAF functions based on a write-up I did a 
few months back. [1]

All Drill functions are implemented using the UDF and UDAF protocol. (The "U" 
(User) is a bit of a misnomer, internal and user-defined functions follow the 
same protocol.) Every UDF (including UDAF) is strongly typed in its arguments 
and return value. To create the ANY_VALUE implementation, you must create a 
separate implementation for each and every combination of (type, mode). That 
is, we need a REQUIRED INT and OPTIONAL INT implementation for INT types.

In this case, the incoming argument is passed using the argument annotation, 
and the return value via the return annotation. The generated code sets the 
incoming argument and copies the return value from the return variable. (There 
is an example of the generated code in the write-up.)

For a Map, there is no annotation to say "set this value to a map" for either 
input or output. Instead, we pass in a complex reader for input (I believe) and 
a complex writer for output. (Here I am a bit hazy as I never had time to 
experiment with maps and arrays in UDFs.)

So, you'll need a Map implementation. (Maps can only be REQUIRED, never 
OPTIONAL, unless they are in a UNION or LIST...)

Moreover, the code generator must understand that code generated for a Map UDAF 
must be different than that for a scalar UDAF. Presumably we must have that 
code, since the UDF mechanism supports maps.

Have you worked out how to handle arrays (REPEATED cardinality?) It was not 
clear from my exploration of UDFs how we handle REPEATED types. The UDAF must 
pass in one array, which the UDAF copies to its output, which is then written 
to the output repeated vector. Since values must arrive in Holders, it is not 
clear how this would be done for arrays. Perhaps there is an annotation that 
lets us use some form of complex writer for arrays as is done for MAPs? Again, 
sorry, I didn't have time to learn that bit. Would be great to understand that 
so we can add it to the write-up.

This chain mentions a MAP type. Drill also includes other complex types: 
REPEATED MAP, (non-repeated) LIST, (repeated) LIST, and UNION. It is not at all 
clear how UDAFs work for these types.

One other thing to consider: ANY_VALUE can never work for the hash agg because 
output values are updated in random order. It can only ever work for streaming 
agg because the streaming agg only appends output values. Fortunately, this 
chain is about the streaming agg. For Hash Agg, intermediate variable-width 
values are stored in an Object Vector, but those values won't survive 
serialization. As a result, only fixed-width types can be updated in random 
order. DRILL-6087 describes this issue.

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information




 

    On Wednesday, April 11, 2018, 4:09:47 PM PDT, Aman Sinha 
<amansi...@apache.org> wrote:  
 
 Here's some background on what Gautam is trying to do:  Currently, SQL does
not have a standard way to do a DISTINCT on a subset of the columns in the
SELECT list.  Suppose there are 2 columns:
  a:  INTEGER
  b:  MAP
Suppose I want to only do DISTINCT on 'a' and I don't really care about the
column 'b' .. I just want the first or any value of 'b' within a single
group of 'a'.    Postgres actually has a 'DISTINCT ON(a), b' syntax but
based on our discussion on the Calcite mailing list, we want to avoid that
syntax.  So, there's an alternative proposal to do the following:

    SELECT a, ANY_VALUE(b) FROM table GROUP BY a

This means, ANY_VALUE will essentially be treated as an Aggregate function
and from a code-gen perspective, we want to read 1 item (a MapHolder) from
the incoming MapVector and write it to a particular index in the output
MapVector.    This is where  it would be useful to  have
MapVector.setSafe()  since the StreamingAgg and HashAgg both generate
setSafe()  for normal aggregate functions.

However, it seems the better (or perhaps only) way to do this is through
the MapOrListWriter (ComplexWriter) as long as there's a way to instruct
the writer to write to a specific output index (the output index is needed
because there are several groups in the output container and we want to
write to a specific one).

-Aman


On Wed, Apr 11, 2018 at 2:13 PM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> What semantics are wanted? SetSafe sets a single value in a vector. What
> does it mean to set a single map or array value? What would we pass as an
> argument?
> For non-simple types, something needs to iterate over the values: be they
> elements of a map, elements in an array, elements of an array of maps, then
> over the map members, etc.
> I believe that you are hitting a fundamental difference between simple
> scale values and complex (composite) values.
> This is for an aggregate. There is no meaningful aggregate of a map or an
> array. Once could aggregate over a scalar that is a member of a map to
> produce a scalar result. Or, one could iterate over the members of an array
> to produce, say, an average or sum.
> You are dealing with aggregate UDFs (even built in functions implement the
> UDAF protocol.) A quick check of the source code does not find a
> "AnyValueComplexFunctions" class, so this may perhaps be something new you
> are developing. What are the desired semantics?
> The UDAF protocol can include a complex writer for maps. I've not played
> with that yet. But, it does not seem meaningful to aggregate a map to
> produce a map or to aggregate an array to produce an array. The idea is
> that the UDAF figures out what to do with maps, then uses the complex
> writer to produce the desired result. This makes sense since there is no
> way to store a map as a simple value passed to setSafe().
> Can you provide additional details of what you are trying to do?
> Thanks,
> - Paul
>
>
>
>    On Wednesday, April 11, 2018, 1:53:12 PM PDT, Padma Penumarthy <
> ppenumar...@mapr.com> wrote:
>
>  I guess you can add a setSafe method which recursively does setSafe for
> all children.
>
> Thanks
> Padma
>
>
> > On Apr 11, 2018, at 1:19 PM, Gautam Parai <gpa...@mapr.com> wrote:
> >
> > Hi Paul/Padma,
> >
> >
> > Thank you so much for the responses. This function is supposed to return
> `any value` from the batch of incoming rows. Hence, the need to handle
> maps/lists.
> >
> >
> > This codegen is for the StreamingAggregator for Complex type(e.g. maps)
> in the incoming batch. It is trying to assign the values in the
> ComplexHolder to the outgoing MapVector.
> >
> >
> > MapVector vv9; // Output VV of StreamingAgg
> >
> > ....
> >
> >
> >    public void outputRecordValues(int outIndex)
> >
> >        throws SchemaChangeException
> >
> >    {
> >
> >        {
> >
> >            ComplexHolder out8;
> >
> >            {
> >
> >                final ComplexHolder out = new ComplexHolder();
> >
> >                FieldReader fr = work0;
> >
> >                MapHolder value = work1;
> >
> >                BigIntHolder nonNullCount = work2;
> >
> >
> >
> > AnyValueComplexFunctions$MapAnyValue_output: {
> >
> >    out.reader = fr;
> >
> > }
> >
> >
> >
> >                work0 = fr;
> >
> >                work1 = value;
> >
> >                work2 = nonNullCount;
> >
> >                out8 = out;
> >
> >            }
> >
> >            vv9 .getMutator().setSafe((outIndex), out8); //Don't have
> setSafe for MapVector
> >
> >        }
> >
> >    }
> >
> >
> > Please let me know your thoughts.
> >
> >
> > Gautam
> >
> >
> >
> > ________________________________
> > From: Paul Rogers <par0...@yahoo.com.INVALID>
> > Sent: Wednesday, April 11, 2018 12:40:15 PM
> > To: dev@drill.apache.org
> > Subject: Re: [DISCUSS] Regarding mutator interface
> >
> > Note that, for maps and lists, there is nothing to set. Maps are purely
> containers for other vectors. Lists (you didn't mention whether "repeated"
> or "non-repeated") are also containers. Non-repeated lists are containers
> for unions, repeated-lists are containers for arrays.
> > Any setting should be done on the contained vectors. For lists, only the
> offset vector is updated.
> > So, another question is: what is the generated code trying to set?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, April 11, 2018, 12:33:52 PM PDT, Padma Penumarthy <
> ppenumar...@mapr.com> wrote:
> >
> > Can you explain how aggregation on complex type works (or supposed to
> work).
> >
> > Thanks
> > Padma
> >
> >
> >> On Apr 11, 2018, at 12:15 PM, Gautam Parai <gpa...@mapr.com> wrote:
> >>
> >> Hi all,
> >>
> >>
> >> I am implementing a new aggregate function which also handles Complex
> types (map and list). However, the codegen barfs with
> >>
> >>
> >> CompileException: Line 104, Column 39: A method named "setSafe" is not
> declared in any enclosing class nor any supertype, nor through a static
> import
> >>
> >>
> >> It looks like we do not have set()/ setSafe() methods for
> MapVector/ListVector mutators.
> >>
> >>
> >> Should we add these methods to the Mutator interface to ensure all
> mutators implement them? Is these a reason we chose not to do so?
> >>
> >>
> >> Please let me know your thoughts. Thanks!
> >>
> >>
> >> Gautam
> >
>
>
  

Reply via email to