Re: UDAFs have an inefficiency problem

Erik Erlandson Mon, 30 Sep 2019 08:40:19 -0700

On the PR review, there were questions about adding a new aggregating
class, and whether or not Aggregator[IN,BUF,OUT] could be used.  I added a
proof of concept solution based on enhancing Aggregator to the pull-req:
https://github.com/apache/spark/pull/25024/

I wrote up my findings on the PR but the gist is that Aggregator is a
feasible option, however it does not provide *total* feature parity with
UDAF.  Note that this PR now includes two candidate solutions, for
comparison purposes, as well as an extra test file (tdigest.scala).
Eventually one of these solutions will be removed, depending on what option
is selected.

I'm pushing this forward now with the goal of getting a solution into the
upcoming 3.0 branch cut

On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson <eerla...@redhat.com> wrote:

> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> used by UDAFs are serialized to a Row object, and de-serialized, for every
> row in a data frame.
> Cheers,
> Erik
>
>

Re: UDAFs have an inefficiency problem

Reply via email to