BTW, if this is known, is there an existing JIRA I should link to? On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson <eerla...@redhat.com> wrote:
> > At a high level, some candidate strategies are: > 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF > trait itself) so that the update method can do the right thing. > 2. Expose TypedImperativeAggregate to users for defining their own, since > it already does the right thing. > 3. As a workaround, allow users to define their own sub-classes of > DataType. It would essentially allow one to define the sqlType of the UDT > to be the aggregating object itself and make ser/de a no-op. I tried doing > this and it will compile, but spark's internals only consider a predefined > universe of DataType classes. > > All of these options are likely to have implications for the catalyst > systems. I'm not sure if they are minor more substantial. > > On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin <r...@databricks.com> wrote: > >> Yes this is known and an issue for performance. Do you have any thoughts >> on how to fix this? >> >> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson <eerla...@redhat.com> >> wrote: >> >>> I describe some of the details here: >>> https://issues.apache.org/jira/browse/SPARK-27296 >>> >>> The short version of the story is that aggregating data structures >>> (UDTs) used by UDAFs are serialized to a Row object, and de-serialized, for >>> every row in a data frame. >>> Cheers, >>> Erik >>> >>>