RE: [DataFusion] Promoting GroupByScalar to public API

Ruan Pearce-Authers Tue, 23 Feb 2021 10:51:31 -0800

Hi Daniël,

That makes sense! Especially from a performance perspective for future and 
revised aggregate implementations. I'll keep my copy-pasted GroupByScalar 
implementation for my prototype work (where perf isn't super critical yet) and 
try to get involved in any future discussions on this topic.


Cheers,
Ruan

> -----Original Message-----
> From: Daniël Heres <danielhe...@gmail.com>
> Sent: 23 February 2021 18:19
> To: dev@arrow.apache.org
> Subject: Re: [DataFusion] Promoting GroupByScalar to public API
> 
> Hi Ruan,
> 
> I am not sure about any stability guidelines, I didn't hear of it.
> 
> Aside from this, I think there might be some arguments against making
> GroupByScalar public for a number of reasons:
> 
> * We have both ScalarValue and GroupByValue, with some
> duplication/mapping between them. I think it makes sense to try to reduce
> duplication there.
> * The default Hash implementation of GroupByScalar is currently AFAIK
> *only* used in "DistinctScalarValues" (for count distinct) which is a current
> implementation detail - hashing on GroupByScalar values (and having them
> as a key in a HashMap) is not efficient and could be optimized.
> * For aggregates (where they are currently used), I think DataFusion has to
> step away at some point keeping state in the accumulators and aggregates
> *for each unique scalar*, as there is a high associated overhead (both
> memory and processing) of doing so. In more extreme case when each group
> by values are (almost) unique per row, DataFusion uses much more data per
> input row than necessary. As an example, we can't run some more
> challenging queries of this benchmark https://github.com/h2oai/db-
> benchmark as we run OOM there
> - for ~0.5GB of input, with 32GB of RAM). Much more efficient would be to
> store the accumulated data in (typed) arrays, keep offsets to values in those
> arrays and get rid of using per-row scalar values in those cases.
> 
> Best,
> 
> Daniël
> 
> Op di 23 feb. 2021 om 18:15 schreef Ruan Pearce-Authers <
> r...@reservoirdb.com>:
> 
> > Hey all,
> >
> > Whilst working on some UDAFs, I noticed I essentially had to
> > reimplement GroupByScalar to use scalars as HashMap keys inside
> > accumulator struct state, as ScalarValue (correctly!) doesn't implement
> Eq/Hash.
> >
> > A simple fix to ease this process would be to remove the crate-only
> > access qualifier so that it can be reused in user code.
> >
> > Do we have any stability guidelines for public types that need
> > considering before doing so? Otherwise, if there's general approval
> > for this, can I just create an issue in JIRA myself to associate with
> > the future PR, or is there a process to be followed? Arrow/general ASF
> > newb here :)
> >
> > Cheers,
> > Ruan
> >
> 
> 
> --
> Daniël Heres

RE: [DataFusion] Promoting GroupByScalar to public API

Reply via email to