Hi Daniël, That makes sense! Especially from a performance perspective for future and revised aggregate implementations. I'll keep my copy-pasted GroupByScalar implementation for my prototype work (where perf isn't super critical yet) and try to get involved in any future discussions on this topic.
Cheers, Ruan > -----Original Message----- > From: Daniël Heres <[email protected]> > Sent: 23 February 2021 18:19 > To: [email protected] > Subject: Re: [DataFusion] Promoting GroupByScalar to public API > > Hi Ruan, > > I am not sure about any stability guidelines, I didn't hear of it. > > Aside from this, I think there might be some arguments against making > GroupByScalar public for a number of reasons: > > * We have both ScalarValue and GroupByValue, with some > duplication/mapping between them. I think it makes sense to try to reduce > duplication there. > * The default Hash implementation of GroupByScalar is currently AFAIK > *only* used in "DistinctScalarValues" (for count distinct) which is a current > implementation detail - hashing on GroupByScalar values (and having them > as a key in a HashMap) is not efficient and could be optimized. > * For aggregates (where they are currently used), I think DataFusion has to > step away at some point keeping state in the accumulators and aggregates > *for each unique scalar*, as there is a high associated overhead (both > memory and processing) of doing so. In more extreme case when each group > by values are (almost) unique per row, DataFusion uses much more data per > input row than necessary. As an example, we can't run some more > challenging queries of this benchmark https://github.com/h2oai/db- > benchmark as we run OOM there > - for ~0.5GB of input, with 32GB of RAM). Much more efficient would be to > store the accumulated data in (typed) arrays, keep offsets to values in those > arrays and get rid of using per-row scalar values in those cases. > > Best, > > Daniël > > Op di 23 feb. 2021 om 18:15 schreef Ruan Pearce-Authers < > [email protected]>: > > > Hey all, > > > > Whilst working on some UDAFs, I noticed I essentially had to > > reimplement GroupByScalar to use scalars as HashMap keys inside > > accumulator struct state, as ScalarValue (correctly!) doesn't implement > Eq/Hash. > > > > A simple fix to ease this process would be to remove the crate-only > > access qualifier so that it can be reused in user code. > > > > Do we have any stability guidelines for public types that need > > considering before doing so? Otherwise, if there's general approval > > for this, can I just create an issue in JIRA myself to associate with > > the future PR, or is there a process to be followed? Arrow/general ASF > > newb here :) > > > > Cheers, > > Ruan > > > > > -- > Daniël Heres
