RE: [DataFusion] Promoting GroupByScalar to public API

2021-02-23 Thread Ruan Pearce-Authers
Hi Daniël,

That makes sense! Especially from a performance perspective for future and 
revised aggregate implementations. I'll keep my copy-pasted GroupByScalar 
implementation for my prototype work (where perf isn't super critical yet) and 
try to get involved in any future discussions on this topic.

Cheers,
Ruan

> -Original Message-
> From: Daniël Heres 
> Sent: 23 February 2021 18:19
> To: dev@arrow.apache.org
> Subject: Re: [DataFusion] Promoting GroupByScalar to public API
> 
> Hi Ruan,
> 
> I am not sure about any stability guidelines, I didn't hear of it.
> 
> Aside from this, I think there might be some arguments against making
> GroupByScalar public for a number of reasons:
> 
> * We have both ScalarValue and GroupByValue, with some
> duplication/mapping between them. I think it makes sense to try to reduce
> duplication there.
> * The default Hash implementation of GroupByScalar is currently AFAIK
> *only* used in "DistinctScalarValues" (for count distinct) which is a current
> implementation detail - hashing on GroupByScalar values (and having them
> as a key in a HashMap) is not efficient and could be optimized.
> * For aggregates (where they are currently used), I think DataFusion has to
> step away at some point keeping state in the accumulators and aggregates
> *for each unique scalar*, as there is a high associated overhead (both
> memory and processing) of doing so. In more extreme case when each group
> by values are (almost) unique per row, DataFusion uses much more data per
> input row than necessary. As an example, we can't run some more
> challenging queries of this benchmark https://github.com/h2oai/db-
> benchmark as we run OOM there
> - for ~0.5GB of input, with 32GB of RAM). Much more efficient would be to
> store the accumulated data in (typed) arrays, keep offsets to values in those
> arrays and get rid of using per-row scalar values in those cases.
> 
> Best,
> 
> Daniël
> 
> Op di 23 feb. 2021 om 18:15 schreef Ruan Pearce-Authers <
> r...@reservoirdb.com>:
> 
> > Hey all,
> >
> > Whilst working on some UDAFs, I noticed I essentially had to
> > reimplement GroupByScalar to use scalars as HashMap keys inside
> > accumulator struct state, as ScalarValue (correctly!) doesn't implement
> Eq/Hash.
> >
> > A simple fix to ease this process would be to remove the crate-only
> > access qualifier so that it can be reused in user code.
> >
> > Do we have any stability guidelines for public types that need
> > considering before doing so? Otherwise, if there's general approval
> > for this, can I just create an issue in JIRA myself to associate with
> > the future PR, or is there a process to be followed? Arrow/general ASF
> > newb here :)
> >
> > Cheers,
> > Ruan
> >
> 
> 
> --
> Daniël Heres


Re: [DataFusion] Promoting GroupByScalar to public API

2021-02-23 Thread Daniël Heres
Hi Ruan,

I am not sure about any stability guidelines, I didn't hear of it.

Aside from this, I think there might be some arguments against making
GroupByScalar public for a number of reasons:

* We have both ScalarValue and GroupByValue, with some duplication/mapping
between them. I think it makes sense to try to reduce duplication there.
* The default Hash implementation of GroupByScalar is currently AFAIK *only*
used in "DistinctScalarValues" (for count distinct) which is a current
implementation detail - hashing on GroupByScalar values (and having them as
a key in a HashMap) is not efficient and could be optimized.
* For aggregates (where they are currently used), I think DataFusion has to
step away at some point keeping state in the accumulators and aggregates *for
each unique scalar*, as there is a high associated overhead (both memory
and processing) of doing so. In more extreme case when each group by values
are (almost) unique per row, DataFusion uses much more data per input row
than necessary. As an example, we can't run some more challenging queries
of this benchmark https://github.com/h2oai/db-benchmark as we run OOM there
- for ~0.5GB of input, with 32GB of RAM). Much more efficient would be to
store the accumulated data in (typed) arrays, keep offsets to values in
those arrays and get rid of using per-row scalar values in those cases.

Best,

Daniël

Op di 23 feb. 2021 om 18:15 schreef Ruan Pearce-Authers <
r...@reservoirdb.com>:

> Hey all,
>
> Whilst working on some UDAFs, I noticed I essentially had to reimplement
> GroupByScalar to use scalars as HashMap keys inside accumulator struct
> state, as ScalarValue (correctly!) doesn't implement Eq/Hash.
>
> A simple fix to ease this process would be to remove the crate-only access
> qualifier so that it can be reused in user code.
>
> Do we have any stability guidelines for public types that need considering
> before doing so? Otherwise, if there's general approval for this, can I
> just create an issue in JIRA myself to associate with the future PR, or is
> there a process to be followed? Arrow/general ASF newb here :)
>
> Cheers,
> Ruan
>


-- 
Daniël Heres