On Wed, 23 Jun 2021 07:37:09 -0500
Wes McKinney <wesmck...@gmail.com> wrote:
> On Wed, Jun 23, 2021 at 3:03 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> > On Tue, 22 Jun 2021 19:04:49 -0500
> > Wes McKinney <wesmck...@gmail.com> wrote:  
> > > Some on this list might be interested in a new paper out of CMU/MIT
> > > about the use of selection vectors and bitmaps for handling the
> > > intermediate results of filters:
> > >
> > > https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf
> > >
> > > The research was done in the context of NoisePage which uses Arrow as
> > > its memory format. I found some of the observations related to AVX512
> > > to be interesting.  
> >
> > Too bad they didn't compare with the simple strategy of materializing
> > filtered results.  
> 
> I think this strategy has been rejected consistently in vectorized
> query engines on empirical performance grounds. "Pushing down" the
> filter into aggregate or elementwise kernels (to avoid a temporary
> materialization / memory allocation) is the way that systems I'm aware
> with work.

Yet it seems this would depend both on the filter selectivity and on
the length of the downstream pipeline.

If the filter selectivity is close to 0 (i.e. a small subset is
selected) then 1) materializing the filtered data should be cheap 2)
the filtered data is much smaller and contiguous, hence more
cache-friendly.

(note that if the result set is temporary and private, materializing
the filter might be done in place, but that doesn't fit the Arrow
model of immutability very well)

Regards

Antoine.


Reply via email to