>
> > Re-stating the points so that scrolling is not needed.
> > 1.  Change of integer encoding (see debate in this thread on FOR vs
> > Delta).  We also want to get fast lanes in at some point.


I'm not sure why we "want to get last lanes in some point". I don't
> think we want to obsess about performance here, given that current
> performance is already very good, and real-world bottlenecks will
> probably be elsewhere.


I think we can maybe debate this further, if/when someone proposes fast
lanes.  My main point here is that there are valid use-cases for making
integer encoding configurable (or at least there is uncertainty here).


This highlights that *unless you can use cpu vector opcodes, adding more
> options can hurt branch prediction and so make overall performance worse*.
> It's a good argument for simplicity in compression and encoding choices.


It looks like the actual issue described for ORC in the paper is that it
has multiple sub-encodings in a batch.  This is different then the design
proposed here where there is still fixed encoding per page in parquet.
Given reasonably sized pages I don't think branch misprediction should be a
big issue for new encodings.  I agree that we should be conservative in
general for adding new encodings.

Regards,
Micah

On Thu, Feb 5, 2026 at 6:28 AM Antoine Pitrou <[email protected]> wrote:

>
> hi Prateek,
>
> Le 03/02/2026 à 23:39, PRATEEK GAUR a écrit :
> > Hi Antoine and Micah,
> >
> > Apologies for getting back on this a little late.
> >
> > *Running Perf tests*
> > @Antoine Pitrou <[email protected]> were you able to figure out the
> steps
> > to run the tests?
>
> Yes, I finally did that, results below on an AMD Zen 2 CPU:
> https://gist.github.com/pitrou/1f4aefb7034657ce018231d87993f437
>
> > *Sampling Frequency*
> > We want to pick the right parameters to encode the values with. That is
> > what the Spec requires.
> >  From the implementation perspective you raise a good point that did
> cross my
> > mind that 'practically we don't want to sample for every page', for
> > performance
> > reasons. My thinking is each engine is free to decide this.
> > 1) Do it at page level if data is changing often
> > 2) Provide fixed presets via config
> > 3) Do it once per encoder (per column, as Micah pointed out)
> > 4) Provide a fancy config.
>
> Ok, that would sound fine to me.
>
> > Re-stating the points so that scrolling is not needed.
> > 1.  Change of integer encoding (see debate in this thread on FOR vs
> > Delta).  We also want to get fast lanes in at some point.
>
> I'm not sure why we "want to get last lanes in some point". I don't
> think we want to obsess about performance here, given that current
> performance is already very good, and real-world bottlenecks will
> probably be elsewhere.
>
> > For eg at this point I do see that both bitpacking of exceptions, as
> > pointed by
> > Antoine, or plain ub2 encoding should work equally well .
>
> Well, since 16 bits are actually enough for the current vector size, I'd
> say we can keep things simple.
>
> Regards
>
> Antoine.
>
>
>

Reply via email to