> > > Re-stating the points so that scrolling is not needed. > > 1. Change of integer encoding (see debate in this thread on FOR vs > > Delta). We also want to get fast lanes in at some point.
I'm not sure why we "want to get last lanes in some point". I don't > think we want to obsess about performance here, given that current > performance is already very good, and real-world bottlenecks will > probably be elsewhere. I think we can maybe debate this further, if/when someone proposes fast lanes. My main point here is that there are valid use-cases for making integer encoding configurable (or at least there is uncertainty here). This highlights that *unless you can use cpu vector opcodes, adding more > options can hurt branch prediction and so make overall performance worse*. > It's a good argument for simplicity in compression and encoding choices. It looks like the actual issue described for ORC in the paper is that it has multiple sub-encodings in a batch. This is different then the design proposed here where there is still fixed encoding per page in parquet. Given reasonably sized pages I don't think branch misprediction should be a big issue for new encodings. I agree that we should be conservative in general for adding new encodings. Regards, Micah On Thu, Feb 5, 2026 at 6:28 AM Antoine Pitrou <[email protected]> wrote: > > hi Prateek, > > Le 03/02/2026 à 23:39, PRATEEK GAUR a écrit : > > Hi Antoine and Micah, > > > > Apologies for getting back on this a little late. > > > > *Running Perf tests* > > @Antoine Pitrou <[email protected]> were you able to figure out the > steps > > to run the tests? > > Yes, I finally did that, results below on an AMD Zen 2 CPU: > https://gist.github.com/pitrou/1f4aefb7034657ce018231d87993f437 > > > *Sampling Frequency* > > We want to pick the right parameters to encode the values with. That is > > what the Spec requires. > > From the implementation perspective you raise a good point that did > cross my > > mind that 'practically we don't want to sample for every page', for > > performance > > reasons. My thinking is each engine is free to decide this. > > 1) Do it at page level if data is changing often > > 2) Provide fixed presets via config > > 3) Do it once per encoder (per column, as Micah pointed out) > > 4) Provide a fancy config. > > Ok, that would sound fine to me. > > > Re-stating the points so that scrolling is not needed. > > 1. Change of integer encoding (see debate in this thread on FOR vs > > Delta). We also want to get fast lanes in at some point. > > I'm not sure why we "want to get last lanes in some point". I don't > think we want to obsess about performance here, given that current > performance is already very good, and real-world bottlenecks will > probably be elsewhere. > > > For eg at this point I do see that both bitpacking of exceptions, as > > pointed by > > Antoine, or plain ub2 encoding should work equally well . > > Well, since 16 bits are actually enough for the current vector size, I'd > say we can keep things simple. > > Regards > > Antoine. > > >
