On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote: > In trying to wrap the SIMD code behind layers of abstraction, the latest > patch (and Nathan's cleanup) threw it away in almost all cases. To explain, > we need to talk about how vectorized code deals with the "tail" that is too > small for the register: > > 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants. > 2. Read some junk into the register and mask off false positives from the > result. > > There are advantages to both depending on the situation. > > Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 > elements or less, it will iterate over them one-by-one exactly like a > node4. Only when full with 16 will the vector path be taken. When another > entry is added, the elements are copied to the next bigger node, so there's > a *small* window where it's fast. > > In short, this code needs to be lower level so that we still have full > control while being portable. I will work on this, and also the related > code for node dispatch.
Is it possible to use approach #2 here, too? AFAICT space is allocated for all of the chunks, so there wouldn't be any danger in searching all them and discarding any results >= node->count. Granted, we're depending on the number of chunks always being a multiple of elements-per-vector in order to avoid the tail path, but that seems like a reasonably safe assumption that can be covered with comments. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com