Hi,

Thank you for sharing your thoughts!

On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <[email protected]> wrote:
>
> It looks like a lot of energy has been put into benchmarking and refining
> the heuristic for deciding when to use the SIMD path so that we avoid large
> regressions when there are special characters.  I think this is all
> valuable work, but I'm a bit concerned that we are putting the cart before
> the horse.  IMHO it would be better to first get the SIMD code committed
> with the absolute simplest heuristic we can think of (e.g., as soon as we
> see a special character, switch to the scalar path for the remainder of
> COPY FROM).  My hope is that would be far easier to reason about from a
> performance angle.  If we immediately fall back to the existing code path,
> we don't need to worry about how many special characters there are and
> whether they are sparse or clustered or whatever.  We just need to measure
> the overhead of the new branches and ensure they don't produce meaningful
> regressions.  Assuming that all looks good, we can then focus on the SIMD
> code itself and make sure that is correct and optimal.  And once we get
> that portion committed, we could then consider more sophisticated
> heuristics.

I have three possible approaches in my mind, they are actually similar
to each other.

1- After encountering a special character, disable SIMD for the rest
of the current line and also for the rest of the data.

2- It is a mixed version of the current heuristic and #1. After
encountering a special character, skip SIMD for the current line (let'
say line 1) and for the next line (line 2). Then try running SIMD for
the next line (line 3), if there is no special character continue to
run SIMD but if there is a special character then skip running SIMD
for two lines this time. And it goes like that, everytime special
character is encountered in the SIMD run, skipped SIMD lines are
doubled.

3- This version is a bit different from #2. Instead of calculating the
number of lines to skip dynamically, skip the constant N number of
lines and then try to run SIMD again after these lines. N could be
something like 100, 1000, or 10000 etc.. Actually, you and Andrew
suggested this approach before [1].

I think what you suggested is closer to #1 or #3. I just wanted to
hear your opinions, and whether you think any of these approaches are
good to implement / work on.

[1] https://postgr.es/m/aR4wDwNdLc5TmcQq%40nathan

-- 
Regards,
Nazir Bilal Yavuz
Microsoft


Reply via email to