Sorry for disappearing from this thread for a while. It looks like a lot of energy has been put into benchmarking and refining the heuristic for deciding when to use the SIMD path so that we avoid large regressions when there are special characters. I think this is all valuable work, but I'm a bit concerned that we are putting the cart before the horse. IMHO it would be better to first get the SIMD code committed with the absolute simplest heuristic we can think of (e.g., as soon as we see a special character, switch to the scalar path for the remainder of COPY FROM). My hope is that would be far easier to reason about from a performance angle. If we immediately fall back to the existing code path, we don't need to worry about how many special characters there are and whether they are sparse or clustered or whatever. We just need to measure the overhead of the new branches and ensure they don't produce meaningful regressions. Assuming that all looks good, we can then focus on the SIMD code itself and make sure that is correct and optimal. And once we get that portion committed, we could then consider more sophisticated heuristics.
FWIW I'm hoping to get something in this area committed for v19, and IMO now is a good time to start thinking about how to get things over the finish line. Thanks for working on it. -- nathan
