On Thu, Jan 8, 2026 at 2:49 PM Manni Wood <[email protected]>
wrote:

>
>
> On Wed, Jan 7, 2026 at 1:13 PM Manni Wood <[email protected]>
> wrote:
>
>>
>>
>> On Tue, Jan 6, 2026 at 2:05 PM Manni Wood <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Wed, Dec 31, 2025 at 7:04 AM Nazir Bilal Yavuz <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <[email protected]> wrote:
>>>> >
>>>> > Hello,
>>>> > Following the same path of optimizing COPY FROM using SIMD, i found
>>>> that COPY TO can also benefit from this.
>>>> >
>>>> > I attached a small patch that uses SIMD to skip data and advance as
>>>> far as the first special character is found, then fallback to scalar
>>>> processing for that character and re-enter the SIMD path again...
>>>> > There's two ways to do this:
>>>> > 1) Essentially we do SIMD until we find a special character, then
>>>> continue scalar path without re-entering SIMD again.
>>>> > - This gives from 10% to 30% speedups depending on the weight of
>>>> special characters in the attribute, we don't lose anything here since it
>>>> advances with SIMD until it can't (using the previous scripts: 1/3, 2/3
>>>> specials chars).
>>>> >
>>>> > 2) Do SIMD path, then use scalar path when we hit a special
>>>> character, keep re-entering the SIMD path each time.
>>>> > - This is equivalent to the COPY FROM story, we'll need to find the
>>>> same heuristic to use for both COPY FROM/TO to reduce the regressions (same
>>>> regressions: around from 20% to 30% with 1/3, 2/3 specials chars).
>>>> >
>>>> > Something else to note is that the scalar path for COPY TO isn't as
>>>> heavy as the state machine in COPY FROM.
>>>> >
>>>> > So if we find the sweet spot for the heuristic, doing the same for
>>>> COPY TO will be trivial and always beneficial.
>>>> > Attached is 0004 which is option 1 (SIMD without re-entering), 0005
>>>> is the second one.
>>>>
>>>> Patches look correct to me. I think we could move these SIMD code
>>>> portions into a shared function to remove duplication, although that
>>>> might have a performance impact. I have not benchmarked these patches
>>>> yet.
>>>>
>>>> Another consideration is that these patches might need their own
>>>> thread, though I am not completely sure about this yet.
>>>>
>>>> One question: what do you think about having a 0004-style approach for
>>>> COPY FROM? What I have in mind is running SIMD for each line & column,
>>>> stopping SIMD once it can no longer skip an entire chunk, and then
>>>> continuing with the next line & column.
>>>>
>>>> --
>>>> Regards,
>>>> Nazir Bilal Yavuz
>>>> Microsoft
>>>>
>>>
>>> Hello, Nazir, I tried your suggested cpupower commands as well as
>>> disabling turbo, and my results are indeed more uniform. (see attached
>>> screenshot of my spreadsheet).
>>>
>>> This time, I ran the tests on my Tower PC instead of on my laptop.
>>>
>>> I also followed Mark Wong's advice and used the taskset command to pin
>>> my postgres postmaster (and all of its children) to a single cpu core.
>>>
>>> So when I start postgres, I do this to pin it to core 27:
>>>
>>> ${PGHOME}/bin/pg_ctl -D ${PGHOME}/data -l ${PGHOME}/logfile.txt start
>>> PGPID=$(head -1 ${PGHOME}/data/postmaster.pid)
>>> taskset --cpu-list -p 27 ${PGPID}
>>>
>>>
>>> My results seem similar to yours:
>>>
>>> master: Nazir 85ddcc2f4c | Manni 877ae5db
>>>
>>> text, no special: 102294 | 302651
>>> text, 1/3 special: 108946 | 326208
>>> csv, no special: 121831 | 348930
>>> csv, 1/3 special: 140063 | 439786
>>>
>>> v3
>>>
>>> text, no special: 88890 (13.1% speedup) | 227874 (24.7% speedup)
>>> text, 1/3 special: 110463 (1.4% regression) | 322637 (1.1% speedup)
>>> csv, no special: 89781 (26.3% speedup) | 226525 (35.1% speedup)
>>> csv, 1/3 special: 147094 (5.0% regression) | 461501 (4.9% regression)
>>>
>>> v4.2
>>>
>>> text, no special: 87785 (14.2% speedup) | 225702 (25.4% speedup)
>>> text, 1/3 special: 127008 (16.6% regression) | 343480 (5.3% regression)
>>> csv, no special: 88093 (27.7% speedup) | 226633 (35.0% speedup)
>>> csv, 1/3 special: 164487 (17.4% regression) | 510954 (16.2% regression)
>>>
>>> It would seem that both your results and mine show a more serious
>>> worst-case regression for the v4.2 patches than for the v3 patches. It
>>> seems also that the speedups for v4.2 and v3 are similar.
>>>
>>> I'm currently working with Mark Wong to see if his results continue to
>>> be dissimilar (as they currently are now) and, if so, why.
>>> --
>>> -- Manni Wood EDB: https://www.enterprisedb.com
>>>
>>
>> Hello, all.
>>
>> Now that I am following Nazir's on how to configure my CPU for
>> performance test run, and now that I am following Mark's advice on pinning
>> the postmaster to a particular CPU core, I figured I would share the
>> scripts I have been using to build, run, and test Postges with various
>> patches applied: https://github.com/manniwood/copysimdperf
>>
>> With Nazir and Mark's tips, I have seen more consistent numbers on my
>> tower PC, as shared in a previous e-mail. But Mark and I saw rather
>> variable results on a different Linux system he has access to. So this has
>> inspired me to spin up an AWS EC2 instance and test that when I find the
>> time. And maybe re-test on my Linux laptop.
>>
>> If anybody else is inspired to test on different setups, that would be
>> great.
>> --
>> -- Manni Wood EDB: https://www.enterprisedb.com
>>
>
> I tested master (bfb335d) and v3 and v4.2 patches on an amazon ec2
> instance (t2.small) and, with Mark's help, proved that on such a small
> system with default storage configured, IO will be the bottleneck and the
> v3 and v4.2 patches show no significant differences over master because the
> CPU is always waiting on IO. This is presumably an experience Postgres
> users will have when running on systems with IO so slow that the CPU is
> always waiting for data.
>
> I went in the other direction and tested an all-RAM setup on my tower PC.
> I put the entire data dir in RAM for each postgres instance (master, v3
> patch, v4.2 patch), and wrote and copied the test copyfiles from RAM. On
> Linux, /ram/user/<myuserid> is tmpfs (ramdisk), so I just put everything
> there. I had to shrink the data sizes compared to previous runs (to not run
> out of ramdisk space) but Nazir's cpupower tips are making all of my test
> runs much more uniform, so I no longer feel that I need huge data sizes to
> get good results.
>
> Here are the results when all of the files are on RAM disks:
>
> master: bfb335df
>
> text, no special: 30372
> text, 1/3 special: 32665
> csv, no special: 34925
> csv, 1/3 special: 44044
>
> v3
>
> text, no special: 22840 (24.7% speedup)
> text, 1/3 special: 32448 (0.6% speedup)
> csv, no special: 22642 (35.1% speedup)
> csv, 1/3 special: 46280 (5.1% regression)
>
> v4.2
>
> text, no special: 22677 (25.3% speedup)
> text, 1/3 special: 34512 (6.5% regression)
> csv, no special: 22686 (35.0% speedup)
> csv, 1/3 special: 51411 (16.7% regression)
>
> Assuming all-storage-is-RAM setups get us closer to the theoretical limit
> of each patch, it looks like v3 holds up quite well to v4.2 in the best
> case scenarios, while v3 has better performance than v4.2 in the worst-case
> scenarios.
>
> Let me know what you think!
> --
> -- Manni Wood EDB: https://www.enterprisedb.com
>

Ayoub Kazar, I tested your v4 "copy to" patch, doing everything in RAM, and
using the cpupower tips from above. (I wanted to test your v5, but `git
apply --check` gave me an error, so I can look at that another day.)

The results look great:

master: (forgot to get commit hash)

text, no special: 8165
text, 1/3 special: 22662
csv, no special: 9619
csv, 1/3 special: 23213

v4 (copy to)

text, no special: 4577 (43.9% speedup)
text, 1/3 special: 22847 (0.8% regression)
csv, no special: 4720 (50.9% speedup)
csv, 1/3 special: 23195 (0.07% regression)

Seems like a very clear win to me!
-- 
-- Manni Wood EDB: https://www.enterprisedb.com

Reply via email to