On Thu, Jan 8, 2026 at 2:49 PM Manni Wood <[email protected]> wrote:
> > > On Wed, Jan 7, 2026 at 1:13 PM Manni Wood <[email protected]> > wrote: > >> >> >> On Tue, Jan 6, 2026 at 2:05 PM Manni Wood <[email protected]> >> wrote: >> >>> >>> >>> On Wed, Dec 31, 2025 at 7:04 AM Nazir Bilal Yavuz <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <[email protected]> wrote: >>>> > >>>> > Hello, >>>> > Following the same path of optimizing COPY FROM using SIMD, i found >>>> that COPY TO can also benefit from this. >>>> > >>>> > I attached a small patch that uses SIMD to skip data and advance as >>>> far as the first special character is found, then fallback to scalar >>>> processing for that character and re-enter the SIMD path again... >>>> > There's two ways to do this: >>>> > 1) Essentially we do SIMD until we find a special character, then >>>> continue scalar path without re-entering SIMD again. >>>> > - This gives from 10% to 30% speedups depending on the weight of >>>> special characters in the attribute, we don't lose anything here since it >>>> advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 >>>> specials chars). >>>> > >>>> > 2) Do SIMD path, then use scalar path when we hit a special >>>> character, keep re-entering the SIMD path each time. >>>> > - This is equivalent to the COPY FROM story, we'll need to find the >>>> same heuristic to use for both COPY FROM/TO to reduce the regressions (same >>>> regressions: around from 20% to 30% with 1/3, 2/3 specials chars). >>>> > >>>> > Something else to note is that the scalar path for COPY TO isn't as >>>> heavy as the state machine in COPY FROM. >>>> > >>>> > So if we find the sweet spot for the heuristic, doing the same for >>>> COPY TO will be trivial and always beneficial. >>>> > Attached is 0004 which is option 1 (SIMD without re-entering), 0005 >>>> is the second one. >>>> >>>> Patches look correct to me. I think we could move these SIMD code >>>> portions into a shared function to remove duplication, although that >>>> might have a performance impact. I have not benchmarked these patches >>>> yet. >>>> >>>> Another consideration is that these patches might need their own >>>> thread, though I am not completely sure about this yet. >>>> >>>> One question: what do you think about having a 0004-style approach for >>>> COPY FROM? What I have in mind is running SIMD for each line & column, >>>> stopping SIMD once it can no longer skip an entire chunk, and then >>>> continuing with the next line & column. >>>> >>>> -- >>>> Regards, >>>> Nazir Bilal Yavuz >>>> Microsoft >>>> >>> >>> Hello, Nazir, I tried your suggested cpupower commands as well as >>> disabling turbo, and my results are indeed more uniform. (see attached >>> screenshot of my spreadsheet). >>> >>> This time, I ran the tests on my Tower PC instead of on my laptop. >>> >>> I also followed Mark Wong's advice and used the taskset command to pin >>> my postgres postmaster (and all of its children) to a single cpu core. >>> >>> So when I start postgres, I do this to pin it to core 27: >>> >>> ${PGHOME}/bin/pg_ctl -D ${PGHOME}/data -l ${PGHOME}/logfile.txt start >>> PGPID=$(head -1 ${PGHOME}/data/postmaster.pid) >>> taskset --cpu-list -p 27 ${PGPID} >>> >>> >>> My results seem similar to yours: >>> >>> master: Nazir 85ddcc2f4c | Manni 877ae5db >>> >>> text, no special: 102294 | 302651 >>> text, 1/3 special: 108946 | 326208 >>> csv, no special: 121831 | 348930 >>> csv, 1/3 special: 140063 | 439786 >>> >>> v3 >>> >>> text, no special: 88890 (13.1% speedup) | 227874 (24.7% speedup) >>> text, 1/3 special: 110463 (1.4% regression) | 322637 (1.1% speedup) >>> csv, no special: 89781 (26.3% speedup) | 226525 (35.1% speedup) >>> csv, 1/3 special: 147094 (5.0% regression) | 461501 (4.9% regression) >>> >>> v4.2 >>> >>> text, no special: 87785 (14.2% speedup) | 225702 (25.4% speedup) >>> text, 1/3 special: 127008 (16.6% regression) | 343480 (5.3% regression) >>> csv, no special: 88093 (27.7% speedup) | 226633 (35.0% speedup) >>> csv, 1/3 special: 164487 (17.4% regression) | 510954 (16.2% regression) >>> >>> It would seem that both your results and mine show a more serious >>> worst-case regression for the v4.2 patches than for the v3 patches. It >>> seems also that the speedups for v4.2 and v3 are similar. >>> >>> I'm currently working with Mark Wong to see if his results continue to >>> be dissimilar (as they currently are now) and, if so, why. >>> -- >>> -- Manni Wood EDB: https://www.enterprisedb.com >>> >> >> Hello, all. >> >> Now that I am following Nazir's on how to configure my CPU for >> performance test run, and now that I am following Mark's advice on pinning >> the postmaster to a particular CPU core, I figured I would share the >> scripts I have been using to build, run, and test Postges with various >> patches applied: https://github.com/manniwood/copysimdperf >> >> With Nazir and Mark's tips, I have seen more consistent numbers on my >> tower PC, as shared in a previous e-mail. But Mark and I saw rather >> variable results on a different Linux system he has access to. So this has >> inspired me to spin up an AWS EC2 instance and test that when I find the >> time. And maybe re-test on my Linux laptop. >> >> If anybody else is inspired to test on different setups, that would be >> great. >> -- >> -- Manni Wood EDB: https://www.enterprisedb.com >> > > I tested master (bfb335d) and v3 and v4.2 patches on an amazon ec2 > instance (t2.small) and, with Mark's help, proved that on such a small > system with default storage configured, IO will be the bottleneck and the > v3 and v4.2 patches show no significant differences over master because the > CPU is always waiting on IO. This is presumably an experience Postgres > users will have when running on systems with IO so slow that the CPU is > always waiting for data. > > I went in the other direction and tested an all-RAM setup on my tower PC. > I put the entire data dir in RAM for each postgres instance (master, v3 > patch, v4.2 patch), and wrote and copied the test copyfiles from RAM. On > Linux, /ram/user/<myuserid> is tmpfs (ramdisk), so I just put everything > there. I had to shrink the data sizes compared to previous runs (to not run > out of ramdisk space) but Nazir's cpupower tips are making all of my test > runs much more uniform, so I no longer feel that I need huge data sizes to > get good results. > > Here are the results when all of the files are on RAM disks: > > master: bfb335df > > text, no special: 30372 > text, 1/3 special: 32665 > csv, no special: 34925 > csv, 1/3 special: 44044 > > v3 > > text, no special: 22840 (24.7% speedup) > text, 1/3 special: 32448 (0.6% speedup) > csv, no special: 22642 (35.1% speedup) > csv, 1/3 special: 46280 (5.1% regression) > > v4.2 > > text, no special: 22677 (25.3% speedup) > text, 1/3 special: 34512 (6.5% regression) > csv, no special: 22686 (35.0% speedup) > csv, 1/3 special: 51411 (16.7% regression) > > Assuming all-storage-is-RAM setups get us closer to the theoretical limit > of each patch, it looks like v3 holds up quite well to v4.2 in the best > case scenarios, while v3 has better performance than v4.2 in the worst-case > scenarios. > > Let me know what you think! > -- > -- Manni Wood EDB: https://www.enterprisedb.com > Ayoub Kazar, I tested your v4 "copy to" patch, doing everything in RAM, and using the cpupower tips from above. (I wanted to test your v5, but `git apply --check` gave me an error, so I can look at that another day.) The results look great: master: (forgot to get commit hash) text, no special: 8165 text, 1/3 special: 22662 csv, no special: 9619 csv, 1/3 special: 23213 v4 (copy to) text, no special: 4577 (43.9% speedup) text, 1/3 special: 22847 (0.8% regression) csv, no special: 4720 (50.9% speedup) csv, 1/3 special: 23195 (0.07% regression) Seems like a very clear win to me! -- -- Manni Wood EDB: https://www.enterprisedb.com
