RE: [PoC] Non-volatile WAL buffer
Hi Takashi, There are some differences between our HW/SW configuration and test steps. I attached postgresql.conf I used for your reference. I would like to try postgresql.conf and steps you provided in the later days to see if I can find cause. I also ran pgbench and postgres server on the same server but on different NUMA node, and ensure server process and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 9. But some difference in later steps, major of them are: In step 10), I created a database and table for test by: #create database: psql -c "create database insert_bench;" #create table: psql -d insert_bench -c "create table test(crt_time timestamp, info text default '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');" in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to warm up. In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f ./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment) For HW/SW conf, the major differences are: CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled) OS Distro: CentOS 8.2.2004 Kernel: 4.18.0-193.6.3.el8_2.x86_64 GCC: 8.3.1 Best regards Gang -Original Message- From: Takashi Menjo Sent: Tuesday, October 6, 2020 4:49 PM To: Deng, Gang Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' Subject: RE: [PoC] Non-volatile WAL buffer Hi Gang, I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on. My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer." Best regards, Takashi # Results See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM). # Steps Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node. 01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0) 02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0) 03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1) 04) Make /mnt/pmem0/pg_wal directory for WAL 05) Make /mnt/nvme0n1/pgdata directory for PGDATA 06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...) - Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer 07) Edit postgresql.conf as the attached one - Please remove nvwal_* lines in the case of Original (PMEM) 08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 09) Create a database (createdb --locale=C --encoding=UTF8) 10) Initialize pgbench tables with s=50 (pgbench -i -s 50) 11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);) - This would make the row size of the table 328 bytes 12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop) 13) Remount the PMEM and the PCIe SSD 14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start) 15) Run pg_prewarm for all the four pgbench_* tables 16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __) - It executes the default tpcb-like transactions I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency. # Environment variables export PGHOST=/tmp export PGPORT=5432 export PGDATABASE="$USER" export PGUSER="$USER" export PGDATA=/mnt/nvme0n1/pgdata # Setup - System: HPE ProLiant DL380 Gen10 - CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS) - DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket) - Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled) - PCIe SSD: DC P4800X Series SSDPED1K750GA - Distro: Ubuntu 20.04.1 - C com
RE: [PoC] Non-volatile WAL buffer
Hi Takashi, Thank you for the patch and work on accelerating PG performance with NVM. I applied the patch and made some performance test based on the patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used two methods to store WAL file(s): 1. Leverage your patch to access PMem with libpmem (NVWAL patch). 2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no PG patch is required to access PMem (Storage over App Direct). I tried two insert scenarios: A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test B. Insert large record (length of record to be inserted is 328 bytes) My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL. But I observed that NVWAL patch method had ~5% performance improvement compared with Storage over App Direct method in scenario A, while had ~20% performance degradation in scenario B. I made further investigation on the test. I found that NVWAL patch can improve performance of XlogFlush function, but it may impact performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test: Scenario A (length of record to be inserted: 24 bytes per record): == NVWAL SoAD --- --- Througput (10^3 TPS) 310.5 296.0 CPU Time % of CopyXlogRecordToWAL0.4 0.2 CPU Time % of XLogInsertRecord 1.5 0.8 CPU Time % of XLogFlush 2.1 9.6 Scenario B (length of record to be inserted: 328 bytes per record): == NVWAL SoAD --- --- Througput (10^3 TPS) 13.0 16.9 CPU Time % of CopyXlogRecordToWAL3.0 1.6 CPU Time % of XLogInsertRecord 23.0 16.4 CPU Time % of XLogFlush 2.3 5.9 Best Regards, Gang From: Takashi Menjo Sent: Thursday, September 10, 2020 4:01 PM To: Takashi Menjo Cc: pgsql-hack...@postgresql.org Subject: Re: [PoC] Non-volatile WAL buffer Rebased. 2020年6月24日(水) 16:44 Takashi Menjo mailto:takashi.menjou...@hco.ntt.co.jp>>: Dear hackers, I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode. Updates from v2: - walreceiver supports non-volatile WAL buffer Now walreceiver stores received records directly to non-volatile WAL buffer if applicable. - pg_basebackup supports non-volatile WAL buffer Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn). You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one. Best regards, Takashi -- Takashi Menjo mailto:takashi.menjou...@hco.ntt.co.jp>> NTT Software Innovation Center > -Original Message- > From: Takashi Menjo > mailto:takashi.menjou...@hco.ntt.co.jp>> > Sent: Wednesday, March 18, 2020 5:59 PM > To: 'PostgreSQL-development' > mailto:pgsql-hack...@postgresql.org>> > Cc: 'Robert Haas' mailto:robertmh...@gmail.com>>; > 'Heikki Linnakangas' mailto:hlinn...@iki.fi>>; 'Amit Langote' > mailto:amitlangot...@gmail.com>> > Subject: RE: [PoC] Non-volatile WAL buffer > > Dear hackers, > > I rebased my non-volatile WAL buffer's patchset onto master. A new v2 > patchset is attached to this mail. > > I also measured performance before and after patchset, varying -c/--client > and -j/--jobs options of pgbench, for > each scaling factor s = 50 or 1000. The results are presented in the > following tables and the attached charts. > Conditions, steps, and other details will be shown later. > > > Results (s=50) > == > Throughput [10^3 TPS] Average latency [ms] > ( c, j) before after before after > --- - - > ( 8, 8) 35.737.1 (+3.9%) 0.224 0.216 (-3.6%) > (18,18) 70.974.7 (+5.3%) 0.254 0.241 (-5.1%) > (36,18) 76.080.8 (+6.3%) 0.473 0.446 (-5.7%) > (54,18)
RE: [PATCH] Resolve Parallel Hash Join Performance Issue
Regarding to the reason of setting bit was not cheap anymore in parallel join. As I explain in my original mail, it is because 'false sharing cache coherence'. In short word, setting of the bit will cause the whole cache line (64 bytes) dirty. So that all CPU cores contain the cache line have to load it again, which will waste much cpu time. Article https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads explain more detail. -Original Message- From: Tom Lane Sent: Thursday, January 9, 2020 10:43 PM To: Thomas Munro Cc: Deng, Gang ; pgsql-hack...@postgresql.org Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue Thomas Munro writes: > Right, I see. The funny thing is that the match bit is not even used > in this query (it's used for right and full hash join, and those > aren't supported for parallel joins yet). Hmm. So, instead of the > test you proposed, an alternative would be to use if (!parallel). > That's a value that will be constant-folded, so that there will be no > branch in the generated code (see the pg_attribute_always_inline > trick). If, in a future release, we need the match bit for parallel > hash join because we add parallel right/full hash join support, we > could do it the way you showed, but only if it's one of those join > types, using another constant parameter. Can we base the test off the match type today, and avoid leaving something that will need to be fixed later? I'm pretty sure that the existing coding is my fault, and that it's like that because I reasoned that setting the bit was too cheap to justify having a test-and-branch around it. Apparently that's not true anymore in a parallel join, but I have to say that it's unclear why. In any case, the reasoning probably still holds good in non-parallel cases, so it'd be a shame to introduce a run-time test if we can avoid it. regards, tom lane
RE: [PATCH] Resolve Parallel Hash Join Performance Issue
Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', so that no need to test the bit. Will someone submit patch to for it accordingly? -Original Message- From: Thomas Munro Sent: Thursday, January 9, 2020 6:04 PM To: Deng, Gang Cc: pgsql-hack...@postgresql.org Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang wrote: > Attached is a patch to resolve parallel hash join performance issue. This is > my first time to contribute patch to PostgreSQL community, I referred one of > previous thread as template to report the issue and patch. Please let me know > if need more information of the problem and patch. Thank you very much for investigating this and for your report. > HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)); > > changed to: > > if > (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple))) > > { > > > HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)); > > } > > Compared with original code, modified code can avoid unnecessary write to > memory/cache. Right, I see. The funny thing is that the match bit is not even used in this query (it's used for right and full hash join, and those aren't supported for parallel joins yet). Hmm. So, instead of the test you proposed, an alternative would be to use if (!parallel). That's a value that will be constant-folded, so that there will be no branch in the generated code (see the pg_attribute_always_inline trick). If, in a future release, we need the match bit for parallel hash join because we add parallel right/full hash join support, we could do it the way you showed, but only if it's one of those join types, using another constant parameter. > D. Result > > With the modified code, performance of hash join operation can scale better > with number of threads. Here is result of query02 after patch. For example, > performance improved ~2.5x when run 28 threads. > > number of thread:1 48 1628 > time used(sec):465.1 193.1 97.9 55.9 41 Wow. That is a very nice improvement.