RE: [PoC] Non-volatile WAL buffer

2020-10-09 Thread Deng, Gang
Hi Takashi,

There are some differences between our HW/SW configuration and test steps. I 
attached postgresql.conf I used for your reference. I would like to try 
postgresql.conf and steps you provided in the later days to see if I can find 
cause.

I also ran pgbench and postgres server on the same server but on different NUMA 
node, and ensure server process and PMEM on the same NUMA node. I used similar 
steps are yours from step 1 to 9. But some difference in later steps, major of 
them are:

In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text 
default 
'75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"

in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to 
warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f 
./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment)

For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled)
OS Distro: CentOS 8.2.2004 
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1

Best regards
Gang

-Original Message-
From: Takashi Menjo  
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang 
Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' 
Subject: RE: [PoC] Non-volatile WAL buffer

Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when 
inserting 328-byte records. So I think the condition of you and me would be 
different, such as steps to reproduce, postgresql.conf, installation setup, and 
so on.

My results and condition are as follows. May I have your condition in more 
detail? Note that I refer to your "Storage over App Direct" as my "Original 
(PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi


# Results
See the attached figure. In short, Non-volatile WAL buffer got better 
performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but 
separated two NUMA nodes. PMEM and PCI SSD for the server process are on the 
server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M 
dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo 
mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q 
-F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile 
WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl 
-l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 
(ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- 
pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r 
-M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = 
__ (including connections establishing)" of the three as throughput and the 
"latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by 
BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels 
per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 
sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C com

RE: [PoC] Non-volatile WAL buffer

2020-09-20 Thread Deng, Gang
Hi Takashi,

Thank you for the patch and work on accelerating PG performance with NVM. I 
applied the patch and made some performance test based on the patch v4. I 
stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). 
I used two methods to store WAL file(s):

1.  Leverage your patch to access PMem with libpmem (NVWAL patch).

2.  Access PMem with legacy filesystem interface, that means use PMem as 
ordinary block device, no PG patch is required to access PMem (Storage over App 
Direct).

I tried two insert scenarios:

A. Insert small record (length of record to be inserted is 24 bytes), I 
think it is similar as your test

B.  Insert large record (length of record to be inserted is 328 bytes)

My original purpose is to see higher performance gain in scenario B as it is 
more write intensive on WAL. But I observed that NVWAL patch method had ~5% 
performance improvement compared with Storage over App Direct method in 
scenario A, while had ~20% performance degradation in scenario B.

I made further investigation on the test. I found that NVWAL patch can improve 
performance of XlogFlush function, but it may impact performance of 
CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy 
to Intel PMem comparing with DRAM. Here are key data in my test:

Scenario A (length of record to be inserted: 24 bytes per record):
==
   
NVWAL   SoAD
  ---   
   ---
Througput (10^3 TPS)  310.5 
296.0
CPU Time % of CopyXlogRecordToWAL0.4
 0.2
CPU Time % of XLogInsertRecord  1.5 
0.8
CPU Time % of XLogFlush  2.1
 9.6

Scenario B (length of record to be inserted: 328 bytes per record):
==
   
NVWAL   SoAD
  ---   
   ---
Througput (10^3 TPS)  13.0  
 16.9
CPU Time % of CopyXlogRecordToWAL3.0
 1.6
CPU Time % of XLogInsertRecord  23.0
   16.4
CPU Time % of XLogFlush  2.3
 5.9

Best Regards,
Gang

From: Takashi Menjo 
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo 
Cc: pgsql-hack...@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

Rebased.


2020年6月24日(水) 16:44 Takashi Menjo 
mailto:takashi.menjou...@hco.ntt.co.jp>>:
Dear hackers,

I update my non-volatile WAL buffer's patchset to v3.  Now we can use it in 
streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if 
applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if 
you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.  The path will be 
written to postgresql.auto.conf or recovery.conf.  The size of the new NVWAL is 
same as the master's one.


Best regards,
Takashi

--
Takashi Menjo 
mailto:takashi.menjou...@hco.ntt.co.jp>>
NTT Software Innovation Center

> -Original Message-
> From: Takashi Menjo 
> mailto:takashi.menjou...@hco.ntt.co.jp>>
> Sent: Wednesday, March 18, 2020 5:59 PM
> To: 'PostgreSQL-development' 
> mailto:pgsql-hack...@postgresql.org>>
> Cc: 'Robert Haas' mailto:robertmh...@gmail.com>>; 
> 'Heikki Linnakangas' mailto:hlinn...@iki.fi>>; 'Amit Langote'
> mailto:amitlangot...@gmail.com>>
> Subject: RE: [PoC] Non-volatile WAL buffer
>
> Dear hackers,
>
> I rebased my non-volatile WAL buffer's patchset onto master.  A new v2 
> patchset is attached to this mail.
>
> I also measured performance before and after patchset, varying -c/--client 
> and -j/--jobs options of pgbench, for
> each scaling factor s = 50 or 1000.  The results are presented in the 
> following tables and the attached charts.
> Conditions, steps, and other details will be shown later.
>
>
> Results (s=50)
> ==
>  Throughput [10^3 TPS]  Average latency [ms]
> ( c, j)  before  after  before  after
> ---  -  -
> ( 8, 8)  35.737.1 (+3.9%)   0.224   0.216 (-3.6%)
> (18,18)  70.974.7 (+5.3%)   0.254   0.241 (-5.1%)
> (36,18)  76.080.8 (+6.3%)   0.473   0.446 (-5.7%)
> (54,18)  

RE: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-09 Thread Deng, Gang
Regarding to the reason of setting bit was not cheap anymore in parallel join. 
As I explain in my original mail, it is because 'false sharing cache 
coherence'. In short word, setting of the bit will cause the whole cache line 
(64 bytes) dirty. So that all CPU cores contain the cache line have to load it 
again, which will waste much cpu time. Article 
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
 explain more detail.

-Original Message-
From: Tom Lane  
Sent: Thursday, January 9, 2020 10:43 PM
To: Thomas Munro 
Cc: Deng, Gang ; pgsql-hack...@postgresql.org
Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue

Thomas Munro  writes:
> Right, I see.  The funny thing is that the match bit is not even used 
> in this query (it's used for right and full hash join, and those 
> aren't supported for parallel joins yet).  Hmm.  So, instead of the 
> test you proposed, an alternative would be to use if (!parallel).
> That's a value that will be constant-folded, so that there will be no 
> branch in the generated code (see the pg_attribute_always_inline 
> trick).  If, in a future release, we need the match bit for parallel 
> hash join because we add parallel right/full hash join support, we 
> could do it the way you showed, but only if it's one of those join 
> types, using another constant parameter.

Can we base the test off the match type today, and avoid leaving something that 
will need to be fixed later?

I'm pretty sure that the existing coding is my fault, and that it's like that 
because I reasoned that setting the bit was too cheap to justify having a 
test-and-branch around it.  Apparently that's not true anymore in a parallel 
join, but I have to say that it's unclear why.  In any case, the reasoning 
probably still holds good in non-parallel cases, so it'd be a shame to 
introduce a run-time test if we can avoid it.

regards, tom lane




RE: [PATCH] Resolve Parallel Hash Join Performance Issue

2020-01-09 Thread Deng, Gang
Thank you for the comment. Yes, I agree the alternative of using '(!parallel)', 
so that no need to test the bit. Will someone submit patch to for it 
accordingly?

-Original Message-
From: Thomas Munro  
Sent: Thursday, January 9, 2020 6:04 PM
To: Deng, Gang 
Cc: pgsql-hack...@postgresql.org
Subject: Re: [PATCH] Resolve Parallel Hash Join Performance Issue

On Thu, Jan 9, 2020 at 10:04 PM Deng, Gang  wrote:
> Attached is a patch to resolve parallel hash join performance issue. This is 
> my first time to contribute patch to PostgreSQL community, I referred one of 
> previous thread as template to report the issue and patch. Please let me know 
> if need more information of the problem and patch.

Thank you very much for investigating this and for your report.

> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> changed to:
>
> if 
> (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>
> {
>
> 
> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> }
>
> Compared with original code, modified code can avoid unnecessary write to 
> memory/cache.

Right, I see.  The funny thing is that the match bit is not even used in this 
query (it's used for right and full hash join, and those aren't supported for 
parallel joins yet).  Hmm.  So, instead of the test you proposed, an 
alternative would be to use if (!parallel).
That's a value that will be constant-folded, so that there will be no branch in 
the generated code (see the pg_attribute_always_inline trick).  If, in a future 
release, we need the match bit for parallel hash join because we add parallel 
right/full hash join support, we could do it the way you showed, but only if 
it's one of those join types, using another constant parameter.

> D. Result
>
> With the modified code, performance of hash join operation can scale better 
> with number of threads. Here is result of query02 after patch. For example, 
> performance improved ~2.5x when run 28 threads.
>
> number of thread:1   48 1628
> time used(sec):465.1  193.1   97.9   55.9  41

Wow.  That is a very nice improvement.