Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-10-13 11:30:35 -0700, Andres Freund wrote: > On 2023-10-13 10:39:10 -0700, Andres Freund wrote: > > On 2023-10-12 09:24:19 -0700, Andres Freund wrote: > > > I kind of had hoped somebody would comment on the approach. Given that > > > nobody > > > has, I'll push the minimal fix of resetting the state in > > > ReleaseBulkInsertStatePin(), even though I think architecturally that's > > > not > > > great. > > > > I spent some time working on a test that shows the problem more cheaply than > > the case upthread. I think it'd be desirable to have a test that's likely to > > catch an issue like this fairly quickly. We've had other problems in this > > realm before - there's only a single test that fails if I remove the > > ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at > > all. > > > > I'm a bit on the fence on how large to make the relation. For me the bug > > triggers when filling both relations up to the 3rd block, but how many rows > > that takes is somewhat dependent on space utilization on the page and stuff. > > > > Right now the test uses data/desc.data and ends up with 328kB and 312kB in > > two > > partitions. Alternatively I could make the test create a new file to load > > with > > copy that has fewer rows than data/desc.data - I didn't see another data > > file > > that works conveniently and has fewer rows. The copy is reasonably fast, > > even > > under something as expensive as rr (~60ms). So I'm inclined to just go with > > that? > > Patch with fix and test attached (0001). Pushed that. Greetings, Andres
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-10-13 10:39:10 -0700, Andres Freund wrote: > On 2023-10-12 09:24:19 -0700, Andres Freund wrote: > > I kind of had hoped somebody would comment on the approach. Given that > > nobody > > has, I'll push the minimal fix of resetting the state in > > ReleaseBulkInsertStatePin(), even though I think architecturally that's not > > great. > > I spent some time working on a test that shows the problem more cheaply than > the case upthread. I think it'd be desirable to have a test that's likely to > catch an issue like this fairly quickly. We've had other problems in this > realm before - there's only a single test that fails if I remove the > ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all. > > I'm a bit on the fence on how large to make the relation. For me the bug > triggers when filling both relations up to the 3rd block, but how many rows > that takes is somewhat dependent on space utilization on the page and stuff. > > Right now the test uses data/desc.data and ends up with 328kB and 312kB in two > partitions. Alternatively I could make the test create a new file to load with > copy that has fewer rows than data/desc.data - I didn't see another data file > that works conveniently and has fewer rows. The copy is reasonably fast, even > under something as expensive as rr (~60ms). So I'm inclined to just go with > that? Patch with fix and test attached (0001). Given how easy a missing ReleaseBulkInsertStatePin() can cause corruption (not due to this bug, but the issue fixed in b1ecb9b3fcf), I think we should consider adding an assertion along the lines of 0002 to HEAD. Perhaps adding a new bufmgr.c function to avoid having to get the fields in the buffer tag we don't care about. Or perhaps we should just promote the check to an elog, we already call BufferGetBlockNumber(), using BufferGetTag() instead doesn't cost much more. Greetings, Andres Freund >From e202cf5c3cae0551ae8fabd2629b4f9c6a0734d5 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 13 Oct 2023 10:54:16 -0700 Subject: [PATCH v1 1/2] Fix bulk table extension when copying into multiple partitions When COPYing into a partitioned table that does now permit the use of table_multi_insert(), we could error out with ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes because BulkInsertState->next_free was not reset between partitions. This problem occured only when not able to use table_multi_insert(), as a dedicated BulkInsertState for each partition is used in that case. The bug was introduced in 00d1e02be24, but it was hard to hit at that point as commonly bulk relation extension is not used when not using table_multi_insert(). It became more likely after 82a4edabd27, which expanded the use of bulk extension. To fix the bug, reset the bulk relation extension state in BulkInsertState in ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fcf) to tackle a very similar issue. Obviously the name is not quite correct, but there might be external callers, and bulk insert state needs to be reset in precisely in the situations that ReleaseBulkInsertStatePin() already needed to be called. Medium term the better fix likely is to disallow reusing BulkInsertState across relations. Add a test that, without the fix, reproduces #18130 in most configurations. The test also catches the problem fixed in b1ecb9b3fcf when run with small shared_buffers. Reported-by: Ivan Kolombet Analyzed-by: Tom Lane Analyzed-by: Andres Freund Bug: #18130 Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us Backpatch: 16- --- src/backend/access/heap/heapam.c | 11 + src/test/regress/expected/copy.out | 37 ++ src/test/regress/sql/copy.sql | 37 ++ 3 files changed, 85 insertions(+) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 88a123d38a6..51eceaca6cf 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1792,6 +1792,17 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) if (bistate->current_buf != InvalidBuffer) ReleaseBuffer(bistate->current_buf); bistate->current_buf = InvalidBuffer; + + /* + * Despite the name, we also release information about bulk + * inserts. Otherwise we can end up erroring out due to looking for free + * space in ->next_free of one partition, even though ->next_free was set + * when extending another partition. It's obviously also could be bad for + * efficiency to look at existing blocks at offsets from another + * partition, even if we don't error out. + */ + bistate->next_free = InvalidBlockNumber; + bistate->last_free = InvalidBlockNumber; } diff --git a/src/test/regress/expected/copy.out b/src/test/regress/expected/copy.out index a5912c13a8c..b48365ec981 100644 --- a/src/test/regress/expected/copy.out +++ b/src/
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-10-12 09:24:19 -0700, Andres Freund wrote: > On 2023-10-12 11:44:09 -0400, Tom Lane wrote: > > Andres Freund writes: > > >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote: > > >>> I just did a git bisect run to discover when the failure documented > > >>> in bug #18130 [1] started. And the answer is commit 82a4edabd. > > > > > Uh, huh. The problem is that COPY uses a single BulkInsertState for > > > multiple > > > partitions. Which to me seems to run counter to the following comment: > > > *The caller can also provide a BulkInsertState object to > > > optimize many > > > *insertions into the same relation. This keeps a pin on the > > > current > > > *insertion target page (to save pin/unpin cycles) and also > > > passes a > > > *BULKWRITE buffer selection strategy object to the buffer > > > manager. > > > *Passing NULL for bistate selects the default behavior. > > > > > The reason this doesn't cause straight up corruption due to reusing a pin > > > from > > > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() > > > and a > > > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk > > > insertion state, which is what leads to the errors from the bug report. > > > > > Resetting the relevant BulkInsertState fields fixes the problem. But I'm > > > not > > > sure that's the right fix. ISTM that independent of whether we fix this > > > via > > > ReleaseBulkInsertStatePin() resetting the fields or via not reusing > > > BulkInsertState, we should add assertions defending against future issues > > > like > > > this (e.g. by adding a relation field to BulkInsertState in cassert > > > builds, > > > and asserting that the relation is the same as in prior calls unless > > > ReleaseBulkInsertStatePin() has been called). > > > > Ping? We really ought to have a fix for this committed in time for > > 16.1. > > I kind of had hoped somebody would comment on the approach. Given that nobody > has, I'll push the minimal fix of resetting the state in > ReleaseBulkInsertStatePin(), even though I think architecturally that's not > great. I spent some time working on a test that shows the problem more cheaply than the case upthread. I think it'd be desirable to have a test that's likely to catch an issue like this fairly quickly. We've had other problems in this realm before - there's only a single test that fails if I remove the ReleaseBulkInsertStatePin() call, and I don't think that's guaranteed at all. I'm a bit on the fence on how large to make the relation. For me the bug triggers when filling both relations up to the 3rd block, but how many rows that takes is somewhat dependent on space utilization on the page and stuff. Right now the test uses data/desc.data and ends up with 328kB and 312kB in two partitions. Alternatively I could make the test create a new file to load with copy that has fewer rows than data/desc.data - I didn't see another data file that works conveniently and has fewer rows. The copy is reasonably fast, even under something as expensive as rr (~60ms). So I'm inclined to just go with that? Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-10-12 11:44:09 -0400, Tom Lane wrote: > Andres Freund writes: > >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote: > >>> I just did a git bisect run to discover when the failure documented > >>> in bug #18130 [1] started. And the answer is commit 82a4edabd. > > > Uh, huh. The problem is that COPY uses a single BulkInsertState for > > multiple > > partitions. Which to me seems to run counter to the following comment: > > * The caller can also provide a BulkInsertState object to optimize many > > * insertions into the same relation. This keeps a pin on the current > > * insertion target page (to save pin/unpin cycles) and also passes a > > * BULKWRITE buffer selection strategy object to the buffer manager. > > * Passing NULL for bistate selects the default behavior. > > > The reason this doesn't cause straight up corruption due to reusing a pin > > from > > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and > > a > > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk > > insertion state, which is what leads to the errors from the bug report. > > > Resetting the relevant BulkInsertState fields fixes the problem. But I'm not > > sure that's the right fix. ISTM that independent of whether we fix this via > > ReleaseBulkInsertStatePin() resetting the fields or via not reusing > > BulkInsertState, we should add assertions defending against future issues > > like > > this (e.g. by adding a relation field to BulkInsertState in cassert builds, > > and asserting that the relation is the same as in prior calls unless > > ReleaseBulkInsertStatePin() has been called). > > Ping? We really ought to have a fix for this committed in time for > 16.1. I kind of had hoped somebody would comment on the approach. Given that nobody has, I'll push the minimal fix of resetting the state in ReleaseBulkInsertStatePin(), even though I think architecturally that's not great. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Andres Freund writes: >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote: >>> I just did a git bisect run to discover when the failure documented >>> in bug #18130 [1] started. And the answer is commit 82a4edabd. > Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple > partitions. Which to me seems to run counter to the following comment: > *The caller can also provide a BulkInsertState object to optimize many > *insertions into the same relation. This keeps a pin on the current > *insertion target page (to save pin/unpin cycles) and also passes a > *BULKWRITE buffer selection strategy object to the buffer manager. > *Passing NULL for bistate selects the default behavior. > The reason this doesn't cause straight up corruption due to reusing a pin from > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk > insertion state, which is what leads to the errors from the bug report. > Resetting the relevant BulkInsertState fields fixes the problem. But I'm not > sure that's the right fix. ISTM that independent of whether we fix this via > ReleaseBulkInsertStatePin() resetting the fields or via not reusing > BulkInsertState, we should add assertions defending against future issues like > this (e.g. by adding a relation field to BulkInsertState in cassert builds, > and asserting that the relation is the same as in prior calls unless > ReleaseBulkInsertStatePin() has been called). Ping? We really ought to have a fix for this committed in time for 16.1. regards, tom lane
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-09-25 12:48:30 -0700, Andres Freund wrote: > On 2023-09-25 15:42:26 -0400, Tom Lane wrote: > > I just did a git bisect run to discover when the failure documented > > in bug #18130 [1] started. And the answer is commit 82a4edabd. > > Now, it's pretty obvious that that commit didn't in itself cause > > problems like this: > > > > postgres=# \copy test from 'bug18130.csv' csv > > ERROR: could not read block 5 in file "base/5/17005": read only 0 of 8192 > > bytes > > CONTEXT: COPY test, line 472: > > "0,185647715,222655,489637,2,2023-07-31,9100.000,302110385,2023-07-30 > > 14:16:36.750981+00,14026347..." > > Ugh. > > > > IMO there must be some very nasty bug lurking in the new > > multiple-block extension logic, that happens to be exposed by this > > test case with 82a4edabd's adjustments to the when-to-extend choices > > but wasn't before that. > > > To save other people the trouble of extracting the in-line data > > in the bug submission, I've attached the test files I was using. > > Thanks, looking at this now. (had to switch locations in between) Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple partitions. Which to me seems to run counter to the following comment: * The caller can also provide a BulkInsertState object to optimize many * insertions into the same relation. This keeps a pin on the current * insertion target page (to save pin/unpin cycles) and also passes a * BULKWRITE buffer selection strategy object to the buffer manager. * Passing NULL for bistate selects the default behavior. The reason this doesn't cause straight up corruption due to reusing a pin from another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk insertion state, which is what leads to the errors from the bug report. Resetting the relevant BulkInsertState fields fixes the problem. But I'm not sure that's the right fix. ISTM that independent of whether we fix this via ReleaseBulkInsertStatePin() resetting the fields or via not reusing BulkInsertState, we should add assertions defending against future issues like this (e.g. by adding a relation field to BulkInsertState in cassert builds, and asserting that the relation is the same as in prior calls unless ReleaseBulkInsertStatePin() has been called). Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-09-25 15:42:26 -0400, Tom Lane wrote: > I just did a git bisect run to discover when the failure documented > in bug #18130 [1] started. And the answer is commit 82a4edabd. > Now, it's pretty obvious that that commit didn't in itself cause > problems like this: > > postgres=# \copy test from 'bug18130.csv' csv > ERROR: could not read block 5 in file "base/5/17005": read only 0 of 8192 > bytes > CONTEXT: COPY test, line 472: > "0,185647715,222655,489637,2,2023-07-31,9100.000,302110385,2023-07-30 > 14:16:36.750981+00,14026347..." Ugh. > IMO there must be some very nasty bug lurking in the new > multiple-block extension logic, that happens to be exposed by this > test case with 82a4edabd's adjustments to the when-to-extend choices > but wasn't before that. > To save other people the trouble of extracting the in-line data > in the bug submission, I've attached the test files I was using. Thanks, looking at this now. > The DDL is simplified slightly from what was submitted. I'm not > entirely sure why a no-op trigger is needed to provoke the bug... A trigger might prevent using the multi-insert API, which would lead to different execution paths... Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Andres Freund writes: > On 2023-09-06 18:01:53 -0400, Tom Lane wrote: >> It turns out that this patch is what's making buildfarm member >> chipmunk fail in contrib/pg_visibility [1]. That's easily reproduced >> by running the test with shared_buffers = 10MB. I didn't dig further >> than the "git bisect" result: > At first I was a bit confounded by not being able to reproduce this. My test > environment had max_connections=110 for some other reason - and the problem > doesn't reproduce with that setting... I just did a git bisect run to discover when the failure documented in bug #18130 [1] started. And the answer is commit 82a4edabd. Now, it's pretty obvious that that commit didn't in itself cause problems like this: postgres=# \copy test from 'bug18130.csv' csv ERROR: could not read block 5 in file "base/5/17005": read only 0 of 8192 bytes CONTEXT: COPY test, line 472: "0,185647715,222655,489637,2,2023-07-31,9100.000,302110385,2023-07-30 14:16:36.750981+00,14026347..." IMO there must be some very nasty bug lurking in the new multiple-block extension logic, that happens to be exposed by this test case with 82a4edabd's adjustments to the when-to-extend choices but wasn't before that. To save other people the trouble of extracting the in-line data in the bug submission, I've attached the test files I was using. The DDL is simplified slightly from what was submitted. I'm not entirely sure why a no-op trigger is needed to provoke the bug... regards, tom lane [1] https://www.postgresql.org/message-id/18130-7a86a7356a75209d%40postgresql.org -- run this, then \copy test from 'bug18130.csv' csv create table if not exists test ( a smallint, b bigint, c bigint, d bigint, e smallint, plan_date date, g numeric(18,7), h bigint, i timestamptz, j bigint, k integer, l smallint, primary key (a, b, c, d, e, plan_date) ) partition by list(a); /* this is just to generate partition structure automatically */ create or replace function test_add_partitions(a integer, year integer) returns void volatile strict as $$ declare parent text; root text; begin root := 'test_' || a::text; execute 'create table if not exists ' || root || ' partition of test for values in (' || a::text || ') partition by hash (b);'; for b in 0..7 loop parent := root || '_' || b::text; execute 'create table if not exists ' || parent || ' partition of ' || root || ' for values with (modulus 8, remainder ' || b::text || ')' || ' partition by range (plan_date);'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm1 partition of ' || parent || ' for values from (''' || year::text || '-01-01'') to (''' || year::text || '-02-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm2 partition of ' || parent || ' for values from (''' || year::text || '-02-01'') to (''' || year::text || '-03-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm3 partition of ' || parent || ' for values from (''' || year::text || '-03-01'') to (''' || year::text || '-04-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm4 partition of ' || parent || ' for values from (''' || year::text || '-04-01'') to (''' || year::text || '-05-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm5 partition of ' || parent || ' for values from (''' || year::text || '-05-01'') to (''' || year::text || '-06-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm6 partition of ' || parent || ' for values from (''' || year::text || '-06-01'') to (''' || year::text || '-07-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm7 partition of ' || parent || ' for values from (''' || year::text || '-07-01'') to (''' || year::text || '-08-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm8 partition of ' || parent || ' for values from (''' || year::text || '-08-01'') to (''' || year::text || '-09-01'');'; execute 'create table if not exists ' || parent || '_y' || year::text || 'm9 partition of ' || parent || ' for values from (''' || year::text || '-09-01'') to ('''
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-09-06 18:01:53 -0400, Tom Lane wrote: > It turns out that this patch is what's making buildfarm member > chipmunk fail in contrib/pg_visibility [1]. That's easily reproduced > by running the test with shared_buffers = 10MB. I didn't dig further > than the "git bisect" result: At first I was a bit confounded by not being able to reproduce this. My test environment had max_connections=110 for some other reason - and the problem doesn't reproduce with that setting... > $ git bisect bad > 82a4edabd272f70d044faec8cf7fd1eab92d9991 is the first bad commit > commit 82a4edabd272f70d044faec8cf7fd1eab92d9991 > Author: Andres Freund > Date: Mon Aug 14 09:54:03 2023 -0700 > > hio: Take number of prior relation extensions into account > > but I imagine the problem is that the patch's more aggressive > relation-extension heuristic is causing the table to have more > pages than the test case expects. Is it really a good idea > for such a heuristic to depend on shared_buffers? The heuristic doesn't directly depend on shared buffers. However, the amount we extend by is limited by needing to pin shared buffers covering all the newly extended buffers. That's what ends up limiting things here - shared_buffers = 10MB and max_connections = 10 doesn't allow for a lot of buffers to be pinned concurrently in each backend. Although perhaps LimitAdditionalPins() is a bit too conservative, due to not checking the private refcount array and just assuming REFCOUNT_ARRAY_ENTRIES. > If you don't want to change the heuristic then we'll have to find a way to > tweak this test to avoid it. We could tweak LimitAdditionalPins() by checking PrivateRefCountArray instead of assuming the worst-case REFCOUNT_ARRAY_ENTRIES. However, it seems that the logic in the test is pretty fragile independent of this issue? Different alignment, block size or an optimization of the page layout could also break the test? Unfortunately a query that doesn't falsely alert in this case is a bit ugly, due to needing to deal with the corner case of an empty page at the end: select * from pg_visibility_map('copyfreeze') where (not all_visible or not all_frozen) -- deal with trailing empty pages due to potentially bulk-extending too aggressively and exists(SELECT * FROM copyfreeze WHERE ctid >= ('('||blkno||', 0)')::tid) ; Before 82a4edabd27 this situation was rare - you'd have needed contended extensions. But after it has become more common. I worry that that might cause other issues :(. OTOH, I think we'll need to extend way more aggressively at some point... Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Andres Freund writes: > On 2023-08-16 13:15:46 +0200, Alvaro Herrera wrote: >> Since the wins from this patch were replicated and it has been pushed, I >> understand that this open item can be marked as closed, so I've done >> that. > Thanks! It turns out that this patch is what's making buildfarm member chipmunk fail in contrib/pg_visibility [1]. That's easily reproduced by running the test with shared_buffers = 10MB. I didn't dig further than the "git bisect" result: $ git bisect bad 82a4edabd272f70d044faec8cf7fd1eab92d9991 is the first bad commit commit 82a4edabd272f70d044faec8cf7fd1eab92d9991 Author: Andres Freund Date: Mon Aug 14 09:54:03 2023 -0700 hio: Take number of prior relation extensions into account but I imagine the problem is that the patch's more aggressive relation-extension heuristic is causing the table to have more pages than the test case expects. Is it really a good idea for such a heuristic to depend on shared_buffers? If you don't want to change the heuristic then we'll have to find a way to tweak this test to avoid it. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-09-06%2014%3A14%3A51
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On 2023-08-16 13:15:46 +0200, Alvaro Herrera wrote: > Since the wins from this patch were replicated and it has been pushed, I > understand that this open item can be marked as closed, so I've done > that. Thanks!
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hello, On 2023-Aug-12, Andres Freund wrote: > On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote: > > > Any chance you could your benchmark? I don't see as much of a regression > > > vs 16 > > > as you... > > > > Sure. The results are promising for me too: > > > > nclients = 1, execution time = 13.743 > > nclients = 2, execution time = 7.552 > > nclients = 4, execution time = 4.758 > > nclients = 8, execution time = 3.035 > > nclients = 16, execution time = 2.172 > > nclients = 32, execution time = 1.959 > > nclients = 64, execution time = 1.819 > > nclients = 128, execution time = 1.583 > > nclients = 256, execution time = 1.631 > > Nice. We are consistently better than both 15 and "post integer parsing 16". Since the wins from this patch were replicated and it has been pushed, I understand that this open item can be marked as closed, so I've done that. Thanks, -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Hay dos momentos en la vida de un hombre en los que no debería especular: cuando puede permitírselo y cuando no puede" (Mark Twain)
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-08-08 12:45:05 +0900, Masahiko Sawada wrote: > > I think there could be a quite simple fix: Track by how much we've extended > > the relation previously in the same bistate. If we already extended by many > > blocks, it's very likey that we'll do so further. > > > > A simple prototype patch attached. The results for me are promising. I > > copied > > a smaller file [1], to have more accurate throughput results at shorter runs > > (15s). > > Thank you for the patch! Attached is a mildly updated version of that patch. No functional changes, just polished comments and added a commit message. > > Any chance you could your benchmark? I don't see as much of a regression vs > > 16 > > as you... > > Sure. The results are promising for me too: > > nclients = 1, execution time = 13.743 > nclients = 2, execution time = 7.552 > nclients = 4, execution time = 4.758 > nclients = 8, execution time = 3.035 > nclients = 16, execution time = 2.172 > nclients = 32, execution time = 1.959 > nclients = 64, execution time = 1.819 > nclients = 128, execution time = 1.583 > nclients = 256, execution time = 1.631 Nice. We are consistently better than both 15 and "post integer parsing 16". I'm really a bit baffled at myself for not using this approach from the get go. This also would make it much more beneficial to use a BulkInsertState in nodeModifyTable.c, even without converting to table_multi_insert(). I was tempted to optimize RelationAddBlocks() a bit, by not calling RelationExtensionLockWaiterCount() if we are already extending by MAX_BUFFERS_TO_EXTEND_BY. Before this patch, it was pretty much impossible to reach that case, because of the MAX_BUFFERED_* limits in copyfrom.c, but now it's more common. But that probably ought to be done only HEAD, not 16. A related oddity: RelationExtensionLockWaiterCount()->LockWaiterCount() uses an exclusive lock on the lock partition - which seems not at all necessary? Unless somebody sees a reason not to, I'm planning to push this? Greetings, Andres Freund >From 26e7faad65273beefeb7b40a169c1b631e204501 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 12 Aug 2023 12:31:43 -0700 Subject: [PATCH v2] hio: Take number of prior relation extensions into account The new relation extension logic, introduced in 00d1e02be24, could lead to slowdowns in some scenarios. E.g., when loading narrow rows into a table using COPY, the caller of RelationGetBufferForTuple() will only request a small number of pages. Without concurrency, we just extended using pwritev() in that case. However, if there is *some* concurrency, we switched between extending by a small number of pages and a larger number of pages, depending on the number of waiters for the relation extension logic. However, some filesystems, XFS in particular, do not perform well when switching between extending files using fallocate() and pwritev(). To avoid that issue, remember the number of prior relation extensions in BulkInsertState and extend more aggressively if there were prior relation extensions. That not just avoids the aforementioned slowdown, but also leads to noticeable performance gains in other situations, primarily due to extending more aggressively when there is no concurrency. I should have done it this way from the get go. Reported-by: Masahiko Sawada Author: Andres Freund Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=kp6mszngk3bq9yr...@mail.gmail.com Backpatch: 16-, where the new relation extension code was added --- src/include/access/hio.h | 13 ++--- src/backend/access/heap/heapam.c | 1 + src/backend/access/heap/hio.c| 19 +++ 3 files changed, 30 insertions(+), 3 deletions(-) diff --git a/src/include/access/hio.h b/src/include/access/hio.h index 228433ee4a2..9bc563b7628 100644 --- a/src/include/access/hio.h +++ b/src/include/access/hio.h @@ -32,15 +32,22 @@ typedef struct BulkInsertStateData Buffer current_buf; /* current insertion target page */ /* - * State for bulk extensions. Further pages that were unused at the time - * of the extension. They might be in use by the time we use them though, - * so rechecks are needed. + * State for bulk extensions. + * + * last_free..next_free are further pages that were unused at the time of + * the last extension. They might be in use by the time we use them + * though, so rechecks are needed. * * XXX: Eventually these should probably live in RelationData instead, * alongside targetblock. + * + * already_extended_by is the number of pages that this bulk inserted + * extended by. If we already extended by a significant number of pages, + * we can be more aggressive about extending going forward. */ BlockNumber next_free; BlockNumber last_free; + uint32 already_extended_by; } BulkInsertStateData; diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7ed72abe597..6a66214a580 100644 --- a/src/backend/access/he
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, Aug 8, 2023 at 3:10 AM Andres Freund wrote: > > Hi, > > On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote: > > On Mon, Aug 7, 2023 at 3:16 PM David Rowley wrote: > > > > > > On Wed, 2 Aug 2023 at 13:35, David Rowley wrote: > > > > So, it looks like this item can be closed off. I'll hold off from > > > > doing that for a few days just in case anyone else wants to give > > > > feedback or test themselves. > > > > > > Alright, closed. > > > > IIUC the problem with multiple concurrent COPY is not resolved yet. > > Yea - it was just hard to analyze until the other regressions were fixed. > > > > The result of nclients = 1 became better thanks to recent fixes, but > > there still seems to be the performance regression at nclient = 2~16 > > (on RHEL 8 and 9). Andres reported[1] that after changing > > MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it > > would not be the solution, as he mentioned. > > I think there could be a quite simple fix: Track by how much we've extended > the relation previously in the same bistate. If we already extended by many > blocks, it's very likey that we'll do so further. > > A simple prototype patch attached. The results for me are promising. I copied > a smaller file [1], to have more accurate throughput results at shorter runs > (15s). Thank you for the patch! > > HEAD before: > clients tps > 1 41 > 2 76 > 4136 > 8248 > 16 360 > 32 375 > 64 317 > > > HEAD after: > clients tps > 1 43 > 2 80 > 4155 > 8280 > 16 369 > 32 405 > 64 344 > > Any chance you could your benchmark? I don't see as much of a regression vs 16 > as you... Sure. The results are promising for me too: nclients = 1, execution time = 13.743 nclients = 2, execution time = 7.552 nclients = 4, execution time = 4.758 nclients = 8, execution time = 3.035 nclients = 16, execution time = 2.172 nclients = 32, execution time = 1.959 nclients = 64, execution time = 1.819 nclients = 128, execution time = 1.583 nclients = 256, execution time = 1.631 Here are results of the same benchmark test you used: w/o patch: clientstps 1 66.702 2 59.696 4 73.783 8 168.115 16 400.134 32 574.098 64 565.373 128 526.303 256 591.751 w/ patch: clients tps 1 67.735 2 122.534 4 240.707 8 398.944 16 541.097 32 643.083 64 614.775 128 616.007 256 577.885 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-08-07 23:05:39 +0900, Masahiko Sawada wrote: > On Mon, Aug 7, 2023 at 3:16 PM David Rowley wrote: > > > > On Wed, 2 Aug 2023 at 13:35, David Rowley wrote: > > > So, it looks like this item can be closed off. I'll hold off from > > > doing that for a few days just in case anyone else wants to give > > > feedback or test themselves. > > > > Alright, closed. > > IIUC the problem with multiple concurrent COPY is not resolved yet. Yea - it was just hard to analyze until the other regressions were fixed. > The result of nclients = 1 became better thanks to recent fixes, but > there still seems to be the performance regression at nclient = 2~16 > (on RHEL 8 and 9). Andres reported[1] that after changing > MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it > would not be the solution, as he mentioned. I think there could be a quite simple fix: Track by how much we've extended the relation previously in the same bistate. If we already extended by many blocks, it's very likey that we'll do so further. A simple prototype patch attached. The results for me are promising. I copied a smaller file [1], to have more accurate throughput results at shorter runs (15s). HEAD before: clients tps 1 41 2 76 4136 8248 16 360 32 375 64 317 HEAD after: clients tps 1 43 2 80 4155 8280 16 369 32 405 64 344 Any chance you could your benchmark? I don't see as much of a regression vs 16 as you... Greetings, Andres Freund [1] COPY (SELECT generate_series(1, 10)) TO '/tmp/data.copy'; diff --git i/src/include/access/hio.h w/src/include/access/hio.h index 228433ee4a2..5ae39aec7f8 100644 --- i/src/include/access/hio.h +++ w/src/include/access/hio.h @@ -41,6 +41,7 @@ typedef struct BulkInsertStateData */ BlockNumber next_free; BlockNumber last_free; + uint32 already_extended_by; } BulkInsertStateData; diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c index 7ed72abe597..6a66214a580 100644 --- i/src/backend/access/heap/heapam.c +++ w/src/backend/access/heap/heapam.c @@ -1776,6 +1776,7 @@ GetBulkInsertState(void) bistate->current_buf = InvalidBuffer; bistate->next_free = InvalidBlockNumber; bistate->last_free = InvalidBlockNumber; + bistate->already_extended_by = 0; return bistate; } diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c index c275b08494d..a2be4273df1 100644 --- i/src/backend/access/heap/hio.c +++ w/src/backend/access/heap/hio.c @@ -283,6 +283,13 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate, */ extend_by_pages += extend_by_pages * waitcount; + /* + * If we previously extended using the same bistate, it's very likely + * we'll extend some more. Try to extend by as many pages. + */ + if (bistate) + extend_by_pages = Max(extend_by_pages, bistate->already_extended_by); + /* * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin * them all concurrently. @@ -409,6 +416,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate, /* maintain bistate->current_buf */ IncrBufferRefCount(buffer); bistate->current_buf = buffer; + bistate->already_extended_by += extend_by_pages; } return buffer;
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Mon, Aug 7, 2023 at 3:16 PM David Rowley wrote: > > On Wed, 2 Aug 2023 at 13:35, David Rowley wrote: > > So, it looks like this item can be closed off. I'll hold off from > > doing that for a few days just in case anyone else wants to give > > feedback or test themselves. > > Alright, closed. IIUC the problem with multiple concurrent COPY is not resolved yet. I've run the same benchmark that I used for the first report: * PG15 (cb2ae5741f) nclients = 1, execution time = 15.213 nclients = 2, execution time = 9.470 nclients = 4, execution time = 6.508 nclients = 8, execution time = 4.526 nclients = 16, execution time = 3.788 nclients = 32, execution time = 3.837 nclients = 64, execution time = 4.286 nclients = 128, execution time = 4.388 nclients = 256, execution time = 4.333 * PG16 (67a007dc0c) nclients = 1, execution time = 14.494 nclients = 2, execution time = 12.962 nclients = 4, execution time = 17.757 nclients = 8, execution time = 10.865 nclients = 16, execution time = 7.371 nclients = 32, execution time = 4.929 nclients = 64, execution time = 2.212 nclients = 128, execution time = 2.020 nclients = 256, execution time = 2.196 The result of nclients = 1 became better thanks to recent fixes, but there still seems to be the performance regression at nclient = 2~16 (on RHEL 8 and 9). Andres reported[1] that after changing MAX_BUFFERED_TUPLES to 5000 the numbers became a lot better but it would not be the solution, as he mentioned. Regards, [1] https://www.postgresql.org/message-id/20230711185159.v2j5vnyrtodnwhgz%40awork3.anarazel.de -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 2 Aug 2023 at 13:35, David Rowley wrote: > So, it looks like this item can be closed off. I'll hold off from > doing that for a few days just in case anyone else wants to give > feedback or test themselves. Alright, closed. David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 2 Aug 2023 at 12:25, David Rowley wrote: > master @ 3845577cb > latency average = 1575.879 ms > >6.79% postgres [.] pg_strtoint32_safe > > master~1 > latency average = 1968.004 ms > > 14.28% postgres [.] pg_strtoint32_safe > > REL_16_STABLE > latency average = 1735.163 ms > >6.04% postgres [.] pg_strtoint32_safe > > REL_16_STABLE~1 > latency average = 2188.186 ms > > 13.83% postgres [.] pg_strtoint32_safe And just to complete that set, here's the REL_15_STABLE performance using the same test: latency average = 1829.108 ms 15.46% postgres [.] pg_strtoint32 So, it looks like this item can be closed off. I'll hold off from doing that for a few days just in case anyone else wants to give feedback or test themselves. David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 2 Aug 2023 at 07:38, Dean Rasheed wrote: > Running the new test on slightly older Intel hardware (i9-9900K, gcc > 11.3), I get the following: Thanks for running those tests. I've now pushed the fastpath4.patch after making a few adjustments to the header comments to mention the new stuff that was added in v16. > I don't know why HEAD and v16 aren't consistent, but it's seems to be > quite reproducible, even though the numutils source is the same in > both branches, and using gdb to dump the disassembly for > pg_strtoint32_safe() shows that it's also the same. I also see it's inconsistent, but the other way around. Here are some fresh tests with master and REL_16_STABLE with the committed code and the version directly prior to the commit: master @ 3845577cb latency average = 1575.879 ms 6.79% postgres [.] pg_strtoint32_safe master~1 latency average = 1968.004 ms 14.28% postgres [.] pg_strtoint32_safe REL_16_STABLE latency average = 1735.163 ms 6.04% postgres [.] pg_strtoint32_safe REL_16_STABLE~1 latency average = 2188.186 ms 13.83% postgres [.] pg_strtoint32_safe David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, 1 Aug 2023 at 15:01, David Rowley wrote: > > Here's a patch with an else condition when the first digit check fails. > > master + fastpath4.patch: > latency average = 1579.576 ms > latency average = 1572.716 ms > latency average = 1563.398 ms > > (appears slightly faster than fastpath3.patch) > Running the new test on slightly older Intel hardware (i9-9900K, gcc 11.3), I get the following: REL_15_STABLE latency average = 1687.695 ms latency stddev = 3.068 ms REL_16_STABLE latency average = 1931.756 ms latency stddev = 2.065 ms REL_16_STABLE + pg_strtoint_fastpath1.patch latency average = 1635.731 ms latency stddev = 3.074 ms REL_16_STABLE + pg_strtoint_fastpath2.patch latency average = 1687.303 ms latency stddev = 4.243 ms REL_16_STABLE + pg_strtoint_fastpath3.patch latency average = 1610.307 ms latency stddev = 2.193 ms REL_16_STABLE + pg_strtoint_fastpath4.patch latency average = 1577.604 ms latency stddev = 4.060 ms HEAD latency average = 1868.737 ms latency stddev = 6.114 ms HEAD + pg_strtoint_fastpath1.patch latency average = 1683.215 ms latency stddev = 2.322 ms HEAD + pg_strtoint_fastpath2.patch latency average = 1650.014 ms latency stddev = 3.920 ms HEAD + pg_strtoint_fastpath3.patch latency average = 1670.337 ms latency stddev = 5.074 ms HEAD + pg_strtoint_fastpath4.patch latency average = 1653.568 ms latency stddev = 8.224 ms I don't know why HEAD and v16 aren't consistent, but it's seems to be quite reproducible, even though the numutils source is the same in both branches, and using gdb to dump the disassembly for pg_strtoint32_safe() shows that it's also the same. Anyway, insofar as these results can be trusted, fastpath4.patch looks good. Regards, Dean
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 2 Aug 2023 at 01:26, Dean Rasheed wrote: > > On Tue, 1 Aug 2023 at 13:55, David Rowley wrote: > > > > I tried adding the "at least 1 digit check" by adding an else { goto > > slow; } in the above code, but it seems to generate slower code than > > just checking if (unlikely(ptr == s)) { goto slow; } after the loop. > > > > That check isn't quite right, because "ptr" will not equal "s" if > there is a sign character, so it won't detect an input with no digits > in that case. Ah, yeah. Thanks. Here's a patch with an else condition when the first digit check fails. master + fastpath4.patch: latency average = 1579.576 ms latency average = 1572.716 ms latency average = 1563.398 ms (appears slightly faster than fastpath3.patch) David pg_strtoint_fastpath4.patch Description: Binary data
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, 1 Aug 2023 at 13:55, David Rowley wrote: > > I tried adding the "at least 1 digit check" by adding an else { goto > slow; } in the above code, but it seems to generate slower code than > just checking if (unlikely(ptr == s)) { goto slow; } after the loop. > That check isn't quite right, because "ptr" will not equal "s" if there is a sign character, so it won't detect an input with no digits in that case. Regards, Dean
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, 1 Aug 2023 at 13:25, Andres Freund wrote: > There's a lot of larger numbers in the file, which likely reduces the impact > some. And there's the overhead of actually inserting the rows into the table, > making the difference appear smaller than it is. It might be worth special casing the first digit as we can save doing the multiplication by 10 and the overflow checks on the first digit. That should make it slightly faster to parse smaller numbers. > COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM > generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy'; > > psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE > lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \ > pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM > '/tmp/lotsaints_wide.copy' WHERE false") -t 5 > > 15: 2992.605 > HEAD: 3325.201 > fastpath1.patch 2932.606 > fastpath2.patch 2783.915 > > Interestingly fastpath1 is slower now, even though it wasn't with earlier > patches (which still is repeatable). I do not have the foggiest as to why. I'm glad to see that. I've adjusted the patch to add the fast path for the 16 and 64-bit versions of the function. I also added the special case for processing the first digit, which looks like: /* process the first digit */ digit = (*ptr - '0'); if (likely(digit < 10)) { ptr++; tmp = digit; } /* process remaining digits */ for (;;) I tried adding the "at least 1 digit check" by adding an else { goto slow; } in the above code, but it seems to generate slower code than just checking if (unlikely(ptr == s)) { goto slow; } after the loop. I also noticed that I wasn't getting the same performance after adjusting the 16 and 64 bit versions. I assume that's down to code alignment, but unsure of that. I ended up adjusting all the "while (*ptr)" loops into "for (;;)" loops since the NUL char check is handled by the "else break;". I also removed the needless NUL char check in the isspace loops. It can't be isspace and '\0'. I also replaced the isdigit() function call and replaced it for manually checking the digit range. I see my compiler (gcc12.2) effectively generates the same code as the unsigned char fast path version checking if (digit < 10). Once I did that, I got the performance back again. With your new test with the small-sized ints, I get: REL_15_STABLE: latency average = 1696.390 ms master @ d3a38318a latency average = 1928.803 ms master + fastpath1.patch: latency average = 1634.736 ms master + fastpath2.patch: latency average = 1628.704 ms master + fastpath3.patch latency average = 1590.417 ms I see no particular reason not to go ahead with the attached patch and get this thread closed off. Any objections? David pg_strtoint_fastpath3.patch Description: Binary data
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Mon, 31 Jul 2023 at 21:39, John Naylor wrote: > master + pg_strtoint_fastpath1.patch > latency average = 938.146 ms > latency stddev = 9.354 ms > > master + pg_strtoint_fastpath2.patch > latency average = 902.808 ms > latency stddev = 3.957 ms Thanks for checking those two on your machine. I'm glad to see fastpath2 faster on yours too. David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-27 20:53:16 +1200, David Rowley wrote: > To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my > AMD 3990x machine. Today's REL_16_STABLE takes 530.344 ms. We're > talking about another patch to speed up the pg_strtoint functions > which gets this down to 483.790 ms. Do we need to do this for v16 or > can we just leave this as it is already? The slowdown does not seem > to be much above what we'd ordinarily classify as noise using this > test on my machine. I think we need to do something for 16 - it appears on recent-ish AMD the regression is quite a bit smaller than on intel. You see something < 1%, I see more like 4%. I think there's also other cases where the slowdown is more substantial. Besides intel vs amd, it also looks like the gcc version might make a difference. The code generated by 13 is noticeably slower than 12 for me... > Benchmark setup: > > COPY (SELECT generate_series(1, 200) a, (random() * 10 - > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); There's a lot of larger numbers in the file, which likely reduces the impact some. And there's the overhead of actually inserting the rows into the table, making the difference appear smaller than it is. If I avoid the actual insert into the table and use more columns, I see an about 10% regression here. COPY (SELECT generate_series(1, 1000) a, 10 b, 20 c, 30 d, 40 e, 50 f FROM generate_series(1, 1)) TO '/tmp/lotsaints_wide.copy'; psql -c 'DROP TABLE IF EXISTS lotsaints_wide; CREATE UNLOGGED TABLE lotsaints_wide(a int, b int, c int, d int, e int, f int);' && \ pgbench -n -P1 -f <( echo "COPY lotsaints_wide FROM '/tmp/lotsaints_wide.copy' WHERE false") -t 5 15: 2992.605 HEAD: 3325.201 fastpath1.patch 2932.606 fastpath2.patch 2783.915 Interestingly fastpath1 is slower now, even though it wasn't with earlier patches (which still is repeatable). I do not have the foggiest as to why. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, Jul 27, 2023 at 7:17 AM David Rowley wrote: > > It would be really good if someone with another a newish intel CPU > could test this too. I ran the lotsaints test from last email on an i7-10750H (~3 years old) and got these results (gcc 13.1 , turbo off): REL_15_STABLE: latency average = 956.453 ms latency stddev = 4.854 ms REL_16_STABLE @ 695f5deb7902 (28-JUL-2023): latency average = 999.354 ms latency stddev = 3.611 ms master @ 39055cb4cc (31-JUL-2023): latency average = 995.310 ms latency stddev = 5.176 ms master + revert c1308ce2d (the replace-palloc0 fix) latency average = 1080.909 ms latency stddev = 8.707 ms master + pg_strtoint_fastpath1.patch latency average = 938.146 ms latency stddev = 9.354 ms master + pg_strtoint_fastpath2.patch latency average = 902.808 ms latency stddev = 3.957 ms For me, PG16 seems to regress from PG15, and the second patch seems faster than the first. -- John Naylor EDB: http://www.enterprisedb.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, 27 Jul 2023 at 14:51, David Rowley wrote: > Just to keep this moving and to make it easier for people to test the > pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch. > The only thing I changed was to move the line that was allocating the > array to a location more aligned with the order that the fields are > defined in the struct. I just did another round of benchmarking to see where we're at after fox_COPY_DEFAULT.patch has been committed. Below I've benchmarked REL_15_STABLE up to REL_16_STABLE with some hand-selected key commits, many of which have been mentioned on this thread already because they seem to affect the performance of COPY. To summarise, REL_15_STABLE can run this benchmark in 526.014 ms on my AMD 3990x machine. Today's REL_16_STABLE takes 530.344 ms. We're talking about another patch to speed up the pg_strtoint functions which gets this down to 483.790 ms. Do we need to do this for v16 or can we just leave this as it is already? The slowdown does not seem to be much above what we'd ordinarily classify as noise using this test on my machine. Benchmark setup: COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); Benchmark: psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints FROM '/tmp/lotsaints.copy';") -t 15 2864eb977 REL_15_STABLE latency average = 526.014 ms 8.84% postgres [.] pg_strtoint32 29452de73 Sat Dec 3 10:50:39 2022 -0500 The commit before "Improve performance of pg_strtointNN functions" latency average = 508.453 ms 10.21% postgres [.] pg_strtoint32 6b423ec67 Sun Dec 4 16:18:18 2022 +1300 Improve performance of pg_strtointNN functions latency average = 492.943 ms 7.73% postgres [.] pg_strtoint32 1939d2628 Fri Dec 9 10:08:44 2022 -0500 The commit before "Convert a few datatype input functions to use "soft" error reporting." latency average = 485.016 ms 8.43% postgres [.] pg_strtoint32 ccff2d20e Fri Dec 9 10:14:53 2022 -0500 Convert a few datatype input functions to use "soft" error reporting. latency average = 501.325 ms 6.90% postgres [.] pg_strtoint32_safe 60684dd83 Tue Dec 13 17:33:28 2022 -0800 The commit before "Non-decimal integer literals" latency average = 500.889 ms 8.27% postgres [.] pg_strtoint32_safe 6fcda9aba Wed Dec 14 05:40:38 2022 +0100 Non-decimal integer literals latency average = 521.904 ms 9.02% postgres [.] pg_strtoint32_safe 1b6f632a3 Sat Feb 4 07:56:09 2023 +0100 The commit before "Allow underscores in integer and numeric constants." latency average = 523.195 ms 9.21% postgres [.] pg_strtoint32_safe faff8f8e4 Sat Feb 4 09:48:51 2023 + Allow underscores in integer and numeric constants. latency average = 493.064 ms 10.25% postgres [.] pg_strtoint32_safe 9f8377f7a Mon Mar 13 10:01:56 2023 -0400 Add a DEFAULT option to COPY FROM latency average = 597.617 ms 12.91% postgres [.] CopyReadLine 10.62% postgres [.] CopyReadAttributesText 10.51% postgres [.] pg_strtoint32_safe 7.97% postgres [.] NextCopyFrom REL_16_STABLE @ c1308ce2d Thu Jul 27 14:48:44 2023 +1200 Fix performance problem with new COPY DEFAULT code latency average = 530.344 ms 13.51% postgres [.] CopyReadLine 9.62% postgres [.] pg_strtoint32_safe 8.97% postgres [.] CopyReadAttributesText 8.43% postgres [.] NextCopyFrom REL_16_STABLE + pg_strtoint_fastpath1.patch latency average = 493.136 ms 13.79% postgres [.] CopyReadLine 11.82% postgres [.] CopyReadAttributesText 7.07% postgres [.] NextCopyFrom 6.81% postgres [.] pg_strtoint32_safe REL_16_STABLE + pg_strtoint_fastpath2.patch latency average = 483.790 ms 13.87% postgres [.] CopyReadLine 10.40% postgres [.] CopyReadAttributesText 8.22% postgres [.] NextCopyFrom 5.52% postgres [.] pg_strtoint32_safe David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 26 Jul 2023 at 03:50, Andres Freund wrote: > On 2023-07-25 23:37:08 +1200, David Rowley wrote: > > On Tue, 25 Jul 2023 at 17:34, Andres Freund wrote: > > I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a > > reason to delay committing that? It would be good to eliminate that > > as a variable for the current performance regression. > > Yea, I don't think there's a reason to hold off on that. Sawada-san, do you > want to commit it? Or Andrew? Just to keep this moving and to make it easier for people to test the pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch. The only thing I changed was to move the line that was allocating the array to a location more aligned with the order that the fields are defined in the struct. David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
> On 2023-07-25 23:37:08 +1200, David Rowley wrote: > > On Tue, 25 Jul 2023 at 17:34, Andres Freund wrote: > > > HEAD: 812.690 > > > > > > your patch: 821.354 > > > > > > strtoint from 8692f6644e7: 824.543 > > > > > > strtoint from 6b423ec677d^: 806.678 > > > > I'm surprised to see the imul version is faster. It's certainly not > > what we found when working on 6b423ec67. > > What CPUs did you test it on? I'd not be surprised if this were heavily > dependent on the microarchitecture. This was on AMD 3990x. > One idea I had was to add a fastpath that won't parse all strings, but will > parse the strings that we would generate, and fall back to the more general > variant if it fails. See the attached, rough, prototype: There were a couple of problems with fastpath.patch. You need to reset the position of ptr at the start of the slow path and also you were using tmp in the if (neg) part instead of tmp_s in the fast path section. I fixed that up and made two versions of the patch, one using the overflow functions (pg_strtoint_fastpath1.patch) and one testing if the number is going to overflow (same as current master) (pg_strtoint_fastpath2.patch) AMD 3990x: master + fix_COPY_DEFAULT.patch: latency average = 525.226 ms master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch: latency average = 488.171 ms master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch: latency average = 481.827 ms Apple M2 Pro: master + fix_COPY_DEFAULT.patch: latency average = 348.433 ms master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch: latency average = 336.778 ms master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch: latency average = 335.992 ms Zen 4 7945HX CPU: master + fix_COPY_DEFAULT.patch: latency average = 296.881 ms master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch: latency average = 287.052 ms master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch: latency average = 280.742 ms The M2 chip does not seem to be clearly faster with the fastpath2 method of overflow checking, but both AMD CPUs seem pretty set on fastpath2 being faster. It would be really good if someone with another a newish intel CPU could test this too. David diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 471fbb7ee6..a50c5ccd4a 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -298,9 +298,70 @@ pg_strtoint32_safe(const char *s, Node *escontext) { const char *ptr = s; const char *firstdigit; - uint32 tmp = 0; + uint32 tmp; + int32 tmp_s = 0; boolneg = false; + /* +* The majority of cases are likely to be base-10 digits without any +* underscore separator characters. We'll first try to parse the string with +* the assumption that's the case and only fallback on a slower +* implementation which handles hex, octal and binary strings and +* underscores. +*/ + + /* leave it up to the slow path to look for leading spaces */ + + if (*ptr == '-') + { + ptr++; + neg = true; + } + + /* a leading '+' is uncommon so leave that for the slow path */ + + /* process digits */ + for (;;) + { + unsigned char digit = (*ptr - '0'); + + /* +* Exploit unsigned arithmetic to save having to check both the upper +* and lower bounds of the digit +*/ + if (digit >= 10) + break; + + ptr++; + + if (unlikely(pg_mul_s32_overflow(tmp_s, 10, &tmp_s)) || + unlikely(pg_sub_s32_overflow(tmp_s, digit, &tmp_s))) + goto out_of_range; + } + + /* we need at least one digit */ + if (unlikely(ptr == s)) + goto slow; + + /* when the string does not end in a digit, let the slow path handle it */ + if (unlikely(*ptr != '\0')) + goto slow; + + if (!neg) + { + /* could fail if input is most negative number */ + if (unlikely(tmp_s == PG_INT32_MIN)) + goto out_of_range; + tmp_s = -tmp_s; + } + + return tmp_s; + +slow: + tmp = 0; + ptr = s; + /* no need to reset neg */ + /* skip leading spaces */ while (likely(*ptr) && isspace((unsigned char) *ptr)) ptr++; diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 471fbb7ee6..bb48e5157e 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -301,6 +301,70 @@ pg_strtoint32_safe(const char *s, Node *escontext) uint32 tmp = 0; boolneg = false; + /* +* The
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-25 08:50:19 -0700, Andres Freund wrote: > One idea I had was to add a fastpath that won't parse all strings, but will > parse the strings that we would generate, and fall back to the more general > variant if it fails. See the attached, rough, prototype: > > fix_COPY_DEFAULT.patch + fastpath.patch: > 746.971 > > fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch: > 715.570 > > Now, the precise contents of this fastpath are not yet clear (wrt imul or > not), but I think the idea has promise. Btw, I strongly suspect that fastpath wants to be branchless SSE when it grows up. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-25 23:37:08 +1200, David Rowley wrote: > On Tue, 25 Jul 2023 at 17:34, Andres Freund wrote: > I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a > reason to delay committing that? It would be good to eliminate that > as a variable for the current performance regression. Yea, I don't think there's a reason to hold off on that. Sawada-san, do you want to commit it? Or Andrew? > > prep: > > COPY (SELECT generate_series(1, 200) a, (random() * 10 - > > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; > > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); > > > > benchmark: > > psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY > > lotsaints FROM '/tmp/lotsaints.copy';") -t 15 > > > > I disabled turbo mode, pinned the server to a single core of my Xeon Gold > > 5215: > > > > HEAD: 812.690 > > > > your patch: 821.354 > > > > strtoint from 8692f6644e7: 824.543 > > > > strtoint from 6b423ec677d^: 806.678 > > I'm surprised to see the imul version is faster. It's certainly not > what we found when working on 6b423ec67. What CPUs did you test it on? I'd not be surprised if this were heavily dependent on the microarchitecture. > I've been fooling around a bit to try to learn what might be going on. > I wrote 2 patches: > > 1) pg_strtoint_imul.patch: Effectively reverts 6b423ec67 and puts the > code how it likely would have looked had I not done that work at all. > (I've assumed that the hex, octal, binary parsing would have been > added using the overflow multiplication functions just as the base 10 > parsing). > 2) pg_strtoint_optimize.patch: I've made another pass over the > functions with the current overflow checks and optimized the digit > checking code so that it can be done in a single check like: if (digit > < 10). This can be done by assigning the result of *ptr - '0' to an > unsigned char. Anything less than '0' will wrap around and anything > above '9' will remain so. I've also removed a few inefficiencies with > the isspace checking. We didn't need to do "while (*ptr && > isspace(*ptr)). It's fine just to do while (isspace(*ptr)) since '\0' > isn't a space char. I also got rid of the isxdigit call. The > hexlookup array contains -1 for non-hex chars. We may as well check > the digit value is >= 0. > > Here are the results I get using your test as quoted above: > > master @ 62e9af4c63f + fix_COPY_DEFAULT.patch > > latency average = 568.102 ms > > master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch > > latency average = 531.238 ms > > master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch > > latency average = 559.333 ms (surprisingly also faster on my machine) > > The optimized version of the pg_strtoint functions wins over the imul > patch. Could you test to see if this is also the case in your Xeon > machine? (these are the numbers with turbo disabled, if I enable it they're all in the 6xx ms range, but the variance is higher) fix_COPY_DEFAULT.patch 774.344 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch 777.673 fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch 777.545 fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch 795.298 fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch + likely(isdigit()) 773.477 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch 774.443 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch + likely(isdigit()) 774.513 fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + pg_strtoint_imul.patch + likely(isdigit()) + unlikely(*ptr == '_') 763.879 One idea I had was to add a fastpath that won't parse all strings, but will parse the strings that we would generate, and fall back to the more general variant if it fails. See the attached, rough, prototype: fix_COPY_DEFAULT.patch + fastpath.patch: 746.971 fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch: 715.570 Now, the precise contents of this fastpath are not yet clear (wrt imul or not), but I think the idea has promise. > > (when I say strtoint from, I did not replace the goto labels, so that part > > is > > unchanged and unrelated) > > I didn't quite follow this. I meant that I didn't undo ereport()->ereturn(). Greetings, Andres Freund diff --git c/src/backend/utils/adt/numutils.c i/src/backend/utils/adt/numutils.c index 471fbb7ee63..52bbff801d0 100644 --- c/src/backend/utils/adt/numutils.c +++ i/src/backend/utils/adt/numutils.c @@ -301,6 +301,53 @@ pg_strtoint32_safe(const char *s, Node *escontext) uint32 tmp = 0; bool neg = false; + /* + * Fast path recognizing the most common output. + */ + if (1) + { + int32 tmp_s = 0; + + /* leading spaces are uncommon => slow path */ + + /* handle - sign, + is uncommon => slow path */ + if (*ptr == '-') + { + ptr++; + neg = true; + } + + /* requ
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, 25 Jul 2023 at 17:34, Andres Freund wrote: > prep: > COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int > b, 3243423 c) TO '/tmp/lotsaints.copy'; > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); > > benchmark: > psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints > FROM '/tmp/lotsaints.copy';") -t 15 > > I disabled turbo mode, pinned the server to a single core of my Xeon Gold > 5215: > > HEAD: 812.690 > > your patch: 821.354 > > strtoint from 8692f6644e7: 824.543 > > strtoint from 6b423ec677d^: 806.678 I'm surprised to see the imul version is faster. It's certainly not what we found when working on 6b423ec67. I've been fooling around a bit to try to learn what might be going on. I wrote 2 patches: 1) pg_strtoint_imul.patch: Effectively reverts 6b423ec67 and puts the code how it likely would have looked had I not done that work at all. (I've assumed that the hex, octal, binary parsing would have been added using the overflow multiplication functions just as the base 10 parsing). 2) pg_strtoint_optimize.patch: I've made another pass over the functions with the current overflow checks and optimized the digit checking code so that it can be done in a single check like: if (digit < 10). This can be done by assigning the result of *ptr - '0' to an unsigned char. Anything less than '0' will wrap around and anything above '9' will remain so. I've also removed a few inefficiencies with the isspace checking. We didn't need to do "while (*ptr && isspace(*ptr)). It's fine just to do while (isspace(*ptr)) since '\0' isn't a space char. I also got rid of the isxdigit call. The hexlookup array contains -1 for non-hex chars. We may as well check the digit value is >= 0. Here are the results I get using your test as quoted above: master @ 62e9af4c63f + fix_COPY_DEFAULT.patch latency average = 568.102 ms master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch latency average = 531.238 ms master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch latency average = 559.333 ms (surprisingly also faster on my machine) The optimized version of the pg_strtoint functions wins over the imul patch. Could you test to see if this is also the case in your Xeon machine? > (when I say strtoint from, I did not replace the goto labels, so that part is > unchanged and unrelated) I didn't quite follow this. I've not really studied the fix_COPY_DEFAULT.patch patch. Is there a reason to delay committing that? It would be good to eliminate that as a variable for the current performance regression. David diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 471fbb7ee6..a6e9d00727 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -104,10 +104,10 @@ static const int8 hexlookup[128] = { * pg_strtoint16() will throw ereport() upon bad input format or overflow; * while pg_strtoint16_safe() instead returns such complaints in *escontext, * if it's an ErrorSaveContext. -* - * NB: Accumulate input as an unsigned number, to deal with two's complement + * + * NB: Accumulate input as a negative number, to deal with two's complement * representation of the most negative number, which can't be represented as a - * signed positive number. + * positive number. */ int16 pg_strtoint16(const char *s) @@ -120,7 +120,7 @@ pg_strtoint16_safe(const char *s, Node *escontext) { const char *ptr = s; const char *firstdigit; - uint16 tmp = 0; + int16 tmp = 0; boolneg = false; /* skip leading spaces */ @@ -145,10 +145,11 @@ pg_strtoint16_safe(const char *s, Node *escontext) { if (isxdigit((unsigned char) *ptr)) { - if (unlikely(tmp > -(PG_INT16_MIN / 16))) - goto out_of_range; + int8 digit = hexlookup[(unsigned char) *ptr++]; - tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; + if (unlikely(pg_mul_s16_overflow(tmp, 16, &tmp)) || + unlikely(pg_sub_s16_overflow(tmp, digit, &tmp))) + goto out_of_range; } else if (*ptr == '_') { @@ -169,10 +170,11 @@ pg_strtoint16_safe(const char *s, Node *escontext) { if (*ptr >= '0' && *ptr <= '7') { - if (unlikely(tmp > -(PG_INT16_MIN / 8))) - goto out_of_range; + int8 digit = (*ptr++ - '0'); - tmp = tmp * 8 + (*ptr++ -
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, Hm, in some cases your patch is better, but in others both the old code (8692f6644e7) and HEAD beat yours on my machine. TBH, not entirely sure why. prep: COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy'; DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int); benchmark: psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints FROM '/tmp/lotsaints.copy';") -t 15 I disabled turbo mode, pinned the server to a single core of my Xeon Gold 5215: HEAD: 812.690 your patch: 821.354 strtoint from 8692f6644e7: 824.543 strtoint from 6b423ec677d^: 806.678 (when I say strtoint from, I did not replace the goto labels, so that part is unchanged and unrelated) IOW, for me the code from 15 is the fastest by a good bit... There's an imul, sure, but the fact that it sets a flag makes it faster than having to add more tests and branches. Looking at a profile reminded me of how silly it is that we are wasting a good chunk of the time in these isdigit() checks, even though we already rely on on the ascii values via (via *ptr++ - '0'). I think that's done in the headers for some platforms, but not others (glibc). And we've even done already for octal and binary! Open coding isdigit() gives us: HEAD: 797.434 your patch: 803.570 strtoint from 8692f6644e7: 778.943 strtoint from 6b423ec677d^: 777.741 It's somewhat odd that HEAD and your patch switch position here... > - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O')) > + /* process hex digits */ > + else if (ptr[1] == 'x' || ptr[1] == 'X') > { > > firstdigit = ptr += 2; I find this unnecessarily hard to read. I realize it's been added in 6fcda9aba83, but I don't see a reason to use such a construct here. I find it somewhat grating how much duplication there now is in this code due to being repeated for all the bases... Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, 20 Jul 2023 at 20:37, Dean Rasheed wrote: > > On Thu, 20 Jul 2023 at 00:56, David Rowley wrote: > I agree with the principal though. In the attached updated patch, I > replaced that test with a simpler one: > > +/* > + * Process the number's digits. We optimize for decimal input (expected > to > + * be the most common case) first. Anything that doesn't start with a > base > + * prefix indicator must be decimal. > + */ > + > + /* process decimal digits */ > + if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1]))) > > So hopefully any compiler should only need to do the one comparison > against '0' for any non-zero decimal input. > > Doing that didn't give any measurable performance improvement for me, > but it did at least not make it noticeably worse, and seems more > likely to generate better code with not-so-smart compilers. I'd be > interested to know how that performs for you (and if your compiler > really does generate 3 "first digit is '0'" tests for the unpatched > code). That seems better. I benchmarked it on two machines: gcc12.2/linux/amd3990x create table a (a int); insert into a select x from generate_series(1,1000)x; copy a to '/tmp/a.dump'; master @ ab29a7a9c postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 2226.912 ms (00:02.227) Time: 2214.168 ms (00:02.214) Time: 2206.481 ms (00:02.206) Time: 2219.640 ms (00:02.220) Time: 2205.004 ms (00:02.205) master + pg_strtoint32_base_10_first.v2.patch postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 2067.108 ms (00:02.067) Time: 2070.401 ms (00:02.070) Time: 2073.423 ms (00:02.073) Time: 2065.407 ms (00:02.065) Time: 2066.536 ms (00:02.067) (~7% faster) apple m2 pro/clang master @ 9089287a postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 1286.369 ms (00:01.286) Time: 1300.534 ms (00:01.301) Time: 1295.661 ms (00:01.296) Time: 1296.404 ms (00:01.296) Time: 1268.361 ms (00:01.268) Time: 1259.321 ms (00:01.259) master + pg_strtoint32_base_10_first.v2.patch postgres=# truncate a; copy a from '/tmp/a.dump'; Time: 1253.519 ms (00:01.254) Time: 1235.256 ms (00:01.235) Time: 1269.501 ms (00:01.270) Time: 1267.801 ms (00:01.268) Time: 1275.758 ms (00:01.276) Time: 1261.478 ms (00:01.261) (a bit noisy but avg of ~1.8% faster) David
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Thu, 20 Jul 2023 at 00:56, David Rowley wrote: > > I noticed that 6fcda9aba added quite a lot of conditions that need to > be checked before we process a normal decimal integer string. I think > we could likely do better and code it to assume that most strings will > be decimal and put that case first rather than last. That sounds sensible, but ... > In the attached, I've changed that for the 32-bit version only. A > more complete patch would need to do the 16 and 64-bit versions too. > > Without that, we need to check if the first digit is '0' a total of 3 > times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O', > 'b' or 'B'. That's not what I see. For me, the compiler (gcc 11, using either -O2 or -O3) is smart enough to spot that the first part of each of the 3 checks is the same, and it only tests whether the first digit is '0' once, before immediately jumping to the decimal parsing code. I didn't test other compilers though, so I can believe that different compilers might not be so smart. OTOH, this test in your patch: +/* process decimal digits */ +if (isdigit((unsigned char) ptr[0]) && +(isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1] == '\0' || isspace(ptr[1]))) is doing more work than it needs to, and actually makes things noticeably worse for me. It needs to do a minimum of 2 isdigit() checks before it will parse the input as a decimal, whereas before (for me at least) it just did one simple comparison of ptr[0] against '0'. I agree with the principal though. In the attached updated patch, I replaced that test with a simpler one: +/* + * Process the number's digits. We optimize for decimal input (expected to + * be the most common case) first. Anything that doesn't start with a base + * prefix indicator must be decimal. + */ + + /* process decimal digits */ + if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1]))) So hopefully any compiler should only need to do the one comparison against '0' for any non-zero decimal input. Doing that didn't give any measurable performance improvement for me, but it did at least not make it noticeably worse, and seems more likely to generate better code with not-so-smart compilers. I'd be interested to know how that performs for you (and if your compiler really does generate 3 "first digit is '0'" tests for the unpatched code). Regards, Dean --- Here were my test results (where P1 is the "fix_COPY_DEFAULT.patch"), and I tested COPY loading 50M rows: HEAD + P1 = 7137.966 ms 7193.190 ms 7094.491 ms 7123.520 ms HEAD + P1 + pg_strtoint32_base_10_first.patch = 7561.658 ms 7548.282 ms 7551.360 ms 7560.671 ms HEAD + P1 + pg_strtoint32_base_10_first.v2.patch 7238.775 ms 7184.937 ms 7123.257 ms 7159.618 ms diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c new file mode 100644 index 471fbb7..e2320e6 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -314,113 +314,124 @@ pg_strtoint32_safe(const char *s, Node * else if (*ptr == '+') ptr++; - /* process digits */ - if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X')) + /* + * Process the number's digits. We optimize for decimal input (expected to + * be the most common case) first. Anything that doesn't start with a base + * prefix indicator must be decimal. + */ + + /* process decimal digits */ + if (likely(ptr[0] != '0' || !isalpha((unsigned char) ptr[1]))) { - firstdigit = ptr += 2; + firstdigit = ptr; while (*ptr) { - if (isxdigit((unsigned char) *ptr)) + if (isdigit((unsigned char) *ptr)) { -if (unlikely(tmp > -(PG_INT32_MIN / 16))) +if (unlikely(tmp > -(PG_INT32_MIN / 10))) goto out_of_range; -tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; +tmp = tmp * 10 + (*ptr++ - '0'); } else if (*ptr == '_') { -/* underscore must be followed by more digits */ +/* underscore may not be first */ +if (unlikely(ptr == firstdigit)) + goto invalid_syntax; +/* and it must be followed by more digits */ ptr++; -if (*ptr == '\0' || !isxdigit((unsigned char) *ptr)) +if (*ptr == '\0' || !isdigit((unsigned char) *ptr)) goto invalid_syntax; } else break; } } - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O')) + /* process hex digits */ + else if (ptr[1] == 'x' || ptr[1] == 'X') { firstdigit = ptr += 2; while (*ptr) { - if (*ptr >= '0' && *ptr <= '7') + if (isxdigit((unsigned char) *ptr)) { -if (unlikely(tmp > -(PG_INT32_MIN / 8))) +if (unlikely(tmp > -(PG_INT32_MIN / 16))) goto out_of_range; -tmp = tmp * 8 + (*ptr++ - '0'); +tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; } else if (*ptr == '_') { /* underscore must be followed by more digits */ ptr++; -if (*ptr == '\0' || *ptr < '
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 19 Jul 2023 at 23:14, Dean Rasheed wrote: > Hmm, I'm somewhat sceptical about this second patch. It's not obvious > why adding such tests would speed it up, and indeed, testing on my > machine with 50M rows, I see a noticeable speed-up from patch 1, and a > slow-down from patch 2: I noticed that 6fcda9aba added quite a lot of conditions that need to be checked before we process a normal decimal integer string. I think we could likely do better and code it to assume that most strings will be decimal and put that case first rather than last. In the attached, I've changed that for the 32-bit version only. A more complete patch would need to do the 16 and 64-bit versions too. -- setup create table a (a int); insert into a select x from generate_series(1,1000)x; copy a to '~/a.dump'; -- test truncate a; copy a from '/tmp/a.dump'; master @ ab29a7a9c Time: 2152.633 ms (00:02.153) Time: 2121.091 ms (00:02.121) Time: 2100.995 ms (00:02.101) Time: 2101.724 ms (00:02.102) Time: 2103.949 ms (00:02.104) master + pg_strtoint32_base_10_first.patch Time: 2061.464 ms (00:02.061) Time: 2035.513 ms (00:02.036) Time: 2028.356 ms (00:02.028) Time: 2043.218 ms (00:02.043) Time: 2037.035 ms (00:02.037) (~3.6% faster) Without that, we need to check if the first digit is '0' a total of 3 times and also check if the 2nd digit is any of 'x', 'X', 'o', 'O', 'b' or 'B'. It seems to be coded with the assumption that hex strings are the most likely. I think decimals are the most likely by far. David diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index 471fbb7ee6..60c90f7252 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -314,113 +314,119 @@ pg_strtoint32_safe(const char *s, Node *escontext) else if (*ptr == '+') ptr++; - /* process digits */ - if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X')) + /* process decimal digits */ + if (isdigit((unsigned char) ptr[0]) && + (isdigit((unsigned char) ptr[1]) || ptr[1] == '_' || ptr[1] == '\0' || isspace(ptr[1]))) { - firstdigit = ptr += 2; + firstdigit = ptr; while (*ptr) { - if (isxdigit((unsigned char) *ptr)) + if (isdigit((unsigned char) *ptr)) { - if (unlikely(tmp > -(PG_INT32_MIN / 16))) + if (unlikely(tmp > -(PG_INT32_MIN / 10))) goto out_of_range; - tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; + tmp = tmp * 10 + (*ptr++ - '0'); } else if (*ptr == '_') { - /* underscore must be followed by more digits */ + /* underscore may not be first */ + if (unlikely(ptr == firstdigit)) + goto invalid_syntax; + /* and it must be followed by more digits */ ptr++; - if (*ptr == '\0' || !isxdigit((unsigned char) *ptr)) + if (*ptr == '\0' || !isdigit((unsigned char) *ptr)) goto invalid_syntax; } else break; } } - else if (ptr[0] == '0' && (ptr[1] == 'o' || ptr[1] == 'O')) + /* process hex digits */ + else if (ptr[0] == '0' && (ptr[1] == 'x' || ptr[1] == 'X')) { firstdigit = ptr += 2; while (*ptr) { - if (*ptr >= '0' && *ptr <= '7') + if (isxdigit((unsigned char) *ptr)) { - if (unlikely(tmp > -(PG_INT32_MIN / 8))) + if (unlikely(tmp > -(PG_INT32_MIN / 16))) goto out_of_range; - tmp = tmp * 8 + (*ptr++ - '0'); + tmp = tmp * 16 + hexlookup[(unsigned char) *ptr++]; } else if (*ptr == '_') { /* underscore must be followed by more digits */ ptr++; - if (*ptr == '\0' || *ptr < '0' || *ptr > '7') + if (*ptr == '\0' || !isxdigit((unsigned char) *ptr)) goto invalid_syntax; } else break; } } - else if (ptr[0] == '0' && (ptr[1] == 'b' || ptr[1] == 'B'))
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, 19 Jul 2023 at 09:24, Masahiko Sawada wrote: > > > > 2) pg_strtoint32_safe() got substantially slower, mainly due > > >to > > > faff8f8e47f Allow underscores in integer and numeric constants. > > > 6fcda9aba83 Non-decimal integer literals > > > > Agreed. > > > I have made some progress on dealing with performance regression on > single client COPY. I've attached a patch to fix 2). With the patch I > shared[1] to deal with 1), single client COPY performance seems to be > now as good as (or slightly better than) PG15 . Here are the results > (averages of 5 times) of loading 50M rows via COPY: > Hmm, I'm somewhat sceptical about this second patch. It's not obvious why adding such tests would speed it up, and indeed, testing on my machine with 50M rows, I see a noticeable speed-up from patch 1, and a slow-down from patch 2: PG15 7390.461 ms 7497.655 ms 7485.850 ms 7406.336 ms HEAD 8388.707 ms 8283.484 ms 8391.638 ms 8363.306 ms HEAD + P1 = 7255.128 ms 7185.319 ms 7197.822 ms 7191.176 ms HEAD + P2 = 8687.164 ms 8654.907 ms 8641.493 ms 8668.865 ms HEAD + P1 + P2 == 7780.126 ms 7786.427 ms 7775.047 ms 7785.938 ms So for me at least, just applying patch 1 gives the best results, and makes it slightly faster than PG15 (possibly due to 6b423ec677). Regards, Dean
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, Jul 12, 2023 at 5:40 PM Masahiko Sawada wrote: > > On Wed, Jul 12, 2023 at 3:52 AM Andres Freund wrote: > > > > Hi, > > > > On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > > > While testing PG16, I observed that in PG16 there is a big performance > > > degradation in concurrent COPY into a single relation with 2 - 16 > > > clients in my environment. I've attached a test script that measures > > > the execution time of COPYing 5GB data in total to the single relation > > > while changing the number of concurrent insertions, in PG16 and PG15. > > > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > > > vCPUs, 512GB RAM): > > > > > > * PG15 (4b15868b69) > > > PG15: nclients = 1, execution time = 14.181 > > > > > > * PG16 (c24e9ef330) > > > PG16: nclients = 1, execution time = 17.112 > > > > > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to > > > extend tables more efficiently". With commit 1cbbee0338 (the previous > > > commit of 00d1e02be2), I got a better numbers, it didn't have a better > > > scalability, though: > > > > > > PG16: nclients = 1, execution time = 17.444 > > > > I think the single client case is indicative of an independent regression, > > or > > rather regressions - it can't have anything to do with the fallocate() issue > > and reproduces before that too in your numbers. > > Right. > > > > > 1) COPY got slower, due to: > > 9f8377f7a27 Add a DEFAULT option to COPY FROM > > > > This added a new palloc()/free() to every call to NextCopyFrom(). It's not > > at > > all clear to me why this needs to happen in NextCopyFrom(), particularly > > because it's already stored in CopyFromState? > > Yeah, it seems to me that we can palloc the bool array once and use it > for the entire COPY FROM. With the attached small patch, the > performance becomes much better: > > 15: > 14.70500 sec > > 16: > 17.42900 sec > > 16 w/ patch: > 14.85600 sec > > > > > > > 2) pg_strtoint32_safe() got substantially slower, mainly due > >to > > faff8f8e47f Allow underscores in integer and numeric constants. > > 6fcda9aba83 Non-decimal integer literals > > Agreed. > > > > > pinned to one cpu, turbo mode disabled, I get the following best-of-three > > times for > > copy test from '/tmp/tmp_4.data' > > (too impatient to use the larger file every time) > > > > 15: > > 6281.107 ms > > > > HEAD: > > 7000.469 ms > > > > backing out 9f8377f7a27: > > 6433.516 ms > > > > also backing out faff8f8e47f, 6fcda9aba83: > > 6235.453 ms > > > > > > I suspect 1) can relatively easily be fixed properly. But 2) seems much > > harder. The changes increased the number of branches substantially, that's > > gonna cost in something as (previously) tight as pg_strtoint32(). > > I'll look at how to fix 2). I have made some progress on dealing with performance regression on single client COPY. I've attached a patch to fix 2). With the patch I shared[1] to deal with 1), single client COPY performance seems to be now as good as (or slightly better than) PG15 . Here are the results (averages of 5 times) of loading 50M rows via COPY: 15: 7.609 sec 16: 8.637 sec 16 w/ two patches: 7.179 sec Regards, [1] https://www.postgresql.org/message-id/CAD21AoBb9Sbddh%2BnQk1BxUFaRHaa%2BfL8fCzO-Lvxqb6ZcpAHqw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_numeric.patch Description: Binary data
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Wed, Jul 12, 2023 at 3:52 AM Andres Freund wrote: > > Hi, > > On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > > While testing PG16, I observed that in PG16 there is a big performance > > degradation in concurrent COPY into a single relation with 2 - 16 > > clients in my environment. I've attached a test script that measures > > the execution time of COPYing 5GB data in total to the single relation > > while changing the number of concurrent insertions, in PG16 and PG15. > > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > > vCPUs, 512GB RAM): > > > > * PG15 (4b15868b69) > > PG15: nclients = 1, execution time = 14.181 > > > > * PG16 (c24e9ef330) > > PG16: nclients = 1, execution time = 17.112 > > > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to > > extend tables more efficiently". With commit 1cbbee0338 (the previous > > commit of 00d1e02be2), I got a better numbers, it didn't have a better > > scalability, though: > > > > PG16: nclients = 1, execution time = 17.444 > > I think the single client case is indicative of an independent regression, or > rather regressions - it can't have anything to do with the fallocate() issue > and reproduces before that too in your numbers. Right. > > 1) COPY got slower, due to: > 9f8377f7a27 Add a DEFAULT option to COPY FROM > > This added a new palloc()/free() to every call to NextCopyFrom(). It's not at > all clear to me why this needs to happen in NextCopyFrom(), particularly > because it's already stored in CopyFromState? Yeah, it seems to me that we can palloc the bool array once and use it for the entire COPY FROM. With the attached small patch, the performance becomes much better: 15: 14.70500 sec 16: 17.42900 sec 16 w/ patch: 14.85600 sec > > > 2) pg_strtoint32_safe() got substantially slower, mainly due >to > faff8f8e47f Allow underscores in integer and numeric constants. > 6fcda9aba83 Non-decimal integer literals Agreed. > > pinned to one cpu, turbo mode disabled, I get the following best-of-three > times for > copy test from '/tmp/tmp_4.data' > (too impatient to use the larger file every time) > > 15: > 6281.107 ms > > HEAD: > 7000.469 ms > > backing out 9f8377f7a27: > 6433.516 ms > > also backing out faff8f8e47f, 6fcda9aba83: > 6235.453 ms > > > I suspect 1) can relatively easily be fixed properly. But 2) seems much > harder. The changes increased the number of branches substantially, that's > gonna cost in something as (previously) tight as pg_strtoint32(). I'll look at how to fix 2). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_COPY_DEFAULT.patch Description: Binary data
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > While testing PG16, I observed that in PG16 there is a big performance > degradation in concurrent COPY into a single relation with 2 - 16 > clients in my environment. I've attached a test script that measures > the execution time of COPYing 5GB data in total to the single relation > while changing the number of concurrent insertions, in PG16 and PG15. > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > vCPUs, 512GB RAM): > > * PG15 (4b15868b69) > PG15: nclients = 1, execution time = 14.181 > > * PG16 (c24e9ef330) > PG16: nclients = 1, execution time = 17.112 > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to > extend tables more efficiently". With commit 1cbbee0338 (the previous > commit of 00d1e02be2), I got a better numbers, it didn't have a better > scalability, though: > > PG16: nclients = 1, execution time = 17.444 I think the single client case is indicative of an independent regression, or rather regressions - it can't have anything to do with the fallocate() issue and reproduces before that too in your numbers. 1) COPY got slower, due to: 9f8377f7a27 Add a DEFAULT option to COPY FROM This added a new palloc()/free() to every call to NextCopyFrom(). It's not at all clear to me why this needs to happen in NextCopyFrom(), particularly because it's already stored in CopyFromState? 2) pg_strtoint32_safe() got substantially slower, mainly due to faff8f8e47f Allow underscores in integer and numeric constants. 6fcda9aba83 Non-decimal integer literals pinned to one cpu, turbo mode disabled, I get the following best-of-three times for copy test from '/tmp/tmp_4.data' (too impatient to use the larger file every time) 15: 6281.107 ms HEAD: 7000.469 ms backing out 9f8377f7a27: 6433.516 ms also backing out faff8f8e47f, 6fcda9aba83: 6235.453 ms I suspect 1) can relatively easily be fixed properly. But 2) seems much harder. The changes increased the number of branches substantially, that's gonna cost in something as (previously) tight as pg_strtoint32(). For higher concurrency numbers, I now was able to reproduce the regression, to a smaller degree. Much smaller after fixing the above. The reason we run into the issue here is basically that the rows in the test are very narrow and reach #define MAX_BUFFERED_TUPLES 1000 at a small number of pages, so we go back and forth between extending with fallocate() and not. I'm *not* saying that that is the solution, but after changing that to 5000, the numbers look a lot better (with the other regressions "worked around"): (this is again with turboboost disabled, to get more reproducible numbers) clients 1 2 4 8 16 32 15,buffered=100025725 13211 9232563948624700 15,buffered=500026107 14550 8644605049434766 HEAD+fixes,buffered=100025875 14505 8200490035653433 HEAD+fixes,buffered=500025830 12975 6527359427392642 Greetings, Andres Freund [1] https://postgr.es/m/CAD21AoAEwHTLYhuQ6PaBRPXKWN-CgW9iw%2B4hm%3D2EOFXbJQ3tOg%40mail.gmail.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-11 09:09:43 +0200, Jakub Wartak wrote: > On Mon, Jul 10, 2023 at 6:24 PM Andres Freund wrote: > > > > Hi, > > > > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote: > > > Out of curiosity I've tried and it is reproducible as you have stated : > > > XFS > > > @ 4.18.0-425.10.1.el8_7.x86_64: > > >... > > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , > > > the XFS issues sync writes while ext4 does not, xfs looks like constant > > > loop of sync writes (D) by kworker/2:1H-kblockd: > > > > That clearly won't go well. It's not reproducible on newer systems, > > unfortunately :(. Or well, fortunately maybe. > > > > > > I wonder if a trick to avoid this could be to memorialize the fact that we > > bulk extended before and extend by that much going forward? That'd avoid the > > swapping back and forth. > > I haven't seen this thread [1] "Question on slow fallocate", from XFS > mailing list being mentioned here (it was started by Masahiko), but I > do feel it contains very important hints even challenging the whole > idea of zeroing out files (or posix_fallocate()). Please especially > see Dave's reply. I think that's just due to the reproducer being a bit too minimal and the actual problem being addressed not being explained. > He also argues that posix_fallocate() != fallocate(). What's interesting is > that it's by design and newer kernel versions should not prevent such > behaviour, see my testing result below. I think the problem there was that I was not targetting a different file between the different runs, somehow assuming the test program would be taking care of that. I don't think the test program actually tests things in a particularly useful way - it does fallocate()s in 8k chunks - which postgres never does. > All I can add is that this those kernel versions (4.18.0) seem to very > popular across customers (RHEL, Rocky) right now and that I've tested > on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using > Masahiko test.c and still got 6-7x slower time when using XFS on that > kernel. After installing kernel-ml (6.4.2) the test.c result seems to > be the same (note it it occurs only when 1st allocating space, but of > course it doesnt if the same file is rewritten/"reallocated"): test.c really doesn't reproduce postgres behaviour in any meaningful way, using it as a benchmark is a bad idea. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Mon, Jul 10, 2023 at 6:24 PM Andres Freund wrote: > > Hi, > > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote: > > Out of curiosity I've tried and it is reproducible as you have stated : XFS > > @ 4.18.0-425.10.1.el8_7.x86_64: > >... > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , > > the XFS issues sync writes while ext4 does not, xfs looks like constant > > loop of sync writes (D) by kworker/2:1H-kblockd: > > That clearly won't go well. It's not reproducible on newer systems, > unfortunately :(. Or well, fortunately maybe. > > > I wonder if a trick to avoid this could be to memorialize the fact that we > bulk extended before and extend by that much going forward? That'd avoid the > swapping back and forth. I haven't seen this thread [1] "Question on slow fallocate", from XFS mailing list being mentioned here (it was started by Masahiko), but I do feel it contains very important hints even challenging the whole idea of zeroing out files (or posix_fallocate()). Please especially see Dave's reply. He also argues that posix_fallocate() != fallocate(). What's interesting is that it's by design and newer kernel versions should not prevent such behaviour, see my testing result below. All I can add is that this those kernel versions (4.18.0) seem to very popular across customers (RHEL, Rocky) right now and that I've tested on most recent available one (4.18.0-477.15.1.el8_8.x86_64) using Masahiko test.c and still got 6-7x slower time when using XFS on that kernel. After installing kernel-ml (6.4.2) the test.c result seems to be the same (note it it occurs only when 1st allocating space, but of course it doesnt if the same file is rewritten/"reallocated"): [root@rockyora ~]# uname -r 6.4.2-1.el8.elrepo.x86_64 [root@rockyora ~]# time ./test test.0 0 total 20 fallocate 0 filewrite 20 real0m0.405s user0m0.006s sys 0m0.391s [root@rockyora ~]# time ./test test.0 1 total 20 fallocate 20 filewrite 0 real0m0.137s user0m0.005s sys 0m0.132s [root@rockyora ~]# time ./test test.1 1 total 20 fallocate 20 filewrite 0 real0m0.968s user0m0.020s sys 0m0.928s [root@rockyora ~]# time ./test test.2 2 total 20 fallocate 10 filewrite 10 real0m6.059s user0m0.000s sys 0m0.788s [root@rockyora ~]# time ./test test.2 2 total 20 fallocate 10 filewrite 10 real0m0.598s user0m0.003s sys 0m0.225s [root@rockyora ~]# iostat -x reports during first "time ./test test.2 2" (as you can see w_awiat is not that high but it accumulates): Devicer/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 0.00 15394.00 0.00122.02 0.0013.00 0.00 0.080.000.05 0.75 0.00 8.12 0.06 100.00 dm-0 0.00 15407.00 0.00122.02 0.00 0.00 0.00 0.000.000.06 0.98 0.00 8.11 0.06 100.00 So maybe that's just a hint that you should try on slower storage instead? (I think that on NVMe this issue would be hardly noticeable due to low IO latency, not like here) -J. [1] - https://www.spinics.net/lists/linux-xfs/msg73035.html
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, Jul 11, 2023 at 1:24 AM Andres Freund wrote: > > Hi, > > On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote: > > Out of curiosity I've tried and it is reproducible as you have stated : XFS > > @ 4.18.0-425.10.1.el8_7.x86_64: > >... > > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , > > the XFS issues sync writes while ext4 does not, xfs looks like constant > > loop of sync writes (D) by kworker/2:1H-kblockd: > > That clearly won't go well. It's not reproducible on newer systems, > unfortunately :(. Or well, fortunately maybe. > > > I wonder if a trick to avoid this could be to memorialize the fact that we > bulk extended before and extend by that much going forward? That'd avoid the > swapping back and forth. > > > One thing that confuses me is that Sawada-san observed a regression at a > single client - yet from what I can tell, there's actually not a single > fallocate() being generated in that case, because the table is so narrow that > we never end up extending by a sufficient number of blocks in > RelationAddBlocks() to reach that path. Yet there's a regression at 1 client. > > I don't yet have a RHEL8 system at hand, could either of you send the result > of a > strace -c -p $pid_of_backend_doing_copy > Here are the results: * PG16: nclients = 1, execution time = 23.758 % time seconds usecs/call callserrors syscall -- --- --- - - 53.180.308675 0358898 pwrite64 33.920.196900 2 81202 pwritev 7.780.045148 0 81378 lseek 5.060.029371 2 11141 read 0.040.000250 291 openat 0.020.94 189 close 0.000.00 0 1 munmap 0.000.00 084 brk 0.000.00 0 1 sendto 0.000.00 0 2 1 recvfrom 0.000.00 0 2 kill 0.000.00 0 1 futex 0.000.00 0 1 epoll_wait -- --- --- - - 100.000.580438 1532891 1 total * PG16: nclients = 2, execution time = 18.045 % time seconds usecs/call callserrors syscall -- --- --- - - 54.190.218721 1187803 pwrite64 29.240.118002 2 40242 pwritev 6.230.025128 0 41239 lseek 5.030.020285 2 9133 read 2.040.008229 9 855 fallocate 1.280.005151 0 5598 1000 futex 1.120.004516 1 3965 kill 0.780.003141 1 3058 1 epoll_wait 0.060.000224 2 100 openat 0.030.000136 198 close 0.010.50 084 brk 0.000.13 022 setitimer 0.000.06 015 1 rt_sigreturn 0.000.02 2 1 munmap 0.000.02 2 1 sendto 0.000.02 0 3 2 recvfrom -- --- --- - - 100.000.403608 1292217 1004 total * PG16: nclients = 4, execution time = 18.807 % time seconds usecs/call callserrors syscall -- --- --- - - 64.610.240171 2 93868 pwrite64 15.110.056173 4 11337 pwritev 7.290.027100 7 3465 fallocate 4.050.015048 2 5179 read 3.550.013188 0 14941 lseek 2.650.009858 1 8675 2527 futex 1.760.006536 1 4190 kill 0.880.003268 1 2199 epoll_wait 0.060.000213 2 101 openat 0.030.000130 199 close 0.010.31 118 rt_sigreturn 0.010.27 117 setitimer 0.000.00 0 1 munmap 0.000.00 084 brk 0.000.00 0 1 sendto 0.000.00 0 1 recvfrom -- --- --- - - 100.000.371743 2144176 2527 total * PG16: nclients = 8, execution time = 11.982 % time seconds usecs/call callserrors syscall -- --- --- - - 73.16
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Tue, Jul 11, 2023 at 12:34 AM Andres Freund wrote: > > Hi, > > On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > > While testing PG16, I observed that in PG16 there is a big performance > > degradation in concurrent COPY into a single relation with 2 - 16 > > clients in my environment. I've attached a test script that measures > > the execution time of COPYing 5GB data in total to the single relation > > while changing the number of concurrent insertions, in PG16 and PG15. > > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > > vCPUs, 512GB RAM): > > Gah, RHEL with its frankenkernels, the bane of my existance. > > FWIW, I had extensively tested this with XFS, just with a newer kernel. Have > you tested this on RHEL9 as well by any chance? I've tested this on RHEL9 (m5.24xlarge; 96vCPUs, 384GB RAM), and it seems to be reproducible on RHEL9 too. $ uname -rms Linux 5.14.0-284.11.1.el9_2.x86_64 x86_64 $ cat /etc/redhat-release Red Hat Enterprise Linux release 9.2 (Plow) PG15: nclients = 1, execution time = 22.193 PG15: nclients = 2, execution time = 12.430 PG15: nclients = 4, execution time = 8.677 PG15: nclients = 8, execution time = 6.144 PG15: nclients = 16, execution time = 5.938 PG15: nclients = 32, execution time = 5.775 PG15: nclients = 64, execution time = 5.928 PG15: nclients = 128, execution time = 6.346 PG15: nclients = 256, execution time = 6.641 PG16: nclients = 1, execution time = 24.601 PG16: nclients = 2, execution time = 27.845 PG16: nclients = 4, execution time = 40.575 PG16: nclients = 8, execution time = 24.379 PG16: nclients = 16, execution time = 15.835 PG16: nclients = 32, execution time = 8.370 PG16: nclients = 64, execution time = 4.042 PG16: nclients = 128, execution time = 2.956 PG16: nclients = 256, execution time = 2.591 Tests with test.c program: $ rm -f test.data; time ./test test.data 0 total 20 fallocate 0 filewrite 20 real0m0.709s user0m0.057s sys 0m0.649s $ rm -f test.data; time ./test test.data 1 total 20 fallocate 20 filewrite 0 real0m0.853s user0m0.058s sys 0m0.791s $ rm -f test.data; time ./test test.data 2 total 20 fallocate 10 filewrite 10 real2m10.002s user0m0.366s sys 0m6.649s Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:53:56 +0200, Jakub Wartak wrote: > Out of curiosity I've tried and it is reproducible as you have stated : XFS > @ 4.18.0-425.10.1.el8_7.x86_64: >... > According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , > the XFS issues sync writes while ext4 does not, xfs looks like constant > loop of sync writes (D) by kworker/2:1H-kblockd: That clearly won't go well. It's not reproducible on newer systems, unfortunately :(. Or well, fortunately maybe. I wonder if a trick to avoid this could be to memorialize the fact that we bulk extended before and extend by that much going forward? That'd avoid the swapping back and forth. One thing that confuses me is that Sawada-san observed a regression at a single client - yet from what I can tell, there's actually not a single fallocate() being generated in that case, because the table is so narrow that we never end up extending by a sufficient number of blocks in RelationAddBlocks() to reach that path. Yet there's a regression at 1 client. I don't yet have a RHEL8 system at hand, could either of you send the result of a strace -c -p $pid_of_backend_doing_copy Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:59:38 +0900, Masahiko Sawada wrote: > On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada wrote: > > > > After further investigation, the performance degradation comes from > > calling posix_fallocate() (called via FileFallocate()) and pwritev() > > (called via FileZero) alternatively depending on how many blocks we > > extend by. And it happens only on the xfs filesystem. > > FYI, the attached simple C program proves the fact that calling > alternatively posix_fallocate() and pwrite() causes slow performance > on posix_fallocate(): > > $ gcc -o test test.c > $ time ./test test.1 1 > total 20 > fallocate 20 > filewrite 0 > > real0m1.305s > user0m0.050s > sys 0m1.255s > > $ time ./test test.2 2 > total 20 > fallocate 10 > filewrite 10 > > real1m29.222s > user0m0.139s > sys 0m3.139s On an xfs filesystem, with a very recent kernel: time /tmp/msw_test /srv/dev/fio/msw 0 total 20 fallocate 0 filewrite 20 real0m0.456s user0m0.017s sys 0m0.439s time /tmp/msw_test /srv/dev/fio/msw 1 total 20 fallocate 20 filewrite 0 real0m0.141s user0m0.010s sys 0m0.131s time /tmp/msw_test /srv/dev/fio/msw 2 total 20 fallocate 10 filewrite 10 real0m0.297s user0m0.017s sys 0m0.280s So I don't think I can reproduce your problem on that system... I also tried adding a fdatasync() into the loop, but that just made things uniformly slow. I guess I'll try to dig up whether this is a problem in older upstream kernels, or whether it's been introduced in RHEL. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-03 11:55:13 +0900, Masahiko Sawada wrote: > While testing PG16, I observed that in PG16 there is a big performance > degradation in concurrent COPY into a single relation with 2 - 16 > clients in my environment. I've attached a test script that measures > the execution time of COPYing 5GB data in total to the single relation > while changing the number of concurrent insertions, in PG16 and PG15. > Here are the results on my environment (EC2 instance, RHEL 8.6, 128 > vCPUs, 512GB RAM): Gah, RHEL with its frankenkernels, the bane of my existance. FWIW, I had extensively tested this with XFS, just with a newer kernel. Have you tested this on RHEL9 as well by any chance? Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-07-10 15:25:41 +0200, Alvaro Herrera wrote: > On 2023-Jul-03, Masahiko Sawada wrote: > > > While testing PG16, I observed that in PG16 there is a big performance > > degradation in concurrent COPY into a single relation with 2 - 16 > > clients in my environment. I've attached a test script that measures > > the execution time of COPYing 5GB data in total to the single relation > > while changing the number of concurrent insertions, in PG16 and PG15. > > This item came up in the RMT meeting. Andres, I think this item belongs > to you, because of commit 00d1e02be2. I'll take a look - I wasn't even aware of this thread until now. Greetings, Andres Freund
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hello, On 2023-Jul-03, Masahiko Sawada wrote: > While testing PG16, I observed that in PG16 there is a big performance > degradation in concurrent COPY into a single relation with 2 - 16 > clients in my environment. I've attached a test script that measures > the execution time of COPYing 5GB data in total to the single relation > while changing the number of concurrent insertions, in PG16 and PG15. This item came up in the RMT meeting. Andres, I think this item belongs to you, because of commit 00d1e02be2. The regression seems serious enough at low client counts: > * PG15 (4b15868b69) > PG15: nclients = 1, execution time = 14.181 > PG15: nclients = 2, execution time = 9.319 > PG15: nclients = 4, execution time = 5.872 > PG15: nclients = 8, execution time = 3.773 > PG15: nclients = 16, execution time = 3.202 > * PG16 (c24e9ef330) > PG16: nclients = 1, execution time = 17.112 > PG16: nclients = 2, execution time = 14.084 > PG16: nclients = 4, execution time = 27.997 > PG16: nclients = 8, execution time = 10.554 > PG16: nclients = 16, execution time = 7.074 So the fact that the speed has clearly gone up at larger client counts is not an excuse for not getting it fixed, XFS-specificity notwithstanding. > The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to > extend tables more efficiently". With commit 1cbbee0338 (the previous > commit of 00d1e02be2), I got a better numbers, it didn't have a better > scalability, though: > > PG16: nclients = 1, execution time = 17.444 > PG16: nclients = 2, execution time = 10.690 > PG16: nclients = 4, execution time = 7.010 > PG16: nclients = 8, execution time = 4.282 > PG16: nclients = 16, execution time = 3.373 Well, these numbers are better, but they still look worse than PG15. I suppose there are other commits that share blame. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi Masahiko, Out of curiosity I've tried and it is reproducible as you have stated : XFS @ 4.18.0-425.10.1.el8_7.x86_64: [root@rockyora ~]# time ./test test.1 1 total 20 fallocate 20 filewrite 0 real0m5.868s user0m0.035s sys 0m5.716s [root@rockyora ~]# time ./test test.2 2 total 20 fallocate 10 filewrite 10 real0m25.858s user0m0.108s sys 0m3.596s [root@rockyora ~]# time ./test test.3 2 total 20 fallocate 10 filewrite 10 real0m25.927s user0m0.091s sys 0m3.621s [root@rockyora ~]# time ./test test.4 1 total 20 fallocate 20 filewrite 0 real0m3.044s user0m0.043s sys 0m2.934s According to iostat and blktrace -d /dev/sda -o - | blkparse -i - output , the XFS issues sync writes while ext4 does not, xfs looks like constant loop of sync writes (D) by kworker/2:1H-kblockd: [..] 8,023417224.115364875 312 D WS 44624928 + 16 [kworker/2:1H] 8,023417324.115482679 0 C WS 44624928 + 16 [0] 8,023417424.115548251 6501 A WS 42525760 + 16 <- (253,0) 34225216 8,023417524.115548660 6501 A WS 44624960 + 16 <- (8,2) 42525760 8,023417624.115549111 6501 Q WS 44624960 + 16 [test] 8,023417724.115551351 6501 G WS 44624960 + 16 [test] 8,023417824.115552111 6501 I WS 44624960 + 16 [test] 8,023417924.115559713 312 D WS 44624960 + 16 [kworker/2:1H] 8,023418024.115677217 0 C WS 44624960 + 16 [0] 8,023418124.115743150 6501 A WS 42525792 + 16 <- (253,0) 34225248 8,023418224.115743502 6501 A WS 44624992 + 16 <- (8,2) 42525792 8,023418324.115743949 6501 Q WS 44624992 + 16 [test] 8,023418424.115746175 6501 G WS 44624992 + 16 [test] 8,023418524.115746918 6501 I WS 44624992 + 16 [test] 8,023418624.115754492 312 D WS 44624992 + 16 [kworker/2:1H] So it looks like you are onto something. Regards, -J.
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Mon, Jul 3, 2023 at 4:36 PM Heikki Linnakangas wrote: > > On 03/07/2023 05:59, Masahiko Sawada wrote: > > On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada > > wrote: > >> > >> After further investigation, the performance degradation comes from > >> calling posix_fallocate() (called via FileFallocate()) and pwritev() > >> (called via FileZero) alternatively depending on how many blocks we > >> extend by. And it happens only on the xfs filesystem. > > > > FYI, the attached simple C program proves the fact that calling > > alternatively posix_fallocate() and pwrite() causes slow performance > > on posix_fallocate(): > > > > $ gcc -o test test.c > > $ time ./test test.1 1 > > total 20 > > fallocate 20 > > filewrite 0 > > > > real0m1.305s > > user0m0.050s > > sys 0m1.255s > > > > $ time ./test test.2 2 > > total 20 > > fallocate 10 > > filewrite 10 > > > > real1m29.222s > > user0m0.139s > > sys 0m3.139s > > This must be highly dependent on the underlying OS and filesystem. Right. The above were the result where I created the file on the xfs filesystem. The kernel version and the xfs filesystem version are: % uname -rms Linux 4.18.0-372.9.1.el8.x86_64 x86_64 % sudo xfs_db -r /dev/nvme4n1p2 xfs_db> version versionnum [0xb4b5+0x18a] = V5,NLINK,DIRV2,ATTR,ALIGN,LOGV2,EXTFLG,MOREBITS,ATTR2,LAZYSBCOUNT,PROJID32BIT,CRC,FTYPE,FINOBT,SPARSE_INODES,REFLINK As far as I tested, it happens only on the xfs filesystem (at least the above version) and doesn't happen on ext4 and ext3 filesystems. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On 03/07/2023 05:59, Masahiko Sawada wrote: On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada wrote: After further investigation, the performance degradation comes from calling posix_fallocate() (called via FileFallocate()) and pwritev() (called via FileZero) alternatively depending on how many blocks we extend by. And it happens only on the xfs filesystem. FYI, the attached simple C program proves the fact that calling alternatively posix_fallocate() and pwrite() causes slow performance on posix_fallocate(): $ gcc -o test test.c $ time ./test test.1 1 total 20 fallocate 20 filewrite 0 real0m1.305s user0m0.050s sys 0m1.255s $ time ./test test.2 2 total 20 fallocate 10 filewrite 10 real1m29.222s user0m0.139s sys 0m3.139s This must be highly dependent on the underlying OS and filesystem. I'm not seeing that effect on my laptop: /data$ time /tmp/test test.0 0 total 20 fallocate 0 filewrite 20 real0m1.856s user0m0.140s sys 0m1.688s /data$ time /tmp/test test.1 1 total 20 fallocate 20 filewrite 0 real0m1.335s user0m0.156s sys 0m1.179s /data$ time /tmp/test test.2 2 total 20 fallocate 10 filewrite 10 real0m2.159s user0m0.165s sys 0m1.880s /data$ uname -a Linux heikkilaptop 6.0.0-6-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.12-1 (2022-12-09) x86_64 GNU/Linux /data is an nvme drive with ext4 filesystem. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Performance degradation on concurrent COPY into a single relation in PG16.
On Mon, Jul 3, 2023 at 11:55 AM Masahiko Sawada wrote: > > After further investigation, the performance degradation comes from > calling posix_fallocate() (called via FileFallocate()) and pwritev() > (called via FileZero) alternatively depending on how many blocks we > extend by. And it happens only on the xfs filesystem. FYI, the attached simple C program proves the fact that calling alternatively posix_fallocate() and pwrite() causes slow performance on posix_fallocate(): $ gcc -o test test.c $ time ./test test.1 1 total 20 fallocate 20 filewrite 0 real0m1.305s user0m0.050s sys 0m1.255s $ time ./test test.2 2 total 20 fallocate 10 filewrite 10 real1m29.222s user0m0.139s sys 0m3.139s Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com #include #include #include #include int main(int argc, char **argv) { char *filename = argv[1]; int ratio = atoi(argv[2]); char block[8192] = {0}; int fd; int total_len = 0; int n_fallocate = 0; int n_filewrite = 0; int i; fd = open(filename, O_RDWR | O_CREAT, S_IRWXU); if (fd < 0) { fprintf(stderr, "could not open file %s: %m\n", filename); return 1; } for (i = 0; i < 20; i++) { int ret; if (ratio != 0 && i % ratio == 0) { posix_fallocate(fd, total_len, 8192); n_fallocate++; } else { pwrite(fd, block, 8192, total_len); n_filewrite++; } total_len += 8192; } printf("total\t%d\n", i); printf("fallocate\t%d\n", n_fallocate); printf("filewrite\t%d\n", n_filewrite); close(fd); return 0; }
Performance degradation on concurrent COPY into a single relation in PG16.
Hi all, While testing PG16, I observed that in PG16 there is a big performance degradation in concurrent COPY into a single relation with 2 - 16 clients in my environment. I've attached a test script that measures the execution time of COPYing 5GB data in total to the single relation while changing the number of concurrent insertions, in PG16 and PG15. Here are the results on my environment (EC2 instance, RHEL 8.6, 128 vCPUs, 512GB RAM): * PG15 (4b15868b69) PG15: nclients = 1, execution time = 14.181 PG15: nclients = 2, execution time = 9.319 PG15: nclients = 4, execution time = 5.872 PG15: nclients = 8, execution time = 3.773 PG15: nclients = 16, execution time = 3.202 PG15: nclients = 32, execution time = 3.023 PG15: nclients = 64, execution time = 3.829 PG15: nclients = 128, execution time = 4.111 PG15: nclients = 256, execution time = 4.158 * PG16 (c24e9ef330) PG16: nclients = 1, execution time = 17.112 PG16: nclients = 2, execution time = 14.084 PG16: nclients = 4, execution time = 27.997 PG16: nclients = 8, execution time = 10.554 PG16: nclients = 16, execution time = 7.074 PG16: nclients = 32, execution time = 4.607 PG16: nclients = 64, execution time = 2.093 PG16: nclients = 128, execution time = 2.141 PG16: nclients = 256, execution time = 2.202 PG16 has better scalability (more than 64 clients) but it took much more time than PG15, especially at 1 - 16 clients. The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to extend tables more efficiently". With commit 1cbbee0338 (the previous commit of 00d1e02be2), I got a better numbers, it didn't have a better scalability, though: PG16: nclients = 1, execution time = 17.444 PG16: nclients = 2, execution time = 10.690 PG16: nclients = 4, execution time = 7.010 PG16: nclients = 8, execution time = 4.282 PG16: nclients = 16, execution time = 3.373 PG16: nclients = 32, execution time = 3.205 PG16: nclients = 64, execution time = 3.705 PG16: nclients = 128, execution time = 4.196 PG16: nclients = 256, execution time = 4.201 While investigating the cause, I found an interesting fact that in mdzeroextend if I use only either FileFallocate() or FileZero, we can get better numbers. For example, If I always use FileZero with the following change: @@ -574,7 +574,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, * that decision should be made though? For now just use a cutoff of * 8, anything between 4 and 8 worked OK in some local testing. */ - if (numblocks > 8) + if (false) { int ret; I got: PG16: nclients = 1, execution time = 16.898 PG16: nclients = 2, execution time = 8.740 PG16: nclients = 4, execution time = 4.656 PG16: nclients = 8, execution time = 2.733 PG16: nclients = 16, execution time = 2.021 PG16: nclients = 32, execution time = 1.693 PG16: nclients = 64, execution time = 1.742 PG16: nclients = 128, execution time = 2.180 PG16: nclients = 256, execution time = 2.296 After further investigation, the performance degradation comes from calling posix_fallocate() (called via FileFallocate()) and pwritev() (called via FileZero) alternatively depending on how many blocks we extend by. And it happens only on the xfs filesystem. Does anyone observe a similar performance issue with the attached benchmark script? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com PG15_PGDATA="/home/masahiko/pgsql/15.s/data" PG16_PGDATA="/home/masahiko/pgsql/16.s/data" PG15_BIN="/home/masahiko/pgsql/15.s/bin" PG16_BIN="/home/masahiko/pgsql/16.s/bin" ROWS=$((100 * 1000 * 1000)) CLIENTS="1 2 4 8 16 32 64 128 256" ${PG15_BIN}/pg_ctl stop -D ${PG15_PGDATA} -mi ${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA} -mi rm -rf $PG15_PGDATA $PG16_PGDATA echo "initializing clusters ..." ${PG15_BIN}/initdb -D $PG15_PGDATA -E UTF8 --no-locale ${PG16_BIN}/initdb -D $PG16_PGDATA -E UTF8 --no-locale cat <> ${PG15_PGDATA}/postgresql.conf port = 5515 max_wal_size = 50GB shared_buffers = 20GB max_connections = 500 EOF cat <> ${PG16_PGDATA}/postgresql.conf port = 5516 max_wal_size = 50GB shared_buffers = 20GB max_connections = 500 EOF if [ "$1" != "skip_file_init" ]; then echo "prepare load files..." ${PG16_BIN}/pg_ctl start -D ${PG16_PGDATA} for c in $CLIENTS do rm -f /tmp/tmp_${c}.data ${PG16_BIN}/psql -d postgres -p 5516 -X -c "copy (select generate_series(1, $ROWS / $c)) to '/tmp/tmp_${c}.data'" done ${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA} fi echo "start benchmark ..." #for version in PG15 PG16 for version in PG16 do PSQL="" if [ "$version" == "PG15" ]; then PSQL="${PG15_BIN}/psql -p 5515 -d postgres" ${PG15_BIN}/pg_ctl start -D ${PG15_PGDATA} else PSQL="${PG16_BIN}/psql -p 5516 -d postgres" ${PG16_BIN}/pg_ctl start -D ${PG16_PGDATA} fi ${PSQL} -c "create unlogged table test (c int) with (autovacuum_enabled = off)" for c in $CLIENTS do ${PSQL} -c "truncate test" > /dev/null 2>&1 chileren=() start=`date +%s.%3N`