Re: Polyphase merge is obsolete
On 16/01/2023 12:23, Peter Eisentraut wrote: On 21.11.22 10:29, Peter Eisentraut wrote: On 21.11.22 00:57, Heikki Linnakangas wrote: On 19/11/2022 13:00, Peter Eisentraut wrote: AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, but did not adequately update the function header comment. The comment still mentions the "shared" argument, which has been removed. There is a new "preallocate" argument that is not mentioned at all. Also, it's not easy to match the actual callers to the call variants described in the comment. Could someone who remembers this work perhaps look this over and update the comment? Is the attached more readable? That looks better, thanks. I'm not 100% sure of the "preallocate" argument. If I understand correctly, you should pass true if you are writing multiple tapes at the same time, and false otherwise. HashAgg passed true, tuplesort passes false. However, it's not clear to me why we couldn't just always do the preallocation. It seems pretty harmless even if it's not helpful. Or do it when there are multiple writer tapes, and not otherwise. The parameter was added in commit 0758964963 so presumably there was a reason, but at a quick glance at the thread that led to that commit, I couldn't see what it was. Right, these are the kinds of questions such a comment ought to answer. Let's see if anyone chimes in here, otherwise let's complain in the original thread, since it had nothing to do with this one. So nothing has happened. Let's get your changes about the removed "shared" argument committed, and we can figure out the "preallocated" thing separately. Ok, pushed. I included the brief comment on "preallocated", but it would be nice to improve it further, per the above discussion. Thanks! - Heikki
Re: Polyphase merge is obsolete
On 21.11.22 10:29, Peter Eisentraut wrote: On 21.11.22 00:57, Heikki Linnakangas wrote: On 19/11/2022 13:00, Peter Eisentraut wrote: On 18.10.21 14:15, Heikki Linnakangas wrote: On 05/10/2021 20:24, John Naylor wrote: I've had a chance to review and test out the v5 patches. Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong mentioned, and pushed. AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, but did not adequately update the function header comment. The comment still mentions the "shared" argument, which has been removed. There is a new "preallocate" argument that is not mentioned at all. Also, it's not easy to match the actual callers to the call variants described in the comment. Could someone who remembers this work perhaps look this over and update the comment? Is the attached more readable? That looks better, thanks. I'm not 100% sure of the "preallocate" argument. If I understand correctly, you should pass true if you are writing multiple tapes at the same time, and false otherwise. HashAgg passed true, tuplesort passes false. However, it's not clear to me why we couldn't just always do the preallocation. It seems pretty harmless even if it's not helpful. Or do it when there are multiple writer tapes, and not otherwise. The parameter was added in commit 0758964963 so presumably there was a reason, but at a quick glance at the thread that led to that commit, I couldn't see what it was. Right, these are the kinds of questions such a comment ought to answer. Let's see if anyone chimes in here, otherwise let's complain in the original thread, since it had nothing to do with this one. So nothing has happened. Let's get your changes about the removed "shared" argument committed, and we can figure out the "preallocated" thing separately.
Re: Polyphase merge is obsolete
On 21.11.22 00:57, Heikki Linnakangas wrote: On 19/11/2022 13:00, Peter Eisentraut wrote: On 18.10.21 14:15, Heikki Linnakangas wrote: On 05/10/2021 20:24, John Naylor wrote: I've had a chance to review and test out the v5 patches. Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong mentioned, and pushed. AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, but did not adequately update the function header comment. The comment still mentions the "shared" argument, which has been removed. There is a new "preallocate" argument that is not mentioned at all. Also, it's not easy to match the actual callers to the call variants described in the comment. Could someone who remembers this work perhaps look this over and update the comment? Is the attached more readable? That looks better, thanks. I'm not 100% sure of the "preallocate" argument. If I understand correctly, you should pass true if you are writing multiple tapes at the same time, and false otherwise. HashAgg passed true, tuplesort passes false. However, it's not clear to me why we couldn't just always do the preallocation. It seems pretty harmless even if it's not helpful. Or do it when there are multiple writer tapes, and not otherwise. The parameter was added in commit 0758964963 so presumably there was a reason, but at a quick glance at the thread that led to that commit, I couldn't see what it was. Right, these are the kinds of questions such a comment ought to answer. Let's see if anyone chimes in here, otherwise let's complain in the original thread, since it had nothing to do with this one.
Re: Polyphase merge is obsolete
On 19/11/2022 13:00, Peter Eisentraut wrote: On 18.10.21 14:15, Heikki Linnakangas wrote: On 05/10/2021 20:24, John Naylor wrote: I've had a chance to review and test out the v5 patches. Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong mentioned, and pushed. AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, but did not adequately update the function header comment. The comment still mentions the "shared" argument, which has been removed. There is a new "preallocate" argument that is not mentioned at all. Also, it's not easy to match the actual callers to the call variants described in the comment. Could someone who remembers this work perhaps look this over and update the comment? Is the attached more readable? I'm not 100% sure of the "preallocate" argument. If I understand correctly, you should pass true if you are writing multiple tapes at the same time, and false otherwise. HashAgg passed true, tuplesort passes false. However, it's not clear to me why we couldn't just always do the preallocation. It seems pretty harmless even if it's not helpful. Or do it when there are multiple writer tapes, and not otherwise. The parameter was added in commit 0758964963 so presumably there was a reason, but at a quick glance at the thread that led to that commit, I couldn't see what it was. - Heikki From 1c5d4e28a21f6052d22667e79b5bd443aec3a5b5 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 21 Nov 2022 01:55:05 +0200 Subject: [PATCH 1/1] Fix and clarify function comment on LogicalTapeSetCreate. Commit c4649cce39 removed the "shared" and "ntapes" arguments, but the comment still talked about "shared". It also talked about "a shared file handle", which was technically correct because it even before commit c4649cce39, the "shared file handle" referred to the "fileset" argument, not "shared". But it was very confusing. Improve the comment. Also add a comment on what the "preallocate" argument does. Discussion: https://www.postgresql.org/message-id/af989685-91d5-aad4-8f60-1d066b5ec...@enterprisedb.com --- src/backend/utils/sort/logtape.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index c384f98e13..b518a4b9c5 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -544,14 +544,20 @@ ltsInitReadBuffer(LogicalTape *lt) * The tape set is initially empty. Use LogicalTapeCreate() to create * tapes in it. * - * Serial callers pass NULL argument for shared, and -1 for worker. Parallel - * worker callers pass a shared file handle and their own worker number. + * In a single-process sort, pass NULL argument for fileset, and -1 for + * worker. * - * Leader callers pass a shared file handle and -1 for worker. After creating - * the tape set, use LogicalTapeImport() to import the worker tapes into it. + * In a parallel sort, parallel workers pass the shared fileset handle and + * their own worker number. After the workers have finished, create the + * tape set in the leader, passing the shared fileset handle and -1 for + * worker, and use LogicalTapeImport() to import the worker tapes into it. * * Currently, the leader will only import worker tapes into the set, it does * not create tapes of its own, although in principle that should work. + * + * If preallocate is true, blocks for each individual tape are allocated in + * batches. This avoids fragmentation when writing multiple tapes at the + * same time. */ LogicalTapeSet * LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker) -- 2.30.2
Re: Polyphase merge is obsolete
On 18.10.21 14:15, Heikki Linnakangas wrote: On 05/10/2021 20:24, John Naylor wrote: I've had a chance to review and test out the v5 patches. Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong mentioned, and pushed. AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, but did not adequately update the function header comment. The comment still mentions the "shared" argument, which has been removed. There is a new "preallocate" argument that is not mentioned at all. Also, it's not easy to match the actual callers to the call variants described in the comment. Could someone who remembers this work perhaps look this over and update the comment?
Re: Polyphase merge is obsolete
On 05/10/2021 20:24, John Naylor wrote: I've had a chance to review and test out the v5 patches. Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong mentioned, and pushed. I've done some performance testing of master versus both patches applied. The full results and test script are attached, but I'll give a summary here. A variety of value distributions were tested, with work_mem from 1MB to 16MB, plus 2GB which will not use external sort at all. I settled on 2 million records for the sort, to have something large enough to work with but also keep the test time reasonable. That works out to about 130MB on disk. We have recent improvements to datum sort, so I used both single values and all values in the SELECT list. The system was on a Westmere-era Xeon with gcc 4.8. pg_prewarm was run on the input tables. The raw measurements were reduced to the minimum of five runs. I can confirm that sort performance is improved with small values of work_mem. That was not the motivating reason for the patch, but it's a nice bonus. Even as high as 16MB work_mem, it's possible some of the 4-6% differences represent real improvement and not just noise or binary effects, but it's much more convincing at 4MB and below, with 25-30% faster with non-datum integer sorts at 1MB work_mem. The nominal regressions seem within the noise level, with one exception that only showed up in one set of measurements (-10.89% in the spreadsheet). I'm not sure what to make of that since it only happens in one combination of factors and nowhere else. That's a bit odd, but given how many data points there are, I think we can write it off as random noise. - Heikki
Re: Polyphase merge is obsolete
On Tue, Oct 5, 2021 at 10:25 AM John Naylor wrote: > int64 is used elsewhere in this file, and I see now reason to do otherwise. Right. The point of using int64 in tuplesort.c is that the values may become negative in certain edge-cases. The whole LACKMEM() concept that tuplesort.c uses to enforce work_mem limits sometimes allows the implementation to use slightly more memory than theoretically allowable. That's how we get negative values. -- Peter Geoghegan
Re: Polyphase merge is obsolete
On Wed, Sep 15, 2021 at 5:35 PM Heikki Linnakangas wrote: > Thanks, here's another rebase. > > - Heikki I've had a chance to review and test out the v5 patches. 0001 is a useful simplification. Nothing in 0002 stood out as needing comment. I've done some performance testing of master versus both patches applied. The full results and test script are attached, but I'll give a summary here. A variety of value distributions were tested, with work_mem from 1MB to 16MB, plus 2GB which will not use external sort at all. I settled on 2 million records for the sort, to have something large enough to work with but also keep the test time reasonable. That works out to about 130MB on disk. We have recent improvements to datum sort, so I used both single values and all values in the SELECT list. The system was on a Westmere-era Xeon with gcc 4.8. pg_prewarm was run on the input tables. The raw measurements were reduced to the minimum of five runs. I can confirm that sort performance is improved with small values of work_mem. That was not the motivating reason for the patch, but it's a nice bonus. Even as high as 16MB work_mem, it's possible some of the 4-6% differences represent real improvement and not just noise or binary effects, but it's much more convincing at 4MB and below, with 25-30% faster with non-datum integer sorts at 1MB work_mem. The nominal regressions seem within the noise level, with one exception that only showed up in one set of measurements (-10.89% in the spreadsheet). I'm not sure what to make of that since it only happens in one combination of factors and nowhere else. On Sat, Sep 11, 2021 at 4:28 AM Zhihong Yu wrote: > + * Before PostgreSQL 14, we used the polyphase merge algorithm (Knuth's > + * Algorithm 5.4.2D), > > I think the above 'Before PostgreSQL 14' should be 'Before PostgreSQL 15' now that PostgreSQL 14 has been released. > > +static int64 > +merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns, > + int maxOutputTapes) > > For memory to allocate, I think uint64 can be used (instead of int64). int64 is used elsewhere in this file, and I see now reason to do otherwise. Aside from s/14/15/ for the version as noted above, I've marked it Ready for Committer. -- John Naylor EDB: http://www.enterprisedb.com set -e ROWS=$1 function log { echo `date +%s` [`date +'%Y-%m-%d %H:%M:%S'`] $1 } function create_tables { ./inst/bin/psql > /dev/null < /dev/null < /dev/null < /dev/null < 16384 AND relkind = 'r'; EOF } # load test tables function load_random { truncate_tables log "loading random" ./inst/bin/psql > /dev/null < /dev/null < /dev/null < /dev/null < /dev/null < /dev/null < 0.5 then '' else '' end FROM data_int; EOF prewarm vacuum_analyze } # run function run_query { times="" group=$1 wmem=$2 workers=$3 query=$4 echo "--" `date` [`date +%s`] >> explains.log echo "-- $group rows=$ROWS work_mem=$wmem workers=$workers" >> explains.log ./inst/bin/psql >> explains.log 2>&1 < /dev/null <> results.csv } function run_queries { for wm in '1MB' '2MB' '4MB' '8MB' '16MB' '2GB'; do log "running queries work_mem=$wm noparallel" run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 OFFSET '$ROWS'' run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 OFFSET '$ROWS'' run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 OFFSET '$ROWS'' run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 OFFSET '$ROWS'' done } create_tables log "sort benchmark $ROWS" load_random run_queries "random" load_sorted run_queries "sorted" load_almost_sorted run_queries "almost_sorted" load_reversed run_queries "reversed" load_organ_pipe run_queries "organ_pipe" load_0_1 run_queries "0_1" kway-merge-test-1.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: Polyphase merge is obsolete
On 16/09/2021 00:12, Jaime Casanova wrote: On Sat, Sep 11, 2021 at 01:35:27AM -0500, Jaime Casanova wrote: On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote: On 14/07/2021 15:12, vignesh C wrote: On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas wrote: Here's an updated version that fixes one bug: The CFBot was reporting a failure on the FreeBSD system [1]. It turned out to be an out-of-memory issue caused by an underflow bug in the calculation of the size of the tape read buffer size. With a small work_mem size, the memory left for tape buffers was negative, and that wrapped around to a very large number. I believe that was not caught by the other systems, because the other ones had enough memory for the incorrectly-sized buffers anyway. That was the case on my laptop at least. It did cause a big slowdown in the 'tuplesort' regression test though, which I hadn't noticed. The fix for that bug is here as a separate patch for easier review, but I'll squash it before committing. The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Here's a rebased version. I also squashed that little bug fix from previous patch set. Hi, This patch does not apply, can you submit a rebased version? BTW, I'm marking this one as "waiting on author" Thanks, here's another rebase. - Heikki >From 501d73488159af3e172e117439054bc06dd09edf Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 21 Oct 2020 21:57:42 +0300 Subject: [PATCH v5 1/2] Refactor LogicalTapeSet/LogicalTape interface. All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now take a LogicalTape as argument, instead of LogicalTapeSet+tape number. You can create any number of LogicalTapes in a single LogicalTapeSet, and you don't need to decide the number upfront, when you create the tape set. This makes the tape management in hash agg spilling in nodeAgg.c simpler. --- src/backend/executor/nodeAgg.c | 187 src/backend/utils/sort/logtape.c | 456 - src/backend/utils/sort/tuplesort.c | 229 +++ src/include/nodes/execnodes.h | 3 +- src/include/utils/logtape.h| 37 ++- 5 files changed, 359 insertions(+), 553 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 39bea204d16..c99a0de4ddb 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -208,7 +208,16 @@ * * Spilled data is written to logical tapes. These provide better control * over memory usage, disk space, and the number of files than if we were - * to use a BufFile for each spill. + * to use a BufFile for each spill. We don't know the number of tapes needed + * at the start of the algorithm (because it can recurse), so a tape set is + * allocated at the beginning, and individual tapes are created as needed. + * As a particular tape is read, logtape.c recycles its disk space. When a + * tape is read to completion, it is destroyed entirely. + * + * Tapes' buffers can take up substantial memory when many tapes are open at + * once. We only need one tape open at a time in read mode (using a buffer + * that's a multiple of BLCKSZ); but we need one tape open in write mode (each + * requiring a buffer of size BLCKSZ) for each partition. * * Note that it's possible for transition states to start small but then * grow very large; for instance in the case of ARRAY_AGG. In such cases, @@ -311,27 +320,6 @@ */ #define CHUNKHDRSZ 16 -/* - * Track all tapes needed for a HashAgg that spills. We don't know the maximum - * number of tapes needed at the start of the algorithm (because it can - * recurse), so one tape set is allocated and extended as needed for new - * tapes. When a particular tape is already read, rewind it for write mode and - * put it in the free list. - * - * Tapes' buffers can take up substantial memory when many tapes are open at - * once. We only need one tape open at a time in read mode (using a buffer - * that's a multiple of BLCKSZ); but we need one tape open in write mode (each - * requiring a buffer of size BLCKSZ) for each partition. - */ -typedef struct HashTapeInfo -{ - LogicalTapeSet *tapeset; - int ntapes; - int *freetapes; - int nfreetapes; - int freetapes_alloc; -} HashTapeInfo; - /* * Represents partitioned spill data for a single hashtable. Contains the * necessary information to route tuples to the correct partition, and to @@ -343,9 +331,8 @@ typedef struct HashTapeInfo */ typedef struct HashAggSpill { - LogicalTapeSet *tapeset; /* borrowed reference to tape set */ int npartitions; /* number of partitions */ - int *partitions; /* spill partition tape numbers */ + LogicalTape **partitions; /* spill partition tapes */ int64 *ntuples; /* number of tuples in each partition */ uint32 mask; /* mask to find partition from hash value */
Re: Polyphase merge is obsolete
On Sat, Sep 11, 2021 at 01:35:27AM -0500, Jaime Casanova wrote: > On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote: > > On 14/07/2021 15:12, vignesh C wrote: > > > On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas > > > wrote: > > > > Here's an updated version that fixes one bug: > > > > > > > > The CFBot was reporting a failure on the FreeBSD system [1]. It turned > > > > out to be an out-of-memory issue caused by an underflow bug in the > > > > calculation of the size of the tape read buffer size. With a small > > > > work_mem size, the memory left for tape buffers was negative, and that > > > > wrapped around to a very large number. I believe that was not caught by > > > > the other systems, because the other ones had enough memory for the > > > > incorrectly-sized buffers anyway. That was the case on my laptop at > > > > least. It did cause a big slowdown in the 'tuplesort' regression test > > > > though, which I hadn't noticed. > > > > > > > > The fix for that bug is here as a separate patch for easier review, but > > > > I'll squash it before committing. > > > > > > The patch does not apply on Head anymore, could you rebase and post a > > > patch. I'm changing the status to "Waiting for Author". > > > > Here's a rebased version. I also squashed that little bug fix from previous > > patch set. > > > > Hi, > > This patch does not apply, can you submit a rebased version? > BTW, I'm marking this one as "waiting on author" -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: Polyphase merge is obsolete
On Fri, Sep 10, 2021 at 11:35 PM Jaime Casanova < jcasa...@systemguards.com.ec> wrote: > On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote: > > On 14/07/2021 15:12, vignesh C wrote: > > > On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas > wrote: > > > > Here's an updated version that fixes one bug: > > > > > > > > The CFBot was reporting a failure on the FreeBSD system [1]. It > turned > > > > out to be an out-of-memory issue caused by an underflow bug in the > > > > calculation of the size of the tape read buffer size. With a small > > > > work_mem size, the memory left for tape buffers was negative, and > that > > > > wrapped around to a very large number. I believe that was not caught > by > > > > the other systems, because the other ones had enough memory for the > > > > incorrectly-sized buffers anyway. That was the case on my laptop at > > > > least. It did cause a big slowdown in the 'tuplesort' regression test > > > > though, which I hadn't noticed. > > > > > > > > The fix for that bug is here as a separate patch for easier review, > but > > > > I'll squash it before committing. > > > > > > The patch does not apply on Head anymore, could you rebase and post a > > > patch. I'm changing the status to "Waiting for Author". > > > > Here's a rebased version. I also squashed that little bug fix from > previous > > patch set. > > > > Hi, > > This patch does not apply, can you submit a rebased version? > > -- > Jaime Casanova > Director de Servicios Profesionales > SystemGuards - Consultores de PostgreSQL > > > Hi, + * Before PostgreSQL 14, we used the polyphase merge algorithm (Knuth's + * Algorithm 5.4.2D), I think the above 'Before PostgreSQL 14' should be 'Before PostgreSQL 15' now that PostgreSQL 14 has been released. +static int64 +merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns, + int maxOutputTapes) For memory to allocate, I think uint64 can be used (instead of int64). Cheers
Re: Polyphase merge is obsolete
On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote: > On 14/07/2021 15:12, vignesh C wrote: > > On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas wrote: > > > Here's an updated version that fixes one bug: > > > > > > The CFBot was reporting a failure on the FreeBSD system [1]. It turned > > > out to be an out-of-memory issue caused by an underflow bug in the > > > calculation of the size of the tape read buffer size. With a small > > > work_mem size, the memory left for tape buffers was negative, and that > > > wrapped around to a very large number. I believe that was not caught by > > > the other systems, because the other ones had enough memory for the > > > incorrectly-sized buffers anyway. That was the case on my laptop at > > > least. It did cause a big slowdown in the 'tuplesort' regression test > > > though, which I hadn't noticed. > > > > > > The fix for that bug is here as a separate patch for easier review, but > > > I'll squash it before committing. > > > > The patch does not apply on Head anymore, could you rebase and post a > > patch. I'm changing the status to "Waiting for Author". > > Here's a rebased version. I also squashed that little bug fix from previous > patch set. > Hi, This patch does not apply, can you submit a rebased version? -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: Polyphase merge is obsolete
On 14/07/2021 15:12, vignesh C wrote: On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas wrote: Here's an updated version that fixes one bug: The CFBot was reporting a failure on the FreeBSD system [1]. It turned out to be an out-of-memory issue caused by an underflow bug in the calculation of the size of the tape read buffer size. With a small work_mem size, the memory left for tape buffers was negative, and that wrapped around to a very large number. I believe that was not caught by the other systems, because the other ones had enough memory for the incorrectly-sized buffers anyway. That was the case on my laptop at least. It did cause a big slowdown in the 'tuplesort' regression test though, which I hadn't noticed. The fix for that bug is here as a separate patch for easier review, but I'll squash it before committing. The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Here's a rebased version. I also squashed that little bug fix from previous patch set. - Heikki >From 027b20e74b3d8b09144cffe6d071706231e496d7 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 21 Oct 2020 21:57:42 +0300 Subject: [PATCH 1/2] Refactor LogicalTapeSet/LogicalTape interface. All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now take a LogicalTape as argument, instead of LogicalTapeSet+tape number. You can create any number of LogicalTapes in a single LogicalTapeSet, and you don't need to decide the number upfront, when you create the tape set. This makes the tape management in hash agg spilling in nodeAgg.c simpler. --- src/backend/executor/nodeAgg.c | 187 src/backend/utils/sort/logtape.c | 457 - src/backend/utils/sort/tuplesort.c | 229 +++ src/include/nodes/execnodes.h | 3 +- src/include/utils/logtape.h| 37 ++- 5 files changed, 360 insertions(+), 553 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 914b02ceee4..80025e5f1f3 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -208,7 +208,16 @@ * * Spilled data is written to logical tapes. These provide better control * over memory usage, disk space, and the number of files than if we were - * to use a BufFile for each spill. + * to use a BufFile for each spill. We don't know the number of tapes needed + * at the start of the algorithm (because it can recurse), so a tape set is + * allocated at the beginning, and individual tapes are created as needed. + * As a particular tape is read, logtape.c recycles its disk space. When a + * tape is read to completion, it is destroyed entirely. + * + * Tapes' buffers can take up substantial memory when many tapes are open at + * once. We only need one tape open at a time in read mode (using a buffer + * that's a multiple of BLCKSZ); but we need one tape open in write mode (each + * requiring a buffer of size BLCKSZ) for each partition. * * Note that it's possible for transition states to start small but then * grow very large; for instance in the case of ARRAY_AGG. In such cases, @@ -311,27 +320,6 @@ */ #define CHUNKHDRSZ 16 -/* - * Track all tapes needed for a HashAgg that spills. We don't know the maximum - * number of tapes needed at the start of the algorithm (because it can - * recurse), so one tape set is allocated and extended as needed for new - * tapes. When a particular tape is already read, rewind it for write mode and - * put it in the free list. - * - * Tapes' buffers can take up substantial memory when many tapes are open at - * once. We only need one tape open at a time in read mode (using a buffer - * that's a multiple of BLCKSZ); but we need one tape open in write mode (each - * requiring a buffer of size BLCKSZ) for each partition. - */ -typedef struct HashTapeInfo -{ - LogicalTapeSet *tapeset; - int ntapes; - int *freetapes; - int nfreetapes; - int freetapes_alloc; -} HashTapeInfo; - /* * Represents partitioned spill data for a single hashtable. Contains the * necessary information to route tuples to the correct partition, and to @@ -343,9 +331,8 @@ typedef struct HashTapeInfo */ typedef struct HashAggSpill { - LogicalTapeSet *tapeset; /* borrowed reference to tape set */ int npartitions; /* number of partitions */ - int *partitions; /* spill partition tape numbers */ + LogicalTape **partitions; /* spill partition tapes */ int64 *ntuples; /* number of tuples in each partition */ uint32 mask; /* mask to find partition from hash value */ int shift; /* after masking, shift by this amount */ @@ -365,8 +352,7 @@ typedef struct HashAggBatch { int setno; /* grouping set */ int used_bits; /* number of bits of hash already used */ - LogicalTapeSet *tapeset; /* borrowed reference to tape set */ - int input_tapenum; /* input partition tape */ +
Re: Polyphase merge is obsolete
On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas wrote: > > On 22/10/2020 14:48, Heikki Linnakangas wrote: > > On 11/09/2017 13:37, Tomas Vondra wrote: > >> I planned to do some benchmarking on this patch, but apparently the > >> patch no longer applies. Rebase please? > > > > Here's a rebase of this. Sorry to keep you waiting :-). > > Here's an updated version that fixes one bug: > > The CFBot was reporting a failure on the FreeBSD system [1]. It turned > out to be an out-of-memory issue caused by an underflow bug in the > calculation of the size of the tape read buffer size. With a small > work_mem size, the memory left for tape buffers was negative, and that > wrapped around to a very large number. I believe that was not caught by > the other systems, because the other ones had enough memory for the > incorrectly-sized buffers anyway. That was the case on my laptop at > least. It did cause a big slowdown in the 'tuplesort' regression test > though, which I hadn't noticed. > > The fix for that bug is here as a separate patch for easier review, but > I'll squash it before committing. The patch does not apply on Head anymore, could you rebase and post a patch. I'm changing the status to "Waiting for Author". Regards, Vignesh
Re: Polyphase merge is obsolete
On 3/25/21 9:41 AM, David Steele wrote: On 1/22/21 5:19 PM, Heikki Linnakangas wrote: On 22/10/2020 14:48, Heikki Linnakangas wrote: On 11/09/2017 13:37, Tomas Vondra wrote: I planned to do some benchmarking on this patch, but apparently the patch no longer applies. Rebase please? Here's a rebase of this. Sorry to keep you waiting :-). I am amused that the wait here was 3+ years! Here's an updated version that fixes one bug: We're near the end of the CF and this has not really seen any review after a long hiatus. I'll move this to the next CF on April 30 unless I hear objections. Oops, I meant March 30. Regards, -- -David da...@pgmasters.net
Re: Polyphase merge is obsolete
On 1/22/21 5:19 PM, Heikki Linnakangas wrote: On 22/10/2020 14:48, Heikki Linnakangas wrote: On 11/09/2017 13:37, Tomas Vondra wrote: I planned to do some benchmarking on this patch, but apparently the patch no longer applies. Rebase please? Here's a rebase of this. Sorry to keep you waiting :-). I am amused that the wait here was 3+ years! Here's an updated version that fixes one bug: We're near the end of the CF and this has not really seen any review after a long hiatus. I'll move this to the next CF on April 30 unless I hear objections. Regards, -- -David da...@pgmasters.net
Re: Polyphase merge is obsolete
On 22/10/2020 14:48, Heikki Linnakangas wrote: On 11/09/2017 13:37, Tomas Vondra wrote: I planned to do some benchmarking on this patch, but apparently the patch no longer applies. Rebase please? Here's a rebase of this. Sorry to keep you waiting :-). Here's an updated version that fixes one bug: The CFBot was reporting a failure on the FreeBSD system [1]. It turned out to be an out-of-memory issue caused by an underflow bug in the calculation of the size of the tape read buffer size. With a small work_mem size, the memory left for tape buffers was negative, and that wrapped around to a very large number. I believe that was not caught by the other systems, because the other ones had enough memory for the incorrectly-sized buffers anyway. That was the case on my laptop at least. It did cause a big slowdown in the 'tuplesort' regression test though, which I hadn't noticed. The fix for that bug is here as a separate patch for easier review, but I'll squash it before committing. [1] https://cirrus-ci.com/task/6699842091089920 - Heikki >From 8fd75cfc8574e3a5885f88f13f11b0676c99f39f Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 21 Oct 2020 21:57:42 +0300 Subject: [PATCH v3 1/3] Refactor LogicalTapeSet/LogicalTape interface. All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now take a LogicalTape as argument, instead of LogicalTapeSet+tape number. You can create any number of LogicalTapes in a single LogicalTapeSet, and you don't need to decide the number upfront, when you create the tape set. This makes the tape management in hash agg spilling in nodeAgg.c simpler. --- src/backend/executor/nodeAgg.c | 187 src/backend/utils/sort/logtape.c | 456 - src/backend/utils/sort/tuplesort.c | 229 +++ src/include/nodes/execnodes.h | 3 +- src/include/utils/logtape.h| 37 ++- 5 files changed, 360 insertions(+), 552 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 601b6dab03f..13f6f188668 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -208,7 +208,16 @@ * * Spilled data is written to logical tapes. These provide better control * over memory usage, disk space, and the number of files than if we were - * to use a BufFile for each spill. + * to use a BufFile for each spill. We don't know the number of tapes needed + * at the start of the algorithm (because it can recurse), so a tape set is + * allocated at the beginning, and individual tapes are created as needed. + * As a particular tape is read, logtape.c recycles its disk space. When a + * tape is read to completion, it is destroyed entirely. + * + * Tapes' buffers can take up substantial memory when many tapes are open at + * once. We only need one tape open at a time in read mode (using a buffer + * that's a multiple of BLCKSZ); but we need one tape open in write mode (each + * requiring a buffer of size BLCKSZ) for each partition. * * Note that it's possible for transition states to start small but then * grow very large; for instance in the case of ARRAY_AGG. In such cases, @@ -311,27 +320,6 @@ */ #define CHUNKHDRSZ 16 -/* - * Track all tapes needed for a HashAgg that spills. We don't know the maximum - * number of tapes needed at the start of the algorithm (because it can - * recurse), so one tape set is allocated and extended as needed for new - * tapes. When a particular tape is already read, rewind it for write mode and - * put it in the free list. - * - * Tapes' buffers can take up substantial memory when many tapes are open at - * once. We only need one tape open at a time in read mode (using a buffer - * that's a multiple of BLCKSZ); but we need one tape open in write mode (each - * requiring a buffer of size BLCKSZ) for each partition. - */ -typedef struct HashTapeInfo -{ - LogicalTapeSet *tapeset; - int ntapes; - int *freetapes; - int nfreetapes; - int freetapes_alloc; -} HashTapeInfo; - /* * Represents partitioned spill data for a single hashtable. Contains the * necessary information to route tuples to the correct partition, and to @@ -343,9 +331,8 @@ typedef struct HashTapeInfo */ typedef struct HashAggSpill { - LogicalTapeSet *tapeset; /* borrowed reference to tape set */ int npartitions; /* number of partitions */ - int *partitions; /* spill partition tape numbers */ + LogicalTape **partitions; /* spill partition tapes */ int64 *ntuples; /* number of tuples in each partition */ uint32 mask; /* mask to find partition from hash value */ int shift; /* after masking, shift by this amount */ @@ -365,8 +352,7 @@ typedef struct HashAggBatch { int setno; /* grouping set */ int used_bits; /* number of bits of hash already used */ - LogicalTapeSet *tapeset; /* borrowed reference to tape set */ - int input_tapenum; /* input partition tape */ +