Re: Polyphase merge is obsolete

2023-01-23 Thread Heikki Linnakangas

On 16/01/2023 12:23, Peter Eisentraut wrote:

On 21.11.22 10:29, Peter Eisentraut wrote:

On 21.11.22 00:57, Heikki Linnakangas wrote:

On 19/11/2022 13:00, Peter Eisentraut wrote:

AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15,
but did not adequately update the function header comment.  The comment
still mentions the "shared" argument, which has been removed.  There is
a new "preallocate" argument that is not mentioned at all.  Also, it's
not easy to match the actual callers to the call variants described in
the comment.  Could someone who remembers this work perhaps look this
over and update the comment?


Is the attached more readable?


That looks better, thanks.


I'm not 100% sure of the "preallocate" argument. If I understand
correctly, you should pass true if you are writing multiple tapes at
the same time, and false otherwise. HashAgg passed true, tuplesort
passes false. However, it's not clear to me why we couldn't just
always do the preallocation. It seems pretty harmless even if it's not
helpful. Or do it when there are multiple writer tapes, and not
otherwise. The parameter was added in commit 0758964963 so presumably
there was a reason, but at a quick glance at the thread that led to
that commit, I couldn't see what it was.


Right, these are the kinds of questions such a comment ought to answer.

Let's see if anyone chimes in here, otherwise let's complain in the
original thread, since it had nothing to do with this one.


So nothing has happened.  Let's get your changes about the removed
"shared" argument committed, and we can figure out the "preallocated"
thing separately.


Ok, pushed. I included the brief comment on "preallocated", but it would 
be nice to improve it further, per the above discussion.


Thanks!

- Heikki





Re: Polyphase merge is obsolete

2023-01-16 Thread Peter Eisentraut

On 21.11.22 10:29, Peter Eisentraut wrote:

On 21.11.22 00:57, Heikki Linnakangas wrote:

On 19/11/2022 13:00, Peter Eisentraut wrote:

On 18.10.21 14:15, Heikki Linnakangas wrote:

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong
mentioned, and pushed.


AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15,
but did not adequately update the function header comment.  The comment
still mentions the "shared" argument, which has been removed.  There is
a new "preallocate" argument that is not mentioned at all.  Also, it's
not easy to match the actual callers to the call variants described in
the comment.  Could someone who remembers this work perhaps look this
over and update the comment?


Is the attached more readable?


That looks better, thanks.

I'm not 100% sure of the "preallocate" argument. If I understand 
correctly, you should pass true if you are writing multiple tapes at 
the same time, and false otherwise. HashAgg passed true, tuplesort 
passes false. However, it's not clear to me why we couldn't just 
always do the preallocation. It seems pretty harmless even if it's not 
helpful. Or do it when there are multiple writer tapes, and not 
otherwise. The parameter was added in commit 0758964963 so presumably 
there was a reason, but at a quick glance at the thread that led to 
that commit, I couldn't see what it was.


Right, these are the kinds of questions such a comment ought to answer.

Let's see if anyone chimes in here, otherwise let's complain in the 
original thread, since it had nothing to do with this one.


So nothing has happened.  Let's get your changes about the removed 
"shared" argument committed, and we can figure out the "preallocated" 
thing separately.






Re: Polyphase merge is obsolete

2022-11-21 Thread Peter Eisentraut

On 21.11.22 00:57, Heikki Linnakangas wrote:

On 19/11/2022 13:00, Peter Eisentraut wrote:

On 18.10.21 14:15, Heikki Linnakangas wrote:

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong
mentioned, and pushed.


AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15,
but did not adequately update the function header comment.  The comment
still mentions the "shared" argument, which has been removed.  There is
a new "preallocate" argument that is not mentioned at all.  Also, it's
not easy to match the actual callers to the call variants described in
the comment.  Could someone who remembers this work perhaps look this
over and update the comment?


Is the attached more readable?


That looks better, thanks.

I'm not 100% sure of the "preallocate" argument. If I understand 
correctly, you should pass true if you are writing multiple tapes at the 
same time, and false otherwise. HashAgg passed true, tuplesort passes 
false. However, it's not clear to me why we couldn't just always do the 
preallocation. It seems pretty harmless even if it's not helpful. Or do 
it when there are multiple writer tapes, and not otherwise. The 
parameter was added in commit 0758964963 so presumably there was a 
reason, but at a quick glance at the thread that led to that commit, I 
couldn't see what it was.


Right, these are the kinds of questions such a comment ought to answer.

Let's see if anyone chimes in here, otherwise let's complain in the 
original thread, since it had nothing to do with this one.






Re: Polyphase merge is obsolete

2022-11-20 Thread Heikki Linnakangas

On 19/11/2022 13:00, Peter Eisentraut wrote:

On 18.10.21 14:15, Heikki Linnakangas wrote:

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong
mentioned, and pushed.


AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15,
but did not adequately update the function header comment.  The comment
still mentions the "shared" argument, which has been removed.  There is
a new "preallocate" argument that is not mentioned at all.  Also, it's
not easy to match the actual callers to the call variants described in
the comment.  Could someone who remembers this work perhaps look this
over and update the comment?


Is the attached more readable?

I'm not 100% sure of the "preallocate" argument. If I understand 
correctly, you should pass true if you are writing multiple tapes at the 
same time, and false otherwise. HashAgg passed true, tuplesort passes 
false. However, it's not clear to me why we couldn't just always do the 
preallocation. It seems pretty harmless even if it's not helpful. Or do 
it when there are multiple writer tapes, and not otherwise. The 
parameter was added in commit 0758964963 so presumably there was a 
reason, but at a quick glance at the thread that led to that commit, I 
couldn't see what it was.


- Heikki
From 1c5d4e28a21f6052d22667e79b5bd443aec3a5b5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 21 Nov 2022 01:55:05 +0200
Subject: [PATCH 1/1] Fix and clarify function comment on LogicalTapeSetCreate.

Commit c4649cce39 removed the "shared" and "ntapes" arguments, but the
comment still talked about "shared". It also talked about "a shared
file handle", which was technically correct because it even before
commit c4649cce39, the "shared file handle" referred to the "fileset"
argument, not "shared". But it was very confusing. Improve the
comment.

Also add a comment on what the "preallocate" argument does.

Discussion: https://www.postgresql.org/message-id/af989685-91d5-aad4-8f60-1d066b5ec...@enterprisedb.com
---
 src/backend/utils/sort/logtape.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index c384f98e13..b518a4b9c5 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -544,14 +544,20 @@ ltsInitReadBuffer(LogicalTape *lt)
  * The tape set is initially empty. Use LogicalTapeCreate() to create
  * tapes in it.
  *
- * Serial callers pass NULL argument for shared, and -1 for worker.  Parallel
- * worker callers pass a shared file handle and their own worker number.
+ * In a single-process sort, pass NULL argument for fileset, and -1 for
+ * worker.
  *
- * Leader callers pass a shared file handle and -1 for worker. After creating
- * the tape set, use LogicalTapeImport() to import the worker tapes into it.
+ * In a parallel sort, parallel workers pass the shared fileset handle and
+ * their own worker number.  After the workers have finished, create the
+ * tape set in the leader, passing the shared fileset handle and -1 for
+ * worker, and use LogicalTapeImport() to import the worker tapes into it.
  *
  * Currently, the leader will only import worker tapes into the set, it does
  * not create tapes of its own, although in principle that should work.
+ *
+ * If preallocate is true, blocks for each individual tape are allocated in
+ * batches.  This avoids fragmentation when writing multiple tapes at the
+ * same time.
  */
 LogicalTapeSet *
 LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
-- 
2.30.2



Re: Polyphase merge is obsolete

2022-11-19 Thread Peter Eisentraut

On 18.10.21 14:15, Heikki Linnakangas wrote:

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong 
mentioned, and pushed.


AFAICT, this thread updated the API of LogicalTapeSetCreate() in PG15, 
but did not adequately update the function header comment.  The comment 
still mentions the "shared" argument, which has been removed.  There is 
a new "preallocate" argument that is not mentioned at all.  Also, it's 
not easy to match the actual callers to the call variants described in 
the comment.  Could someone who remembers this work perhaps look this 
over and update the comment?






Re: Polyphase merge is obsolete

2021-10-18 Thread Heikki Linnakangas

On 05/10/2021 20:24, John Naylor wrote:

I've had a chance to review and test out the v5 patches.


Thanks! I fixed the stray reference to PostgreSQL 14 that Zhihong 
mentioned, and pushed.


I've done some performance testing of master versus both patches 
applied. The full results and test script are attached, but I'll give a 
summary here. A variety of value distributions were tested, with 
work_mem from 1MB to 16MB, plus 2GB which will not use external sort at 
all. I settled on 2 million records for the sort, to have something 
large enough to work with but also keep the test time reasonable. That 
works out to about 130MB on disk. We have recent improvements to datum 
sort, so I used both single values and all values in the SELECT list.


The system was on a Westmere-era Xeon with gcc 4.8. pg_prewarm was run 
on the input tables. The raw measurements were reduced to the minimum of 
five runs.


I can confirm that sort performance is improved with small values of 
work_mem. That was not the motivating reason for the patch, but it's a 
nice bonus. Even as high as 16MB work_mem, it's possible some of the 
4-6% differences represent real improvement and not just noise or binary 
effects, but it's much more convincing at 4MB and below, with 25-30% 
faster with non-datum integer sorts at 1MB work_mem. The nominal 
regressions seem within the noise level, with one exception that only 
showed up in one set of measurements (-10.89% in the spreadsheet). I'm 
not sure what to make of that since it only happens in one combination 
of factors and nowhere else.


That's a bit odd, but given how many data points there are, I think we 
can write it off as random noise.


- Heikki




Re: Polyphase merge is obsolete

2021-10-05 Thread Peter Geoghegan
On Tue, Oct 5, 2021 at 10:25 AM John Naylor
 wrote:
> int64 is used elsewhere in this file, and I see now reason to do otherwise.

Right. The point of using int64 in tuplesort.c is that the values may
become negative in certain edge-cases. The whole LACKMEM() concept
that tuplesort.c uses to enforce work_mem limits sometimes allows the
implementation to use slightly more memory than theoretically
allowable. That's how we get negative values.

-- 
Peter Geoghegan




Re: Polyphase merge is obsolete

2021-10-05 Thread John Naylor
On Wed, Sep 15, 2021 at 5:35 PM Heikki Linnakangas  wrote:
> Thanks, here's another rebase.
>
> - Heikki

I've had a chance to review and test out the v5 patches.

0001 is a useful simplification. Nothing in 0002 stood out as needing
comment.

I've done some performance testing of master versus both patches applied.
The full results and test script are attached, but I'll give a summary
here. A variety of value distributions were tested, with work_mem from 1MB
to 16MB, plus 2GB which will not use external sort at all. I settled on 2
million records for the sort, to have something large enough to work with
but also keep the test time reasonable. That works out to about 130MB on
disk. We have recent improvements to datum sort, so I used both single
values and all values in the SELECT list.

The system was on a Westmere-era Xeon with gcc 4.8. pg_prewarm was run on
the input tables. The raw measurements were reduced to the minimum of five
runs.

I can confirm that sort performance is improved with small values of
work_mem. That was not the motivating reason for the patch, but it's a nice
bonus. Even as high as 16MB work_mem, it's possible some of the 4-6%
differences represent real improvement and not just noise or binary
effects, but it's much more convincing at 4MB and below, with 25-30% faster
with non-datum integer sorts at 1MB work_mem. The nominal regressions seem
within the noise level, with one exception that only showed up in one set
of measurements (-10.89% in the spreadsheet). I'm not sure what to make of
that since it only happens in one combination of factors and nowhere else.

On Sat, Sep 11, 2021 at 4:28 AM Zhihong Yu  wrote:
> + * Before PostgreSQL 14, we used the polyphase merge algorithm (Knuth's
> + * Algorithm 5.4.2D),
>
> I think the above 'Before PostgreSQL 14' should be 'Before PostgreSQL 15'
now that PostgreSQL 14 has been released.
>
> +static int64
> +merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,
> +  int maxOutputTapes)
>
> For memory to allocate, I think uint64 can be used (instead of int64).

int64 is used elsewhere in this file, and I see now reason to do otherwise.

Aside from s/14/15/ for the version as noted above, I've marked it Ready
for Committer.

--
John Naylor
EDB: http://www.enterprisedb.com
set -e

ROWS=$1


function log {
	echo `date +%s` [`date +'%Y-%m-%d %H:%M:%S'`] $1
}

function create_tables {

	./inst/bin/psql > /dev/null  < /dev/null  < /dev/null  < /dev/null  < 16384 AND relkind = 'r';
EOF

}

# load test tables

function load_random {

	truncate_tables

	log "loading random"

	./inst/bin/psql > /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < 0.5 
	then ''
	else ''
	end
FROM data_int;
EOF

	prewarm

	vacuum_analyze

}

# run

function run_query {

	times=""
	group=$1
	wmem=$2
	workers=$3
	query=$4

	echo "--" `date` [`date +%s`] >> explains.log
	echo "-- $group rows=$ROWS work_mem=$wmem workers=$workers" >> explains.log

	./inst/bin/psql >> explains.log 2>&1 < /dev/null <> results.csv
}


function run_queries {

	for wm in '1MB' '2MB' '4MB' '8MB' '16MB' '2GB'; do

		log "running queries work_mem=$wm noparallel"

		run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 OFFSET '$ROWS''
		run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 OFFSET '$ROWS''

		run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 OFFSET '$ROWS''
		run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 OFFSET '$ROWS''

	done

}

create_tables

log "sort benchmark $ROWS"

load_random
run_queries "random"

load_sorted
run_queries "sorted"

load_almost_sorted
run_queries "almost_sorted"

load_reversed
run_queries "reversed"

load_organ_pipe
run_queries "organ_pipe"

load_0_1
run_queries "0_1"


kway-merge-test-1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: Polyphase merge is obsolete

2021-09-15 Thread Heikki Linnakangas

On 16/09/2021 00:12, Jaime Casanova wrote:

On Sat, Sep 11, 2021 at 01:35:27AM -0500, Jaime Casanova wrote:

On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote:

On 14/07/2021 15:12, vignesh C wrote:

On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas  wrote:

Here's an updated version that fixes one bug:

The CFBot was reporting a failure on the FreeBSD system [1]. It turned
out to be an out-of-memory issue caused by an underflow bug in the
calculation of the size of the tape read buffer size. With a small
work_mem size, the memory left for tape buffers was negative, and that
wrapped around to a very large number. I believe that was not caught by
the other systems, because the other ones had enough memory for the
incorrectly-sized buffers anyway. That was the case on my laptop at
least. It did cause a big slowdown in the 'tuplesort' regression test
though, which I hadn't noticed.

The fix for that bug is here as a separate patch for easier review, but
I'll squash it before committing.


The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".


Here's a rebased version. I also squashed that little bug fix from previous
patch set.



Hi,

This patch does not apply, can you submit a rebased version?


BTW, I'm marking this one as "waiting on author"


Thanks, here's another rebase.

- Heikki
>From 501d73488159af3e172e117439054bc06dd09edf Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Oct 2020 21:57:42 +0300
Subject: [PATCH v5 1/2] Refactor LogicalTapeSet/LogicalTape interface.

All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now
take a LogicalTape as argument, instead of LogicalTapeSet+tape number.
You can create any number of LogicalTapes in a single LogicalTapeSet, and
you don't need to decide the number upfront, when you create the tape set.

This makes the tape management in hash agg spilling in nodeAgg.c simpler.
---
 src/backend/executor/nodeAgg.c | 187 
 src/backend/utils/sort/logtape.c   | 456 -
 src/backend/utils/sort/tuplesort.c | 229 +++
 src/include/nodes/execnodes.h  |   3 +-
 src/include/utils/logtape.h|  37 ++-
 5 files changed, 359 insertions(+), 553 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 39bea204d16..c99a0de4ddb 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -208,7 +208,16 @@
  *
  *	  Spilled data is written to logical tapes. These provide better control
  *	  over memory usage, disk space, and the number of files than if we were
- *	  to use a BufFile for each spill.
+ *	  to use a BufFile for each spill.  We don't know the number of tapes needed
+ *	  at the start of the algorithm (because it can recurse), so a tape set is
+ *	  allocated at the beginning, and individual tapes are created as needed.
+ *	  As a particular tape is read, logtape.c recycles its disk space. When a
+ *	  tape is read to completion, it is destroyed entirely.
+ *
+ *	  Tapes' buffers can take up substantial memory when many tapes are open at
+ *	  once. We only need one tape open at a time in read mode (using a buffer
+ *	  that's a multiple of BLCKSZ); but we need one tape open in write mode (each
+ *	  requiring a buffer of size BLCKSZ) for each partition.
  *
  *	  Note that it's possible for transition states to start small but then
  *	  grow very large; for instance in the case of ARRAY_AGG. In such cases,
@@ -311,27 +320,6 @@
  */
 #define CHUNKHDRSZ 16
 
-/*
- * Track all tapes needed for a HashAgg that spills. We don't know the maximum
- * number of tapes needed at the start of the algorithm (because it can
- * recurse), so one tape set is allocated and extended as needed for new
- * tapes. When a particular tape is already read, rewind it for write mode and
- * put it in the free list.
- *
- * Tapes' buffers can take up substantial memory when many tapes are open at
- * once. We only need one tape open at a time in read mode (using a buffer
- * that's a multiple of BLCKSZ); but we need one tape open in write mode (each
- * requiring a buffer of size BLCKSZ) for each partition.
- */
-typedef struct HashTapeInfo
-{
-	LogicalTapeSet *tapeset;
-	int			ntapes;
-	int		   *freetapes;
-	int			nfreetapes;
-	int			freetapes_alloc;
-} HashTapeInfo;
-
 /*
  * Represents partitioned spill data for a single hashtable. Contains the
  * necessary information to route tuples to the correct partition, and to
@@ -343,9 +331,8 @@ typedef struct HashTapeInfo
  */
 typedef struct HashAggSpill
 {
-	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
 	int			npartitions;	/* number of partitions */
-	int		   *partitions;		/* spill partition tape numbers */
+	LogicalTape **partitions;	/* spill partition tapes */
 	int64	   *ntuples;		/* number of tuples in each partition */
 	uint32		mask;			/* mask to find partition from hash value */
 

Re: Polyphase merge is obsolete

2021-09-15 Thread Jaime Casanova
On Sat, Sep 11, 2021 at 01:35:27AM -0500, Jaime Casanova wrote:
> On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote:
> > On 14/07/2021 15:12, vignesh C wrote:
> > > On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas  
> > > wrote:
> > > > Here's an updated version that fixes one bug:
> > > > 
> > > > The CFBot was reporting a failure on the FreeBSD system [1]. It turned
> > > > out to be an out-of-memory issue caused by an underflow bug in the
> > > > calculation of the size of the tape read buffer size. With a small
> > > > work_mem size, the memory left for tape buffers was negative, and that
> > > > wrapped around to a very large number. I believe that was not caught by
> > > > the other systems, because the other ones had enough memory for the
> > > > incorrectly-sized buffers anyway. That was the case on my laptop at
> > > > least. It did cause a big slowdown in the 'tuplesort' regression test
> > > > though, which I hadn't noticed.
> > > > 
> > > > The fix for that bug is here as a separate patch for easier review, but
> > > > I'll squash it before committing.
> > > 
> > > The patch does not apply on Head anymore, could you rebase and post a
> > > patch. I'm changing the status to "Waiting for Author".
> > 
> > Here's a rebased version. I also squashed that little bug fix from previous
> > patch set.
> > 
> 
> Hi,
> 
> This patch does not apply, can you submit a rebased version?
> 

BTW, I'm marking this one as "waiting on author"

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL




Re: Polyphase merge is obsolete

2021-09-11 Thread Zhihong Yu
On Fri, Sep 10, 2021 at 11:35 PM Jaime Casanova <
jcasa...@systemguards.com.ec> wrote:

> On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote:
> > On 14/07/2021 15:12, vignesh C wrote:
> > > On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas 
> wrote:
> > > > Here's an updated version that fixes one bug:
> > > >
> > > > The CFBot was reporting a failure on the FreeBSD system [1]. It
> turned
> > > > out to be an out-of-memory issue caused by an underflow bug in the
> > > > calculation of the size of the tape read buffer size. With a small
> > > > work_mem size, the memory left for tape buffers was negative, and
> that
> > > > wrapped around to a very large number. I believe that was not caught
> by
> > > > the other systems, because the other ones had enough memory for the
> > > > incorrectly-sized buffers anyway. That was the case on my laptop at
> > > > least. It did cause a big slowdown in the 'tuplesort' regression test
> > > > though, which I hadn't noticed.
> > > >
> > > > The fix for that bug is here as a separate patch for easier review,
> but
> > > > I'll squash it before committing.
> > >
> > > The patch does not apply on Head anymore, could you rebase and post a
> > > patch. I'm changing the status to "Waiting for Author".
> >
> > Here's a rebased version. I also squashed that little bug fix from
> previous
> > patch set.
> >
>
> Hi,
>
> This patch does not apply, can you submit a rebased version?
>
> --
> Jaime Casanova
> Director de Servicios Profesionales
> SystemGuards - Consultores de PostgreSQL
>
>
> Hi,

+ * Before PostgreSQL 14, we used the polyphase merge algorithm (Knuth's
+ * Algorithm 5.4.2D),

I think the above 'Before PostgreSQL 14' should be 'Before PostgreSQL 15'
now that PostgreSQL 14 has been released.

+static int64
+merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,
+  int maxOutputTapes)

For memory to allocate, I think uint64 can be used (instead of int64).

Cheers


Re: Polyphase merge is obsolete

2021-09-11 Thread Jaime Casanova
On Wed, Jul 14, 2021 at 06:04:14PM +0300, Heikki Linnakangas wrote:
> On 14/07/2021 15:12, vignesh C wrote:
> > On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas  wrote:
> > > Here's an updated version that fixes one bug:
> > > 
> > > The CFBot was reporting a failure on the FreeBSD system [1]. It turned
> > > out to be an out-of-memory issue caused by an underflow bug in the
> > > calculation of the size of the tape read buffer size. With a small
> > > work_mem size, the memory left for tape buffers was negative, and that
> > > wrapped around to a very large number. I believe that was not caught by
> > > the other systems, because the other ones had enough memory for the
> > > incorrectly-sized buffers anyway. That was the case on my laptop at
> > > least. It did cause a big slowdown in the 'tuplesort' regression test
> > > though, which I hadn't noticed.
> > > 
> > > The fix for that bug is here as a separate patch for easier review, but
> > > I'll squash it before committing.
> > 
> > The patch does not apply on Head anymore, could you rebase and post a
> > patch. I'm changing the status to "Waiting for Author".
> 
> Here's a rebased version. I also squashed that little bug fix from previous
> patch set.
> 

Hi,

This patch does not apply, can you submit a rebased version?

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL




Re: Polyphase merge is obsolete

2021-07-14 Thread Heikki Linnakangas

On 14/07/2021 15:12, vignesh C wrote:

On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas  wrote:

Here's an updated version that fixes one bug:

The CFBot was reporting a failure on the FreeBSD system [1]. It turned
out to be an out-of-memory issue caused by an underflow bug in the
calculation of the size of the tape read buffer size. With a small
work_mem size, the memory left for tape buffers was negative, and that
wrapped around to a very large number. I believe that was not caught by
the other systems, because the other ones had enough memory for the
incorrectly-sized buffers anyway. That was the case on my laptop at
least. It did cause a big slowdown in the 'tuplesort' regression test
though, which I hadn't noticed.

The fix for that bug is here as a separate patch for easier review, but
I'll squash it before committing.


The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".


Here's a rebased version. I also squashed that little bug fix from 
previous patch set.


- Heikki
>From 027b20e74b3d8b09144cffe6d071706231e496d7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Oct 2020 21:57:42 +0300
Subject: [PATCH 1/2] Refactor LogicalTapeSet/LogicalTape interface.

All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now
take a LogicalTape as argument, instead of LogicalTapeSet+tape number.
You can create any number of LogicalTapes in a single LogicalTapeSet, and
you don't need to decide the number upfront, when you create the tape set.

This makes the tape management in hash agg spilling in nodeAgg.c simpler.
---
 src/backend/executor/nodeAgg.c | 187 
 src/backend/utils/sort/logtape.c   | 457 -
 src/backend/utils/sort/tuplesort.c | 229 +++
 src/include/nodes/execnodes.h  |   3 +-
 src/include/utils/logtape.h|  37 ++-
 5 files changed, 360 insertions(+), 553 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 914b02ceee4..80025e5f1f3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -208,7 +208,16 @@
  *
  *	  Spilled data is written to logical tapes. These provide better control
  *	  over memory usage, disk space, and the number of files than if we were
- *	  to use a BufFile for each spill.
+ *	  to use a BufFile for each spill.  We don't know the number of tapes needed
+ *	  at the start of the algorithm (because it can recurse), so a tape set is
+ *	  allocated at the beginning, and individual tapes are created as needed.
+ *	  As a particular tape is read, logtape.c recycles its disk space. When a
+ *	  tape is read to completion, it is destroyed entirely.
+ *
+ *	  Tapes' buffers can take up substantial memory when many tapes are open at
+ *	  once. We only need one tape open at a time in read mode (using a buffer
+ *	  that's a multiple of BLCKSZ); but we need one tape open in write mode (each
+ *	  requiring a buffer of size BLCKSZ) for each partition.
  *
  *	  Note that it's possible for transition states to start small but then
  *	  grow very large; for instance in the case of ARRAY_AGG. In such cases,
@@ -311,27 +320,6 @@
  */
 #define CHUNKHDRSZ 16
 
-/*
- * Track all tapes needed for a HashAgg that spills. We don't know the maximum
- * number of tapes needed at the start of the algorithm (because it can
- * recurse), so one tape set is allocated and extended as needed for new
- * tapes. When a particular tape is already read, rewind it for write mode and
- * put it in the free list.
- *
- * Tapes' buffers can take up substantial memory when many tapes are open at
- * once. We only need one tape open at a time in read mode (using a buffer
- * that's a multiple of BLCKSZ); but we need one tape open in write mode (each
- * requiring a buffer of size BLCKSZ) for each partition.
- */
-typedef struct HashTapeInfo
-{
-	LogicalTapeSet *tapeset;
-	int			ntapes;
-	int		   *freetapes;
-	int			nfreetapes;
-	int			freetapes_alloc;
-} HashTapeInfo;
-
 /*
  * Represents partitioned spill data for a single hashtable. Contains the
  * necessary information to route tuples to the correct partition, and to
@@ -343,9 +331,8 @@ typedef struct HashTapeInfo
  */
 typedef struct HashAggSpill
 {
-	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
 	int			npartitions;	/* number of partitions */
-	int		   *partitions;		/* spill partition tape numbers */
+	LogicalTape **partitions;	/* spill partition tapes */
 	int64	   *ntuples;		/* number of tuples in each partition */
 	uint32		mask;			/* mask to find partition from hash value */
 	int			shift;			/* after masking, shift by this amount */
@@ -365,8 +352,7 @@ typedef struct HashAggBatch
 {
 	int			setno;			/* grouping set */
 	int			used_bits;		/* number of bits of hash already used */
-	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
-	int			input_tapenum;	/* input partition tape */
+	

Re: Polyphase merge is obsolete

2021-07-14 Thread vignesh C
On Sat, Jan 23, 2021 at 3:49 AM Heikki Linnakangas  wrote:
>
> On 22/10/2020 14:48, Heikki Linnakangas wrote:
> > On 11/09/2017 13:37, Tomas Vondra wrote:
> >> I planned to do some benchmarking on this patch, but apparently the
> >> patch no longer applies. Rebase please?
> >
> > Here's a rebase of this. Sorry to keep you waiting :-).
>
> Here's an updated version that fixes one bug:
>
> The CFBot was reporting a failure on the FreeBSD system [1]. It turned
> out to be an out-of-memory issue caused by an underflow bug in the
> calculation of the size of the tape read buffer size. With a small
> work_mem size, the memory left for tape buffers was negative, and that
> wrapped around to a very large number. I believe that was not caught by
> the other systems, because the other ones had enough memory for the
> incorrectly-sized buffers anyway. That was the case on my laptop at
> least. It did cause a big slowdown in the 'tuplesort' regression test
> though, which I hadn't noticed.
>
> The fix for that bug is here as a separate patch for easier review, but
> I'll squash it before committing.

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh




Re: Polyphase merge is obsolete

2021-03-25 Thread David Steele

On 3/25/21 9:41 AM, David Steele wrote:

On 1/22/21 5:19 PM, Heikki Linnakangas wrote:

On 22/10/2020 14:48, Heikki Linnakangas wrote:

On 11/09/2017 13:37, Tomas Vondra wrote:

I planned to do some benchmarking on this patch, but apparently the
patch no longer applies. Rebase please?


Here's a rebase of this. Sorry to keep you waiting :-).


I am amused that the wait here was 3+ years!


Here's an updated version that fixes one bug:


We're near the end of the CF and this has not really seen any review 
after a long hiatus. I'll move this to the next CF on April 30 unless I 
hear objections.


Oops, I meant March 30.

Regards,
--
-David
da...@pgmasters.net




Re: Polyphase merge is obsolete

2021-03-25 Thread David Steele

On 1/22/21 5:19 PM, Heikki Linnakangas wrote:

On 22/10/2020 14:48, Heikki Linnakangas wrote:

On 11/09/2017 13:37, Tomas Vondra wrote:

I planned to do some benchmarking on this patch, but apparently the
patch no longer applies. Rebase please?


Here's a rebase of this. Sorry to keep you waiting :-).


I am amused that the wait here was 3+ years!


Here's an updated version that fixes one bug:


We're near the end of the CF and this has not really seen any review 
after a long hiatus. I'll move this to the next CF on April 30 unless I 
hear objections.


Regards,
--
-David
da...@pgmasters.net




Re: Polyphase merge is obsolete

2021-01-22 Thread Heikki Linnakangas

On 22/10/2020 14:48, Heikki Linnakangas wrote:

On 11/09/2017 13:37, Tomas Vondra wrote:

I planned to do some benchmarking on this patch, but apparently the
patch no longer applies. Rebase please?


Here's a rebase of this. Sorry to keep you waiting :-).


Here's an updated version that fixes one bug:

The CFBot was reporting a failure on the FreeBSD system [1]. It turned 
out to be an out-of-memory issue caused by an underflow bug in the 
calculation of the size of the tape read buffer size. With a small 
work_mem size, the memory left for tape buffers was negative, and that 
wrapped around to a very large number. I believe that was not caught by 
the other systems, because the other ones had enough memory for the 
incorrectly-sized buffers anyway. That was the case on my laptop at 
least. It did cause a big slowdown in the 'tuplesort' regression test 
though, which I hadn't noticed.


The fix for that bug is here as a separate patch for easier review, but 
I'll squash it before committing.


[1] https://cirrus-ci.com/task/6699842091089920

- Heikki

>From 8fd75cfc8574e3a5885f88f13f11b0676c99f39f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Oct 2020 21:57:42 +0300
Subject: [PATCH v3 1/3] Refactor LogicalTapeSet/LogicalTape interface.

All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now
take a LogicalTape as argument, instead of LogicalTapeSet+tape number.
You can create any number of LogicalTapes in a single LogicalTapeSet, and
you don't need to decide the number upfront, when you create the tape set.

This makes the tape management in hash agg spilling in nodeAgg.c simpler.
---
 src/backend/executor/nodeAgg.c | 187 
 src/backend/utils/sort/logtape.c   | 456 -
 src/backend/utils/sort/tuplesort.c | 229 +++
 src/include/nodes/execnodes.h  |   3 +-
 src/include/utils/logtape.h|  37 ++-
 5 files changed, 360 insertions(+), 552 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 601b6dab03f..13f6f188668 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -208,7 +208,16 @@
  *
  *	  Spilled data is written to logical tapes. These provide better control
  *	  over memory usage, disk space, and the number of files than if we were
- *	  to use a BufFile for each spill.
+ *	  to use a BufFile for each spill.  We don't know the number of tapes needed
+ *	  at the start of the algorithm (because it can recurse), so a tape set is
+ *	  allocated at the beginning, and individual tapes are created as needed.
+ *	  As a particular tape is read, logtape.c recycles its disk space. When a
+ *	  tape is read to completion, it is destroyed entirely.
+ *
+ *	  Tapes' buffers can take up substantial memory when many tapes are open at
+ *	  once. We only need one tape open at a time in read mode (using a buffer
+ *	  that's a multiple of BLCKSZ); but we need one tape open in write mode (each
+ *	  requiring a buffer of size BLCKSZ) for each partition.
  *
  *	  Note that it's possible for transition states to start small but then
  *	  grow very large; for instance in the case of ARRAY_AGG. In such cases,
@@ -311,27 +320,6 @@
  */
 #define CHUNKHDRSZ 16
 
-/*
- * Track all tapes needed for a HashAgg that spills. We don't know the maximum
- * number of tapes needed at the start of the algorithm (because it can
- * recurse), so one tape set is allocated and extended as needed for new
- * tapes. When a particular tape is already read, rewind it for write mode and
- * put it in the free list.
- *
- * Tapes' buffers can take up substantial memory when many tapes are open at
- * once. We only need one tape open at a time in read mode (using a buffer
- * that's a multiple of BLCKSZ); but we need one tape open in write mode (each
- * requiring a buffer of size BLCKSZ) for each partition.
- */
-typedef struct HashTapeInfo
-{
-	LogicalTapeSet *tapeset;
-	int			ntapes;
-	int		   *freetapes;
-	int			nfreetapes;
-	int			freetapes_alloc;
-} HashTapeInfo;
-
 /*
  * Represents partitioned spill data for a single hashtable. Contains the
  * necessary information to route tuples to the correct partition, and to
@@ -343,9 +331,8 @@ typedef struct HashTapeInfo
  */
 typedef struct HashAggSpill
 {
-	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
 	int			npartitions;	/* number of partitions */
-	int		   *partitions;		/* spill partition tape numbers */
+	LogicalTape **partitions;	/* spill partition tapes */
 	int64	   *ntuples;		/* number of tuples in each partition */
 	uint32		mask;			/* mask to find partition from hash value */
 	int			shift;			/* after masking, shift by this amount */
@@ -365,8 +352,7 @@ typedef struct HashAggBatch
 {
 	int			setno;			/* grouping set */
 	int			used_bits;		/* number of bits of hash already used */
-	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
-	int			input_tapenum;	/* input partition tape */
+