subject:"Re\: Memory leak from ExecutorState context\?"

Re: Memory leak from ExecutorState context?

2023-05-22 Thread Jehan-Guillaume de Rorthais

On Fri, 19 May 2023 17:23:56 +0200
Tomas Vondra  wrote:
> On 5/19/23 00:27, Melanie Plageman wrote:
> > v10 LGTM.  
> 
> Thanks!
> 
> I've pushed 0002 and 0003, after some general bikeshedding and minor
> rewording (a bit audacious, admittedly).

Thank you Melanie et Tomas for your help, review et commit!

Re: Memory leak from ExecutorState context?

2023-05-19 Thread Tom Lane

Tomas Vondra  writes:
> I didn't push 0001, I don't think generally do separate pgindent patches
> like this (I only run pgindent on large patches to ensure it doesn't
> cause massive breakage, not separately like this, but YMMV).

It's especially pointless when the main pgindent run for v16 is going
to happen today (as soon as I get done clearing out my other queue
items).

regards, tom lane

Re: Memory leak from ExecutorState context?

2023-05-19 Thread Tomas Vondra

On 5/19/23 00:27, Melanie Plageman wrote:
> On Wed, May 17, 2023 at 6:35 PM Jehan-Guillaume de Rorthais
>  wrote:
>>
>> On Wed, 17 May 2023 13:46:35 -0400
>> Melanie Plageman  wrote:
>>
>>> On Wed, May 17, 2023 at 07:10:08PM +0200, Jehan-Guillaume de Rorthais wrote:
 On Tue, 16 May 2023 16:00:52 -0400
 Melanie Plageman  wrote:
> ...
> There are some existing indentation issues in these files, but you can
> leave those or put them in a separate commit.

 Reindented in v9.

 I put existing indentation issues in a separate commit to keep the actual
 patches clean from distractions.
>>>
>>> It is a matter of opinion, but I tend to prefer the commit with the fix
>>> for the existing indentation issues to be first in the patch set.
>>
>> OK. moved in v10 patch set.
> 
> v10 LGTM.

Thanks!

I've pushed 0002 and 0003, after some general bikeshedding and minor
rewording (a bit audacious, admittedly).

I didn't push 0001, I don't think generally do separate pgindent patches
like this (I only run pgindent on large patches to ensure it doesn't
cause massive breakage, not separately like this, but YMMV).

Anyway, that's it for PG16. Let's see if we can do more in this area for
PG17.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-05-18 Thread Melanie Plageman

On Wed, May 17, 2023 at 6:35 PM Jehan-Guillaume de Rorthais
 wrote:
>
> On Wed, 17 May 2023 13:46:35 -0400
> Melanie Plageman  wrote:
>
> > On Wed, May 17, 2023 at 07:10:08PM +0200, Jehan-Guillaume de Rorthais wrote:
> > > On Tue, 16 May 2023 16:00:52 -0400
> > > Melanie Plageman  wrote:
> > > > ...
> > > > There are some existing indentation issues in these files, but you can
> > > > leave those or put them in a separate commit.
> > >
> > > Reindented in v9.
> > >
> > > I put existing indentation issues in a separate commit to keep the actual
> > > patches clean from distractions.
> >
> > It is a matter of opinion, but I tend to prefer the commit with the fix
> > for the existing indentation issues to be first in the patch set.
>
> OK. moved in v10 patch set.

v10 LGTM.

Re: Memory leak from ExecutorState context?

2023-05-17 Thread Jehan-Guillaume de Rorthais

On Wed, 17 May 2023 13:46:35 -0400
Melanie Plageman  wrote:

> On Wed, May 17, 2023 at 07:10:08PM +0200, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 16 May 2023 16:00:52 -0400
> > Melanie Plageman  wrote:  
> > > ...
> > > There are some existing indentation issues in these files, but you can
> > > leave those or put them in a separate commit.  
> > 
> > Reindented in v9.
> > 
> > I put existing indentation issues in a separate commit to keep the actual
> > patches clean from distractions.  
> 
> It is a matter of opinion, but I tend to prefer the commit with the fix
> for the existing indentation issues to be first in the patch set.

OK. moved in v10 patch set.

> ...
> > > >  void
> > > >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> > > > - BufFile **fileptr)
> > > > + BufFile **fileptr,
> > > > MemoryContext cxt) {
> > 
> > Note that I changed this to:
> > 
> >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> >BufFile **fileptr, HashJoinTable hashtable) {
> > 
> > As this function must allocate BufFile buffer in spillCxt, I suppose
> > we should force it explicitly in the function code.
> > 
> > Plus, having hashtable locally could be useful for later patch to eg. fine
> > track allocated memory in spaceUsed.  
> 
> I think if you want to pass the hashtable instead of the memory context,
> I think you'll need to explain why you still pass the buffile pointer
> (because ExecHashJoinSaveTuple() is called for inner and outer batch
> files) instead of getting it from the hashtable's arrays of buffile
> pointers.

Comment added in v10

> > @@ -1310,22 +1311,38 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
> ...
> 
> Suggested small edit to the second paragraph:
> 
>   During the build phase, buffered files are created for inner batches.
>   Each batch's buffered file is closed (and its buffer freed) after the
>   batch is loaded into memory during the outer side scan. Therefore, it
>   is necessary to allocate the batch file buffer in a memory context
>   which outlives the batch itself.

Changed.

> I'd also mention the reason for passing the buffile pointer above the
> function.

Added.

Regards,
>From 25ef004b31e07956a018dd08170ee1588b7719b8 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Wed, 17 May 2023 19:01:06 +0200
Subject: [PATCH v10 1/3] Run pgindent on nodeHash.c and nodeHashjoin.c

---
 src/backend/executor/nodeHash.c | 6 +++---
 src/backend/executor/nodeHashjoin.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5fd1c5553b..ac3eb32d97 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1327,7 +1327,7 @@ ExecParallelHashRepartitionFirst(HashJoinTable hashtable)
 			else
 			{
 size_t		tuple_size =
-MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
+	MAXALIGN(HJTUPLE_OVERHEAD + tuple->t_len);
 
 /* It belongs in a later batch. */
 hashtable->batches[batchno].estimated_size += tuple_size;
@@ -1369,7 +1369,7 @@ ExecParallelHashRepartitionRest(HashJoinTable hashtable)
 	for (i = 1; i < old_nbatch; ++i)
 	{
 		ParallelHashJoinBatch *shared =
-		NthParallelHashJoinBatch(old_batches, i);
+			NthParallelHashJoinBatch(old_batches, i);
 
 		old_inner_tuples[i] = sts_attach(ParallelHashJoinBatchInner(shared),
 		 ParallelWorkerNumber + 1,
@@ -3317,7 +3317,7 @@ ExecHashTableDetachBatch(HashJoinTable hashtable)
 			while (DsaPointerIsValid(batch->chunks))
 			{
 HashMemoryChunk chunk =
-dsa_get_address(hashtable->area, batch->chunks);
+	dsa_get_address(hashtable->area, batch->chunks);
 dsa_pointer next = chunk->next.shared;
 
 dsa_free(hashtable->area, batch->chunks);
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..b29a8ff48b 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1170,7 +1170,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
 		{
 			SharedTuplestoreAccessor *inner_tuples;
 			Barrier*batch_barrier =
-			>batches[batchno].shared->batch_barrier;
+>batches[batchno].shared->batch_barrier;
 
 			switch (BarrierAttach(batch_barrier))
 			{
@@ -1558,7 +1558,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt)
 {
 	int			plan_node_id = state->js.ps.plan->plan_node_id;
 	ParallelHashJoinState *pstate =
-	shm_toc_lookup(pcxt->toc, plan_node_id, false);
+		shm_toc_lookup(pcxt->toc, plan_node_id, false);
 
 	/*
 	 * It would be possible to reuse the shared hash table in single-batch
@@ -1593,7 +1593,7 @@ ExecHashJoinInitializeWorker(HashJoinState *state,
 	HashState  *hashNode;
 	int			plan_node_id = state->js.ps.plan->plan_node_id;
 	ParallelHashJoinState *pstate =
-	shm_toc_lookup(pwcxt->toc,

Re: Memory leak from ExecutorState context?

2023-05-17 Thread Melanie Plageman

On Wed, May 17, 2023 at 07:10:08PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Tue, 16 May 2023 16:00:52 -0400
> Melanie Plageman  wrote:
> > > From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
> > > From: Jehan-Guillaume de Rorthais 
> > > Date: Tue, 16 May 2023 15:42:14 +0200
> > > Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a 
> > > dedicated
> > There are some existing indentation issues in these files, but you can
> > leave those or put them in a separate commit.
> 
> Reindented in v9.
> 
> I put existing indentation issues in a separate commit to keep the actual
> patches clean from distractions.

It is a matter of opinion, but I tend to prefer the commit with the fix
for the existing indentation issues to be first in the patch set.

> > > diff --git a/src/backend/executor/nodeHashjoin.c
> > > b/src/backend/executor/nodeHashjoin.c @@ -1313,21 +1314,30 @@
> > > ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
> > >   * The data recorded in the file for each tuple is its hash value,
> > >   * then the tuple in MinimalTuple format.
> > >   *
> > > - * Note: it is important always to call this in the regular executor
> > > - * context, not in a shorter-lived context; else the temp file buffers
> > > - * will get messed up.
> > > + * If this is the first write to the batch file, this function first
> > > + * create it.  The associated BufFile buffer is allocated in the given
> > > + * context.  It is important to always give the HashSpillContext
> > > + * context.  First to avoid a shorter-lived context, else the temp file
> > > + * buffers will get messed up.  Second, for a better accounting of the
> > > + * spilling memory consumption.
> > > + *
> > >   */  
> > 
> > Here is my suggested wording fot this block comment:
> > 
> > The batch file is lazily created. If this is the first tuple written to
> > this batch, the batch file is created and its buffer is allocated in the
> > given context. It is important to pass in a memory context which will
> > live for the entirety of the lifespan of the batch.
> 
> Reworded. The context must actually survive the batch itself, not just live
> during the lifespan of the batch.

I've added a small recommended change to this inline.

> > >  void
> > >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> > > -   BufFile **fileptr)
> > > +   BufFile **fileptr, MemoryContext
> > > cxt) {  
> 
> Note that I changed this to:
> 
>  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
>BufFile **fileptr, HashJoinTable hashtable) {
> 
> As this function must allocate BufFile buffer in spillCxt, I suppose
> we should force it explicitly in the function code.
> 
> Plus, having hashtable locally could be useful for later patch to eg. fine 
> track
> allocated memory in spaceUsed.

I think if you want to pass the hashtable instead of the memory context,
I think you'll need to explain why you still pass the buffile pointer
(because ExecHashJoinSaveTuple() is called for inner and outer batch
files) instead of getting it from the hashtable's arrays of buffile
pointers.

> From c7b70dec3f4c162ea590b53a407c39dfd7ade873 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais 
> Date: Tue, 16 May 2023 15:42:14 +0200
> Subject: [PATCH v9 2/3] Dedicated memory context for hash join spill buffers

> @@ -1310,22 +1311,38 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
>   *
>   * The data recorded in the file for each tuple is its hash value,
>   * then the tuple in MinimalTuple format.
> - *
> - * Note: it is important always to call this in the regular executor
> - * context, not in a shorter-lived context; else the temp file buffers
> - * will get messed up.
>   */
>  void
>  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> -   BufFile **fileptr)
> +   BufFile **fileptr, HashJoinTable 
> hashtable)
>  {
>   BufFile*file = *fileptr;
>  
>   if (file == NULL)
>   {
> - /* First write to this batch file, so open it. */
> + MemoryContext oldctx;
> +
> + /*
> +  * The batch file is lazily created. If this is the first tuple
> +  * written to this batch, the batch file is created and its 
> buffer is
> +  * allocated in the spillCxt context, NOT in the batchCxt.
> +  *
> +  * During the building phase, inner batch are created with 
> their temp
> +  * file buffers. These buffers are released later, after the 
> batch is
> +  * loaded back to memory during the outer side scan. That 
> explains why
> +  * it is important to use a memory context which live longer 
> than the
> +  * batch itself or some temp file buffers will get messed up.
> +  *
> +  * Also, we use

Re: Memory leak from ExecutorState context?

2023-05-17 Thread Jehan-Guillaume de Rorthais

On Tue, 16 May 2023 16:00:52 -0400
Melanie Plageman  wrote:

> On Tue, May 16, 2023 at 04:00:51PM +0200, Jehan-Guillaume de Rorthais wrote:
> 
> > From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Thu, 30 Apr 2020 07:16:28 -0700
> > Subject: [PATCH v8 1/3] Describe hash join implementation
> > 
> > This is just a draft to spark conversation on what a good comment might
> > be like in this file on how the hybrid hash join algorithm is
> > implemented in Postgres. I'm pretty sure this is the accepted term for
> > this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join  
> 
> I recommend changing the commit message to something like this:
> 
>   Describe hash join implementation
> 
>   Add details to the block comment in nodeHashjoin.c describing the
>   hybrid hash join algorithm at a high level.
> 
>   Author: Melanie Plageman 
>   Author: Jehan-Guillaume de Rorthais 
>   Reviewed-by: Tomas Vondra 
>   Discussion: https://postgr.es/m/20230516160051.4267a800%40karst

Done, but assigning myself as a reviewer as I don't remember having authored
anything in this but the reference to the Hybrid hash page, which is
questionable according to Tomas :)

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > b/src/backend/executor/nodeHashjoin.c index 0a3f32f731..93a78f6f74 100644
> > --- a/src/backend/executor/nodeHashjoin.c
> > ...
> > + * TODO: should we discuss that tuples can only spill forward?  
> 
> I would just cut this for now since we haven't started on an agreed-upon
> wording.

Removed in v9.

> > From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Tue, 16 May 2023 15:42:14 +0200
> > Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a dedicated
> >  context  
> 
> Here is a draft commit message for the second patch:
> 
> ...

Thanks. Adopted with some minor rewording... hopefully it's OK.

> I recommend editing and adding links to the various discussions on this
> topic from your research.

Done in v9.

> As for the patch itself, I noticed that there are few things needing
> pgindenting.
>
> I usually do the following to run pgindent (in case you haven't done
> this recently).
> 
> ...

Thank you for your recipe.

> There are some existing indentation issues in these files, but you can
> leave those or put them in a separate commit.

Reindented in v9.

I put existing indentation issues in a separate commit to keep the actual
patches clean from distractions.

> ...
> Add a period at the end of this comment.
> 
> > +   /*
> > +* Use hash join spill memory context to allocate accessors and
> > their
> > +* buffers
> > +*/

Fixed in v9.

> Add a period at the end of this comment.
> 
> > +   /* Use hash join spill memory context to allocate accessors */

Fixed in v9.

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > b/src/backend/executor/nodeHashjoin.c @@ -1313,21 +1314,30 @@
> > ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
> >   * The data recorded in the file for each tuple is its hash value,
> >   * then the tuple in MinimalTuple format.
> >   *
> > - * Note: it is important always to call this in the regular executor
> > - * context, not in a shorter-lived context; else the temp file buffers
> > - * will get messed up.
> > + * If this is the first write to the batch file, this function first
> > + * create it.  The associated BufFile buffer is allocated in the given
> > + * context.  It is important to always give the HashSpillContext
> > + * context.  First to avoid a shorter-lived context, else the temp file
> > + * buffers will get messed up.  Second, for a better accounting of the
> > + * spilling memory consumption.
> > + *
> >   */  
> 
> Here is my suggested wording fot this block comment:
> 
> The batch file is lazily created. If this is the first tuple written to
> this batch, the batch file is created and its buffer is allocated in the
> given context. It is important to pass in a memory context which will
> live for the entirety of the lifespan of the batch.

Reworded. The context must actually survive the batch itself, not just live
during the lifespan of the batch.

> >  void
> >  ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
> > - BufFile **fileptr)
> > + BufFile **fileptr, MemoryContext
> > cxt) {  

Note that I changed this to:

 ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
   BufFile **fileptr, HashJoinTable hashtable) {

As this function must allocate BufFile buffer in spillCxt, I suppose
we should force it explicitly in the function code.

Plus, having hashtable locally could be useful for later patch to eg. fine track
allocated memory in spaceUsed.

> > diff --git a/src/include/executor/hashjoin.h
> > b/src/include/executor/hashjoin.h index

Re: Memory leak from ExecutorState context?

2023-05-16 Thread Melanie Plageman

On Sun, May 14, 2023 at 12:10:00AM +0200, Tomas Vondra wrote:
> On 5/12/23 23:36, Melanie Plageman wrote:
> > Thanks for continuing to work on this.
> > 
> > Are you planning to modify what is displayed for memory usage in
> > EXPLAIN ANALYZE?
> > 
> 
> We could do that, but we can do that separately - it's a separate and
> independent improvement, I think.
> 
> Also, do you have a proposal how to change the explain output? In
> principle we already have the number of batches, so people can calculate
> the "peak" amount of memory (assuming they realize what it means).

I don't know that we can expect people looking at the EXPLAIN output to
know how much space the different file buffers are taking up. Not to
mention that it is different for parallel hash join.

I like Jean-Guillaume's idea in his email responding to this point:

> We could add the batch memory consumption with the number of batches. Eg.:

>  Buckets: 4096 (originally 4096) 
>  Batches: 32768 (originally 8192) using 256MB
>  Memory Usage: 192kB

However, I think we can discuss this for 17.

> I think the main problem with adding this info to EXPLAIN is that I'm
> not sure it's very useful in practice. I've only really heard about this
> memory explosion issue when the query dies with OOM or takes forever,
> but EXPLAIN ANALYZE requires the query to complete.
> 
> A separate memory context (which  pg_log_backend_memory_contexts can
> dump to server log) is more valuable, I think.

Yes, I'm satisfied with scoping this to only the patch with the
dedicated memory context for now.

> > Also, since that won't help a user who OOMs, I wondered if the spillCxt
> > is helpful on its own or if we need some kind of logging message for
> > users to discover that this is what led them to running out of memory.
> > 
> 
> I think the separate memory context is definitely an improvement,
> helpful on it's own it makes it clear *what* allocated the memory. It
> requires having the memory context stats, but we may already dump them
> into the server log if malloc returns NULL. Granted, it depends on how
> the system is configured and it won't help when OOM killer hits :-(

Right. I suppose if someone had an OOM and the OOM killer ran, they may
be motivated to disable vm overcommit and then perhaps the memory
context name will show up somewhere in an error message or log?

> I wouldn't object to having some sort of log message, but when exactly
> would we emit it? Presumably after exceeding some amount of memory, but
> what would it be? It can't be too low (not to trigger it too often) or
> too high (failing to report the issue). Also, do you think it should go
> to the user or just to the server log?

I think where the log is delivered is dependent on under what conditions
we log -- if it is fairly preemptive, then doing so in the server log is
enough.

However, I think we can discuss this in the future. You are right that
the dedicated memory context by itself is an improvement.
Determining when to emit the log message seems like it will be too
difficult to accomplish in a day or so.

- Melanie

Re: Memory leak from ExecutorState context?

2023-05-16 Thread Melanie Plageman

On Tue, May 16, 2023 at 04:00:51PM +0200, Jehan-Guillaume de Rorthais wrote:

> From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Thu, 30 Apr 2020 07:16:28 -0700
> Subject: [PATCH v8 1/3] Describe hash join implementation
> 
> This is just a draft to spark conversation on what a good comment might
> be like in this file on how the hybrid hash join algorithm is
> implemented in Postgres. I'm pretty sure this is the accepted term for
> this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join

I recommend changing the commit message to something like this:

Describe hash join implementation

Add details to the block comment in nodeHashjoin.c describing the
hybrid hash join algorithm at a high level.

Author: Melanie Plageman 
Author: Jehan-Guillaume de Rorthais 
Reviewed-by: Tomas Vondra 
Discussion: https://postgr.es/m/20230516160051.4267a800%40karst

> diff --git a/src/backend/executor/nodeHashjoin.c 
> b/src/backend/executor/nodeHashjoin.c
> index 0a3f32f731..93a78f6f74 100644
> --- a/src/backend/executor/nodeHashjoin.c
> +++ b/src/backend/executor/nodeHashjoin.c
> @@ -10,6 +10,47 @@
>   * IDENTIFICATION
>   * src/backend/executor/nodeHashjoin.c
>   *
> + * HASH JOIN
> + *
> + * This is based on the "hybrid hash join" algorithm described shortly in the
> + * following page, and in length in the pdf in page's notes:
> + *
> + *   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
> + *
> + * If the inner side tuples of a hash join do not fit in memory, the hash 
> join
> + * can be executed in multiple batches.
> + *
> + * If the statistics on the inner side relation are accurate, planner 
> chooses a
> + * multi-batch strategy and estimates the number of batches.
> + *
> + * The query executor measures the real size of the hashtable and increases 
> the
> + * number of batches if the hashtable grows too large.
> + *
> + * The number of batches is always a power of two, so an increase in the 
> number
> + * of batches doubles it.
> + *
> + * Serial hash join measures batch size lazily -- waiting until it is 
> loading a
> + * batch to determine if it will fit in memory. While inserting tuples into 
> the
> + * hashtable, serial hash join will, if that tuple were to exceed work_mem,
> + * dump out the hashtable and reassign them either to other batch files or 
> the
> + * current batch resident in the hashtable.
> + *
> + * Parallel hash join, on the other hand, completes all changes to the number
> + * of batches during the build phase. If it increases the number of batches, 
> it
> + * dumps out all the tuples from all batches and reassigns them to entirely 
> new
> + * batch files. Then it checks every batch to ensure it will fit in the space
> + * budget for the query.
> + *
> + * In both parallel and serial hash join, the executor currently makes a best
> + * effort. If a particular batch will not fit in memory, it tries doubling 
> the
> + * number of batches. If after a batch increase, there is a batch which
> + * retained all or none of its tuples, the executor disables growth in the
> + * number of batches globally. After growth is disabled, all batches that 
> would
> + * have previously triggered an increase in the number of batches instead
> + * exceed the space allowed.
> + *
> + * TODO: should we discuss that tuples can only spill forward?

I would just cut this for now since we haven't started on an agreed-upon
wording.

> From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais 
> Date: Tue, 16 May 2023 15:42:14 +0200
> Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a dedicated
>  context

Here is a draft commit message for the second patch:

Dedicated memory context for hash join spill metadata

A hash join's hashtable may be split up into multiple batches if it
would otherwise exceed work_mem. The number of batches is doubled each
time a given batch is determined not to fit in memory. Each batch file
is allocated with a block-sized buffer for buffering tuples (parallel
hash join has additional sharedtuplestore accessor buffers).

In some cases, often with skewed data, bad stats, or very large
datasets, users can run out-of-memory while attempting to fit an
oversized batch in memory solely from the memory overhead of all the
batch files' buffers.

Batch files were allocated in the ExecutorState memory context, making
it very hard to identify when this batch explosion was the source of an
OOM. By allocating the batch files in a dedicated memory context, it
should be easier for users to identify the cause of an OOM and work to
avoid it.

I recommend editing and adding links to the various discussions on this
topic from your research.

As for the patch itself, I noticed that there are few things needing
pgindenting.

I

Re: Memory leak from ExecutorState context?

2023-05-16 Thread Jehan-Guillaume de Rorthais

Hi,

On Tue, 16 May 2023 12:01:51 +0200 Tomas Vondra 
wrote:

> On 5/16/23 00:15, Jehan-Guillaume de Rorthais wrote:
> > On Sat, 13 May 2023 23:47:53 +0200
> > Tomas Vondra  wrote:
> ...
> >> I'm not really sure about calling this "hybrid hash-join". What does it
> >> even mean? The new comments simply describe the existing batching, and
> >> how we increment number of batches, etc.  
> > 
> > Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm"
> > as described here (+ see pdf in this page ref):
> > 
> >   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
> > 
> > I added the ref in the v7 documentation to avoid futur confusion, is it ok?
> 
> Ah, I see. I'd just leave out the "hybrid" entirely. We've lived without
> it until now, we know what the implementation does ...

I changed the title, but kept the reference. There's still two other uses
of "hybrid hash join algorithm" in function and code comments. Keeping the ref
in this header doesn't cost much and help new comers.

> >> 0002
> >> ...
> >> The modified comment in hashjoin.h has a some alignment issues.  
> > 
> > I see no alignment issue. I suppose what bother you might be the spaces
> > before spillCxt and batchCxt to show they are childs of hashCxt? Should I
> > remove them?
> 
> It didn't occur to me this is intentional to show the contexts are
> children of hashCxt, so maybe it's not a good way to document that. I'd
> remove it, and if you think it's something worth mentioning, I'd add an
> explicit comment.

Changed.

Thanks,
>From e5ecd466172b7bae2f1be294c1a5e70ce2b43ed8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH v8 1/3] Describe hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..93a78f6f74 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,47 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ * HASH JOIN
+ *
+ * This is based on the "hybrid hash join" algorithm described shortly in the
+ * following page, and in length in the pdf in page's notes:
+ *
+ *   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
+ *
+ * If the inner side tuples of a hash join do not fit in memory, the hash join
+ * can be executed in multiple batches.
+ *
+ * If the statistics on the inner side relation are accurate, planner chooses a
+ * multi-batch strategy and estimates the number of batches.
+ *
+ * The query executor measures the real size of the hashtable and increases the
+ * number of batches if the hashtable grows too large.
+ *
+ * The number of batches is always a power of two, so an increase in the number
+ * of batches doubles it.
+ *
+ * Serial hash join measures batch size lazily -- waiting until it is loading a
+ * batch to determine if it will fit in memory. While inserting tuples into the
+ * hashtable, serial hash join will, if that tuple were to exceed work_mem,
+ * dump out the hashtable and reassign them either to other batch files or the
+ * current batch resident in the hashtable.
+ *
+ * Parallel hash join, on the other hand, completes all changes to the number
+ * of batches during the build phase. If it increases the number of batches, it
+ * dumps out all the tuples from all batches and reassigns them to entirely new
+ * batch files. Then it checks every batch to ensure it will fit in the space
+ * budget for the query.
+ *
+ * In both parallel and serial hash join, the executor currently makes a best
+ * effort. If a particular batch will not fit in memory, it tries doubling the
+ * number of batches. If after a batch increase, there is a batch which
+ * retained all or none of its tuples, the executor disables growth in the
+ * number of batches globally. After growth is disabled, all batches that would
+ * have previously triggered an increase in the number of batches instead
+ * exceed the space allowed.
+ *
+ * TODO: should we discuss that tuples can only spill forward?
+ *
  * PARALLELISM
  *
  * Hash joins can participate in parallel query execution in several ways.  A
-- 
2.40.1

>From 309ad354b7a9e4dfa01b2985bd883829f5e0eba0 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Tue, 16 May 2023 15:42:14 +0200
Subject: [PATCH v8 2/3] Allocate hash batches related BufFile in a dedicated
 context

---
 src/backend/executor/nodeHash.c   | 43 ---
 src/backend/executor/nodeHashjoin.c   | 22 
 src/backend/utils/sort/sharedtuplestore.c |  8 +

Re: Memory leak from ExecutorState context?

2023-05-16 Thread Tomas Vondra

On 5/16/23 00:15, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> Thanks for your review!
> 
> On Sat, 13 May 2023 23:47:53 +0200
> Tomas Vondra  wrote:
> 
>> Thanks for the patches. A couple mostly minor comments, to complement
>> Melanie's review:
>>
>> 0001
>>
>> I'm not really sure about calling this "hybrid hash-join". What does it
>> even mean? The new comments simply describe the existing batching, and
>> how we increment number of batches, etc.
> 
> Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm" as
> described here (+ see pdf in this page ref):
> 
>   https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
> 
> I added the ref in the v7 documentation to avoid futur confusion, is it ok?
> 

Ah, I see. I'd just leave out the "hybrid" entirely. We've lived without
it until now, we know what the implementation does ...

>> 0002
>>
>> I wouldn't call the new ExecHashJoinSaveTuple parameter spillcxt but
>> just something generic (e.g. cxt). Yes, we're passing spillCxt, but from
>> the function's POV it's just a pointer.
> 
> changed in v7.
> 
>> I also wouldn't move the ExecHashJoinSaveTuple comment inside - it just
>> needs to be reworded that we're expecting the context to be with the
>> right lifespan. The function comment is the right place to document such
>> expectations, people are less likely to read the function body.
> 
> moved and reworded in v7.
> 
>> The modified comment in hashjoin.h has a some alignment issues.
> 
> I see no alignment issue. I suppose what bother you might be the spaces
> before spillCxt and batchCxt to show they are childs of hashCxt? Should I
> remove them?
> 

It didn't occur to me this is intentional to show the contexts are
children of hashCxt, so maybe it's not a good way to document that. I'd
remove it, and if you think it's something worth mentioning, I'd add an
explicit comment.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-05-15 Thread Jehan-Guillaume de Rorthais

Hi,

Thanks for your review!

On Sat, 13 May 2023 23:47:53 +0200
Tomas Vondra  wrote:

> Thanks for the patches. A couple mostly minor comments, to complement
> Melanie's review:
> 
> 0001
> 
> I'm not really sure about calling this "hybrid hash-join". What does it
> even mean? The new comments simply describe the existing batching, and
> how we increment number of batches, etc.

Unless I'm wrong, I believed it comes from the "Hybrid hash join algorithm" as
described here (+ see pdf in this page ref):

  https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join

I added the ref in the v7 documentation to avoid futur confusion, is it ok?

> 0002
> 
> I wouldn't call the new ExecHashJoinSaveTuple parameter spillcxt but
> just something generic (e.g. cxt). Yes, we're passing spillCxt, but from
> the function's POV it's just a pointer.

changed in v7.

> I also wouldn't move the ExecHashJoinSaveTuple comment inside - it just
> needs to be reworded that we're expecting the context to be with the
> right lifespan. The function comment is the right place to document such
> expectations, people are less likely to read the function body.

moved and reworded in v7.

> The modified comment in hashjoin.h has a some alignment issues.

I see no alignment issue. I suppose what bother you might be the spaces
before spillCxt and batchCxt to show they are childs of hashCxt? Should I
remove them?

Regards,
>From 997a82a0ee9eda1774b5210401b2d9bf281318da Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH 1/3] Describe hybrid hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..a39164c15d 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,47 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ *   HYBRID HASH JOIN
+ *
+ *  This is based on the hybrid hash join algorythm described shortly in the
+ *  following page, and in length in the pdf in page's notes:
+ *
+ *https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
+ *
+ *  If the inner side tuples of a hash join do not fit in memory, the hash join
+ *  can be executed in multiple batches.
+ *
+ *  If the statistics on the inner side relation are accurate, planner chooses a
+ *  multi-batch strategy and estimates the number of batches.
+ *
+ *  The query executor measures the real size of the hashtable and increases the
+ *  number of batches if the hashtable grows too large.
+ *
+ *  The number of batches is always a power of two, so an increase in the number
+ *  of batches doubles it.
+ *
+ *  Serial hash join measures batch size lazily -- waiting until it is loading a
+ *  batch to determine if it will fit in memory. While inserting tuples into the
+ *  hashtable, serial hash join will, if that tuple were to exceed work_mem,
+ *  dump out the hashtable and reassign them either to other batch files or the
+ *  current batch resident in the hashtable.
+ *
+ *  Parallel hash join, on the other hand, completes all changes to the number
+ *  of batches during the build phase. If it increases the number of batches, it
+ *  dumps out all the tuples from all batches and reassigns them to entirely new
+ *  batch files. Then it checks every batch to ensure it will fit in the space
+ *  budget for the query.
+ *
+ *  In both parallel and serial hash join, the executor currently makes a best
+ *  effort. If a particular batch will not fit in memory, it tries doubling the
+ *  number of batches. If after a batch increase, there is a batch which
+ *  retained all or none of its tuples, the executor disables growth in the
+ *  number of batches globally. After growth is disabled, all batches that would
+ *  have previously triggered an increase in the number of batches instead
+ *  exceed the space allowed.
+ *
+ *  TODO: should we discuss that tuples can only spill forward?
+ *
  * PARALLELISM
  *
  * Hash joins can participate in parallel query execution in several ways.  A
-- 
2.40.1

>From 52e989d5eb6405e86e0d460dffbfaca292bc4274 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
 context

---
 src/backend/executor/nodeHash.c   | 43 ---
 src/backend/executor/nodeHashjoin.c   | 22 
 src/backend/utils/sort/sharedtuplestore.c |  8 +
 src/include/executor/hashjoin.h   | 30 ++--
 src/include/executor/nodeHashjoin.h

Re: Memory leak from ExecutorState context?

2023-05-15 Thread Jehan-Guillaume de Rorthais

On Sun, 14 May 2023 00:10:00 +0200
Tomas Vondra  wrote:

> On 5/12/23 23:36, Melanie Plageman wrote:
> > Thanks for continuing to work on this.
> > 
> > Are you planning to modify what is displayed for memory usage in
> > EXPLAIN ANALYZE?

Yes, I already start to work on this. Tracking spilling memory in
spaceUsed/spacePeak change the current behavior of the serialized HJ because it
will increase the number of batch much faster, so this is a no go for v16.

I'll try to accumulate the total allocated (used+not used) spill context memory
in instrumentation. This is gross, but it avoids to track the spilling memory
in its own structure entry.

> We could do that, but we can do that separately - it's a separate and
> independent improvement, I think.

+1

> Also, do you have a proposal how to change the explain output? In
> principle we already have the number of batches, so people can calculate
> the "peak" amount of memory (assuming they realize what it means).

We could add the batch memory consumption with the number of batches. Eg.:

  Buckets: 4096 (originally 4096)  
  Batches: 32768 (originally 8192) using 256MB
  Memory Usage: 192kB

> I think the main problem with adding this info to EXPLAIN is that I'm
> not sure it's very useful in practice. I've only really heard about this
> memory explosion issue when the query dies with OOM or takes forever,
> but EXPLAIN ANALYZE requires the query to complete.

It could be useful to help admins tuning their queries realize that the current
number of batches is consuming much more memory than the join itself.

This could help them fix the issue before OOM happen.

Regards,

Re: Memory leak from ExecutorState context?

2023-05-13 Thread Tomas Vondra

On 5/12/23 23:36, Melanie Plageman wrote:
> Thanks for continuing to work on this.
> 
> Are you planning to modify what is displayed for memory usage in
> EXPLAIN ANALYZE?
> 

We could do that, but we can do that separately - it's a separate and
independent improvement, I think.

Also, do you have a proposal how to change the explain output? In
principle we already have the number of batches, so people can calculate
the "peak" amount of memory (assuming they realize what it means).

I think the main problem with adding this info to EXPLAIN is that I'm
not sure it's very useful in practice. I've only really heard about this
memory explosion issue when the query dies with OOM or takes forever,
but EXPLAIN ANALYZE requires the query to complete.

A separate memory context (which  pg_log_backend_memory_contexts can
dump to server log) is more valuable, I think.

> Also, since that won't help a user who OOMs, I wondered if the spillCxt
> is helpful on its own or if we need some kind of logging message for
> users to discover that this is what led them to running out of memory.
> 

I think the separate memory context is definitely an improvement,
helpful on it's own it makes it clear *what* allocated the memory. It
requires having the memory context stats, but we may already dump them
into the server log if malloc returns NULL. Granted, it depends on how
the system is configured and it won't help when OOM killer hits :-(

I wouldn't object to having some sort of log message, but when exactly
would we emit it? Presumably after exceeding some amount of memory, but
what would it be? It can't be too low (not to trigger it too often) or
too high (failing to report the issue). Also, do you think it should go
to the user or just to the server log?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-05-13 Thread Tomas Vondra

Hi,

Thanks for the patches. A couple mostly minor comments, to complement
Melanie's review:

0001

I'm not really sure about calling this "hybrid hash-join". What does it
even mean? The new comments simply describe the existing batching, and
how we increment number of batches, etc.

When someone says "hybrid" it usually means a combination of two very
different approaches. Say, a join switching between NL and HJ would be
hybrid. But this is not it.

I'm not against describing the behavior, but the comment either needs to
explain why it's hybrid or it should not be called that.


0002

I wouldn't call the new ExecHashJoinSaveTuple parameter spillcxt but
just something generic (e.g. cxt). Yes, we're passing spillCxt, but from
the function's POV it's just a pointer.

I also wouldn't move the ExecHashJoinSaveTuple comment inside - it just
needs to be reworded that we're expecting the context to be with the
right lifespan. The function comment is the right place to document such
expectations, people are less likely to read the function body.

The modified comment in hashjoin.h has a some alignment issues.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-05-12 Thread Melanie Plageman

Thanks for continuing to work on this.

Are you planning to modify what is displayed for memory usage in
EXPLAIN ANALYZE?

Also, since that won't help a user who OOMs, I wondered if the spillCxt
is helpful on its own or if we need some kind of logging message for
users to discover that this is what led them to running out of memory.

On Wed, May 10, 2023 at 02:24:19PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 8 May 2023 11:56:48 -0400 Melanie Plageman 
>  wrote:
> 
> > > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> > >  hashtable, nbatch, hashtable->spaceUsed);
> > >  #endif
> > >  
> > > - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > > -
> > >   if (hashtable->innerBatchFile == NULL)
> > >   {
> > > + MemoryContext oldcxt =
> > > MemoryContextSwitchTo(hashtable->fileCxt); +
> > >   /* we had no file arrays before */
> > >   hashtable->innerBatchFile = palloc0_array(BufFile *,
> > > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > > +  
> > > + MemoryContextSwitchTo(oldcxt);
> > > +
> > >   /* time to establish the temp tablespaces, too */
> > >   PrepareTempTablespaces();
> > >   }
> > > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)  
> > 
> > I don't see a reason to call repalloc0_array() in a different context
> > than the initial palloc0_array().
> 
> Unless I'm wrong, wrapping the whole if/else blocks in memory context
> hashtable->fileCxt seemed useless as repalloc() actually realloc in the 
> context
> the original allocation occurred. So I only wrapped the palloc() calls.

My objection is less about correctness and more about the diff. The
existing memory context switch is around the whole if/else block. If you
want to move it to only wrap the if statement, I would do that in a
separate commit with a message describing the rationale. It doesn't seem
to save us much and it makes the diff a bit more confusing. I don't feel
strongly enough about this to protest much more, though.

> > > diff --git a/src/include/executor/hashjoin.h
> > > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644
> > > --- a/src/include/executor/hashjoin.h
> > > +++ b/src/include/executor/hashjoin.h
> > > @@ -25,10 +25,14 @@
> > >   *
> > >   * Each active hashjoin has a HashJoinTable control block, which is
> > >   * palloc'd in the executor's per-query context.  All other storage 
> > > needed
> > > - * for the hashjoin is kept in private memory contexts, two for each
> > > hashjoin.
> > > + * for the hashjoin is kept in private memory contexts, three for each
> > > + * hashjoin:  
> > 
> > Maybe "hash table control block". I know the phrase "control block" is
> > used elsewhere in the comments, but it is a bit confusing. Also, I wish
> > there was a way to make it clear this is for the hashtable but relevant
> > to all batches.
> 
> I tried to reword the comment with this additional info in mind in v6. Does it
> match what you had in mind?

Review of that below.

> > So, if we are going to allocate the array of pointers to the spill files
> > in the fileCxt, we should revise this comment. As I mentioned above, I
> > agree with you that if the SharedTupleStore-related buffers are also
> > allocated in this context, perhaps it shouldn't be called the fileCxt.
> > 
> > One idea I had is calling it the spillCxt. Almost all memory allocated in 
> > this
> > context is related to needing to spill to permanent storage during 
> > execution.
> 
> Agree
> 
> > The one potential confusing part of this is batch 0 for parallel hash
> > join. I would have to dig back into it again, but from a cursory look at
> > ExecParallelHashJoinSetUpBatches() it seems like batch 0 also gets a
> > shared tuplestore with associated accessors and files even if it is a
> > single batch parallel hashjoin.
> > 
> > Are the parallel hash join read_buffer and write_chunk also used for a
> > single batch parallel hash join?
> 
> I don't think so.
> 
> For the inner side, there's various Assert() around the batchno==0 special
> case. Plus, it always has his own block when inserting in a batch, to directly
> write in shared memory calling ExecParallelHashPushTuple().
> 
> The outer side of the join actually creates all batches using shared tuple
> storage mechanism, including batch 0, **only** if the number of batch is
> greater than 1. See in ExecParallelHashJoinOuterGetTuple:
> 
>   /*
>* In the Parallel Hash case we only run the outer plan directly for
>* single-batch hash joins. Otherwise we have to go to batch files, even
>* for batch 0.
>*/
>   if (curbatch == 0 && hashtable->nbatch == 1)
>   {
>   slot = ExecProcNode(outerNode);
> 
> So, for a single batch PHJ, it seems there's no temp files involved.

spill context seems appropriate, then.

> 
> > Though, perhaps spillCxt still makes sense? It doesn't specify
> > multi-batch...
> 
> I'm not sure the see

Re: Memory leak from ExecutorState context?

2023-05-10 Thread Jehan-Guillaume de Rorthais

Thank you for your review!

On Mon, 8 May 2023 11:56:48 -0400
Melanie Plageman  wrote:

> ...
> > 4. accessor->read_buffer is currently allocated in accessor->context as
> > well.
> > 
> >This buffer holds tuple read from the fileset. This is still a buffer,
> > but not related to any file anymore...
> > 
> > Because of 3 and 4, and the gross context granularity of SharedTuplestore
> > related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".
> 
> I think bufCxt is a potentially confusing name. The context contains
> pointers to the batch files and saying those are related to buffers is
> confusing. It also sounds like it could include any kind of buffer
> related to the hashtable or hash join.
> 
> Perhaps we could call this memory context the "spillCxt"--indicating it
> is for the memory required for spilling to permanent storage while
> executing hash joins.

"Spilling" seems fair and a large enough net to grab everything around temp
files and accessing them.

> I discuss this more in my code review below.
> 
> > From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001
> > From: Jehan-Guillaume de Rorthais 
> > Date: Mon, 27 Mar 2023 15:54:39 +0200
> > Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
> >  context  
> 
> > diff --git a/src/backend/executor/nodeHash.c
> > b/src/backend/executor/nodeHash.c index 5fd1c5553b..a4fbf29301 100644
> > --- a/src/backend/executor/nodeHash.c
> > +++ b/src/backend/executor/nodeHash.c
> > @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List
> > *hashOperators, List *hashCollations, 
> > if (nbatch > 1 && hashtable->parallel_state == NULL)
> > {
> > +   MemoryContext oldctx;
> > +
> > /*
> >  * allocate and initialize the file arrays in hashCxt (not
> > needed for
> >  * parallel case which uses shared tuplestores instead of
> > raw files) */
> > +   oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
> > +
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > /* The files will not be opened until needed... */
> > /* ... but make sure we have temp tablespaces established
> > for them */  
> 
> I haven't studied it closely, but I wonder if we shouldn't switch back
> into the oldctx before calling PrepareTempTablespaces().
> PrepareTempTablespaces() is explicit about what memory context it is
> allocating in, however, I'm not sure it makes sense to call it in the
> fileCxt. If you have a reason, you should add a comment and do the same
> in ExecHashIncreaseNumBatches().

I had no reason. I catched it in ExecHashIncreaseNumBatches() while reviewing
myself, but not here.

Line moved in v6.

> > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> >hashtable, nbatch, hashtable->spaceUsed);
> >  #endif
> >  
> > -   oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > -
> > if (hashtable->innerBatchFile == NULL)
> > {
> > +   MemoryContext oldcxt =
> > MemoryContextSwitchTo(hashtable->fileCxt); +
> > /* we had no file arrays before */
> > hashtable->innerBatchFile = palloc0_array(BufFile *,
> > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > +  
> > +   MemoryContextSwitchTo(oldcxt);
> > +
> > /* time to establish the temp tablespaces, too */
> > PrepareTempTablespaces();
> > }
> > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)  
> 
> I don't see a reason to call repalloc0_array() in a different context
> than the initial palloc0_array().

Unless I'm wrong, wrapping the whole if/else blocks in memory context
hashtable->fileCxt seemed useless as repalloc() actually realloc in the context
the original allocation occurred. So I only wrapped the palloc() calls.

> > diff --git a/src/backend/executor/nodeHashjoin.c
> > ...
> > @@ -1308,21 +1310,27 @@ ExecParallelHashJoinNewBatch(HashJoinState
> > *hjstate)
> >   * The data recorded in the file for each tuple is its hash value,  
> 
> It doesn't sound accurate to me to say that it should be called *in* the
> HashBatchFiles context. We now switch into that context, so you can
> probably remove this comment and add a comment above the switch into the
> filecxt which explains that the temp file buffers should be allocated in
> the filecxt (both because they need to be allocated in a sufficiently
> long-lived context and to increase visibility of their memory overhead).

Indeed. Comment moved and reworded in v6.

> > diff --git a/src/include/executor/hashjoin.h
> > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644
> > --- a/src/include/executor/hashjoin.h
> > +++ b/src/include/executor/hashjoin.h
> > @@ -25,10 +25,14 @@
> >   *
> >   * Each active hashjoin has a HashJoinTable control block, which is

Re: Memory leak from ExecutorState context?

2023-05-08 Thread Melanie Plageman

Thanks for continuing to work on this!

On Thu, May 04, 2023 at 07:30:06PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Fri, 21 Apr 2023 16:44:48 -0400 Melanie Plageman 
>  wrote:
...
> > I think the biggest change that is needed is to implement this memory
> > context usage for parallel hash join.
> 
> Indeed. 

...

> 4. accessor->read_buffer is currently allocated in accessor->context as well.
> 
>This buffer holds tuple read from the fileset. This is still a buffer, but
>not related to any file anymore...
> 
> Because of 3 and 4, and the gross context granularity of SharedTuplestore
> related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".

I think bufCxt is a potentially confusing name. The context contains
pointers to the batch files and saying those are related to buffers is
confusing. It also sounds like it could include any kind of buffer
related to the hashtable or hash join.

Perhaps we could call this memory context the "spillCxt"--indicating it
is for the memory required for spilling to permanent storage while
executing hash joins.

I discuss this more in my code review below.

> From c5ed2ae2c2749af4f5058b012dc5e8a9e1529127 Mon Sep 17 00:00:00 2001
> From: Jehan-Guillaume de Rorthais 
> Date: Mon, 27 Mar 2023 15:54:39 +0200
> Subject: [PATCH 2/3] Allocate hash batches related BufFile in a dedicated
>  context

> diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
> index 5fd1c5553b..a4fbf29301 100644
> --- a/src/backend/executor/nodeHash.c
> +++ b/src/backend/executor/nodeHash.c
> @@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List 
> *hashOperators, List *hashCollations,
>  
>   if (nbatch > 1 && hashtable->parallel_state == NULL)
>   {
> + MemoryContext oldctx;
> +
>   /*
>* allocate and initialize the file arrays in hashCxt (not 
> needed for
>* parallel case which uses shared tuplestores instead of raw 
> files)
>*/
> + oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
> +
>   hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
>   hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
>   /* The files will not be opened until needed... */
>   /* ... but make sure we have temp tablespaces established for 
> them */

I haven't studied it closely, but I wonder if we shouldn't switch back
into the oldctx before calling PrepareTempTablespaces().
PrepareTempTablespaces() is explicit about what memory context it is
allocating in, however, I'm not sure it makes sense to call it in the
fileCxt. If you have a reason, you should add a comment and do the same
in ExecHashIncreaseNumBatches().

>   PrepareTempTablespaces();
> +
> + MemoryContextSwitchTo(oldctx);
>   }

> @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
>  hashtable, nbatch, hashtable->spaceUsed);
>  #endif
>  
> - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> -
>   if (hashtable->innerBatchFile == NULL)
>   {
> + MemoryContext oldcxt = 
> MemoryContextSwitchTo(hashtable->fileCxt);
> +
>   /* we had no file arrays before */
>   hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
>   hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> +

As mentioned above, you should likely make ExecHashTableCreate()
consistent with this.

> + MemoryContextSwitchTo(oldcxt);
> +
>   /* time to establish the temp tablespaces, too */
>   PrepareTempTablespaces();
>   }
> @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)

I don't see a reason to call repalloc0_array() in a different context
than the initial palloc0_array().

>   hashtable->outerBatchFile = 
> repalloc0_array(hashtable->outerBatchFile, BufFile *, oldnbatch, nbatch);
>   }
>  
> - MemoryContextSwitchTo(oldcxt);
> -
>   hashtable->nbatch = nbatch;
>  
>   /*

> diff --git a/src/backend/executor/nodeHashjoin.c 
> b/src/backend/executor/nodeHashjoin.c
> index 920d1831c2..ac72fbfbb6 100644
> --- a/src/backend/executor/nodeHashjoin.c
> +++ b/src/backend/executor/nodeHashjoin.c
> @@ -485,8 +485,10 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
>*/
>   Assert(parallel_state == NULL);
>   Assert(batchno > hashtable->curbatch);
> +
>   ExecHashJoinSaveTuple(mintuple, 
> hashvalue,
> - 
>   >outerBatchFile[batchno]);
> + 
>   >outerBatchFile[batchno],
> +

Re: Memory leak from ExecutorState context?

2023-05-04 Thread Jehan-Guillaume de Rorthais

Hi,

On Fri, 21 Apr 2023 16:44:48 -0400
Melanie Plageman  wrote:

> On Fri, Apr 7, 2023 at 8:01 PM Jehan-Guillaume de Rorthais
>  wrote:
> >
> > On Fri, 31 Mar 2023 14:06:11 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> >  
> > > > [...]  
> > > > >> Hmmm, not sure is WARNING is a good approach, but I don't have a
> > > > >> better idea at the moment.  
> > > > >
> > > > > I stepped it down to NOTICE and added some more infos.
> > > > >
> > > > > [...]
> > > > >   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed
> > > > > memory (137494656 > 2097152)  
> > > > [...]
> > > >
> > > > OK, although NOTICE that may actually make it less useful - the default
> > > > level is WARNING, and regular users are unable to change the level. So
> > > > very few people will actually see these messages.  
> > [...]  
> > > Anyway, maybe this should be added in the light of next patch, balancing
> > > between increasing batches and allowed memory. The WARNING/LOG/NOTICE
> > > message could appears when we actually break memory rules because of some
> > > bad HJ situation.  
> >
> > So I did some more minor editions to the memory context patch and start
> > working on the balancing memory patch. Please, find in attachment the v4
> > patch set:
> >
> > * 0001-v4-Describe-hybrid-hash-join-implementation.patch:
> >   Adds documentation written by Melanie few years ago
> > * 0002-v4-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch:
> >   The batches' BufFile dedicated memory context patch  
> 
> This is only a review of the code in patch 0002 (the patch to use a more
> granular memory context for multi-batch hash join batch files). I have
> not reviewed the changes to comments in detail either.

Ok.

> I think the biggest change that is needed is to implement this memory
> context usage for parallel hash join.

Indeed. 

> To implement a file buffer memory context for multi-patch parallel hash
> join [...]

Thank you for your review and pointers!

After (some days off and then) studying the parallel code, I end up with this:

1. As explained by Melanie, the v5 patch sets accessor->context to fileCxt.

2. BufFile buffers were wrongly allocated in ExecutorState context for
   accessor->read|write_file, from sts_puttuple and sts_parallel_scan_next when
   they first need to work with the accessor FileSet.

   The v5 patch now allocate them in accessor->context, directly in
   sts_puttuple and sts_parallel_scan_next. This avoids to wrap each single
   call of these functions inside MemoryContextSwitchTo calls. I suppose this
   is correct as the comment about accessor->context says 
   "/* Memory context for **buffers**. */" in struct SharedTuplestoreAccessor.

3. accessor->write_chunk is currently already allocated in accessor->context.

   In consequence, this buffer is now allocated in the fileCxt instead
   of hashCxt context. 

   This is a bit far fetched, but I suppose this is ok as it act as a second
   level buffer, on top of the BufFile.

4. accessor->read_buffer is currently allocated in accessor->context as well.

   This buffer holds tuple read from the fileset. This is still a buffer, but
   not related to any file anymore...

Because of 3 and 4, and the gross context granularity of SharedTuplestore
related-code, I'm now wondering if "fileCxt" shouldn't be renamed "bufCxt".

Regards,
>From acaa94fab35ab966366a29a092780e54a68de0f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH 1/3] Describe hybrid hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 36 +
 1 file changed, 36 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 0a3f32f731..920d1831c2 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,42 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ *   HYBRID HASH JOIN
+ *
+ *  If the inner side tuples of a hash join do not fit in memory, the hash join
+ *  can be executed in multiple batches.
+ *
+ *  If the statistics on the inner side relation are accurate, planner chooses a
+ *  multi-batch strategy and estimates the number of batches.
+ *
+ *  The query executor measures the real size of the hashtable and increases the
+ *  number of batches if the hashtable grows too large.
+ *
+ *  The number of batches is always a power of two, so an increase in the number
+ *  of batches doubles it.
+ *
+ *  Serial hash join measures batch size lazily -- waiting until it is loading a
+ *  batch to determine if it will fit in memory. While inserting tuples into the
+ *  hashtable, serial hash join

Re: Memory leak from ExecutorState context?

2023-04-21 Thread Melanie Plageman

On Fri, Apr 7, 2023 at 8:01 PM Jehan-Guillaume de Rorthais
 wrote:
>
> On Fri, 31 Mar 2023 14:06:11 +0200
> Jehan-Guillaume de Rorthais  wrote:
>
> > > [...]
> > > >> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> > > >> idea at the moment.
> > > >
> > > > I stepped it down to NOTICE and added some more infos.
> > > >
> > > > [...]
> > > >   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed
> > > > memory (137494656 > 2097152)
> > > [...]
> > >
> > > OK, although NOTICE that may actually make it less useful - the default
> > > level is WARNING, and regular users are unable to change the level. So
> > > very few people will actually see these messages.
> [...]
> > Anyway, maybe this should be added in the light of next patch, balancing
> > between increasing batches and allowed memory. The WARNING/LOG/NOTICE 
> > message
> > could appears when we actually break memory rules because of some bad HJ
> > situation.
>
> So I did some more minor editions to the memory context patch and start 
> working
> on the balancing memory patch. Please, find in attachment the v4 patch set:
>
> * 0001-v4-Describe-hybrid-hash-join-implementation.patch:
>   Adds documentation written by Melanie few years ago
> * 0002-v4-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch:
>   The batches' BufFile dedicated memory context patch

This is only a review of the code in patch 0002 (the patch to use a more
granular memory context for multi-batch hash join batch files). I have
not reviewed the changes to comments in detail either.

I think the biggest change that is needed is to implement this memory
context usage for parallel hash join.

To implement a file buffer memory context for multi-patch parallel hash
join we would need, at a minimum, to switch into the proposed fileCxt
memory context in sts_puttuple() before BufFileCreateFileSet().

We should also consider changing the SharedTuplestoreAccessor->context
from HashTableContext to the proposed fileCxt.

In parallel hash join code, the SharedTuplestoreAccessor->context is
only used when allocating the SharedTuplestoreAccessor->write_chunk and
read_chunk. Those are the buffers for writing out and reading from the
SharedTuplestore and are part of the memory overhead of file buffers for
a multi-batch hash join. Note that we will allocate STS_CHUNK_PAGES *
BLCKSZ size buffer for every batch -- this is on top of the BufFile
buffer per batch.

sts_initialize() and sts_attach() set the
SharedTuplestoreAccessor->context to the CurrentMemoryContext. We could
change into the fileCxt before calling those functions from the hash
join code. That would mean that the SharedTuplestoreAccessor data
structure would also be counted in the fileCxt (and probably
hashtable->batches). This is probably what we want anyway.

As for this patch's current implementation and use of the fileCxt , I
think you are going to need to switch into the fileCxt before calling
BufFileClose() with the batch files throughout the hash join code (e.g.
in ExecHashJoinNewBatch()) if you want the mem_allocated to be accurate
(since it frees the BufFile buffer). Once you figure out if that makes
sense and implement it, I think we will have to revisit if it still
makes sense to pass the fileCxt as an argument to
ExecHashJoinSaveTuple().

- Melanie

Re: Memory leak from ExecutorState context?

2023-04-21 Thread Konstantin Knizhnik




On 21.04.2023 1:51 AM, Melanie Plageman wrote:

On Thu, Apr 20, 2023 at 12:42 PM Konstantin Knizhnik  wrote:

On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:

On Sat, 8 Apr 2023 02:01:19 +0200
Jehan-Guillaume de Rorthais  wrote:


On Fri, 31 Mar 2023 14:06:11 +0200
Jehan-Guillaume de Rorthais  wrote:

   [...]

After rebasing Tomas' memory balancing patch, I did some memory measures
to answer some of my questions. Please, find in attachment the resulting
charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
between HEAD and Tomas' patch. They shows an alternance of numbers
before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
didn't try to find the exact last total peak of memory consumption during the
join phase and before all the BufFiles are destroyed. So the last number
might be underestimated.

I did some more analysis about the total memory consumption in filecxt of HEAD,
v3 and v4 patches. My previous debug numbers only prints memory metrics during
batch increments or hash table destruction. That means:

* for HEAD: we miss the batches consumed during the outer scan
* for v3: adds twice nbatch in spaceUsed, which is a rough estimation
* for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak

Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
from there, here are the maximum allocated memory for bufFile context for each
branch:

  batches   max bufFiles   total  spaceAllowed rise
HEAD16384  199966960  ~194MB
v3   4096   65419456   ~78MB
v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5

It seems account for bufFile in spaceUsed allows a better memory balancing and
management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
*4 looks good, but this is based on a single artificial test case.

Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
far wrong with the reality. So even if we don't commit the balancing memory
patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?

Regards,

Thank you for the patch.
I  faced with the same problem (OOM caused by hash join).
I tried to create simplest test reproducing the problem:

create table t(pk int, val int);
insert into t values (generate_series(1,1),0);
set work_mem='64kB';
explain (analyze,buffers) select count(*) from t t1 join t t2 on
(t1.pk=t2.pk);


There are three workers and size of each exceeds 1.3Gb.

Plan is the following:

   Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1
width=8) (actual time=2
12961.033..226097.513 rows=1 loops=1)
 Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374,
temp read=944407 w
ritten=1130380
 ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8)
(actual time=212943.
505..226097.497 rows=3 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=32644 read=852474 dirtied=437947
written=426374, temp read=94
4407 written=1130380
   ->  Partial Aggregate (cost=355905976972.65..355905976972.66
rows=1 width=8) (ac
tual time=212938.410..212940.035 rows=1 loops=3)
 Buffers: shared hit=32644 read=852474 dirtied=437947
written=426374, temp r
ead=944407 written=1130380
 ->  Parallel Hash Join (cost=1542739.26..303822614472.65
rows=2083334502 width=0) (actual time=163268.274..207829.524
rows= loops=3)
   Hash Cond: (t1.pk = t2.pk)
   Buffers: shared hit=32644 read=852474
dirtied=437947 written=426374, temp read=944407 written=1130380
   ->  Parallel Seq Scan on t t1
(cost=0.00..859144.78 rows=4178 width=4) (actual
time=0.045..30828.051 rows= loops=3)
 Buffers: shared hit=16389 read=426089 written=87
   ->  Parallel Hash (cost=859144.78..859144.78
rows=4178 width=4) (actual time=82202.445..82202.447 rows=
loops=3)
 Buckets: 4096 (originally 4096)  Batches:
32768 (originally 8192)  Memory Usage: 192kB
 Buffers: shared hit=16095 read=426383
dirtied=437947 written=426287, temp read=267898 written=737164
 ->  Parallel Seq Scan on t t2
(cost=0.00..859144.78 rows=4178 width=4) (actual
time=0.054..12647.534 rows= loops=3)
   Buffers: shared hit=16095 read=426383
dirtied=437947 writ
ten=426287
   Planning:
 Buffers: shared hit=69 read=38
   Planning Time: 2.819 ms
   Execution Time: 226113.292 ms
(22 rows)



-

So we have increased number of batches to 32k.
I applied your patches 0001-0004 but unfortunately them have not reduced

Re: Memory leak from ExecutorState context?

2023-04-20 Thread Melanie Plageman

On Thu, Apr 20, 2023 at 12:42 PM Konstantin Knizhnik  wrote:
> On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:
> > On Sat, 8 Apr 2023 02:01:19 +0200
> > Jehan-Guillaume de Rorthais  wrote:
> >
> >> On Fri, 31 Mar 2023 14:06:11 +0200
> >> Jehan-Guillaume de Rorthais  wrote:
> >>
> >>   [...]
> >>
> >> After rebasing Tomas' memory balancing patch, I did some memory measures
> >> to answer some of my questions. Please, find in attachment the resulting
> >> charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
> >> between HEAD and Tomas' patch. They shows an alternance of numbers
> >> before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
> >> didn't try to find the exact last total peak of memory consumption during 
> >> the
> >> join phase and before all the BufFiles are destroyed. So the last number
> >> might be underestimated.
> > I did some more analysis about the total memory consumption in filecxt of 
> > HEAD,
> > v3 and v4 patches. My previous debug numbers only prints memory metrics 
> > during
> > batch increments or hash table destruction. That means:
> >
> > * for HEAD: we miss the batches consumed during the outer scan
> > * for v3: adds twice nbatch in spaceUsed, which is a rough estimation
> > * for v4: batches are tracked in spaceUsed, so they are reflected in 
> > spacePeak
> >
> > Using a breakpoint in ExecHashJoinSaveTuple to print 
> > "filecxt->mem_allocated"
> > from there, here are the maximum allocated memory for bufFile context for 
> > each
> > branch:
> >
> >  batches   max bufFiles   total  spaceAllowed rise
> >HEAD16384  199966960  ~194MB
> >v3   4096   65419456   ~78MB
> >v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
> >v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
> >v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5
> >
> > It seems account for bufFile in spaceUsed allows a better memory balancing 
> > and
> > management. The precise factor to rise spaceAllowed is yet to be defined. 
> > *3 or
> > *4 looks good, but this is based on a single artificial test case.
> >
> > Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
> > far wrong with the reality. So even if we don't commit the balancing memory
> > patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?
> >
> > Regards,
>
> Thank you for the patch.
> I  faced with the same problem (OOM caused by hash join).
> I tried to create simplest test reproducing the problem:
>
> create table t(pk int, val int);
> insert into t values (generate_series(1,1),0);
> set work_mem='64kB';
> explain (analyze,buffers) select count(*) from t t1 join t t2 on
> (t1.pk=t2.pk);
>
>
> There are three workers and size of each exceeds 1.3Gb.
>
> Plan is the following:
>
>   Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1
> width=8) (actual time=2
> 12961.033..226097.513 rows=1 loops=1)
> Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374,
> temp read=944407 w
> ritten=1130380
> ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8)
> (actual time=212943.
> 505..226097.497 rows=3 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   Buffers: shared hit=32644 read=852474 dirtied=437947
> written=426374, temp read=94
> 4407 written=1130380
>   ->  Partial Aggregate (cost=355905976972.65..355905976972.66
> rows=1 width=8) (ac
> tual time=212938.410..212940.035 rows=1 loops=3)
> Buffers: shared hit=32644 read=852474 dirtied=437947
> written=426374, temp r
> ead=944407 written=1130380
> ->  Parallel Hash Join (cost=1542739.26..303822614472.65
> rows=2083334502 width=0) (actual time=163268.274..207829.524
> rows= loops=3)
>   Hash Cond: (t1.pk = t2.pk)
>   Buffers: shared hit=32644 read=852474
> dirtied=437947 written=426374, temp read=944407 written=1130380
>   ->  Parallel Seq Scan on t t1
> (cost=0.00..859144.78 rows=4178 width=4) (actual
> time=0.045..30828.051 rows= loops=3)
> Buffers: shared hit=16389 read=426089 written=87
>   ->  Parallel Hash (cost=859144.78..859144.78
> rows=4178 width=4) (actual time=82202.445..82202.447 rows=
> loops=3)
> Buckets: 4096 (originally 4096)  Batches:
> 32768 (originally 8192)  Memory Usage: 192kB
> Buffers: shared hit=16095 read=426383
> dirtied=437947 written=426287, temp read=267898 written=737164
> ->  Parallel Seq Scan on t t2
> (cost=0.00..859144.78 rows=4178 width=4) (actual
> time=0.054..12647.534 rows= loops=3)
>   Buffers: shared hit=16095 read=426383
> dirtied=437947 writ
> ten=426287
>   Planning:
>

Re: Memory leak from ExecutorState context?

2023-04-20 Thread Konstantin Knizhnik





On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:

On Sat, 8 Apr 2023 02:01:19 +0200
Jehan-Guillaume de Rorthais  wrote:


On Fri, 31 Mar 2023 14:06:11 +0200
Jehan-Guillaume de Rorthais  wrote:

  [...]

After rebasing Tomas' memory balancing patch, I did some memory measures
to answer some of my questions. Please, find in attachment the resulting
charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
between HEAD and Tomas' patch. They shows an alternance of numbers
before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
didn't try to find the exact last total peak of memory consumption during the
join phase and before all the BufFiles are destroyed. So the last number
might be underestimated.

I did some more analysis about the total memory consumption in filecxt of HEAD,
v3 and v4 patches. My previous debug numbers only prints memory metrics during
batch increments or hash table destruction. That means:

* for HEAD: we miss the batches consumed during the outer scan
* for v3: adds twice nbatch in spaceUsed, which is a rough estimation
* for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak

Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
from there, here are the maximum allocated memory for bufFile context for each
branch:

 batches   max bufFiles   total  spaceAllowed rise
   HEAD16384  199966960  ~194MB
   v3   4096   65419456   ~78MB
   v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
   v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
   v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5

It seems account for bufFile in spaceUsed allows a better memory balancing and
management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
*4 looks good, but this is based on a single artificial test case.

Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
far wrong with the reality. So even if we don't commit the balancing memory
patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?

Regards,


Thank you for the patch.
I  faced with the same problem (OOM caused by hash join).
I tried to create simplest test reproducing the problem:

create table t(pk int, val int);
insert into t values (generate_series(1,1),0);
set work_mem='64kB';
explain (analyze,buffers) select count(*) from t t1 join t t2 on 
(t1.pk=t2.pk);



There are three workers and size of each exceeds 1.3Gb.

Plan is the following:

 Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1 
width=8) (actual time=2

12961.033..226097.513 rows=1 loops=1)
   Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374, 
temp read=944407 w

ritten=1130380
   ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8) 
(actual time=212943.

505..226097.497 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=32644 read=852474 dirtied=437947 
written=426374, temp read=94

4407 written=1130380
 ->  Partial Aggregate (cost=355905976972.65..355905976972.66 
rows=1 width=8) (ac

tual time=212938.410..212940.035 rows=1 loops=3)
   Buffers: shared hit=32644 read=852474 dirtied=437947 
written=426374, temp r

ead=944407 written=1130380
   ->  Parallel Hash Join (cost=1542739.26..303822614472.65 
rows=2083334502 width=0) (actual time=163268.274..207829.524 
rows= loops=3)

 Hash Cond: (t1.pk = t2.pk)
 Buffers: shared hit=32644 read=852474 
dirtied=437947 written=426374, temp read=944407 written=1130380
 ->  Parallel Seq Scan on t t1 
(cost=0.00..859144.78 rows=4178 width=4) (actual 
time=0.045..30828.051 rows= loops=3)

   Buffers: shared hit=16389 read=426089 written=87
 ->  Parallel Hash (cost=859144.78..859144.78 
rows=4178 width=4) (actual time=82202.445..82202.447 rows= 
loops=3)
   Buckets: 4096 (originally 4096)  Batches: 
32768 (originally 8192)  Memory Usage: 192kB
   Buffers: shared hit=16095 read=426383 
dirtied=437947 written=426287, temp read=267898 written=737164
   ->  Parallel Seq Scan on t t2 
(cost=0.00..859144.78 rows=4178 width=4) (actual 
time=0.054..12647.534 rows= loops=3)
 Buffers: shared hit=16095 read=426383 
dirtied=437947 writ

ten=426287
 Planning:
   Buffers: shared hit=69 read=38
 Planning Time: 2.819 ms
 Execution Time: 226113.292 ms
(22 rows)



-

So we have increased number of batches to 32k.
I applied your patches 0001-0004 but unfortunately them have not reduced 
memory consumption - still size of each backend is more than 1.3Gb.


I wonder what can be the prefered solution of the problem?
We

Re: Memory leak from ExecutorState context?

2023-04-11 Thread Jehan-Guillaume de Rorthais

On Sat, 8 Apr 2023 02:01:19 +0200
Jehan-Guillaume de Rorthais  wrote:

> On Fri, 31 Mar 2023 14:06:11 +0200
> Jehan-Guillaume de Rorthais  wrote:
> 
>  [...]  
> 
> After rebasing Tomas' memory balancing patch, I did some memory measures
> to answer some of my questions. Please, find in attachment the resulting
> charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
> between HEAD and Tomas' patch. They shows an alternance of numbers
> before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
> didn't try to find the exact last total peak of memory consumption during the
> join phase and before all the BufFiles are destroyed. So the last number
> might be underestimated.

I did some more analysis about the total memory consumption in filecxt of HEAD,
v3 and v4 patches. My previous debug numbers only prints memory metrics during
batch increments or hash table destruction. That means:

* for HEAD: we miss the batches consumed during the outer scan
* for v3: adds twice nbatch in spaceUsed, which is a rough estimation
* for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak

Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
from there, here are the maximum allocated memory for bufFile context for each
branch:

batches   max bufFiles   total  spaceAllowed rise
  HEAD16384  199966960  ~194MB
  v3   4096   65419456   ~78MB
  v4(*3)   2048   3427328048MB  nbatch*sizeof(PGAlignedBlock)*3
  v4(*4)   1024   17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
  v4(*5)   2048   34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5

It seems account for bufFile in spaceUsed allows a better memory balancing and
management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
*4 looks good, but this is based on a single artificial test case.

Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
far wrong with the reality. So even if we don't commit the balancing memory
patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?

Regards,

Re: Memory leak from ExecutorState context?

2023-03-31 Thread Jehan-Guillaume de Rorthais

On Tue, 28 Mar 2023 17:25:49 +0200
Tomas Vondra  wrote:
...
>  * Note that BufFile structs are allocated with palloc(), and therefore
>  * will go away automatically at query/transaction end.  Since the
> underlying
>  * virtual Files are made with OpenTemporaryFile, all resources for
>  * the file are certain to be cleaned up even if processing is aborted
>  * by ereport(ERROR).  The data structures required are made in the
>  * palloc context that was current when the BufFile was created, and
>  * any external resources such as temp files are owned by the ResourceOwner
>  * that was current at that time.
> 
> which I take as confirmation that it's legal to allocate BufFile in any
> memory context, and that cleanup is handled by the cache in fd.c.

OK. I just over interpreted comments and been over prudent.

> [...]
> >> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> >> idea at the moment.  
> > 
> > I stepped it down to NOTICE and added some more infos.
> > 
> > [...]
> >   NOTICE:  Growing number of hash batch to 32768 is exhausting allowed
> > memory (137494656 > 2097152)
> [...]
> 
> OK, although NOTICE that may actually make it less useful - the default
> level is WARNING, and regular users are unable to change the level. So
> very few people will actually see these messages.

The main purpose of NOTICE was to notice user/dev, as client_min_messages=notice
by default. 

But while playing with it, I wonder if the message is in the good place anyway.
It is probably pessimistic as it shows memory consumption when increasing the
number of batch, but before actual buffile are (lazily) allocated. The message
should probably pop a bit sooner with better numbers.

Anyway, maybe this should be added in the light of next patch, balancing
between increasing batches and allowed memory. The WARNING/LOG/NOTICE message
could appears when we actually break memory rules because of some bad HJ
situation.

Another way to expose the bad memory consumption would be to add memory infos to
the HJ node in the explain output, or maybe collect some min/max/mean for
pg_stat_statement, but I suspect tracking memory spikes by query is another
challenge altogether...

In the meantime, find in attachment v3 of the patch with debug and NOTICE
messages removed. Given the same plan from my previous email, here is the
memory contexts close to the query end:

 ExecutorState: 32768 total in 3 blocks; 15512 free (6 chunks); 17256 used
   HashTableContext: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used 
HashBatchFiles: 28605680 total in 3256 blocks; 970128 free (38180 chunks);
27635552 used
HashBatchContext: 960544 total in 23 blocks; 7928 free (0 chunks); 952616 
used


Regards,
>From 6814994fa0576a8ba6458412ac5f944135fc3813 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context

---
 src/backend/executor/nodeHash.c | 27 ++-
 src/backend/executor/nodeHashjoin.c | 18 +-
 src/include/executor/hashjoin.h | 15 ---
 src/include/executor/nodeHashjoin.h |  2 +-
 4 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 748c9b0024..0c0d5b4a3c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	 *
 	 * The hashtable control block is just palloc'd from the executor's
 	 * per-query memory context.  Everything else should be kept inside the
-	 * subsidiary hashCxt or batchCxt.
+	 * subsidiary hashCxt, batchCxt or fileCxt.
 	 */
 	hashtable = palloc_object(HashJoinTableData);
 	hashtable->nbuckets = nbuckets;
@@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);
 
+	hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
+ "HashBatchFiles",
+ ALLOCSET_DEFAULT_SIZES);
+
 	/* Allocate data that will live for the life of the hashjoin */
 
 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 
 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * allocate and initialize the file arrays in hashCxt (not needed for
 		 * parallel case which uses shared tuplestores instead of raw files)
 		 */
+		oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
+
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Tomas Vondra




On 3/28/23 15:17, Jehan-Guillaume de Rorthais wrote:
> On Tue, 28 Mar 2023 00:43:34 +0200
> Tomas Vondra  wrote:
> 
>> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
>>> Please, find in attachment a patch to allocate bufFiles in a dedicated
>>> context. I picked up your patch, backpatch'd it, went through it and did
>>> some minor changes to it. I have some comment/questions thought.
>>>
>>>   1. I'm not sure why we must allocate the "HashBatchFiles" new context
>>>   under ExecutorState and not under hashtable->hashCxt?
>>>
>>>   The only references I could find was in hashjoin.h:30:
>>>
>>>/* [...]
>>> * [...] (Exception: data associated with the temp files lives in the
>>> * per-query context too, since we always call buffile.c in that
>>> context.)
>>>
>>>   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
>>>   original comment in the patch):
>>>
>>>/* [...]
>>> * Note: it is important always to call this in the regular executor
>>> * context, not in a shorter-lived context; else the temp file buffers
>>> * will get messed up.
>>>
>>>
>>>   But these are not explanation of why BufFile related allocations must be
>>> under a per-query context. 
>>>   
>>
>> Doesn't that simply describe the current (unpatched) behavior where
>> BufFile is allocated in the per-query context? 
> 
> I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
> rule about «**always** call buffile.c in per-query context». But maybe it 
> ought
> to be «always call buffile.c from one of the sub-query context»? I assume the
> aim is to enforce the tmp files removal on query end/error?
> 

I don't think we need this info for tempfile cleanup - CleanupTempFiles
relies on the VfdCache, which does malloc/realloc (so it's not using
memory contexts at all).

I'm not very familiar with this part of the code, so I might be missing
something. But you can try that - just stick an elog(ERROR) somewhere
into the hashjoin code, and see if that breaks the cleanup.

Not an explicit proof, but if there was some hard-wired requirement in
which memory context to allocate BufFile stuff, I'd expect it to be
documented in buffile.c. But that actually says this:

 * Note that BufFile structs are allocated with palloc(), and therefore
 * will go away automatically at query/transaction end.  Since the
underlying
 * virtual Files are made with OpenTemporaryFile, all resources for
 * the file are certain to be cleaned up even if processing is aborted
 * by ereport(ERROR).  The data structures required are made in the
 * palloc context that was current when the BufFile was created, and
 * any external resources such as temp files are owned by the ResourceOwner
 * that was current at that time.

which I take as confirmation that it's legal to allocate BufFile in any
memory context, and that cleanup is handled by the cache in fd.c.


>> I mean, the current code calls BufFileCreateTemp() without switching the
>> context, so it's in the ExecutorState. But with the patch it very clearly is
>> not.
>>
>> And I'm pretty sure the patch should do
>>
>> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>>  "HashBatchFiles",
>>  ALLOCSET_DEFAULT_SIZES);
>>
>> and it'd still work. Or why do you think we *must* allocate it under
>> ExecutorState?
> 
> That was actually my very first patch and it indeed worked. But I was confused
> about the previous quoted code comments. That's why I kept your original code
> and decided to rise the discussion here.
> 

IIRC I was just lazy when writing the experimental patch, there was not
much thought about stuff like this.

> Fixed in new patch in attachment.
> 
>> FWIW The comment in hashjoin.h needs updating to reflect the change.
> 
> Done in the last patch. Is my rewording accurate?
> 
>>>   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
>>> switch seems fragile as it could be forgotten in futur code path/changes.
>>> So I added an Assert() in the function to make sure the current memory
>>> context is "HashBatchFiles" as expected.
>>>   Another way to tie this up might be to pass the memory context as
>>> argument to the function.
>>>   ... Or maybe I'm over precautionary.
>>>   
>>
>> I'm not sure I'd call that fragile, we have plenty other code that
>> expects the memory context to be set correctly. Not sure about the
>> assert, but we don't have similar asserts anywhere else.
> 
> I mostly sticked it there to stimulate the discussion around this as I needed
> to scratch that itch.
> 
>> But I think it's just ugly and overly verbose
> 
> +1
> 
> Your patch was just a demo/debug patch by the time. It needed some cleanup now
> :)
> 
>> it'd be much nicer to e.g. pass the memory context as a parameter, and do
>> the switch inside.
> 
> That was a proposition in my previous mail, so I did it in the new patch. 
> Let's
> see what other

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Jehan-Guillaume de Rorthais

Hi,

Sorry for the late answer, I was reviewing the first patch and it took me some
time to study and dig around.

On Thu, 23 Mar 2023 08:07:04 -0400
Melanie Plageman  wrote:

> On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais
>  wrote:
> > > So I guess the best thing would be to go through these threads, see what
> > > the status is, restart the discussion and propose what to do. If you do
> > > that, I'm happy to rebase the patches, and maybe see if I could improve
> > > them in some way.  
> >
> > OK! It took me some time, but I did it. I'll try to sum up the situation as
> > simply as possible.  
> 
> Wow, so many memories!
> 
> I'm excited that someone looked at this old work (though it is sad that
> a customer faced this issue). And, Jehan, I really appreciate your great
> summarization of all these threads. This will be a useful reference.

Thank you!

> > 1. "move BufFile stuff into separate context"
> > [...]
> >I suppose this simple one has been forgotten in the fog of all other
> >discussions. Also, this probably worth to be backpatched.  
> 
> I agree with Jehan-Guillaume and Tomas that this seems fine to commit
> alone.

This is a WIP.

> > 2. "account for size of BatchFile structure in hashJoin"
> > [...] 
> 
> I think I would have to see a modern version of a patch which does this
> to assess if it makes sense. But, I probably still agree with 2019
> Melanie :)

I volunteer to work on this after the memory context patch, unless someone grab
it in the meantime.

> [...]
> On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
>  wrote:
> > BNJL and/or other considerations are for 17 or even after. In the meantime,
> > Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with
> > other discussed solutions. No one down vote since then. Melanie, what is
> > your opinion today on this patch? Did you change your mind as you worked
> > for many months on BNLJ since then?  
> 
> So, in order to avoid deadlock, my design of adaptive hash join/block
> nested loop hash join required a new parallelism concept not yet in
> Postgres at the time -- the idea of a lone worker remaining around to do
> work when others have left.
> 
> See: BarrierArriveAndDetachExceptLast()
> introduced in 7888b09994
> 
> Thomas Munro had suggested we needed to battle test this concept in a
> more straightforward feature first, so I implemented parallel full outer
> hash join and parallel right outer hash join with it.
> 
> https://commitfest.postgresql.org/42/2903/
> 
> This has been stalled ready-for-committer for two years. It happened to
> change timing such that it made an existing rarely hit parallel hash
> join bug more likely to be hit. Thomas recently committed our fix for
> this in 8d578b9b2e37a4d (last week). It is my great hope that parallel
> full outer hash join goes in before the 16 feature freeze.

This is really interesting to follow. I kinda feel/remember how this could
be useful for your BNLJ patch. It's good to see things are moving, step by
step.

Thanks for the pointers.

> If it does, I think it could make sense to try and find committable
> smaller pieces of the adaptive hash join work. As it is today, parallel
> hash join does not respect work_mem, and, in some sense, is a bit broken.
> 
> I would be happy to work on this feature again, or, if you were
> interested in picking it up, to provide review and any help I can if for
> you to work on it.

I don't think I would be able to pick up such a large and complex patch. But I'm
interested to help, test and review, as far as I can!

Regards,

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Jehan-Guillaume de Rorthais

On Tue, 28 Mar 2023 00:43:34 +0200
Tomas Vondra  wrote:

> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
> > Please, find in attachment a patch to allocate bufFiles in a dedicated
> > context. I picked up your patch, backpatch'd it, went through it and did
> > some minor changes to it. I have some comment/questions thought.
> > 
> >   1. I'm not sure why we must allocate the "HashBatchFiles" new context
> >   under ExecutorState and not under hashtable->hashCxt?
> > 
> >   The only references I could find was in hashjoin.h:30:
> > 
> >/* [...]
> > * [...] (Exception: data associated with the temp files lives in the
> > * per-query context too, since we always call buffile.c in that
> > context.)
> > 
> >   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
> >   original comment in the patch):
> > 
> >/* [...]
> > * Note: it is important always to call this in the regular executor
> > * context, not in a shorter-lived context; else the temp file buffers
> > * will get messed up.
> > 
> > 
> >   But these are not explanation of why BufFile related allocations must be
> > under a per-query context. 
> >   
> 
> Doesn't that simply describe the current (unpatched) behavior where
> BufFile is allocated in the per-query context? 

I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
rule about «**always** call buffile.c in per-query context». But maybe it ought
to be «always call buffile.c from one of the sub-query context»? I assume the
aim is to enforce the tmp files removal on query end/error?

> I mean, the current code calls BufFileCreateTemp() without switching the
> context, so it's in the ExecutorState. But with the patch it very clearly is
> not.
> 
> And I'm pretty sure the patch should do
> 
> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>  "HashBatchFiles",
>  ALLOCSET_DEFAULT_SIZES);
> 
> and it'd still work. Or why do you think we *must* allocate it under
> ExecutorState?

That was actually my very first patch and it indeed worked. But I was confused
about the previous quoted code comments. That's why I kept your original code
and decided to rise the discussion here.

Fixed in new patch in attachment.

> FWIW The comment in hashjoin.h needs updating to reflect the change.

Done in the last patch. Is my rewording accurate?

> >   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
> > switch seems fragile as it could be forgotten in futur code path/changes.
> > So I added an Assert() in the function to make sure the current memory
> > context is "HashBatchFiles" as expected.
> >   Another way to tie this up might be to pass the memory context as
> > argument to the function.
> >   ... Or maybe I'm over precautionary.
> >   
> 
> I'm not sure I'd call that fragile, we have plenty other code that
> expects the memory context to be set correctly. Not sure about the
> assert, but we don't have similar asserts anywhere else.

I mostly sticked it there to stimulate the discussion around this as I needed
to scratch that itch.

> But I think it's just ugly and overly verbose

+1

Your patch was just a demo/debug patch by the time. It needed some cleanup now
:)

> it'd be much nicer to e.g. pass the memory context as a parameter, and do
> the switch inside.

That was a proposition in my previous mail, so I did it in the new patch. Let's
see what other reviewers think.

> >   3. You wrote:
> >   
> >>> A separate BufFile memory context helps, although people won't see it
> >>> unless they attach a debugger, I think. Better than nothing, but I was
> >>> wondering if we could maybe print some warnings when the number of batch
> >>> files gets too high ...  
> > 
> >   So I added a WARNING when batches memory are exhausting the memory size
> >   allowed.
> > 
> >+   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
> >+   elog(WARNING, "Growing number of hash batch is exhausting
> > memory");
> > 
> >   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
> >   overflows the memory budget. I realize now I should probably add the
> > memory limit, the number of current batch and their memory consumption.
> >   The message is probably too cryptic for a user. It could probably be
> >   reworded, but some doc or additionnal hint around this message might help.
> >   
> 
> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> idea at the moment.

I stepped it down to NOTICE and added some more infos.

Here is the output of the last patch with a 1MB work_mem:

  =# explain analyze select * from small join large using (id);
  WARNING:  increasing number of batches from 1 to 2
  WARNING:  increasing number of batches from 2 to 4
  WARNING:  increasing number of batches from 4 to 8
  WARNING:  increasing number of batches from 8 to 16
  WARNING:  increasing

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Tomas Vondra

On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> On Mon, 20 Mar 2023 15:12:34 +0100
> Jehan-Guillaume de Rorthais  wrote:
> 
>> On Mon, 20 Mar 2023 09:32:17 +0100
>> Tomas Vondra  wrote:
>>
> * Patch 1 could be rebased/applied/backpatched

 Would it help if I rebase Patch 1 ("move BufFile stuff into separate
 context")?   
>>>
>>> Yeah, I think this is something we'd want to do. It doesn't change the
>>> behavior, but it makes it easier to track the memory consumption etc.  
>>
>> Will do this week.
> 
> Please, find in attachment a patch to allocate bufFiles in a dedicated 
> context.
> I picked up your patch, backpatch'd it, went through it and did some minor
> changes to it. I have some comment/questions thought.
> 
>   1. I'm not sure why we must allocate the "HashBatchFiles" new context
>   under ExecutorState and not under hashtable->hashCxt?
> 
>   The only references I could find was in hashjoin.h:30:
> 
>/* [...]
> * [...] (Exception: data associated with the temp files lives in the
> * per-query context too, since we always call buffile.c in that context.)
> 
>   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
>   original comment in the patch):
> 
>/* [...]
> * Note: it is important always to call this in the regular executor
> * context, not in a shorter-lived context; else the temp file buffers
> * will get messed up.
> 
> 
>   But these are not explanation of why BufFile related allocations must be 
> under
>   a per-query context. 
> 

Doesn't that simply describe the current (unpatched) behavior where
BufFile is allocated in the per-query context? I mean, the current code
calls BufFileCreateTemp() without switching the context, so it's in the
ExecutorState. But with the patch it very clearly is not.

And I'm pretty sure the patch should do

hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
 "HashBatchFiles",
 ALLOCSET_DEFAULT_SIZES);

and it'd still work. Or why do you think we *must* allocate it under
ExecutorState?

FWIW The comment in hashjoin.h needs updating to reflect the change.

> 
>   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context 
> switch
>   seems fragile as it could be forgotten in futur code path/changes. So I
>   added an Assert() in the function to make sure the current memory context is
>   "HashBatchFiles" as expected.
>   Another way to tie this up might be to pass the memory context as argument 
> to
>   the function.
>   ... Or maybe I'm over precautionary.
> 

I'm not sure I'd call that fragile, we have plenty other code that
expects the memory context to be set correctly. Not sure about the
assert, but we don't have similar asserts anywhere else.

But I think it's just ugly and overly verbose - it'd be much nicer to
e.g. pass the memory context as a parameter, and do the switch inside.

> 
>   3. You wrote:
> 
>>> A separate BufFile memory context helps, although people won't see it
>>> unless they attach a debugger, I think. Better than nothing, but I was
>>> wondering if we could maybe print some warnings when the number of batch
>>> files gets too high ...
> 
>   So I added a WARNING when batches memory are exhausting the memory size
>   allowed.
> 
>+   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
>+   elog(WARNING, "Growing number of hash batch is exhausting memory");
> 
>   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
>   overflows the memory budget. I realize now I should probably add the memory
>   limit, the number of current batch and their memory consumption.
>   The message is probably too cryptic for a user. It could probably be
>   reworded, but some doc or additionnal hint around this message might help.
> 

Hmmm, not sure is WARNING is a good approach, but I don't have a better
idea at the moment.

>   4. I left the debug messages for some more review rounds
> 

OK


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Jehan-Guillaume de Rorthais

Hi,

On Mon, 20 Mar 2023 15:12:34 +0100
Jehan-Guillaume de Rorthais  wrote:

> On Mon, 20 Mar 2023 09:32:17 +0100
> Tomas Vondra  wrote:
> 
> > >> * Patch 1 could be rebased/applied/backpatched
> > > 
> > > Would it help if I rebase Patch 1 ("move BufFile stuff into separate
> > > context")?   
> > 
> > Yeah, I think this is something we'd want to do. It doesn't change the
> > behavior, but it makes it easier to track the memory consumption etc.  
> 
> Will do this week.

Please, find in attachment a patch to allocate bufFiles in a dedicated context.
I picked up your patch, backpatch'd it, went through it and did some minor
changes to it. I have some comment/questions thought.

  1. I'm not sure why we must allocate the "HashBatchFiles" new context
  under ExecutorState and not under hashtable->hashCxt?

  The only references I could find was in hashjoin.h:30:

   /* [...]
* [...] (Exception: data associated with the temp files lives in the
* per-query context too, since we always call buffile.c in that context.)

  And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
  original comment in the patch):

   /* [...]
* Note: it is important always to call this in the regular executor
* context, not in a shorter-lived context; else the temp file buffers
* will get messed up.

  But these are not explanation of why BufFile related allocations must be under
  a per-query context. 

  2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context switch
  seems fragile as it could be forgotten in futur code path/changes. So I
  added an Assert() in the function to make sure the current memory context is
  "HashBatchFiles" as expected.
  Another way to tie this up might be to pass the memory context as argument to
  the function.
  ... Or maybe I'm over precautionary.

  3. You wrote:

>> A separate BufFile memory context helps, although people won't see it
>> unless they attach a debugger, I think. Better than nothing, but I was
>> wondering if we could maybe print some warnings when the number of batch
>> files gets too high ...

  So I added a WARNING when batches memory are exhausting the memory size
  allowed.

   +   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
   +   elog(WARNING, "Growing number of hash batch is exhausting memory");

  This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
  overflows the memory budget. I realize now I should probably add the memory
  limit, the number of current batch and their memory consumption.
  The message is probably too cryptic for a user. It could probably be
  reworded, but some doc or additionnal hint around this message might help.

  4. I left the debug messages for some more review rounds

Regards,
>From a0dd541ad4e47735fc388d0c553e935f0956f254 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 27 Mar 2023 15:54:39 +0200
Subject: [PATCH] Allocate hash batches related BufFile in a dedicated context

---
 src/backend/executor/nodeHash.c | 39 +++--
 src/backend/executor/nodeHashjoin.c | 14 ---
 src/include/executor/hashjoin.h |  1 +
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 748c9b0024..3f1ae9755e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -484,7 +484,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 	 *
 	 * The hashtable control block is just palloc'd from the executor's
 	 * per-query memory context.  Everything else should be kept inside the
-	 * subsidiary hashCxt or batchCxt.
+	 * subsidiary hashCxt, batchCxt or fileCxt.
 	 */
 	hashtable = palloc_object(HashJoinTableData);
 	hashtable->nbuckets = nbuckets;
@@ -538,6 +538,10 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 "HashBatchContext",
 ALLOCSET_DEFAULT_SIZES);

+	hashtable->fileCxt = AllocSetContextCreate(CurrentMemoryContext,
+ "HashBatchFiles",
+ ALLOCSET_DEFAULT_SIZES);
+
 	/* Allocate data that will live for the life of the hashjoin */

 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
@@ -570,15 +574,21 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,

 	if (nbatch > 1 && hashtable->parallel_state == NULL)
 	{
+		MemoryContext oldctx;
+
 		/*
 		 * allocate and initialize the file arrays in hashCxt (not needed for
 		 * parallel case which uses shared tuplestores instead of raw files)
 		 */
+		oldctx = MemoryContextSwitchTo(hashtable->fileCxt);
+
 		hashtable->innerBatchFile = palloc0_array(BufFile *, nbatch);
 		hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
 		/* The files will not be opened until needed... */
 		/* ... but make sure we have temp tablespaces established for them */
 		PrepareTempTablespaces();
+
+

Re: Memory leak from ExecutorState context?

2023-03-27 Thread Melanie Plageman

On Thu, Mar 23, 2023 at 2:49 PM Tomas Vondra
 wrote:
> > On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
> >  wrote:
> >> BNJL and/or other considerations are for 17 or even after. In the meantime,
> >> Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with 
> >> other
> >> discussed solutions. No one down vote since then. Melanie, what is your
> >> opinion today on this patch? Did you change your mind as you worked for 
> >> many
> >> months on BNLJ since then?
> >
> > So, in order to avoid deadlock, my design of adaptive hash join/block
> > nested loop hash join required a new parallelism concept not yet in
> > Postgres at the time -- the idea of a lone worker remaining around to do
> > work when others have left.
> >
> > See: BarrierArriveAndDetachExceptLast()
> > introduced in 7888b09994
> >
> > Thomas Munro had suggested we needed to battle test this concept in a
> > more straightforward feature first, so I implemented parallel full outer
> > hash join and parallel right outer hash join with it.
> >
> > https://commitfest.postgresql.org/42/2903/
> >
> > This has been stalled ready-for-committer for two years. It happened to
> > change timing such that it made an existing rarely hit parallel hash
> > join bug more likely to be hit. Thomas recently committed our fix for
> > this in 8d578b9b2e37a4d (last week). It is my great hope that parallel
> > full outer hash join goes in before the 16 feature freeze.
> >
>
> Good to hear this is moving forward.
>
> > If it does, I think it could make sense to try and find committable
> > smaller pieces of the adaptive hash join work. As it is today, parallel
> > hash join does not respect work_mem, and, in some sense, is a bit broken.
> >
> > I would be happy to work on this feature again, or, if you were
> > interested in picking it up, to provide review and any help I can if for
> > you to work on it.
> >
>
> I'm no expert in parallel hashjoin expert, but I'm willing to take a
> look a rebased patch. I'd however recommend breaking the patch into
> smaller pieces - the last version I see in the thread is ~250kB, which
> is rather daunting, and difficult to review without interruption. (I
> don't remember the details of the patch, so maybe that's not possible
> for some reason.)

Great! I will rebase and take a stab at splitting up the patch into
smaller commits, with a focus on finding pieces that may have standalone
benefits, in the 17 dev cycle.

- Melanie

Re: Memory leak from ExecutorState context?

2023-03-23 Thread Tomas Vondra




On 3/23/23 13:07, Melanie Plageman wrote:
> On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais
>  wrote:
>>> So I guess the best thing would be to go through these threads, see what
>>> the status is, restart the discussion and propose what to do. If you do
>>> that, I'm happy to rebase the patches, and maybe see if I could improve
>>> them in some way.
>>
>> OK! It took me some time, but I did it. I'll try to sum up the situation as
>> simply as possible.
> 
> Wow, so many memories!
> 
> I'm excited that someone looked at this old work (though it is sad that
> a customer faced this issue). And, Jehan, I really appreciate your great
> summarization of all these threads. This will be a useful reference.
> >> 1. "move BufFile stuff into separate context"
>>last found patch: 2019-04-21
>>
>> https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
>>
>> https://www.postgresql.org/message-id/attachment/100822/0001-move-BufFile-stuff-into-separate-context.patch
>>
>>This patch helps with observability/debug by allocating the bufFiles in 
>> the
>>appropriate context instead of the "ExecutorState" one.
>>
>>I suppose this simple one has been forgotten in the fog of all other
>>discussions. Also, this probably worth to be backpatched.
> 
> I agree with Jehan-Guillaume and Tomas that this seems fine to commit
> alone.
> 

+1 to that. I think the separate memory contexts would be
non-controversial for backpatching too.

> Tomas:
>> Yeah, I think this is something we'd want to do. It doesn't change the
>> behavior, but it makes it easier to track the memory consumption etc.
> 
>> 2. "account for size of BatchFile structure in hashJoin"
>>last found patch: 2019-04-22
>>
>> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>>
>> https://www.postgresql.org/message-id/attachment/100951/v2-simple-rebalance.patch
>>
>>This patch seems like a good first step:
>>
>>* it definitely helps older versions where other patches discussed are way
>>  too invasive to be backpatched
>>* it doesn't step on the way of other discussed patches
>>
>>While looking at the discussions around this patch, I was wondering if the
>>planner considers the memory allocation of bufFiles. But of course, 
>> Melanie
>>already noticed that long before I was aware of this problem and 
>> discussion:
>>
>>2019-07-10: «I do think that accounting for Buffile overhead when 
>> estimating
>>the size of the hashtable during ExecChooseHashTableSize() so it can be
>>  used during planning is a worthwhile patch by itself (though I know it
>>  is not even part of this patch).»
>>
>> https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com
>>
>>Tomas Vondra agreed with this in his answer, but no new version of the 
>> patch
>>where produced.
>>
>>Finally, Melanie was pushing the idea to commit this patch no matter other
>>pending patches/ideas:
>>
>>2019-09-05: «If Tomas or someone else has time to pick up and modify 
>> BufFile
>>  accounting patch, committing that still seems like the nest logical
>>  step.»
>>
>> https://www.postgresql.org/message-id/CAAKRu_b6%2BjC93WP%2BpWxqK5KAZJC5Rmxm8uquKtEf-KQ%2B%2B1Li6Q%40mail.gmail.com
> 
> I think I would have to see a modern version of a patch which does this
> to assess if it makes sense. But, I probably still agree with 2019
> Melanie :)
> Overall, I think anything that makes it faster to identify customer
> cases of this bug is good (which, I would think granular memory contexts
> and accounting would do). Even if it doesn't fix it, we can determine
> more easily how often customers are hitting this issue, which helps
> justify an admittedly large design change to hash join.
> 

Agreed. This issue is quite rare (we only get a report once a year or
two), just enough to forget about it and have to rediscover it's there.
So having something that clearly identifies it would be good.

A separate BufFile memory context helps, although people won't see it
unless they attach a debugger, I think. Better than nothing, but I was
wondering if we could maybe print some warnings when the number of batch
files gets too high ...

> On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
>  wrote:
>> BNJL and/or other considerations are for 17 or even after. In the meantime,
>> Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with 
>> other
>> discussed solutions. No one down vote since then. Melanie, what is your
>> opinion today on this patch? Did you change your mind as you worked for many
>> months on BNLJ since then?
> 
> So, in order to avoid deadlock, my design of adaptive hash join/block
> nested loop hash join required a new parallelism concept not yet in
> Postgres at the time -- the idea of a lone worker remaining around to do
> work when others

Re: Memory leak from ExecutorState context?

2023-03-23 Thread Melanie Plageman

On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais
 wrote:
> > So I guess the best thing would be to go through these threads, see what
> > the status is, restart the discussion and propose what to do. If you do
> > that, I'm happy to rebase the patches, and maybe see if I could improve
> > them in some way.
>
> OK! It took me some time, but I did it. I'll try to sum up the situation as
> simply as possible.

Wow, so many memories!

I'm excited that someone looked at this old work (though it is sad that
a customer faced this issue). And, Jehan, I really appreciate your great
summarization of all these threads. This will be a useful reference.

> 1. "move BufFile stuff into separate context"
>last found patch: 2019-04-21
>
> https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
>
> https://www.postgresql.org/message-id/attachment/100822/0001-move-BufFile-stuff-into-separate-context.patch
>
>This patch helps with observability/debug by allocating the bufFiles in the
>appropriate context instead of the "ExecutorState" one.
>
>I suppose this simple one has been forgotten in the fog of all other
>discussions. Also, this probably worth to be backpatched.

I agree with Jehan-Guillaume and Tomas that this seems fine to commit
alone.

Tomas:
> Yeah, I think this is something we'd want to do. It doesn't change the
> behavior, but it makes it easier to track the memory consumption etc.

> 2. "account for size of BatchFile structure in hashJoin"
>last found patch: 2019-04-22
>
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>
> https://www.postgresql.org/message-id/attachment/100951/v2-simple-rebalance.patch
>
>This patch seems like a good first step:
>
>* it definitely helps older versions where other patches discussed are way
>  too invasive to be backpatched
>* it doesn't step on the way of other discussed patches
>
>While looking at the discussions around this patch, I was wondering if the
>planner considers the memory allocation of bufFiles. But of course, Melanie
>already noticed that long before I was aware of this problem and 
> discussion:
>
>2019-07-10: «I do think that accounting for Buffile overhead when 
> estimating
>the size of the hashtable during ExecChooseHashTableSize() so it can be
>  used during planning is a worthwhile patch by itself (though I know it
>  is not even part of this patch).»
>
> https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com
>
>Tomas Vondra agreed with this in his answer, but no new version of the 
> patch
>where produced.
>
>Finally, Melanie was pushing the idea to commit this patch no matter other
>pending patches/ideas:
>
>2019-09-05: «If Tomas or someone else has time to pick up and modify 
> BufFile
>  accounting patch, committing that still seems like the nest logical
>  step.»
>
> https://www.postgresql.org/message-id/CAAKRu_b6%2BjC93WP%2BpWxqK5KAZJC5Rmxm8uquKtEf-KQ%2B%2B1Li6Q%40mail.gmail.com

I think I would have to see a modern version of a patch which does this
to assess if it makes sense. But, I probably still agree with 2019
Melanie :)
Overall, I think anything that makes it faster to identify customer
cases of this bug is good (which, I would think granular memory contexts
and accounting would do). Even if it doesn't fix it, we can determine
more easily how often customers are hitting this issue, which helps
justify an admittedly large design change to hash join.

On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
 wrote:
> BNJL and/or other considerations are for 17 or even after. In the meantime,
> Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with 
> other
> discussed solutions. No one down vote since then. Melanie, what is your
> opinion today on this patch? Did you change your mind as you worked for many
> months on BNLJ since then?

So, in order to avoid deadlock, my design of adaptive hash join/block
nested loop hash join required a new parallelism concept not yet in
Postgres at the time -- the idea of a lone worker remaining around to do
work when others have left.

See: BarrierArriveAndDetachExceptLast()
introduced in 7888b09994

Thomas Munro had suggested we needed to battle test this concept in a
more straightforward feature first, so I implemented parallel full outer
hash join and parallel right outer hash join with it.

https://commitfest.postgresql.org/42/2903/

This has been stalled ready-for-committer for two years. It happened to
change timing such that it made an existing rarely hit parallel hash
join bug more likely to be hit. Thomas recently committed our fix for
this in 8d578b9b2e37a4d (last week). It is my great hope that parallel
full outer hash join goes in before the 16 feature freeze.

If it does, I think it could make sense to try and find

Re: Memory leak from ExecutorState context?

2023-03-20 Thread Jehan-Guillaume de Rorthais

On Mon, 20 Mar 2023 09:32:17 +0100
Tomas Vondra  wrote:

> >> * Patch 1 could be rebased/applied/backpatched  
> > 
> > Would it help if I rebase Patch 1 ("move BufFile stuff into separate
> > context")? 
> 
> Yeah, I think this is something we'd want to do. It doesn't change the
> behavior, but it makes it easier to track the memory consumption etc.

Will do this week.

> I think focusing on the backpatchability is not particularly good
> approach. It's probably be better to fist check if there's any hope of
> restarting the work on BNLJ, which seems like a "proper" solution for
> the future.

Backpatching worth some consideration. Balancing is almost there and could help
in 16. As time is flying, it might well miss release 16, but maybe it could be
backpacthed in 16.1 or a later minor release? But what if it is not eligible for
backpatching? We would stall another year without having at least an imperfect
stop-gap solution (sorry for picking your words ;)).

BNJL and/or other considerations are for 17 or even after. In the meantime,
Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with other
discussed solutions. No one down vote since then. Melanie, what is your
opinion today on this patch? Did you change your mind as you worked for many
months on BNLJ since then?

> On 3/19/23 20:31, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 05:41:11PM +0100, Tomas Vondra wrote:  
>  * Patch 2 is worth considering to backpatch  
> >>
> >> I'm not quite sure what exactly are the numbered patches, as some of the
> >> threads had a number of different patch ideas, and I'm not sure which
> >> one was/is the most promising one.  
> > 
> > patch 2 is referring to the list of patches that was compiled
> > https://www.postgresql.org/message-id/20230310195114.6d0c5406%40karst  
> 
> Ah, I see - it's just the "rebalancing" patch which minimizes the total
> amount of memory used (i.e. grow work_mem a bit, so that we don't
> allocate too many files).
> 
> Yeah, I think that's the best we can do without reworking how we spill
> data (slicing or whatever).

Indeed, that's why I was speaking about backpatching for this one:

* it surely helps 16 (and maybe previous release) in such skewed situation
* it's constrained
* it's not too invasive, it doesn't shake to whole algorithm all over the place
* Melanie was +1 for it, no one down vote. 

What need to be discussed/worked:

* any regressions for existing queries running fine or without OOM?
* add the bufFile memory consumption in the planner considerations?

> I think the spilling is "nicer" in that it actually enforces work_mem
> more strictly than (a), 

Sure it enforces work_mem more strictly, but this is a discussion for 17 or
later in my humble opinion.

> but we should probably spend a bit more time on
> the exact spilling strategy. I only really tried two trivial ideas, but
> maybe we can be smarter about allocating / sizing the files? Maybe
> instead of slices of constant size we could/should make them larger,
> similarly to what log-structured storage does.
> 
> For example we might double the number of batches a file represents, so
> the first file would be for batch 1, the next one for batches 2+3, then
> 4+5+6+7, then 8-15, then 16-31, ...
> 
> We should have some model calculating (i) amount of disk space we would
> need for the spilled files, and (ii) the amount of I/O we do when
> writing the data between temp files.

And:

* what about Robert's discussion on uneven batch distribution? Why is it
  ignored? Maybe there was some IRL or off-list discussions? Or did I missed
  some mails?
* what about dealing with all normal batch first, then revamp in
  freshly emptied batches the skewed ones, spliting them if needed, then rince &
  repeat? At some point, we would probably still need something like slicing
  and/or BNLJ though...

> let's revive the rebalance and/or spilling patches. Imperfect but better than
> nothing.

+1 for rebalance. I'm not even sure it could make it to 16 as we are running
out time, but it worth to try as it's the closest one that could be
reviewed and ready'd-for-commiter.

I might lack of ambition, but spilling patches seems too much to make it for
16. It seems to belongs with other larger patches/ideas (patches 4 a 5 in my sum
up). But this is just my humble feeling.

> And then in the end we can talk about if/what can be backpatched.
> 
> FWIW I don't think there's a lot of rush, considering this is clearly a
> matter for PG17. So the summer CF at the earliest, people are going to
> be busy until then.

100% agree, there's no rush for patches 3, 4 ... and 5.

Thanks!

Re: Memory leak from ExecutorState context?

2023-03-20 Thread Tomas Vondra

On 3/19/23 20:31, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 05:41:11PM +0100, Tomas Vondra wrote:
 * Patch 2 is worth considering to backpatch
>>
>> I'm not quite sure what exactly are the numbered patches, as some of the
>> threads had a number of different patch ideas, and I'm not sure which
>> one was/is the most promising one.
> 
> patch 2 is referring to the list of patches that was compiled
> https://www.postgresql.org/message-id/20230310195114.6d0c5406%40karst

Ah, I see - it's just the "rebalancing" patch which minimizes the total
amount of memory used (i.e. grow work_mem a bit, so that we don't
allocate too many files).

Yeah, I think that's the best we can do without reworking how we spill
data (slicing or whatever).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-19 Thread Justin Pryzby

On Fri, Mar 17, 2023 at 05:41:11PM +0100, Tomas Vondra wrote:
> >> * Patch 2 is worth considering to backpatch
> 
> I'm not quite sure what exactly are the numbered patches, as some of the
> threads had a number of different patch ideas, and I'm not sure which
> one was/is the most promising one.

patch 2 is referring to the list of patches that was compiled
https://www.postgresql.org/message-id/20230310195114.6d0c5406%40karst

Re: Memory leak from ExecutorState context?

2023-03-17 Thread Tomas Vondra

On 3/17/23 09:18, Jehan-Guillaume de Rorthais wrote:
> Hi there,
> 
> On Fri, 10 Mar 2023 19:51:14 +0100
> Jehan-Guillaume de Rorthais  wrote:
> 
>>> So I guess the best thing would be to go through these threads, see what
>>> the status is, restart the discussion and propose what to do. If you do
>>> that, I'm happy to rebase the patches, and maybe see if I could improve
>>> them in some way.  
>>
>> [...]
>>
>>> I was hoping we'd solve this by the BNL, but if we didn't get that in 4
>>> years, maybe we shouldn't stall and get at least an imperfect stop-gap
>>> solution ...  
>>
>> Indeed. So, to sum-up:
>>
>> * Patch 1 could be rebased/applied/backpatched
> 
> Would it help if I rebase Patch 1 ("move BufFile stuff into separate 
> context")?
> 

Yeah, I think this is something we'd want to do. It doesn't change the
behavior, but it makes it easier to track the memory consumption etc.

>> * Patch 2 is worth considering to backpatch
> 

I'm not quite sure what exactly are the numbered patches, as some of the
threads had a number of different patch ideas, and I'm not sure which
one was/is the most promising one.

IIRC there were two directions:

a) "balancing" i.e. increasing work_mem to minimize the total memory
consumption (best effort, but allowing exceeding work_mem)

b) limiting the number of BufFiles, and combining data from "future"
batches into a single file

I think the spilling is "nicer" in that it actually enforces work_mem
more strictly than (a), but we should probably spend a bit more time on
the exact spilling strategy. I only really tried two trivial ideas, but
maybe we can be smarter about allocating / sizing the files? Maybe
instead of slices of constant size we could/should make them larger,
similarly to what log-structured storage does.

For example we might double the number of batches a file represents, so
the first file would be for batch 1, the next one for batches 2+3, then
4+5+6+7, then 8-15, then 16-31, ...

We should have some model calculating (i) amount of disk space we would
need for the spilled files, and (ii) the amount of I/O we do when
writing the data between temp files.

> Same question.
> 
>> * Patch 3 seemed withdrawn in favor of BNLJ
>> * Patch 4 is waiting for some more review and has some TODO
>> * discussion 5 worth few minutes to discuss before jumping on previous topics
> 
> These other patches needs more discussions and hacking. They have a low
> priority compare to other discussions and running commitfest. However, how can
> avoid losing them in limbo again?
> 

I think focusing on the backpatchability is not particularly good
approach. It's probably be better to fist check if there's any hope of
restarting the work on BNLJ, which seems like a "proper" solution for
the future.

It getting that soon (in PG17) is unlikely, let's revive the rebalance
and/or spilling patches. Imperfect but better than nothing.

And then in the end we can talk about if/what can be backpatched.

FWIW I don't think there's a lot of rush, considering this is clearly a
matter for PG17. So the summer CF at the earliest, people are going to
be busy until then.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-17 Thread Jehan-Guillaume de Rorthais

Hi there,

On Fri, 10 Mar 2023 19:51:14 +0100
Jehan-Guillaume de Rorthais  wrote:

> > So I guess the best thing would be to go through these threads, see what
> > the status is, restart the discussion and propose what to do. If you do
> > that, I'm happy to rebase the patches, and maybe see if I could improve
> > them in some way.  
> 
> [...]
> 
> > I was hoping we'd solve this by the BNL, but if we didn't get that in 4
> > years, maybe we shouldn't stall and get at least an imperfect stop-gap
> > solution ...  
> 
> Indeed. So, to sum-up:
> 
> * Patch 1 could be rebased/applied/backpatched

Would it help if I rebase Patch 1 ("move BufFile stuff into separate context")?

> * Patch 2 is worth considering to backpatch

Same question.

> * Patch 3 seemed withdrawn in favor of BNLJ
> * Patch 4 is waiting for some more review and has some TODO
> * discussion 5 worth few minutes to discuss before jumping on previous topics

These other patches needs more discussions and hacking. They have a low
priority compare to other discussions and running commitfest. However, how can
avoid losing them in limbo again?

Regards,

Re: Memory leak from ExecutorState context?

2023-03-10 Thread Jehan-Guillaume de Rorthais

Hi,

> So I guess the best thing would be to go through these threads, see what
> the status is, restart the discussion and propose what to do. If you do
> that, I'm happy to rebase the patches, and maybe see if I could improve
> them in some way.

OK! It took me some time, but I did it. I'll try to sum up the situation as
simply as possible.

I reviewed the following threads:

* Out of Memory errors are frustrating as heck!
  2019-04-14 -> 2019-04-28
  
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net

  This discussion stalled, waiting for OP, but ideas there ignited all other
  discussions.

* accounting for memory used for BufFile during hash joins
  2019-05-04 -> 2019-09-10
  
https://www.postgresql.org/message-id/flat/20190504003414.bulcbnge3rhwhcsh%40development

  This was suppose to push forward a patch discussed on previous thread, but
  it actually took over it and more ideas pops from there.

* Replace hashtable growEnable flag
  2019-05-15 -> 2019-05-16
  
https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com

  This one quickly merged to the next one.

* Avoiding hash join batch explosions with extreme skew and weird stats
  2019-05-16 -> 2020-09-24
  
https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

  Another thread discussing another facet of the problem, but eventually end up
  discussing / reviewing the BNLJ implementation.
  
Five possible fixes/ideas were discussed all over these threads:


1. "move BufFile stuff into separate context"
   last found patch: 2019-04-21
   
https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
   
https://www.postgresql.org/message-id/attachment/100822/0001-move-BufFile-stuff-into-separate-context.patch

   This patch helps with observability/debug by allocating the bufFiles in the
   appropriate context instead of the "ExecutorState" one.

   I suppose this simple one has been forgotten in the fog of all other
   discussions. Also, this probably worth to be backpatched.

2. "account for size of BatchFile structure in hashJoin"
   last found patch: 2019-04-22
   
https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
   
https://www.postgresql.org/message-id/attachment/100951/v2-simple-rebalance.patch

   This patch seems like a good first step:

   * it definitely helps older versions where other patches discussed are way
 too invasive to be backpatched
   * it doesn't step on the way of other discussed patches

   While looking at the discussions around this patch, I was wondering if the
   planner considers the memory allocation of bufFiles. But of course, Melanie
   already noticed that long before I was aware of this problem and discussion:

   2019-07-10: «I do think that accounting for Buffile overhead when estimating
   the size of the hashtable during ExecChooseHashTableSize() so it can be
 used during planning is a worthwhile patch by itself (though I know it
 is not even part of this patch).»
   
https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com
 
   Tomas Vondra agreed with this in his answer, but no new version of the patch
   where produced.

   Finally, Melanie was pushing the idea to commit this patch no matter other
   pending patches/ideas:

   2019-09-05: «If Tomas or someone else has time to pick up and modify BufFile
 accounting patch, committing that still seems like the nest logical
 step.»
   
https://www.postgresql.org/message-id/CAAKRu_b6%2BjC93WP%2BpWxqK5KAZJC5Rmxm8uquKtEf-KQ%2B%2B1Li6Q%40mail.gmail.com

   Unless I'm wrong, no one down voted this.

3. "per slice overflow file"
   last found patch: 2019-05-08
   
https://www.postgresql.org/message-id/20190508150844.rij36rtuk4lhvztw%40development
   
https://www.postgresql.org/message-id/attachment/101080/v4-per-slice-overflow-file-20190508.patch

   This patch has been withdraw after an off-list discussion with Thomas Munro
   because of a missing parallel hashJoin implementation. Plus, before any
   effort started on the parallel implementation, the BNLJ idea appeared and
   seemed more appealing.

   See:
   
https://www.postgresql.org/message-id/20190529145517.sj2poqmb3cr4cg6w%40development

   By the time, it still seems to have some interest despite the BNLJ patch:

   2019-07-10: «If slicing is made to work for parallel-aware hashjoin and the
   code is in a committable state (and probably has the threshold I mentioned
 above), then I think that this patch should go in.»
   
https://www.postgresql.org/message-id/CAAKRu_Yiam-%3D06L%2BR8FR%2BVaceb-ozQzzMqRiY2pDYku1VdZ%3DEw%40mail.gmail.com

   But this might have been disapproved later by Tomas:

   2019-09-10: «I have to admit I kinda lost track [...] My feeling is that we
 should get the BNLJ committed

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Tomas Vondra




On 3/2/23 23:57, Jehan-Guillaume de Rorthais wrote:
> On Thu, 2 Mar 2023 19:53:14 +0100
> Tomas Vondra  wrote:
>> On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
> ...
> 
>>> There was some thoughts about how to make a better usage of the memory. As
>>> memory is exploding way beyond work_mem, at least, avoid to waste it with
>>> too many buffers of BufFile. So you expand either the work_mem or the
>>> number of batch, depending on what move is smarter. TJis is explained and
>>> tested here:
>>>
>>> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
>>> https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
>>>
>>> And then, another patch to overflow each batch to a dedicated temp file and
>>> stay inside work_mem (v4-per-slice-overflow-file.patch):
>>>
>>> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>>>
>>> Then, nothing more on the discussion about this last patch. So I guess it
>>> just went cold.
>>
>> I think a contributing factor was that the OP did not respond for a
>> couple months, so the thread went cold.
>>
>>> For what it worth, these two patches seems really interesting to me. Do you
>>> need any help to revive it?
>>
>> I think another reason why that thread went nowhere were some that we've
>> been exploring a different (and likely better) approach to fix this by
>> falling back to a nested loop for the "problematic" batches.
>>
>> As proposed in this thread:
>>
>>  
>> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> 
> Unless I'm wrong, you are linking to the same «frustrated as heck!» 
> discussion,
> for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
> (balancing between increasing batches *and* work_mem).
> 
> No sign of turning "problematic" batches to nested loop. Did I miss something?
> 
> Do you have a link close to your hand about such algo/patch test by any 
> chance?
> 

Gah! My apologies, I meant to post a link to this thread:

https://www.postgresql.org/message-id/caakru_b6+jc93wp+pwxqk5kazjc5rmxm8uquktef-kq++1l...@mail.gmail.com

which then points to this BNL patch

https://www.postgresql.org/message-id/CAAKRu_YsWm7gc_b2nBGWFPE6wuhdOLfc1LBZ786DUzaCPUDXCA%40mail.gmail.com

That discussion apparently stalled in August 2020, so maybe that's where
we should pick up and see in what shape that patch is.

regards


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Mar 2023 19:53:14 +0100
Tomas Vondra  wrote:
> On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
...

> > There was some thoughts about how to make a better usage of the memory. As
> > memory is exploding way beyond work_mem, at least, avoid to waste it with
> > too many buffers of BufFile. So you expand either the work_mem or the
> > number of batch, depending on what move is smarter. TJis is explained and
> > tested here:
> > 
> > https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> > https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
> > 
> > And then, another patch to overflow each batch to a dedicated temp file and
> > stay inside work_mem (v4-per-slice-overflow-file.patch):
> > 
> > https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
> > 
> > Then, nothing more on the discussion about this last patch. So I guess it
> > just went cold.
> 
> I think a contributing factor was that the OP did not respond for a
> couple months, so the thread went cold.
> 
> > For what it worth, these two patches seems really interesting to me. Do you
> > need any help to revive it?
> 
> I think another reason why that thread went nowhere were some that we've
> been exploring a different (and likely better) approach to fix this by
> falling back to a nested loop for the "problematic" batches.
> 
> As proposed in this thread:
> 
>  
> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development

Unless I'm wrong, you are linking to the same «frustrated as heck!» discussion,
for your patch v2-0001-account-for-size-of-BatchFile-structure-in-hashJo.patch
(balancing between increasing batches *and* work_mem).

No sign of turning "problematic" batches to nested loop. Did I miss something?

Do you have a link close to your hand about such algo/patch test by any chance?

> I was hoping we'd solve this by the BNL, but if we didn't get that in 4
> years, maybe we shouldn't stall and get at least an imperfect stop-gap
> solution ...

I'll keep searching tomorrow about existing BLN discussions (is it block level
nested loops?).

Regards,

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Tomas Vondra




On 3/2/23 19:15, Jehan-Guillaume de Rorthais wrote:
> Hi!
> 
> On Thu, 2 Mar 2023 13:44:52 +0100
> Tomas Vondra  wrote:
>> Well, yeah and no.
>>
>> In principle we could/should have allocated the BufFiles in a different
>> context (possibly hashCxt). But in practice it probably won't make any
>> difference, because the query will probably run all the hashjoins at the
>> same time. Imagine a sequence of joins - we build all the hashes, and
>> then tuples from the outer side bubble up through the plans. And then
>> you process the last tuple and release all the hashes.
>>
>> This would not fix the issue. It'd be helpful for accounting purposes
>> (we'd know it's the buffiles and perhaps for which hashjoin node). But
>> we'd still have to allocate the memory etc. (so still OOM).
> 
> Well, accounting things in the correct context would already worth a patch I
> suppose. At least, it help to investigate faster. Plus, you already wrote a
> patch about that[1]:
> 
> https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development
> 
> Note that I did reference the "Out of Memory errors are frustrating as heck!"
> thread in my first email, pointing on a Tom Lane's email explaining that
> ExecutorState was not supposed to be so large[2].
> 
> [2] 
> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 

Ah, right, I didn't realize it's the same thread. There are far too many
threads about this sort of things, and I probably submitted half-baked
patches to most of them :-/

>> There's only one thing I think could help - increase the work_mem enough
>> not to trigger the explosive growth in number of batches. Imagine
>> there's one very common value, accounting for ~65MB of tuples. With
>> work_mem=64MB this leads to exactly the explosive growth you're
>> observing here. With 128MB it'd probably run just fine.
>>
>> The problem is we don't know how large the work_mem would need to be :-(
>> So you'll have to try and experiment a bit.
>>
>> I remembered there was a thread [1] about *exactly* this issue in 2019.
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
>>
>> I even posted a couple patches that try to address this by accounting
>> for the BufFile memory, and increasing work_mem a bit instead of just
>> blindly increasing the number of batches (ignoring the fact that means
>> more memory will be used for the BufFile stuff).
>>
>> I don't recall why it went nowhere, TBH. But I recall there were
>> discussions about maybe doing something like "block nestloop" at the
>> time, or something. Or maybe the thread just went cold.
> 
> So I read the full thread now. I'm still not sure why we try to avoid hash
> collision so hard, and why a similar data subset barely larger than work_mem
> makes the number of batchs explode, but I think I have a better understanding 
> of
> the discussion and the proposed solutions.
> 

I don't think this is about hash collisions (i.e. the same hash value
being computed for different values). You can construct cases like this,
of course, particularly if you only look at a subset of the bits (for 1M
batches we only look at the first 20 bits), but I'd say it's fairly
unlikely to happen unless you do that intentionally.

(I'm assuming regular data types with reasonable hash functions. If the
query joins on custom data types with some silly hash function, it may
be more likely to have conflicts.)

IMHO a much more likely explanation is there actually is a very common
value in the data. For example there might be a value repeated 1M times,
and that'd be enough to break this.

We do build a special "skew" buckets for values from an MCV, but maybe
the stats are not updated yet, or maybe there are too many such values
to fit into MCV?

I now realize there's probably another way to get into this - oversized
rows. Could there be a huge row (e.g. with a large text/bytea value)?
Imagine a row that's 65MB - that'd be game over with work_mem=64MB. Or
there might be smaller rows, but a couple hash collisions would suffice.

> There was some thoughts about how to make a better usage of the memory. As
> memory is exploding way beyond work_mem, at least, avoid to waste it with too
> many buffers of BufFile. So you expand either the work_mem or the number of
> batch, depending on what move is smarter. TJis is explained and tested here:
> 
> https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
> https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development
> 
> And then, another patch to overflow each batch to a dedicated temp file and
> stay inside work_mem (v4-per-slice-overflow-file.patch):
> 
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
> 
> Then, nothing more on the discussion about this last patch. So I guess it just
> went cold.
> 

I think a contributing factor was that the

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Mar 2023 19:15:30 +0100
Jehan-Guillaume de Rorthais  wrote:
[...]
> For what it worth, these two patches seems really interesting to me. Do you
> need any help to revive it?

To avoid confusion, the two patches I meant were:

* 0001-move-BufFile-stuff-into-separate-context.patch   
* v4-per-slice-overflow-file.patch

Regards,

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

Hi!

On Thu, 2 Mar 2023 13:44:52 +0100
Tomas Vondra  wrote:
> Well, yeah and no.
> 
> In principle we could/should have allocated the BufFiles in a different
> context (possibly hashCxt). But in practice it probably won't make any
> difference, because the query will probably run all the hashjoins at the
> same time. Imagine a sequence of joins - we build all the hashes, and
> then tuples from the outer side bubble up through the plans. And then
> you process the last tuple and release all the hashes.
> 
> This would not fix the issue. It'd be helpful for accounting purposes
> (we'd know it's the buffiles and perhaps for which hashjoin node). But
> we'd still have to allocate the memory etc. (so still OOM).

Well, accounting things in the correct context would already worth a patch I
suppose. At least, it help to investigate faster. Plus, you already wrote a
patch about that[1]:

https://www.postgresql.org/message-id/20190421114618.z3mpgmimc3rmubi4%40development

Note that I did reference the "Out of Memory errors are frustrating as heck!"
thread in my first email, pointing on a Tom Lane's email explaining that
ExecutorState was not supposed to be so large[2].

[2] 
https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b

> There's only one thing I think could help - increase the work_mem enough
> not to trigger the explosive growth in number of batches. Imagine
> there's one very common value, accounting for ~65MB of tuples. With
> work_mem=64MB this leads to exactly the explosive growth you're
> observing here. With 128MB it'd probably run just fine.
> 
> The problem is we don't know how large the work_mem would need to be :-(
> So you'll have to try and experiment a bit.
> 
> I remembered there was a thread [1] about *exactly* this issue in 2019.
> 
> [1]
> https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
>
> I even posted a couple patches that try to address this by accounting
> for the BufFile memory, and increasing work_mem a bit instead of just
> blindly increasing the number of batches (ignoring the fact that means
> more memory will be used for the BufFile stuff).
> 
> I don't recall why it went nowhere, TBH. But I recall there were
> discussions about maybe doing something like "block nestloop" at the
> time, or something. Or maybe the thread just went cold.

So I read the full thread now. I'm still not sure why we try to avoid hash
collision so hard, and why a similar data subset barely larger than work_mem
makes the number of batchs explode, but I think I have a better understanding of
the discussion and the proposed solutions.

There was some thoughts about how to make a better usage of the memory. As
memory is exploding way beyond work_mem, at least, avoid to waste it with too
many buffers of BufFile. So you expand either the work_mem or the number of
batch, depending on what move is smarter. TJis is explained and tested here:

https://www.postgresql.org/message-id/20190421161434.4hedytsadpbnglgk%40development
https://www.postgresql.org/message-id/20190422030927.3huxq7gghms4kmf4%40development

And then, another patch to overflow each batch to a dedicated temp file and
stay inside work_mem (v4-per-slice-overflow-file.patch):

https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

Then, nothing more on the discussion about this last patch. So I guess it just
went cold.

For what it worth, these two patches seems really interesting to me. Do you need
any help to revive it?

Regards,

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Tomas Vondra

On 3/2/23 13:08, Jehan-Guillaume de Rorthais wrote:
> ...
> [...]
>> But I have another idea - put a breakpoint on makeBufFile() which is the
>> bit that allocates the temp files including the 8kB buffer, and print in
>> what context we allocate that. I have a hunch we may be allocating it in
>> the ExecutorState. That'd explain all the symptoms.
> 
> That what I was wondering as well yesterday night.
> 
> So, on your advice, I set a breakpoint on makeBufFile:
> 
>   (gdb) info br
>   Num Type   Disp Enb AddressWhat
>   1   breakpoint keep y   0x007229df in makeBufFile
>   bt 10
>   p CurrentMemoryContext.name
> 
> 
> Then, I disabled it and ran the query up to this mem usage:
> 
>VIRTRESSHR S  %CPU %MEM
>   20.1g   7.0g  88504 t   0.0 22.5
> 
> Then, I enabled the breakpoint and look at around 600 bt and context name
> before getting bored. They **all** looked like that:
> 
>   Breakpoint 1, BufFileCreateTemp (...)at buffile.c:201
>   201 in buffile.c
>   #0  BufFileCreateTemp (...) buffile.c:201
>   #1  ExecHashJoinSaveTuple (tuple=0x1952c180, ...)   nodeHashjoin.c:1238
>   #2  ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
>   #3  ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
>   #4  ExecProcNodeInstr (node=)execProcnode.c:462
>   #5  ExecProcNode (node=0x31a6418)
>   #6  ExecSort (pstate=0x31a6308)
>   #7  ExecProcNodeInstr (node=)
>   #8  ExecProcNode (node=0x31a6308)
>   #9  fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)
>   
>   $421643 = 0x99d7f7 "ExecutorState"
> 
> These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
> probably log much more of them if more checks/stats need to be collected, but
> it really slow down the query a lot, granting it only 1-5% of CPU time instead
> of the usual 100%.
> 

Bingo!

> So It's not exactly a leakage, as memory would be released at the end of the
> query, but I suppose they should be allocated in a shorter living context,
> to avoid this memory bloat, am I right?
> 

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1] about *exactly* this issue in 2019.

[1]
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

>> BTW with how many batches does the hash join start?
> 
> * batches went from 32 to 1048576 before being growEnabled=false as suspected
> * original and current nbuckets were set to 1048576 immediately
> * allowed space is set to the work_mem, but current space usage is 1.3GB, as
>   measured previously close before system refuse more memory allocation.
> 

Yeah, I think this is pretty expected. We start with multiple batches,
so we pick optimal buckets for the whole work_mem (so no growth here).

But then batches explode, in the futile hope to keep this in work_mem.
Once that growth gets disabled, we end up with 1.3GB hash table.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-02 Thread Jehan-Guillaume de Rorthais

On Thu, 2 Mar 2023 01:30:27 +0100
Tomas Vondra  wrote:
> On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:
> >>> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
> >>> of calls, always short-cut'ed to 1048576, I guess because of the
> >>> conditional block «/* safety check to avoid overflow */» appearing early
> >>> in this function.   
> >[...] But what about this other short-cut then?
> > 
> > /* do nothing if we've decided to shut off growth */
> > if (!hashtable->growEnabled)
> > return;
> > 
> > [...]
> > 
> > /*
> >  * If we dumped out either all or none of the tuples in the table,
> >  * disable
> >  * further expansion of nbatch.  This situation implies that we have
> >  * enough tuples of identical hashvalues to overflow spaceAllowed.
> >  * Increasing nbatch will not fix it since there's no way to subdivide
> >  * the
> >  * group any more finely. We have to just gut it out and hope the server
> >  * has enough RAM.
> >  */
> > if (nfreed == 0 || nfreed == ninmemory)
> > {
> > hashtable->growEnabled = false;
> >   #ifdef HJDEBUG
> > printf("Hashjoin %p: disabling further increase of nbatch\n",
> >hashtable);
> >   #endif
> > }
> > 
> > If I guess correctly, the function is not able to split the current batch,
> > so it sits and hopes. This is a much better suspect and I can surely track
> > this from gdb.
> 
> Yes, this would make much more sense - it'd be consistent with the
> hypothesis that this is due to number of batches exploding (it's a
> protection exactly against that).
> 
> You specifically mentioned the other check earlier, but now I realize
> you've been just speculating it might be that.

Yes, sorry about that, I jumped on this speculation without actually digging it
much...

[...]
> But I have another idea - put a breakpoint on makeBufFile() which is the
> bit that allocates the temp files including the 8kB buffer, and print in
> what context we allocate that. I have a hunch we may be allocating it in
> the ExecutorState. That'd explain all the symptoms.

That what I was wondering as well yesterday night.

So, on your advice, I set a breakpoint on makeBufFile:

  (gdb) info br
  Num Type   Disp Enb AddressWhat
  1   breakpoint keep y   0x007229df in makeBufFile
  bt 10
  p CurrentMemoryContext.name


Then, I disabled it and ran the query up to this mem usage:

   VIRTRESSHR S  %CPU %MEM
  20.1g   7.0g  88504 t   0.0 22.5

Then, I enabled the breakpoint and look at around 600 bt and context name
before getting bored. They **all** looked like that:

  Breakpoint 1, BufFileCreateTemp (...)at buffile.c:201
  201 in buffile.c
  #0  BufFileCreateTemp (...) buffile.c:201
  #1  ExecHashJoinSaveTuple (tuple=0x1952c180, ...)   nodeHashjoin.c:1238
  #2  ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
  #3  ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
  #4  ExecProcNodeInstr (node=)execProcnode.c:462
  #5  ExecProcNode (node=0x31a6418)
  #6  ExecSort (pstate=0x31a6308)
  #7  ExecProcNodeInstr (node=)
  #8  ExecProcNode (node=0x31a6308)
  #9  fetch_input_tuple (aggstate=aggstate@entry=0x31a5ea0)
  
  $421643 = 0x99d7f7 "ExecutorState"

These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
probably log much more of them if more checks/stats need to be collected, but
it really slow down the query a lot, granting it only 1-5% of CPU time instead
of the usual 100%.

So It's not exactly a leakage, as memory would be released at the end of the
query, but I suppose they should be allocated in a shorter living context,
to avoid this memory bloat, am I right?

> BTW with how many batches does the hash join start?

* batches went from 32 to 1048576 before being growEnabled=false as suspected
* original and current nbuckets were set to 1048576 immediately
* allowed space is set to the work_mem, but current space usage is 1.3GB, as
  measured previously close before system refuse more memory allocation.

Here are the full details about the hash associated with the previous backtrace:

  (gdb) up
  (gdb) up
  (gdb) p *((HashJoinState*)pstate)->hj_HashTable
  $421652 = {
nbuckets = 1048576,
log2_nbuckets = 20,
nbuckets_original = 1048576,
nbuckets_optimal = 1048576,
log2_nbuckets_optimal = 20,
buckets = {unshared = 0x68f12e8, shared = 0x68f12e8},
keepNulls = true,
skewEnabled = false,
skewBucket = 0x0,
skewBucketLen = 0,
nSkewBuckets = 0,
skewBucketNums = 0x0,
nbatch = 1048576,
curbatch = 0,
nbatch_original = 32,
nbatch_outstart = 1048576,
growEnabled = false,
totalTuples = 19541735,
partialTuples = 19541735,
skewTuples = 0,
innerBatchFile = 0xdfcd168,
outerBatchFile = 0xe7cd1a8,

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Tomas Vondra

On 3/2/23 00:18, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> On Wed, 1 Mar 2023 20:29:11 +0100
> Tomas Vondra  wrote:
>> On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:
>>> On Tue, 28 Feb 2023 20:51:02 +0100
>>> Tomas Vondra  wrote:  
 On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:  
> * HashBatchContext goes up to 1441MB after 240s then stay flat until the
> end (400s as the last record)

 That's interesting. We're using HashBatchContext for very few things, so
 what could it consume so much memory? But e.g. the number of buckets
 should be limited by work_mem, so how could it get to 1.4GB?

 Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
 and print how many batches/butches are there?  
>>>
>>> I did this test this morning.
>>>
>>> Batches and buckets increased really quickly to 1048576/1048576.
>>
>> OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
>> buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
>> fillfactor 1.0).
>>
>> But 1M batches? I guess that could be problematic. It doesn't seem like
>> much, but we need 1M files on each side - 1M for the hash table, 1M for
>> the outer relation. That's 16MB of pointers, but the files are BufFile
>> and we keep 8kB buffer for each of them. That's ~16GB right there :-(
>>
>> In practice it probably won't be that bad, because not all files will be
>> allocated/opened concurrently (especially if this is due to many tuples
>> having the same value). Assuming that's what's happening here, ofc.
> 
> And I suppose they are close/freed concurrently as well?
> 

Yeah. There can be different subsets of the files used, depending on
when the number of batches start to explode, etc.

>>> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
>>> of calls, always short-cut'ed to 1048576, I guess because of the
>>> conditional block «/* safety check to avoid overflow */» appearing early in
>>> this function. 
>>
>> Hmmm, that's a bit weird, no? I mean, the check is
>>
>>   /* safety check to avoid overflow */
>>   if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
>>  return;
>>
>> Why would it stop at 1048576? It certainly is not higher than  INT_MAX/2
>> and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
>> happening here?
> 
> Indeed, not the good suspect. But what about this other short-cut then?
> 
> /* do nothing if we've decided to shut off growth */
> if (!hashtable->growEnabled)
> return;
> 
> [...]
> 
> /*
>  * If we dumped out either all or none of the tuples in the table, disable
>  * further expansion of nbatch.  This situation implies that we have
>  * enough tuples of identical hashvalues to overflow spaceAllowed.
>  * Increasing nbatch will not fix it since there's no way to subdivide the
>  * group any more finely. We have to just gut it out and hope the server
>  * has enough RAM.
>  */
> if (nfreed == 0 || nfreed == ninmemory)
> {
> hashtable->growEnabled = false;
>   #ifdef HJDEBUG
> printf("Hashjoin %p: disabling further increase of nbatch\n",
>hashtable);
>   #endif
> }
> 
> If I guess correctly, the function is not able to split the current batch, so
> it sits and hopes. This is a much better suspect and I can surely track this
> from gdb.
> 

Yes, this would make much more sense - it'd be consistent with the
hypothesis that this is due to number of batches exploding (it's a
protection exactly against that).

You specifically mentioned the other check earlier, but now I realize
you've been just speculating it might be that.

> Being able to find what are the fields involved in the join could help as well
> to check or gather some stats about them, but I hadn't time to dig this yet...
> 

It's going to be tricky, because all parts of the plan may be doing
something, and there may be multiple hash joins. So you won't know if
you're executing the part of the plan that's causing issues :-(


But I have another idea - put a breakpoint on makeBufFile() which is the
bit that allocates the temp files including the 8kB buffer, and print in
what context we allocate that. I have a hunch we may be allocating it in
the ExecutorState. That'd explain all the symptoms.


BTW with how many batches does the hash join start?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

Hi,

On Wed, 1 Mar 2023 20:29:11 +0100
Tomas Vondra  wrote:
> On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 28 Feb 2023 20:51:02 +0100
> > Tomas Vondra  wrote:  
> >> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:  
> >>> * HashBatchContext goes up to 1441MB after 240s then stay flat until the
> >>> end (400s as the last record)
> >>
> >> That's interesting. We're using HashBatchContext for very few things, so
> >> what could it consume so much memory? But e.g. the number of buckets
> >> should be limited by work_mem, so how could it get to 1.4GB?
> >>
> >> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> >> and print how many batches/butches are there?  
> > 
> > I did this test this morning.
> > 
> > Batches and buckets increased really quickly to 1048576/1048576.
> 
> OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
> buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
> fillfactor 1.0).
> 
> But 1M batches? I guess that could be problematic. It doesn't seem like
> much, but we need 1M files on each side - 1M for the hash table, 1M for
> the outer relation. That's 16MB of pointers, but the files are BufFile
> and we keep 8kB buffer for each of them. That's ~16GB right there :-(
>
> In practice it probably won't be that bad, because not all files will be
> allocated/opened concurrently (especially if this is due to many tuples
> having the same value). Assuming that's what's happening here, ofc.

And I suppose they are close/freed concurrently as well?

> > ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
> > of calls, always short-cut'ed to 1048576, I guess because of the
> > conditional block «/* safety check to avoid overflow */» appearing early in
> > this function. 
> 
> Hmmm, that's a bit weird, no? I mean, the check is
> 
>   /* safety check to avoid overflow */
>   if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
>  return;
> 
> Why would it stop at 1048576? It certainly is not higher than  INT_MAX/2
> and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
> happening here?

Indeed, not the good suspect. But what about this other short-cut then?

/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
return;

[...]

/*
 * If we dumped out either all or none of the tuples in the table, disable
 * further expansion of nbatch.  This situation implies that we have
 * enough tuples of identical hashvalues to overflow spaceAllowed.
 * Increasing nbatch will not fix it since there's no way to subdivide the
 * group any more finely. We have to just gut it out and hope the server
 * has enough RAM.
 */
if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
  #ifdef HJDEBUG
printf("Hashjoin %p: disabling further increase of nbatch\n",
   hashtable);
  #endif
}

If I guess correctly, the function is not able to split the current batch, so
it sits and hopes. This is a much better suspect and I can surely track this
from gdb.

Being able to find what are the fields involved in the join could help as well
to check or gather some stats about them, but I hadn't time to dig this yet...

[...]
> >> Investigating memory leaks is tough, especially for generic memory
> >> contexts like ExecutorState :-( Even more so when you can't reproduce it
> >> on a machine with custom builds.
> >>
> >> What I'd try is this:

[...]
> > I couldn't print "size" as it is optimzed away, that's why I tracked
> > chunk->size... Is there anything wrong with my current run and gdb log? 
> 
> There's definitely something wrong. The size should not be 0, and
> neither it should be > 1GB. I suspect it's because some of the variables
> get optimized out, and gdb just uses some nonsense :-(
> 
> I guess you'll need to debug the individual breakpoints, and see what's
> available. It probably depends on the compiler version, etc. For example
> I don't see the "chunk" for breakpoint 3, but "chunk_size" works and I
> can print the chunk pointer with a bit of arithmetics:
> 
>   p (block->freeptr - chunk_size)
> 
> I suppose similar gympastics could work for the other breakpoints.

OK, I'll give it a try tomorrow.

Thank you!

NB: the query has been killed by the replication.

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

On Wed, 1 Mar 2023 20:34:08 +0100
Tomas Vondra  wrote:

> On 3/1/23 19:09, Jehan-Guillaume de Rorthais wrote:
> > On Wed, 1 Mar 2023 18:48:40 +0100
> > Jehan-Guillaume de Rorthais  wrote:
> > ...  
> >> You'll find some intermediate stats I already collected in attachment:
> >>
> >> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
> >> * most of the non-free'd chunk are allocated since the very beginning,
> >> before the 5000's allocation call for almost 1M call so far.
> >> * 3754 of them have a chunk->size of 0
> >> * it seems there's some buggy stats or data:
> >># this one actually really comes from the gdb log
> >>0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
> >># this one might be a bug in my script
> >>  0x2: break=2 num=945346sz=2   (weird
> >> address)
> >> * ignoring the weird size requested during the 191st call, the total amount
> >>   of non free'd memory is currently 5488MB  
> > 
> > I forgot one stat. I don't know if this is expected, normal or not, but 53
> > chunks has been allocated on an existing address that was not free'd before.
> >   
> 
> It's likely chunk was freed by repalloc() and not by pfree() directly.
> Or maybe the whole context got destroyed/reset, in which case we don't
> free individual chunks. But that's unlikely for the ExecutorState.

Well, as all breakpoints were conditional on ExecutorState, I suppose this
might be repalloc then.

Regards,

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Tomas Vondra




On 3/1/23 19:09, Jehan-Guillaume de Rorthais wrote:
> On Wed, 1 Mar 2023 18:48:40 +0100
> Jehan-Guillaume de Rorthais  wrote:
> ...
>> You'll find some intermediate stats I already collected in attachment:
>>
>> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
>> * most of the non-free'd chunk are allocated since the very beginning, before
>>   the 5000's allocation call for almost 1M call so far.
>> * 3754 of them have a chunk->size of 0
>> * it seems there's some buggy stats or data:
>># this one actually really comes from the gdb log
>>0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
>># this one might be a bug in my script
>>  0x2: break=2 num=945346sz=2   (weird 
>> address)
>> * ignoring the weird size requested during the 191st call, the total amount
>>   of non free'd memory is currently 5488MB
> 
> I forgot one stat. I don't know if this is expected, normal or not, but 53
> chunks has been allocated on an existing address that was not free'd before.
> 

It's likely chunk was freed by repalloc() and not by pfree() directly.
Or maybe the whole context got destroyed/reset, in which case we don't
free individual chunks. But that's unlikely for the ExecutorState.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Tomas Vondra

On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:
> Hi Tomas,
> 
> On Tue, 28 Feb 2023 20:51:02 +0100
> Tomas Vondra  wrote:
>> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
>>> * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
>>>   (400s as the last record)  
>>
>> That's interesting. We're using HashBatchContext for very few things, so
>> what could it consume so much memory? But e.g. the number of buckets
>> should be limited by work_mem, so how could it get to 1.4GB?
>>
>> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
>> and print how many batches/butches are there?
> 
> I did this test this morning.
> 
> Batches and buckets increased really quickly to 1048576/1048576.
> 

OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
fillfactor 1.0).

But 1M batches? I guess that could be problematic. It doesn't seem like
much, but we need 1M files on each side - 1M for the hash table, 1M for
the outer relation. That's 16MB of pointers, but the files are BufFile
and we keep 8kB buffer for each of them. That's ~16GB right there :-(

In practice it probably won't be that bad, because not all files will be
allocated/opened concurrently (especially if this is due to many tuples
having the same value). Assuming that's what's happening here, ofc.

> ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
> calls, always short-cut'ed to 1048576, I guess because of the conditional 
> block
> «/* safety check to avoid overflow */» appearing early in this function.
> 

Hmmm, that's a bit weird, no? I mean, the check is

  /* safety check to avoid overflow */
  if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
 return;

Why would it stop at 1048576? It certainly is not higher than  INT_MAX/2
and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
happening here?

> I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
> query run faster. Enabling it at 19.1GB of memory consumption, it stayed
> silent till the memory exhaustion (around 21 or 22GB, I don't remember 
> exactly).
> 
> The breakpoint on ExecHashIncreaseNumBuckets triggered some times at 
> beginning,
> and a last time close to the end of the query execution.
> 
>>> Any idea? How could I help to have a better idea if a leak is actually
>>> occurring and where exactly?
>>
>> Investigating memory leaks is tough, especially for generic memory
>> contexts like ExecutorState :-( Even more so when you can't reproduce it
>> on a machine with custom builds.
>>
>> What I'd try is this:
>>
>> 1) attach breakpoints to all returns in AllocSetAlloc(), printing the
>> pointer and size for ExecutorState context, so something like
>>
>>   break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
>>   commands
>>   print MemoryChunkGetPointer(chunk) size
>>   cont
>>   end
>>
>> 2) do the same for AllocSetFree()
>>
>> 3) Match the palloc/pfree calls (using the pointer address), to
>> determine which ones are not freed and do some stats on the size.
>> Usually there's only a couple distinct sizes that account for most of
>> the leaked memory.
> 
> So here is what I end up with this afternoon, using file, lines and macro from
> REL_11_18:
> 
>   set logging  on
>   set pagination off
>   
>   break aset.c:781 if strcmp("ExecutorState",context.name) == 0
>   commands 1
>   print (((char *)(chunk)) + sizeof(struct AllocChunkData))
>   print chunk->size
>   cont
>   end
>   
>   break aset.c:820 if strcmp("ExecutorState",context.name) == 0
>   commands 2
>   print (((char *)(chunk)) + sizeof(struct AllocChunkData))
>   print chunk->size
>   cont 
>   end
>   
>   break aset.c:979 if strcmp("ExecutorState",context.name) == 0
>   commands 3
>   print (((char *)(chunk)) + sizeof(struct AllocChunkData))
>   print chunk->size
>   cont 
>   end
>   
>   break AllocSetFree if strcmp("ExecutorState",context.name) == 0
>   commands 4 
>   print pointer
>   cont
>   end
> 
> So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
> backend had only 3'32" of CPU time:
> 
>VIRTRESSHR S  %CPU %MEM TIME+ COMMAND
>   2727284   2.4g  17840 R  99.0  7.7 181:25.07 gdb
>   9054688 220648 103056 t   1.3  0.7   3:32.05 postmaster
> 
> Interestingly, the RES memory of the backend did not explode yet, but VIRT is
> already high.
> 
> I suppose the query will run for some more hours, hopefully, gdb will not
> exhaust the memory in the meantime...
> 
> You'll find some intermediate stats I already collected in attachment:
> 
> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
> * most of the non-free'd chunk are allocated since the very beginning, before
>   the 5000's allocation call for almost 1M call so far.
> * 3754 of them have a chunk->size of 0
> * it seems there's some buggy stats or data:
>

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

On Wed, 1 Mar 2023 18:48:40 +0100
Jehan-Guillaume de Rorthais  wrote:
...
> You'll find some intermediate stats I already collected in attachment:
> 
> * break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
> * most of the non-free'd chunk are allocated since the very beginning, before
>   the 5000's allocation call for almost 1M call so far.
> * 3754 of them have a chunk->size of 0
> * it seems there's some buggy stats or data:
># this one actually really comes from the gdb log
>0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
># this one might be a bug in my script
>  0x2: break=2 num=945346sz=2   (weird address)
> * ignoring the weird size requested during the 191st call, the total amount
>   of non free'd memory is currently 5488MB

I forgot one stat. I don't know if this is expected, normal or not, but 53
chunks has been allocated on an existing address that was not free'd before.

Regards,

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

Hi Tomas,

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra  wrote:
> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
> >   (400s as the last record)  
> 
> That's interesting. We're using HashBatchContext for very few things, so
> what could it consume so much memory? But e.g. the number of buckets
> should be limited by work_mem, so how could it get to 1.4GB?
> 
> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
calls, always short-cut'ed to 1048576, I guess because of the conditional block
«/* safety check to avoid overflow */» appearing early in this function.

I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
query run faster. Enabling it at 19.1GB of memory consumption, it stayed
silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly).

The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning,
and a last time close to the end of the query execution.

> > Any idea? How could I help to have a better idea if a leak is actually
> > occurring and where exactly?
> 
> Investigating memory leaks is tough, especially for generic memory
> contexts like ExecutorState :-( Even more so when you can't reproduce it
> on a machine with custom builds.
> 
> What I'd try is this:
> 
> 1) attach breakpoints to all returns in AllocSetAlloc(), printing the
> pointer and size for ExecutorState context, so something like
> 
>   break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
>   commands
>   print MemoryChunkGetPointer(chunk) size
>   cont
>   end
> 
> 2) do the same for AllocSetFree()
> 
> 3) Match the palloc/pfree calls (using the pointer address), to
> determine which ones are not freed and do some stats on the size.
> Usually there's only a couple distinct sizes that account for most of
> the leaked memory.

So here is what I end up with this afternoon, using file, lines and macro from
REL_11_18:

  set logging  on
  set pagination off

  break aset.c:781 if strcmp("ExecutorState",context.name) == 0
  commands 1
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont
  end

  break aset.c:820 if strcmp("ExecutorState",context.name) == 0
  commands 2
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont 
  end

  break aset.c:979 if strcmp("ExecutorState",context.name) == 0
  commands 3
  print (((char *)(chunk)) + sizeof(struct AllocChunkData))
  print chunk->size
  cont 
  end

  break AllocSetFree if strcmp("ExecutorState",context.name) == 0
  commands 4 
  print pointer
  cont
  end

So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
backend had only 3'32" of CPU time:

   VIRTRESSHR S  %CPU %MEM TIME+ COMMAND
  2727284   2.4g  17840 R  99.0  7.7 181:25.07 gdb
  9054688 220648 103056 t   1.3  0.7   3:32.05 postmaster

Interestingly, the RES memory of the backend did not explode yet, but VIRT is
already high.

I suppose the query will run for some more hours, hopefully, gdb will not
exhaust the memory in the meantime...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
  the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
   # this one actually really comes from the gdb log
   0x38a77b8: break=3 num=191   sz=4711441762604810240 (weird sz)
   # this one might be a bug in my script
 0x2: break=2 num=945346sz=2   (weird address)
* ignoring the weird size requested during the 191st call, the total amount
  of non free'd memory is currently 5488MB

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log? 

The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if
needed.

Stay tuned...

Thank you!

allocs-tmp-stats.txt.gz
Description: application/gzip

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Tomas Vondra

On 3/1/23 10:46, Jehan-Guillaume de Rorthais wrote:
> Hi Justin,
> 
> On Tue, 28 Feb 2023 12:25:08 -0600
> Justin Pryzby  wrote:
> 
>> On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
>>> Hello all,
>>>
>>> A customer is facing out of memory query which looks similar to this
>>> situation:
>>>
>>>   
>>> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
>>>
>>> This PostgreSQL version is 11.18. Some settings:  
>>
>> hash joins could exceed work_mem until v13:
> 
> Yes, I am aware of this. But as far as I understand Tom Lane explanations from
> the discussion mentioned up thread, it should not be ExecutorState.
> ExecutorState (13GB) is at least ten times bigger than any other context,
> including HashBatchContext (1.4GB) or HashTableContext (16MB). So maybe some
> aggregate is walking toward the wall because of bad estimation, but something
> else is racing way faster to the wall. And presently it might be something
> related to some JOIN node.
> 

I still don't understand why would this be due to a hash aggregate. That
should not allocate memory in ExecutorState at all. And HashBatchContext
(which is the one bloated) is used by hashjoin, so the issue is likely
somewhere in that area.

> About your other points, you are right, there's numerous things we could do to
> improve this query, and our customer is considering it as well. It's just a
> matter of time now.
> 
> But in the meantime, we are facing a query with a memory behavior that seemed
> suspect. Following the 4 years old thread I mentioned, my goal is to inspect
> and provide all possible information to make sure it's a "normal" behavior or
> something that might/should be fixed.
> 

It'd be interesting to see if the gdb stuff I suggested yesterday yields
some interesting info.

Furthermore, I realized the plan you posted yesterday may not be the
case used for the failing query. It'd be interesting to see what plan is
used for the case that actually fails. Can you do at least explain on
it? Or alternatively, if the query is already running and eating a lot
of memory, attach gdb and print the plan in ExecutorStart

  set print elements 0
  p nodeToString(queryDesc->plannedstmt->planTree)

Thinking about this, I have one suspicion. Hashjoins try to fit into
work_mem by increasing the number of batches - when a batch gets too
large, we double the number of batches (and split the batch into two, to
reduce the size). But if there's a lot of tuples for a particular key
(or at least the hash value), we quickly run into work_mem and keep
adding more and more batches.

The problem with this theory is that the batches are allocated in
HashTableContext, and that doesn't grow very much. And the 1.4GB
HashBatchContext is used for buckets - but we should not allocate that
many, because we cap that to nbuckets_optimal (see 30d7ae3c76). And it
does not explain the ExecutorState bloat either.

Nevertheless, it'd be interesting to see the hashtable parameters:

  p *hashtable

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-03-01 Thread Jehan-Guillaume de Rorthais

Hi Justin,

On Tue, 28 Feb 2023 12:25:08 -0600
Justin Pryzby  wrote:

> On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
> > Hello all,
> > 
> > A customer is facing out of memory query which looks similar to this
> > situation:
> > 
> >   
> > https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> > 
> > This PostgreSQL version is 11.18. Some settings:  
> 
> hash joins could exceed work_mem until v13:

Yes, I am aware of this. But as far as I understand Tom Lane explanations from
the discussion mentioned up thread, it should not be ExecutorState.
ExecutorState (13GB) is at least ten times bigger than any other context,
including HashBatchContext (1.4GB) or HashTableContext (16MB). So maybe some
aggregate is walking toward the wall because of bad estimation, but something
else is racing way faster to the wall. And presently it might be something
related to some JOIN node.

About your other points, you are right, there's numerous things we could do to
improve this query, and our customer is considering it as well. It's just a
matter of time now.

But in the meantime, we are facing a query with a memory behavior that seemed
suspect. Following the 4 years old thread I mentioned, my goal is to inspect
and provide all possible information to make sure it's a "normal" behavior or
something that might/should be fixed.

Thank you for your help!

Re: Memory leak from ExecutorState context?

2023-02-28 Thread Tomas Vondra

On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> Hello all,
> 
> A customer is facing out of memory query which looks similar to this 
> situation:
> 
>   
> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 
> This PostgreSQL version is 11.18. Some settings:
> 
> * shared_buffers: 8GB
> * work_mem: 64MB
> * effective_cache_size: 24GB
> * random/seq_page_cost are by default
> * physical memory: 32GB
> 
> The query is really large and actually update kind of a materialized view.
> 
> The customer records the plans of this query on a regular basis. The explain
> analyze of this query before running out of memory was:
> 
>   https://explain.depesz.com/s/sGOH
> 
> The customer is aware he should rewrite this query to optimize it, but it's a
> long time process he can not start immediately. To make it run in the 
> meantime,
> he actually removed the top CTE to a dedicated table. According to their
> experience, it's not the first time they had to split a query this way to make
> it work.
> 
> I've been able to run this query on a standby myself.  I've "call
> MemoryContextStats(TopMemoryContext)" every 10s on a run, see the data parsed
> (best view with "less -S") and the graph associated with it in attachment. It
> shows:
> 
> * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
>   (400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

> * ALL other context are stable before 240s, but ExecutorState
> * ExecutorState keeps rising up to 13GB with no interruption until the memory
>   exhaustion
> 
> I did another run with interactive gdb session (see the messy log session in
> attachment, for what it worth). Looking at some backtraces during the memory
> inflation close to the end of the query, all of them were having these frames 
> in
> common:
> 
>   [...]
>   #6  0x00621ffc in ExecHashJoinImpl (parallel=false, 
> pstate=0x31a3378)
>   at nodeHashjoin.c:398 [...]
> 
> ...which is not really helpful but at least, it seems to come from a hash join
> node or some other hash related code. See the gdb session log for more 
> details.
> After the out of mem, pmap of this process shows:
> 
>   430:   postgres: postgres  [local] EXPLAIN
>   Address   Kbytes RSS   Dirty Mode  Mapping
>   [...]
>   02c5e000 13719620 8062376 8062376 rw---   [ anon ]
>   [...]
> 
> Is it usual a backend is requesting such large memory size (13 GB) and
> actually use less of 60% of it (7.7GB of RSS)?
> 

No idea. Interpreting this info is pretty tricky, in my experience. It
might mean the memory is no longer used but sbrk couldn't return it to
the OS yet, or something like that.

> Sadly, the database is 1.5TB large and I can not test on a newer major 
> version.
> I did not try to check how large would be the required data set to reproduce
> this, but it moves 10s of million of rows from multiple tables anyway...
> 
> Any idea? How could I help to have a better idea if a leak is actually
> occurring and where exactly?
> 

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

1) attach breakpoints to all returns in AllocSetAlloc(), printing the
pointer and size for ExecutorState context, so something like

  break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
  commands
  print MemoryChunkGetPointer(chunk) size
  cont
  end

2) do the same for AllocSetFree()

3) Match the palloc/pfree calls (using the pointer address), to
determine which ones are not freed and do some stats on the size.
Usually there's only a couple distinct sizes that account for most of
the leaked memory.

4) Break AllocSetAlloc on those leaked sizes, to determine where the
calls come from.

This usually gives enough info about the leak or at least allows
focusing the investigation to a particular area of code.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-02-28 Thread Tomas Vondra




On 2/28/23 19:25, Justin Pryzby wrote:
> On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
>> Hello all,
>>
>> A customer is facing out of memory query which looks similar to this 
>> situation:
>>
>>   
>> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
>>
>> This PostgreSQL version is 11.18. Some settings:
> 
> hash joins could exceed work_mem until v13:
> 
> |Allow hash aggregation to use disk storage for large aggregation result
> |sets (Jeff Davis)
> |

That's hash aggregate, not hash join.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Memory leak from ExecutorState context?

2023-02-28 Thread Justin Pryzby

On Tue, Feb 28, 2023 at 07:06:43PM +0100, Jehan-Guillaume de Rorthais wrote:
> Hello all,
> 
> A customer is facing out of memory query which looks similar to this 
> situation:
> 
>   
> https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 
> This PostgreSQL version is 11.18. Some settings:

hash joins could exceed work_mem until v13:

|Allow hash aggregation to use disk storage for large aggregation result
|sets (Jeff Davis)
|
|Previously, hash aggregation was avoided if it was expected to use more
|than work_mem memory. Now, a hash aggregation plan can be chosen despite
|that. The hash table will be spilled to disk if it exceeds work_mem
|times hash_mem_multiplier.
|
|This behavior is normally preferable to the old behavior, in which once
|hash aggregation had been chosen, the hash table would be kept in memory
|no matter how large it got — which could be very large if the planner
|had misestimated. If necessary, behavior similar to that can be obtained
|by increasing hash_mem_multiplier.

>   https://explain.depesz.com/s/sGOH

This shows multiple plan nodes underestimating the row counts by factors
of ~50,000, which could lead to the issue fixed in v13.

I think you should try to improve the estimates, which might improve
other queries in addition to this one, in addition to maybe avoiding the
issue with joins.

> The customer is aware he should rewrite this query to optimize it, but it's a
> long time process he can not start immediately. To make it run in the 
> meantime,
> he actually removed the top CTE to a dedicated table.

Is the table analyzed ?

> Is it usual a backend is requesting such large memory size (13 GB) and
> actually use less of 60% of it (7.7GB of RSS)?

It's possible it's "using less" simply because it's not available.  Is
the process swapping ?

-- 
Justin

58 matches

Mail list logo