Re: Skipping logical replication transactions on subscriber side

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 11:39 AM Amit Kapila  wrote:
>
> On Thu, Aug 26, 2021 at 9:50 AM Masahiko Sawada  wrote:
> >
> > On Thu, Aug 26, 2021 at 12:51 PM Amit Kapila  
> > wrote:
> >
> > Yeah, I agree that it's a handy way to detect missing a switch case
> > but I think that we don't necessarily need it in this case. Because
> > there are many places in the code where doing similar things and when
> > it comes to apply_dispatch() it's the entry function to handle the
> > incoming message so it will be unlikely that we miss adding a switch
> > case until the patch gets committed. If we don't move it, we would end
> > up either adding the code resetting the
> > apply_error_callback_arg.command to every message type, adding a flag
> > indicating the message is handled and checking later, or having a big
> > if statement checking if the incoming message type is valid etc.
> >
>
> I was reviewing and making minor edits to your v11-0001* patch and
> noticed that the below parts of the code could be improved:
> 1.
> + if (errarg->rel)
> + appendStringInfo(&buf, _(" for replication target relation \"%s.%s\""),
> + errarg->rel->remoterel.nspname,
> + errarg->rel->remoterel.relname);
> +
> + if (errarg->remote_attnum >= 0)
> + appendStringInfo(&buf, _(" column \"%s\""),
> + errarg->rel->remoterel.attnames[errarg->remote_attnum]);
>
> Isn't it better if 'remote_attnum' check is inside if (errargrel)
> check? It will be weird to print column information without rel
> information and in the current code, we don't set remote_attnum
> without rel. The other possibility could be to have an Assert for rel
> in 'remote_attnum' check.
>
> 2.
> + /* Reset relation for error callback */
> + apply_error_callback_arg.rel = NULL;
> +
>   logicalrep_rel_close(rel, NoLock);
>
>   end_replication_step();
>
> Isn't it better to reset relation info as the last thing in
> apply_handle_insert/update/delete as you do for a few other
> parameters? There is very little chance of error from those two
> functions but still, it will be good if they ever throw an error and
> it might be clear for future edits in this function that this needs to
> be set as the last thing in these functions.
>

I see that resetting it before logicalrep_rel_close has an advantage
that we might not accidentally access some information after close
which is not there in rel. I am not sure if that is the reason you
have in mind for resetting it before close.

-- 
With Regards,
Amit Kapila.




Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)

2021-08-25 Thread Dilip Kumar
On Wed, Aug 25, 2021 at 5:49 PM Amit Kapila  wrote:

> On Tue, Aug 24, 2021 at 3:55 PM Dilip Kumar  wrote:
> >
> > On Tue, Aug 24, 2021 at 12:26 PM Amit Kapila 
> wrote:
> >
>
> The first patch looks good to me. I have made minor changes to the
> attached patch. The changes include: fixing compilation warning, made
> some comment changes, ran pgindent, and few other cosmetic changes. If
> you are fine with the attached, then kindly rebase the second patch
> atop it.
>
>
The patch looks good to me, I have rebased 0002 atop this patch and also
done some cosmetic fixes in 0002.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 3558cdc8a4f77e3c9092bbaeded451eaaa1feca4 Mon Sep 17 00:00:00 2001
From: Amit Kapila 
Date: Wed, 25 Aug 2021 15:00:19 +0530
Subject: [PATCH v5 1/2] Refactor sharedfileset.c to separate out fileset
 implementation.

Move fileset related implementation out of sharedfileset.c to allow its
usage by backends that don't want to share filesets among different
processes. After this split, fileset infrastructure is used by both
sharedfileset.c and worker.c for the named temporary files that survive
across transactions.

Author: Dilip Kumar, based on suggestion by Andres Freund
Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila
Discussion: https://postgr.es/m/e1mcc6u-0004ik...@gemulon.postgresql.org
---
 src/backend/replication/logical/launcher.c |   3 +
 src/backend/replication/logical/worker.c   |  82 ++
 src/backend/storage/file/Makefile  |   1 +
 src/backend/storage/file/buffile.c |  84 +-
 src/backend/storage/file/fd.c  |   2 +-
 src/backend/storage/file/fileset.c | 205 
 src/backend/storage/file/sharedfileset.c   | 244 +
 src/backend/utils/sort/logtape.c   |   8 +-
 src/backend/utils/sort/sharedtuplestore.c  |   5 +-
 src/include/replication/worker_internal.h  |   1 +
 src/include/storage/buffile.h  |  14 +-
 src/include/storage/fileset.h  |  40 +
 src/include/storage/sharedfileset.h|  14 +-
 src/tools/pgindent/typedefs.list   |   1 +
 14 files changed, 368 insertions(+), 336 deletions(-)
 create mode 100644 src/backend/storage/file/fileset.c
 create mode 100644 src/include/storage/fileset.h

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index e3b11da..8b1772d 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -648,6 +648,9 @@ logicalrep_worker_onexit(int code, Datum arg)
 
 	logicalrep_worker_detach();
 
+	/* Cleanup filesets used for streaming transactions. */
+	logicalrep_worker_cleanupfileset();
+
 	ApplyLauncherWakeup();
 }
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 38b493e..6adc6ddd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -39,13 +39,13 @@
  * BufFile infrastructure supports temporary files that exceed the OS file size
  * limit, (b) provides a way for automatic clean up on the error and (c) provides
  * a way to survive these files across local transactions and allow to open and
- * close at stream start and close. We decided to use SharedFileSet
+ * close at stream start and close. We decided to use FileSet
  * infrastructure as without that it deletes the files on the closure of the
  * file and if we decide to keep stream files open across the start/stop stream
  * then it will consume a lot of memory (more than 8K for each BufFile and
  * there could be multiple such BufFiles as the subscriber could receive
  * multiple start/stop streams for different transactions before getting the
- * commit). Moreover, if we don't use SharedFileSet then we also need to invent
+ * commit). Moreover, if we don't use FileSet then we also need to invent
  * a new way to pass filenames to BufFile APIs so that we are allowed to open
  * the file we desired across multiple stream-open calls for the same
  * transaction.
@@ -231,8 +231,8 @@ typedef struct ApplyExecutionData
 typedef struct StreamXidHash
 {
 	TransactionId xid;			/* xid is the hash key and must be first */
-	SharedFileSet *stream_fileset;	/* shared file set for stream data */
-	SharedFileSet *subxact_fileset; /* shared file set for subxact info */
+	FileSet*stream_fileset; /* file set for stream data */
+	FileSet*subxact_fileset;	/* file set for subxact info */
 } StreamXidHash;
 
 static MemoryContext ApplyMessageContext = NULL;
@@ -255,8 +255,8 @@ static bool in_streamed_transaction = false;
 static TransactionId stream_xid = InvalidTransactionId;
 
 /*
- * Hash table for storing the streaming xid information along with shared file
- * set for streaming and subxact files.
+ * Hash table for storing the streaming xid information along with filesets
+ * for streaming and subxact files.
  */
 static HTAB *xidhash = N

Re: Skipping logical replication transactions on subscriber side

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 9:50 AM Masahiko Sawada  wrote:
>
> On Thu, Aug 26, 2021 at 12:51 PM Amit Kapila  wrote:
>
> Yeah, I agree that it's a handy way to detect missing a switch case
> but I think that we don't necessarily need it in this case. Because
> there are many places in the code where doing similar things and when
> it comes to apply_dispatch() it's the entry function to handle the
> incoming message so it will be unlikely that we miss adding a switch
> case until the patch gets committed. If we don't move it, we would end
> up either adding the code resetting the
> apply_error_callback_arg.command to every message type, adding a flag
> indicating the message is handled and checking later, or having a big
> if statement checking if the incoming message type is valid etc.
>

I was reviewing and making minor edits to your v11-0001* patch and
noticed that the below parts of the code could be improved:
1.
+ if (errarg->rel)
+ appendStringInfo(&buf, _(" for replication target relation \"%s.%s\""),
+ errarg->rel->remoterel.nspname,
+ errarg->rel->remoterel.relname);
+
+ if (errarg->remote_attnum >= 0)
+ appendStringInfo(&buf, _(" column \"%s\""),
+ errarg->rel->remoterel.attnames[errarg->remote_attnum]);

Isn't it better if 'remote_attnum' check is inside if (errargrel)
check? It will be weird to print column information without rel
information and in the current code, we don't set remote_attnum
without rel. The other possibility could be to have an Assert for rel
in 'remote_attnum' check.

2.
+ /* Reset relation for error callback */
+ apply_error_callback_arg.rel = NULL;
+
  logicalrep_rel_close(rel, NoLock);

  end_replication_step();

Isn't it better to reset relation info as the last thing in
apply_handle_insert/update/delete as you do for a few other
parameters? There is very little chance of error from those two
functions but still, it will be good if they ever throw an error and
it might be clear for future edits in this function that this needs to
be set as the last thing in these functions.

Note - I can take care of the above points based on whatever we agree
with, you don't need to send a new version for this.

-- 
With Regards,
Amit Kapila.




RE: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)

2021-08-25 Thread kuroda.hay...@fujitsu.com
Dear Fujii-san,

Thank you for replying! I attached new version.

> Why did you make even %u and %d available in application_name?

Actually no particular reason. I added them because they can easily add...
And I agree what you say, so removed.

> So some people may want to specify their own ID in application_name
> when connecting to the foreign server. For now they can do that by
> setting application_name per foreign server object. But this means that
> every session connecting to the same foreign server basically should
> use the same application_name. If they want to change the setting,
> they need to execute "ALTER SERAVER ... OPTIONS ...". Isn't this inconvenient?

Yeah, currently such users must execute ALTER command for each time.

> If yes, what about allowing us to specify foreign server's application_name
> per session? For example, we can add postgres_fdw.application_name GUC,
> and then if it's set postgres_fdw always uses it instead of the server-level
> option when connecting to the foreign server. Thought?

OK, I added the GUC parameter. My patch Defines new GUC at _PG_init().
The value is used when backends started to connect to remote server.
Normal users can modify the value by SET commands but
that change does not propagate to the established connections.

I'm not sure about the versioning policy about contrib.
Do we have to make new version? Any other objects are not changed.

Finally, I and Fujii-san were now discussing locally whether
this replacement should be in libpq or postgres_fdw layer.
I started to think that it should be postgres_fdw, because:
* previously aplications could distinguish by using current applicaiton_name,
* libpq is used almost all uses so some modifications might affect someone, and
* libpq might be just a tool, and should not be intelligent

I will make and attach another version that new codes move to postgres_fdw.
So could you vote which version is better?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v02_0002_add_guc.patch
Description: v02_0002_add_guc.patch


v02_0001_allow_escape_in_application_name.patch
Description: v02_0001_allow_escape_in_application_name.patch


Re: Async-unsafe functions in signal handlers

2021-08-25 Thread Andrey Borodin



> 25 авг. 2021 г., в 19:22, Denis Smirnov  написал(а):
> 
> I am going to refactor Greenplum backtraces for error messages and want to 
> make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
> introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
> discussion 
> https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
>  ) and rely on backtrace() and backtrace_symbols() functions. They are used 
> inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
> inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
> confused with this fact - both backtrace functions are async-unsafe: 
> backtrace_symbols() - always, backtrace() - only for the first call due to 
> dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
> handlers?

In my view GUC backtrace_functions is expected to be used for debug purposes. 
Not for enabling on production server for bgworker_die() or 
FloatExceptionHandler().
Are there any way to call backtrace_symbols() without touching 
backtrace_functions?

Best regards, Andrey Borodin.



RE: Skipping logical replication transactions on subscriber side

2021-08-25 Thread houzj.f...@fujitsu.com
On Wed, Aug 25, 2021 12:22 PM Masahiko Sawada  wrote:
> 
> Attached updated version patches. Please review them.

The v11-0001 patch LGTM.

Best regards,
Hou zj


Re: Skipping logical replication transactions on subscriber side

2021-08-25 Thread Greg Nancarrow
On Thu, Aug 26, 2021 at 1:51 PM Amit Kapila  wrote:
>
> Do you have any suggestions on how to achieve that without adding some
> additional variable? I think it is not a very hard requirement as we
> don't follow the same at other places in code.
>

Sorry, forget my suggestion, I see it's not easy to achieve it and
still execute the non-error-case code after the switch.
(you'd have to use a variable set in the default case, defeating the
purpose, or have the switch in a separate function with return for
each case)

So the 0001 patch LGTM.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: row filtering for logical replication

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 9:51 AM Peter Smith  wrote:
>
> On Thu, Aug 26, 2021 at 1:20 PM Amit Kapila  wrote:
> >
> > On Thu, Aug 26, 2021 at 7:37 AM Peter Smith  wrote:
> > >
> > > On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila  
> > > wrote:
> > > >
> > > ...
> > > >
> > > > Hmm, I think the gain via caching is not visible because we are using
> > > > simple expressions. It will be visible when we use somewhat complex
> > > > expressions where expression evaluation cost is significant.
> > > > Similarly, the impact of this change will magnify and it will also be
> > > > visible when a publication has many tables. Apart from performance,
> > > > this change is logically correct as well because it would be any way
> > > > better if we don't invalidate the cached expressions unless required.
> > >
> > > Please tell me what is your idea of a "complex" row filter expression.
> > > Do you just mean a filter that has multiple AND conditions in it? I
> > > don't really know if few complex expressions would amount to any
> > > significant evaluation costs, so I would like to run some timing tests
> > > with some real examples to see the results.
> > >
> >
> > I think this means you didn't even understand or are convinced why the
> > patch has cache in the first place. As per your theory, even if we
> > didn't have cache, it won't matter but that is not true otherwise, the
> > patch wouldn't have it.
>
> I have never said there should be no caching. On the contrary, my
> performance test results [1] already confirmed that caching ExprState
> is of benefit for the millions of times it may be used in the
> pgoutput_row_filter function. My only doubts are in regard to how much
> observable impact there would be re-evaluating the filter expression
> just a few extra times by the get_rel_sync_entry function.
>

I think it depends but why in the first place do you want to allow
re-evaluation when there is a way for not doing that?

-- 
With Regards,
Amit Kapila.




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Masahiko Sawada
On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian  wrote:
>
> On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila  wrote:
> >
> > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian  wrote:
> > >
> > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila  
> > > wrote:
> > >
> > > >
> > > > You have a point but if we see the below logs, it seems the second
> > > > walsender (#step6) seemed to exited before the first walsender
> > > > (#step4).
> > > >
> > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > > >
> > > > Isn't it possible that pid is cleared in the other order due to which
> > > > we are seeing this problem?
> > >
> > > If the pid is cleared in the other order, wouldn't the query [1] return a 
> > > false?
> > >
> > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > > application_name =  'tap_sub';"
> > >
> >
> > I think it should return true because pid for 16336 is cleared first
> > and the remaining one will be 16475.
>
> Yes, that was what I explained as well. 16336 is PID 'a' (first
> walsender) in my explanation. The first walsender should
> be cleared first for this theory to work.

I think that it’s possible that the orders of the process writing
disconnections logs and setting 0 to walsender's pid are mismatched.
We set 0 to walsender's pid in WalSndKill() that is called during
on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
show the entry. On the other hand, disconnections logs are written by
log_disconnections() that is called during on_proc_exit callback. That
is, the following sequence could happen:

1. the second walsender (pid = 16475) raises an error as the slot is
already active (held by the first walsender).
2. the first walsender (pid = 16336) clears its pid on the shmem.
3. the polling query checks the walsender’s pid, and returns true
(since there is only the second walsender now).
4. the second walsender clears its pid on the shmem.
5. the second walsender writes disconnection log.
6. the first walsender writes disconneciton log.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: row filtering for logical replication

2021-08-25 Thread Peter Smith
On Thu, Aug 26, 2021 at 1:20 PM Amit Kapila  wrote:
>
> On Thu, Aug 26, 2021 at 7:37 AM Peter Smith  wrote:
> >
> > On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila  wrote:
> > >
> > ...
> > >
> > > Hmm, I think the gain via caching is not visible because we are using
> > > simple expressions. It will be visible when we use somewhat complex
> > > expressions where expression evaluation cost is significant.
> > > Similarly, the impact of this change will magnify and it will also be
> > > visible when a publication has many tables. Apart from performance,
> > > this change is logically correct as well because it would be any way
> > > better if we don't invalidate the cached expressions unless required.
> >
> > Please tell me what is your idea of a "complex" row filter expression.
> > Do you just mean a filter that has multiple AND conditions in it? I
> > don't really know if few complex expressions would amount to any
> > significant evaluation costs, so I would like to run some timing tests
> > with some real examples to see the results.
> >
>
> I think this means you didn't even understand or are convinced why the
> patch has cache in the first place. As per your theory, even if we
> didn't have cache, it won't matter but that is not true otherwise, the
> patch wouldn't have it.

I have never said there should be no caching. On the contrary, my
performance test results [1] already confirmed that caching ExprState
is of benefit for the millions of times it may be used in the
pgoutput_row_filter function. My only doubts are in regard to how much
observable impact there would be re-evaluating the filter expression
just a few extra times by the get_rel_sync_entry function.

--
[1] 
https://www.postgresql.org/message-id/CAHut%2BPs5j7mkO0xLmNW%3DkXh0eezGoKyzBCiQc9bfkCiM_MVDrg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Skipping logical replication transactions on subscriber side

2021-08-25 Thread Masahiko Sawada
On Thu, Aug 26, 2021 at 12:51 PM Amit Kapila  wrote:
>
> On Thu, Aug 26, 2021 at 7:15 AM Greg Nancarrow  wrote:
> >
> > On Wed, Aug 25, 2021 at 2:22 PM Masahiko Sawada  
> > wrote:
> > >
> > > Attached updated version patches. Please review them.
> > >
> >
> > Regarding the v11-0001 patch, it looks OK to me, but I do have one point:
> > In apply_dispatch(), wouldn't it be better to NOT move the error
> > reporting for an invalid message type into the switch as the default
> > case - because then, if you add a new message type, you won't get a
> > compiler warning (when warnings are enabled) for a missing switch
> > case, which is a handy way to alert you that the new message type
> > needs to be added as a case to the switch.
> >
>
> Do you have any suggestions on how to achieve that without adding some
> additional variable? I think it is not a very hard requirement as we
> don't follow the same at other places in code.

Yeah, I agree that it's a handy way to detect missing a switch case
but I think that we don't necessarily need it in this case. Because
there are many places in the code where doing similar things and when
it comes to apply_dispatch() it's the entry function to handle the
incoming message so it will be unlikely that we miss adding a switch
case until the patch gets committed. If we don't move it, we would end
up either adding the code resetting the
apply_error_callback_arg.command to every message type, adding a flag
indicating the message is handled and checking later, or having a big
if statement checking if the incoming message type is valid etc.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila  wrote:
>
> On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian  wrote:
> >
> > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila  wrote:
> >
> > >
> > > You have a point but if we see the below logs, it seems the second
> > > walsender (#step6) seemed to exited before the first walsender
> > > (#step4).
> > >
> > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > >
> > > Isn't it possible that pid is cleared in the other order due to which
> > > we are seeing this problem?
> >
> > If the pid is cleared in the other order, wouldn't the query [1] return a 
> > false?
> >
> > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > application_name =  'tap_sub';"
> >
>
> I think it should return true because pid for 16336 is cleared first
> and the remaining one will be 16475.

Yes, that was what I explained as well. 16336 is PID 'a' (first
walsender) in my explanation. The first walsender should
be cleared first for this theory to work.

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian  wrote:
>
> On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila  wrote:
>
> >
> > You have a point but if we see the below logs, it seems the second
> > walsender (#step6) seemed to exited before the first walsender
> > (#step4).
> >
> > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > session time: 0:00:06.367 user=nm database=postgres host=[local]
> >
> > Isn't it possible that pid is cleared in the other order due to which
> > we are seeing this problem?
>
> If the pid is cleared in the other order, wouldn't the query [1] return a 
> false?
>
> [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> application_name =  'tap_sub';"
>

I think it should return true because pid for 16336 is cleared first
and the remaining one will be 16475.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 7:15 AM Greg Nancarrow  wrote:
>
> On Wed, Aug 25, 2021 at 2:22 PM Masahiko Sawada  wrote:
> >
> > Attached updated version patches. Please review them.
> >
>
> Regarding the v11-0001 patch, it looks OK to me, but I do have one point:
> In apply_dispatch(), wouldn't it be better to NOT move the error
> reporting for an invalid message type into the switch as the default
> case - because then, if you add a new message type, you won't get a
> compiler warning (when warnings are enabled) for a missing switch
> case, which is a handy way to alert you that the new message type
> needs to be added as a case to the switch.
>

Do you have any suggestions on how to achieve that without adding some
additional variable? I think it is not a very hard requirement as we
don't follow the same at other places in code.

-- 
With Regards,
Amit Kapila.




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila  wrote:

>
> You have a point but if we see the below logs, it seems the second
> walsender (#step6) seemed to exited before the first walsender
> (#step4).
>
> 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> session time: 0:00:00.036 user=nm database=postgres host=[local]
> 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> session time: 0:00:06.367 user=nm database=postgres host=[local]
>
> Isn't it possible that pid is cleared in the other order due to which
> we are seeing this problem?

If the pid is cleared in the other order, wouldn't the query [1] return a false?

[1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
application_name =  'tap_sub';"

regards,
Ajin Cherian
Fujitsu Australia




Re: prevent immature WAL streaming

2021-08-25 Thread Bossart, Nathan
On 8/25/21, 5:40 PM, "Kyotaro Horiguchi"  wrote:
> At Wed, 25 Aug 2021 18:18:59 +, "Bossart, Nathan"  
> wrote in
>> Let's say we have the following situation (F = flush, E = earliest
>> registered boundary, and L = latest registered boundary), and let's
>> assume that each segment has a cross-segment record that ends in the
>> next segment.
>>
>> F E L
>> |-|-|-|-|-|-|-|-|
>>1 2 3 4 5 6 7 8
>>
>> Then, we write out WAL to disk and create .ready files as needed.  If
>> we didn't flush beyond the latest registered boundary, the latest
>> registered boundary now becomes the earliest boundary.
>>
>>   F E
>> |-|-|-|-|-|-|-|-|
>>1 2 3 4 5 6 7 8
>>
>> At this point, the earliest segment boundary past the flush point is
>> before the "earliest" boundary we are tracking.
>
> We know we have some cross-segment records in the regin [E L] so we
> cannot add a .ready file if flush is in the region. I haven't looked
> the latest patch (or I may misunderstand the discussion here) but I
> think we shouldn't move E before F exceeds previous (or in the first
> picture above) L.  Things are done that way in my ancient proposal in
> [1].

The strategy in place ensures that we track a boundary that doesn't
change until the flush position passes it as well as the latest
registered boundary.  I think it is important that any segment
boundary tracking mechanism does at least those two things.  I don't
see how we could do that if we didn't update E until F passed both E
and L.

Nathan



Re: row filtering for logical replication

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 7:37 AM Peter Smith  wrote:
>
> On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila  wrote:
> >
> ...
> >
> > Hmm, I think the gain via caching is not visible because we are using
> > simple expressions. It will be visible when we use somewhat complex
> > expressions where expression evaluation cost is significant.
> > Similarly, the impact of this change will magnify and it will also be
> > visible when a publication has many tables. Apart from performance,
> > this change is logically correct as well because it would be any way
> > better if we don't invalidate the cached expressions unless required.
>
> Please tell me what is your idea of a "complex" row filter expression.
> Do you just mean a filter that has multiple AND conditions in it? I
> don't really know if few complex expressions would amount to any
> significant evaluation costs, so I would like to run some timing tests
> with some real examples to see the results.
>

I think this means you didn't even understand or are convinced why the
patch has cache in the first place. As per your theory, even if we
didn't have cache, it won't matter but that is not true otherwise, the
patch wouldn't have it.

> On Wed, Aug 25, 2021 at 6:31 PM Amit Kapila  wrote:
> >
> > On Wed, Aug 25, 2021 at 10:57 AM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira  wrote:
> > > >
> > > > On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote:
> >
> > > >
> > > > Anyway, I have implemented the suggested cache change because I agree
> > > > it is probably theoretically superior, even if in practice there is
> > > > almost no difference.
> > > >
> > > > I didn't inspect your patch carefully but it seems you add another List 
> > > > to
> > > > control this new cache mechanism. I don't like it. IMO if we can use 
> > > > the data
> > > > structures that we have now, let's implement your idea; otherwise, -1 
> > > > for this
> > > > new micro optimization.
> > > >
> > >
> > > As mentioned above, without this we will invalidate many cached
> > > expressions even though it is not required. I don't deny that there
> > > might be a better way to achieve the same and if you or Peter have any
> > > ideas, I am all ears.
> > >
> >
> > I see that the new list is added to store row_filter node which we
> > later use to compute expression. This is not required for invalidation
> > but for delaying the expression evaluation till it is required (for
> > example, for truncate, we may not need the row evaluation, so there is
> > no need to compute it). Can we try to postpone the syscache lookup to
> > a later stage when we are actually doing row_filtering? If we can do
> > that, then I think we can avoid having this extra list?
>
> Yes, you are correct - that Node list was re-instated only because you
> had requested that the ExprState evaluation should be deferred until
> it is needed by the pgoutput_row_filter. Otherwise, the additional
> list would not be needed so everything would be much the same as in
> v23 except the invalidations would be more focussed on single tables.
>
> I don't think the syscache lookup can be easily postponed. That logic
> of get_rel_sync_entry processes the table filters of *all*
> publications, so moving that publications loop (including the
> partition logic) into the pgoutput_row_filter seems a bridge too far
> IMO.
>

Hmm, I don't think that is not true. You just need it for the relation
to be processed.

> Furthermore, I am not yet convinced that this ExprState postponement
> is very useful. It may be true that for truncate there is no need to
> compute it, but consider that the user would never even define a row
> filter in the first place unless they intended there will be some CRUD
> operations. So even if the truncate does not need the filter,
> *something* is surely going to need it.
>

Sure, but we don't need to add additional computation until it is required.

-- 
With Regards,
Amit Kapila.




Re: Fix around conn_duration in pgbench

2021-08-25 Thread Yugo NAGATA
On Fri, 20 Aug 2021 02:05:27 +0900
Fujii Masao  wrote:

> 
> On 2021/08/11 13:56, Fujii Masao wrote:
> > Yes, but I was thinking that's a bug that we should fix.
> > IOW, I was thinking that, in v13, both connection and disconnection delays
> > should be measured whether -C is specified or not, *per spec*.
> > But, in v13, the disconnection delays are not measured in the cases
> > where -C is specified and not specified. So I was thinking that this is
> > a bug and we should fix those both cases.
> > 
> > But you're thinking that, in v13, the disconnection delays don't need to
> > be measured because they are not measured for now?
> 
> Please let me clarify my thought.

Thank you for your clarification.

> 
> In master and v14,
> 
> # Expected behavior
> (1) Both connection and disconnection delays should be measured
>only when -C is specified, but not otherwise.
> (2) When -C is specified, since each transaction establishes and closes
>a connection, those delays should be measured for each transaction.
> 
> # Current behavior
> (1) Connection delay is measured whether -C is specified or not.
> (2) Even when -C is specified, disconnection delay is NOT measured
>at the end of transaction.
> 
> # What the patch should do
> (1) Make pgbench skip measuring connection and disconnection delays
>if not necessary (i.e., -C is not specified).
> (2) Make pgbench measure the disconnection delays whenever
>the connection is closed at the end of transaction, when -C is 
> specified.

I agree with you. This is what the patch for pg14 does. We don't need to measure
disconnection delay when -C is not specified because the output just reports
"initial connection time".

> In v13 or before,
> 
> # Expected behavior
> (1) Both connection and disconnection delays should be measured
>whether -C is specified or not. Because information about those delays
>is used for the benchmark result report.
> (2) When -C is specified, since each transaction establishes and closes
>a connection, those delays should be measured for each transaction.
> 
> # Current behavior
> (1)(2) Disconnection delay is NOT measured whether -C is specified or not.
> 
> # What the patch should do
> (1)(2) Make pgbench measure the disconnection delays whenever
>  the connection is closed at the end of transaction (for -C case)
>  and the end of thread (for NOT -C case).

Ok. That makes sense. The output reports "including connections establishing"
and "excluding connections establishing" regardless with -C, so we should
measure delays in the same way.

I updated the patch for pg13 to measure disconnection delay when -C is not
specified. I attached the updated patch for pg13 as well as one for pg14
which is same as attached before.

> 
> Anyway, I changed the status of this patch to "Waiting on Author" in CF.

I returned the status to "Ready for Committer". 
Could you please review this?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 338c0152a2..0058f3 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3285,8 +3285,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 if (is_connect)
 {
+	instr_time	start = now;
+
+	INSTR_TIME_SET_CURRENT_LAZY(start);
 	finishCon(st);
-	INSTR_TIME_SET_ZERO(now);
+	INSTR_TIME_SET_CURRENT(now);
+	INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
 }
 
 if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3310,7 +3314,28 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  */
 			case CSTATE_ABORTED:
 			case CSTATE_FINISHED:
-finishCon(st);
+/*
+ * In CSTATE_FINISHED state, this finishCon() is unnecessary
+ * under -C/--connect because all the connections that this
+ * thread established should have already been closed at the end
+ * of transactions. So we need to measure the disconnection
+ * delays here only when -C/--connect is not specified.
+ *
+ * In CSTATE_ABORTED state, the measurement is no longer
+ * necessary because we cannot report complete results anyways
+ * in this case.
+ */
+if (!is_connect)
+{
+	instr_time	start = now;
+
+	INSTR_TIME_SET_CURRENT_LAZY(start);
+	finishCon(st);
+	INSTR_TIME_SET_CURRENT(now);
+	INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start);
+}
+else
+	finishCon(st);
 return;
 		}
 	}
@@ -6210,6 +6235,12 @@ main(int argc, char **argv)
 		latency_late += thread->latency_late;
 		INSTR_TIME_ADD(conn_total_time, thread->conn_time);
 	}
+
+	/*
+	 * All connections should be already closed in threadRun(), so this
+	 * disconnect_all() will be a no-op, but clean up the connecions just
+	 * to be sure. We don't need to measure the disconnection delays here.
+	 */
 	disconnect_all(state, nclients);
 
 	/*
@@ -6237,8 +6268,7 @@ thread

Re: Failure of subscription tests with topminnow

2021-08-25 Thread Amit Kapila
On Thu, Aug 26, 2021 at 7:38 AM Ajin Cherian  wrote:
>
> On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada  
> wrote:
> >
>
> Luckily these logs have the disconnection times of the tap test client
> sessions as well. (not sure why I don't see these when I run these
> tests).
>
> Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
> 18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG:  statement: SELECT
> pid != 16336 FROM pg_stat_replication WHERE application_name =
> 'tap_sub';
> :
> :
> 18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG:  disconnection:
> session time: 0:00:00.063 user=nm database=postgres host=[local]
>
> When the query starts both walsenders are present but when the query
> completes both walsenders are gone, the actual query evaluation could
> have happened any time in between. This is the rare timing window that
> causes this problem.
>

You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).

2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]

Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?

-- 
With Regards,
Amit Kapila.




Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Peter Geoghegan
On Wed, Aug 25, 2021 at 5:23 PM Alvaro Herrera  wrote:
> > The question of whether or not we do an index scan (i.e. index
> > vacuuming) depends entirely on the number of LP_DEAD items that heap
> > pruning left behind in the table structure. [...]
>
> Ooh, this was illuminating -- thanks for explaining.  TBH I would have
> been very confused if asked to explain what that log line meant; and now
> that I know what it means, I am even more convinced that we need to work
> harder at it :-)

The way that VACUUM and ANALYZE do dead tuple accounting is very
confusing. In fact, it's so confusing that even autovacuum can get
confused! I think that we need to treat LP_DEAD items and pruned
tuples even more differently than we do in Postgres 14, probably in a
number of different areas (not just VACUUM).

I've found that if I set autovacuum_vacuum_scale_factor and
autovacuum_analyze_scale_factor to 0.02 with a HOT-heavy workload
(almost stock pgbench), then autovacuum workers are launched almost
constantly. If I then increase autovacuum_vacuum_scale_factor to 0.05,
but make no other changes, then the system decides that it should
actually never launch an autovacuum worker, even once (except for
anti-wraparound purposes) [1]. This behavior is completely absurd, of
course. To me this scenario illustrates an important general point:
VACUUM has the wrong idea. At least when it comes to certain specific
details. Details that have plenty of real world relevance.

VACUUM currently fails to understand anything about the rate of change
-- which, as I've said, is often the most important thing in the real
world. That's what my absurd scenario seems to show. That's how I view
a lot of these things.

> I'll see if I can come up with something ...

Thanks.

The message itself probably does need some work. But documentation
seems at least as important. It's slightly daunting, honestly, because
we don't even document HOT itself (unless you count passing references
that don't even explain the basic idea). I did try to get people
interested in this stuff at one point not too long ago [2]. That
thread went an entirely different direction to the one I'd planned on,
though, so I became discouraged. I should pick it up again now,
though.

[1] 
https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com
[2] 
https://postgr.es/m/cah2-wzkju+nibskzunbdpz6trse+aqvupae+xgm8zvob4wq...@mail.gmail.com
--
Peter Geoghegan




Re: Window Function "Run Conditions"

2021-08-25 Thread Andy Fan
On Thu, Aug 19, 2021 at 2:35 PM David Rowley  wrote:
>
> On Thu, 19 Aug 2021 at 00:20, Andy Fan  wrote:
> > In the current master, the result is:
> >
> >  empno | salary | c | dr
> > ---++---+
> >  8 |   6000 | 4 |  1
>
> > In the patched version, the result is:
> >
> >  empno | salary | c | dr
> > ---++---+
> >  8 |   6000 | 1 |  1
>
> Thanks for taking it for a spin.
>
> That's a bit unfortunate.  I don't immediately see how to fix it other
> than to restrict the optimisation to only apply when there's a single
> WindowClause. It might be possible to relax it further and only apply
> if it's the final window clause to be evaluated, but in those cases,
> the savings are likely to be much less anyway as some previous
> WindowAgg will have exhausted all rows from its subplan.

I am trying to hack the select_active_windows function to make
sure the WindowClause with Run Condition clause to be the last one
to evaluate (we also need to consider more than 1 window func has
run condition), at that time, the run condition clause is ready already.

However there are two troubles in this direction: a).  This may conflict
with "the windows that need the same sorting are adjacent in the list."
b). "when two or more windows are order-equivalent then all peer rows
must be presented in the same order in all of them. .. (See General Rule 4 of
 in SQL2008 - SQL2016.)"

In summary, I am not sure if it is correct to change the execution Order
of WindowAgg freely.

>   Likely
> restricting it to only working if there's 1 WindowClause would be fine
> as for the people using row_number() for a top-N type query, there's
> most likely only going to be 1 WindowClause.
>

This sounds practical.  And I suggest the following small changes.
(just check the partitionClause before the prosupport)

@@ -2133,20 +2133,22 @@ find_window_run_conditions(Query *subquery,
RangeTblEntry *rte, Index rti,

*keep_original = true;

-   prosupport = get_func_support(wfunc->winfnoid);
-
-   /* Check if there's a support function for 'wfunc' */
-   if (!OidIsValid(prosupport))
-   return false;
-
/*
 * Currently the WindowAgg node just stop when the run condition is no
 * longer true.  If there is a PARTITION BY clause then we cannot just
 * stop as other partitions still need to be processed.
 */
+
+   /* Check this first since window function with a partition
clause is common*/
if (wclause->partitionClause != NIL)
return false;

+   prosupport = get_func_support(wfunc->winfnoid);
+
+   /* Check if there's a support function for 'wfunc' */
+   if (!OidIsValid(prosupport))
+   return false;
+
/* get the Expr from the other side of the OpExpr */
if (wfunc_left)
otherexpr = lsecond(opexpr->args);



-- 
Best Regards
Andy Fan (https://www.aliyun.com/)




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Bruce Momjian
On Wed, Aug 25, 2021 at 10:41:14AM -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > On Wed, Aug 25, 2021 at 4:06 PM Robert Haas  wrote:
> >> It does tend to be controversial, but I think that's basically only
> >> because Tom Lane has reservations about it. I think if Tom dropped his
> >> opposition to this, nobody else would really care. And I think that
> >> would be a good thing for the project.
> 
> > But in particular, both on that argument, and on the general
> > maintenance argument, I have a very hard time seeing how exporting the
> > GUC variables would be any worse than exporting the many hundreds of
> > functions we already export.
> 
> My beef about it has nothing to do with binary-size concerns, although
> that is an interesting angle.  (I wonder whether marking a variable
> PGDLLIMPORT has any negative effect on the cost of accessing it from
> within the core code?)  Rather, I'm unhappy with spreading even more
> Microsoft-droppings all over our source.  If there were some way to
> just do this automatically for all global variables without any source
> changes, I'd be content with that.  That would *really* make the
> platforms more nearly equivalent.

OK, so the big question is how can we minimize the code impact of this
feature?  Can we add some kind of checking so we don't forget to mark
anything?  Can we automate this somehow?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





RE: Added schema level support for publication.

2021-08-25 Thread tanghy.f...@fujitsu.com
On Wednesday, August 25, 2021 5:37 PM vignesh C  wrote:
> 
> Attached v21 patch has the changes based on the new syntax and fixes
> few of the other review comments provided by reviewers.
> 

Thanks for your new patch. I saw the following warning when building, please 
have a look.

publicationcmds.c: In function ‘ConvertPubObjSpecListToOidList’:
publicationcmds.c:212:23: warning: ‘prevobjtype’ may be used uninitialized in 
this function [-Wmaybe-uninitialized]
pubobj->pubobjtype = prevobjtype;
~~~^

Regards
Tang


Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada  wrote:
>
> On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian  wrote:
> >
> > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I did a quick check with the following tap test code:
> > > >
> > > > $node_publisher->poll_query_until('postgres',
> > > >   qq(
> > > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > > ));
> > > >
> > > > The query returns {t, f} but poll_query_until() never finished. The
> > > > same is true when the query returns {f, t}.
> > > >
> >
> > Yes, this is true, I also see the same behaviour.
> >
> > >
> > > This means something different is going on in Ajin's setup. Ajin, can
> > > you please share how did you confirm your findings about poll_query?
> >
> > Relooking at my logs, I think what happens is this:
> >
> > 1. First walsender 'a' is running.
> > 2. Second walsender 'b' starts and attempts at acquiring the slot
> > finds that the slot is active for pid a.
> > 3. Now both walsenders are active, the query does not return.
> > 4. First walsender 'a' times out and exits.
> > 5. Now only the second walsender is active and the query returns OK
> > because pid != a.
> > 6. Second walsender exits with error.
> > 7. Another query attempts to get the pid of the running walsender for
> > tap_sub but returns null because both walsender exits.
> > 8. This null return value results in the next query erroring out and
> > the test failing.
>
> So this is slightly different than what we can see in the topminnow
> logs? According to the server logs, step #5 happened (at 18:44:38.016)
> before step #4 happened (at 18:44:38.043).
>

Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).

Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG:  statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG:  disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]

When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.

regards,
Ajin Cherian
Fujitsu Australia




Re: row filtering for logical replication

2021-08-25 Thread Peter Smith
On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila  wrote:
>
...
>
> Hmm, I think the gain via caching is not visible because we are using
> simple expressions. It will be visible when we use somewhat complex
> expressions where expression evaluation cost is significant.
> Similarly, the impact of this change will magnify and it will also be
> visible when a publication has many tables. Apart from performance,
> this change is logically correct as well because it would be any way
> better if we don't invalidate the cached expressions unless required.

Please tell me what is your idea of a "complex" row filter expression.
Do you just mean a filter that has multiple AND conditions in it? I
don't really know if few complex expressions would amount to any
significant evaluation costs, so I would like to run some timing tests
with some real examples to see the results.

On Wed, Aug 25, 2021 at 6:31 PM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 10:57 AM Amit Kapila  wrote:
> >
> > On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira  wrote:
> > >
> > > On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote:
>
> > >
> > > Anyway, I have implemented the suggested cache change because I agree
> > > it is probably theoretically superior, even if in practice there is
> > > almost no difference.
> > >
> > > I didn't inspect your patch carefully but it seems you add another List to
> > > control this new cache mechanism. I don't like it. IMO if we can use the 
> > > data
> > > structures that we have now, let's implement your idea; otherwise, -1 for 
> > > this
> > > new micro optimization.
> > >
> >
> > As mentioned above, without this we will invalidate many cached
> > expressions even though it is not required. I don't deny that there
> > might be a better way to achieve the same and if you or Peter have any
> > ideas, I am all ears.
> >
>
> I see that the new list is added to store row_filter node which we
> later use to compute expression. This is not required for invalidation
> but for delaying the expression evaluation till it is required (for
> example, for truncate, we may not need the row evaluation, so there is
> no need to compute it). Can we try to postpone the syscache lookup to
> a later stage when we are actually doing row_filtering? If we can do
> that, then I think we can avoid having this extra list?

Yes, you are correct - that Node list was re-instated only because you
had requested that the ExprState evaluation should be deferred until
it is needed by the pgoutput_row_filter. Otherwise, the additional
list would not be needed so everything would be much the same as in
v23 except the invalidations would be more focussed on single tables.

I don't think the syscache lookup can be easily postponed. That logic
of get_rel_sync_entry processes the table filters of *all*
publications, so moving that publications loop (including the
partition logic) into the pgoutput_row_filter seems a bridge too far
IMO.

Furthermore, I am not yet convinced that this ExprState postponement
is very useful. It may be true that for truncate there is no need to
compute it, but consider that the user would never even define a row
filter in the first place unless they intended there will be some CRUD
operations. So even if the truncate does not need the filter,
*something* is surely going to need it. In other words, IIUC this
postponement is not going to save any time overall - it only shifting
when the (one time) expression evaluation will happen.

I feel it would be better to just remove the postponed evaluation of
the ExprState added in v24. That will remove any need for the extra
Node list (which I think is Euler's concern). The ExprState cache will
still be slightly improved from how it was implemented before because
it is "logically correct" that we don't invalidate the cached
expressions unless required.

Thoughts?

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)

2021-08-25 Thread Masahiko Sawada
On Wed, Aug 25, 2021 at 9:19 PM Amit Kapila  wrote:
>
> On Tue, Aug 24, 2021 at 3:55 PM Dilip Kumar  wrote:
> >
> > On Tue, Aug 24, 2021 at 12:26 PM Amit Kapila  
> > wrote:
> >
>
> The first patch looks good to me. I have made minor changes to the
> attached patch. The changes include: fixing compilation warning, made
> some comment changes, ran pgindent, and few other cosmetic changes. If
> you are fine with the attached, then kindly rebase the second patch
> atop it.

The patch looks good to me.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Skipping logical replication transactions on subscriber side

2021-08-25 Thread Greg Nancarrow
On Wed, Aug 25, 2021 at 2:22 PM Masahiko Sawada  wrote:
>
> Attached updated version patches. Please review them.
>

Regarding the v11-0001 patch, it looks OK to me, but I do have one point:
In apply_dispatch(), wouldn't it be better to NOT move the error
reporting for an invalid message type into the switch as the default
case - because then, if you add a new message type, you won't get a
compiler warning (when warnings are enabled) for a missing switch
case, which is a handy way to alert you that the new message type
needs to be added as a case to the switch.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Julien Rouhaud
On Thu, Aug 26, 2021 at 1:51 AM Alvaro Herrera  wrote:
>
> On 2021-Aug-25, Magnus Hagander wrote:
>
> > The thing we need the PGDLLIMPORT definition for is to *import* them
> > on the other end?
>
> Oh ... so modules that are willing to cheat can include their own
> declarations of the variables they need, and mark them __declspec
> (dllimport)?

I just tried and msvc doesn't like it.  It errors out with a C2370
error "redefinition; different storage class".  According to
https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2370
changing __declspec() on the other side is not possible.




Re: prevent immature WAL streaming

2021-08-25 Thread Kyotaro Horiguchi
(this is off-topic here)

At Wed, 25 Aug 2021 09:56:56 -0400, Robert Haas  wrote 
in 
> On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
>  wrote:
> > At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera 
> >  wrote in
> > > I'd also like to have tests.  That seems moderately hard, but if we had
> > > WAL-molasses that could be used in walreceiver, it could be done. (It
> > > sounds easier to write tests with a molasses-archive_command.)
> > >
> > >
> > > [1] https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
> > > [2] 
> > > https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan@alibaba-inc.com
> >
> > (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)
> 
> I think, but am not 100% sure, that "molasses" here is being used to
> refer to fault injection.

Oh. That makes sense, thanks.

I sometimes inject artificial faults (a server crash, in this case) to
create specific on-disk states but I cannot imagine that that kind of
machinery can be statically placed in the source tree..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: prevent immature WAL streaming

2021-08-25 Thread Kyotaro Horiguchi
At Wed, 25 Aug 2021 20:20:04 -0400, "alvhe...@alvh.no-ip.org" 
 wrote in 
> BTW while going about testing this, I noticed that we forbid
> pg_walfile_name() while in recovery.  That restriction was added by
> commit 370f770c15a4 because ThisTimeLineID was not set correctly during
> recovery.  That was supposed to be fixed by commit 1148e22a82ed, so I
> thought that it should be possible to remove the restriction.  However,
> I did that per the attached patch, but was quickly disappointed because
> ThisTimeLineID seems to remain zero in a standby for reasons that I
> didn't investigate.

On a intermediate node of a cascading replication set, timeline id on
walsender and walrecever can differ and ordinary backends cannot
decide which to use.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Masahiko Sawada
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian  wrote:
>
> On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila  wrote:
> >
> > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada  
> > wrote:
> > >
> > > I did a quick check with the following tap test code:
> > >
> > > $node_publisher->poll_query_until('postgres',
> > >   qq(
> > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > ));
> > >
> > > The query returns {t, f} but poll_query_until() never finished. The
> > > same is true when the query returns {f, t}.
> > >
>
> Yes, this is true, I also see the same behaviour.
>
> >
> > This means something different is going on in Ajin's setup. Ajin, can
> > you please share how did you confirm your findings about poll_query?
>
> Relooking at my logs, I think what happens is this:
>
> 1. First walsender 'a' is running.
> 2. Second walsender 'b' starts and attempts at acquiring the slot
> finds that the slot is active for pid a.
> 3. Now both walsenders are active, the query does not return.
> 4. First walsender 'a' times out and exits.
> 5. Now only the second walsender is active and the query returns OK
> because pid != a.
> 6. Second walsender exits with error.
> 7. Another query attempts to get the pid of the running walsender for
> tap_sub but returns null because both walsender exits.
> 8. This null return value results in the next query erroring out and
> the test failing.

So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).

>
> >Can you additionally check the value of 'state' from
> >pg_stat_replication for both the old and new walsender sessions?
>
> Yes, will try this and post a patch tomorrow.

Thanks. I guess the state of the new walsender should be "startup"
whereas the old one should be "streaming".

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: prevent immature WAL streaming

2021-08-25 Thread Kyotaro Horiguchi
At Wed, 25 Aug 2021 18:18:59 +, "Bossart, Nathan"  
wrote in 
> On 8/25/21, 5:33 AM, "alvhe...@alvh.no-ip.org"  
> wrote:
> > On 2021-Aug-24, Bossart, Nathan wrote:
> >> Another interesting thing I see is that the boundary stored in
> >> earliestSegBoundary is not necessarily the earliest one.  It's just
> >> the first one that has been registered.  I did this for simplicity for
> >> the .ready file fix, but I can see it causing problems here.
> >
> > Hmm, is there really a problem here?  Surely the flush point cannot go
> > past whatever has been written.  If somebody is writing an earlier
> > section of WAL, then we cannot move the flush pointer to a later
> > position.  So it doesn't matter if the earliest point we have registered
> > is the true earliest -- we only care for it to be the earliest that is
> > past the flush point.
> 
> Let's say we have the following situation (F = flush, E = earliest
> registered boundary, and L = latest registered boundary), and let's
> assume that each segment has a cross-segment record that ends in the
> next segment.
> 
> F E L
> |-|-|-|-|-|-|-|-|
>1 2 3 4 5 6 7 8
> 
> Then, we write out WAL to disk and create .ready files as needed.  If
> we didn't flush beyond the latest registered boundary, the latest
> registered boundary now becomes the earliest boundary.
> 
>   F E
> |-|-|-|-|-|-|-|-|
>1 2 3 4 5 6 7 8
> 
> At this point, the earliest segment boundary past the flush point is
> before the "earliest" boundary we are tracking.

We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L.  Things are done that way in my ancient proposal in
[1].

https://www.postgresql.org/message-id/attachment/117052/v4-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch
+if (LogwrtResult.Write < firstSegContRecStart ||
+lastSegContRecEnd <= LogwrtResult.Write)
+{


By doing so, at the time of the second picutre, the pointers are set as:

   E   F L
 |-|-|-|-|-|-|-|-|
1 2 3 4 5 6 7 8

Then the poiter are cleard at the time F reaches L,

 F
 |-|-|-|-|-|-|-|-|
1 2 3 4 5 6 7 8

Isn't this work correctly?  As I think I mentioned in the thread, I
don't think we don't have so many (more than several, specifically)
segments in a region [E L].

[1] 
https://www.postgresql.org/message-id/20201216.110120.887433782054853494.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Alvaro Herrera
On 2021-Aug-25, Peter Geoghegan wrote:

> On Wed, Aug 25, 2021 at 2:06 PM Alvaro Herrera  
> wrote:

> > I like it better than the current layout, so +1.
> 
>  This seems like a release housekeeping task to me. I'll come up with
> a patch targeting 14 and master in a few days.

Agreed, thanks.

> The question of whether or not we do an index scan (i.e. index
> vacuuming) depends entirely on the number of LP_DEAD items that heap
> pruning left behind in the table structure. [...]

Ooh, this was illuminating -- thanks for explaining.  TBH I would have
been very confused if asked to explain what that log line meant; and now
that I know what it means, I am even more convinced that we need to work
harder at it :-)

I'll see if I can come up with something ...

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"The problem with the future is that it keeps turning into the present"
(Hobbes)




Re: prevent immature WAL streaming

2021-08-25 Thread alvhe...@alvh.no-ip.org
BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery.  That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery.  That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction.  However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
   (Paul Graham)
>From 1b57e504b9a7eeddf2d99e615f21c5aec042a908 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 25 Aug 2021 19:55:43 -0400
Subject: [PATCH] Don't forbid pg_walfile_name in recovery mode

---
 src/backend/access/transam/xlogfuncs.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b98deb72ec..99b22d0b30 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -503,13 +503,6 @@ pg_walfile_name(PG_FUNCTION_ARGS)
 	XLogRecPtr	locationpoint = PG_GETARG_LSN(0);
 	char		xlogfilename[MAXFNAMELEN];
 
-	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
- errhint("%s cannot be executed during recovery.",
-		 "pg_walfile_name()")));
-
 	XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size);
 	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size);
 
-- 
2.30.2



Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Peter Geoghegan
On Wed, Aug 25, 2021 at 2:06 PM Alvaro Herrera  wrote:
> You mean:
>
> LOG:  automatic vacuum of table "regression.public.bmsql_order_line": index 
> scans: 1
> pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen
> tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not yet 
> removable, oldest xmin: 88197949
> index scan needed: 1959301 pages from table (22.24% of total) had 11745226 
> dead item identifiers removed
> index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly deleted, 0 
> currently deleted, 0 reusable
> I/O timings: read: 65813.666 ms, write: 11310.689 ms
> avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s
> buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied
> WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes
> system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s

Yes, exactly.

> I like it better than the current layout, so +1.

 This seems like a release housekeeping task to me. I'll come up with
a patch targeting 14 and master in a few days.

> I think the "index scan needed" line (introduced very late in the 14
> cycle, commit 5100010ee4d5 dated April 7 2021) is a bit odd.

But that's largely a reflection of what's going on here.

> It is
> telling us stuff about the table -- how many pages had TIDs removed, am
> I reading that right? -- and it is also telling us whether indexes were
> scanned. But the fact that it starts with "index scan needed" suggests
> that it's talking about indexes.

The question of whether or not we do an index scan (i.e. index
vacuuming) depends entirely on the number of LP_DEAD items that heap
pruning left behind in the table structure. Actually, sometimes it's
~100% opportunistic pruning that happens to run outside of VACUUM (in
which case VACUUM merely notices and collects TIDs to delete from
indexes) -- it depends entirely on the workload. This isn't a new
thing added in commit 5100010ee4d5, really. That commit merely made
the index-bypass behavior not only occur when we had precisely 0 items
to delete from indexes -- now it can be skipped when the percentage of
heap pages with one or more LP_DEAD items is < 2%. So yes: this "pages
from table" output *is* primarily concerned with what happened with
indexes, even though the main piece of information says something
about the heap/table structure.

Note that in general a table could easily have many many more "tuples:
N removed" than "N dead item identifiers removed" in its
log_autovacuum output -- this is very common (any table that mostly or
only gets HOT updates and no deletes will look like that). The
opposite situation is also possible, and almost as common with tables
that only get non-HOT updates. The BenchmarkSQL TPC-C implementation
has tables in both categories -- it does tend to be a stable thing for
a table, in general.

Here is the second largest BenchmarkSQL table (this is just a random
VACUUM operation from logs used by a recent benchmark of mine):

automatic aggressive vacuum of table "regression.public.bmsql_oorder":
index scans: 1
pages: 0 removed, 943785 remain, 6 skipped due to pins, 205851 skipped frozen
tuples: 63649 removed, 105630136 remain, 2785 are dead but not yet
removable, oldest xmin: 186094041
buffer usage: 2660543 hits, 1766591 misses, 1375104 dirtied
index scan needed: 219092 pages from table (23.21% of total) had
14946563 dead item identifiers removed
index "bmsql_oorder_pkey": pages: 615866 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "bmsql_oorder_idx1": pages: 797957 in total, 131608 newly
deleted, 131608 currently deleted, 131608 reusable
avg read rate: 33.933 MB/s, avg write rate: 26.413 MB/s
I/O timings: read: 105551.978 ms, write: 16538.690 ms
system usage: CPU: user: 79.71 s, system: 49.74 s, elapsed: 406.73 s
WAL usage: 1934993 records, 1044051 full page images, 7076820876 bytes

On Postgres 13 you'd only see "tuples: 63649 removed" here. You'd
never see anything like "14946563 dead item identifiers removed", even
though that's obviously hugely important (more important than "tuples
removed", even). A user could be forgiven for thinking that HOT must
hurt performance! So yes, I agree. This *is* a bit odd.

(Another problem here is that "205851 skipped frozen" only counts
those heap pages that were specifically skipped frozen, even for a
non-aggressive VACUUM.)

> I think we should reword this line.  I
> don't have any great ideas; what do you think of this?
>
> dead items: 1959301 pages from table (22.24% of total) had 11745226 dead item 
> identifiers removed; index scan {needed, not needed, bypassed, bypassed by 
> failsafe}
>
> I have to say that I am a bit bothered about the coding pattern used to
> build this sentence from two parts.  I'm not sure it'll work okay in
> languages that build sentences in different ways.  Maybe we should split
> this in two lines, one to give the numbers and the other to talk about
> the decision taken about indexes.

I'm 

Re: prevent immature WAL streaming

2021-08-25 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-25, Jakub Wartak wrote:

> In order to get reliable reproducer and get proper the fault injection
> instead of playing with really filling up fs, apparently one could
> substitute fd with fd of /dev/full using e.g. dup2() so that every
> write is going to throw this error too:

Oh, this is a neat trick that I didn't know about.  Thanks.

> After restarting master and inspecting standby - in all of those above
> 3 cases - the standby didn't inhibit the "invalid contrecord length"
> at least here, while without corruption this v02 patch it is
> notorious. So if it passes the worst-case code review assumptions I
> would be wondering if it shouldn't even be committed as it stands
> right now.

Well, Nathan is right that this patch isn't really closing the hole
because we aren't tracking segment boundaries that aren't "earliest" nor
"latest" at the moment of registration.  The simplistic approach seems
okay for the archive case, but not for the replication case.

I also noticed today (facepalm) that the patch doesn't work unless you
have archiving enabled, because the registration code is only invoked
when XLogArchivingActive().  Doh.  This part is easy to solve.  The
other doesn't look easy ...

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: [PoC] Federated Authn/z with OAUTHBEARER

2021-08-25 Thread Zhihong Yu
On Wed, Aug 25, 2021 at 3:25 PM Zhihong Yu  wrote:

>
>
> On Wed, Aug 25, 2021 at 11:42 AM Jacob Champion 
> wrote:
>
>> On Tue, 2021-06-22 at 23:22 +, Jacob Champion wrote:
>> > On Fri, 2021-06-18 at 11:31 +0300, Heikki Linnakangas wrote:
>> > >
>> > > A few small things caught my eye in the backend oauth_exchange
>> function:
>> > >
>> > > > +   /* Handle the client's initial message. */
>> > > > +   p = strdup(input);
>> > >
>> > > this strdup() should be pstrdup().
>> >
>> > Thanks, I'll fix that in the next re-roll.
>> >
>> > > In the same function, there are a bunch of reports like this:
>> > >
>> > > >ereport(ERROR,
>> > > > +  (errcode(ERRCODE_PROTOCOL_VIOLATION),
>> > > > +   errmsg("malformed OAUTHBEARER message"),
>> > > > +   errdetail("Comma expected, but found
>> character \"%s\".",
>> > > > + sanitize_char(*p;
>> > >
>> > > I don't think the double quotes are needed here, because
>> sanitize_char
>> > > will return quotes if it's a single character. So it would end up
>> > > looking like this: ... found character "'x'".
>> >
>> > I'll fix this too. Thanks!
>>
>> v2, attached, incorporates Heikki's suggested fixes and also rebases on
>> top of latest HEAD, which had the SASL refactoring changes committed
>> last month.
>>
>> The biggest change from the last patchset is 0001, an attempt at
>> enabling jsonapi in the frontend without the use of palloc(), based on
>> suggestions by Michael and Tom from last commitfest. I've also made
>> some improvements to the pytest suite. No major changes to the OAuth
>> implementation yet.
>>
>> --Jacob
>>
> Hi,
> For v2-0001-common-jsonapi-support-FRONTEND-clients.patch :
>
> +   /* Clean up. */
> +   termJsonLexContext(&lex);
>
> At the end of termJsonLexContext(), empty is copied to lex. For stack
> based JsonLexContext, the copy seems unnecessary.
> Maybe introduce a boolean parameter for termJsonLexContext() to signal
> that the copy can be omitted ?
>
> +#ifdef FRONTEND
> +   /* make sure initialization succeeded */
> +   if (lex->strval == NULL)
> +   return JSON_OUT_OF_MEMORY;
>
> Should PQExpBufferBroken(lex->strval) be used for the check ?
>
> Thanks
>
Hi,
For v2-0002-libpq-add-OAUTHBEARER-SASL-mechanism.patch :

+   i_init_session(&session);
+
+   if (!conn->oauth_client_id)
+   {
+   /* We can't talk to a server without a client identifier. */
+   appendPQExpBufferStr(&conn->errorMessage,
+libpq_gettext("no oauth_client_id is set for
the connection"));
+   goto cleanup;

Can conn->oauth_client_id check be performed ahead of i_init_session() ?
That way, ```goto cleanup``` can be replaced with return.

+   if (!error_code || (strcmp(error_code, "authorization_pending")
+   && strcmp(error_code, "slow_down")))

What if, in the future, there is error code different from the above two
which doesn't represent "OAuth token retrieval failed" scenario ?

For client_initial_response(),

+   token_buf = createPQExpBuffer();
+   if (!token_buf)
+   goto cleanup;

If token_buf is NULL, there doesn't seem to be anything to free. We can
return directly.

Cheers


Re: [PoC] Federated Authn/z with OAUTHBEARER

2021-08-25 Thread Zhihong Yu
On Wed, Aug 25, 2021 at 11:42 AM Jacob Champion 
wrote:

> On Tue, 2021-06-22 at 23:22 +, Jacob Champion wrote:
> > On Fri, 2021-06-18 at 11:31 +0300, Heikki Linnakangas wrote:
> > >
> > > A few small things caught my eye in the backend oauth_exchange
> function:
> > >
> > > > +   /* Handle the client's initial message. */
> > > > +   p = strdup(input);
> > >
> > > this strdup() should be pstrdup().
> >
> > Thanks, I'll fix that in the next re-roll.
> >
> > > In the same function, there are a bunch of reports like this:
> > >
> > > >ereport(ERROR,
> > > > +  (errcode(ERRCODE_PROTOCOL_VIOLATION),
> > > > +   errmsg("malformed OAUTHBEARER message"),
> > > > +   errdetail("Comma expected, but found
> character \"%s\".",
> > > > + sanitize_char(*p;
> > >
> > > I don't think the double quotes are needed here, because sanitize_char
> > > will return quotes if it's a single character. So it would end up
> > > looking like this: ... found character "'x'".
> >
> > I'll fix this too. Thanks!
>
> v2, attached, incorporates Heikki's suggested fixes and also rebases on
> top of latest HEAD, which had the SASL refactoring changes committed
> last month.
>
> The biggest change from the last patchset is 0001, an attempt at
> enabling jsonapi in the frontend without the use of palloc(), based on
> suggestions by Michael and Tom from last commitfest. I've also made
> some improvements to the pytest suite. No major changes to the OAuth
> implementation yet.
>
> --Jacob
>
Hi,
For v2-0001-common-jsonapi-support-FRONTEND-clients.patch :

+   /* Clean up. */
+   termJsonLexContext(&lex);

At the end of termJsonLexContext(), empty is copied to lex. For stack
based JsonLexContext, the copy seems unnecessary.
Maybe introduce a boolean parameter for termJsonLexContext() to signal that
the copy can be omitted ?

+#ifdef FRONTEND
+   /* make sure initialization succeeded */
+   if (lex->strval == NULL)
+   return JSON_OUT_OF_MEMORY;

Should PQExpBufferBroken(lex->strval) be used for the check ?

Thanks


Re: Multi-Column List Partitioning

2021-08-25 Thread Zhihong Yu
On Wed, Aug 25, 2021 at 5:41 AM Nitin Jadhav 
wrote:

> > The new list bound binary search and related comparison support
> > function look a bit too verbose to me.  I was expecting
> > partition_list_bsearch() to look very much like
> > partition_range_datum_bsearch(), but that is not the case.  The
> > special case code that you wrote in partition_list_bsearch() seems
> > unnecessary, at least in that function.  I'm talking about the code
> > fragment starting with this comment:
> >
> > I will look at other parts of the patch next week hopefully.   For
> > now, attached is a delta patch that applies on top of your v1, which
> > does:
> >
> > * Simplify partition_list_bsearch() and partition_lbound_datum_cmp()
> > * Make qsort_partition_list_value_cmp simply call
> > partition_lbound_datum_cmp() instead of having its own logic to
> > compare input bounds
> > * Move partition_lbound_datum_cmp() into partbounds.c as a static
> > function (export seems unnecessary)
> > * Add a comment for PartitionBoundInfo.isnulls and remove that for
> null_index
>
> Yes. You are right. The extra code added in partition_list_bsearch()
> is not required and thanks for sharing the delta patch. It looks good
> to me and I have incorporated the changes in the attached patch.
>
> > I guess you're perhaps trying to address the case where the caller
> > does not specify the values for all of the partition key columns,
> > which can happen when the partition pruning code needs to handle a set
> > of clauses matching only some of the partition key columns.  But
> > that's a concern of the partition pruning code and so the special case
> > should be handled there (if at all), not in the binary search function
> > that is shared with other callers.  Regarding that, I'm wondering if
> > we should require clauses matching all of the partition key columns to
> > be found for the pruning code to call the binary search, so do
> > something like get_matching_hash_bounds() does:
> >
> > Do you think that trying to match list partitions even with fewer keys
> > is worth the complexity of the implementation?  That is, is the use
> > case to search for only a subset of partition key columns common
> > enough with list partitioning?
> >
> > If we do decide to implement the special case, remember that to do
> > that efficiently, we'd need to require that the subset of matched key
> > columns constitutes a prefix, because of the way the datums are
> > sorted.  That is, match all partitions when the query only contains a
> > clause for b when the partition key is (a, b, c), but engage the
> > special case of pruning if the query contains clauses for a, or for a
> > and b.
>
> Thanks for the suggestion. Below is the implementation details for the
> partition pruning for multi column list partitioning.
>
> In the existing code (For single column list partitioning)
> 1. In gen_partprune_steps_internal(), we try to match the where
> clauses provided by the user with the partition key data using
> match_clause_to_partition_key(). Based on the match, this function can
> return many values like PARTCLAUSE_MATCH_CLAUSE,
> PARTCLAUSE_MATCH_NULLNESS, PARTCLAUSE_NOMATCH, etc.
> 2. In case of PARTCLAUSE_MATCH_CLAUSE, we generate steps using
> gen_prune_steps_from_opexps() (strategy-2) which generate and return a
> list of PartitionPruneStepOp that are based on OpExpr and BooleanTest
> clauses that have been matched to the partition key and it also takes
> care handling prefix of the partition keys.
> 3. In case of PARTCLAUSE_MATCH_NULLNESS, we generate steps using
> gen_prune_step_op() (strategy-1) which generates single
> PartitionPruneStepOp since the earlier list partitioning supports
> single column and there can be only one NULL value. In
> get_matching_list_bounds(), if the nullkeys is not empty, we fetch the
> partition index which accepts null and we used to return from here.
>
> In case of multi column list partitioning, we have columns more than
> one and hence there is a possibility of more than one NULL values in
> the where clauses. The above mentioned steps are modified like below.
>
> 1.  Modified the match_clause_to_partition_key() to generate an object
> of PartClauseInfo structure and return PARTCLAUSE_MATCH_CLAUSE even in
> case of clauses related to NULL. The information required to generate
> PartClauseInfo is populated here like the constant expression
> consisting of (Datum) 0, op_strategy, op_is_ne, etc.
> 2. Since I am returning PARTCLAUSE_MATCH_CLAUSE, now we use strategy-2
> (gen_prune_steps_from_opexps) to generate partition pruning steps.
> This function takes care of generating a list of pruning steps if
> there are multiple clauses and also takes care of handling prefixes.
> 3. Modified perform_pruning_base_step() to generate the datum values
> and isnulls data of the where clauses. In case if any of the key
> contains NULL value then the corresponding datum value is 0.
> 4. Modified get_matching_list_bounds() to generate the min

Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Stephen Frost
Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Wed, Aug 25, 2021 at 11:42 AM Nikolay Samokhvalov
>  wrote:
> > The last two lines are also "*** usage" -- shouldn't the buffer numbers be 
> > next to them?
> 
> I agree that that would be better still -- but all the "usage" stuff
> together in one block.
> 
> And that leads me to another observation: The track_io_timing stuff
> (also new to Postgres 14) might also need to be reordered. And maybe
> even the WAL usage stuff, which was added in Postgres 13.
> 
> That way the overall structure starts with details of the physical
> data structures (the table and its indexes), then goes into buffers
> 
> 1. Heap pages
> 2. Heap tuples
> 3. Index stuff
> 4. I/O timings (only when track_io_timing is on)
> 5. avg read rate (always)
> 6. buffer usage
> 7. WAL usage.
> 8. system usage
> 
> This would mean that I'd be flipping the order of 7 and 8 relative to
> Postgres 13 -- meaning there'd be one difference between Postgres 14
> and some existing stable release. But I think that putting WAL usage
> last of all (after system usage) makes little sense -- commit
> b7ce6de93b shouldn't have done it that way. I always expect to see the
> getrusage() stuff last.

I generally like the idea though I'm not sure about changing things in
v13 as there's likely code out there that's already parsing that data
and it might suddenly break if this was changed.

Given that such code would need to be adjusted for v14 anyway, I don't
really see changing it in v14 as as being an issue (nor do I feel that
it's even a big concern at this point in the release cycle, though
perhaps others feel differently).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Alvaro Herrera
On 2021-Aug-25, Peter Geoghegan wrote:

> That way the overall structure starts with details of the physical
> data structures (the table and its indexes), then goes into buffers
> 
> 1. Heap pages
> 2. Heap tuples
> 3. Index stuff
> 4. I/O timings (only when track_io_timing is on)
> 5. avg read rate (always)
> 6. buffer usage
> 7. WAL usage.
> 8. system usage
> 
> This would mean that I'd be flipping the order of 7 and 8 relative to
> Postgres 13 -- meaning there'd be one difference between Postgres 14
> and some existing stable release. But I think that putting WAL usage
> last of all (after system usage) makes little sense -- commit
> b7ce6de93b shouldn't have done it that way. I always expect to see the
> getrusage() stuff last.

You mean:

LOG:  automatic vacuum of table "regression.public.bmsql_order_line": index 
scans: 1
pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen
tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not yet 
removable, oldest xmin: 88197949
index scan needed: 1959301 pages from table (22.24% of total) had 11745226 dead 
item identifiers removed
index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly deleted, 0 
currently deleted, 0 reusable
I/O timings: read: 65813.666 ms, write: 11310.689 ms
avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s
buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied
WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes
system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s

I like it better than the current layout, so +1.


I think the "index scan needed" line (introduced very late in the 14
cycle, commit 5100010ee4d5 dated April 7 2021) is a bit odd.  It is
telling us stuff about the table -- how many pages had TIDs removed, am
I reading that right? -- and it is also telling us whether indexes were
scanned. But the fact that it starts with "index scan needed" suggests
that it's talking about indexes.  I think we should reword this line.  I
don't have any great ideas; what do you think of this?

dead items: 1959301 pages from table (22.24% of total) had 11745226 dead item 
identifiers removed; index scan {needed, not needed, bypassed, bypassed by 
failsafe}

I have to say that I am a bit bothered about the coding pattern used to
build this sentence from two parts.  I'm not sure it'll work okay in
languages that build sentences in different ways.  Maybe we should split
this in two lines, one to give the numbers and the other to talk about
the decision taken about indexes.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php




Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Peter Geoghegan
On Wed, Aug 25, 2021 at 1:33 PM Stephen Frost  wrote:
> I don't have any particular issue with moving them.

What do you think of the plan I just outlined to Nikolay?

-- 
Peter Geoghegan




Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Peter Geoghegan
On Wed, Aug 25, 2021 at 11:42 AM Nikolay Samokhvalov
 wrote:
> The last two lines are also "*** usage" -- shouldn't the buffer numbers be 
> next to them?

I agree that that would be better still -- but all the "usage" stuff
together in one block.

And that leads me to another observation: The track_io_timing stuff
(also new to Postgres 14) might also need to be reordered. And maybe
even the WAL usage stuff, which was added in Postgres 13.

That way the overall structure starts with details of the physical
data structures (the table and its indexes), then goes into buffers

1. Heap pages
2. Heap tuples
3. Index stuff
4. I/O timings (only when track_io_timing is on)
5. avg read rate (always)
6. buffer usage
7. WAL usage.
8. system usage

This would mean that I'd be flipping the order of 7 and 8 relative to
Postgres 13 -- meaning there'd be one difference between Postgres 14
and some existing stable release. But I think that putting WAL usage
last of all (after system usage) makes little sense -- commit
b7ce6de93b shouldn't have done it that way. I always expect to see the
getrusage() stuff last.

-- 
Peter Geoghegan




Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Stephen Frost
Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> log_autovacuum output looks like this (as of Postgres 14):
> 
> LOG:  automatic vacuum of table "regression.public.bmsql_order_line":
> index scans: 1
> pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen
> tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not
> yet removable, oldest xmin: 88197949
> buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied
> index scan needed: 1959301 pages from table (22.24% of total) had
> 11745226 dead item identifiers removed
> index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly
> deleted, 0 currently deleted, 0 reusable
> avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s
> I/O timings: read: 65813.666 ms, write: 11310.689 ms
> system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s
> WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes
> 
> I think that this output is slightly misleading. I'm concerned about
> the specific order of the lines here: the "buffer usage" line comes
> after the information that applies specifically to the heap structure,
> but before the information about indexes. This is the case despite the
> fact that its output applies to all buffers (not just those for the
> heap structure).
> 
> It would be a lot clearer if the "buffer usage" line was simply moved
> down. I think that it should appear after the lines that are specific
> to the table's indexes -- just before the "avg read rate" line. That
> way we'd group the buffer usage output with all of the other I/O
> related output that summarizes the VACUUM operation as a whole.
> 
> I propose changing the ordering along those lines, and backpatching
> the change to Postgres 14.

I don't have any particular issue with moving them.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: badly calculated width of emoji in psql

2021-08-25 Thread John Naylor
On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion  wrote:
>
> Does there need to be any sanity check for overlapping ranges between
> the combining and fullwidth sets? The Unicode data on a dev's machine
> would have to be broken somehow for that to happen, but it could
> potentially go undetected for a while if it did.

It turns out I should have done that to begin with. In the Unicode data, it
apparently happens that a character can be both combining and wide, and
that will cause ranges to overlap in my scheme:

302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
ENTERING TONE MARK

{0x3000, 0x303E, 2},
{0x302A, 0x302D, 0},

3099..309A;W # Mn [2] COMBINING KATAKANA-HIRAGANA VOICED SOUND
MARK..COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

{0x3099, 0x309A, 0},
{0x3099, 0x30FF, 2},

Going by the above, Jacob's patch from July 21 just happened to be correct
by chance since the combining character search happened first.

It seems the logical thing to do is revert my 0001 and 0002 and go back to
something much closer to Jacob's patch, plus a big comment explaining that
the order in which the searches happen matters.

The EastAsianWidth.txt does have combining property "Mn" in the comment
above, so it's tempting to just read that (plus we could read just one file
for these properties). However, it seems risky to rely on comments, since
their presence and format is probably less stable than the data format.
--
John Naylor
EDB: http://www.enterprisedb.com


Re: Autovacuum on partitioned table (autoanalyze)

2021-08-25 Thread Justin Pryzby
On Fri, Aug 20, 2021 at 07:55:13AM -0500, Justin Pryzby wrote:
> On Tue, Aug 17, 2021 at 06:30:18AM -0500, Justin Pryzby wrote:
> > On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote:
> > > On Mon, Aug 16, 2021 at 05:42:48PM -0400, Álvaro Herrera wrote:
> > > > On 2021-Aug-16, Álvaro Herrera wrote:
> > > > 
> > > > > Here's the reversal patch for the 14 branch.  (It applies cleanly to
> > > > > master, but the unused member of PgStat_StatTabEntry needs to be
> > > > > removed and catversion bumped).
> > > > 
> > > > I have pushed this to both branches.  (I did not remove the item from
> > > > the release notes in the 14 branch.)
> > > 
> > > |I retained the addition of relkind 'p' to tables included by
> > > |pg_stat_user_tables, because reverting that would require a 
> > > catversion
> > > |bump.
> > > 
> > > Right now, on v15dev, it shows 0, which is misleading.
> > > Shouldn't it be null ?
> > > 
> > > analyze_count   | 0
> > > 
> > > Note that having analyze_count and last_analyze would be an an 
> > > independently
> > > useful change.  Since parent tables aren't analyzed automatically, I have 
> > > a
> > > script to periodically process them if they weren't processed recently.  
> > > Right
> > > now, for partitioned tables, the best I could find is to check its 
> > > partitions:
> > > | MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON 
> > > psat.relid=i.inhrelid
> > > 
> > > In 20200418050815.ge26...@telsasoft.com I wrote:
> > > |This patch includes partitioned tables in pg_stat_*_tables, which is 
> > > great; I
> > > |complained awhile ago that they were missing [0].  It might be useful if 
> > > that
> > > |part was split out into a separate 0001 patch (?).
> > > | [0] 
> > > https://www.postgresql.org/message-id/20180601221428.GU5164%40telsasoft.com
> > 
> > I suggest the attached (which partially reverts the revert), to allow 
> > showing
> > correct data for analyze_count and last_analyzed.
> 
> Álvaro, would you comment on this ?
> 
> To me this could be an open item, but someone else should make that
> determination.

I added an opened item until this is discussed.
|   pg_stats includes partitioned tables, but always shows analyze_count=0
|   Owner: Alvaro Herrera

Possible solutions, in decreasing order of my own preference:

 - partially revert the revert, as proposed, to have "analyze_count" and
   "last_analyzed" work properly for partitioned tables.  This doesn't suffer
   from any of the problems that led to the revert, does it ?

 - Update the .c code to return analyze_count=NULL for partitioned tables.

 - Update the catalog definition to exclude partitioned tables, again.
   Requires a catalog bumped.

 - Document that analyze_count=NULL for partitioned tables.  It seems to just
   document a misbehavior.

-- 
Justin




Re: log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Nikolay Samokhvalov
On Wed, Aug 25, 2021 at 10:34 AM Peter Geoghegan  wrote:

> It would be a lot clearer if the "buffer usage" line was simply moved
> down. I think that it should appear after the lines that are specific
> to the table's indexes -- just before the "avg read rate" line. That
> way we'd group the buffer usage output with all of the other I/O
> related output that summarizes the VACUUM operation as a whole.
>

The last two lines are also "*** usage" -- shouldn't the buffer numbers be
next to them?


Re: archive status ".ready" files may be created too early

2021-08-25 Thread Bossart, Nathan
On 8/25/21, 11:01 AM, "Fujii Masao"  wrote:
> If LogwrtResult.Flush >= EndPos, which means that another process already
> has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush.
> This situation also means that that another process called
> NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right?

If the segment boundary wasn't registered before the other process
called NotifySegmentsReadyForArchive(), then it couldn't have used the
boundary for deciding which .ready files to create.

> If this understanding is right, there seems no need to wake walwriter up here
> so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain.
> Thought?

We're actually discussing this right now in another thread [0].  I
think we might be able to get rid of that part if we move the boundary
registration to before we release the WAL insert lock(s).

Nathan

[0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com



Re: prevent immature WAL streaming

2021-08-25 Thread Bossart, Nathan
On 8/25/21, 5:33 AM, "alvhe...@alvh.no-ip.org"  wrote:
> On 2021-Aug-24, Bossart, Nathan wrote:
>
>> If moving RegisterSegmentBoundary() is sufficient to prevent the flush
>> pointer from advancing before we register the boundary, I bet we could
>> also remove the WAL writer nudge.
>
> Can you elaborate on this?  I'm not sure I see the connection.

The reason we are moving RegisterSegmentBoundary() to before
WALInsertLockRelease() is because we believe it will prevent boundary
registration from taking place after the flush pointer has already
advanced past the boundary in question.  We had added the WAL writer
nudge to make sure we called NotifySegmentsReadyForArchive() whenever
that happened.

If moving boundary registration to before we release the lock(s) is
enough to prevent the race condition with the flush pointer, then ISTM
we no longer have to worry about nudging the WAL writer.

>> Another interesting thing I see is that the boundary stored in
>> earliestSegBoundary is not necessarily the earliest one.  It's just
>> the first one that has been registered.  I did this for simplicity for
>> the .ready file fix, but I can see it causing problems here.
>
> Hmm, is there really a problem here?  Surely the flush point cannot go
> past whatever has been written.  If somebody is writing an earlier
> section of WAL, then we cannot move the flush pointer to a later
> position.  So it doesn't matter if the earliest point we have registered
> is the true earliest -- we only care for it to be the earliest that is
> past the flush point.

Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.

F E L
|-|-|-|-|-|-|-|-|
   1 2 3 4 5 6 7 8

Then, we write out WAL to disk and create .ready files as needed.  If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.

  F E
|-|-|-|-|-|-|-|-|
   1 2 3 4 5 6 7 8

At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.

Nathan



Re: archive status ".ready" files may be created too early

2021-08-25 Thread Fujii Masao




On 2021/08/24 4:55, alvhe...@alvh.no-ip.org wrote:

On 2021-Aug-23, Bossart, Nathan wrote:


Ah, okay.  BTW the other changes you mentioned made sense to me.


Thanks.  I've pushed this now to all live branches.


Thanks a lot!

+   /*
+* There's a chance that the record was already flushed to disk 
and we
+* missed marking segments as ready for archive.  If this 
happens, we
+* nudge the WALWriter, which will take care of notifying 
segments as
+* needed.
+*/
+   if (StartSeg != EndSeg && XLogArchivingActive() &&
+   LogwrtResult.Flush >= EndPos && 
ProcGlobal->walwriterLatch)
+   SetLatch(ProcGlobal->walwriterLatch);

Is this really necessary?

If LogwrtResult.Flush >= EndPos, which means that another process already
has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush.
This situation also means that that another process called
NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right?

If this understanding is right, there seems no need to wake walwriter up here
so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain.
Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: The Free Space Map: Problems and Opportunities

2021-08-25 Thread Robert Haas
On Mon, Aug 23, 2021 at 5:55 PM Peter Geoghegan  wrote:
> Right now my prototype has a centralized table in shared memory, with
> a hash table. One entry per relation, generally multiple freelists per
> relation. And with per-freelist metadata such as owner and original
> leader backend XID values. Plus of course the lists of free blocks
> themselves. The prototype already clearly fixes the worst problems
> with BenchmarkSQL, but that's only one of my goals. That's just the
> starting point.
>
> I appreciate your input on this. And not just on the implementation
> details -- the actual requirements themselves are still in flux. This
> isn't so much a project to replace the FSM as it is a project that
> adds a new rich abstraction layer that goes between access methods and
> smgr.c -- free space management is only one of the responsibilities.

Makes sense. I think one of the big implementation challenges here is
coping with the scenario where there's not enough shared memory
available ... or else somehow making that impossible without reserving
an unreasonable amount of shared memory. If you allowed space for
every buffer to belong to a different relation and have the maximum
number of leases and whatever, you'd probably have no possibility of
OOM, but you'd probably be pre-reserving too much memory. I also think
there are some implementation challenges around locking. You probably
need some, because the data structure is shared, but because it's
complex, it's not easy to create locking that allows for good
concurrency. Or so I think.

Andres has been working -- I think for years now -- on replacing the
buffer mapping table with a radix tree of some kind. That strikes me
as very similar to what you're doing here. The per-relation data can
then include not only the kind of stuff you're talking about but very
fundamental things like how long it is and where its buffers are in
the buffer pool. Hopefully we don't end up with dueling patches.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Alvaro Herrera
On 2021-Aug-25, Magnus Hagander wrote:

> The thing we need the PGDLLIMPORT definition for is to *import* them
> on the other end?

Oh ... so modules that are willing to cheat can include their own
declarations of the variables they need, and mark them __declspec
(dllimport)?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)




log_autovacuum in Postgres 14 -- ordering issue

2021-08-25 Thread Peter Geoghegan
log_autovacuum output looks like this (as of Postgres 14):

LOG:  automatic vacuum of table "regression.public.bmsql_order_line":
index scans: 1
pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen
tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not
yet removable, oldest xmin: 88197949
buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied
index scan needed: 1959301 pages from table (22.24% of total) had
11745226 dead item identifiers removed
index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly
deleted, 0 currently deleted, 0 reusable
avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s
I/O timings: read: 65813.666 ms, write: 11310.689 ms
system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s
WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes

I think that this output is slightly misleading. I'm concerned about
the specific order of the lines here: the "buffer usage" line comes
after the information that applies specifically to the heap structure,
but before the information about indexes. This is the case despite the
fact that its output applies to all buffers (not just those for the
heap structure).

It would be a lot clearer if the "buffer usage" line was simply moved
down. I think that it should appear after the lines that are specific
to the table's indexes -- just before the "avg read rate" line. That
way we'd group the buffer usage output with all of the other I/O
related output that summarizes the VACUUM operation as a whole.

I propose changing the ordering along those lines, and backpatching
the change to Postgres 14.

-- 
Peter Geoghegan




Re: badly calculated width of emoji in psql

2021-08-25 Thread John Naylor
On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion  wrote:
>
> On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote:
> > On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion 
wrote:
> > > I guess it just depends on what the end result looks/performs like.
> > > We'd save seven hops or so in the worst case?
> >
> > Something like that. Attached is what I had in mind (using real
> > patches to see what the CF bot thinks):
> >
> > 0001 is a simple renaming
> > 0002 puts the char width inside the mbinterval so we can put arbitrary
values there
>
> 0002 introduces a mixed declarations/statements warning for
> ucs_wcwidth(). Other than that, LGTM overall.

I didn't see that warning with clang 12, either with or without assertions,
but I do see it on gcc 11. Fixed, and pushed 0001 and 0002. I decided
against squashing them, since my original instinct was correct -- the
header changes too much for git to consider it the same file, which may
make archeology harder.

> > --- a/src/common/wchar.c
> > +++ b/src/common/wchar.c
> > @@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s)
> >
> >  struct mbinterval
> >  {
> > -   unsigned short first;
> > -   unsigned short last;
> > -   signed short width;
> > +   unsigned int first;
> > +   unsigned int last:21;
> > +   signed int  width:4;
> >  };
>
> Oh, right -- my patch moved mbinterval from short to int, but should I
> have used uint32 instead? It would only matter in theory for the
> `first` member now that the bitfields are there.

I'm not sure it would matter, but the usual type for codepoints is unsigned.

> > I think the adjustments to 0003 result in a cleaner and more
> > extensible design, but a case could be made otherwise. The former
> > combining table script is a bit more complex than the sum of its
> > former self and Jacob's proposed new script, but just slightly.
>
> The microbenchmark says it's also more performant, so +1 to your
> version.
>
> Does there need to be any sanity check for overlapping ranges between
> the combining and fullwidth sets? The Unicode data on a dev's machine
> would have to be broken somehow for that to happen, but it could
> potentially go undetected for a while if it did.

Thanks for testing again! The sanity check sounds like a good idea, so I'll
work on that and push soon.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] document

2021-08-25 Thread Justin Pryzby
On Wed, Aug 25, 2021 at 09:50:13AM -0400, Tom Lane wrote:
> Fujii Masao  writes:
> > When I applied the patch to the master, I found that the table entries for
> > those function were added into the table for aclitem functions in the docs.
> > I think this is not valid position and needs to be moved to the proper one
> > (maybe the table for system catalog information functions?).
> 
> You have to be very careful these days when applying stale patches to
> func.sgml --- there's enough duplicate boilerplate that "patch' can easily
> be fooled into dumping an addition into the wrong place.  I doubt that
> the submitter meant the doc addition to go there.

I suppose one solution to this is to use git format-patch -U11 or similar, at
least for doc/

Or write the "duplicate boilerplate" across fewer lines.

And another is to add  comments before and/or after each.

-- 
Justin




Re: Regression tests for MobilityDB: Continous shutdowns at a random step

2021-08-25 Thread Tom Lane
Esteban Zimanyi  writes:
> However, I continuously receive at a random step in the process the
> following error in the log file

> 2021-08-25 16:48:13.608 CEST [22375] LOG:  received fast shutdown request

This indicates that something sent the postmaster SIGINT.
You need to look around for something in your test environment
that would do that.  Possibly you need to decouple the test
processes from your terminal session?

regards, tom lane




Regression tests for MobilityDB: Continous shutdowns at a random step

2021-08-25 Thread Esteban Zimanyi
Hello

While executing the regression tests for MobilityDB I load a predefined
database on which I run the tests and then compare the results obtained
with those expected. All the tests are driven by the following bash file
https://github.com/MobilityDB/MobilityDB/blob/develop/test/scripts/test.sh

However, I continuously receive at a random step in the process the
following error in the log file

2021-08-25 16:48:13.608 CEST [22375] LOG:  received fast shutdown request
2021-08-25 16:48:13.622 CEST [22375] LOG:  aborting any active transactions
2021-08-25 16:48:13.622 CEST [22375] LOG:  background worker "logical
replication launcher" (PID 22382) exited with exit code 1
2021-08-25 16:48:13.623 CEST [22377] LOG:  shutting down
2021-08-25 16:48:13.971 CEST [22375] LOG:  database system is shut down

and sometimes I need to relaunch *numerous* times the whole build process
in CMake
https://github.com/MobilityDB/MobilityDB/blob/develop/CMakeLists.txt
to finalize the tests

/* While on MobilityDB/build directory */
rm -rf *
cmake ..
make
make test

Any idea where I can begin looking at the problem ?

Thanks for your help

Esteban


Re: PostgreSQL <-> Babelfish integration

2021-08-25 Thread Matthias van de Meent
On Mon, 15 Feb 2021 at 17:01, Finnerty, Jim  wrote:
>
> We are applying the Babelfish commits to the REL_12_STABLE branch now, and 
> the plan is to merge them into the REL_13_STABLE and master branch ASAP after 
> that.  There should be a publicly downloadable git repository before very 
> long.

Hi,

Out of curiosity, are you able to share the status on the publication
of this repository?

I mainly ask this because I haven't seen any announcements from Amazon
/ AWS regarding the publication of the Babelfish project since the
start of this thread, and the relevant websites [0][1][2] also do not
appear to have seen an update. The last mention of babelfish in a
thread here on -hackers also only seem to date back to late March.

Kind regards,

Matthias van de Meent

[0] https://babelfish-for-postgresql.github.io/babelfish-for-postgresql/
[1] https://aws.amazon.com/rds/aurora/babelfish/
[2] https://github.com/babelfish-for-postgresql/




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Magnus Hagander
On Wed, Aug 25, 2021 at 4:41 PM Tom Lane  wrote:
>
> Magnus Hagander  writes:
> > On Wed, Aug 25, 2021 at 4:06 PM Robert Haas  wrote:
> >> It does tend to be controversial, but I think that's basically only
> >> because Tom Lane has reservations about it. I think if Tom dropped his
> >> opposition to this, nobody else would really care. And I think that
> >> would be a good thing for the project.
>
> > But in particular, both on that argument, and on the general
> > maintenance argument, I have a very hard time seeing how exporting the
> > GUC variables would be any worse than exporting the many hundreds of
> > functions we already export.
>
> My beef about it has nothing to do with binary-size concerns, although
> that is an interesting angle.  (I wonder whether marking a variable
> PGDLLIMPORT has any negative effect on the cost of accessing it from
> within the core code?)

It should have no effect on local code.

PGDLLIMPORT turns into "__declspec (dllexport)" when building the
backend, which should have no effect on imports

(it turns into __declspec (dllimport) when building a frontend only,
but that's why we need it in the headers, iirc)

The only overhead I've seen discussions about int he docs around that
is the overhead of exporting by name vs exporting by ordinal.


>  Rather, I'm unhappy with spreading even more
> Microsoft-droppings all over our source.  If there were some way to
> just do this automatically for all global variables without any source
> changes, I'd be content with that.  That would *really* make the
> platforms more nearly equivalent.

Actually, ti's clearly been a while since I dug into this

AFAICT we *do* actually export all the data symbols as well? At least
in the MSVC build where we use gendef.pl? Specifically, see
a5eed4d770.

The thing we need the PGDLLIMPORT definition for is to *import* them
on the other end?

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Tom Lane
Magnus Hagander  writes:
> On Wed, Aug 25, 2021 at 4:06 PM Robert Haas  wrote:
>> It does tend to be controversial, but I think that's basically only
>> because Tom Lane has reservations about it. I think if Tom dropped his
>> opposition to this, nobody else would really care. And I think that
>> would be a good thing for the project.

> But in particular, both on that argument, and on the general
> maintenance argument, I have a very hard time seeing how exporting the
> GUC variables would be any worse than exporting the many hundreds of
> functions we already export.

My beef about it has nothing to do with binary-size concerns, although
that is an interesting angle.  (I wonder whether marking a variable
PGDLLIMPORT has any negative effect on the cost of accessing it from
within the core code?)  Rather, I'm unhappy with spreading even more
Microsoft-droppings all over our source.  If there were some way to
just do this automatically for all global variables without any source
changes, I'd be content with that.  That would *really* make the
platforms more nearly equivalent.

regards, tom lane




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Magnus Hagander
On Wed, Aug 25, 2021 at 4:06 PM Robert Haas  wrote:
>
> On Tue, Aug 24, 2021 at 5:06 PM Chapman Flack  wrote:
> > The thing is, I think I have somewhere a list of all the threads on this
> > topic that I've read through since the first time I had to come with my own
> > hat in hand asking for a PGDLLIMPORT on something, years ago now, and
> > I don't think I have ever seen one where it was as uncontroversial
> > as you suggest.
>
> It does tend to be controversial, but I think that's basically only
> because Tom Lane has reservations about it. I think if Tom dropped his
> opposition to this, nobody else would really care. And I think that
> would be a good thing for the project.


I have only one consideration about it, and that's a technical one :)

Does this in some way have an effect on the size of the binary and/or
the time it takes to load it?

I ask, because IIRC back in the prehistoric days, adding all the
*functions* to the list of exports had a very significant impact on
the size of the binary, and some (but not very large) impact on the
loading time. Of course, we had to do that so that even our own
libraries would probably load. And at the time it was decided that we
definitely wanted to export all functions and not just the ones that
we would somehow define as an API.

Now, I'm pretty sure that the GUCs are few enough that this should
have no measurable effect on size/load time, but it's something that
should be verified.

But in particular, both on that argument, and on the general
maintenance argument, I have a very hard time seeing how exporting the
GUC variables would be any worse than exporting the many hundreds of
functions we already export.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Async-unsafe functions in signal handlers

2021-08-25 Thread Denis Smirnov
Hello all,

I am going to refactor Greenplum backtraces for error messages and want to make 
it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
discussion 
https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
 ) and rely on backtrace() and backtrace_symbols() functions. They are used 
inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
confused with this fact - both backtrace functions are async-unsafe: 
backtrace_symbols() - always, backtrace() - only for the first call due to 
dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
handlers?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Postgres perl module namespace

2021-08-25 Thread Robert Haas
On Wed, Aug 25, 2021 at 1:48 AM Michael Paquier  wrote:
> On Mon, Aug 23, 2021 at 03:39:15PM -0400, Robert Haas wrote:
> > On Mon, Aug 23, 2021 at 3:03 PM Andrew Dunstan  wrote:
> >> OK, I count 3 in favor of changing to PgTest::Cluster, 1 against,
> >> remainder don't care.
> >
> > I'd have gone with something starting with Postgres:: ... but I don't care 
> > much.
>
> PgTest seems like a better choice to me, as "Postgres" could be used
> for other purposes than a testing framework, and the argument that
> multiple paths makes things annoying for hackers is sensible.

I mean, it's a hierarchical namespace. The idea is you do
Postgres::Test or Postgres:: and other people using the
Postgres database can use other parts of it. But again, not really
worth arguing about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Mark all GUC variable as PGDLLIMPORT

2021-08-25 Thread Robert Haas
On Tue, Aug 24, 2021 at 5:06 PM Chapman Flack  wrote:
> The thing is, I think I have somewhere a list of all the threads on this
> topic that I've read through since the first time I had to come with my own
> hat in hand asking for a PGDLLIMPORT on something, years ago now, and
> I don't think I have ever seen one where it was as uncontroversial
> as you suggest.

It does tend to be controversial, but I think that's basically only
because Tom Lane has reservations about it. I think if Tom dropped his
opposition to this, nobody else would really care. And I think that
would be a good thing for the project.

> In each iteration, I think I've also seen a countervailing view expressed
> in favor of looking into whether globals visibility could be further
> /reduced/.

But, like I say, that's only a view that gets advanced as a reason not
to mark things PGDLLIMPORT. Nobody ever wants that thing for its own
sake. I think it's a complete red herring.

If and when somebody wants to make a serious proposal to do something
like that, it can be considered on its own merits. But between now and
then, refusing to make things work on Windows as they do on Linux does
not benefit anyone. A ton of work has been made into making PostgreSQL
portable over the years, and abandoning it in just this one case is
unreasonable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada  wrote:
> >
> > I did a quick check with the following tap test code:
> >
> > $node_publisher->poll_query_until('postgres',
> >   qq(
> > select 1 != foo.column1 from (values(0), (1)) as foo;
> > ));
> >
> > The query returns {t, f} but poll_query_until() never finished. The
> > same is true when the query returns {f, t}.
> >

Yes, this is true, I also see the same behaviour.

>
> This means something different is going on in Ajin's setup. Ajin, can
> you please share how did you confirm your findings about poll_query?

Relooking at my logs, I think what happens is this:

1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.

>Can you additionally check the value of 'state' from
>pg_stat_replication for both the old and new walsender sessions?

Yes, will try this and post a patch tomorrow.

regards,
Ajin Cherian
Fujitsu Australia




Re: Remove Value node struct

2021-08-25 Thread Dagfinn Ilmari Mannsåker
Peter Eisentraut  writes:

> While trying to refactor the node support in various ways, the Value
> node is always annoying.
[…]
> This change removes the Value struct and node type and replaces them
> by separate Integer, Float, String, and BitString node types that are
> proper node types and structs of their own and behave mostly like
> normal node types.

This looks like a nice cleanup overall, independent of any future
refactoring.

> Also, this removes the T_Null node tag, which was previously also a
> possible variant of Value but wasn't actually used outside of the
> Value contained in A_Const.  Replace that by an isnull field in
> A_Const.

However, the patch adds:

> +typedef struct Null
> +{
> + NodeTag type;
> + char   *val;
> +} Null;

which doesn't seem to be used anywhere. Is that a leftoverf from an
intermediate development stage?

- ilmari




Re: prevent immature WAL streaming

2021-08-25 Thread Robert Haas
On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
 wrote:
> At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera  
> wrote in
> > I'd also like to have tests.  That seems moderately hard, but if we had
> > WAL-molasses that could be used in walreceiver, it could be done. (It
> > sounds easier to write tests with a molasses-archive_command.)
> >
> >
> > [1] https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com
> > [2] 
> > https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan@alibaba-inc.com
>
> (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)

I think, but am not 100% sure, that "molasses" here is being used to
refer to fault injection.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] document

2021-08-25 Thread Tom Lane
Fujii Masao  writes:
> When I applied the patch to the master, I found that the table entries for
> those function were added into the table for aclitem functions in the docs.
> I think this is not valid position and needs to be moved to the proper one
> (maybe the table for system catalog information functions?).

You have to be very careful these days when applying stale patches to
func.sgml --- there's enough duplicate boilerplate that "patch' can easily
be fooled into dumping an addition into the wrong place.  I doubt that
the submitter meant the doc addition to go there.

regards, tom lane




Re: Remove Value node struct

2021-08-25 Thread Robert Haas
On Wed, Aug 25, 2021 at 9:20 AM Peter Eisentraut
 wrote:
> This change removes the Value struct and node type and replaces them
> by separate Integer, Float, String, and BitString node types that are
> proper node types and structs of their own and behave mostly like
> normal node types.

+1. I noticed this years ago and never thought of doing anything about
it. I'm glad you did think of it...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-25 Thread Robert Haas
On Wed, Aug 25, 2021 at 5:36 AM Greg Nancarrow  wrote:
> I've attached an updated patch, hopefully more along the lines that
> you were thinking of.

LGTM. Committed and back-patched to v10 and up. In theory the same bug
exists in 9.6, but you'd have to have third-party code using the
parallel context infrastructure in order to hit it. If the patch
back-patched cleanly I would have done so just in case, but
shm_toc_lookup lacks a bool noError option in that version.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Some RELKIND macro refactoring

2021-08-25 Thread Alvaro Herrera
On 2021-Aug-25, Michael Paquier wrote:

> On Tue, Aug 24, 2021 at 12:01:33PM +0200, Peter Eisentraut wrote:
> > While analyzing this again, I think I found an existing mistake.  The
> > handling of RELKIND_PARTITIONED_INDEX in RelationGetNumberOfBlocksInFork()
> > seems to be misplaced.  See attached patch.

Agreed, that's a mistake.

> Right.  This maps with RELKIND_HAS_STORAGE().  Makes me wonder whether
> is would be better to add a check on RELKIND_HAS_STORAGE() in this
> area, even if that's basically the same as the Assert() already used
> in this code path.

Well, the patch replaces the switch on individual relkind values with if
tests on RELKIND_HAS_FOO macros.  I suppose we'd have

Assert(RELKIND_HAS_STORAGE(relkind));

so the function would not even be called for partitioned indexes. (In a
quick scan of 'git grep RelationGetNumberOfBlocks' I see nothing that
would obviously call this function on a partitioned index.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/




Re: [PATCH] document

2021-08-25 Thread Fujii Masao




On 2021/07/14 14:45, Laurenz Albe wrote:

On Wed, 2021-07-14 at 14:43 +0900, Ian Lawrence Barwick wrote:

Hi

The description for "pg_database" [1] mentions the function
"pg_encoding_to_char()", but this is not described anywhere in the docs. Given
that that it (and the corresponding "pg_char_to_encoding()") have been around
since 7.0 [2], it's probably not a burning issue, but it seems not entirely
unreasonable to add short descriptions for both (and link from "pg_conversion"
while we're at it); see attached patch. "System Catalog Information Functions"
seems the most logical place to put these.

[1] https://www.postgresql.org/docs/current/catalog-pg-database.html
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5eb1d0deb15f2b7cd0051bef12f3e091516c723b

Will add to the next commitfest.


+1


+1

When I applied the patch to the master, I found that the table entries for
those function were added into the table for aclitem functions in the docs.
I think this is not valid position and needs to be moved to the proper one
(maybe the table for system catalog information functions?).

+int
+pg_encoding_to_char ( encoding 
int )

It's better to s/int/integer because the entries for other functions
in func.sgml use "integer".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Remove Value node struct

2021-08-25 Thread Peter Eisentraut


While trying to refactor the node support in various ways, the Value 
node is always annoying.


The Value node struct is a weird construct.  It is its own node type,
but most of the time, it actually has a node type of Integer, Float,
String, or BitString.  As a consequence, the struct name and the node
type don't match most of the time, and so it has to be treated
specially a lot.  There doesn't seem to be any value in the special
construct.  There is very little code that wants to accept all Value
variants but nothing else (and even if it did, this doesn't provide
any convenient way to check it), and most code wants either just one
particular node type (usually String), or it accepts a broader set of
node types besides just Value.

This change removes the Value struct and node type and replaces them
by separate Integer, Float, String, and BitString node types that are
proper node types and structs of their own and behave mostly like
normal node types.

Also, this removes the T_Null node tag, which was previously also a
possible variant of Value but wasn't actually used outside of the
Value contained in A_Const.  Replace that by an isnull field in
A_Const.
From 14bfa252aa64e52528ecbd457154cb3c9d32781a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 25 Aug 2021 13:41:09 +0200
Subject: [PATCH 1/2] Remove useless casts

Casting the argument of strVal() to (Value *) is useless, since
strVal() already does that.

Most code didn't do that anyway; this was apparently just a style that
snuck into certain files.
---
 src/backend/catalog/objectaddress.c | 20 ++--
 src/backend/commands/alter.c| 16 
 src/backend/commands/comment.c  |  2 +-
 src/backend/commands/dropcmds.c | 16 
 src/backend/commands/statscmds.c|  2 +-
 5 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/backend/catalog/objectaddress.c 
b/src/backend/catalog/objectaddress.c
index 9882e549c4..7423b99265 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -2386,7 +2386,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
case OBJECT_DATABASE:
if (!pg_database_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_TYPE:
case OBJECT_DOMAIN:
@@ -2433,7 +2433,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
case OBJECT_SCHEMA:
if (!pg_namespace_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_COLLATION:
if (!pg_collation_ownercheck(address.objectId, roleid))
@@ -2448,27 +2448,27 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
case OBJECT_EXTENSION:
if (!pg_extension_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_FDW:
if 
(!pg_foreign_data_wrapper_ownercheck(address.objectId, roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_FOREIGN_SERVER:
if (!pg_foreign_server_ownercheck(address.objectId, 
roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_EVENT_TRIGGER:
if (!pg_event_trigger_ownercheck(address.objectId, 
roleid))
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
-  strVal((Value *) 
object));
+  strVal(object));
break;
case OBJECT_LANGUAGE:
if (!pg

Re: Failure of subscription tests with topminnow

2021-08-25 Thread Amit Kapila
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada  wrote:
>
> I did a quick check with the following tap test code:
>
> $node_publisher->poll_query_until('postgres',
>   qq(
> select 1 != foo.column1 from (values(0), (1)) as foo;
> ));
>
> The query returns {t, f} but poll_query_until() never finished. The
> same is true when the query returns {f, t}.
>

This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?


-- 
With Regards,
Amit Kapila.




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Amit Kapila
On Wed, Aug 25, 2021 at 5:54 PM Ajin Cherian  wrote:
>
> On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada  wrote:
>
> >
> > IIUC the query[1] used for polling returns two rows in this case: {t,
> > f} or {f, t}. But did poll_query_until() returned OK in this case even
> > if we expected one row of 't'? My guess of how this issue happened is:
> >
> > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> > passed (for some reason).
> > 2. all wal senders exited.
> > 3. get the pid of wal sender with application_name 'tap_sub' but got 
> > nothing.
> > 4. the second polling query resulted in a syntax error since $oldpid is 
> > null.
> >
> > If the fact that two walsender with the same application_name could
> > present in pg_stat_replication view was the cause of this issue,
> > poll_query_until() should return OK even if we expected just 't'. I
> > might be missing something, though.
> >
> > [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> > application_name = '$appname';"
>
> Yes, the query [1] returns OK with a {f,t} or {t,f}
>
> [1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = '$appname';"
>

Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?

-- 
With Regards,
Amit Kapila.




Re: Added schema level support for publication.

2021-08-25 Thread vignesh C
On Sat, Aug 14, 2021 at 3:02 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> On 13.08.21 04:59, Amit Kapila wrote:
> >> Even if we drop all tables added to the publication from it, 'pubkind'
> >> doesn't go back to 'empty'. Is that intentional behavior? If we do
> >> that, we can save the lookup of pg_publication_rel and
> >> pg_publication_schema in some cases, and we can switch the publication
> >> that was created as FOR SCHEMA to FOR TABLE and vice versa.
> >>
> > Do we really want to allow users to change a publication that is FOR
> > SCHEMA to FOR TABLE? I see that we don't allow to do that FOR TABLES.
> > postgres=# Alter Publication pub add table tbl1;
> > ERROR:  publication "pub" is defined as FOR ALL TABLES
> > DETAIL:  Tables cannot be added to or dropped from FOR ALL TABLES
publications.
>
> I think the strict separation between publication-for-tables vs.
> publication-for-schemas is a mistake.  Why can't I have a publication
> that publishes tables t1, t2, t3, *and* schemas s1, s2, s3.  Also note
> that we have a pending patch to add sequences support to logical
> replication.  So eventually, a publication will be able to contain a
> bunch of different objects of different kinds.

Thanks for the feedback, now the same publication can handle schemas as
well as tables, and can be extended further to other objects. This is
handled in the v21 patch attached at [1]. It is supported by the following
syntax:
CREATE PUBLICATION pub FOR ALL TABLES IN SCHEMA s1,s2,s3, TABLE t1,t2,t3;

[1] -
https://www.postgresql.org/message-id/CALDaNm2iKJvSdCyh0S%2BwYgFjMNB4hu3kYjk%3DYrEkpqTJY9zW%2Bw%40mail.gmail.com

Regards,
Vignesh


Re: Failure of subscription tests with topminnow

2021-08-25 Thread Masahiko Sawada
On Wed, Aug 25, 2021 at 9:23 PM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada  wrote:
> >
> > On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian  wrote:
> > >
> > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian  wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila 
> > > > > >  wrote:
> > > > > >
> > > > > > > But will poll function still poll or exit? Have you tried that?
> > > > > >
> > > > > > I have forced that condition with a changed query and found that the
> > > > > > poll will not exit in case of a NULL return.
> > > > > >
> > > > >
> > > > > What if the query in a poll is fired just before we get an error
> > > > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > > > Won't at that stage both old and new walsender's are present, so the
> > > > > query might return true. You can check that via debugger by stopping
> > > > > just before this error occurs and then check pg_stat_replication view.
> > > >
> > > > If this error happens then the PID is NOT updated as the pid in the
> > > > Replication slot. I have checked this
> > > > and explained this in my first email itself
> > > >
> > >
> > > Sorry about the above email, I misunderstood. I was looking at
> > > pg_stat_replication_slot rather than pg_stat_replication hence the 
> > > confusion.
> > > Amit is correct, just prior to the walsender erroring out, it briefly
> > > appears in the
> > > pg_stat_replication, and that is why this error happens. Sorry for the
> > > confusion.
> > > I just confirmed it, got both the walsenders stopped in the debugger:
> > >
> > > postgres=# select pid from pg_stat_replication where application_name = 
> > > 'sub';
> > >  pid
> > > --
> > >  7899
> > >  7993
> > > (2 rows)
> >
> > IIUC the query[1] used for polling returns two rows in this case: {t,
> > f} or {f, t}. But did poll_query_until() returned OK in this case even
> > if we expected one row of 't'? My guess of how this issue happened is:
> >
>
> Yeah, we can check this but I guess as soon as it gets 't', the poll
> query will exit.

I did a quick check with the following tap test code:

$node_publisher->poll_query_until('postgres',
  qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));

The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: prevent immature WAL streaming

2021-08-25 Thread alvhe...@alvh.no-ip.org
On 2021-Aug-24, Bossart, Nathan wrote:

> If moving RegisterSegmentBoundary() is sufficient to prevent the flush
> pointer from advancing before we register the boundary, I bet we could
> also remove the WAL writer nudge.

Can you elaborate on this?  I'm not sure I see the connection.

> Another interesting thing I see is that the boundary stored in
> earliestSegBoundary is not necessarily the earliest one.  It's just
> the first one that has been registered.  I did this for simplicity for
> the .ready file fix, but I can see it causing problems here.

Hmm, is there really a problem here?  Surely the flush point cannot go
past whatever has been written.  If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position.  So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada  wrote:

>
> IIUC the query[1] used for polling returns two rows in this case: {t,
> f} or {f, t}. But did poll_query_until() returned OK in this case even
> if we expected one row of 't'? My guess of how this issue happened is:
>
> 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> passed (for some reason).
> 2. all wal senders exited.
> 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> 4. the second polling query resulted in a syntax error since $oldpid is null.
>
> If the fact that two walsender with the same application_name could
> present in pg_stat_replication view was the cause of this issue,
> poll_query_until() should return OK even if we expected just 't'. I
> might be missing something, though.
>
> [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = '$appname';"

Yes, the query [1] returns OK with a {f,t} or {t,f}

[1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"

regards,
Ajin Cherian
Fujitsu Australia




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Amit Kapila
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada  wrote:
>
> On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian  wrote:
> >
> > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian  wrote:
> > >
> > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  wrote:
> > > > >
> > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila 
> > > > >  wrote:
> > > > >
> > > > > > But will poll function still poll or exit? Have you tried that?
> > > > >
> > > > > I have forced that condition with a changed query and found that the
> > > > > poll will not exit in case of a NULL return.
> > > > >
> > > >
> > > > What if the query in a poll is fired just before we get an error
> > > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > > Won't at that stage both old and new walsender's are present, so the
> > > > query might return true. You can check that via debugger by stopping
> > > > just before this error occurs and then check pg_stat_replication view.
> > >
> > > If this error happens then the PID is NOT updated as the pid in the
> > > Replication slot. I have checked this
> > > and explained this in my first email itself
> > >
> >
> > Sorry about the above email, I misunderstood. I was looking at
> > pg_stat_replication_slot rather than pg_stat_replication hence the 
> > confusion.
> > Amit is correct, just prior to the walsender erroring out, it briefly
> > appears in the
> > pg_stat_replication, and that is why this error happens. Sorry for the
> > confusion.
> > I just confirmed it, got both the walsenders stopped in the debugger:
> >
> > postgres=# select pid from pg_stat_replication where application_name = 
> > 'sub';
> >  pid
> > --
> >  7899
> >  7993
> > (2 rows)
>
> IIUC the query[1] used for polling returns two rows in this case: {t,
> f} or {f, t}. But did poll_query_until() returned OK in this case even
> if we expected one row of 't'? My guess of how this issue happened is:
>

Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.

> 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> passed (for some reason).
>

I think the reason for exit is that we get two rows with the same
application_name in pg_stat_replication.

> 2. all wal senders exited.
> 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> 4. the second polling query resulted in a syntax error since $oldpid is null.
>

Your understanding of steps is the same as mine.


--
With Regards,
Amit Kapila.




Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)

2021-08-25 Thread Amit Kapila
On Tue, Aug 24, 2021 at 3:55 PM Dilip Kumar  wrote:
>
> On Tue, Aug 24, 2021 at 12:26 PM Amit Kapila  wrote:
>

The first patch looks good to me. I have made minor changes to the
attached patch. The changes include: fixing compilation warning, made
some comment changes, ran pgindent, and few other cosmetic changes. If
you are fine with the attached, then kindly rebase the second patch
atop it.

-- 
With Regards,
Amit Kapila.


v5-0001-Refactor-sharedfileset.c-to-separate-out-fileset-.patch
Description: Binary data


Re: replay of CREATE TABLESPACE eats data at wal_level=minimal

2021-08-25 Thread Robert Haas
On Wed, Aug 25, 2021 at 1:21 AM Noah Misch  wrote:
> Sounds good.  I think the log message is the optimal place:

Looks awesome.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: prevent immature WAL streaming

2021-08-25 Thread Jakub Wartak
Hi Álvaro, -hackers, 

> I attach the patch with the change you suggested.

I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 
5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the 
v02 patch I was getting standby corruption always via simulation by having 
separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, 
archive_command set to rsync to different dir on same fs, wal_init_zero at 
default(true). 

Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries 
on standby. Probably the 2 failures were somehow my fault (?) or some rare 
condition (and in 1 of those 2 cases simply restarting standby did help). To be 
honest I've tried to force this error, but with v02 I simply cannot force this 
error anymore, so that's good! :)

> I didn't have a lot of luck with a reliable reproducer script.  I was able to
> reproduce the problem starting with Ryo Matsumura's script and attaching
> a replica; most of the time the replica would recover by restarting from a
> streaming position earlier than where the problem occurred; but a few
> times it would just get stuck with a WAL segment containing a bogus
> record.  

In order to get reliable reproducer and get proper the fault injection instead 
of playing with really filling up fs, apparently one could substitute fd with 
fd of /dev/full using e.g. dup2() so that every write is going to throw this 
error too:

root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase
root@hive:~# ls -l /proc/27296/fd/3
lrwx-- 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite
root@hive:~# gdb -q -p 27296
-- 1089 is bitmask O_WRONLY|..
(gdb) p dup2(open("/dev/full", 1089, 0777), 3)
$1 = 3
(gdb) c
Continuing.
==>
fflush/write(): : No space left on device

So I've also tried to be malicious while writing to the DB and inject ENOSPCE 
near places like:
 
a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) 
is true, one gets classic "PANIC:  could not write to file 
"pg_wal/xlogtemp.90670": No space left on device"
b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC:  
could not write to log file 0001003B00A8 at offset 0, length 
15466496: No space left on device" (that would be possible with 
wal_init_zero=false?)
c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC:  
could not fdatasync file "000100430004": Invalid argument" that 
would pretty much mean same as above but with last possible offset near end of 
WAL? 

This was done with gdb voodoo:
handle SIGUSR1 noprint nostop
break xlog.c: // 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311
c
print fd or openLogFile -- to verify it is 3
p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it 
has current log as fd=3

After restarting master and inspecting standby - in all of those above 3 cases 
- the standby didn't inhibit the "invalid contrecord length" at least here, 
while without corruption this v02 patch it is notorious. So if it passes the 
worst-case code review assumptions I would be wondering if it shouldn't even be 
committed as it stands right now.

-J.




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Masahiko Sawada
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian  wrote:
>
> On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian  wrote:
> >
> > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  wrote:
> > >
> > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  wrote:
> > > >
> > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > > But will poll function still poll or exit? Have you tried that?
> > > >
> > > > I have forced that condition with a changed query and found that the
> > > > poll will not exit in case of a NULL return.
> > > >
> > >
> > > What if the query in a poll is fired just before we get an error
> > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > Won't at that stage both old and new walsender's are present, so the
> > > query might return true. You can check that via debugger by stopping
> > > just before this error occurs and then check pg_stat_replication view.
> >
> > If this error happens then the PID is NOT updated as the pid in the
> > Replication slot. I have checked this
> > and explained this in my first email itself
> >
>
> Sorry about the above email, I misunderstood. I was looking at
> pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
> Amit is correct, just prior to the walsender erroring out, it briefly
> appears in the
> pg_stat_replication, and that is why this error happens. Sorry for the
> confusion.
> I just confirmed it, got both the walsenders stopped in the debugger:
>
> postgres=# select pid from pg_stat_replication where application_name = 'sub';
>  pid
> --
>  7899
>  7993
> (2 rows)

IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:

1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.

If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.

[1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: .ready and .done files considered harmful

2021-08-25 Thread Dipesh Pandit
> If a .ready file is created out of order, the directory scan logic
> will pick it up about as soon as possible based on its priority.  If
> the archiver is keeping up relatively well, there's a good chance such
> a file will have the highest archival priority and will be picked up
> the next time the archiver looks for a file to archive.  With the
> patch proposed in this thread, an out-of-order .ready file has no such
> guarantee.  As long as the archiver never has to fall back to a
> directory scan, it won't be archived.  The proposed patch handles the
> case where RemoveOldXlogFiles() creates missing .ready files by
> forcing a directory scan, but I'm not sure this is enough.  I think we
> have to check the archiver state each time we create a .ready file to
> see whether we're creating one out-of-order.

We can handle the scenario where .ready file is created out of order
in XLogArchiveNotify(). This way we can avoid making an explicit call
to enable directory scan from different code paths which may result
into creating an out of order .ready file.

Archiver can store the segment number corresponding to the last or most
recent .ready file found. When a .ready file is created in
XLogArchiveNotify(),
the log segment number of the current .ready file can be compared with the
segment number of the last .ready file found at archiver to detect if this
file is
created out of order. A directory scan can be forced if required.

I have incorporated these changes in patch v11.

> While this may be an extremely rare problem in practice, archiving
> something after the next checkpoint completes seems better than never
> archiving it at all.  IMO this isn't an area where there is much space
> to take risks.

An alternate approach could be to force a directory scan at checkpoint to
break the infinite wait for a .ready file which is being missed due to the
fact that it is created out of order. This will make sure that the file
gets archived within the checkpoint boundaries.

Thoughts?

Please find attached patch v11.

Thanks,
Dipesh
From 9392fd1b82ade933e8127845013bb2940239af68 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Tue, 24 Aug 2021 12:17:34 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL
file that needs to be archived. This directory scan can be minimised
by maintaining the log segment of current wAL file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

There are some special scenarios like timeline switch, backup or
.ready file created out of order which requires archiver to perform a
full directory scan as archiving these files takes precedence over
regular WAL files.
---
 src/backend/access/transam/xlogarchive.c |  23 
 src/backend/postmaster/pgarch.c  | 211 ---
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/postmaster/pgarch.h  |   4 +
 4 files changed, 224 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index b9c19b2..cb73895 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -465,12 +465,16 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
  *
  * Optionally, nudge the archiver process so that it'll notice the file we
  * create.
+ *
+ * Also, notifies archiver to enable directory scan to handle a few special
+ * scenarios.
  */
 void
 XLogArchiveNotify(const char *xlog, bool nudge)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	FILE	   *fd;
+	bool		fileOutOfOrder = false;
 
 	/* insert an otherwise empty file called .ready */
 	StatusFilePath(archiveStatusPath, xlog, ".ready");
@@ -492,6 +496,25 @@ XLogArchiveNotify(const char *xlog, bool nudge)
 		return;
 	}
 
+	/* Check if .ready file is created out of order */
+	if (IsXLogFileName(xlog))
+	{
+		XLogSegNo	curSegNo;
+		TimeLineID	tli;
+
+		XLogFromFileName(xlog, &tli, &curSegNo, wal_segment_size);
+
+		fileOutOfOrder = PgArchIsBrokenReadyFileOrder(curSegNo);
+	}
+
+	/*
+	 * History files or a .ready file created out of order requires archiver to
+	 * perform a full directory scan.
+	 */
+	if (IsTLHistoryFileName(xlog) || IsBackupHistoryFileName(xlog) ||
+			fileOutOfOrder)
+		PgArchEnableDirScan();
+
 	/* If caller requested, let archiver know it's got work to do */
 	if (nudge)
 		PgArchWakeup();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..a33648a 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,31 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disable directory scan. If this flag is set then it
+	 * f

Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian  wrote:
>
> On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  wrote:
> >
> > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  wrote:
> > >
> > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila  
> > > wrote:
> > >
> > > > But will poll function still poll or exit? Have you tried that?
> > >
> > > I have forced that condition with a changed query and found that the
> > > poll will not exit in case of a NULL return.
> > >
> >
> > What if the query in a poll is fired just before we get an error
> > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > Won't at that stage both old and new walsender's are present, so the
> > query might return true. You can check that via debugger by stopping
> > just before this error occurs and then check pg_stat_replication view.
>
> If this error happens then the PID is NOT updated as the pid in the
> Replication slot. I have checked this
> and explained this in my first email itself
>

Sorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:

postgres=# select pid from pg_stat_replication where application_name = 'sub';
 pid
--
 7899
 7993
(2 rows)


ajin  7896  3326  0 05:22 pts/200:00:00 psql -p 6972 postgres
ajin  7897  7882  0 05:22 ?00:00:00 postgres: ajin
postgres [local] idle
ajin  7899  7882  0 05:22 ?00:00:00 postgres: walsender
ajin ::1(37719) START_REPLICATION
ajin  7992  3647  0 05:24 ?00:00:00 postgres: logical
replication worker for subscription 16426
ajin  7993  7882  0 05:24 ?00:00:00 postgres: walsender
ajin ::1(37720) START_REPLICATION

regards,
Ajin Cherian
Fujitsu Australia




Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-25 Thread Greg Nancarrow
On Wed, Aug 25, 2021 at 1:37 AM Robert Haas  wrote:
>
> I guess I was thinking more of rejiggering things so that we save the
> results of each RestoreSnapshot() call in a local variable, e.g.
> asnapshot and tsnapshot. And then I think we could just
> RestoreTransactionSnapshot() on whichever one we want, and then
> PushActiveSnapshot(asnapshot) either way. I think it would be worth
> trying to move the PushActiveSnapshot() call out of the if statement
> instead it in two places, written differently but doing the same
> thing.
>

I've attached an updated patch, hopefully more along the lines that
you were thinking of.

Regards,
Greg Nancarrow
Fujitsu Australia


v11-0001-Fix-parallel-worker-failed-assertion-and-coredump.patch
Description: Binary data


Re: row filtering for logical replication

2021-08-25 Thread Amit Kapila
On Wed, Aug 25, 2021 at 10:57 AM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira  wrote:
> >
> > On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote:

> >
> > Anyway, I have implemented the suggested cache change because I agree
> > it is probably theoretically superior, even if in practice there is
> > almost no difference.
> >
> > I didn't inspect your patch carefully but it seems you add another List to
> > control this new cache mechanism. I don't like it. IMO if we can use the 
> > data
> > structures that we have now, let's implement your idea; otherwise, -1 for 
> > this
> > new micro optimization.
> >
>
> As mentioned above, without this we will invalidate many cached
> expressions even though it is not required. I don't deny that there
> might be a better way to achieve the same and if you or Peter have any
> ideas, I am all ears.
>

I see that the new list is added to store row_filter node which we
later use to compute expression. This is not required for invalidation
but for delaying the expression evaluation till it is required (for
example, for truncate, we may not need the row evaluation, so there is
no need to compute it). Can we try to postpone the syscache lookup to
a later stage when we are actually doing row_filtering? If we can do
that, then I think we can avoid having this extra list?


-- 
With Regards,
Amit Kapila.




Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead

2021-08-25 Thread Andres Freund
Hi,

On 2021-08-25 12:51:58 +0900, Michael Paquier wrote:
> I was looking at this WIP patch, and plugging in the connection
> statistics to the table-access statistics looks like the wrong
> abstraction to me.  I find much cleaner the approach of HEAD to use a
> separate API to report this information, as of
> pgstat_send_connstats().

As I said before, this ship has long sailed:

typedef struct PgStat_MsgTabstat
{
PgStat_MsgHdr m_hdr;
Oid m_databaseid;
int m_nentries;
int m_xact_commit;
int m_xact_rollback;
PgStat_Counter m_block_read_time;   /* times in microseconds */
PgStat_Counter m_block_write_time;
PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
} PgStat_MsgTabstat;


> As of the two problems discussed on this thread, aka the increased
> number of UDP packages and the extra timestamp computations, it seems
> to me that we had better combine the following ideas for HEAD and 14,
> for now:
> - Avoid the extra timestamp computation as proposed by Laurenz in [1]
> - Throttle the frequency where the connection stat packages are sent,
> as of [2].

I think in that case we'd have to do the bigger redesign and move "live"
connection stats to backend_status.c...

Greetings,

Andres Freund




Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead

2021-08-25 Thread Andres Freund
Hi,

On 2021-08-20 14:27:20 -0500, Justin Pryzby wrote:
> On Tue, Aug 17, 2021 at 02:14:20AM -0700, Andres Freund wrote:
> > Doubling the number of UDP messages in common workloads seems also 
> > problematic
> > enough that it should be addressed for 14. It increases the likelihood of
> > dropping stats messages on busy systems, which can have downstream impacts.
> 
> I think by "common workloads" you mean one with many, shortlived sessions.

You don't need short-lived sessions. You just need sessions that don't
process queries all the time (so that there's only one or a few queries
within each PGSTAT_STAT_INTERVAL). The connection stats aren't sent once
per session, they're sent once per PGSTAT_STAT_INTERVAL.

Greetings,

Andres Freund




Re: Added schema level support for publication.

2021-08-25 Thread Masahiko Sawada
On Mon, Aug 23, 2021 at 11:16 PM vignesh C  wrote:
>
> On Tue, Aug 17, 2021 at 6:55 PM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > On Tue, Aug 17, 2021 at 6:40 AM Peter Smith  wrote:
> > >> On Mon, Aug 16, 2021 at 11:31 PM Tom Lane  wrote:
> > >>> Abstractly it'd be
> > >>>
> > >>> createpub := CREATE PUBLICATION pubname FOR cpitem [, ... ] [ WITH ... ]
> > >>>
> > >>> cpitem := ALL TABLES |
> > >>> TABLE name |
> > >>> ALL TABLES IN SCHEMA name |
> > >>> ALL SEQUENCES |
> > >>> SEQUENCE name |
> > >>> ALL SEQUENCES IN SCHEMA name |
> > >>> name
> > >>>
> > >>> The grammar output would need some post-analysis to attribute the
> > >>> right type to bare "name" items, but that doesn't seem difficult.
> >
> > >> That last bare "name" cpitem. looks like it would permit the following 
> > >> syntax:
> > >> CREATE PUBLICATION pub FOR a,b,c;
> > >> Was that intentional?
> >
> > > I think so.
> >
> > I had supposed that we could throw an error at the post-processing stage,
> > or alternatively default to assuming that such names are tables.
> >
> > Now you could instead make the grammar work like
> >
> > cpitem := ALL TABLES |
> >   TABLE name [, ...] |
> >   ALL TABLES IN SCHEMA name [, ...] |
> >   ALL SEQUENCES |
> >   SEQUENCE name [, ...] |
> >   ALL SEQUENCES IN SCHEMA name [, ...]
> >
> > which would result in a two-level-list data structure.  I'm not sure
> > that this is better, as any sort of mistake would result in a very
> > uninformative generic "syntax error" from Bison.  Errors out of a
> > post-processing stage could be more specific than that.
>
> I preferred the implementation in the lines Tom Lane had proposed at [1]. Is 
> it ok if the implementation is something like below:
> CreatePublicationStmt:
> CREATE PUBLICATION name FOR pub_obj_list opt_definition
> {
> CreatePublicationStmt *n = makeNode(CreatePublicationStmt);
> n->pubname = $3;
> n->options = $6;
> n->pubobjects = (List *)$5;
> $$ = (Node *)n;
> }
> ;
> pub_obj_list: PublicationObjSpec
> { $$ = list_make1($1); }
> | pub_obj_list ',' PublicationObjSpec
> { $$ = lappend($1, $3); }
> ;
> /* FOR TABLE and FOR ALL TABLES IN SCHEMA specifications */
> PublicationObjSpec: TABLE pubobj_expr
> { }
> | ALL TABLES IN_P SCHEMA pubobj_expr
> { }
> | pubobj_expr
> { }
> ;
> pubobj_expr:
> any_name
> { }
> | any_name '*'
> { }
> | ONLY any_name
> { }
> | ONLY '(' any_name ')'
> { }
> | CURRENT_SCHEMA
> { }
> ;

"FOR ALL TABLES” (that includes all tables in the database) is missing
in this syntax?

>
> I needed pubobj_expr to support the existing syntaxes supported by 
> relation_expr and also to handle CURRENT_SCHEMA support in case of the "FOR 
> ALL TABLES IN SCHEMA" feature. I changed the name to any_name to support 
> objects like "sch1.t1".

I think that relation_expr also accepts objects like "sch1.t1", no?

> I felt if a user specified "FOR ALL TABLES", the user should not be allowed 
> to combine it with "FOR TABLE" and "FOR ALL TABLES IN SCHEMA" as "FOR ALL 
> TABLES" anyway will include all the tables.

I think so too.

> Should we support the similar syntax in case of alter publication, like 
> "ALTER PUBLICATION pub1 ADD TABLE t1,t2, ALL TABLES IN SCHEMA sch1, sch2" or 
> shall we keep these separate like "ALTER PUBLICATION pub1 ADD TABLE t1, t2"  
> and "ALTER PUBLICATION pub1 ADD ALL TABLES IN SCHEMA sch1, sch2". I preferred 
> to keep it separate as we have kept ADD/DROP separately which cannot be 
> combined currently.

If we support the former syntax, the latter two syntaxes are also
supported. Why do we want to support only the latter separate two
syntaxes?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Failure of subscription tests with topminnow

2021-08-25 Thread Ajin Cherian
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila  wrote:
>
> On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian  wrote:
> >
> > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila  
> > wrote:
> >
> > > But will poll function still poll or exit? Have you tried that?
> >
> > I have forced that condition with a changed query and found that the
> > poll will not exit in case of a NULL return.
> >
>
> What if the query in a poll is fired just before we get an error
> "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> Won't at that stage both old and new walsender's are present, so the
> query might return true. You can check that via debugger by stopping
> just before this error occurs and then check pg_stat_replication view.

If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself

regards,
Ajin Cherian
Fujitsu Australia




Re: Add some tests for pg_stat_statements compatibility verification under contrib

2021-08-25 Thread Michael Paquier
On Mon, Mar 15, 2021 at 03:05:24PM +0800, Erica Zhang wrote:
> This way the same query can be reused for both older versions and current
> version.
> Yep, it's neater to use the query as you suggested. Thanks!
> 
> Also, can you register your patch for the next commitfest at
> https://commitfest.postgresql.org/33/, to make sure it won't be forgotten?

I was just looking at your patch, and I think that you should move all
the past compatibility tests into a separate test file, in a way
consistent to what we do in contrib/pageinspect/ for
oldextversions.sql.  I would suggest to use the same file names, while
on it.
--
Michael


signature.asc
Description: PGP signature


Re: logical replication empty transactions

2021-08-25 Thread Peter Smith
I reviewed the v14-0001 patch.

All my previous comments have been addressed.

Apply / build / test was all OK.

--

More review comments:

1. Params names in the function declarations should match the rest of the code.

1a. src/include/replication/logical.h

@@ -26,7 +26,8 @@ typedef LogicalOutputPluginWriterWrite
LogicalOutputPluginWriterPrepareWrite;

 typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct
LogicalDecodingContext *lr,
  XLogRecPtr Ptr,
- TransactionId xid
+ TransactionId xid,
+ bool send_keep_alive

=>
Change "send_keep_alive" --> "send_keepalive"

~~

1b. src/include/replication/output_plugin.h

@@ -243,6 +243,6 @@ typedef struct OutputPluginCallbacks
 /* Functions in replication/logical/logical.c */
 extern void OutputPluginPrepareWrite(struct LogicalDecodingContext
*ctx, bool last_write);
 extern void OutputPluginWrite(struct LogicalDecodingContext *ctx,
bool last_write);
-extern void OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx);
+extern void OutputPluginUpdateProgress(struct LogicalDecodingContext
*ctx, bool send_keep_alive);

=>
Change "send_keep_alive" --> "send_keepalive"

--

2. Comment should be capitalized - src/backend/replication/walsender.c

@@ -170,6 +170,9 @@ static TimestampTz last_reply_timestamp = 0;
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;

+/* force keep alive when skipping transactions in synchronous
replication mode */
+static bool force_keepalive_syncrep = false;

=>
"force" --> "Force"

--

Otherwise, v14-0001 LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia