Re: Skipping logical replication transactions on subscriber side
On Thu, Aug 26, 2021 at 11:39 AM Amit Kapila wrote: > > On Thu, Aug 26, 2021 at 9:50 AM Masahiko Sawada wrote: > > > > On Thu, Aug 26, 2021 at 12:51 PM Amit Kapila > > wrote: > > > > Yeah, I agree that it's a handy way to detect missing a switch case > > but I think that we don't necessarily need it in this case. Because > > there are many places in the code where doing similar things and when > > it comes to apply_dispatch() it's the entry function to handle the > > incoming message so it will be unlikely that we miss adding a switch > > case until the patch gets committed. If we don't move it, we would end > > up either adding the code resetting the > > apply_error_callback_arg.command to every message type, adding a flag > > indicating the message is handled and checking later, or having a big > > if statement checking if the incoming message type is valid etc. > > > > I was reviewing and making minor edits to your v11-0001* patch and > noticed that the below parts of the code could be improved: > 1. > + if (errarg->rel) > + appendStringInfo(&buf, _(" for replication target relation \"%s.%s\""), > + errarg->rel->remoterel.nspname, > + errarg->rel->remoterel.relname); > + > + if (errarg->remote_attnum >= 0) > + appendStringInfo(&buf, _(" column \"%s\""), > + errarg->rel->remoterel.attnames[errarg->remote_attnum]); > > Isn't it better if 'remote_attnum' check is inside if (errargrel) > check? It will be weird to print column information without rel > information and in the current code, we don't set remote_attnum > without rel. The other possibility could be to have an Assert for rel > in 'remote_attnum' check. > > 2. > + /* Reset relation for error callback */ > + apply_error_callback_arg.rel = NULL; > + > logicalrep_rel_close(rel, NoLock); > > end_replication_step(); > > Isn't it better to reset relation info as the last thing in > apply_handle_insert/update/delete as you do for a few other > parameters? There is very little chance of error from those two > functions but still, it will be good if they ever throw an error and > it might be clear for future edits in this function that this needs to > be set as the last thing in these functions. > I see that resetting it before logicalrep_rel_close has an advantage that we might not accidentally access some information after close which is not there in rel. I am not sure if that is the reason you have in mind for resetting it before close. -- With Regards, Amit Kapila.
Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)
On Wed, Aug 25, 2021 at 5:49 PM Amit Kapila wrote: > On Tue, Aug 24, 2021 at 3:55 PM Dilip Kumar wrote: > > > > On Tue, Aug 24, 2021 at 12:26 PM Amit Kapila > wrote: > > > > The first patch looks good to me. I have made minor changes to the > attached patch. The changes include: fixing compilation warning, made > some comment changes, ran pgindent, and few other cosmetic changes. If > you are fine with the attached, then kindly rebase the second patch > atop it. > > The patch looks good to me, I have rebased 0002 atop this patch and also done some cosmetic fixes in 0002. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From 3558cdc8a4f77e3c9092bbaeded451eaaa1feca4 Mon Sep 17 00:00:00 2001 From: Amit Kapila Date: Wed, 25 Aug 2021 15:00:19 +0530 Subject: [PATCH v5 1/2] Refactor sharedfileset.c to separate out fileset implementation. Move fileset related implementation out of sharedfileset.c to allow its usage by backends that don't want to share filesets among different processes. After this split, fileset infrastructure is used by both sharedfileset.c and worker.c for the named temporary files that survive across transactions. Author: Dilip Kumar, based on suggestion by Andres Freund Reviewed-by: Hou Zhijie, Masahiko Sawada, Amit Kapila Discussion: https://postgr.es/m/e1mcc6u-0004ik...@gemulon.postgresql.org --- src/backend/replication/logical/launcher.c | 3 + src/backend/replication/logical/worker.c | 82 ++ src/backend/storage/file/Makefile | 1 + src/backend/storage/file/buffile.c | 84 +- src/backend/storage/file/fd.c | 2 +- src/backend/storage/file/fileset.c | 205 src/backend/storage/file/sharedfileset.c | 244 + src/backend/utils/sort/logtape.c | 8 +- src/backend/utils/sort/sharedtuplestore.c | 5 +- src/include/replication/worker_internal.h | 1 + src/include/storage/buffile.h | 14 +- src/include/storage/fileset.h | 40 + src/include/storage/sharedfileset.h| 14 +- src/tools/pgindent/typedefs.list | 1 + 14 files changed, 368 insertions(+), 336 deletions(-) create mode 100644 src/backend/storage/file/fileset.c create mode 100644 src/include/storage/fileset.h diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index e3b11da..8b1772d 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -648,6 +648,9 @@ logicalrep_worker_onexit(int code, Datum arg) logicalrep_worker_detach(); + /* Cleanup filesets used for streaming transactions. */ + logicalrep_worker_cleanupfileset(); + ApplyLauncherWakeup(); } diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 38b493e..6adc6ddd 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -39,13 +39,13 @@ * BufFile infrastructure supports temporary files that exceed the OS file size * limit, (b) provides a way for automatic clean up on the error and (c) provides * a way to survive these files across local transactions and allow to open and - * close at stream start and close. We decided to use SharedFileSet + * close at stream start and close. We decided to use FileSet * infrastructure as without that it deletes the files on the closure of the * file and if we decide to keep stream files open across the start/stop stream * then it will consume a lot of memory (more than 8K for each BufFile and * there could be multiple such BufFiles as the subscriber could receive * multiple start/stop streams for different transactions before getting the - * commit). Moreover, if we don't use SharedFileSet then we also need to invent + * commit). Moreover, if we don't use FileSet then we also need to invent * a new way to pass filenames to BufFile APIs so that we are allowed to open * the file we desired across multiple stream-open calls for the same * transaction. @@ -231,8 +231,8 @@ typedef struct ApplyExecutionData typedef struct StreamXidHash { TransactionId xid; /* xid is the hash key and must be first */ - SharedFileSet *stream_fileset; /* shared file set for stream data */ - SharedFileSet *subxact_fileset; /* shared file set for subxact info */ + FileSet*stream_fileset; /* file set for stream data */ + FileSet*subxact_fileset; /* file set for subxact info */ } StreamXidHash; static MemoryContext ApplyMessageContext = NULL; @@ -255,8 +255,8 @@ static bool in_streamed_transaction = false; static TransactionId stream_xid = InvalidTransactionId; /* - * Hash table for storing the streaming xid information along with shared file - * set for streaming and subxact files. + * Hash table for storing the streaming xid information along with filesets + * for streaming and subxact files. */ static HTAB *xidhash = N
Re: Skipping logical replication transactions on subscriber side
On Thu, Aug 26, 2021 at 9:50 AM Masahiko Sawada wrote: > > On Thu, Aug 26, 2021 at 12:51 PM Amit Kapila wrote: > > Yeah, I agree that it's a handy way to detect missing a switch case > but I think that we don't necessarily need it in this case. Because > there are many places in the code where doing similar things and when > it comes to apply_dispatch() it's the entry function to handle the > incoming message so it will be unlikely that we miss adding a switch > case until the patch gets committed. If we don't move it, we would end > up either adding the code resetting the > apply_error_callback_arg.command to every message type, adding a flag > indicating the message is handled and checking later, or having a big > if statement checking if the incoming message type is valid etc. > I was reviewing and making minor edits to your v11-0001* patch and noticed that the below parts of the code could be improved: 1. + if (errarg->rel) + appendStringInfo(&buf, _(" for replication target relation \"%s.%s\""), + errarg->rel->remoterel.nspname, + errarg->rel->remoterel.relname); + + if (errarg->remote_attnum >= 0) + appendStringInfo(&buf, _(" column \"%s\""), + errarg->rel->remoterel.attnames[errarg->remote_attnum]); Isn't it better if 'remote_attnum' check is inside if (errargrel) check? It will be weird to print column information without rel information and in the current code, we don't set remote_attnum without rel. The other possibility could be to have an Assert for rel in 'remote_attnum' check. 2. + /* Reset relation for error callback */ + apply_error_callback_arg.rel = NULL; + logicalrep_rel_close(rel, NoLock); end_replication_step(); Isn't it better to reset relation info as the last thing in apply_handle_insert/update/delete as you do for a few other parameters? There is very little chance of error from those two functions but still, it will be good if they ever throw an error and it might be clear for future edits in this function that this needs to be set as the last thing in these functions. Note - I can take care of the above points based on whatever we agree with, you don't need to send a new version for this. -- With Regards, Amit Kapila.
RE: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)
Dear Fujii-san, Thank you for replying! I attached new version. > Why did you make even %u and %d available in application_name? Actually no particular reason. I added them because they can easily add... And I agree what you say, so removed. > So some people may want to specify their own ID in application_name > when connecting to the foreign server. For now they can do that by > setting application_name per foreign server object. But this means that > every session connecting to the same foreign server basically should > use the same application_name. If they want to change the setting, > they need to execute "ALTER SERAVER ... OPTIONS ...". Isn't this inconvenient? Yeah, currently such users must execute ALTER command for each time. > If yes, what about allowing us to specify foreign server's application_name > per session? For example, we can add postgres_fdw.application_name GUC, > and then if it's set postgres_fdw always uses it instead of the server-level > option when connecting to the foreign server. Thought? OK, I added the GUC parameter. My patch Defines new GUC at _PG_init(). The value is used when backends started to connect to remote server. Normal users can modify the value by SET commands but that change does not propagate to the established connections. I'm not sure about the versioning policy about contrib. Do we have to make new version? Any other objects are not changed. Finally, I and Fujii-san were now discussing locally whether this replacement should be in libpq or postgres_fdw layer. I started to think that it should be postgres_fdw, because: * previously aplications could distinguish by using current applicaiton_name, * libpq is used almost all uses so some modifications might affect someone, and * libpq might be just a tool, and should not be intelligent I will make and attach another version that new codes move to postgres_fdw. So could you vote which version is better? Best Regards, Hayato Kuroda FUJITSU LIMITED v02_0002_add_guc.patch Description: v02_0002_add_guc.patch v02_0001_allow_escape_in_application_name.patch Description: v02_0001_allow_escape_in_application_name.patch
Re: Async-unsafe functions in signal handlers
> 25 авг. 2021 г., в 19:22, Denis Smirnov написал(а): > > I am going to refactor Greenplum backtraces for error messages and want to > make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were > introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original > discussion > https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com > ) and rely on backtrace() and backtrace_symbols() functions. They are used > inside errfinish() that is wrapped by ereport() macros. ereport() is invoked > inside bgworker_die() and FloatExceptionHandler() signal handlers. I am > confused with this fact - both backtrace functions are async-unsafe: > backtrace_symbols() - always, backtrace() - only for the first call due to > dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal > handlers? In my view GUC backtrace_functions is expected to be used for debug purposes. Not for enabling on production server for bgworker_die() or FloatExceptionHandler(). Are there any way to call backtrace_symbols() without touching backtrace_functions? Best regards, Andrey Borodin.
RE: Skipping logical replication transactions on subscriber side
On Wed, Aug 25, 2021 12:22 PM Masahiko Sawada wrote: > > Attached updated version patches. Please review them. The v11-0001 patch LGTM. Best regards, Hou zj
Re: Skipping logical replication transactions on subscriber side
On Thu, Aug 26, 2021 at 1:51 PM Amit Kapila wrote: > > Do you have any suggestions on how to achieve that without adding some > additional variable? I think it is not a very hard requirement as we > don't follow the same at other places in code. > Sorry, forget my suggestion, I see it's not easy to achieve it and still execute the non-error-case code after the switch. (you'd have to use a variable set in the default case, defeating the purpose, or have the switch in a separate function with return for each case) So the 0001 patch LGTM. Regards, Greg Nancarrow Fujitsu Australia
Re: row filtering for logical replication
On Thu, Aug 26, 2021 at 9:51 AM Peter Smith wrote: > > On Thu, Aug 26, 2021 at 1:20 PM Amit Kapila wrote: > > > > On Thu, Aug 26, 2021 at 7:37 AM Peter Smith wrote: > > > > > > On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila > > > wrote: > > > > > > > ... > > > > > > > > Hmm, I think the gain via caching is not visible because we are using > > > > simple expressions. It will be visible when we use somewhat complex > > > > expressions where expression evaluation cost is significant. > > > > Similarly, the impact of this change will magnify and it will also be > > > > visible when a publication has many tables. Apart from performance, > > > > this change is logically correct as well because it would be any way > > > > better if we don't invalidate the cached expressions unless required. > > > > > > Please tell me what is your idea of a "complex" row filter expression. > > > Do you just mean a filter that has multiple AND conditions in it? I > > > don't really know if few complex expressions would amount to any > > > significant evaluation costs, so I would like to run some timing tests > > > with some real examples to see the results. > > > > > > > I think this means you didn't even understand or are convinced why the > > patch has cache in the first place. As per your theory, even if we > > didn't have cache, it won't matter but that is not true otherwise, the > > patch wouldn't have it. > > I have never said there should be no caching. On the contrary, my > performance test results [1] already confirmed that caching ExprState > is of benefit for the millions of times it may be used in the > pgoutput_row_filter function. My only doubts are in regard to how much > observable impact there would be re-evaluating the filter expression > just a few extra times by the get_rel_sync_entry function. > I think it depends but why in the first place do you want to allow re-evaluation when there is a way for not doing that? -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila wrote: > > > > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian wrote: > > > > > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila > > > wrote: > > > > > > > > > > > You have a point but if we see the below logs, it seems the second > > > > walsender (#step6) seemed to exited before the first walsender > > > > (#step4). > > > > > > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > > > > > Isn't it possible that pid is cleared in the other order due to which > > > > we are seeing this problem? > > > > > > If the pid is cleared in the other order, wouldn't the query [1] return a > > > false? > > > > > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE > > > application_name = 'tap_sub';" > > > > > > > I think it should return true because pid for 16336 is cleared first > > and the remaining one will be 16475. > > Yes, that was what I explained as well. 16336 is PID 'a' (first > walsender) in my explanation. The first walsender should > be cleared first for this theory to work. I think that it’s possible that the orders of the process writing disconnections logs and setting 0 to walsender's pid are mismatched. We set 0 to walsender's pid in WalSndKill() that is called during on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't show the entry. On the other hand, disconnections logs are written by log_disconnections() that is called during on_proc_exit callback. That is, the following sequence could happen: 1. the second walsender (pid = 16475) raises an error as the slot is already active (held by the first walsender). 2. the first walsender (pid = 16336) clears its pid on the shmem. 3. the polling query checks the walsender’s pid, and returns true (since there is only the second walsender now). 4. the second walsender clears its pid on the shmem. 5. the second walsender writes disconnection log. 6. the first walsender writes disconneciton log. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: row filtering for logical replication
On Thu, Aug 26, 2021 at 1:20 PM Amit Kapila wrote: > > On Thu, Aug 26, 2021 at 7:37 AM Peter Smith wrote: > > > > On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila wrote: > > > > > ... > > > > > > Hmm, I think the gain via caching is not visible because we are using > > > simple expressions. It will be visible when we use somewhat complex > > > expressions where expression evaluation cost is significant. > > > Similarly, the impact of this change will magnify and it will also be > > > visible when a publication has many tables. Apart from performance, > > > this change is logically correct as well because it would be any way > > > better if we don't invalidate the cached expressions unless required. > > > > Please tell me what is your idea of a "complex" row filter expression. > > Do you just mean a filter that has multiple AND conditions in it? I > > don't really know if few complex expressions would amount to any > > significant evaluation costs, so I would like to run some timing tests > > with some real examples to see the results. > > > > I think this means you didn't even understand or are convinced why the > patch has cache in the first place. As per your theory, even if we > didn't have cache, it won't matter but that is not true otherwise, the > patch wouldn't have it. I have never said there should be no caching. On the contrary, my performance test results [1] already confirmed that caching ExprState is of benefit for the millions of times it may be used in the pgoutput_row_filter function. My only doubts are in regard to how much observable impact there would be re-evaluating the filter expression just a few extra times by the get_rel_sync_entry function. -- [1] https://www.postgresql.org/message-id/CAHut%2BPs5j7mkO0xLmNW%3DkXh0eezGoKyzBCiQc9bfkCiM_MVDrg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Skipping logical replication transactions on subscriber side
On Thu, Aug 26, 2021 at 12:51 PM Amit Kapila wrote: > > On Thu, Aug 26, 2021 at 7:15 AM Greg Nancarrow wrote: > > > > On Wed, Aug 25, 2021 at 2:22 PM Masahiko Sawada > > wrote: > > > > > > Attached updated version patches. Please review them. > > > > > > > Regarding the v11-0001 patch, it looks OK to me, but I do have one point: > > In apply_dispatch(), wouldn't it be better to NOT move the error > > reporting for an invalid message type into the switch as the default > > case - because then, if you add a new message type, you won't get a > > compiler warning (when warnings are enabled) for a missing switch > > case, which is a handy way to alert you that the new message type > > needs to be added as a case to the switch. > > > > Do you have any suggestions on how to achieve that without adding some > additional variable? I think it is not a very hard requirement as we > don't follow the same at other places in code. Yeah, I agree that it's a handy way to detect missing a switch case but I think that we don't necessarily need it in this case. Because there are many places in the code where doing similar things and when it comes to apply_dispatch() it's the entry function to handle the incoming message so it will be unlikely that we miss adding a switch case until the patch gets committed. If we don't move it, we would end up either adding the code resetting the apply_error_callback_arg.command to every message type, adding a flag indicating the message is handled and checking later, or having a big if statement checking if the incoming message type is valid etc. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila wrote: > > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian wrote: > > > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila wrote: > > > > > > > > You have a point but if we see the below logs, it seems the second > > > walsender (#step6) seemed to exited before the first walsender > > > (#step4). > > > > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > > > Isn't it possible that pid is cleared in the other order due to which > > > we are seeing this problem? > > > > If the pid is cleared in the other order, wouldn't the query [1] return a > > false? > > > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE > > application_name = 'tap_sub';" > > > > I think it should return true because pid for 16336 is cleared first > and the remaining one will be 16475. Yes, that was what I explained as well. 16336 is PID 'a' (first walsender) in my explanation. The first walsender should be cleared first for this theory to work. regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila wrote: > > > > > You have a point but if we see the below logs, it seems the second > > walsender (#step6) seemed to exited before the first walsender > > (#step4). > > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > Isn't it possible that pid is cleared in the other order due to which > > we are seeing this problem? > > If the pid is cleared in the other order, wouldn't the query [1] return a > false? > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE > application_name = 'tap_sub';" > I think it should return true because pid for 16336 is cleared first and the remaining one will be 16475. -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Thu, Aug 26, 2021 at 7:15 AM Greg Nancarrow wrote: > > On Wed, Aug 25, 2021 at 2:22 PM Masahiko Sawada wrote: > > > > Attached updated version patches. Please review them. > > > > Regarding the v11-0001 patch, it looks OK to me, but I do have one point: > In apply_dispatch(), wouldn't it be better to NOT move the error > reporting for an invalid message type into the switch as the default > case - because then, if you add a new message type, you won't get a > compiler warning (when warnings are enabled) for a missing switch > case, which is a handy way to alert you that the new message type > needs to be added as a case to the switch. > Do you have any suggestions on how to achieve that without adding some additional variable? I think it is not a very hard requirement as we don't follow the same at other places in code. -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila wrote: > > You have a point but if we see the below logs, it seems the second > walsender (#step6) seemed to exited before the first walsender > (#step4). > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > session time: 0:00:00.036 user=nm database=postgres host=[local] > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > session time: 0:00:06.367 user=nm database=postgres host=[local] > > Isn't it possible that pid is cleared in the other order due to which > we are seeing this problem? If the pid is cleared in the other order, wouldn't the query [1] return a false? [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE application_name = 'tap_sub';" regards, Ajin Cherian Fujitsu Australia
Re: prevent immature WAL streaming
On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" wrote: > At Wed, 25 Aug 2021 18:18:59 +, "Bossart, Nathan" > wrote in >> Let's say we have the following situation (F = flush, E = earliest >> registered boundary, and L = latest registered boundary), and let's >> assume that each segment has a cross-segment record that ends in the >> next segment. >> >> F E L >> |-|-|-|-|-|-|-|-| >>1 2 3 4 5 6 7 8 >> >> Then, we write out WAL to disk and create .ready files as needed. If >> we didn't flush beyond the latest registered boundary, the latest >> registered boundary now becomes the earliest boundary. >> >> F E >> |-|-|-|-|-|-|-|-| >>1 2 3 4 5 6 7 8 >> >> At this point, the earliest segment boundary past the flush point is >> before the "earliest" boundary we are tracking. > > We know we have some cross-segment records in the regin [E L] so we > cannot add a .ready file if flush is in the region. I haven't looked > the latest patch (or I may misunderstand the discussion here) but I > think we shouldn't move E before F exceeds previous (or in the first > picture above) L. Things are done that way in my ancient proposal in > [1]. The strategy in place ensures that we track a boundary that doesn't change until the flush position passes it as well as the latest registered boundary. I think it is important that any segment boundary tracking mechanism does at least those two things. I don't see how we could do that if we didn't update E until F passed both E and L. Nathan
Re: row filtering for logical replication
On Thu, Aug 26, 2021 at 7:37 AM Peter Smith wrote: > > On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila wrote: > > > ... > > > > Hmm, I think the gain via caching is not visible because we are using > > simple expressions. It will be visible when we use somewhat complex > > expressions where expression evaluation cost is significant. > > Similarly, the impact of this change will magnify and it will also be > > visible when a publication has many tables. Apart from performance, > > this change is logically correct as well because it would be any way > > better if we don't invalidate the cached expressions unless required. > > Please tell me what is your idea of a "complex" row filter expression. > Do you just mean a filter that has multiple AND conditions in it? I > don't really know if few complex expressions would amount to any > significant evaluation costs, so I would like to run some timing tests > with some real examples to see the results. > I think this means you didn't even understand or are convinced why the patch has cache in the first place. As per your theory, even if we didn't have cache, it won't matter but that is not true otherwise, the patch wouldn't have it. > On Wed, Aug 25, 2021 at 6:31 PM Amit Kapila wrote: > > > > On Wed, Aug 25, 2021 at 10:57 AM Amit Kapila > > wrote: > > > > > > On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira wrote: > > > > > > > > On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote: > > > > > > > > > > Anyway, I have implemented the suggested cache change because I agree > > > > it is probably theoretically superior, even if in practice there is > > > > almost no difference. > > > > > > > > I didn't inspect your patch carefully but it seems you add another List > > > > to > > > > control this new cache mechanism. I don't like it. IMO if we can use > > > > the data > > > > structures that we have now, let's implement your idea; otherwise, -1 > > > > for this > > > > new micro optimization. > > > > > > > > > > As mentioned above, without this we will invalidate many cached > > > expressions even though it is not required. I don't deny that there > > > might be a better way to achieve the same and if you or Peter have any > > > ideas, I am all ears. > > > > > > > I see that the new list is added to store row_filter node which we > > later use to compute expression. This is not required for invalidation > > but for delaying the expression evaluation till it is required (for > > example, for truncate, we may not need the row evaluation, so there is > > no need to compute it). Can we try to postpone the syscache lookup to > > a later stage when we are actually doing row_filtering? If we can do > > that, then I think we can avoid having this extra list? > > Yes, you are correct - that Node list was re-instated only because you > had requested that the ExprState evaluation should be deferred until > it is needed by the pgoutput_row_filter. Otherwise, the additional > list would not be needed so everything would be much the same as in > v23 except the invalidations would be more focussed on single tables. > > I don't think the syscache lookup can be easily postponed. That logic > of get_rel_sync_entry processes the table filters of *all* > publications, so moving that publications loop (including the > partition logic) into the pgoutput_row_filter seems a bridge too far > IMO. > Hmm, I don't think that is not true. You just need it for the relation to be processed. > Furthermore, I am not yet convinced that this ExprState postponement > is very useful. It may be true that for truncate there is no need to > compute it, but consider that the user would never even define a row > filter in the first place unless they intended there will be some CRUD > operations. So even if the truncate does not need the filter, > *something* is surely going to need it. > Sure, but we don't need to add additional computation until it is required. -- With Regards, Amit Kapila.
Re: Fix around conn_duration in pgbench
On Fri, 20 Aug 2021 02:05:27 +0900 Fujii Masao wrote: > > On 2021/08/11 13:56, Fujii Masao wrote: > > Yes, but I was thinking that's a bug that we should fix. > > IOW, I was thinking that, in v13, both connection and disconnection delays > > should be measured whether -C is specified or not, *per spec*. > > But, in v13, the disconnection delays are not measured in the cases > > where -C is specified and not specified. So I was thinking that this is > > a bug and we should fix those both cases. > > > > But you're thinking that, in v13, the disconnection delays don't need to > > be measured because they are not measured for now? > > Please let me clarify my thought. Thank you for your clarification. > > In master and v14, > > # Expected behavior > (1) Both connection and disconnection delays should be measured >only when -C is specified, but not otherwise. > (2) When -C is specified, since each transaction establishes and closes >a connection, those delays should be measured for each transaction. > > # Current behavior > (1) Connection delay is measured whether -C is specified or not. > (2) Even when -C is specified, disconnection delay is NOT measured >at the end of transaction. > > # What the patch should do > (1) Make pgbench skip measuring connection and disconnection delays >if not necessary (i.e., -C is not specified). > (2) Make pgbench measure the disconnection delays whenever >the connection is closed at the end of transaction, when -C is > specified. I agree with you. This is what the patch for pg14 does. We don't need to measure disconnection delay when -C is not specified because the output just reports "initial connection time". > In v13 or before, > > # Expected behavior > (1) Both connection and disconnection delays should be measured >whether -C is specified or not. Because information about those delays >is used for the benchmark result report. > (2) When -C is specified, since each transaction establishes and closes >a connection, those delays should be measured for each transaction. > > # Current behavior > (1)(2) Disconnection delay is NOT measured whether -C is specified or not. > > # What the patch should do > (1)(2) Make pgbench measure the disconnection delays whenever > the connection is closed at the end of transaction (for -C case) > and the end of thread (for NOT -C case). Ok. That makes sense. The output reports "including connections establishing" and "excluding connections establishing" regardless with -C, so we should measure delays in the same way. I updated the patch for pg13 to measure disconnection delay when -C is not specified. I attached the updated patch for pg13 as well as one for pg14 which is same as attached before. > > Anyway, I changed the status of this patch to "Waiting on Author" in CF. I returned the status to "Ready for Committer". Could you please review this? Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 338c0152a2..0058f3 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -3285,8 +3285,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (is_connect) { + instr_time start = now; + + INSTR_TIME_SET_CURRENT_LAZY(start); finishCon(st); - INSTR_TIME_SET_ZERO(now); + INSTR_TIME_SET_CURRENT(now); + INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start); } if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded) @@ -3310,7 +3314,28 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) */ case CSTATE_ABORTED: case CSTATE_FINISHED: -finishCon(st); +/* + * In CSTATE_FINISHED state, this finishCon() is unnecessary + * under -C/--connect because all the connections that this + * thread established should have already been closed at the end + * of transactions. So we need to measure the disconnection + * delays here only when -C/--connect is not specified. + * + * In CSTATE_ABORTED state, the measurement is no longer + * necessary because we cannot report complete results anyways + * in this case. + */ +if (!is_connect) +{ + instr_time start = now; + + INSTR_TIME_SET_CURRENT_LAZY(start); + finishCon(st); + INSTR_TIME_SET_CURRENT(now); + INSTR_TIME_ACCUM_DIFF(thread->conn_time, now, start); +} +else + finishCon(st); return; } } @@ -6210,6 +6235,12 @@ main(int argc, char **argv) latency_late += thread->latency_late; INSTR_TIME_ADD(conn_total_time, thread->conn_time); } + + /* + * All connections should be already closed in threadRun(), so this + * disconnect_all() will be a no-op, but clean up the connecions just + * to be sure. We don't need to measure the disconnection delays here. + */ disconnect_all(state, nclients); /* @@ -6237,8 +6268,7 @@ thread
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 7:38 AM Ajin Cherian wrote: > > On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada > wrote: > > > > Luckily these logs have the disconnection times of the tap test client > sessions as well. (not sure why I don't see these when I run these > tests). > > Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063 > 18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT > pid != 16336 FROM pg_stat_replication WHERE application_name = > 'tap_sub'; > : > : > 18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection: > session time: 0:00:00.063 user=nm database=postgres host=[local] > > When the query starts both walsenders are present but when the query > completes both walsenders are gone, the actual query evaluation could > have happened any time in between. This is the rare timing window that > causes this problem. > You have a point but if we see the below logs, it seems the second walsender (#step6) seemed to exited before the first walsender (#step4). 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: session time: 0:00:00.036 user=nm database=postgres host=[local] 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: session time: 0:00:06.367 user=nm database=postgres host=[local] Isn't it possible that pid is cleared in the other order due to which we are seeing this problem? -- With Regards, Amit Kapila.
Re: log_autovacuum in Postgres 14 -- ordering issue
On Wed, Aug 25, 2021 at 5:23 PM Alvaro Herrera wrote: > > The question of whether or not we do an index scan (i.e. index > > vacuuming) depends entirely on the number of LP_DEAD items that heap > > pruning left behind in the table structure. [...] > > Ooh, this was illuminating -- thanks for explaining. TBH I would have > been very confused if asked to explain what that log line meant; and now > that I know what it means, I am even more convinced that we need to work > harder at it :-) The way that VACUUM and ANALYZE do dead tuple accounting is very confusing. In fact, it's so confusing that even autovacuum can get confused! I think that we need to treat LP_DEAD items and pruned tuples even more differently than we do in Postgres 14, probably in a number of different areas (not just VACUUM). I've found that if I set autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor to 0.02 with a HOT-heavy workload (almost stock pgbench), then autovacuum workers are launched almost constantly. If I then increase autovacuum_vacuum_scale_factor to 0.05, but make no other changes, then the system decides that it should actually never launch an autovacuum worker, even once (except for anti-wraparound purposes) [1]. This behavior is completely absurd, of course. To me this scenario illustrates an important general point: VACUUM has the wrong idea. At least when it comes to certain specific details. Details that have plenty of real world relevance. VACUUM currently fails to understand anything about the rate of change -- which, as I've said, is often the most important thing in the real world. That's what my absurd scenario seems to show. That's how I view a lot of these things. > I'll see if I can come up with something ... Thanks. The message itself probably does need some work. But documentation seems at least as important. It's slightly daunting, honestly, because we don't even document HOT itself (unless you count passing references that don't even explain the basic idea). I did try to get people interested in this stuff at one point not too long ago [2]. That thread went an entirely different direction to the one I'd planned on, though, so I became discouraged. I should pick it up again now, though. [1] https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com [2] https://postgr.es/m/cah2-wzkju+nibskzunbdpz6trse+aqvupae+xgm8zvob4wq...@mail.gmail.com -- Peter Geoghegan
Re: Window Function "Run Conditions"
On Thu, Aug 19, 2021 at 2:35 PM David Rowley wrote: > > On Thu, 19 Aug 2021 at 00:20, Andy Fan wrote: > > In the current master, the result is: > > > > empno | salary | c | dr > > ---++---+ > > 8 | 6000 | 4 | 1 > > > In the patched version, the result is: > > > > empno | salary | c | dr > > ---++---+ > > 8 | 6000 | 1 | 1 > > Thanks for taking it for a spin. > > That's a bit unfortunate. I don't immediately see how to fix it other > than to restrict the optimisation to only apply when there's a single > WindowClause. It might be possible to relax it further and only apply > if it's the final window clause to be evaluated, but in those cases, > the savings are likely to be much less anyway as some previous > WindowAgg will have exhausted all rows from its subplan. I am trying to hack the select_active_windows function to make sure the WindowClause with Run Condition clause to be the last one to evaluate (we also need to consider more than 1 window func has run condition), at that time, the run condition clause is ready already. However there are two troubles in this direction: a). This may conflict with "the windows that need the same sorting are adjacent in the list." b). "when two or more windows are order-equivalent then all peer rows must be presented in the same order in all of them. .. (See General Rule 4 of in SQL2008 - SQL2016.)" In summary, I am not sure if it is correct to change the execution Order of WindowAgg freely. > Likely > restricting it to only working if there's 1 WindowClause would be fine > as for the people using row_number() for a top-N type query, there's > most likely only going to be 1 WindowClause. > This sounds practical. And I suggest the following small changes. (just check the partitionClause before the prosupport) @@ -2133,20 +2133,22 @@ find_window_run_conditions(Query *subquery, RangeTblEntry *rte, Index rti, *keep_original = true; - prosupport = get_func_support(wfunc->winfnoid); - - /* Check if there's a support function for 'wfunc' */ - if (!OidIsValid(prosupport)) - return false; - /* * Currently the WindowAgg node just stop when the run condition is no * longer true. If there is a PARTITION BY clause then we cannot just * stop as other partitions still need to be processed. */ + + /* Check this first since window function with a partition clause is common*/ if (wclause->partitionClause != NIL) return false; + prosupport = get_func_support(wfunc->winfnoid); + + /* Check if there's a support function for 'wfunc' */ + if (!OidIsValid(prosupport)) + return false; + /* get the Expr from the other side of the OpExpr */ if (wfunc_left) otherexpr = lsecond(opexpr->args); -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Mark all GUC variable as PGDLLIMPORT
On Wed, Aug 25, 2021 at 10:41:14AM -0400, Tom Lane wrote: > Magnus Hagander writes: > > On Wed, Aug 25, 2021 at 4:06 PM Robert Haas wrote: > >> It does tend to be controversial, but I think that's basically only > >> because Tom Lane has reservations about it. I think if Tom dropped his > >> opposition to this, nobody else would really care. And I think that > >> would be a good thing for the project. > > > But in particular, both on that argument, and on the general > > maintenance argument, I have a very hard time seeing how exporting the > > GUC variables would be any worse than exporting the many hundreds of > > functions we already export. > > My beef about it has nothing to do with binary-size concerns, although > that is an interesting angle. (I wonder whether marking a variable > PGDLLIMPORT has any negative effect on the cost of accessing it from > within the core code?) Rather, I'm unhappy with spreading even more > Microsoft-droppings all over our source. If there were some way to > just do this automatically for all global variables without any source > changes, I'd be content with that. That would *really* make the > platforms more nearly equivalent. OK, so the big question is how can we minimize the code impact of this feature? Can we add some kind of checking so we don't forget to mark anything? Can we automate this somehow? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
RE: Added schema level support for publication.
On Wednesday, August 25, 2021 5:37 PM vignesh C wrote: > > Attached v21 patch has the changes based on the new syntax and fixes > few of the other review comments provided by reviewers. > Thanks for your new patch. I saw the following warning when building, please have a look. publicationcmds.c: In function ‘ConvertPubObjSpecListToOidList’: publicationcmds.c:212:23: warning: ‘prevobjtype’ may be used uninitialized in this function [-Wmaybe-uninitialized] pubobj->pubobjtype = prevobjtype; ~~~^ Regards Tang
Re: Failure of subscription tests with topminnow
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada wrote: > > On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian wrote: > > > > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila > > wrote: > > > > > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada > > > wrote: > > > > > > > > I did a quick check with the following tap test code: > > > > > > > > $node_publisher->poll_query_until('postgres', > > > > qq( > > > > select 1 != foo.column1 from (values(0), (1)) as foo; > > > > )); > > > > > > > > The query returns {t, f} but poll_query_until() never finished. The > > > > same is true when the query returns {f, t}. > > > > > > > > Yes, this is true, I also see the same behaviour. > > > > > > > > This means something different is going on in Ajin's setup. Ajin, can > > > you please share how did you confirm your findings about poll_query? > > > > Relooking at my logs, I think what happens is this: > > > > 1. First walsender 'a' is running. > > 2. Second walsender 'b' starts and attempts at acquiring the slot > > finds that the slot is active for pid a. > > 3. Now both walsenders are active, the query does not return. > > 4. First walsender 'a' times out and exits. > > 5. Now only the second walsender is active and the query returns OK > > because pid != a. > > 6. Second walsender exits with error. > > 7. Another query attempts to get the pid of the running walsender for > > tap_sub but returns null because both walsender exits. > > 8. This null return value results in the next query erroring out and > > the test failing. > > So this is slightly different than what we can see in the topminnow > logs? According to the server logs, step #5 happened (at 18:44:38.016) > before step #4 happened (at 18:44:38.043). > Luckily these logs have the disconnection times of the tap test client sessions as well. (not sure why I don't see these when I run these tests). Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063 18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT pid != 16336 FROM pg_stat_replication WHERE application_name = 'tap_sub'; : : 18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection: session time: 0:00:00.063 user=nm database=postgres host=[local] When the query starts both walsenders are present but when the query completes both walsenders are gone, the actual query evaluation could have happened any time in between. This is the rare timing window that causes this problem. regards, Ajin Cherian Fujitsu Australia
Re: row filtering for logical replication
On Wed, Aug 25, 2021 at 3:28 PM Amit Kapila wrote: > ... > > Hmm, I think the gain via caching is not visible because we are using > simple expressions. It will be visible when we use somewhat complex > expressions where expression evaluation cost is significant. > Similarly, the impact of this change will magnify and it will also be > visible when a publication has many tables. Apart from performance, > this change is logically correct as well because it would be any way > better if we don't invalidate the cached expressions unless required. Please tell me what is your idea of a "complex" row filter expression. Do you just mean a filter that has multiple AND conditions in it? I don't really know if few complex expressions would amount to any significant evaluation costs, so I would like to run some timing tests with some real examples to see the results. On Wed, Aug 25, 2021 at 6:31 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 10:57 AM Amit Kapila wrote: > > > > On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira wrote: > > > > > > On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote: > > > > > > > Anyway, I have implemented the suggested cache change because I agree > > > it is probably theoretically superior, even if in practice there is > > > almost no difference. > > > > > > I didn't inspect your patch carefully but it seems you add another List to > > > control this new cache mechanism. I don't like it. IMO if we can use the > > > data > > > structures that we have now, let's implement your idea; otherwise, -1 for > > > this > > > new micro optimization. > > > > > > > As mentioned above, without this we will invalidate many cached > > expressions even though it is not required. I don't deny that there > > might be a better way to achieve the same and if you or Peter have any > > ideas, I am all ears. > > > > I see that the new list is added to store row_filter node which we > later use to compute expression. This is not required for invalidation > but for delaying the expression evaluation till it is required (for > example, for truncate, we may not need the row evaluation, so there is > no need to compute it). Can we try to postpone the syscache lookup to > a later stage when we are actually doing row_filtering? If we can do > that, then I think we can avoid having this extra list? Yes, you are correct - that Node list was re-instated only because you had requested that the ExprState evaluation should be deferred until it is needed by the pgoutput_row_filter. Otherwise, the additional list would not be needed so everything would be much the same as in v23 except the invalidations would be more focussed on single tables. I don't think the syscache lookup can be easily postponed. That logic of get_rel_sync_entry processes the table filters of *all* publications, so moving that publications loop (including the partition logic) into the pgoutput_row_filter seems a bridge too far IMO. Furthermore, I am not yet convinced that this ExprState postponement is very useful. It may be true that for truncate there is no need to compute it, but consider that the user would never even define a row filter in the first place unless they intended there will be some CRUD operations. So even if the truncate does not need the filter, *something* is surely going to need it. In other words, IIUC this postponement is not going to save any time overall - it only shifting when the (one time) expression evaluation will happen. I feel it would be better to just remove the postponed evaluation of the ExprState added in v24. That will remove any need for the extra Node list (which I think is Euler's concern). The ExprState cache will still be slightly improved from how it was implemented before because it is "logically correct" that we don't invalidate the cached expressions unless required. Thoughts? -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)
On Wed, Aug 25, 2021 at 9:19 PM Amit Kapila wrote: > > On Tue, Aug 24, 2021 at 3:55 PM Dilip Kumar wrote: > > > > On Tue, Aug 24, 2021 at 12:26 PM Amit Kapila > > wrote: > > > > The first patch looks good to me. I have made minor changes to the > attached patch. The changes include: fixing compilation warning, made > some comment changes, ran pgindent, and few other cosmetic changes. If > you are fine with the attached, then kindly rebase the second patch > atop it. The patch looks good to me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Skipping logical replication transactions on subscriber side
On Wed, Aug 25, 2021 at 2:22 PM Masahiko Sawada wrote: > > Attached updated version patches. Please review them. > Regarding the v11-0001 patch, it looks OK to me, but I do have one point: In apply_dispatch(), wouldn't it be better to NOT move the error reporting for an invalid message type into the switch as the default case - because then, if you add a new message type, you won't get a compiler warning (when warnings are enabled) for a missing switch case, which is a handy way to alert you that the new message type needs to be added as a case to the switch. Regards, Greg Nancarrow Fujitsu Australia
Re: Mark all GUC variable as PGDLLIMPORT
On Thu, Aug 26, 2021 at 1:51 AM Alvaro Herrera wrote: > > On 2021-Aug-25, Magnus Hagander wrote: > > > The thing we need the PGDLLIMPORT definition for is to *import* them > > on the other end? > > Oh ... so modules that are willing to cheat can include their own > declarations of the variables they need, and mark them __declspec > (dllimport)? I just tried and msvc doesn't like it. It errors out with a C2370 error "redefinition; different storage class". According to https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2370 changing __declspec() on the other side is not possible.
Re: prevent immature WAL streaming
(this is off-topic here) At Wed, 25 Aug 2021 09:56:56 -0400, Robert Haas wrote in > On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi > wrote: > > At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera > > wrote in > > > I'd also like to have tests. That seems moderately hard, but if we had > > > WAL-molasses that could be used in walreceiver, it could be done. (It > > > sounds easier to write tests with a molasses-archive_command.) > > > > > > > > > [1] https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com > > > [2] > > > https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan@alibaba-inc.com > > > > (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?) > > I think, but am not 100% sure, that "molasses" here is being used to > refer to fault injection. Oh. That makes sense, thanks. I sometimes inject artificial faults (a server crash, in this case) to create specific on-disk states but I cannot imagine that that kind of machinery can be statically placed in the source tree.. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: prevent immature WAL streaming
At Wed, 25 Aug 2021 20:20:04 -0400, "alvhe...@alvh.no-ip.org" wrote in > BTW while going about testing this, I noticed that we forbid > pg_walfile_name() while in recovery. That restriction was added by > commit 370f770c15a4 because ThisTimeLineID was not set correctly during > recovery. That was supposed to be fixed by commit 1148e22a82ed, so I > thought that it should be possible to remove the restriction. However, > I did that per the attached patch, but was quickly disappointed because > ThisTimeLineID seems to remain zero in a standby for reasons that I > didn't investigate. On a intermediate node of a cascading replication set, timeline id on walsender and walrecever can differ and ordinary backends cannot decide which to use. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila wrote: > > > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada > > wrote: > > > > > > I did a quick check with the following tap test code: > > > > > > $node_publisher->poll_query_until('postgres', > > > qq( > > > select 1 != foo.column1 from (values(0), (1)) as foo; > > > )); > > > > > > The query returns {t, f} but poll_query_until() never finished. The > > > same is true when the query returns {f, t}. > > > > > Yes, this is true, I also see the same behaviour. > > > > > This means something different is going on in Ajin's setup. Ajin, can > > you please share how did you confirm your findings about poll_query? > > Relooking at my logs, I think what happens is this: > > 1. First walsender 'a' is running. > 2. Second walsender 'b' starts and attempts at acquiring the slot > finds that the slot is active for pid a. > 3. Now both walsenders are active, the query does not return. > 4. First walsender 'a' times out and exits. > 5. Now only the second walsender is active and the query returns OK > because pid != a. > 6. Second walsender exits with error. > 7. Another query attempts to get the pid of the running walsender for > tap_sub but returns null because both walsender exits. > 8. This null return value results in the next query erroring out and > the test failing. So this is slightly different than what we can see in the topminnow logs? According to the server logs, step #5 happened (at 18:44:38.016) before step #4 happened (at 18:44:38.043). > > >Can you additionally check the value of 'state' from > >pg_stat_replication for both the old and new walsender sessions? > > Yes, will try this and post a patch tomorrow. Thanks. I guess the state of the new walsender should be "startup" whereas the old one should be "streaming". Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: prevent immature WAL streaming
At Wed, 25 Aug 2021 18:18:59 +, "Bossart, Nathan" wrote in > On 8/25/21, 5:33 AM, "alvhe...@alvh.no-ip.org" > wrote: > > On 2021-Aug-24, Bossart, Nathan wrote: > >> Another interesting thing I see is that the boundary stored in > >> earliestSegBoundary is not necessarily the earliest one. It's just > >> the first one that has been registered. I did this for simplicity for > >> the .ready file fix, but I can see it causing problems here. > > > > Hmm, is there really a problem here? Surely the flush point cannot go > > past whatever has been written. If somebody is writing an earlier > > section of WAL, then we cannot move the flush pointer to a later > > position. So it doesn't matter if the earliest point we have registered > > is the true earliest -- we only care for it to be the earliest that is > > past the flush point. > > Let's say we have the following situation (F = flush, E = earliest > registered boundary, and L = latest registered boundary), and let's > assume that each segment has a cross-segment record that ends in the > next segment. > > F E L > |-|-|-|-|-|-|-|-| >1 2 3 4 5 6 7 8 > > Then, we write out WAL to disk and create .ready files as needed. If > we didn't flush beyond the latest registered boundary, the latest > registered boundary now becomes the earliest boundary. > > F E > |-|-|-|-|-|-|-|-| >1 2 3 4 5 6 7 8 > > At this point, the earliest segment boundary past the flush point is > before the "earliest" boundary we are tracking. We know we have some cross-segment records in the regin [E L] so we cannot add a .ready file if flush is in the region. I haven't looked the latest patch (or I may misunderstand the discussion here) but I think we shouldn't move E before F exceeds previous (or in the first picture above) L. Things are done that way in my ancient proposal in [1]. https://www.postgresql.org/message-id/attachment/117052/v4-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch +if (LogwrtResult.Write < firstSegContRecStart || +lastSegContRecEnd <= LogwrtResult.Write) +{ By doing so, at the time of the second picutre, the pointers are set as: E F L |-|-|-|-|-|-|-|-| 1 2 3 4 5 6 7 8 Then the poiter are cleard at the time F reaches L, F |-|-|-|-|-|-|-|-| 1 2 3 4 5 6 7 8 Isn't this work correctly? As I think I mentioned in the thread, I don't think we don't have so many (more than several, specifically) segments in a region [E L]. [1] https://www.postgresql.org/message-id/20201216.110120.887433782054853494.horikyota.ntt%40gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: log_autovacuum in Postgres 14 -- ordering issue
On 2021-Aug-25, Peter Geoghegan wrote: > On Wed, Aug 25, 2021 at 2:06 PM Alvaro Herrera > wrote: > > I like it better than the current layout, so +1. > > This seems like a release housekeeping task to me. I'll come up with > a patch targeting 14 and master in a few days. Agreed, thanks. > The question of whether or not we do an index scan (i.e. index > vacuuming) depends entirely on the number of LP_DEAD items that heap > pruning left behind in the table structure. [...] Ooh, this was illuminating -- thanks for explaining. TBH I would have been very confused if asked to explain what that log line meant; and now that I know what it means, I am even more convinced that we need to work harder at it :-) I'll see if I can come up with something ... -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "The problem with the future is that it keeps turning into the present" (Hobbes)
Re: prevent immature WAL streaming
BTW while going about testing this, I noticed that we forbid pg_walfile_name() while in recovery. That restriction was added by commit 370f770c15a4 because ThisTimeLineID was not set correctly during recovery. That was supposed to be fixed by commit 1148e22a82ed, so I thought that it should be possible to remove the restriction. However, I did that per the attached patch, but was quickly disappointed because ThisTimeLineID seems to remain zero in a standby for reasons that I didn't investigate. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." (Paul Graham) >From 1b57e504b9a7eeddf2d99e615f21c5aec042a908 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 25 Aug 2021 19:55:43 -0400 Subject: [PATCH] Don't forbid pg_walfile_name in recovery mode --- src/backend/access/transam/xlogfuncs.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index b98deb72ec..99b22d0b30 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -503,13 +503,6 @@ pg_walfile_name(PG_FUNCTION_ARGS) XLogRecPtr locationpoint = PG_GETARG_LSN(0); char xlogfilename[MAXFNAMELEN]; - if (RecoveryInProgress()) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("recovery is in progress"), - errhint("%s cannot be executed during recovery.", - "pg_walfile_name()"))); - XLByteToPrevSeg(locationpoint, xlogsegno, wal_segment_size); XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno, wal_segment_size); -- 2.30.2
Re: log_autovacuum in Postgres 14 -- ordering issue
On Wed, Aug 25, 2021 at 2:06 PM Alvaro Herrera wrote: > You mean: > > LOG: automatic vacuum of table "regression.public.bmsql_order_line": index > scans: 1 > pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen > tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not yet > removable, oldest xmin: 88197949 > index scan needed: 1959301 pages from table (22.24% of total) had 11745226 > dead item identifiers removed > index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly deleted, 0 > currently deleted, 0 reusable > I/O timings: read: 65813.666 ms, write: 11310.689 ms > avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s > buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied > WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes > system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s Yes, exactly. > I like it better than the current layout, so +1. This seems like a release housekeeping task to me. I'll come up with a patch targeting 14 and master in a few days. > I think the "index scan needed" line (introduced very late in the 14 > cycle, commit 5100010ee4d5 dated April 7 2021) is a bit odd. But that's largely a reflection of what's going on here. > It is > telling us stuff about the table -- how many pages had TIDs removed, am > I reading that right? -- and it is also telling us whether indexes were > scanned. But the fact that it starts with "index scan needed" suggests > that it's talking about indexes. The question of whether or not we do an index scan (i.e. index vacuuming) depends entirely on the number of LP_DEAD items that heap pruning left behind in the table structure. Actually, sometimes it's ~100% opportunistic pruning that happens to run outside of VACUUM (in which case VACUUM merely notices and collects TIDs to delete from indexes) -- it depends entirely on the workload. This isn't a new thing added in commit 5100010ee4d5, really. That commit merely made the index-bypass behavior not only occur when we had precisely 0 items to delete from indexes -- now it can be skipped when the percentage of heap pages with one or more LP_DEAD items is < 2%. So yes: this "pages from table" output *is* primarily concerned with what happened with indexes, even though the main piece of information says something about the heap/table structure. Note that in general a table could easily have many many more "tuples: N removed" than "N dead item identifiers removed" in its log_autovacuum output -- this is very common (any table that mostly or only gets HOT updates and no deletes will look like that). The opposite situation is also possible, and almost as common with tables that only get non-HOT updates. The BenchmarkSQL TPC-C implementation has tables in both categories -- it does tend to be a stable thing for a table, in general. Here is the second largest BenchmarkSQL table (this is just a random VACUUM operation from logs used by a recent benchmark of mine): automatic aggressive vacuum of table "regression.public.bmsql_oorder": index scans: 1 pages: 0 removed, 943785 remain, 6 skipped due to pins, 205851 skipped frozen tuples: 63649 removed, 105630136 remain, 2785 are dead but not yet removable, oldest xmin: 186094041 buffer usage: 2660543 hits, 1766591 misses, 1375104 dirtied index scan needed: 219092 pages from table (23.21% of total) had 14946563 dead item identifiers removed index "bmsql_oorder_pkey": pages: 615866 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "bmsql_oorder_idx1": pages: 797957 in total, 131608 newly deleted, 131608 currently deleted, 131608 reusable avg read rate: 33.933 MB/s, avg write rate: 26.413 MB/s I/O timings: read: 105551.978 ms, write: 16538.690 ms system usage: CPU: user: 79.71 s, system: 49.74 s, elapsed: 406.73 s WAL usage: 1934993 records, 1044051 full page images, 7076820876 bytes On Postgres 13 you'd only see "tuples: 63649 removed" here. You'd never see anything like "14946563 dead item identifiers removed", even though that's obviously hugely important (more important than "tuples removed", even). A user could be forgiven for thinking that HOT must hurt performance! So yes, I agree. This *is* a bit odd. (Another problem here is that "205851 skipped frozen" only counts those heap pages that were specifically skipped frozen, even for a non-aggressive VACUUM.) > I think we should reword this line. I > don't have any great ideas; what do you think of this? > > dead items: 1959301 pages from table (22.24% of total) had 11745226 dead item > identifiers removed; index scan {needed, not needed, bypassed, bypassed by > failsafe} > > I have to say that I am a bit bothered about the coding pattern used to > build this sentence from two parts. I'm not sure it'll work okay in > languages that build sentences in different ways. Maybe we should split > this in two lines, one to give the numbers and the other to talk about > the decision taken about indexes. I'm
Re: prevent immature WAL streaming
On 2021-Aug-25, Jakub Wartak wrote: > In order to get reliable reproducer and get proper the fault injection > instead of playing with really filling up fs, apparently one could > substitute fd with fd of /dev/full using e.g. dup2() so that every > write is going to throw this error too: Oh, this is a neat trick that I didn't know about. Thanks. > After restarting master and inspecting standby - in all of those above > 3 cases - the standby didn't inhibit the "invalid contrecord length" > at least here, while without corruption this v02 patch it is > notorious. So if it passes the worst-case code review assumptions I > would be wondering if it shouldn't even be committed as it stands > right now. Well, Nathan is right that this patch isn't really closing the hole because we aren't tracking segment boundaries that aren't "earliest" nor "latest" at the moment of registration. The simplistic approach seems okay for the archive case, but not for the replication case. I also noticed today (facepalm) that the patch doesn't work unless you have archiving enabled, because the registration code is only invoked when XLogArchivingActive(). Doh. This part is easy to solve. The other doesn't look easy ... -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Wed, Aug 25, 2021 at 3:25 PM Zhihong Yu wrote: > > > On Wed, Aug 25, 2021 at 11:42 AM Jacob Champion > wrote: > >> On Tue, 2021-06-22 at 23:22 +, Jacob Champion wrote: >> > On Fri, 2021-06-18 at 11:31 +0300, Heikki Linnakangas wrote: >> > > >> > > A few small things caught my eye in the backend oauth_exchange >> function: >> > > >> > > > + /* Handle the client's initial message. */ >> > > > + p = strdup(input); >> > > >> > > this strdup() should be pstrdup(). >> > >> > Thanks, I'll fix that in the next re-roll. >> > >> > > In the same function, there are a bunch of reports like this: >> > > >> > > >ereport(ERROR, >> > > > + (errcode(ERRCODE_PROTOCOL_VIOLATION), >> > > > + errmsg("malformed OAUTHBEARER message"), >> > > > + errdetail("Comma expected, but found >> character \"%s\".", >> > > > + sanitize_char(*p; >> > > >> > > I don't think the double quotes are needed here, because >> sanitize_char >> > > will return quotes if it's a single character. So it would end up >> > > looking like this: ... found character "'x'". >> > >> > I'll fix this too. Thanks! >> >> v2, attached, incorporates Heikki's suggested fixes and also rebases on >> top of latest HEAD, which had the SASL refactoring changes committed >> last month. >> >> The biggest change from the last patchset is 0001, an attempt at >> enabling jsonapi in the frontend without the use of palloc(), based on >> suggestions by Michael and Tom from last commitfest. I've also made >> some improvements to the pytest suite. No major changes to the OAuth >> implementation yet. >> >> --Jacob >> > Hi, > For v2-0001-common-jsonapi-support-FRONTEND-clients.patch : > > + /* Clean up. */ > + termJsonLexContext(&lex); > > At the end of termJsonLexContext(), empty is copied to lex. For stack > based JsonLexContext, the copy seems unnecessary. > Maybe introduce a boolean parameter for termJsonLexContext() to signal > that the copy can be omitted ? > > +#ifdef FRONTEND > + /* make sure initialization succeeded */ > + if (lex->strval == NULL) > + return JSON_OUT_OF_MEMORY; > > Should PQExpBufferBroken(lex->strval) be used for the check ? > > Thanks > Hi, For v2-0002-libpq-add-OAUTHBEARER-SASL-mechanism.patch : + i_init_session(&session); + + if (!conn->oauth_client_id) + { + /* We can't talk to a server without a client identifier. */ + appendPQExpBufferStr(&conn->errorMessage, +libpq_gettext("no oauth_client_id is set for the connection")); + goto cleanup; Can conn->oauth_client_id check be performed ahead of i_init_session() ? That way, ```goto cleanup``` can be replaced with return. + if (!error_code || (strcmp(error_code, "authorization_pending") + && strcmp(error_code, "slow_down"))) What if, in the future, there is error code different from the above two which doesn't represent "OAuth token retrieval failed" scenario ? For client_initial_response(), + token_buf = createPQExpBuffer(); + if (!token_buf) + goto cleanup; If token_buf is NULL, there doesn't seem to be anything to free. We can return directly. Cheers
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Wed, Aug 25, 2021 at 11:42 AM Jacob Champion wrote: > On Tue, 2021-06-22 at 23:22 +, Jacob Champion wrote: > > On Fri, 2021-06-18 at 11:31 +0300, Heikki Linnakangas wrote: > > > > > > A few small things caught my eye in the backend oauth_exchange > function: > > > > > > > + /* Handle the client's initial message. */ > > > > + p = strdup(input); > > > > > > this strdup() should be pstrdup(). > > > > Thanks, I'll fix that in the next re-roll. > > > > > In the same function, there are a bunch of reports like this: > > > > > > >ereport(ERROR, > > > > + (errcode(ERRCODE_PROTOCOL_VIOLATION), > > > > + errmsg("malformed OAUTHBEARER message"), > > > > + errdetail("Comma expected, but found > character \"%s\".", > > > > + sanitize_char(*p; > > > > > > I don't think the double quotes are needed here, because sanitize_char > > > will return quotes if it's a single character. So it would end up > > > looking like this: ... found character "'x'". > > > > I'll fix this too. Thanks! > > v2, attached, incorporates Heikki's suggested fixes and also rebases on > top of latest HEAD, which had the SASL refactoring changes committed > last month. > > The biggest change from the last patchset is 0001, an attempt at > enabling jsonapi in the frontend without the use of palloc(), based on > suggestions by Michael and Tom from last commitfest. I've also made > some improvements to the pytest suite. No major changes to the OAuth > implementation yet. > > --Jacob > Hi, For v2-0001-common-jsonapi-support-FRONTEND-clients.patch : + /* Clean up. */ + termJsonLexContext(&lex); At the end of termJsonLexContext(), empty is copied to lex. For stack based JsonLexContext, the copy seems unnecessary. Maybe introduce a boolean parameter for termJsonLexContext() to signal that the copy can be omitted ? +#ifdef FRONTEND + /* make sure initialization succeeded */ + if (lex->strval == NULL) + return JSON_OUT_OF_MEMORY; Should PQExpBufferBroken(lex->strval) be used for the check ? Thanks
Re: Multi-Column List Partitioning
On Wed, Aug 25, 2021 at 5:41 AM Nitin Jadhav wrote: > > The new list bound binary search and related comparison support > > function look a bit too verbose to me. I was expecting > > partition_list_bsearch() to look very much like > > partition_range_datum_bsearch(), but that is not the case. The > > special case code that you wrote in partition_list_bsearch() seems > > unnecessary, at least in that function. I'm talking about the code > > fragment starting with this comment: > > > > I will look at other parts of the patch next week hopefully. For > > now, attached is a delta patch that applies on top of your v1, which > > does: > > > > * Simplify partition_list_bsearch() and partition_lbound_datum_cmp() > > * Make qsort_partition_list_value_cmp simply call > > partition_lbound_datum_cmp() instead of having its own logic to > > compare input bounds > > * Move partition_lbound_datum_cmp() into partbounds.c as a static > > function (export seems unnecessary) > > * Add a comment for PartitionBoundInfo.isnulls and remove that for > null_index > > Yes. You are right. The extra code added in partition_list_bsearch() > is not required and thanks for sharing the delta patch. It looks good > to me and I have incorporated the changes in the attached patch. > > > I guess you're perhaps trying to address the case where the caller > > does not specify the values for all of the partition key columns, > > which can happen when the partition pruning code needs to handle a set > > of clauses matching only some of the partition key columns. But > > that's a concern of the partition pruning code and so the special case > > should be handled there (if at all), not in the binary search function > > that is shared with other callers. Regarding that, I'm wondering if > > we should require clauses matching all of the partition key columns to > > be found for the pruning code to call the binary search, so do > > something like get_matching_hash_bounds() does: > > > > Do you think that trying to match list partitions even with fewer keys > > is worth the complexity of the implementation? That is, is the use > > case to search for only a subset of partition key columns common > > enough with list partitioning? > > > > If we do decide to implement the special case, remember that to do > > that efficiently, we'd need to require that the subset of matched key > > columns constitutes a prefix, because of the way the datums are > > sorted. That is, match all partitions when the query only contains a > > clause for b when the partition key is (a, b, c), but engage the > > special case of pruning if the query contains clauses for a, or for a > > and b. > > Thanks for the suggestion. Below is the implementation details for the > partition pruning for multi column list partitioning. > > In the existing code (For single column list partitioning) > 1. In gen_partprune_steps_internal(), we try to match the where > clauses provided by the user with the partition key data using > match_clause_to_partition_key(). Based on the match, this function can > return many values like PARTCLAUSE_MATCH_CLAUSE, > PARTCLAUSE_MATCH_NULLNESS, PARTCLAUSE_NOMATCH, etc. > 2. In case of PARTCLAUSE_MATCH_CLAUSE, we generate steps using > gen_prune_steps_from_opexps() (strategy-2) which generate and return a > list of PartitionPruneStepOp that are based on OpExpr and BooleanTest > clauses that have been matched to the partition key and it also takes > care handling prefix of the partition keys. > 3. In case of PARTCLAUSE_MATCH_NULLNESS, we generate steps using > gen_prune_step_op() (strategy-1) which generates single > PartitionPruneStepOp since the earlier list partitioning supports > single column and there can be only one NULL value. In > get_matching_list_bounds(), if the nullkeys is not empty, we fetch the > partition index which accepts null and we used to return from here. > > In case of multi column list partitioning, we have columns more than > one and hence there is a possibility of more than one NULL values in > the where clauses. The above mentioned steps are modified like below. > > 1. Modified the match_clause_to_partition_key() to generate an object > of PartClauseInfo structure and return PARTCLAUSE_MATCH_CLAUSE even in > case of clauses related to NULL. The information required to generate > PartClauseInfo is populated here like the constant expression > consisting of (Datum) 0, op_strategy, op_is_ne, etc. > 2. Since I am returning PARTCLAUSE_MATCH_CLAUSE, now we use strategy-2 > (gen_prune_steps_from_opexps) to generate partition pruning steps. > This function takes care of generating a list of pruning steps if > there are multiple clauses and also takes care of handling prefixes. > 3. Modified perform_pruning_base_step() to generate the datum values > and isnulls data of the where clauses. In case if any of the key > contains NULL value then the corresponding datum value is 0. > 4. Modified get_matching_list_bounds() to generate the min
Re: log_autovacuum in Postgres 14 -- ordering issue
Greetings, * Peter Geoghegan (p...@bowt.ie) wrote: > On Wed, Aug 25, 2021 at 11:42 AM Nikolay Samokhvalov > wrote: > > The last two lines are also "*** usage" -- shouldn't the buffer numbers be > > next to them? > > I agree that that would be better still -- but all the "usage" stuff > together in one block. > > And that leads me to another observation: The track_io_timing stuff > (also new to Postgres 14) might also need to be reordered. And maybe > even the WAL usage stuff, which was added in Postgres 13. > > That way the overall structure starts with details of the physical > data structures (the table and its indexes), then goes into buffers > > 1. Heap pages > 2. Heap tuples > 3. Index stuff > 4. I/O timings (only when track_io_timing is on) > 5. avg read rate (always) > 6. buffer usage > 7. WAL usage. > 8. system usage > > This would mean that I'd be flipping the order of 7 and 8 relative to > Postgres 13 -- meaning there'd be one difference between Postgres 14 > and some existing stable release. But I think that putting WAL usage > last of all (after system usage) makes little sense -- commit > b7ce6de93b shouldn't have done it that way. I always expect to see the > getrusage() stuff last. I generally like the idea though I'm not sure about changing things in v13 as there's likely code out there that's already parsing that data and it might suddenly break if this was changed. Given that such code would need to be adjusted for v14 anyway, I don't really see changing it in v14 as as being an issue (nor do I feel that it's even a big concern at this point in the release cycle, though perhaps others feel differently). Thanks, Stephen signature.asc Description: PGP signature
Re: log_autovacuum in Postgres 14 -- ordering issue
On 2021-Aug-25, Peter Geoghegan wrote: > That way the overall structure starts with details of the physical > data structures (the table and its indexes), then goes into buffers > > 1. Heap pages > 2. Heap tuples > 3. Index stuff > 4. I/O timings (only when track_io_timing is on) > 5. avg read rate (always) > 6. buffer usage > 7. WAL usage. > 8. system usage > > This would mean that I'd be flipping the order of 7 and 8 relative to > Postgres 13 -- meaning there'd be one difference between Postgres 14 > and some existing stable release. But I think that putting WAL usage > last of all (after system usage) makes little sense -- commit > b7ce6de93b shouldn't have done it that way. I always expect to see the > getrusage() stuff last. You mean: LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1 pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not yet removable, oldest xmin: 88197949 index scan needed: 1959301 pages from table (22.24% of total) had 11745226 dead item identifiers removed index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly deleted, 0 currently deleted, 0 reusable I/O timings: read: 65813.666 ms, write: 11310.689 ms avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s I like it better than the current layout, so +1. I think the "index scan needed" line (introduced very late in the 14 cycle, commit 5100010ee4d5 dated April 7 2021) is a bit odd. It is telling us stuff about the table -- how many pages had TIDs removed, am I reading that right? -- and it is also telling us whether indexes were scanned. But the fact that it starts with "index scan needed" suggests that it's talking about indexes. I think we should reword this line. I don't have any great ideas; what do you think of this? dead items: 1959301 pages from table (22.24% of total) had 11745226 dead item identifiers removed; index scan {needed, not needed, bypassed, bypassed by failsafe} I have to say that I am a bit bothered about the coding pattern used to build this sentence from two parts. I'm not sure it'll work okay in languages that build sentences in different ways. Maybe we should split this in two lines, one to give the numbers and the other to talk about the decision taken about indexes. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "This is what I like so much about PostgreSQL. Most of the surprises are of the "oh wow! That's cool" Not the "oh shit!" kind. :)" Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php
Re: log_autovacuum in Postgres 14 -- ordering issue
On Wed, Aug 25, 2021 at 1:33 PM Stephen Frost wrote: > I don't have any particular issue with moving them. What do you think of the plan I just outlined to Nikolay? -- Peter Geoghegan
Re: log_autovacuum in Postgres 14 -- ordering issue
On Wed, Aug 25, 2021 at 11:42 AM Nikolay Samokhvalov wrote: > The last two lines are also "*** usage" -- shouldn't the buffer numbers be > next to them? I agree that that would be better still -- but all the "usage" stuff together in one block. And that leads me to another observation: The track_io_timing stuff (also new to Postgres 14) might also need to be reordered. And maybe even the WAL usage stuff, which was added in Postgres 13. That way the overall structure starts with details of the physical data structures (the table and its indexes), then goes into buffers 1. Heap pages 2. Heap tuples 3. Index stuff 4. I/O timings (only when track_io_timing is on) 5. avg read rate (always) 6. buffer usage 7. WAL usage. 8. system usage This would mean that I'd be flipping the order of 7 and 8 relative to Postgres 13 -- meaning there'd be one difference between Postgres 14 and some existing stable release. But I think that putting WAL usage last of all (after system usage) makes little sense -- commit b7ce6de93b shouldn't have done it that way. I always expect to see the getrusage() stuff last. -- Peter Geoghegan
Re: log_autovacuum in Postgres 14 -- ordering issue
Greetings, * Peter Geoghegan (p...@bowt.ie) wrote: > log_autovacuum output looks like this (as of Postgres 14): > > LOG: automatic vacuum of table "regression.public.bmsql_order_line": > index scans: 1 > pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen > tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not > yet removable, oldest xmin: 88197949 > buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied > index scan needed: 1959301 pages from table (22.24% of total) had > 11745226 dead item identifiers removed > index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly > deleted, 0 currently deleted, 0 reusable > avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s > I/O timings: read: 65813.666 ms, write: 11310.689 ms > system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s > WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes > > I think that this output is slightly misleading. I'm concerned about > the specific order of the lines here: the "buffer usage" line comes > after the information that applies specifically to the heap structure, > but before the information about indexes. This is the case despite the > fact that its output applies to all buffers (not just those for the > heap structure). > > It would be a lot clearer if the "buffer usage" line was simply moved > down. I think that it should appear after the lines that are specific > to the table's indexes -- just before the "avg read rate" line. That > way we'd group the buffer usage output with all of the other I/O > related output that summarizes the VACUUM operation as a whole. > > I propose changing the ordering along those lines, and backpatching > the change to Postgres 14. I don't have any particular issue with moving them. Thanks, Stephen signature.asc Description: PGP signature
Re: badly calculated width of emoji in psql
On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion wrote: > > Does there need to be any sanity check for overlapping ranges between > the combining and fullwidth sets? The Unicode data on a dev's machine > would have to be broken somehow for that to happen, but it could > potentially go undetected for a while if it did. It turns out I should have done that to begin with. In the Unicode data, it apparently happens that a character can be both combining and wide, and that will cause ranges to overlap in my scheme: 302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC ENTERING TONE MARK {0x3000, 0x303E, 2}, {0x302A, 0x302D, 0}, 3099..309A;W # Mn [2] COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK..COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK {0x3099, 0x309A, 0}, {0x3099, 0x30FF, 2}, Going by the above, Jacob's patch from July 21 just happened to be correct by chance since the combining character search happened first. It seems the logical thing to do is revert my 0001 and 0002 and go back to something much closer to Jacob's patch, plus a big comment explaining that the order in which the searches happen matters. The EastAsianWidth.txt does have combining property "Mn" in the comment above, so it's tempting to just read that (plus we could read just one file for these properties). However, it seems risky to rely on comments, since their presence and format is probably less stable than the data format. -- John Naylor EDB: http://www.enterprisedb.com
Re: Autovacuum on partitioned table (autoanalyze)
On Fri, Aug 20, 2021 at 07:55:13AM -0500, Justin Pryzby wrote: > On Tue, Aug 17, 2021 at 06:30:18AM -0500, Justin Pryzby wrote: > > On Mon, Aug 16, 2021 at 05:28:10PM -0500, Justin Pryzby wrote: > > > On Mon, Aug 16, 2021 at 05:42:48PM -0400, Álvaro Herrera wrote: > > > > On 2021-Aug-16, Álvaro Herrera wrote: > > > > > > > > > Here's the reversal patch for the 14 branch. (It applies cleanly to > > > > > master, but the unused member of PgStat_StatTabEntry needs to be > > > > > removed and catversion bumped). > > > > > > > > I have pushed this to both branches. (I did not remove the item from > > > > the release notes in the 14 branch.) > > > > > > |I retained the addition of relkind 'p' to tables included by > > > |pg_stat_user_tables, because reverting that would require a > > > catversion > > > |bump. > > > > > > Right now, on v15dev, it shows 0, which is misleading. > > > Shouldn't it be null ? > > > > > > analyze_count | 0 > > > > > > Note that having analyze_count and last_analyze would be an an > > > independently > > > useful change. Since parent tables aren't analyzed automatically, I have > > > a > > > script to periodically process them if they weren't processed recently. > > > Right > > > now, for partitioned tables, the best I could find is to check its > > > partitions: > > > | MIN(last_analyzed) FROM pg_stat_all_tables psat JOIN pg_inherits i ON > > > psat.relid=i.inhrelid > > > > > > In 20200418050815.ge26...@telsasoft.com I wrote: > > > |This patch includes partitioned tables in pg_stat_*_tables, which is > > > great; I > > > |complained awhile ago that they were missing [0]. It might be useful if > > > that > > > |part was split out into a separate 0001 patch (?). > > > | [0] > > > https://www.postgresql.org/message-id/20180601221428.GU5164%40telsasoft.com > > > > I suggest the attached (which partially reverts the revert), to allow > > showing > > correct data for analyze_count and last_analyzed. > > Álvaro, would you comment on this ? > > To me this could be an open item, but someone else should make that > determination. I added an opened item until this is discussed. | pg_stats includes partitioned tables, but always shows analyze_count=0 | Owner: Alvaro Herrera Possible solutions, in decreasing order of my own preference: - partially revert the revert, as proposed, to have "analyze_count" and "last_analyzed" work properly for partitioned tables. This doesn't suffer from any of the problems that led to the revert, does it ? - Update the .c code to return analyze_count=NULL for partitioned tables. - Update the catalog definition to exclude partitioned tables, again. Requires a catalog bumped. - Document that analyze_count=NULL for partitioned tables. It seems to just document a misbehavior. -- Justin
Re: log_autovacuum in Postgres 14 -- ordering issue
On Wed, Aug 25, 2021 at 10:34 AM Peter Geoghegan wrote: > It would be a lot clearer if the "buffer usage" line was simply moved > down. I think that it should appear after the lines that are specific > to the table's indexes -- just before the "avg read rate" line. That > way we'd group the buffer usage output with all of the other I/O > related output that summarizes the VACUUM operation as a whole. > The last two lines are also "*** usage" -- shouldn't the buffer numbers be next to them?
Re: archive status ".ready" files may be created too early
On 8/25/21, 11:01 AM, "Fujii Masao" wrote: > If LogwrtResult.Flush >= EndPos, which means that another process already > has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush. > This situation also means that that another process called > NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right? If the segment boundary wasn't registered before the other process called NotifySegmentsReadyForArchive(), then it couldn't have used the boundary for deciding which .ready files to create. > If this understanding is right, there seems no need to wake walwriter up here > so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain. > Thought? We're actually discussing this right now in another thread [0]. I think we might be able to get rid of that part if we move the boundary registration to before we release the WAL insert lock(s). Nathan [0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com
Re: prevent immature WAL streaming
On 8/25/21, 5:33 AM, "alvhe...@alvh.no-ip.org" wrote: > On 2021-Aug-24, Bossart, Nathan wrote: > >> If moving RegisterSegmentBoundary() is sufficient to prevent the flush >> pointer from advancing before we register the boundary, I bet we could >> also remove the WAL writer nudge. > > Can you elaborate on this? I'm not sure I see the connection. The reason we are moving RegisterSegmentBoundary() to before WALInsertLockRelease() is because we believe it will prevent boundary registration from taking place after the flush pointer has already advanced past the boundary in question. We had added the WAL writer nudge to make sure we called NotifySegmentsReadyForArchive() whenever that happened. If moving boundary registration to before we release the lock(s) is enough to prevent the race condition with the flush pointer, then ISTM we no longer have to worry about nudging the WAL writer. >> Another interesting thing I see is that the boundary stored in >> earliestSegBoundary is not necessarily the earliest one. It's just >> the first one that has been registered. I did this for simplicity for >> the .ready file fix, but I can see it causing problems here. > > Hmm, is there really a problem here? Surely the flush point cannot go > past whatever has been written. If somebody is writing an earlier > section of WAL, then we cannot move the flush pointer to a later > position. So it doesn't matter if the earliest point we have registered > is the true earliest -- we only care for it to be the earliest that is > past the flush point. Let's say we have the following situation (F = flush, E = earliest registered boundary, and L = latest registered boundary), and let's assume that each segment has a cross-segment record that ends in the next segment. F E L |-|-|-|-|-|-|-|-| 1 2 3 4 5 6 7 8 Then, we write out WAL to disk and create .ready files as needed. If we didn't flush beyond the latest registered boundary, the latest registered boundary now becomes the earliest boundary. F E |-|-|-|-|-|-|-|-| 1 2 3 4 5 6 7 8 At this point, the earliest segment boundary past the flush point is before the "earliest" boundary we are tracking. Nathan
Re: archive status ".ready" files may be created too early
On 2021/08/24 4:55, alvhe...@alvh.no-ip.org wrote: On 2021-Aug-23, Bossart, Nathan wrote: Ah, okay. BTW the other changes you mentioned made sense to me. Thanks. I've pushed this now to all live branches. Thanks a lot! + /* +* There's a chance that the record was already flushed to disk and we +* missed marking segments as ready for archive. If this happens, we +* nudge the WALWriter, which will take care of notifying segments as +* needed. +*/ + if (StartSeg != EndSeg && XLogArchivingActive() && + LogwrtResult.Flush >= EndPos && ProcGlobal->walwriterLatch) + SetLatch(ProcGlobal->walwriterLatch); Is this really necessary? If LogwrtResult.Flush >= EndPos, which means that another process already has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush. This situation also means that that another process called NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right? If this understanding is right, there seems no need to wake walwriter up here so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: The Free Space Map: Problems and Opportunities
On Mon, Aug 23, 2021 at 5:55 PM Peter Geoghegan wrote: > Right now my prototype has a centralized table in shared memory, with > a hash table. One entry per relation, generally multiple freelists per > relation. And with per-freelist metadata such as owner and original > leader backend XID values. Plus of course the lists of free blocks > themselves. The prototype already clearly fixes the worst problems > with BenchmarkSQL, but that's only one of my goals. That's just the > starting point. > > I appreciate your input on this. And not just on the implementation > details -- the actual requirements themselves are still in flux. This > isn't so much a project to replace the FSM as it is a project that > adds a new rich abstraction layer that goes between access methods and > smgr.c -- free space management is only one of the responsibilities. Makes sense. I think one of the big implementation challenges here is coping with the scenario where there's not enough shared memory available ... or else somehow making that impossible without reserving an unreasonable amount of shared memory. If you allowed space for every buffer to belong to a different relation and have the maximum number of leases and whatever, you'd probably have no possibility of OOM, but you'd probably be pre-reserving too much memory. I also think there are some implementation challenges around locking. You probably need some, because the data structure is shared, but because it's complex, it's not easy to create locking that allows for good concurrency. Or so I think. Andres has been working -- I think for years now -- on replacing the buffer mapping table with a radix tree of some kind. That strikes me as very similar to what you're doing here. The per-relation data can then include not only the kind of stuff you're talking about but very fundamental things like how long it is and where its buffers are in the buffer pool. Hopefully we don't end up with dueling patches. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Mark all GUC variable as PGDLLIMPORT
On 2021-Aug-25, Magnus Hagander wrote: > The thing we need the PGDLLIMPORT definition for is to *import* them > on the other end? Oh ... so modules that are willing to cheat can include their own declarations of the variables they need, and mark them __declspec (dllimport)? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "The Gord often wonders why people threaten never to come back after they've been told never to return" (www.actsofgord.com)
log_autovacuum in Postgres 14 -- ordering issue
log_autovacuum output looks like this (as of Postgres 14): LOG: automatic vacuum of table "regression.public.bmsql_order_line": index scans: 1 pages: 0 removed, 8810377 remain, 0 skipped due to pins, 3044924 frozen tuples: 16819838 removed, 576364686 remain, 2207444 are dead but not yet removable, oldest xmin: 88197949 buffer usage: 174505 hits, 7630386 misses, 5628271 dirtied index scan needed: 1959301 pages from table (22.24% of total) had 11745226 dead item identifiers removed index "bmsql_order_line_pkey": pages: 2380261 in total, 0 newly deleted, 0 currently deleted, 0 reusable avg read rate: 65.621 MB/s, avg write rate: 48.403 MB/s I/O timings: read: 65813.666 ms, write: 11310.689 ms system usage: CPU: user: 72.55 s, system: 52.07 s, elapsed: 908.42 s WAL usage: 7387358 records, 4051205 full page images, 28472185998 bytes I think that this output is slightly misleading. I'm concerned about the specific order of the lines here: the "buffer usage" line comes after the information that applies specifically to the heap structure, but before the information about indexes. This is the case despite the fact that its output applies to all buffers (not just those for the heap structure). It would be a lot clearer if the "buffer usage" line was simply moved down. I think that it should appear after the lines that are specific to the table's indexes -- just before the "avg read rate" line. That way we'd group the buffer usage output with all of the other I/O related output that summarizes the VACUUM operation as a whole. I propose changing the ordering along those lines, and backpatching the change to Postgres 14. -- Peter Geoghegan
Re: badly calculated width of emoji in psql
On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion wrote: > > On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote: > > On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion wrote: > > > I guess it just depends on what the end result looks/performs like. > > > We'd save seven hops or so in the worst case? > > > > Something like that. Attached is what I had in mind (using real > > patches to see what the CF bot thinks): > > > > 0001 is a simple renaming > > 0002 puts the char width inside the mbinterval so we can put arbitrary values there > > 0002 introduces a mixed declarations/statements warning for > ucs_wcwidth(). Other than that, LGTM overall. I didn't see that warning with clang 12, either with or without assertions, but I do see it on gcc 11. Fixed, and pushed 0001 and 0002. I decided against squashing them, since my original instinct was correct -- the header changes too much for git to consider it the same file, which may make archeology harder. > > --- a/src/common/wchar.c > > +++ b/src/common/wchar.c > > @@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s) > > > > struct mbinterval > > { > > - unsigned short first; > > - unsigned short last; > > - signed short width; > > + unsigned int first; > > + unsigned int last:21; > > + signed int width:4; > > }; > > Oh, right -- my patch moved mbinterval from short to int, but should I > have used uint32 instead? It would only matter in theory for the > `first` member now that the bitfields are there. I'm not sure it would matter, but the usual type for codepoints is unsigned. > > I think the adjustments to 0003 result in a cleaner and more > > extensible design, but a case could be made otherwise. The former > > combining table script is a bit more complex than the sum of its > > former self and Jacob's proposed new script, but just slightly. > > The microbenchmark says it's also more performant, so +1 to your > version. > > Does there need to be any sanity check for overlapping ranges between > the combining and fullwidth sets? The Unicode data on a dev's machine > would have to be broken somehow for that to happen, but it could > potentially go undetected for a while if it did. Thanks for testing again! The sanity check sounds like a good idea, so I'll work on that and push soon. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] document
On Wed, Aug 25, 2021 at 09:50:13AM -0400, Tom Lane wrote: > Fujii Masao writes: > > When I applied the patch to the master, I found that the table entries for > > those function were added into the table for aclitem functions in the docs. > > I think this is not valid position and needs to be moved to the proper one > > (maybe the table for system catalog information functions?). > > You have to be very careful these days when applying stale patches to > func.sgml --- there's enough duplicate boilerplate that "patch' can easily > be fooled into dumping an addition into the wrong place. I doubt that > the submitter meant the doc addition to go there. I suppose one solution to this is to use git format-patch -U11 or similar, at least for doc/ Or write the "duplicate boilerplate" across fewer lines. And another is to add comments before and/or after each. -- Justin
Re: Regression tests for MobilityDB: Continous shutdowns at a random step
Esteban Zimanyi writes: > However, I continuously receive at a random step in the process the > following error in the log file > 2021-08-25 16:48:13.608 CEST [22375] LOG: received fast shutdown request This indicates that something sent the postmaster SIGINT. You need to look around for something in your test environment that would do that. Possibly you need to decouple the test processes from your terminal session? regards, tom lane
Regression tests for MobilityDB: Continous shutdowns at a random step
Hello While executing the regression tests for MobilityDB I load a predefined database on which I run the tests and then compare the results obtained with those expected. All the tests are driven by the following bash file https://github.com/MobilityDB/MobilityDB/blob/develop/test/scripts/test.sh However, I continuously receive at a random step in the process the following error in the log file 2021-08-25 16:48:13.608 CEST [22375] LOG: received fast shutdown request 2021-08-25 16:48:13.622 CEST [22375] LOG: aborting any active transactions 2021-08-25 16:48:13.622 CEST [22375] LOG: background worker "logical replication launcher" (PID 22382) exited with exit code 1 2021-08-25 16:48:13.623 CEST [22377] LOG: shutting down 2021-08-25 16:48:13.971 CEST [22375] LOG: database system is shut down and sometimes I need to relaunch *numerous* times the whole build process in CMake https://github.com/MobilityDB/MobilityDB/blob/develop/CMakeLists.txt to finalize the tests /* While on MobilityDB/build directory */ rm -rf * cmake .. make make test Any idea where I can begin looking at the problem ? Thanks for your help Esteban
Re: PostgreSQL <-> Babelfish integration
On Mon, 15 Feb 2021 at 17:01, Finnerty, Jim wrote: > > We are applying the Babelfish commits to the REL_12_STABLE branch now, and > the plan is to merge them into the REL_13_STABLE and master branch ASAP after > that. There should be a publicly downloadable git repository before very > long. Hi, Out of curiosity, are you able to share the status on the publication of this repository? I mainly ask this because I haven't seen any announcements from Amazon / AWS regarding the publication of the Babelfish project since the start of this thread, and the relevant websites [0][1][2] also do not appear to have seen an update. The last mention of babelfish in a thread here on -hackers also only seem to date back to late March. Kind regards, Matthias van de Meent [0] https://babelfish-for-postgresql.github.io/babelfish-for-postgresql/ [1] https://aws.amazon.com/rds/aurora/babelfish/ [2] https://github.com/babelfish-for-postgresql/
Re: Mark all GUC variable as PGDLLIMPORT
On Wed, Aug 25, 2021 at 4:41 PM Tom Lane wrote: > > Magnus Hagander writes: > > On Wed, Aug 25, 2021 at 4:06 PM Robert Haas wrote: > >> It does tend to be controversial, but I think that's basically only > >> because Tom Lane has reservations about it. I think if Tom dropped his > >> opposition to this, nobody else would really care. And I think that > >> would be a good thing for the project. > > > But in particular, both on that argument, and on the general > > maintenance argument, I have a very hard time seeing how exporting the > > GUC variables would be any worse than exporting the many hundreds of > > functions we already export. > > My beef about it has nothing to do with binary-size concerns, although > that is an interesting angle. (I wonder whether marking a variable > PGDLLIMPORT has any negative effect on the cost of accessing it from > within the core code?) It should have no effect on local code. PGDLLIMPORT turns into "__declspec (dllexport)" when building the backend, which should have no effect on imports (it turns into __declspec (dllimport) when building a frontend only, but that's why we need it in the headers, iirc) The only overhead I've seen discussions about int he docs around that is the overhead of exporting by name vs exporting by ordinal. > Rather, I'm unhappy with spreading even more > Microsoft-droppings all over our source. If there were some way to > just do this automatically for all global variables without any source > changes, I'd be content with that. That would *really* make the > platforms more nearly equivalent. Actually, ti's clearly been a while since I dug into this AFAICT we *do* actually export all the data symbols as well? At least in the MSVC build where we use gendef.pl? Specifically, see a5eed4d770. The thing we need the PGDLLIMPORT definition for is to *import* them on the other end? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: Mark all GUC variable as PGDLLIMPORT
Magnus Hagander writes: > On Wed, Aug 25, 2021 at 4:06 PM Robert Haas wrote: >> It does tend to be controversial, but I think that's basically only >> because Tom Lane has reservations about it. I think if Tom dropped his >> opposition to this, nobody else would really care. And I think that >> would be a good thing for the project. > But in particular, both on that argument, and on the general > maintenance argument, I have a very hard time seeing how exporting the > GUC variables would be any worse than exporting the many hundreds of > functions we already export. My beef about it has nothing to do with binary-size concerns, although that is an interesting angle. (I wonder whether marking a variable PGDLLIMPORT has any negative effect on the cost of accessing it from within the core code?) Rather, I'm unhappy with spreading even more Microsoft-droppings all over our source. If there were some way to just do this automatically for all global variables without any source changes, I'd be content with that. That would *really* make the platforms more nearly equivalent. regards, tom lane
Re: Mark all GUC variable as PGDLLIMPORT
On Wed, Aug 25, 2021 at 4:06 PM Robert Haas wrote: > > On Tue, Aug 24, 2021 at 5:06 PM Chapman Flack wrote: > > The thing is, I think I have somewhere a list of all the threads on this > > topic that I've read through since the first time I had to come with my own > > hat in hand asking for a PGDLLIMPORT on something, years ago now, and > > I don't think I have ever seen one where it was as uncontroversial > > as you suggest. > > It does tend to be controversial, but I think that's basically only > because Tom Lane has reservations about it. I think if Tom dropped his > opposition to this, nobody else would really care. And I think that > would be a good thing for the project. I have only one consideration about it, and that's a technical one :) Does this in some way have an effect on the size of the binary and/or the time it takes to load it? I ask, because IIRC back in the prehistoric days, adding all the *functions* to the list of exports had a very significant impact on the size of the binary, and some (but not very large) impact on the loading time. Of course, we had to do that so that even our own libraries would probably load. And at the time it was decided that we definitely wanted to export all functions and not just the ones that we would somehow define as an API. Now, I'm pretty sure that the GUCs are few enough that this should have no measurable effect on size/load time, but it's something that should be verified. But in particular, both on that argument, and on the general maintenance argument, I have a very hard time seeing how exporting the GUC variables would be any worse than exporting the many hundreds of functions we already export. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Async-unsafe functions in signal handlers
Hello all, I am going to refactor Greenplum backtraces for error messages and want to make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original discussion https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com ) and rely on backtrace() and backtrace_symbols() functions. They are used inside errfinish() that is wrapped by ereport() macros. ereport() is invoked inside bgworker_die() and FloatExceptionHandler() signal handlers. I am confused with this fact - both backtrace functions are async-unsafe: backtrace_symbols() - always, backtrace() - only for the first call due to dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal handlers? Best regards, Denis Smirnov | Developer s...@arenadata.io Arenadata | Godovikova 9-17, Moscow 129085 Russia
Re: Postgres perl module namespace
On Wed, Aug 25, 2021 at 1:48 AM Michael Paquier wrote: > On Mon, Aug 23, 2021 at 03:39:15PM -0400, Robert Haas wrote: > > On Mon, Aug 23, 2021 at 3:03 PM Andrew Dunstan wrote: > >> OK, I count 3 in favor of changing to PgTest::Cluster, 1 against, > >> remainder don't care. > > > > I'd have gone with something starting with Postgres:: ... but I don't care > > much. > > PgTest seems like a better choice to me, as "Postgres" could be used > for other purposes than a testing framework, and the argument that > multiple paths makes things annoying for hackers is sensible. I mean, it's a hierarchical namespace. The idea is you do Postgres::Test or Postgres:: and other people using the Postgres database can use other parts of it. But again, not really worth arguing about. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Mark all GUC variable as PGDLLIMPORT
On Tue, Aug 24, 2021 at 5:06 PM Chapman Flack wrote: > The thing is, I think I have somewhere a list of all the threads on this > topic that I've read through since the first time I had to come with my own > hat in hand asking for a PGDLLIMPORT on something, years ago now, and > I don't think I have ever seen one where it was as uncontroversial > as you suggest. It does tend to be controversial, but I think that's basically only because Tom Lane has reservations about it. I think if Tom dropped his opposition to this, nobody else would really care. And I think that would be a good thing for the project. > In each iteration, I think I've also seen a countervailing view expressed > in favor of looking into whether globals visibility could be further > /reduced/. But, like I say, that's only a view that gets advanced as a reason not to mark things PGDLLIMPORT. Nobody ever wants that thing for its own sake. I think it's a complete red herring. If and when somebody wants to make a serious proposal to do something like that, it can be considered on its own merits. But between now and then, refusing to make things work on Windows as they do on Linux does not benefit anyone. A ton of work has been made into making PostgreSQL portable over the years, and abandoning it in just this one case is unreasonable. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada wrote: > > > > I did a quick check with the following tap test code: > > > > $node_publisher->poll_query_until('postgres', > > qq( > > select 1 != foo.column1 from (values(0), (1)) as foo; > > )); > > > > The query returns {t, f} but poll_query_until() never finished. The > > same is true when the query returns {f, t}. > > Yes, this is true, I also see the same behaviour. > > This means something different is going on in Ajin's setup. Ajin, can > you please share how did you confirm your findings about poll_query? Relooking at my logs, I think what happens is this: 1. First walsender 'a' is running. 2. Second walsender 'b' starts and attempts at acquiring the slot finds that the slot is active for pid a. 3. Now both walsenders are active, the query does not return. 4. First walsender 'a' times out and exits. 5. Now only the second walsender is active and the query returns OK because pid != a. 6. Second walsender exits with error. 7. Another query attempts to get the pid of the running walsender for tap_sub but returns null because both walsender exits. 8. This null return value results in the next query erroring out and the test failing. >Can you additionally check the value of 'state' from >pg_stat_replication for both the old and new walsender sessions? Yes, will try this and post a patch tomorrow. regards, Ajin Cherian Fujitsu Australia
Re: Remove Value node struct
Peter Eisentraut writes: > While trying to refactor the node support in various ways, the Value > node is always annoying. […] > This change removes the Value struct and node type and replaces them > by separate Integer, Float, String, and BitString node types that are > proper node types and structs of their own and behave mostly like > normal node types. This looks like a nice cleanup overall, independent of any future refactoring. > Also, this removes the T_Null node tag, which was previously also a > possible variant of Value but wasn't actually used outside of the > Value contained in A_Const. Replace that by an isnull field in > A_Const. However, the patch adds: > +typedef struct Null > +{ > + NodeTag type; > + char *val; > +} Null; which doesn't seem to be used anywhere. Is that a leftoverf from an intermediate development stage? - ilmari
Re: prevent immature WAL streaming
On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi wrote: > At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera > wrote in > > I'd also like to have tests. That seems moderately hard, but if we had > > WAL-molasses that could be used in walreceiver, it could be done. (It > > sounds easier to write tests with a molasses-archive_command.) > > > > > > [1] https://postgr.es/m/cbddfa01-6e40-46bb-9f98-9340f4379...@amazon.com > > [2] > > https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan@alibaba-inc.com > > (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?) I think, but am not 100% sure, that "molasses" here is being used to refer to fault injection. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] document
Fujii Masao writes: > When I applied the patch to the master, I found that the table entries for > those function were added into the table for aclitem functions in the docs. > I think this is not valid position and needs to be moved to the proper one > (maybe the table for system catalog information functions?). You have to be very careful these days when applying stale patches to func.sgml --- there's enough duplicate boilerplate that "patch' can easily be fooled into dumping an addition into the wrong place. I doubt that the submitter meant the doc addition to go there. regards, tom lane
Re: Remove Value node struct
On Wed, Aug 25, 2021 at 9:20 AM Peter Eisentraut wrote: > This change removes the Value struct and node type and replaces them > by separate Integer, Float, String, and BitString node types that are > proper node types and structs of their own and behave mostly like > normal node types. +1. I noticed this years ago and never thought of doing anything about it. I'm glad you did think of it... -- Robert Haas EDB: http://www.enterprisedb.com
Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
On Wed, Aug 25, 2021 at 5:36 AM Greg Nancarrow wrote: > I've attached an updated patch, hopefully more along the lines that > you were thinking of. LGTM. Committed and back-patched to v10 and up. In theory the same bug exists in 9.6, but you'd have to have third-party code using the parallel context infrastructure in order to hit it. If the patch back-patched cleanly I would have done so just in case, but shm_toc_lookup lacks a bool noError option in that version. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Some RELKIND macro refactoring
On 2021-Aug-25, Michael Paquier wrote: > On Tue, Aug 24, 2021 at 12:01:33PM +0200, Peter Eisentraut wrote: > > While analyzing this again, I think I found an existing mistake. The > > handling of RELKIND_PARTITIONED_INDEX in RelationGetNumberOfBlocksInFork() > > seems to be misplaced. See attached patch. Agreed, that's a mistake. > Right. This maps with RELKIND_HAS_STORAGE(). Makes me wonder whether > is would be better to add a check on RELKIND_HAS_STORAGE() in this > area, even if that's basically the same as the Assert() already used > in this code path. Well, the patch replaces the switch on individual relkind values with if tests on RELKIND_HAS_FOO macros. I suppose we'd have Assert(RELKIND_HAS_STORAGE(relkind)); so the function would not even be called for partitioned indexes. (In a quick scan of 'git grep RelationGetNumberOfBlocks' I see nothing that would obviously call this function on a partitioned index.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "XML!" Exclaimed C++. "What are you doing here? You're not a programming language." "Tell that to the people who use me," said XML. https://burningbird.net/the-parable-of-the-languages/
Re: [PATCH] document
On 2021/07/14 14:45, Laurenz Albe wrote: On Wed, 2021-07-14 at 14:43 +0900, Ian Lawrence Barwick wrote: Hi The description for "pg_database" [1] mentions the function "pg_encoding_to_char()", but this is not described anywhere in the docs. Given that that it (and the corresponding "pg_char_to_encoding()") have been around since 7.0 [2], it's probably not a burning issue, but it seems not entirely unreasonable to add short descriptions for both (and link from "pg_conversion" while we're at it); see attached patch. "System Catalog Information Functions" seems the most logical place to put these. [1] https://www.postgresql.org/docs/current/catalog-pg-database.html [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5eb1d0deb15f2b7cd0051bef12f3e091516c723b Will add to the next commitfest. +1 +1 When I applied the patch to the master, I found that the table entries for those function were added into the table for aclitem functions in the docs. I think this is not valid position and needs to be moved to the proper one (maybe the table for system catalog information functions?). +int +pg_encoding_to_char ( encoding int ) It's better to s/int/integer because the entries for other functions in func.sgml use "integer". Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Remove Value node struct
While trying to refactor the node support in various ways, the Value node is always annoying. The Value node struct is a weird construct. It is its own node type, but most of the time, it actually has a node type of Integer, Float, String, or BitString. As a consequence, the struct name and the node type don't match most of the time, and so it has to be treated specially a lot. There doesn't seem to be any value in the special construct. There is very little code that wants to accept all Value variants but nothing else (and even if it did, this doesn't provide any convenient way to check it), and most code wants either just one particular node type (usually String), or it accepts a broader set of node types besides just Value. This change removes the Value struct and node type and replaces them by separate Integer, Float, String, and BitString node types that are proper node types and structs of their own and behave mostly like normal node types. Also, this removes the T_Null node tag, which was previously also a possible variant of Value but wasn't actually used outside of the Value contained in A_Const. Replace that by an isnull field in A_Const. From 14bfa252aa64e52528ecbd457154cb3c9d32781a Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 25 Aug 2021 13:41:09 +0200 Subject: [PATCH 1/2] Remove useless casts Casting the argument of strVal() to (Value *) is useless, since strVal() already does that. Most code didn't do that anyway; this was apparently just a style that snuck into certain files. --- src/backend/catalog/objectaddress.c | 20 ++-- src/backend/commands/alter.c| 16 src/backend/commands/comment.c | 2 +- src/backend/commands/dropcmds.c | 16 src/backend/commands/statscmds.c| 2 +- 5 files changed, 28 insertions(+), 28 deletions(-) diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c index 9882e549c4..7423b99265 100644 --- a/src/backend/catalog/objectaddress.c +++ b/src/backend/catalog/objectaddress.c @@ -2386,7 +2386,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address, case OBJECT_DATABASE: if (!pg_database_ownercheck(address.objectId, roleid)) aclcheck_error(ACLCHECK_NOT_OWNER, objtype, - strVal((Value *) object)); + strVal(object)); break; case OBJECT_TYPE: case OBJECT_DOMAIN: @@ -2433,7 +2433,7 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address, case OBJECT_SCHEMA: if (!pg_namespace_ownercheck(address.objectId, roleid)) aclcheck_error(ACLCHECK_NOT_OWNER, objtype, - strVal((Value *) object)); + strVal(object)); break; case OBJECT_COLLATION: if (!pg_collation_ownercheck(address.objectId, roleid)) @@ -2448,27 +2448,27 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address, case OBJECT_EXTENSION: if (!pg_extension_ownercheck(address.objectId, roleid)) aclcheck_error(ACLCHECK_NOT_OWNER, objtype, - strVal((Value *) object)); + strVal(object)); break; case OBJECT_FDW: if (!pg_foreign_data_wrapper_ownercheck(address.objectId, roleid)) aclcheck_error(ACLCHECK_NOT_OWNER, objtype, - strVal((Value *) object)); + strVal(object)); break; case OBJECT_FOREIGN_SERVER: if (!pg_foreign_server_ownercheck(address.objectId, roleid)) aclcheck_error(ACLCHECK_NOT_OWNER, objtype, - strVal((Value *) object)); + strVal(object)); break; case OBJECT_EVENT_TRIGGER: if (!pg_event_trigger_ownercheck(address.objectId, roleid)) aclcheck_error(ACLCHECK_NOT_OWNER, objtype, - strVal((Value *) object)); + strVal(object)); break; case OBJECT_LANGUAGE: if (!pg
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada wrote: > > I did a quick check with the following tap test code: > > $node_publisher->poll_query_until('postgres', > qq( > select 1 != foo.column1 from (values(0), (1)) as foo; > )); > > The query returns {t, f} but poll_query_until() never finished. The > same is true when the query returns {f, t}. > This means something different is going on in Ajin's setup. Ajin, can you please share how did you confirm your findings about poll_query? -- With Regards, Amit Kapila.
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 5:54 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada wrote: > > > > > IIUC the query[1] used for polling returns two rows in this case: {t, > > f} or {f, t}. But did poll_query_until() returned OK in this case even > > if we expected one row of 't'? My guess of how this issue happened is: > > > > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" > > passed (for some reason). > > 2. all wal senders exited. > > 3. get the pid of wal sender with application_name 'tap_sub' but got > > nothing. > > 4. the second polling query resulted in a syntax error since $oldpid is > > null. > > > > If the fact that two walsender with the same application_name could > > present in pg_stat_replication view was the cause of this issue, > > poll_query_until() should return OK even if we expected just 't'. I > > might be missing something, though. > > > > [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE > > application_name = '$appname';" > > Yes, the query [1] returns OK with a {f,t} or {t,f} > > [1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = '$appname';" > Can you additionally check the value of 'state' from pg_stat_replication for both the old and new walsender sessions? -- With Regards, Amit Kapila.
Re: Added schema level support for publication.
On Sat, Aug 14, 2021 at 3:02 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > On 13.08.21 04:59, Amit Kapila wrote: > >> Even if we drop all tables added to the publication from it, 'pubkind' > >> doesn't go back to 'empty'. Is that intentional behavior? If we do > >> that, we can save the lookup of pg_publication_rel and > >> pg_publication_schema in some cases, and we can switch the publication > >> that was created as FOR SCHEMA to FOR TABLE and vice versa. > >> > > Do we really want to allow users to change a publication that is FOR > > SCHEMA to FOR TABLE? I see that we don't allow to do that FOR TABLES. > > postgres=# Alter Publication pub add table tbl1; > > ERROR: publication "pub" is defined as FOR ALL TABLES > > DETAIL: Tables cannot be added to or dropped from FOR ALL TABLES publications. > > I think the strict separation between publication-for-tables vs. > publication-for-schemas is a mistake. Why can't I have a publication > that publishes tables t1, t2, t3, *and* schemas s1, s2, s3. Also note > that we have a pending patch to add sequences support to logical > replication. So eventually, a publication will be able to contain a > bunch of different objects of different kinds. Thanks for the feedback, now the same publication can handle schemas as well as tables, and can be extended further to other objects. This is handled in the v21 patch attached at [1]. It is supported by the following syntax: CREATE PUBLICATION pub FOR ALL TABLES IN SCHEMA s1,s2,s3, TABLE t1,t2,t3; [1] - https://www.postgresql.org/message-id/CALDaNm2iKJvSdCyh0S%2BwYgFjMNB4hu3kYjk%3DYrEkpqTJY9zW%2Bw%40mail.gmail.com Regards, Vignesh
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 9:23 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada wrote: > > > > On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian wrote: > > > > > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > > > > > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian > > > > > wrote: > > > > > > > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > > > > wrote: > > > > > > > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > > > > > > > I have forced that condition with a changed query and found that the > > > > > > poll will not exit in case of a NULL return. > > > > > > > > > > > > > > > > What if the query in a poll is fired just before we get an error > > > > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > > > > Won't at that stage both old and new walsender's are present, so the > > > > > query might return true. You can check that via debugger by stopping > > > > > just before this error occurs and then check pg_stat_replication view. > > > > > > > > If this error happens then the PID is NOT updated as the pid in the > > > > Replication slot. I have checked this > > > > and explained this in my first email itself > > > > > > > > > > Sorry about the above email, I misunderstood. I was looking at > > > pg_stat_replication_slot rather than pg_stat_replication hence the > > > confusion. > > > Amit is correct, just prior to the walsender erroring out, it briefly > > > appears in the > > > pg_stat_replication, and that is why this error happens. Sorry for the > > > confusion. > > > I just confirmed it, got both the walsenders stopped in the debugger: > > > > > > postgres=# select pid from pg_stat_replication where application_name = > > > 'sub'; > > > pid > > > -- > > > 7899 > > > 7993 > > > (2 rows) > > > > IIUC the query[1] used for polling returns two rows in this case: {t, > > f} or {f, t}. But did poll_query_until() returned OK in this case even > > if we expected one row of 't'? My guess of how this issue happened is: > > > > Yeah, we can check this but I guess as soon as it gets 't', the poll > query will exit. I did a quick check with the following tap test code: $node_publisher->poll_query_until('postgres', qq( select 1 != foo.column1 from (values(0), (1)) as foo; )); The query returns {t, f} but poll_query_until() never finished. The same is true when the query returns {f, t}. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: prevent immature WAL streaming
On 2021-Aug-24, Bossart, Nathan wrote: > If moving RegisterSegmentBoundary() is sufficient to prevent the flush > pointer from advancing before we register the boundary, I bet we could > also remove the WAL writer nudge. Can you elaborate on this? I'm not sure I see the connection. > Another interesting thing I see is that the boundary stored in > earliestSegBoundary is not necessarily the earliest one. It's just > the first one that has been registered. I did this for simplicity for > the .ready file fix, but I can see it causing problems here. Hmm, is there really a problem here? Surely the flush point cannot go past whatever has been written. If somebody is writing an earlier section of WAL, then we cannot move the flush pointer to a later position. So it doesn't matter if the earliest point we have registered is the true earliest -- we only care for it to be the earliest that is past the flush point. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada wrote: > > IIUC the query[1] used for polling returns two rows in this case: {t, > f} or {f, t}. But did poll_query_until() returned OK in this case even > if we expected one row of 't'? My guess of how this issue happened is: > > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" > passed (for some reason). > 2. all wal senders exited. > 3. get the pid of wal sender with application_name 'tap_sub' but got nothing. > 4. the second polling query resulted in a syntax error since $oldpid is null. > > If the fact that two walsender with the same application_name could > present in pg_stat_replication view was the cause of this issue, > poll_query_until() should return OK even if we expected just 't'. I > might be missing something, though. > > [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = '$appname';" Yes, the query [1] returns OK with a {f,t} or {t,f} [1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" regards, Ajin Cherian Fujitsu Australia
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada wrote: > > On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian wrote: > > > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > > > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > > > wrote: > > > > > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > > > > > I have forced that condition with a changed query and found that the > > > > > poll will not exit in case of a NULL return. > > > > > > > > > > > > > What if the query in a poll is fired just before we get an error > > > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > > > Won't at that stage both old and new walsender's are present, so the > > > > query might return true. You can check that via debugger by stopping > > > > just before this error occurs and then check pg_stat_replication view. > > > > > > If this error happens then the PID is NOT updated as the pid in the > > > Replication slot. I have checked this > > > and explained this in my first email itself > > > > > > > Sorry about the above email, I misunderstood. I was looking at > > pg_stat_replication_slot rather than pg_stat_replication hence the > > confusion. > > Amit is correct, just prior to the walsender erroring out, it briefly > > appears in the > > pg_stat_replication, and that is why this error happens. Sorry for the > > confusion. > > I just confirmed it, got both the walsenders stopped in the debugger: > > > > postgres=# select pid from pg_stat_replication where application_name = > > 'sub'; > > pid > > -- > > 7899 > > 7993 > > (2 rows) > > IIUC the query[1] used for polling returns two rows in this case: {t, > f} or {f, t}. But did poll_query_until() returned OK in this case even > if we expected one row of 't'? My guess of how this issue happened is: > Yeah, we can check this but I guess as soon as it gets 't', the poll query will exit. > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" > passed (for some reason). > I think the reason for exit is that we get two rows with the same application_name in pg_stat_replication. > 2. all wal senders exited. > 3. get the pid of wal sender with application_name 'tap_sub' but got nothing. > 4. the second polling query resulted in a syntax error since $oldpid is null. > Your understanding of steps is the same as mine. -- With Regards, Amit Kapila.
Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)
On Tue, Aug 24, 2021 at 3:55 PM Dilip Kumar wrote: > > On Tue, Aug 24, 2021 at 12:26 PM Amit Kapila wrote: > The first patch looks good to me. I have made minor changes to the attached patch. The changes include: fixing compilation warning, made some comment changes, ran pgindent, and few other cosmetic changes. If you are fine with the attached, then kindly rebase the second patch atop it. -- With Regards, Amit Kapila. v5-0001-Refactor-sharedfileset.c-to-separate-out-fileset-.patch Description: Binary data
Re: replay of CREATE TABLESPACE eats data at wal_level=minimal
On Wed, Aug 25, 2021 at 1:21 AM Noah Misch wrote: > Sounds good. I think the log message is the optimal place: Looks awesome. -- Robert Haas EDB: http://www.enterprisedb.com
RE: prevent immature WAL streaming
Hi Álvaro, -hackers, > I attach the patch with the change you suggested. I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including 5065aeafb0b7593c04d3bc5bc2a86037f32143fc). Previously(yesterday) without the v02 patch I was getting standby corruption always via simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120, archive_command set to rsync to different dir on same fs, wal_init_zero at default(true). Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were somehow my fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest I've tried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :) > I didn't have a lot of luck with a reliable reproducer script. I was able to > reproduce the problem starting with Ryo Matsumura's script and attaching > a replica; most of the time the replica would recover by restarting from a > streaming position earlier than where the problem occurred; but a few > times it would just get stuck with a WAL segment containing a bogus > record. In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs, apparently one could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this error too: root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase root@hive:~# ls -l /proc/27296/fd/3 lrwx-- 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite root@hive:~# gdb -q -p 27296 -- 1089 is bitmask O_WRONLY|.. (gdb) p dup2(open("/dev/full", 1089, 0777), 3) $1 = 3 (gdb) c Continuing. ==> fflush/write(): : No space left on device So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like: a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC: could not write to file "pg_wal/xlogtemp.90670": No space left on device" b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC: could not write to log file 0001003B00A8 at offset 0, length 15466496: No space left on device" (that would be possible with wal_init_zero=false?) c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC: could not fdatasync file "000100430004": Invalid argument" that would pretty much mean same as above but with last possible offset near end of WAL? This was done with gdb voodoo: handle SIGUSR1 noprint nostop break xlog.c: // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311 c print fd or openLogFile -- to verify it is 3 p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3 After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the "invalid contrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the worst-case code review assumptions I would be wondering if it shouldn't even be committed as it stands right now. -J.
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila wrote: > > > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > > wrote: > > > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > > > I have forced that condition with a changed query and found that the > > > > poll will not exit in case of a NULL return. > > > > > > > > > > What if the query in a poll is fired just before we get an error > > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > > Won't at that stage both old and new walsender's are present, so the > > > query might return true. You can check that via debugger by stopping > > > just before this error occurs and then check pg_stat_replication view. > > > > If this error happens then the PID is NOT updated as the pid in the > > Replication slot. I have checked this > > and explained this in my first email itself > > > > Sorry about the above email, I misunderstood. I was looking at > pg_stat_replication_slot rather than pg_stat_replication hence the confusion. > Amit is correct, just prior to the walsender erroring out, it briefly > appears in the > pg_stat_replication, and that is why this error happens. Sorry for the > confusion. > I just confirmed it, got both the walsenders stopped in the debugger: > > postgres=# select pid from pg_stat_replication where application_name = 'sub'; > pid > -- > 7899 > 7993 > (2 rows) IIUC the query[1] used for polling returns two rows in this case: {t, f} or {f, t}. But did poll_query_until() returned OK in this case even if we expected one row of 't'? My guess of how this issue happened is: 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION" passed (for some reason). 2. all wal senders exited. 3. get the pid of wal sender with application_name 'tap_sub' but got nothing. 4. the second polling query resulted in a syntax error since $oldpid is null. If the fact that two walsender with the same application_name could present in pg_stat_replication view was the cause of this issue, poll_query_until() should return OK even if we expected just 't'. I might be missing something, though. [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = '$appname';" Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: .ready and .done files considered harmful
> If a .ready file is created out of order, the directory scan logic > will pick it up about as soon as possible based on its priority. If > the archiver is keeping up relatively well, there's a good chance such > a file will have the highest archival priority and will be picked up > the next time the archiver looks for a file to archive. With the > patch proposed in this thread, an out-of-order .ready file has no such > guarantee. As long as the archiver never has to fall back to a > directory scan, it won't be archived. The proposed patch handles the > case where RemoveOldXlogFiles() creates missing .ready files by > forcing a directory scan, but I'm not sure this is enough. I think we > have to check the archiver state each time we create a .ready file to > see whether we're creating one out-of-order. We can handle the scenario where .ready file is created out of order in XLogArchiveNotify(). This way we can avoid making an explicit call to enable directory scan from different code paths which may result into creating an out of order .ready file. Archiver can store the segment number corresponding to the last or most recent .ready file found. When a .ready file is created in XLogArchiveNotify(), the log segment number of the current .ready file can be compared with the segment number of the last .ready file found at archiver to detect if this file is created out of order. A directory scan can be forced if required. I have incorporated these changes in patch v11. > While this may be an extremely rare problem in practice, archiving > something after the next checkpoint completes seems better than never > archiving it at all. IMO this isn't an area where there is much space > to take risks. An alternate approach could be to force a directory scan at checkpoint to break the infinite wait for a .ready file which is being missed due to the fact that it is created out of order. This will make sure that the file gets archived within the checkpoint boundaries. Thoughts? Please find attached patch v11. Thanks, Dipesh From 9392fd1b82ade933e8127845013bb2940239af68 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Tue, 24 Aug 2021 12:17:34 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment of current wAL file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. There are some special scenarios like timeline switch, backup or .ready file created out of order which requires archiver to perform a full directory scan as archiving these files takes precedence over regular WAL files. --- src/backend/access/transam/xlogarchive.c | 23 src/backend/postmaster/pgarch.c | 211 --- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/postmaster/pgarch.h | 4 + 4 files changed, 224 insertions(+), 15 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index b9c19b2..cb73895 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -465,12 +465,16 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) * * Optionally, nudge the archiver process so that it'll notice the file we * create. + * + * Also, notifies archiver to enable directory scan to handle a few special + * scenarios. */ void XLogArchiveNotify(const char *xlog, bool nudge) { char archiveStatusPath[MAXPGPATH]; FILE *fd; + bool fileOutOfOrder = false; /* insert an otherwise empty file called .ready */ StatusFilePath(archiveStatusPath, xlog, ".ready"); @@ -492,6 +496,25 @@ XLogArchiveNotify(const char *xlog, bool nudge) return; } + /* Check if .ready file is created out of order */ + if (IsXLogFileName(xlog)) + { + XLogSegNo curSegNo; + TimeLineID tli; + + XLogFromFileName(xlog, &tli, &curSegNo, wal_segment_size); + + fileOutOfOrder = PgArchIsBrokenReadyFileOrder(curSegNo); + } + + /* + * History files or a .ready file created out of order requires archiver to + * perform a full directory scan. + */ + if (IsTLHistoryFileName(xlog) || IsBackupHistoryFileName(xlog) || + fileOutOfOrder) + PgArchEnableDirScan(); + /* If caller requested, let archiver know it's got work to do */ if (nudge) PgArchWakeup(); diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..a33648a 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -76,8 +76,31 @@ typedef struct PgArchData { int pgprocno; /* pgprocno of archiver process */ + + /* + * Flag to enable/disable directory scan. If this flag is set then it + * f
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian wrote: > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila wrote: > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > > wrote: > > > > > > > But will poll function still poll or exit? Have you tried that? > > > > > > I have forced that condition with a changed query and found that the > > > poll will not exit in case of a NULL return. > > > > > > > What if the query in a poll is fired just before we get an error > > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > > Won't at that stage both old and new walsender's are present, so the > > query might return true. You can check that via debugger by stopping > > just before this error occurs and then check pg_stat_replication view. > > If this error happens then the PID is NOT updated as the pid in the > Replication slot. I have checked this > and explained this in my first email itself > Sorry about the above email, I misunderstood. I was looking at pg_stat_replication_slot rather than pg_stat_replication hence the confusion. Amit is correct, just prior to the walsender erroring out, it briefly appears in the pg_stat_replication, and that is why this error happens. Sorry for the confusion. I just confirmed it, got both the walsenders stopped in the debugger: postgres=# select pid from pg_stat_replication where application_name = 'sub'; pid -- 7899 7993 (2 rows) ajin 7896 3326 0 05:22 pts/200:00:00 psql -p 6972 postgres ajin 7897 7882 0 05:22 ?00:00:00 postgres: ajin postgres [local] idle ajin 7899 7882 0 05:22 ?00:00:00 postgres: walsender ajin ::1(37719) START_REPLICATION ajin 7992 3647 0 05:24 ?00:00:00 postgres: logical replication worker for subscription 16426 ajin 7993 7882 0 05:24 ?00:00:00 postgres: walsender ajin ::1(37720) START_REPLICATION regards, Ajin Cherian Fujitsu Australia
Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
On Wed, Aug 25, 2021 at 1:37 AM Robert Haas wrote: > > I guess I was thinking more of rejiggering things so that we save the > results of each RestoreSnapshot() call in a local variable, e.g. > asnapshot and tsnapshot. And then I think we could just > RestoreTransactionSnapshot() on whichever one we want, and then > PushActiveSnapshot(asnapshot) either way. I think it would be worth > trying to move the PushActiveSnapshot() call out of the if statement > instead it in two places, written differently but doing the same > thing. > I've attached an updated patch, hopefully more along the lines that you were thinking of. Regards, Greg Nancarrow Fujitsu Australia v11-0001-Fix-parallel-worker-failed-assertion-and-coredump.patch Description: Binary data
Re: row filtering for logical replication
On Wed, Aug 25, 2021 at 10:57 AM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 5:52 AM Euler Taveira wrote: > > > > On Tue, Aug 24, 2021, at 4:46 AM, Peter Smith wrote: > > > > Anyway, I have implemented the suggested cache change because I agree > > it is probably theoretically superior, even if in practice there is > > almost no difference. > > > > I didn't inspect your patch carefully but it seems you add another List to > > control this new cache mechanism. I don't like it. IMO if we can use the > > data > > structures that we have now, let's implement your idea; otherwise, -1 for > > this > > new micro optimization. > > > > As mentioned above, without this we will invalidate many cached > expressions even though it is not required. I don't deny that there > might be a better way to achieve the same and if you or Peter have any > ideas, I am all ears. > I see that the new list is added to store row_filter node which we later use to compute expression. This is not required for invalidation but for delaying the expression evaluation till it is required (for example, for truncate, we may not need the row evaluation, so there is no need to compute it). Can we try to postpone the syscache lookup to a later stage when we are actually doing row_filtering? If we can do that, then I think we can avoid having this extra list? -- With Regards, Amit Kapila.
Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead
Hi, On 2021-08-25 12:51:58 +0900, Michael Paquier wrote: > I was looking at this WIP patch, and plugging in the connection > statistics to the table-access statistics looks like the wrong > abstraction to me. I find much cleaner the approach of HEAD to use a > separate API to report this information, as of > pgstat_send_connstats(). As I said before, this ship has long sailed: typedef struct PgStat_MsgTabstat { PgStat_MsgHdr m_hdr; Oid m_databaseid; int m_nentries; int m_xact_commit; int m_xact_rollback; PgStat_Counter m_block_read_time; /* times in microseconds */ PgStat_Counter m_block_write_time; PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES]; } PgStat_MsgTabstat; > As of the two problems discussed on this thread, aka the increased > number of UDP packages and the extra timestamp computations, it seems > to me that we had better combine the following ideas for HEAD and 14, > for now: > - Avoid the extra timestamp computation as proposed by Laurenz in [1] > - Throttle the frequency where the connection stat packages are sent, > as of [2]. I think in that case we'd have to do the bigger redesign and move "live" connection stats to backend_status.c... Greetings, Andres Freund
Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead
Hi, On 2021-08-20 14:27:20 -0500, Justin Pryzby wrote: > On Tue, Aug 17, 2021 at 02:14:20AM -0700, Andres Freund wrote: > > Doubling the number of UDP messages in common workloads seems also > > problematic > > enough that it should be addressed for 14. It increases the likelihood of > > dropping stats messages on busy systems, which can have downstream impacts. > > I think by "common workloads" you mean one with many, shortlived sessions. You don't need short-lived sessions. You just need sessions that don't process queries all the time (so that there's only one or a few queries within each PGSTAT_STAT_INTERVAL). The connection stats aren't sent once per session, they're sent once per PGSTAT_STAT_INTERVAL. Greetings, Andres Freund
Re: Added schema level support for publication.
On Mon, Aug 23, 2021 at 11:16 PM vignesh C wrote: > > On Tue, Aug 17, 2021 at 6:55 PM Tom Lane wrote: > > > > Amit Kapila writes: > > > On Tue, Aug 17, 2021 at 6:40 AM Peter Smith wrote: > > >> On Mon, Aug 16, 2021 at 11:31 PM Tom Lane wrote: > > >>> Abstractly it'd be > > >>> > > >>> createpub := CREATE PUBLICATION pubname FOR cpitem [, ... ] [ WITH ... ] > > >>> > > >>> cpitem := ALL TABLES | > > >>> TABLE name | > > >>> ALL TABLES IN SCHEMA name | > > >>> ALL SEQUENCES | > > >>> SEQUENCE name | > > >>> ALL SEQUENCES IN SCHEMA name | > > >>> name > > >>> > > >>> The grammar output would need some post-analysis to attribute the > > >>> right type to bare "name" items, but that doesn't seem difficult. > > > > >> That last bare "name" cpitem. looks like it would permit the following > > >> syntax: > > >> CREATE PUBLICATION pub FOR a,b,c; > > >> Was that intentional? > > > > > I think so. > > > > I had supposed that we could throw an error at the post-processing stage, > > or alternatively default to assuming that such names are tables. > > > > Now you could instead make the grammar work like > > > > cpitem := ALL TABLES | > > TABLE name [, ...] | > > ALL TABLES IN SCHEMA name [, ...] | > > ALL SEQUENCES | > > SEQUENCE name [, ...] | > > ALL SEQUENCES IN SCHEMA name [, ...] > > > > which would result in a two-level-list data structure. I'm not sure > > that this is better, as any sort of mistake would result in a very > > uninformative generic "syntax error" from Bison. Errors out of a > > post-processing stage could be more specific than that. > > I preferred the implementation in the lines Tom Lane had proposed at [1]. Is > it ok if the implementation is something like below: > CreatePublicationStmt: > CREATE PUBLICATION name FOR pub_obj_list opt_definition > { > CreatePublicationStmt *n = makeNode(CreatePublicationStmt); > n->pubname = $3; > n->options = $6; > n->pubobjects = (List *)$5; > $$ = (Node *)n; > } > ; > pub_obj_list: PublicationObjSpec > { $$ = list_make1($1); } > | pub_obj_list ',' PublicationObjSpec > { $$ = lappend($1, $3); } > ; > /* FOR TABLE and FOR ALL TABLES IN SCHEMA specifications */ > PublicationObjSpec: TABLE pubobj_expr > { } > | ALL TABLES IN_P SCHEMA pubobj_expr > { } > | pubobj_expr > { } > ; > pubobj_expr: > any_name > { } > | any_name '*' > { } > | ONLY any_name > { } > | ONLY '(' any_name ')' > { } > | CURRENT_SCHEMA > { } > ; "FOR ALL TABLES” (that includes all tables in the database) is missing in this syntax? > > I needed pubobj_expr to support the existing syntaxes supported by > relation_expr and also to handle CURRENT_SCHEMA support in case of the "FOR > ALL TABLES IN SCHEMA" feature. I changed the name to any_name to support > objects like "sch1.t1". I think that relation_expr also accepts objects like "sch1.t1", no? > I felt if a user specified "FOR ALL TABLES", the user should not be allowed > to combine it with "FOR TABLE" and "FOR ALL TABLES IN SCHEMA" as "FOR ALL > TABLES" anyway will include all the tables. I think so too. > Should we support the similar syntax in case of alter publication, like > "ALTER PUBLICATION pub1 ADD TABLE t1,t2, ALL TABLES IN SCHEMA sch1, sch2" or > shall we keep these separate like "ALTER PUBLICATION pub1 ADD TABLE t1, t2" > and "ALTER PUBLICATION pub1 ADD ALL TABLES IN SCHEMA sch1, sch2". I preferred > to keep it separate as we have kept ADD/DROP separately which cannot be > combined currently. If we support the former syntax, the latter two syntaxes are also supported. Why do we want to support only the latter separate two syntaxes? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Failure of subscription tests with topminnow
On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila wrote: > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian wrote: > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila > > wrote: > > > > > But will poll function still poll or exit? Have you tried that? > > > > I have forced that condition with a changed query and found that the > > poll will not exit in case of a NULL return. > > > > What if the query in a poll is fired just before we get an error > "tap_sub ERROR: replication slot "tap_sub" is active for PID 16336"? > Won't at that stage both old and new walsender's are present, so the > query might return true. You can check that via debugger by stopping > just before this error occurs and then check pg_stat_replication view. If this error happens then the PID is NOT updated as the pid in the Replication slot. I have checked this and explained this in my first email itself regards, Ajin Cherian Fujitsu Australia
Re: Add some tests for pg_stat_statements compatibility verification under contrib
On Mon, Mar 15, 2021 at 03:05:24PM +0800, Erica Zhang wrote: > This way the same query can be reused for both older versions and current > version. > Yep, it's neater to use the query as you suggested. Thanks! > > Also, can you register your patch for the next commitfest at > https://commitfest.postgresql.org/33/, to make sure it won't be forgotten? I was just looking at your patch, and I think that you should move all the past compatibility tests into a separate test file, in a way consistent to what we do in contrib/pageinspect/ for oldextversions.sql. I would suggest to use the same file names, while on it. -- Michael signature.asc Description: PGP signature
Re: logical replication empty transactions
I reviewed the v14-0001 patch. All my previous comments have been addressed. Apply / build / test was all OK. -- More review comments: 1. Params names in the function declarations should match the rest of the code. 1a. src/include/replication/logical.h @@ -26,7 +26,8 @@ typedef LogicalOutputPluginWriterWrite LogicalOutputPluginWriterPrepareWrite; typedef void (*LogicalOutputPluginWriterUpdateProgress) (struct LogicalDecodingContext *lr, XLogRecPtr Ptr, - TransactionId xid + TransactionId xid, + bool send_keep_alive => Change "send_keep_alive" --> "send_keepalive" ~~ 1b. src/include/replication/output_plugin.h @@ -243,6 +243,6 @@ typedef struct OutputPluginCallbacks /* Functions in replication/logical/logical.c */ extern void OutputPluginPrepareWrite(struct LogicalDecodingContext *ctx, bool last_write); extern void OutputPluginWrite(struct LogicalDecodingContext *ctx, bool last_write); -extern void OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx); +extern void OutputPluginUpdateProgress(struct LogicalDecodingContext *ctx, bool send_keep_alive); => Change "send_keep_alive" --> "send_keepalive" -- 2. Comment should be capitalized - src/backend/replication/walsender.c @@ -170,6 +170,9 @@ static TimestampTz last_reply_timestamp = 0; /* Have we sent a heartbeat message asking for reply, since last reply? */ static bool waiting_for_ping_response = false; +/* force keep alive when skipping transactions in synchronous replication mode */ +static bool force_keepalive_syncrep = false; => "force" --> "Force" -- Otherwise, v14-0001 LGTM. -- Kind Regards, Peter Smith. Fujitsu Australia