Re: In-placre persistance change of a relation
Thank you for the quick comments. At Thu, 31 Oct 2024 23:24:36 +0200, Heikki Linnakangas wrote in > On 31/10/2024 10:01, Kyotaro Horiguchi wrote: > > After some delays, here’s the new version. In this update, UNDO logs > > are WAL-logged and processed in memory under most conditions. During > > checkpoints, they’re flushed to files, which are then read when a > > specific XID’s UNDO log is accessed for the first time during > > recovery. > > The biggest changes are in patches 0001 through 0004 (equivalent to > > the previous 0001-0002). After that, there aren’t any major > > changes. Since this update involves removing some existing features, > > I’ve split these parts into multiple smaller identity transformations > > to make them clearer. > > As for changes beyond that, the main one is lifting the previous > > restriction on PREPARE for transactions after a persistence > > change. This was made possible because, with the shift to in-memory > > processing of UNDO logs, commit-time crash recovery detection is now > > simpler. Additional changes include completely removing the > > abort-handling portion from the pendingDeletes mechanism (0008-0010). > > In this patch version, the undo log is kept in dynamic shared > memory. It can grow indefinitely. On a checkpoint, it's flushed to > disk. > > If I'm reading it correctly, the undo records are kept in the DSA area > even after it's flushed to disk. That's not necessary; system never > needs to read the undo log unless there's a crash, so there's no need The system also needs to read the undo log whenever additional undo logs are added. In this version, I’ve moved all abort-time pendingDeletes data entirely to the undo logs. In other words, the DSA area is expanded in exchange for reducing the pendingDelete list. As a result, there is minimal impact on overall memory usage. Additionally, the current flushing code is straightforward because it relies on the in-memory primary image. If we drop the in-memory image during flush, we might need exclusive locking or possibly some ordering techniques. Anyway, I’ll consider that approach. > to keep it in memory after it's been flushed to disk. That's true > today; we could start relying on the undo log to clean up on abort > even when there's no crash, but I think it's a good design to not do > that and rely on backend-private state for non-crash transaction > abort. Hmm. Sounds reasonable. In the next version, I'll revert the changes to pendingDeletes and adjust it to just discard the log on regular aborts. > I'd suggest doing this the other way 'round. Let's treat the on-disk > representation as the primary representation, not the in-memory > one. Let's use a small fixed-size shared memory area just as a write > buffer to hold the dirty undo log entries that haven't been written to > disk yet. Most transactions are short, so most undo log entries never > need to be flushed to disk, but I think it'll be simpler to think of > it that way. On checkpoint, flush all the buffered dirty entries from > memory to disk and clear the buffer. Also do that if the buffer fills > up. I'd like to clarify the specific concept of these fixed-length memory slots. Is it something like this: each slot is keyed by an XID, followed by an in-file offset and a series of, say, 1024-byte areas? When writing a log for a new XID, if no slot is available, the backend would immediately evict the slot with the smallest XID to disk to free up space. If an existing slot runs out of space while writing new logs, the backend would flush it immediately and continue using the area. Is this correct? Additionally, if multiple processes try to write to a single slot, stricter locking might be needed. For example, if a slot is evicted by a backend other than its user, exclusive control might be required during the file write. jjjIs there any effective way to avoid such locking? In the current patch set, I’m avoiding any impact on the backend from checkpointer file writes by treating the in-memory image as primary. And regarding the number of these areas… although I’m not entirely sure, it seems unlikely we’d have hundreds of sessions simultaneously creating tables, so would it make sense to make this configurable, with a default of around 32 areas? > A high-level overview comment of the on-disk format would be nice. If > I understand correctly, there's a magic constant at the beginning of > each undo file, followed by UndoLogRecords. There are no other file > headers and no page structure within the file. Right. > That format seems reasonable. For cross-checking, maybe add the XID to > the file header too. There is a separate CRC value on each record, > which is
Re: In-placre persistance change of a relation
At Mon, 2 Sep 2024 09:30:20 +0900, Michael Paquier wrote in > On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote: > > I wonder if the twophase state files and undo log files should be merged > > into one file. They're similar in many ways: there's one file per > > transaction, named using the XID. I haven't thought this fully through, just > > a thought.. > > Hmm. It could be possible to extract some of this knowledge out of > twophase.c and design some APIs that could be used for both, but would > that be really necessary? The 2PC data and the LSNs used by the files > to check if things are replayed or on disk rely on > GlobalTransactionData that has its own idea of things and timings at > recovery. I'm not sure, but I feel that Heikki mentioned only about using the file format and in/out functions if the file formats of the two are sufficiently overlapping. > Or perhaps your point is actually to do that and add one layer for the > file handlings and their flush timings? I am not sure, TBH, what this > thread is trying to fix is complicated enough that it may be better to > live with two different code paths. But perhaps my gut feeling is > just wrong reading your paragraph. I believe this statement is valid, so I’m not in a hurry to do this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
Hello. Thank you for the response. At Sun, 1 Sep 2024 22:15:00 +0300, Heikki Linnakangas wrote in > On 31/08/2024 19:09, Kyotaro Horiguchi wrote: > > - UNDO log(0002): This handles file deletion during transaction aborts, > >which was previously managed, in part, by the commit XLOG record at > >the end of a transaction. > > - Prevent orphan files after a crash (0005): This is another use-case > >of the UNDO log system. > > Nice, I'm very excited if we can fix that long-standing issue! I'll > try to review this properly later, but at a quick 5 minute glance, one > thing caught my eye: > > This requires fsync()ing the per-xid undo log file every time a > relation is created. I fear that can be a pretty big performance hit > for workloads that repeatedly create and drop small tables. Especially I initially thought that one additional file manipulation during file creation wouldn't be an issue. However, the created storage file isn't being synced, so your concern seems valid. > if they're otherwise running with synchronous_commit=off. Instead of > flushing the undo log file after every write, I'd suggest WAL-logging > the undo log like regular relations and SLRUs. So before writing the > entry to the undo log, WAL-log it. And with a little more effort, you > could postpone creating the files altogether until a checkpoint > happens, similar to how twophase state files are checkpointed > nowadays. I thought that an UNDO log file not flushed before the last checkpoint might not survive a system crash. However, including UNDO files in the checkpointing process resolves that concern. Thansk you for the suggestion. > I wonder if the twophase state files and undo log files should be > merged into one file. They're similar in many ways: there's one file > per transaction, named using the XID. I haven't thought this fully > through, just a thought.. Precisely, UNDO log files are created per subtransaction, unlike twophase files. It might be possible if we allow the twophase files (as they are currently named) to be overwritten or modified at every subcommit. If ULOG contents are WAL-logged, these two things will become even more similar. However, I'm not planning to include that in the next version for now. > > +static void > > +undolog_set_filename(char *buf, TransactionId xid) > > +{ > > + snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid); > > +} > > I'd suggest using FullTransactionId. Doesn't matter much, but seems > like a good future-proofing. Agreed. Will fix it in the next vesion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add callback in pgstats for backend initialization
At Wed, 4 Sep 2024 15:04:09 +0900, Michael Paquier wrote in > On Wed, Sep 04, 2024 at 02:15:43PM +0900, Kyotaro Horiguchi wrote: > > The name "init_backend" makes it sound like the function initializes > > the backend. backend_init might be a better choice, but I'm not sure. > > We (kind of) tend to prefer $verb_$thing-or-action_cb for the name of > the callbacks, which is why I chose this naming. If you feel strongly > about "backend_init_cb", that's also fine by me; you are the original > author of this code area with Andres. More accurately, I believe this applies when the sentence follows a verb-object structure. In this case, the function’s meaning is “pgstat initialization on backend,” which seems somewhat different from the policy you mentioned. However, it could also be interpreted as “initialize backend status related to pgstat.” Either way, I’m okay with the current name if you are, based on the discussion above. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add callbacks for fixed-numbered stats flush in pgstats
At Wed, 4 Sep 2024 15:12:37 +0900, Michael Paquier wrote in > On Wed, Sep 04, 2024 at 05:28:56AM +, Bertrand Drouvot wrote: > > On Wed, Sep 04, 2024 at 02:05:46PM +0900, Kyotaro Horiguchi wrote: > >> The generalization sounds good to me, and hiding the private flags in > >> private places also seems good. > > > > +1 on the idea. > > Thanks for the feedback. > > >> Regarding pgstat_io_flush_cb, I think it no longer needs to check the > >> have_iostats variable, and that check should be moved to the new > >> pgstat_flush_io(). The same applies to pgstat_wal_flush_cb() (and > >> pgstat_flush_wal()). As for pgstat_slru_flush_cb, it simply doesn't > >> need the check anymore. > > > > Agree. Another option could be to add only one callback (the flush_fixed_cb > > one) > > and get rid of the have_fixed_pending_cb one. Then add an extra parameter > > to the > > flush_fixed_cb that would only check if there is pending stats (without > > flushing them). I think those 2 callbacks are highly related that's why I > > think we could "merge" them, thoughts? > > I would still value the shortcut that we can use if there is no > activity to avoid the clock check with GetCurrentTimestamp(), though, > which is why I'd rather stick with two callbacks as that can lead to a > much cheaper path. Yeah, I was wrong in this point. > Anyway, I am not sure to follow your point about removing the boolean > checks in the flush callbacks. It's surely always a better to never > call LWLockAcquire() or LWLockConditionalAcquire() if we don't have > to, and we would go through the whole flush dance as long as we detect > some stats activity in at least one stats kind. I mean, that's cheap > enough to keep around. I doubt that overprotection is always better, but in this case, it's not overprotection at all. The flush callbacks are called unconditionally once we decide to flush anything. Sorry for the noise. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RecoveryTargetAction is left out in xlog_interna.h
Hello, While reviewing a patch, I noticed that enum RecoveryTargetAction is still in xlog_internal.h, even though it seems like it should be in xlogrecovery.h. Commit 70e81861fa separated out xlogrecovery.c/h and moved several enums related to recovery targets to xlogrecovery.h. However, it appears that enum RecoveryTargetAction was inadvertently left behind in that commit. Please find the attached patch, which addresses this oversight. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 4c938f473164d75761001b31fa85e5dc215fbd9a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 4 Sep 2024 17:12:45 +0900 Subject: [PATCH] Move enum RecoveryTargetAction to xlogrecovery.h Commit 70e81861fa split out xlogrecovery.c/h and moved some enums related to recovery targets to xlogrecovery.h. However, it seems that the enum RecoveryTargetAction was inadvertently left out by that commit. This commit moves it to xlogrecovery.h for consistency. --- src/include/access/xlog_internal.h | 10 -- src/include/access/xlogrecovery.h | 10 ++ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index e5cdba0584..9ba01ec20e 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -316,16 +316,6 @@ typedef struct XLogRecData uint32 len; /* length of rmgr data to include */ } XLogRecData; -/* - * Recovery target action. - */ -typedef enum -{ - RECOVERY_TARGET_ACTION_PAUSE, - RECOVERY_TARGET_ACTION_PROMOTE, - RECOVERY_TARGET_ACTION_SHUTDOWN, -} RecoveryTargetAction; - struct LogicalDecodingContext; struct XLogRecordBuffer; diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h index c423464e8b..c09f84765b 100644 --- a/src/include/access/xlogrecovery.h +++ b/src/include/access/xlogrecovery.h @@ -40,6 +40,16 @@ typedef enum RECOVERY_TARGET_TIMELINE_NUMERIC, } RecoveryTargetTimeLineGoal; +/* + * Recovery target action. + */ +typedef enum +{ + RECOVERY_TARGET_ACTION_PAUSE, + RECOVERY_TARGET_ACTION_PROMOTE, + RECOVERY_TARGET_ACTION_SHUTDOWN, +} RecoveryTargetAction; + /* Recovery pause states */ typedef enum RecoveryPauseState { -- 2.43.5
Re: Use XLOG_CONTROL_FILE macro everywhere?
At Tue, 3 Sep 2024 12:02:26 +0300, "Anton A. Melnikov" wrote in > In v2 removed XLOG_CONTROL_FILE from args and used it directly in the > string. > IMHO this makes the code more readable and will output the correct > text if the macro changes. Yeah, I had this in my mind. Looks good to me. > 3) .. > Maybe include/access/xlogdefs.h would be a good place for this? > In v2 moved some definitions from xlog_internal to xlogdefs > and removed including xlog_internal.h from the postmaster.c. > Please, take a look on it. The change can help avoid disrupting existing users of the macro. However, the file is documented as follows: > * Postgres write-ahead log manager record pointer and > * timeline number definitions We could modify the file definition, but I'm not sure if that's the best approach. Instead, I'd like to propose separating the file and path-related definitions from xlog_internal.h, as shown in the attached first patch. This change would allow some modules to include files without unnecessary details. The second file is your patch, adjusted based on the first patch. I’d appreciate hearing from others on whether they find the first patch worthwhile. If it’s not considered worthwhile, then I believe having postmaster include xlog_internal.h would be the best approach. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 22b71830ff601717d01dc66be95bafb930b1a0ea Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 4 Sep 2024 16:15:16 +0900 Subject: [PATCH 1/2] Split file-related parts from xlog_internal.h Some modules that require XLOG file manipulation had to include xlog_internal.h, which contains excessively detailed internal information. This commit separates the file and file path-related components into xlogfilepaths.h, thereby avoiding the exposure of unrelated details of the XLOG module. --- src/backend/access/transam/timeline.c | 1 - src/backend/postmaster/pgarch.c | 1 - src/backend/utils/adt/genfile.c | 1 - src/bin/initdb/initdb.c | 1 + src/bin/pg_archivecleanup/pg_archivecleanup.c | 2 +- src/bin/pg_basebackup/pg_basebackup.c | 4 +- src/bin/pg_basebackup/pg_receivewal.c | 1 + src/bin/pg_basebackup/pg_recvlogical.c| 1 - src/bin/pg_basebackup/receivelog.c| 2 +- src/bin/pg_basebackup/streamutil.c| 2 +- src/bin/pg_controldata/pg_controldata.c | 1 - src/bin/pg_rewind/pg_rewind.c | 2 +- src/include/access/xlog.h | 13 +- src/include/access/xlog_internal.h| 178 +- src/include/access/xlogfilepaths.h| 219 ++ 15 files changed, 230 insertions(+), 199 deletions(-) create mode 100644 src/include/access/xlogfilepaths.h diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 146751ae37..18df8fd326 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -38,7 +38,6 @@ #include "access/xlog.h" #include "access/xlog_internal.h" #include "access/xlogarchive.h" -#include "access/xlogdefs.h" #include "pgstat.h" #include "storage/fd.h" diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 02f91431f5..4971cd76ff 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -30,7 +30,6 @@ #include #include "access/xlog.h" -#include "access/xlog_internal.h" #include "archive/archive_module.h" #include "archive/shell_archive.h" #include "lib/binaryheap.h" diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c index 24b95c32b7..17be1756e2 100644 --- a/src/backend/utils/adt/genfile.c +++ b/src/backend/utils/adt/genfile.c @@ -21,7 +21,6 @@ #include #include "access/htup_details.h" -#include "access/xlog_internal.h" #include "catalog/pg_authid.h" #include "catalog/pg_tablespace_d.h" #include "catalog/pg_type.h" diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f00718a015..95ec8d9c92 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -65,6 +65,7 @@ #endif #include "access/xlog_internal.h" +#include "lib/stringinfo.h" #include "catalog/pg_authid_d.h" #include "catalog/pg_class_d.h" /* pgrminclude ignore */ #include "catalog/pg_collation_d.h" diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c b/src/bin/pg_archivecleanup/pg_archivecleanup.c index 5a124385b7..f6b33de32a 100644 --- a/src/bin/pg_archivecleanup/pg_archivecleanup.c +++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c @@ -15,7 +15,7 @@ #include #include -#include "access/xlog_inte
Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description
At Tue, 3 Sep 2024 10:43:14 +0530, Amit Kapila wrote in > On Mon, Sep 2, 2024 at 9:14 AM shveta malik wrote: > > > > On Mon, Sep 2, 2024 at 5:47 AM Peter Smith wrote: > > > > > > > > > To summarize, the current description wrongly describes the field as a > > > time duration: > > > "The time since the slot has become inactive." > > > > > > I suggest replacing it with: > > > "The slot has been inactive since this time." > > > > > > > +1 for the change. If I had read the document without knowing about > > the patch, I too would have interpreted it as a duration. > > > > The suggested change looks good to me as well. I'll wait for a day or > two before pushing to see if anyone thinks otherwise. If possible, I'd prefer to use "the time" as the subject. For example, would "The time this slot was inactivated" be acceptable? However, this loses the sense of continuation up to that point, so if that's crucial, the current proposal might be better. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Add callback in pgstats for backend initialization
At Tue, 3 Sep 2024 05:00:54 +, Bertrand Drouvot wrote in > Hi, > > On Tue, Sep 03, 2024 at 10:52:20AM +0900, Michael Paquier wrote: > > Hi all, > > > > Currently, the backend-level initialization of pgstats happens in > > pgstat_initialize(), where we are using a shortcut for the WAL stats, > > with pgstat_init_wal(). > > > > I'd like to propose to add a callback for that, so as in-core or > > custom pgstats kinds can assign custom actions when initializing a > > backend. The use-case I had for this one are pretty close to what we > > do for WAL, where we would rely on a difference of state depending on > > what may have existed before reaching the initialization path. So > > this can offer more precision. Another case, perhaps less > > interesting, is to be able to initialize some static backend state. > > I think the proposal makes sense and I can see the use cases too, so +1. +1, too. The name "init_backend" makes it sound like the function initializes the backend. backend_init might be a better choice, but I'm not sure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Improving the latch handling between logical replication launcher and worker processes.
At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C wrote in > The attached v2 version patch has the changes for the same. Sorry for jumping in at this point. I've just reviewed the latest patch (v2), and the frequent Own/Disown-Latch operations caught my attention. Additionally, handling multiple concurrently operating trigger sources with nested latch waits seems bug-prone, which I’d prefer to avoid from both a readability and safety perspective. With that in mind, I’d like to suggest an alternative approach. I may not be fully aware of all the previous discussions, so apologies if this idea has already been considered and dismissed. Currently, WaitForReplicationWorkerAttach() and logicalrep_worker_stop_internal() wait on a latch after verifying the desired state. This ensures that even if there are spurious or missed wakeups, they won't cause issues. In contrast, ApplyLauncherMain() enters a latch wait without checking the desired state first. Consequently, if another process sets the latch to wake up the main loop while the former two functions are waiting, that wakeup could be missed. If my understanding is correct, the problem lies in ApplyLauncherMain() not checking the expected state before beginning to wait on the latch. There is no issue with waiting if the state hasn't been satisfied yet. So, I propose that ApplyLauncherMain() should check the condition that triggers a main loop wakeup before calling WaitLatch(). To do this, we could add a flag in LogicalRepCtxStruct to signal that the main loop has immediate tasks to handle. ApplyLauncherWakeup() would set this flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check this flag before calling WaitLatch() and skip the WaitLatch() call if the flag is set. I think this approach could solve the issue without adding complexity. What do you think? regard. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: per backend I/O statistics
At Tue, 03 Sep 2024 15:37:49 +0900 (JST), Kyotaro Horiguchi wrote in > When I first looked at this patch, my initial thought was whether we > should let these stats stay "fixed." The reason why the current > PGSTAT_KIND_IO is fixed is that there is only one global statistics > storage for the entire database. If we have stats for a flexible > number of backends, it would need to be non-fixed, perhaps with the > entry for INVALID_PROC_NUMBER storing the global I/O stats, I > suppose. However, one concern with that approach would be the impact > on performance due to the frequent creation and deletion of stats > entries caused by high turnover of backends. As an additional benefit of this approach, the client can set a connection variable, for example, no_backend_iostats to true, or set its inverse variable to false, to restrict memory usage to only the required backends. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: per backend I/O statistics
At Mon, 2 Sep 2024 14:55:52 +, Bertrand Drouvot wrote in > Hi hackers, > > Please find attached a patch to implement $SUBJECT. > > While pg_stat_io provides cluster-wide I/O statistics, this patch adds a new > pg_my_stat_io view to display "my" backend I/O statistics and a new > pg_stat_get_backend_io() function to retrieve the I/O statistics for a given > backend pid. > > By having the per backend level of granularity, one could for example identify > which running backend is responsible for most of the reads, most of the > extends > and so on... The pg_my_stat_io view could also be useful to check the > impact on the I/O made by some operations, queries,... in the current session. > > Some remarks: > > - it is split in 2 sub patches: 0001 introducing the necessary changes to > provide > the pg_my_stat_io view and 0002 to add the pg_stat_get_backend_io() function. > - the idea of having per backend I/O statistics has already been mentioned in > [1] by Andres. > > Some implementation choices: > > - The KIND_IO stats are still "fixed amount" ones as the maximum number of > backend is fixed. > - The statistics snapshot is made for the global stats (the aggregated ones) > and > for my backend stats. The snapshot is not build for all the backend stats > (that > could be memory expensive depending on the number of max connections and given > the fact that PgStat_IO is 16KB long). > - The above point means that pg_stat_get_backend_io() behaves as if > stats_fetch_consistency is set to none (each execution re-fetches counters > from shared memory). > - The above 2 points are also the reasons why the pg_my_stat_io view has been > added (as its results takes care of the stats_fetch_consistency setting). I > think > that makes sense to rely on it in that case, while I'm not sure that would > make > a lot of sense to retrieve other's backend I/O stats and taking care of > stats_fetch_consistency. > > > [1]: > https://www.postgresql.org/message-id/20230309003438.rectf7xo7pw5t5cj%40awork3.anarazel.de I'm not sure about the usefulness of having the stats only available from the current session. Since they are stored in shared memory, shouldn't we make them accessible to all backends? However, this would introduce permission considerations and could become complex. When I first looked at this patch, my initial thought was whether we should let these stats stay "fixed." The reason why the current PGSTAT_KIND_IO is fixed is that there is only one global statistics storage for the entire database. If we have stats for a flexible number of backends, it would need to be non-fixed, perhaps with the entry for INVALID_PROC_NUMBER storing the global I/O stats, I suppose. However, one concern with that approach would be the impact on performance due to the frequent creation and deletion of stats entries caused by high turnover of backends. Just to be clear, the above comments are not meant to oppose the current implementation approach. They are purely for the sake of discussing comparisons with other possible approaches. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Use XLOG_CONTROL_FILE macro everywhere?
At Mon, 2 Sep 2024 15:44:45 +0200, Daniel Gustafsson wrote in > Summarizing the thread it seems consensus is using XLOG_CONTROL_FILE > consistently as per the original patch. > > A few comments on the patch though: > > - * reads the data from $PGDATA/global/pg_control > + * reads the data from $PGDATA/ > > I don't think this is an improvement, I'd leave that one as the filename > spelled out. > > - "the \".old\" suffix from %s/global/pg_control.old.\n" > + "the \".old\" suffix from %s/%s.old.\n" > > Same with that change, not sure I think that makes reading the errormessage > code any easier. I agree with the first point. In fact, I think it might be even better to just write something like "reads the data from the control file" in plain language rather than using the actual file name. As for the second point, I'm fine either way, but if the main goal is to ensure resilience against future changes to the value of XLOG_CONTROL_FILE, then changing it makes sense. On the other hand, I don't see any strong reasons not to change it. That said, I don't really expect the value to change in the first place. The following point caught my attention. > +++ b/src/backend/postmaster/postmaster.c ... > +#include "access/xlog_internal.h" The name xlog_internal suggests that the file should only be included by modules dealing with XLOG details, and postmaster.c doesn't seem to fit that category. If the macro is used more broadly, it might be better to move it to a more public location. However, following the current discussion, if we decide to keep the macro's name as it is, it would make more sense to keep it in its current location. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Missing reflection in nls.mk from recent basebackup changes
Hello. After a recent commit f80b09bac8, make update-po fails with "missing required files". The commit moved some files to fe_utils but this change was not reflected in pg_basebackup's nls.mk. The attached patch fixes this issue. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From f9eac9605d0309d32d6d1fab7a0a2d4e8d42798a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 6 Aug 2024 10:04:54 +0900 Subject: [PATCH] Fix nls.mk to reflect astreamer files relocation In the recent commit f80b09bac8, astreamer files were moved to another directory, but this change was not reflected in nls.mk. This commit corrects that oversight. --- src/bin/pg_basebackup/nls.mk | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_basebackup/nls.mk b/src/bin/pg_basebackup/nls.mk index 950b9797b1..f75f44dbe6 100644 --- a/src/bin/pg_basebackup/nls.mk +++ b/src/bin/pg_basebackup/nls.mk @@ -1,12 +1,7 @@ # src/bin/pg_basebackup/nls.mk CATALOG_NAME = pg_basebackup GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \ - astreamer_file.c \ - astreamer_gzip.c \ astreamer_inject.c \ - astreamer_lz4.c \ - astreamer_tar.c \ - astreamer_zstd.c \ pg_basebackup.c \ pg_createsubscriber.c \ pg_receivewal.c \ @@ -19,6 +14,11 @@ GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \ ../../common/fe_memutils.c \ ../../common/file_utils.c \ ../../common/restricted_token.c \ + ../../fe_utils/astreamer_file.c \ + ../../fe_utils/astreamer_gzip.c \ + ../../fe_utils/astreamer_lz4.c \ + ../../fe_utils/astreamer_tar.c \ + ../../fe_utils/astreamer_zstd.c \ ../../fe_utils/option_utils.c \ ../../fe_utils/recovery_gen.c \ ../../fe_utils/string_utils.c -- 2.43.5
wrong translation file reference in pg_createsubscriber
Hi, I found that pg_createsubscriber doesn't use NLS files. This is due to the wrong reference name "pg_createsubscriber" being passed to set_pglocale_pgservice(). It should be "pg_basebackup" instead. See the attached patch. # Sorry for being away for a while. I should be able to return soon. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From ede9fecf6a042e87c5cf92ed34d6b1991646577b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Fri, 2 Aug 2024 11:33:52 +0900 Subject: [PATCH] Fix NLS file reference in pg_createsubscriber pg_createsubscriber is referring to a non-existent message translation file, causing NLS to not work correctly. This command should use the same file as pg_basebackup. --- src/bin/pg_basebackup/pg_createsubscriber.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index aab9bf0f51..4bce45e1e4 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -1904,7 +1904,7 @@ main(int argc, char **argv) pg_logging_init(argv[0]); pg_logging_set_level(PG_LOG_WARNING); progname = get_progname(argv[0]); - set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_createsubscriber")); + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_basebackup")); if (argc > 1) { -- 2.43.5
Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
At Fri, 21 Jun 2024 11:48:22 +0530, Amit Kapila wrote in > On Wed, Jun 19, 2024 at 10:44 AM Hayato Kuroda (Fujitsu) > wrote: > > > > Dear Horiguchi-san, > > > > Thanks for sharing the patch! I agree this approach (ensure WAL records are > > flushed) > > Is more proper than others. > > > > I have an unclear point. According to the comment atop GetInsertRecPtr(), > > it just > > returns the approximated value - the position of the last full WAL page [1]. > > If there is a continuation WAL record which across a page, will it return > > the > > halfway point of the WAL record (end of the first WAL page)? If so, the > > proposed > > fix seems not sufficient. We have to point out the exact the end of the > > record. > > > > You have a point but if this theory is correct why we are not able to > reproduce the issue after this patch? Also, how to get the WAL > location up to which we need to flush? Is XLogCtlData->logInsertResult > the one we are looking for? It is not exposed, but of course logInsertResult is more straightforward source for the LSN. The reason why the patch is working well is due to the following bit of the code. xlog.c:958, in XLInsertRecord() > /* >* Update shared LogwrtRqst.Write, if we crossed page boundary. >*/ > if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) > { > SpinLockAcquire(&XLogCtl->info_lck); > /* advance global request to include new block(s) */ > if (XLogCtl->LogwrtRqst.Write < EndPos) > XLogCtl->LogwrtRqst.Write = EndPos; > SpinLockRelease(&XLogCtl->info_lck); > RefreshXLogWriteResult(LogwrtResult); > } The code, which exists has existed for a long time, ensures that GetInsertRecPtr() returns the accurate end of a record when it spanns over page boundaries. This would need to be written in the new comment if we use GetInsertRecPtr(). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Pluggable cumulative statistics
At Thu, 13 Jun 2024 16:59:50 +0900, Michael Paquier wrote in > * The kind IDs may change across restarts, meaning that any stats data > associated to a custom kind is stored with the *name* of the custom > stats kind. Depending on the discussion happening here, I'd be open > to use the same concept as custom RMGRs, where custom kind IDs are > "reserved", fixed in time, and tracked in the Postgres wiki. It is > cheaper to store the stats this way, as well, while managing conflicts > across extensions available in the community ecosystem. I prefer to avoid having a central database if possible. If we don't intend to move stats data alone out of a cluster for use in another one, can't we store the relationship between stats names and numeric IDs (or index numbers) in a separate file, which is loaded just before and synced just after extension preloading finishes? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] make async slave to wait for lsn to be replayed
ts. = L756 +{ oid => '16387', descr => 'wait for LSN with timeout', The description seems a bit off. While timeout is mentioned, the more important characteristic that the LSN is the replay LSN is not included. = L791 + * WaitLSNProcInfo [e2 80 93] the shared memory structure representing information This line contains a non-ascii character EN Dash(U+2013). = L798 +* A process number, same as the index of this item in waitLSN->procInfos. +* Stored for convenience. +*/ + int procnum; It is described as "(just) for convenience". However, it is referenced by Startup to fetch the PGPROC entry for the waiter, which necessary for Startup. That aside, why don't we hold (the pointer to) procLatch instead of procnum? It makes things simpler and I believe it is our standard practice. = L809 + /* A flag indicating that this item is added to waitLSN->waitersHeap */ + boolinHeap; The name "inHeap" seems too leteral and might be hard to understand in most occurances. How about using "waiting" instead? = L920 +# I +# Make sure that pg_wal_replay_wait() works: add new content to Hmm. I feel that Arabic numerals look nicer than Greek ones here. = L940 +# Check that new data is visible after calling pg_wal_replay_wait() On the other hand, the comment for the check for this test states that > +# Make sure the current LSN on standby and is the same as primary's > LSN +ok($output eq 30, "standby reached the same LSN as primary"); I think the first comment and the second should be consistent. > Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl. > > 1. It's not clear why this test needs node_standby2 at all. It seems useless. I agree with you. What we would need is a second *waiter client* connecting to the same stanby rather than a second standby. I feel like having a test where the first waiter is removed while multiple waiters are waiting, as well as a test where promotion occurs under the same circumstances. > 2. The target LSN is set to pg_current_wal_insert_lsn() + 1. This > location seems to be unachievable in this test. So, it's not clear > what race condition this test could potentially detect. > 3. I think it would make sense to check for the race condition > reported by Heikki. That is to insert the injection point at the > beginning of WaitLSNSetLatches(). I think the race condition you mentioned refers to the inconsistency between the inHeap flag and the pairing heap caused by a race condition between timeout and wakeup (or perhaps other combinations? I'm not sure which version of the patch the mentioned race condition refers to). However, I imagine it is difficult to reliably reproduce this condition. In that regard, in the latest patch, the coherence between the inHeap flag and the pairing heap is protected by LWLock, so I believe we no longer need that test. regards. > Links. > 1. > https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920 -- Kyotaro Horiguchi NTT Open Source Software Center
Re: replace strtok()
At Tue, 18 Jun 2024 09:18:28 +0200, Peter Eisentraut wrote in > Under the topic of getting rid of thread-unsafe functions in the > backend [0], here is a patch series to deal with strtok(). > > Of course, strtok() is famously not thread-safe and can be replaced by > strtok_r(). But it also has the wrong semantics in some cases, > because it considers adjacent delimiters to be one delimiter. So if > you parse > > SCRAM-SHA-256$:$: > > with strtok(), then > > SCRAM-SHA-256$$::$$:: > > parses just the same. In many cases, this is arguably wrong and could > hide mistakes. > > So I'm suggesting to use strsep() in those places. strsep() is > nonstandard but widely available. > > There are a few places where strtok() has the right semantics, such as > parsing tokens separated by whitespace. For those, I'm using > strtok_r(). I agree with the distinction. > A reviewer job here would be to check whether I made that distinction > correctly in each case. 0001 and 0002 look correct to me regarding that distinction. They applied correctly to the master HEAD and all tests passed on Linux. > On the portability side, I'm including a port/ replacement for > strsep() and some workaround to get strtok_r() for Windows. I have > included these here as separate patches for clarity. 0003 looks fine and successfully built and seems working on an MSVC build. About 0004, Cygwin seems to have its own strtok_r, but I haven't checked how that fact affects the build. > [0]: > https://www.postgresql.org/message-id/856e5ec3-879f-42ee-8258-8bcc6ec9b...@eisentraut.org regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
At Thu, 13 Jun 2024 09:29:03 +0530, Amit Kapila wrote in > Yeah, but the commit you quoted later reverted by commit 703f148e98 > and committed again as c6c3334364. Yeah, right.. > > aiming to prevent walsenders from > > generating competing WAL (by, for example, CREATE_REPLICATION_SLOT) > > records with the shutdown checkpoint. Thus, it seems that the > > walsender cannot see the shutdown record, > > > > This is true of logical walsender. The physical walsender do send > shutdown checkpoint record before getting terminated. Yes, I know. They differ in their blocking mechanisms. > > and a certain amount of > > bytes before it, as the walsender appears to have relied on the > > checkpoint flushing its record, rather than on XLogBackgroundFlush(). > > > > If we approve of the walsender being terminated before the shutdown > > checkpoint, we need to "fix" the comment, then provide a function to > > ensure the synchronization of WAL records. > > > > Which comment do you want to fix? Yeah. The part you seem to think I was trying to fix is actually fine. Instead, I have revised the comment on the modified section to make its intention clearer. > > I'll consider this direction for a while. > > > > Okay, thanks. The attached patch is it. It's only for the master. I decided not to create a new function because the simple code has only one caller. I haven't seen the test script fail with this fix. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 663bdeaf8d4d2f5a192dd3ecb6d2817d9b1311f1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 18 Jun 2024 16:36:43 +0900 Subject: [PATCH] Ensure WAL is written out when terminating a logical walsender In commit c6c3334364, XLogBackgroundFlush() was assumed to flush all written WAL to the end, but this function may not flush the last incomplete page in a single call. In such cases, if the last written page ends with a continuation record, the latter part will not be flushed, causing shutdown to wait indefinitely for that part. This commit ensures that the written records are fully flushed to the end, guaranteeing expected behavior. --- src/backend/replication/walsender.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index c623b07cf0..5aa0f58238 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1858,12 +1858,13 @@ WalSndWaitForWal(XLogRecPtr loc) ProcessRepliesIfAny(); /* - * If we're shutting down, trigger pending WAL to be written out, - * otherwise we'd possibly end up waiting for WAL that never gets - * written, because walwriter has shut down already. + * If we're shutting down, all WAL-writing processes are gone, leaving + * only checkpointer to perform the shutdown checkpoint. Ensure that + * any pending WAL is written out here; otherwise, we'd possibly end up + * waiting for WAL that never gets written. */ - if (got_STOPPING) - XLogBackgroundFlush(); + if (got_STOPPING && !RecoveryInProgress()) + XLogFlush(GetInsertRecPtr()); /* * To avoid the scenario where standbys need to catch up to a newer -- 2.43.0
Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
At Tue, 11 Jun 2024 14:27:28 +0530, Amit Kapila wrote in > On Tue, Jun 11, 2024 at 12:34 PM Kyotaro Horiguchi > wrote: > > > > At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila > > wrote in > > > Sorry, it is not clear to me why we failed to flush the last > > > continuation record in logical walsender? I see that we try to flush > > > the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is > > > that not sufficient? > > > > It seems that, it uses XLogBackgroundFlush(), which does not guarantee > > flushing WAL until the end. > > > > What would it take to ensure the same? I am trying to explore this > path because currently logical WALSender sends any outstanding logs up > to the shutdown checkpoint record (i.e., the latest record) and waits > for them to be replicated to the standby before exit. Please take a > look at the comments where we call WalSndDone(). The fix you are > proposing will break that guarantee. Shutdown checkpoint is performed after the walsender completed termination since 086221cf6b, aiming to prevent walsenders from generating competing WAL (by, for example, CREATE_REPLICATION_SLOT) records with the shutdown checkpoint. Thus, it seems that the walsender cannot see the shutdown record, and a certain amount of bytes before it, as the walsender appears to have relied on the checkpoint flushing its record, rather than on XLogBackgroundFlush(). If we approve of the walsender being terminated before the shutdown checkpoint, we need to "fix" the comment, then provide a function to ensure the synchronization of WAL records. I'll consider this direction for a while. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
At Tue, 11 Jun 2024 09:27:20 +0900, Michael Paquier wrote in > On Thu, Jun 06, 2024 at 03:19:20PM +0900, Kyotaro Horiguchi wrote: > > So, I believe the attached small patch fixes the behavior. I haven't > > come up with a good test script for this issue. Something like > > 026_overwrite_contrecord.pl might work, but this situation seems a bit > > more complex than what it handles. > > Hmm. Indeed you will not be able to reuse the same trick with the end > of a segment. Still you should be able to get a rather stable test by > using the same tricks as 039_end_of_wal.pl to spawn a record across > multiple pages, no? With the trick, we could write a page-spanning record, but I'm not sure we can control the behavior of XLogBackgroundFlush(). As Amit suggested, we have the option to create a variant of the function that guarantees flushing WAL until the end. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila wrote in > Sorry, it is not clear to me why we failed to flush the last > continuation record in logical walsender? I see that we try to flush > the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is > that not sufficient? It seems that, it uses XLogBackgroundFlush(), which does not guarantee flushing WAL until the end. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: relfilenode statistics
At Mon, 10 Jun 2024 08:09:56 +, Bertrand Drouvot wrote in > Hi, > > On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote: > > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund wrote: > > > If we just want to keep prior stats upon arelation rewrite, we can just > > > copy > > > the stats from the old relfilenode. Or we can decide that those stats > > > don't > > > really make sense anymore, and start from scratch. > > > > I think we need to think carefully about what we want the user > > experience to be here. "Per-relfilenode stats" could mean "sometimes I > > don't know the relation OID so I want to use the relfilenumber > > instead, without changing the user experience" or it could mean "some > > of these stats actually properly pertain to the relfilenode rather > > than the relation so I want to associate them with the right object > > and that will affect how the user sees things." We need to decide > > which it is. If it's the former, then we need to examine whether the > > goal of hiding the distinction between relfilenode stats and relation > > stats from the user is in fact feasible. If it's the latter, then we > > need to make sure the whole patch reflects that design, which would > > include e.g. NOT copying stats from the old to the new relfilenode, > > and which would also include documenting the behavior in a way that > > will be understandable to users. > > Thanks for sharing your thoughts! > > Let's take the current heap_blks_read as an example: it currently survives > a relation rewrite and I guess we don't want to change the existing user > experience for it. > > Now say we want to add "heap_blks_written" (like in this POC patch) then I > think > that it makes sense for the user to 1) query this new stat from the same place > as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the > same > experience as far the relation rewrite is concerned (keep the previous stats). > > To achieve the rewrite behavior we could: > > 1) copy the stats from the OLD relfilenode to the relation (like in the POC > patch) > 2) copy the stats from the OLD relfilenode to the NEW one (could be in a > dedicated > field) > > The PROS of 1) is that the behavior is consistent with the current > heap_blks_read > and that the user could still see the current relfilenode stats (through a > new API) > if he wants to. > > > In my experience, the worst thing you can do in cases like this is be > > somewhere in the middle. Then you tend to end up with stuff like: the > > difference isn't supposed to be something that the user knows or cares > > about, except that they do have to know and care because you haven't > > thoroughly covered up the deception, and often they have to reverse > > engineer the behavior because you didn't document what was really > > happening because you imagined that they wouldn't notice. > > My idea was to move all that is in pg_statio_all_tables to relfilenode stats > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) > ensure > the user can still retrieve the stats from pg_statio_all_tables in such a way > that it survives a rewrite, 3) provide dedicated APIs to retrieve > relfilenode stats but only for the current relfilenode, 4) document this > behavior. This is what the POC patch is doing for heap_blks_written (would > need to do the same for heap_blks_read and friends) except for the > documentation > part. What do you think? In my opinion, it is certainly strange that bufmgr is aware of relation kinds, but introducing relfilenode stats to avoid this skew doesn't seem to be the best way, as it invites inconclusive arguments like the one raised above. The fact that we transfer counters from old relfilenodes to new ones indicates that we are not really interested in counts by relfilenode. If that's the case, wouldn't it be simpler to call pgstat_count_relation_buffer_read() from bufmgr.c and then branch according to relkind within that function? If you're concerned about the additional branch, some ingenuity may be needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Things I don't like about \du's "Attributes" column
At Sat, 8 Jun 2024 14:09:11 -0400, Robert Haas wrote in > On Sat, Jun 8, 2024 at 10:02 AM Pavel Luzanov > wrote: > > Therefore, I think the current patch offers a better version of the \du > > command. > > However, I admit that these improvements are not enough to accept the patch. > > I would like to hear other opinions. > > Hmm, I don't think I quite agree with this. If people like this > version better than what we have now, that's all we need to accept the > patch. I just don't really want to be the one to decide all by myself > whether this is, in fact, better. So, like you, I would like to hear > other opinions. > regress_du_role0 | yes | Inherit | Tue Jun 04 00:00:00 2024 PDT | > 0 | > regress_du_role1 | no| Create role+| infinity | > | I guess that in English, when written as "'Login' = 'yes/no'", it can be easily understood. However, in Japanese, "'ログイン' = 'はい/いいえ'" looks somewhat awkward and is a bit difficult to understand at a glance. "'ログイン' = '可/不可'" (equivalent to "Login is 'can/cannot'") sounds more natural in Japanese, but it was rejected upthread, and I also don't like 'can/cannot'. To give further candidates, "allowed/not allowed" or "granted/denied" can be mentioned, and they would be easier to translate, at least to Japanese. However, is there a higher likelihood that 'granted/denied' will be misunderstood as referring to database permissions? Likewise, "'Valid until' = 'infinity'" (equivalent to "'有効期限' = ' 無限'") also sounds awkward. Maybe that's the same in English. I guess that 'unbounded' or 'indefinite' sounds better, and their Japanese translation '無期限' also sounds natural. However, I'm not sure we want to go to that extent in transforming the table. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Thu, 06 Jun 2024 17:15:15 +0900 (JST), Kyotaro Horiguchi wrote in > At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi > wrote in > > I have been thinking about this since then. At first, I thought it > > referred to FindFirstChangeNotification() and friends, and inotify on > > Linux. However, I haven't found a way to simplify the specified code > > area using those APIs. > > By the way, the need to shift by 2 seconds to tolerate clock skew > suggests that the current launcher-postmaster association mechanism is > somewhat unreliable. Couldn't we add a command line option to > postmaster to explicitly pass a unique identifier (say, pid itself) of > the launcher? If it is not specified, the number should be the PID of > the immediate parent process. No. The combination of pg_ctl's pid and timestamp, to avoid false matching during reboot. > This change avoids the need for the special treatment for Windows. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi wrote in > I have been thinking about this since then. At first, I thought it > referred to FindFirstChangeNotification() and friends, and inotify on > Linux. However, I haven't found a way to simplify the specified code > area using those APIs. By the way, the need to shift by 2 seconds to tolerate clock skew suggests that the current launcher-postmaster association mechanism is somewhat unreliable. Couldn't we add a command line option to postmaster to explicitly pass a unique identifier (say, pid itself) of the launcher? If it is not specified, the number should be the PID of the immediate parent process. This change avoids the need for the special treatment for Windows. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
At Tue, 4 Jun 2024 08:30:19 +0900, Michael Paquier wrote in > On Mon, Jan 15, 2024 at 01:34:46PM -0500, Robert Haas wrote: > > This kind of change looks massively helpful to me. I don't know if it > > is exactly right or not, but it would have been a big help to me when > > writing my previous review, so +1 for some change of this general > > type. > > During a live review of this patch last week, as part of the Advanced > Patch Workshop of pgconf.dev, it has been mentioned by Tom that we may > be able to simplify the check on pmstart if the detection of the > postmaster PID started by pg_ctl is more stable using the WIN32 > internals that this patch relies on. I am not sure that this > suggestion is right, though, because we should still care about the > clock skew case as written in the surrounding comments? Even if > that's OK, I would assume that this should be an independent patch, > written on top of the proposed v6-0001. > > Tom, could you comment about that? Perhaps my notes did not catch > what you meant. Thank you for the follow-up. I have been thinking about this since then. At first, I thought it referred to FindFirstChangeNotification() and friends, and inotify on Linux. However, I haven't found a way to simplify the specified code area using those APIs. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith wrote in > Hi, I have reproduced this multiple times now. > > I confirmed the initial post/steps from Alexander. i.e. The test > script provided [1] gets itself into a state where function > ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait > for the next page to become available") constantly returns > XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops > forever, since it never calls WalSndDone() to exit the walsender > process. Thanks for the repro; I believe I understand what's happening here. During server shutdown, the latter half of the last continuation record may fail to be flushed. This is similar to what is described in the commit message of commit ff9f111bce. While shutting down, WalSndLoop() waits for XLogSendLogical() to consume WAL up to flushPtr, but in this case, the last record cannot complete without the continuation part starting from flushPtr, which is missing. However, in such cases, xlogreader.missingContrecPtr is set to the beginning of the missing part, but something similar to So, I believe the attached small patch fixes the behavior. I haven't come up with a good test script for this issue. Something like 026_overwrite_contrecord.pl might work, but this situation seems a bit more complex than what it handles. Versions back to 10 should suffer from the same issue and the same patch will be applicable without significant changes. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 99cad7bd53a94b4b90937fb1eb2f37f2ebcadf6a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 6 Jun 2024 14:56:53 +0900 Subject: [PATCH] Fix infinite loop in walsender during publisher shutdown When a publisher server is shutting down, there can be a case where the last WAL record at that point is a continuation record with its latter part not yet flushed. In such cases, the walsender attempts to read this unflushed part and ends up in an infinite loop. To prevent this situation, modify the logical WAL sender to consider itself caught up in this case. The records that are not fully flushed at this point are generally not significant, so simply ignoring them should not cause any issues. --- src/backend/replication/walsender.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index c623b07cf0..635396c138 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -3426,8 +3426,15 @@ XLogSendLogical(void) flushPtr = GetFlushRecPtr(NULL); } - /* If EndRecPtr is still past our flushPtr, it means we caught up. */ - if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr) + /* + * If EndRecPtr is still past our flushPtr, it means we caught up. When + * the server is shutting down, the latter part of a continuation record + * may be missing. If got_STOPPING is true, assume we are caught up if the + * last record is missing its continuation part at flushPtr. + */ + if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr || + (got_STOPPING && + logical_decoding_ctx->reader->missingContrecPtr == flushPtr)) WalSndCaughtUp = true; /* -- 2.43.0
Re: Add last_commit_lsn to pg_stat_database
At Tue, 20 Feb 2024 07:56:28 +0900, Michael Paquier wrote in > On Mon, Feb 19, 2024 at 10:26:43AM +0100, Tomas Vondra wrote: > It just means that I am not much a fan of the signature changes done > for RecordTransactionCommit() and AtEOXact_PgStat_Database(), adding a > dependency to a specify LSN. Your suggestion to switching to > XactLastRecEnd should avoid that. > > > - stats_fetch_consistency=cache: always the same min/max value > > > > - stats_fetch_consistency=none: min != max > > > > Which would suggest you're right and this should be VOLATILE and not > > STABLE. But then again - the existing pg_stat_get_db_* functions are all > > marked as STABLE, so (a) is that correct and (b) why should this > > function be marked different? > > This can look like an oversight of 5891c7a8ed8f to me. I've skimmed > through the threads related to this commit and messages around [1] > explain why this GUC exists and why we have both behaviors, but I'm > not seeing a discussion about the volatibility of these functions. > The current situation is not bad either, the default makes these > functions stable, and I'd like to assume that users of this GUC know > what they do. Perhaps Andres or Horiguchi-san can comment on that. > > https://www.postgresql.org/message-id/382908.1616343...@sss.pgh.pa.us I agree that we cannot achieve, nor do we need, perfect MVCC behavior, and that completely volatile behavior is unusable. I believe the functions are kept marked as stable, as this is the nearest and most usable volatility for the default behavior, since function volatility is not switchable on-the-fly. This approach seems least trouble-prone to me. The consistency of the functions are discussed here. https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-STATS-VIEWS > This is a feature, not a bug, because ... Conversely, if it's known > that statistics are only accessed once, caching accessed statistics is > unnecessary and can be avoided by setting stats_fetch_consistency to > none. It seems to me that this description already implies such an incongruity in the functions' behavior from the "stable" behavior, but we might want to explicitly mention that incongruity. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Fix use of possible uninitialized variable retval (src/pl/plpgsql/src/pl_handler.c)
At Mon, 27 May 2024 11:31:24 -0300, Ranier Vilela wrote in > Hi. > > The function *plpgsql_inline_handler* can use uninitialized > variable retval, if PG_TRY fails. > Fix like function*plpgsql_call_handler* wich declare retval as > volatile and initialize to (Datum 0). If PG_TRY fails, retval is not actually accessed, so no real issue exists. Commit 7292fd8f1c changed plpgsql_call_handler() to the current form, but as stated in its commit message, it did not fix a real issue and was solely to silence compiler. I believe we do not need to modify plpgsql_inline_handler() unless compiler actually issues a false warning for it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH]A minor improvement to the error-report in SimpleLruWriteAll()
At Tue, 28 May 2024 20:15:59 +0800 (CST), "Long Song" wrote in > > Hi, > Actually, I still wonder why only the error message > of the last failure to close the file was recorded. > For this unusual situation, it is acceptable to > record all failure information without causing > too much logging. > Was it designed that way on purpose? Note that SlruReportIOError() causes a non-local exit. To me, the point of the loop seems to be that we want to close every single file, apart from the failed ones. From that perspective, the patch disrupts that intended behavior by exiting in the middle of the loop. It seems we didn't want to bother collecting errors for every failed file in that part. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_parse_json() should not leak token copies on failure
Hi, At Fri, 24 May 2024 08:43:01 -0700, Jacob Champion wrote in > On Tue, Apr 30, 2024 at 2:29 PM Jacob Champion > wrote: > > Attached is a draft patch to illustrate what I mean, but it's > > incomplete: it only solves the problem for scalar values. > > (Attached is a v2 of that patch; in solving a frontend leak I should > probably not introduce a backend segfault.) I understand that the part enclosed in parentheses refers to the "if (ptr) pfree(ptr)" structure. I believe we rely on the behavior of free(NULL), which safely does nothing. (I couldn't find the related discussion due to a timeout error on the ML search page.) Although I don't fully understand the entire parser code, it seems that the owner transfer only occurs in JSON_TOKEN_STRING cases. That is, the memory pointed by scalar_val would become dangling in JSON_TOKEN_NUBMER cases. Even if this is not the case, the ownership transition apperas quite callenging to follow. It might be safer or clearer to pstrdup the token in jsonb_in_scalar() and avoid NULLifying scalar_val after calling callbacks, or to let jsonb_in_sclar() NULLify the pointer. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
Rebased. Along with rebasing, I changed the interface of XLogFsyncFile() to return a boolean instead of an error message. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From bed74e638643d7491bbd86fe640c33db1e16f0e5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 15 Jan 2024 15:57:53 +0900 Subject: [PATCH v33 1/3] Export wal_sync_method related functions Export several functions related to wal_sync_method for use in subsequent commits. Since PG_O_DIRECT cannot be used in those commits, the new function XLogGetSyncBit() will mask PG_O_DIRECT. --- src/backend/access/transam/xlog.c | 73 +-- src/include/access/xlog.h | 2 + 2 files changed, 52 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 330e058c5f..492ababd9c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8592,21 +8592,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra) } } +/* + * Exported version of get_sync_bit() + * + * Do not expose PG_O_DIRECT for uses outside xlog.c. + */ +int +XLogGetSyncBit(void) +{ + return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT; +} + /* - * Issue appropriate kind of fsync (if any) for an XLOG output file. + * Issue appropriate kind of fsync (if any) according to wal_sync_method. * - * 'fd' is a file descriptor for the XLOG file to be fsync'd. - * 'segno' is for error reporting purposes. + * 'fd' is a file descriptor for the file to be fsync'd. */ -void -issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) +const char * +XLogFsyncFile(int fd) { - char *msg = NULL; + const char *msg = NULL; instr_time start; - Assert(tli != 0); - /* * Quick exit if fsync is disabled or write() has already synced the WAL * file. @@ -8614,7 +8622,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) if (!enableFsync || wal_sync_method == WAL_SYNC_METHOD_OPEN || wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC) - return; + return NULL; /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) @@ -8651,19 +8659,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) break; } - /* PANIC if failed to fsync */ - if (msg) - { - char xlogfname[MAXFNAMELEN]; - int save_errno = errno; - - XLogFileName(xlogfname, tli, segno, wal_segment_size); - errno = save_errno; - ereport(PANIC, -(errcode_for_file_access(), - errmsg(msg, xlogfname))); - } - pgstat_report_wait_end(); /* @@ -8677,7 +8672,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start); } - PendingWalStats.wal_sync++; + if (msg != NULL) + PendingWalStats.wal_sync++; + + return msg; +} + +/* + * Issue appropriate kind of fsync (if any) for an XLOG output file. + * + * 'fd' is a file descriptor for the XLOG file to be fsync'd. + * 'segno' is for error reporting purposes. + */ +void +issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) +{ + const char *msg; + + Assert(tli != 0); + + msg = XLogFsyncFile(fd); + + /* PANIC if failed to fsync */ + if (msg) + { + char xlogfname[MAXFNAMELEN]; + int save_errno = errno; + + XLogFileName(xlogfname, tli, segno, wal_segment_size); + errno = save_errno; + ereport(PANIC, +(errcode_for_file_access(), + errmsg(msg, xlogfname))); + } } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 1a1f11a943..badfe4abd6 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record); extern void xlog_desc(StringInfo buf, struct XLogReaderState *record); extern const char *xlog_identify(uint8 info); +extern int XLogGetSyncBit(void); +extern const char *XLogFsyncFile(int fd); extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli); extern bool RecoveryInProgress(void); -- 2.43.0 >From c200b85c1311f97bdae2ed20e2746c44d5c4aadb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 31 Aug 2023 11:49:10 +0900 Subject: [PATCH v33 2/3] Introduce undo log implementation This patch adds a simple implementation of UNDO log feature. --- src/backend/access/transam/Makefile| 1 + src/backend/access/transam/meson.build | 1 + src/backend/access/transam/rmgr.c | 4 +- src/backend/access/transam/simpleundolog.c | 362 + src/backend/access/transam/twophase.c | 3 + src/backend/access/transam/xact.c | 24 ++ src/backend/access/transam/xlog.c | 42 ++- src/backend/catalog/storage.c | 171 ++ src/backend/storage/file/reinit.c | 78 + src/backend/storage/smgr/smgr.c| 9 + src/bin/initdb/initdb.c| 17 + src/bin/pg_rewind/parsexlog.c
inconsistent quoting in error messages
I noticed that there are slightly inconsistent messages regarding quoting policies. > This happens if you temporarily set "wal_level=minimal" on the server. > WAL generated with "full_page_writes=off" was replayed during online backup > pg_log_standby_snapshot() can only be used if "wal_level" >= "replica" > WAL streaming ("max_wal_senders" > 0) requires "wal_level" to be "replica" or > "logical" I think it's best to quote variable names and values separately, like "wal_level" = "minimal" (but not use quotes for numeric values), as it seems to be the most common practice. Anyway, we might want to unify them. Likewise, I saw two different versions of values with units. > "max_stack_depth" must not exceed %ldkB. > "vacuum_buffer_usage_limit" must be 0 or between %d kB and %d kB I'm not sure, but it seems like the latter version is more common. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: cataloguing NOT NULL constraints
Hello, At Wed, 1 May 2024 19:49:35 +0200, Alvaro Herrera wrote in > Here are two patches that I intend to push soon (hopefully tomorrow). This commit added and edited two error messages, resulting in using slightly different wordings "in" and "on" for relation constraints. + errmsg("cannot change NO INHERIT status of NOT NULL constraint \"%s\" on relation \"%s\"", === + errmsg("cannot change NO INHERIT status of NOT NULL constraint \"%s\" in relation \"%s\"", I think we usually use on in this case. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Allow non-superuser to cancel superuser tasks.
At Fri, 12 Apr 2024 09:10:35 +0900, Michael Paquier wrote in > On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote: > > The test doesn't fail because pg_terminate_backend actually meets his > > point: autovac is killed. But while dying, autovac also receives > > segfault. Thats because of injections points: > > > > (gdb) bt > > #0 0x56361c3379ea in tas (lock=0x7fbcb9632224 > access memory at address 0x7fbcb9632224>) at > > ../../../../src/include/storage/s_lock.h:228 > > #1 ConditionVariableCancelSleep () at condition_variable.c:238 ... > > #3 0x56361c330a40 in CleanupProcSignalState (status=, arg=) at procsignal.c:240 > > #4 0x56361c328801 in shmem_exit (code=code@entry=1) at ipc.c:276 > > #9 0x56361c3378d7 in ConditionVariableTimedSleep > > (cv=0x7fbcb9632224, timeout=timeout@entry=-1, ... > I can see this stack trace as well. Capturing a bit more than your > own stack, this is crashing in the autovacuum worker while waiting on > a condition variable when processing a ProcessInterrupts(). > > That may point to a legit bug with condition variables in this > context, actually? From what I can see, issuing a signal on a backend > process waiting with a condition variable is able to process the > interrupt correctly. ProcSignalInit sets up CleanupProcSignalState to be called via on_shmem_exit. If the CV is allocated in a dsm segment, shmem_exit should have detached the region for the CV. CV cleanup code should be invoked via before_shmem_exit. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: NLS doesn't work for pg_combinebackup
At Tue, 9 Apr 2024 15:00:27 +0900, Michael Paquier wrote in > I've checked the whole tree, and the two you are pointing at are the > only incorrect paths. So applied, thanks! Thank you for cross-checking and committing! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: NLS doesn't work for pg_combinebackup
At Mon, 08 Apr 2024 16:27:02 +0900 (JST), Kyotaro Horiguchi wrote in > Hello. > > I noticed that NLS doesn't work for pg_combinebackup. The cause is > that the tool forgets to call set_pglocale_pgservice(). > > This issue is fixed by the following chage. > > diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c > b/src/bin/pg_combinebackup/pg_combinebackup.c > index 1b07ca3fb6..2788c78fdd 100644 > --- a/src/bin/pg_combinebackup/pg_combinebackup.c > +++ b/src/bin/pg_combinebackup/pg_combinebackup.c > @@ -154,6 +154,7 @@ main(int argc, char *argv[]) > > pg_logging_init(argv[0]); > progname = get_progname(argv[0]); > + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_combinebackup")); > handle_help_version_opts(argc, argv, progname, help); > > memset(&opt, 0, sizeof(opt)); Forgot to mention, but pg_walsummary has the same issue. diff --git a/src/bin/pg_walsummary/pg_walsummary.c b/src/bin/pg_walsummary/pg_walsummary.c index 5e41b376d7..daf6cd14ce 100644 --- a/src/bin/pg_walsummary/pg_walsummary.c +++ b/src/bin/pg_walsummary/pg_walsummary.c @@ -67,6 +67,7 @@ main(int argc, char *argv[]) pg_logging_init(argv[0]); progname = get_progname(argv[0]); + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_walsummary")); handle_help_version_opts(argc, argv, progname, help); /* process command-line options */ regards. -- Kyotaro Horiguchi NTT Open Source Software Center
NLS doesn't work for pg_combinebackup
Hello. I noticed that NLS doesn't work for pg_combinebackup. The cause is that the tool forgets to call set_pglocale_pgservice(). This issue is fixed by the following chage. diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c index 1b07ca3fb6..2788c78fdd 100644 --- a/src/bin/pg_combinebackup/pg_combinebackup.c +++ b/src/bin/pg_combinebackup/pg_combinebackup.c @@ -154,6 +154,7 @@ main(int argc, char *argv[]) pg_logging_init(argv[0]); progname = get_progname(argv[0]); + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_combinebackup")); handle_help_version_opts(argc, argv, progname, help); memset(&opt, 0, sizeof(opt)); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: remaining sql/json patches
At Fri, 22 Mar 2024 11:44:08 +0900, Amit Langote wrote in > Thanks for the heads up. > > My bad, will push a fix shortly. No problem. Thank you for the prompt correction. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: remaining sql/json patches
At Wed, 20 Mar 2024 21:53:52 +0900, Amit Langote wrote in > I'll push 0001 tomorrow. This patch (v44-0001-Add-SQL-JSON-query-functions.patch) introduced the following new erro message: +errmsg("can only specify constant, non-aggregate" + " function, or operator expression for" + " DEFAULT"), I believe that our convention here is to write an error message in a single string literal, not split into multiple parts, for better grep'ability. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistent printf placeholders
Thank you for looking this. At Tue, 19 Mar 2024 10:50:23 +0100, Peter Eisentraut wrote in > On 15.03.24 08:20, Kyotaro Horiguchi wrote: > > diff --git a/src/backend/access/transam/twophase.c > > b/src/backend/access/transam/twophase.c > > @@ -1369,8 +1369,8 @@ ReadTwoPhaseFile(TransactionId xid, bool > > missing_ok) > > errmsg("could not read file \"%s\": > > %m", path))); > > else > > ereport(ERROR, > > - (errmsg("could not read file \"%s\": > > - read %d of %lld", > > - path, r, (long long > > - int) stat.st_size))); > > + (errmsg("could not read file \"%s\": read %zd of %zu", > > + path, r, (Size) stat.st_size))); > > } > > pgstat_report_wait_end(); > > This might be worse, because stat.st_size is of type off_t, which > could be smaller than Size/size_t. I think you were trying to mention that off_t could be wider than size_t and you're right in that point. I thought that it is safe since we are trying to read the whole content of the file at once here into palloc'ed memory. However, on second thought, if st_size is out of the range of ssize_t, and palloc accepts that size, at least on Linux, read(2) reads only 0x7000 bytes and raches the error reporting. Addition to that, this size was closer to the memory allocation size limit than I had thought. As the result, I removed the change. However, I kept the change of the type of variable "r" and corresponding placeholder %zd. > > diff --git a/src/backend/libpq/be-secure-gssapi.c > > b/src/backend/libpq/be-secure-gssapi.c > > index bc04e78abb..68645b4519 100644 > > --- a/src/backend/libpq/be-secure-gssapi.c > > +++ b/src/backend/libpq/be-secure-gssapi.c > > @@ -572,9 +572,9 @@ secure_open_gssapi(Port *port) > > if (input.length > PQ_GSS_RECV_BUFFER_SIZE) > > { > > ereport(COMMERROR, > > - (errmsg("oversize GSSAPI packet sent > > - by the client (%zu > %d)", > > + (errmsg("oversize GSSAPI packet sent by the client (%zu > %zu)", > > (size_t) input.length, > > - > > PQ_GSS_RECV_BUFFER_SIZE))); > > + (size_t) PQ_GSS_RECV_BUFFER_SIZE))); > > return -1; > > } > > > > Might be better to add that cast to the definition of > PQ_GSS_RECV_BUFFER_SIZE instead, so that all code can benefit. As far as I see, the only exceptional uses I found were a comparison with int values, and being passed as an OM_uint32 (to one of the parameters of gss_wrap_size_limit()). Therefore, I agree that it is beneficial. By the way, we currently define Size as the same as size_t (since 1998). Is it correct to consider Size as merely for backward compatibility and we should use size_t for new code? I used size_t in the modified part in the attached patch. > > diff --git a/src/backend/replication/repl_gram.y > > b/src/backend/replication/repl_gram.y > > index 7474f5bd67..baa76280b9 100644 > > --- a/src/backend/replication/repl_gram.y > > +++ b/src/backend/replication/repl_gram.y > > @@ -312,11 +312,6 @@ timeline_history: > > { > > TimeLineHistoryCmd *cmd; > > - if ($2 <= 0) > > - ereport(ERROR, > > - > > (errcode(ERRCODE_SYNTAX_ERROR), > > -errmsg("invalid > > -timeline %u", > > -$2))); > > - ... > I don't think this is correct. It loses the check for == 0. Ugh. It's my mistake. So we need to consider unifying the messages again. In walsummaryfuncs.c, %lld is required, but it's silly for the uses in repl_gram.y. Finally, I chose not to change anything here. > > diff --git a/src/backend/tsearch/to_tsany.c > > b/src/backend/tsearch/to_tsany.c > > index 88cba58cba..9d21178107 100644 > > --- a/src/backend/tsearch/to_tsany.c > > +++ b/src/backend/tsearch/to_tsany.c > > @@ -191,7 +
Re: Add new error_action COPY ON_ERROR "log"
At Fri, 15 Mar 2024 16:57:25 +0900, Michael Paquier wrote in > On Wed, Mar 13, 2024 at 07:32:37PM +0530, Bharath Rupireddy wrote: > > I'm attaching the v8 patch set implementing the above idea. With this, > > [1] is sent to the client, [2] is sent to the server log. This > > approach not only reduces the duplicate info in the NOTICE and CONTEXT > > messages, but also makes it easy for users to get all the necessary > > info in the NOTICE message without having to set extra parameters to > > get CONTEXT message. > > In terms of eliminating the information duplication while allowing > clients to know the value, the attribute and the line involved in the > conversion failure, the approach of tweaking the context information > has merits, I guess. > > How do others feel about this kind of tuning? If required, I think that we have already included the minimum information necessary for the primary diagnosis, including locations, within the primary messages. Here is an example: > LOG: 0: invalid record length at 0/18049F8: expected at least 24, got 0 And I believe that CONTEXT, if it exists, is augmentation information to the primary messages. The objective of the precedent for the use of relname_only was somewhat different, but this use also seems legit. In short, I think the distribution between message types (primary and context) is fine as it is in the latest patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistent printf placeholders
At Fri, 15 Mar 2024 16:20:27 +0900 (JST), Kyotaro Horiguchi wrote in > I checked for that kind of msgids in a bit more intensive way. The > numbers are the line numbers in ja.po of backend. I didn't check the > same for other modules. > > > ###: invalid timeline %lld > >@ backup/walsummaryfuncs.c:95 > > invalid timeline %u > >@ repl_gram.y:318 repl_gram.y:359 "The numbers are the line numbers in ja.po" is wrong. Correctly, it should be written as: The information consists of two lines: the first of them is the msgid, and the second indicates the locations where they appear. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistent printf placeholders
At Fri, 15 Mar 2024 16:01:28 +1300, David Rowley wrote in > On Fri, 15 Mar 2024 at 15:27, Kyotaro Horiguchi > wrote: > > I have considered only the two messages. Actually, buffile.c and md.c > > are already like that. The attached aligns the messages in > > pg_combinebackup.c and reconstruct.c with the precedents. > > This looks like a worthy cause to make translator work easier. > > I don't want to widen the goalposts or anything, but just wondering if > you'd searched for any others that could get similar treatment? > > I only just had a quick look at the following. > > $ cat src/backend/po/fr.po | grep -E "^msgid\s" | sed -E > 's/%[a-zA-Z]+/\%/g' | sort | uniq -d -c > 31 msgid "" > 2 msgid "could not accept SSL connection: %" > 2 msgid "could not initialize LDAP: %" > 2 msgid "could not look up local user ID %: %" ... > I've not looked at how hard it would be with any of the above to > determine how hard it would be to make the formats consistent. The > 3rd last one seems similar enough that it might be worth doing > together with this? I checked for that kind of msgids in a bit more intensive way. The numbers are the line numbers in ja.po of backend. I didn't check the same for other modules. > ###: invalid timeline %lld >@ backup/walsummaryfuncs.c:95 > invalid timeline %u >@ repl_gram.y:318 repl_gram.y:359 In the first case, the %lld can be more than 2^31. In the second case, %u is uint32. However, the bigger issue is checking if the uint32 value is negative: repl_gram.c: 147 > if ((yyvsp[0].uintval) <= 0) > ereport(ERROR, > (errcode(ERRCODE_SYNTAX_ERROR), >errmsg("invalid timeline %u", > (yyvsp[0].uintval; which cannot happen. I think we can simply remove the useless error check. > ###: could not read file \"%s\": read %d of %zu >@ ../common/controldata_utils.c:116 ../common/controldata_utils.c:119 > access/transam/xlog.c:3417 access/transam/xlog.c:4278 > replication/logical/origin.c:750 replication/logical/origin.c:789 > replication/logical/snapbuild.c:2040 replication/slot.c:2218 > replication/slot.c:2259 replication/walsender.c:660 > utils/cache/relmapper.c:833 > could not read file \"%s\": read %d of %lld >@ access/transam/twophase.c:1372 > could not read file \"%s\": read %zd of %zu >@ backup/basebackup.c:2102 > ###: oversize GSSAPI packet sent by the client (%zu > %zu) >@ libpq/be-secure-gssapi.c:351 > oversize GSSAPI packet sent by the client (%zu > %d) >@ libpq/be-secure-gssapi.c:575 > ###: compressed segment file \"%s\" has incorrect uncompressed size %d, > skipping >@ pg_receivewal.c:362 > compressed segment file \"%s\" has incorrect uncompressed size %zu, > skipping >@ pg_receivewal.c:448 > ###: could not read file \"%s\": read only %zd of %lld bytes >@ pg_combinebackup.c:1304 > could not read file \"%s\": read only %d of %u bytes >@ reconstruct.c:514 We can "fix" them the same way. I debated whether to use ssize_t for read() and replace all instances of size_t with Size. However, in the end, I decided to only keep it consistent with the surrounding code. > ###: index %d out of valid range, 0..%d >@ utils/adt/varlena.c:3220 utils/adt/varlena.c:3287 > index %lld out of valid range, 0..%lld >@ utils/adt/varlena.c:3251 utils/adt/varlena.c:3323 > ###: string is too long for tsvector (%d bytes, max %d bytes) >@ tsearch/to_tsany.c:194 utils/adt/tsvector.c:277 > utils/adt/tsvector_op.c:1126 > string is too long for tsvector (%ld bytes, max %ld bytes) >@ utils/adt/tsvector.c:223 We can unify them and did in the attached, but I'm not sure that's sensible.. > ###: could not look up local user ID %d: %s >@ ../port/user.c:43 ../port/user.c:79 > could not look up local user ID %ld: %s >@ libpq/auth.c:1886 Both of the above use uid_t (defined as int) as user ID and it is explicitly cast to "int" in the first case and to long in the second, which seems mysterious. Although I'm not sure if there's a possibility of uid_t being widened in the future, I unified them to the latter way for robustness. > ###: error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %m >@ file.c:44 > error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s >@ file.c:65 > > The la
Re: Typos in reorderbuffer.c.
At Thu, 14 Mar 2024 11:23:38 +0530, Amit Kapila wrote in > On Thu, Mar 14, 2024 at 9:58 AM Kyotaro Horiguchi > wrote: > > > > While examining reorderbuffer.c, I found several typos. I'm not sure > > if fixing them is worthwhile, but I've attached a fix just in case. > > > > LGTM. I'll push this in some time. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistent printf placeholders
Thank you for the suggestions. At Thu, 14 Mar 2024 13:45:41 +0100, Daniel Gustafsson wrote in > I've only skimmed this so far but +1 on keeping the messages the same where > possible to reduce translation work. Adding a comment on the message where > the > casting is done to indicate that it is for translation might reduce the risk > of > it "getting fixed" down the line. Added a comment "/* cast xxx to avoid extra translatable messages */. At Thu, 14 Mar 2024 14:02:46 +0100, Peter Eisentraut wrote in > If you want to make them uniform, then I suggest the error messages Yeah. Having the same messages with only the placeholders changed is not very pleasing during translation. If possible, I would like to align them. > should both be "%zd of %zu bytes", which are the actual types read() > deals with. I have considered only the two messages. Actually, buffile.c and md.c are already like that. The attached aligns the messages in pg_combinebackup.c and reconstruct.c with the precedents. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c index 6f0814d9ac..feb4d5dcf4 100644 --- a/src/bin/pg_combinebackup/pg_combinebackup.c +++ b/src/bin/pg_combinebackup/pg_combinebackup.c @@ -1301,8 +1301,9 @@ slurp_file(int fd, char *filename, StringInfo buf, int maxlen) if (rb < 0) pg_fatal("could not read file \"%s\": %m", filename); else - pg_fatal("could not read file \"%s\": read only %zd of %lld bytes", - filename, rb, (long long int) st.st_size); + /* cast st_size to avoid extra translatable messages */ + pg_fatal("could not read file \"%s\": read only %zd of %zu bytes", + filename, rb, (size_t) st.st_size); } /* Adjust buffer length for new data and restore trailing-\0 invariant */ diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c index 41f06bb26b..a4badb90e2 100644 --- a/src/bin/pg_combinebackup/reconstruct.c +++ b/src/bin/pg_combinebackup/reconstruct.c @@ -504,15 +504,16 @@ make_rfile(char *filename, bool missing_ok) static void read_bytes(rfile *rf, void *buffer, unsigned length) { - int rb = read(rf->fd, buffer, length); + ssize_t rb = read(rf->fd, buffer, length); if (rb != length) { if (rb < 0) pg_fatal("could not read file \"%s\": %m", rf->filename); else - pg_fatal("could not read file \"%s\": read only %d of %u bytes", - rf->filename, rb, length); + /* cast length to avoid extra translatable messages */ + pg_fatal("could not read file \"%s\": read only %zd of %zu bytes", + rf->filename, rb, (size_t) length); } }
Typos in reorderbuffer.c.
Hello. While examining reorderbuffer.c, I found several typos. I'm not sure if fixing them is worthwhile, but I've attached a fix just in case. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 001f901ee6..92cf39ff74 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -32,7 +32,7 @@ * * In order to cope with large transactions - which can be several times as * big as the available memory - this module supports spooling the contents - * of a large transactions to disk. When the transaction is replayed the + * of large transactions to disk. When the transaction is replayed the * contents of individual (sub-)transactions will be read from disk in * chunks. * @@ -3078,10 +3078,10 @@ ReorderBufferImmediateInvalidation(ReorderBuffer *rb, uint32 ninvalidations, * least once for every xid in XLogRecord->xl_xid (other places in records * may, but do not have to be passed through here). * - * Reorderbuffer keeps some datastructures about transactions in LSN order, + * Reorderbuffer keeps some data structures about transactions in LSN order, * for efficiency. To do that it has to know about when transactions are seen * first in the WAL. As many types of records are not actually interesting for - * logical decoding, they do not necessarily pass though here. + * logical decoding, they do not necessarily pass through here. */ void ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn) @@ -3513,11 +3513,11 @@ ReorderBufferLargestTXN(ReorderBuffer *rb) * snapshot because we don't decode such transactions. Also, we do not select * the transaction which doesn't have any streamable change. * - * Note that, we skip transactions that contains incomplete changes. There + * Note that, we skip transactions that contain incomplete changes. There * is a scope of optimization here such that we can select the largest * transaction which has incomplete changes. But that will make the code and * design quite complex and that might not be worth the benefit. If we plan to - * stream the transactions that contains incomplete changes then we need to + * stream the transactions that contain incomplete changes then we need to * find a way to partially stream/truncate the transaction changes in-memory * and build a mechanism to partially truncate the spilled files. * Additionally, whenever we partially stream the transaction we need to @@ -4701,8 +4701,8 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn, } /* - * Rejigger change->newtuple to point to in-memory toast tuples instead to - * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM). + * Rejigger change->newtuple to point to in-memory toast tuples instead of + * on-disk toast tuples that may no longer exist (think DROP TABLE or VACUUM). * * We cannot replace unchanged toast tuples though, so those will still point * to on-disk toast data. @@ -4713,7 +4713,7 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn, * at unexpected times. * * We simply subtract size of the change before rejiggering the tuple, and - * then adding the new size. This makes it look like the change was removed + * then add the new size. This makes it look like the change was removed * and then added back, except it only tweaks the accounting info. * * In particular it can't trigger serialization, which would be pointless
Inconsistent printf placeholders
Hello. A recent commit 6612185883 introduced two error messages that are identical in text but differ in their placeholders. - pg_fatal("could not read file \"%s\": read only %d of %d bytes", -filename, (int) rb, (int) st.st_size); + pg_fatal("could not read file \"%s\": read only %zd of %lld bytes", +filename, rb, (long long int) st.st_size); ... - pg_fatal("could not read file \"%s\": read only %d of %d bytes", + pg_fatal("could not read file \"%s\": read only %d of %u bytes", rf->filename, rb, length); I'd be happy if the two messages kept consistency. I suggest aligning types instead of making the messages different, as attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c index 41f06bb26b..e3b8e84289 100644 --- a/src/bin/pg_combinebackup/reconstruct.c +++ b/src/bin/pg_combinebackup/reconstruct.c @@ -504,15 +504,15 @@ make_rfile(char *filename, bool missing_ok) static void read_bytes(rfile *rf, void *buffer, unsigned length) { - int rb = read(rf->fd, buffer, length); + ssize_t rb = read(rf->fd, buffer, length); if (rb != length) { if (rb < 0) pg_fatal("could not read file \"%s\": %m", rf->filename); else - pg_fatal("could not read file \"%s\": read only %d of %u bytes", - rf->filename, rb, length); + pg_fatal("could not read file \"%s\": read only %zd of %lld bytes", + rf->filename, rb, (long long int) length); } }
Re: Infinite loop in XLogPageRead() on standby
At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi wrote in > Oh, I once saw the fix work, but seems not to be working after some > point. The new issue was a corruption of received WAL records on the > first standby, and it may be related to the setting. I identified the cause of the second issue. When I tried to replay the issue, the second standby accidentally received the old timeline's last page-spanning record till the end while the first standby was promoting (but it had not been read by recovery). In addition to that, on the second standby, there's a time window where the timeline increased but the first segment of the new timeline is not available yet. In this case, the second standby successfully reads the page-spanning record in the old timeline even after the second standby noticed that the timeline ID has been increased, thanks to the robustness of XLogFileReadAnyTLI(). I think the primary change to XLogPageRead that I suggested is correct (assuming the use of wal_segment_size instead of the constant). However, still XLogFileReadAnyTLI() has a chance to read the segment from the old timeline after the second standby notices a timeline switch, leading to the second issue. The second issue was fixed by preventing XLogFileReadAnyTLI from reading segments from older timelines than those suggested by the latest timeline history. (In other words, disabling the "AnyTLI" part). I recall that there was a discussion for commit 4bd0ad9e44, about the objective of allowing reading segments from older timelines than the timeline history suggests. In my faint memory, we concluded to postpone making the decision to remove the feature due to uncertainity about the objective. If there's no clear reason to continue using XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt XLogFileReadOnTLHistory(), which reads segments that align precisely with the timeline history. Of course, regardless of the changes above, if recovery on the second standby had reached the end of the page-spanning record before redirection to the first standby, it would need pg_rewind to connect to the first standby. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin wrote in > Hmm, I think you meant to use wal_segment_size, because 0x10 is just > 1MB. As a result, currently it works for you by accident. Oh, I once saw the fix work, but seems not to be working after some point. The new issue was a corruption of received WAL records on the first standby, and it may be related to the setting. > > Thus, I managed to reproduce precisely the same situation as you > > described utilizing your script with modifications and some core > > tweaks, and with the change above, I saw that the behavior was > > fixed. However, for reasons unclear to me, it shows another issue, and > > I am running out of time and need more caffeine. I'll continue > > investigating this tomorrow. > > > > Thank you for spending your time on it! You're welcome, but I aplogize for the delay in the work.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin wrote in > Please find attached the patch fixing the problem and the updated TAP test > that addresses Nit. Record-level retries happen when the upper layer detects errors. In my previous mail, I cited code that is intended to prevent this at segment boundaries. However, the resulting code applies to all page boundaries, since we judged that the difference doen't significanty affects the outcome. > * Check the page header immediately, so that we can retry immediately if > * it's not valid. This may seem unnecessary, because ReadPageInternal() > * validates the page header anyway, and would propagate the failure up to So, the following (tentative) change should also work. xlogrecovery.c: @@ -3460,8 +3490,10 @@ retry: * responsible for the validation. */ if (StandbyMode && + targetPagePtr % 0x10 == 0 && !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { Thus, I managed to reproduce precisely the same situation as you described utilizing your script with modifications and some core tweaks, and with the change above, I saw that the behavior was fixed. However, for reasons unclear to me, it shows another issue, and I am running out of time and need more caffeine. I'll continue investigating this tomorrow. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: initdb's -c option behaves wrong way?
At Mon, 4 Mar 2024 09:39:39 +0100, Daniel Gustafsson wrote in > > > > On 4 Mar 2024, at 02:01, Tom Lane wrote: > > > > Daniel Gustafsson writes: > >> I took the liberty to add this to the upcoming CF to make sure we don't > >> lose > >> track of it. > > > > Thanks for doing that, because the cfbot pointed out a problem: > > I should have written pg_strncasecmp not strncasecmp. If this > > version tests cleanly, I'll push it. > > +1, LGTM. Thank you for fixing this, Tom! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi wrote in > Anyway, our current policy here is to avoid record-rereads beyond > source switches. However, fixing this seems to require that source > switches cause record rereads unless some additional information is > available to know if replay LSN needs to back up. It seems to me that the error messages are related to commit 0668719801. XLogPageRead: > * Check the page header immediately, so that we can retry immediately if > * it's not valid. This may seem unnecessary, because ReadPageInternal() > * validates the page header anyway, and would propagate the failure up to > * ReadRecord(), which would retry. However, there's a corner case with > * continuation records, if a record is split across two pages such that > * we would need to read the two pages from different sources. For ... > if (StandbyMode && > !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, > readBuf)) > { > /* >* Emit this error right now then retry this page immediately. > Use >* errmsg_internal() because the message was already translated. >*/ > if (xlogreader->errormsg_buf[0]) > ereport(emode_for_corrupt_record(emode, > xlogreader->EndRecPtr), > (errmsg_internal("%s", > xlogreader->errormsg_buf))); This code intends to prevent a page header error from causing a record reread, when a record is required to be read from multiple sources. We could restrict this to only fire at segment boundaries. At segment boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD. Having thought up to this point, I now believe that we should completely prevent LSN from going back in any case. One drawback is that the fix cannot be back-patched. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi wrote in > At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi > wrote in > > After reading this, I came up with a possibility that walreceiver > > recovers more quickly than the calling interval to > > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call > > to the function WaitForWAL...(), and then somehow recovers the > > connection before the next call, the function doesn't notice the > > disconnection and returns XLREAD_SUCCESS wrongly. If this assumption > > is correct, the correct fix might be for us to return XLREAD_FAIL when > > reconnection happens after the last call to the WaitForWAL...() > > function. > > That's my stupid. The function performs reconnection by itself. Anyway, our current policy here is to avoid record-rereads beyond source switches. However, fixing this seems to require that source switches cause record rereads unless some additional information is available to know if replay LSN needs to back up. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi wrote in > After reading this, I came up with a possibility that walreceiver > recovers more quickly than the calling interval to > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call > to the function WaitForWAL...(), and then somehow recovers the > connection before the next call, the function doesn't notice the > disconnection and returns XLREAD_SUCCESS wrongly. If this assumption > is correct, the correct fix might be for us to return XLREAD_FAIL when > reconnection happens after the last call to the WaitForWAL...() > function. That's my stupid. The function performs reconnection by itself. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier wrote in > On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote: > > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi > > wrote: > >> In the first place, it's important to note that we do not guarantee > >> that an async standby can always switch its replication connection to > >> the old primary or another sibling standby. This is due to the > >> variations in replication lag among standbys. pg_rewind is required to > >> adjust such discrepancies. > > > > Sure, I know. But in this case the async standby received and flushed > > absolutely the same amount of WAL as the promoted one. > > Ugh. If it can happen transparently to the user without the user > knowing directly about it, that does not sound good to me. I did not > look very closely at monitoring tools available out there, but if both > standbys flushed the same WAL locations a rewind should not be > required. It is not something that monitoring tools would be able to > detect because they just look at LSNs. > > >> In the repro script, the replication connection of the second standby > >> is switched from the old primary to the first standby after its > >> promotion. After the switching, replication is expected to continue > >> from the beginning of the last replayed segment. > > > > Well, maybe, but apparently the standby is busy trying to decode a record > > that spans multiple pages, and it is just infinitely waiting for the next > > page to arrive. Also, the restart "fixes" the problem, because indeed it is > > reading the file from the beginning. > > What happens if the continuation record spawns across multiple segment > files boundaries in this case? We would go back to the beginning of > the segment where the record spawning across multiple segments began, > right? (I may recall this part of contrecords incorrectly, feel free > to correct me if necessary.) After reading this, I came up with a possibility that walreceiver recovers more quickly than the calling interval to WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call to the function WaitForWAL...(), and then somehow recovers the connection before the next call, the function doesn't notice the disconnection and returns XLREAD_SUCCESS wrongly. If this assumption is correct, the correct fix might be for us to return XLREAD_FAIL when reconnection happens after the last call to the WaitForWAL...() function. > >> But with the script, > >> the second standby copies the intentionally broken file, which differs > >> from the data that should be received via streaming. > > > > As I already said, this is a simple way to emulate the primary crash while > > standbys receiving WAL. > > It could easily happen that the record spans on multiple pages is not fully > > received and flushed. > > I think that's OK to do so at test level to force a test in the > backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has > proved that this can be designed to be cheap and stable across the > buildfarm fleet. Yeah, I agree that it clearly illustrates the final state after the issue happened, but if my assumption above is correct, the test doesn't manifest the real issue. > For anything like that, the result had better have solid test > coverage, where perhaps we'd better refactor some of the routines of > 039_end_of_wal.pl into a module to use them here, rather than > duplicate the code. The other test has a few assumptions with the > calculation of page boundaries, and I'd rather not duplicate that > across the tree. I agree to the point. > Seeing that Alexander is a maintainer of Patroni, which is very > probably used by his employer across a large set of PostgreSQL > instances, if he says that he's seen that in the field, that's good > enough for me to respond to the problem, especially if reconnecting a > standby to a promoted node where both flushed the same LSN. Now the > level of response also depends on the invasiness of the change, and we > need a very careful evaluation here. I don't mean to say that we should respond with DNF to this "issue" at all. I simply wanted to make the real issue clear. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Infinite loop in XLogPageRead() on standby
At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier wrote in > On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > > I spent some time debugging an issue with standby not being able to > > continue streaming after failover. > > > > The problem happens when standbys received only the first part of the WAL > > record that spans multiple pages. > > In this case the promoted standby discards the first part of the WAL record > > and writes END_OF_RECOVERY instead. If in addition to that someone will > > call pg_switch_wal(), then there are chances that SWITCH record will also > > fit to the page where the discarded part was settling, As a result the > > other standby (that wasn't promoted) will infinitely try making attempts to > > decode WAL record span on multiple pages by reading the next page, which is > > filled with zero bytes. And, this next page will never be written, because > > the new primary will be writing to the new WAL file after pg_switch_wal(). In the first place, it's important to note that we do not guarantee that an async standby can always switch its replication connection to the old primary or another sibling standby. This is due to the variations in replication lag among standbys. pg_rewind is required to adjust such discrepancies. I might be overlooking something, but I don't understand how this occurs without purposefully tweaking WAL files. The repro script pushes an incomplete WAL file to the archive as a non-partial segment. This shouldn't happen in the real world. In the repro script, the replication connection of the second standby is switched from the old primary to the first standby after its promotion. After the switching, replication is expected to continue from the beginning of the last replayed segment. But with the script, the second standby copies the intentionally broken file, which differs from the data that should be received via streaming. A similar problem to the issue here was seen at segment boundaries, before we introduced the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting a WAL file that is already archived. However, in this case, the second standby won't see the broken record because it cannot be in a non-partial segment in the archive, and the new primary streams END_OF_RECOVERY instead of the broken record. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera wrote in > Here's the complete set, with these two names using the singular. The commit by the second patch added several GUC descriptions: > Sets the size of the dedicated buffer pool used for the commit timestamp > cache. Some of them, commit_timestamp_buffers, transaction_buffers, subtransaction_buffers use 0 to mean auto-tuning based on shared-buffer size. I think it's worth adding an extra_desc such as "0 to automatically determine this value based on the shared buffer size". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: MultiXact\SLRU buffers configuration
At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin" wrote in > Here's the test draft. This test reliably reproduces sleep on CV when waiting > next multixact to be filled into "members" SLRU. By the way, I raised a question about using multiple CVs simultaneously [1]. That is, I suspect that the current CV implementation doesn't allow us to use multiple condition variables at the same time, because all CVs use the same PCPROC member cvWaitLink to accommodate different waiter sets. If this assumption is correct, we should resolve the issue before spreading more uses of CVs. [1] https://www.postgresql.org/message-id/20240227.150709.1766217736683815840.horikyota.ntt%40gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: A failure in t/001_rep_changes.pl
At Fri, 23 Feb 2024 15:50:21 +0530, vignesh C wrote in > By any chance do you have the log files when this failure occurred, if > so please share it. In my understanding, within a single instance, no two proclists can simultaneously share the same waitlink member of PGPROC. On the other hand, a publisher uses two condition variables for slots and WAL waiting, which work on the same PGPROC member cvWaitLink. I suspect this issue arises from the configuration. However, although it is unlikly related to this specific issue, a similar problem can arise in instances that function both as logical publisher and physical primary. Regardless of this issue, I think we should provide separate waitlink members for condition variables that can possibly be used simultaneously. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: About a recently-added message
At Thu, 22 Feb 2024 10:51:07 +0530, Amit Kapila wrote in > > Do you think some additional tests for the rest of the messages are > > worth the trouble? > > > > We have discussed this during development and didn't find it worth > adding tests for all misconfigured parameters. However, in the next > patch where we are planning to add a slot sync worker that will > automatically sync slots, we are adding a test for one more parameter. > I am not against adding tests for all the parameters but it didn't > appeal to add more test cycles for this. Thanks for the explanation. I'm fine with that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: About a recently-added message
At Thu, 22 Feb 2024 09:36:43 +0900 (JST), Kyotaro Horiguchi wrote in > Yes, I'm happy with all of the changes. The proposed patch appears to > cover all instances related to slotsync.c, and it looks fine to > me. Thanks! I'd like to raise another potential issue outside the patch. The patch needed to change only one test item even though it changed nine messages. This means eigh out of nine messages that the patch changed are not covered by our test. I doubt all of them are worth additional test items; however, I think we want to increase coverage. Do you think some additional tests for the rest of the messages are worth the trouble? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: About a recently-added message
At Wed, 21 Feb 2024 14:57:42 +0530, Amit Kapila wrote in > On Tue, Feb 20, 2024 at 3:21 PM shveta malik wrote: > > > > okay, attached v2 patch with changed error msgs and double quotes > > around logical. > > > > Horiguchi-San, does this address all your concerns related to > translation with these new messages? Yes, I'm happy with all of the changes. The proposed patch appears to cover all instances related to slotsync.c, and it looks fine to me. Thanks! I found that logica.c is also using the policy that I complained about, but it is a separate issue. ./logical.c 122:errmsg("logical decoding requires wal_level >= logical"))); regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
At Tue, 20 Feb 2024 19:56:10 +0530, Robert Haas wrote in > It seems like maybe somebody should look into why this is happening, > and perhaps fix it. GetConnection()@streamutil.c wants to ensure conninfo has a fallback database name ("replication"). However, the function seems to be ignoring the case where neither dbname nor connection string is given, which is the first case Kuroda-san raised. The second case is the intended behavior of the function. > /* pg_recvlogical uses dbname only; others use connection_string only. > */ > Assert(dbname == NULL || connection_string == NULL); And the function incorrectly assumes that the connection string requires "dbname=replication". >* Merge the connection info inputs given in form of connection string, >* options and default values (dbname=replication, replication=true, > etc.) But the name is a pseudo database name only used by pg_hba.conf (maybe) , which cannot be used as an actual database name (without quotation marks, or unless it is actually created). The function should not add the fallback database name because the connection string for physical replication connection doesn't require the dbname parameter. (attached patch) About the proposed patch, pg_basebackup cannot verify the validity of the dbname. It could be problematic. Although I haven't looked the original thread, it seems that the dbname is used only by pg_sync_replication_slots(). If it is true, couldn't we make the SQL function require a database name to make a connection, instead of requiring it in physical-replication conninfo? regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c index 56d1b15951..da63a04de4 100644 --- a/src/bin/pg_basebackup/streamutil.c +++ b/src/bin/pg_basebackup/streamutil.c @@ -78,7 +78,7 @@ GetConnection(void) /* * Merge the connection info inputs given in form of connection string, -* options and default values (dbname=replication, replication=true, etc.) +* options and default values (replication=true, etc.) */ i = 0; if (connection_string) @@ -96,14 +96,6 @@ GetConnection(void) keywords = pg_malloc0((argcount + 1) * sizeof(*keywords)); values = pg_malloc0((argcount + 1) * sizeof(*values)); - /* -* Set dbname here already, so it can be overridden by a dbname in the -* connection string. -*/ - keywords[i] = "dbname"; - values[i] = "replication"; - i++; - for (conn_opt = conn_opts; conn_opt->keyword != NULL; conn_opt++) { if (conn_opt->val != NULL && conn_opt->val[0] != '\0')
Re: Do away with zero-padding assumption before WALRead()
At Mon, 19 Feb 2024 11:02:39 +0530, Bharath Rupireddy wrote in > > After all, the patch looks good to me. > > Thanks. It was committed - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=73f0a1326608ac3a7d390706fdeec59fe4dc42c0. Yeah. I realied that after I had already sent the mail.. No harm done:p regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: A new message seems missing a punctuation
At Mon, 19 Feb 2024 10:31:33 +0530, Robert Haas wrote in > On Mon, Feb 19, 2024 at 10:10 AM Kyotaro Horiguchi > wrote: > > A recent commit (7a424ece48) added the following message: > > > > > could not sync slot information as remote slot precedes local slot: > > > remote slot "%s": LSN (%X/%X), catalog xmin (%u) local slot: LSN > > > (%X/%X), catalog xmin (%u) > > > > Since it is a bit overloaded but doesn't have a separator between > > "catalog xmin (%u)" and "local slot:", it is somewhat cluttered. Don't > > we need something, for example a semicolon there to improve > > readability and reduce clutter? > > I think maybe we should even revise the message even more. In most > places we do not just print out a whole bunch of values, but use a > sentence to connect them. Mmm. Something like this?: "could not sync slot information: local slot LSN (%X/%X) or xmin(%u) behind remote (%X/%X, %u)" Or I thought the values could be moved to DETAILS: line. By the way, the code around the message is as follows. > /* > * The remote slot didn't catch up to locally reserved position. > * > * We do not drop the slot because the restart_lsn can be ahead of the > * current location when recreating the slot in the next cycle. It may > * take more time to create such a slot. Therefore, we keep this slot > * and attempt the synchronization in the next cycle. > * > * XXX should this be changed to elog(DEBUG1) perhaps? > */ > ereport(LOG, > errmsg("could not sync slot information as remote slot precedes > local slot:" > " remote slot \"%s\": LSN (%X/%X), > catalog xmin (%u) local slot: LSN (%X/%X), catalog xmin (%u)", If we change this message to DEBUG1, a timeout feature to show a WARNING message might be needed for DBAs to notice that something bad is ongoing. However, it doesn't seem appropriate as a LOG message to me. Perhaps, a LOG message should be more like: > "LOG: waiting for local slot to catch up to remote" > "DETAIL: remote slot LSN (%X/%X); " regards. -- Kyotaro Horiguchi NTT Open Source Software Center
A new message seems missing a punctuation
A recent commit (7a424ece48) added the following message: > could not sync slot information as remote slot precedes local slot: > remote slot "%s": LSN (%X/%X), catalog xmin (%u) local slot: LSN > (%X/%X), catalog xmin (%u) Since it is a bit overloaded but doesn't have a separator between "catalog xmin (%u)" and "local slot:", it is somewhat cluttered. Don't we need something, for example a semicolon there to improve readability and reduce clutter? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Do away with zero-padding assumption before WALRead()
At Mon, 19 Feb 2024 11:56:22 +0900 (JST), Kyotaro Horiguchi wrote in > Yeah, perhaps I was overly concerned. The removed comment made me > think that someone could add code relying on the incorrect assumption > that the remaining bytes beyond the returned count are cleared out. On > the flip side, SimpleXLogPageRead always reads a whole page and > returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't > contain random garbage bytes. Therefore, it's safe as long as the Forgot to mention that there is a case involving non-initialized pages, but it doesn't affect the correctness of the description you pointed out. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Do away with zero-padding assumption before WALRead()
At Fri, 16 Feb 2024 19:50:00 +0530, Bharath Rupireddy wrote in > On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi > wrote: > > 1. It's useless to copy the whole page regardless of the 'count'. It's > > enough to copy only up to the 'count'. The patch looks good in this > > regard. > > Yes, it's not needed to copy the whole page. Per my understanding > about page transfers between disk and OS page cache - upon OS page > cache miss, the whole page (of disk block size) gets fetched from disk > even if we just read 'count' bytes (< disk block size). Right, but with a possibly-different block size. Anyway that behavior doesn't affect the result of this change. (Could affect performance hereafter if it were not the case, though..) > > 2. Maybe we need a comment that states the page_read callback > > functions leave garbage bytes beyond the returned count, due to > > partial copying without clearing the unused portion. > > Isn't the comment around page_read callback at > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78 > enough? > > "The callback shall return the number of bytes read (never more than > XLOG_BLCKSZ), or -1 on failure." Yeah, perhaps I was overly concerned. The removed comment made me think that someone could add code relying on the incorrect assumption that the remaining bytes beyond the returned count are cleared out. On the flip side, SimpleXLogPageRead always reads a whole page and returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't contain random garbage bytes. Therefore, it's safe as long as the caller doesn't access beyond the returned count. As a result, the description you pointed out seems to be enough. After all, the patch looks good to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Do away with zero-padding assumption before WALRead()
At Tue, 13 Feb 2024 11:47:06 +0530, Bharath Rupireddy wrote in > Hi, > > I noticed an assumption [1] at WALRead() call sites expecting the > flushed WAL page to be zero-padded after the flush LSN. I think this > can't always be true as the WAL can get flushed after determining the > flush LSN before reading it from the WAL file using WALRead(). I've > hacked the code up a bit to check if that's true - Good catch! The comment seems wrong also to me. The subsequent bytes can be written simultaneously, and it's very normal that there are unflushed bytes are in OS's page buffer. The objective of the comment seems to be to declare that there's no need to clear out the remaining bytes, here. I agree that it's not a problem for now. However, I think we need two fixes here. 1. It's useless to copy the whole page regardless of the 'count'. It's enough to copy only up to the 'count'. The patch looks good in this regard. 2. Maybe we need a comment that states the page_read callback functions leave garbage bytes beyond the returned count, due to partial copying without clearing the unused portion. > I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ > despite knowing exactly how much to read. Is it to tell the OS to > explicitly fetch the whole page from the disk? If yes, the OS will do > that anyway because the page transfers from disk to OS page cache are > always in terms of disk block sizes, no? If I understand your question correctly, I guess that the whole-page copy was expected to clear out the remaining bytes, as I mentioned earlier. > Although, there's no immediate problem with it right now, the > assumption is going to be incorrect when reading WAL from WAL buffers > using WALReadFromBuffers - > https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com. > > If we have no reason, can the WALRead() callers just read how much > they want like walsender for physical replication? Attached a patch > for the change. > > Thoughts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: About a recently-added message
At Thu, 15 Feb 2024 09:22:23 +0530, shveta malik wrote in > On Thu, Feb 15, 2024 at 8:26 AM Amit Kapila wrote: > > > > On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira wrote: > > > > > > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote: > > > > > > Now, I am less clear about whether to quote "logical" or not in the > > > above message. Do you have any suggestions? > > > > > > > > > The possible confusion comes from the fact that the sentence contains > > > "must be" > > > in the middle of a comparison expression. For pg_createsubscriber, we are > > > using > > > > > > publisher requires wal_level >= logical > > > > > > I suggest to use something like > > > > > > slot synchronization requires wal_level >= logical > > > > > > > This sounds like a better idea but shouldn't we directly use this in > > 'errmsg' instead of a separate 'errhint'? Also, if change this, then > > we should make a similar change for other messages in the same > > function. > > +1 on changing the msg(s) suggested way. Please find the patch for the > same. It also removes double quotes around the variable names Thanks for the discussion. With a translator hat on, I would be happy if I could determine whether a word requires translation with minimal background information. In this case, a translator needs to know which values wal_level can take. It's relatively easy in this case, but I'm not sure if this is always the case. Therefore, I would be slightly happier if "logical" were double-quoted. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: About a recently-added message
At Wed, 14 Feb 2024 16:26:52 +0900 (JST), Kyotaro Horiguchi wrote in > > "wal_level" must be >= logical. .. > > wal_level must be set to "replica" or "logical" at server start. .. > I suggest making the quoting policy consistent. Just after this, I found another inconsistency regarding quotation. > 'dbname' must be specified in "%s". The use of single quotes doesn't seem to comply with our standard. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
About a recently-added message
A recent commit added the following message: > "wal_level" must be >= logical. The use of the term "logical" here is a bit confusing, as it's unclear whether it's meant to be a natural language word or a token. (I believe it to be a token.) On the contrary, we already have the following message: > wal_level must be set to "replica" or "logical" at server start. This has the conflicting policy about quotation of variable names and enum values. I suggest making the quoting policy consistent. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: A comment in DropRole() contradicts the actual behavior
At Thu, 8 Feb 2024 16:39:23 +0900, Michael Paquier wrote in > On Thu, Feb 08, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote: > > Hello, > > > > Please look at errors, which produced by the following script, starting > > from 6566133c5: > > for i in `seq 100`; do (echo "CREATE USER u; DROP USER u;"); done | psql > > >psql-1.log 2>&1 & > > for i in `seq 100`; do (echo "CREATE USER u; DROP USER u;"); done | psql > > >psql-2.log 2>&1 & > > wait > > > > ERROR: could not find tuple for role 16387 > > ERROR: could not find tuple for role 16390 > > ERROR: could not find tuple for role 16394 > > ... > > > > Maybe these errors are expected, but then I'm confused by the comment in > > DropRole(): > > Well, these errors should never happen, IMHO, so it seems to me that > the comment is right and that the code lacks locking, contrary to what > the comment tells. The function acquires a lock, but it does not perform an existence check until it first attempts to fetch the tuple, believing in its presence. However, I doubt performing an additional existence check right after acquiring the lock is worthwhile. By the way, I saw the following error with the provided script: > ERROR: duplicate key value violates unique constraint > "pg_authid_rolname_index" > DETAIL: Key (rolname)=(u) already exists. > STATEMENT: CREATE USER u; This seems to be another instance of a similar thinko. I vaguely think that we should just regard the absence as a concurrent drop and either adjust or remove the message, then fix the comment. The situation is slightly different for the duplication case. We shouldn't ignore it but rather need to adjust the error message. As long as these behaviors don't lead to inconsistencies. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: InstallXLogFileSegment() vs concurrent WAL flush
At Fri, 2 Feb 2024 14:42:46 +0100, Thomas Munro wrote in > On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA wrote: > > On Fri, 2 Feb 2024 11:18:18 +0100 > > Thomas Munro wrote: > > > One simple way to address that would be to make XLogFileInitInternal() > > > wait for InstallXLogFileSegment() to finish. It's a little > > > > Or, can we make sure the rename is durable by calling fsync before > > returning the fd, as a patch attached here? > > Right, yeah, that works too. I'm not sure which way is better. I'm not sure I like issuing spurious syncs unconditionally. Therefore, I prefer Thomas' approach in that regard. 0002 would be beneficial, considering the case of a very large max_wal_size, and the code seems to be the minimal required. I don't think it matters that the lock attempts occur uselessly until the first segment installation. That being said, we could avoid it by initializing last_known_installed_segno properly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA wrote in > On Mon, 05 Feb 2024 11:28:59 +0900 > torikoshia wrote: > > > > Based on this, I've made a patch. > > > based on COPY Synopsis: ON_ERROR 'error_action' > > > on_error 'null', the keyword NULL should be single quoted. > > > > As you mentioned, single quotation seems a little odd.. > > > > I'm not sure what is the best name and syntax for this feature, but > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' > > might not be appropriate. > > I am not in favour of using 'null' either, so I suggested to use > "set_to_null" or more generic syntax like "set_to (col, val)" in my > previous post[1], although I'm not convinced what is the best either. > > [1] > https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp Tom sugggested using a separate option, and I agree with the suggestion. Taking this into consideration, I imagined something like the following, for example. Although I'm not sure we are actually going to do whole-tuple replacement, the action name in this example has the suffix '-column'. COPY (on_error 'replace-colomn', replacement 'null') .. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: More new SQL/JSON item methods
Sorry for a minor correction, but.. At Thu, 01 Feb 2024 14:53:57 +0900 (JST), Kyotaro Horiguchi wrote in > Ah.. Understood. "NaN or Infinity" cannot be used in those > cases. Additionally, for jpiBoolean and jpiBigint, we lack the text > representation of the value. This "Additionally" was merely left in by mistake. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: More new SQL/JSON item methods
At Thu, 1 Feb 2024 09:22:22 +0530, Jeevan Chalke wrote in > On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi > wrote: > > > At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi < > > horikyota@gmail.com> wrote in > > > By the way, while playing with this feature, I noticed the following > > > error message: > > > > > > > select jsonb_path_query('1.1' , '$.boolean()'); > > > > ERROR: numeric argument of jsonpath item method .boolean() is out of > > range for type boolean > > > > > > The error message seems a bit off to me. For example, "argument '1.1' > > > is invalid for type [bB]oolean" seems more appropriate for this > > > specific issue. (I'm not ceratin about our policy on the spelling of > > > Boolean..) > > > > Or, following our general convention, it would be spelled as: > > > > 'invalid argument for type Boolean: "1.1"' > > > > jsonpath way: Hmm. I see. > ERROR: argument of jsonpath item method .boolean() is invalid for type > boolean > > or, if we add input value, then > > ERROR: argument "1.1" of jsonpath item method .boolean() is invalid for > type boolean > > And this should work for all the error types, like out of range, not valid, > invalid input, etc, etc. Also, we don't need separate error messages for > string input as well, which currently has the following form: > > "string argument of jsonpath item method .%s() is not a valid > representation.." Agreed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: More new SQL/JSON item methods
At Thu, 1 Feb 2024 09:19:40 +0530, Jeevan Chalke wrote in > > As Tom suggested, given that similar situations have already been > > disregarded elsewhere, worrying about excessively long input strings > > in this specific instance won't notably improve safety in total. > > > > > Also, for non-string input, we need to convert numeric to string just for > > > the error message, which seems overkill. > > > > As I suggested and you seem to agree, using literally "Nan or > > Infinity" would be sufficient. > > > > I am more concerned about .bigint() and .integer(). We can have errors when > the numeric input is out of range, but not NaN or Infinity. At those > places, we need to convert numeric to string to put that value into the > error. > Do you mean we should still put "Nan or Infinity" there? > > This is the case: > select jsonb_path_query('12345678901', '$.integer()'); > ERROR: numeric argument of jsonpath item method .integer() is out of > range for type integer Ah.. Understood. "NaN or Infinity" cannot be used in those cases. Additionally, for jpiBoolean and jpiBigint, we lack the text representation of the value. By a quick grepping, I found that the following functions call numeric_out to convert the jbvNumeric values back into text representation. JsonbValueAstext, populate_scalar, iterate_jsonb_values, executeItemOptUnrwapTarget, jsonb_put_escaped_value The function iterate_jsonb_values(), in particular, iterates over a values array, calling numeric_out on each iteration. The following functions re-converts the converted numeric into another type. jsonb_int[248]() converts the numeric value into int2 using numeric_int[248](). jsonb_float[48]() converts it into float4 using numeric_float[48](). Given these facts, it seems more efficient for jbvNumber to retain the original scalar value, converting it only when necessary. If needed, we could also add a numeric struct member as a cache for better performance. I'm not sure we refer the values more than once, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: More new SQL/JSON item methods
At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi wrote in > By the way, while playing with this feature, I noticed the following > error message: > > > select jsonb_path_query('1.1' , '$.boolean()'); > > ERROR: numeric argument of jsonpath item method .boolean() is out of range > > for type boolean > > The error message seems a bit off to me. For example, "argument '1.1' > is invalid for type [bB]oolean" seems more appropriate for this > specific issue. (I'm not ceratin about our policy on the spelling of > Boolean..) Or, following our general convention, it would be spelled as: 'invalid argument for type Boolean: "1.1"' regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: More new SQL/JSON item methods
Thank you for the fix! At Tue, 30 Jan 2024 13:46:17 +0530, Jeevan Chalke wrote in > On Mon, Jan 29, 2024 at 11:03 AM Tom Lane wrote: > > > Kyotaro Horiguchi writes: > > > Having said that, I'm a bit concerned about the case where an overly > > > long string is given there. However, considering that we already have > > > "invalid input syntax for type xxx: %x" messages (including for the > > > numeric), this concern might be unnecessary. > > > > Yeah, we have not worried about that in the past. > > > > > Another concern is that the input value is already a numeric, not a > > > string. This situation occurs when the input is NaN of +-Inf. Although > > > numeric_out could be used, it might cause another error. Therefore, > > > simply writing out as "argument NaN and Infinity.." would be better. > > > > Oh! I'd assumed that we were discussing a string that we'd failed to > > convert to numeric. If the input is already numeric, then either > > the error is unreachable or what we're really doing is rejecting > > special values such as NaN on policy grounds. I would ask first > > if that policy is sane at all. (I'd lean to "not" --- if we allow > > it in the input JSON, why not in the output?) If it is sane, the > > error message needs to be far more specific. > > > > regards, tom lane > > > > *Consistent error message related to type:* ... > What if we use phrases like "for type double precision" at both places, > like: > numeric argument of jsonpath item method .%s() is out of range for type > double precision > string argument of jsonpath item method .%s() is not a valid representation > for type double precision > > With this, the rest will be like: > for type numeric > for type bigint > for type integer > > Suggestions? FWIW, I prefer consistently using "for type hoge". > --- > > *Showing input string in the error message:* > > argument "...input string here..." of jsonpath item method .%s() is out of > range for type numeric > > If we add the input string in the error, then for some errors, it will be > too big, for example: > > -ERROR: numeric argument of jsonpath item method .double() is out of range > for type double precision > +ERROR: argument > "10" > of jsonpath item method .double() is out of range for type double precision As Tom suggested, given that similar situations have already been disregarded elsewhere, worrying about excessively long input strings in this specific instance won't notably improve safety in total. > Also, for non-string input, we need to convert numeric to string just for > the error message, which seems overkill. As I suggested and you seem to agree, using literally "Nan or Infinity" would be sufficient. > On another note, irrespective of these changes, is it good to show the > given input in the error messages? Error messages are logged and may leak > some details. > > I think the existing way seems ok. In my opinion, it is quite common to include the error-causing value in error messages. Also, we have already many functions that impliy the possibility for revealing input values when converting text representation into internal format, such as with int4in. However, I don't stick to that way. > --- > > *NaN and Infinity restrictions:* > > I am not sure why NaN and Infinity are not allowed in conversion to double > precision (.double() method). I have used the same restriction for > .decimal() and .number(). However, as you said, we should have error > messages more specific. I tried that in the attached patch; please have > your views. I have the following wordings for that error message: > "NaN or Infinity is not allowed for jsonpath item method .%s()" > > Suggestions... They seem good to *me*. By the way, while playing with this feature, I noticed the following error message: > select jsonb_path_query('1.1' , '$.boolean()'); > ERROR: numeric argument of jsonpath item method .boolean() is out of range > for type boolean The error message seems a bit off to me. For example, "argument '1.1' is invalid for type [bB]oolean" seems more appropriate for this specific issue. (I'm not ceratin about our policy on the spelling of Boolean..) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
possible inter-gcc-version incompatibility regarding fat objects
Hello. I would like to share some information regarding a recent issue we encoutered. While compiling an extension against the RPM package postgresql16-devel-16.1-2PGDG.rhel9.x86_64.rpm we faced a problem that appears to be caused by the combination of LTO and constant folding/propagation, leading to a SEGV. This crash didn't occur when we gave --fno-lto instead of --flto=auto to the compiler. To put this case briefly: postgresql16-devel-16.1-2PGDG.rhel9.x86_64.rpm + gcc 11.3 => building the extension completes but the binary may crash with SEGV --fno-lto prevents this crash from occurring. The PG package is built with gcc-11.4, and the compiler in our build environment for the extension was gcc-11.3. A quick search suggests that the LTO bytecode format was optimized in gcc-11.4. Conversely, when we built the extension with postgresql-16-16.0 and gc-11.4, the build fails. This failure seems to stem LTO binary format incompatibility. > Error: bad register name `%' > Error: open CFI at the end of file; missing .cfi_endproc directive > /usr/bin/ld: error: lto-wrapper failed To put this case briefly: postgresql16-devel-16.0-1PGDG.rhel9.x86_64.rpm + gcc 11.4 => build failure regarding LTO Given these issues, it seems there might be some issues with including fat objects in the -devel package. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Network failure may prevent promotion
Thank you fixing the issue. At Tue, 23 Jan 2024 11:43:43 +0200, Heikki Linnakangas wrote i n > There's an existing AmWalReceiverProcess() macro too. Let's use that. Mmm. I sought an Is* function becuase "IsLogicalWorker()" is placed on the previous line. Our convention regarding those functions (macros) and variables seems inconsistent. However, I can't say for sure that we should unify all of them. > (See also > https://www.postgresql.org/message-id/f3ecd4cb-85ee-4e54-8278-5fabfb3a4ed0%40iki.fi > for refactoring in this area) > > Here's a patch set summarizing the changes so far. They should be > squashed, but I kept them separate for now to help with review: > > 1. revert the revert of 728f86fec6. > 2. your walrcv_shutdown_deblocking_v2-2.patch > 3. Also replace libpqrcv_PQexec() and libpqrcv_PQgetResult() with the > wrappers from libpq-be-fe-helpers.h Both replacements look fine. I didn't find another instance of similar code. > 4. Replace IsWalReceiver() with AmWalReceiverProcess() Just look fine. > >> - pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* request > >> - shutdown */ > >> + pqsignal(SIGTERM, WalRcvShutdownSignalHandler); /* request shutdown > >> */ > >> > >> Can't we just use die(), instead? > > There was a comment explaining the problems associated with exiting > > within a signal handler; > > - * Currently, only SIGTERM is of interest. We can't just exit(1) within > > - * the > > - * SIGTERM signal handler, because the signal might arrive in the middle > > - * of > > - * some critical operation, like while we're holding a spinlock. > > - * Instead, the > > And I think we should keep the considerations it suggests. The patch > > removes the comment itself, but it does so because it implements our > > standard process exit procedure, which incorporates points suggested > > by the now-removed comment. > > die() doesn't call exit(1). Unless DoingCommandRead is set, but it > never is in the walreceiver. It looks just like the new > WalRcvShutdownSignalHandler() function. Am I missing something? Ugh.. Doesn't the name 'die()' suggest exit()? I agree that die() can be used instad. > Hmm, but doesn't bgworker_die() have that problem with exit(1)ing in > the signal handler? I noticed that but ignored for this time. > I also wonder if we should replace SignalHandlerForShutdownRequest() > completely with die(), in all processes? The difference is that > SignalHandlerForShutdownRequest() uses ShutdownRequestPending, while > die() uses ProcDiePending && InterruptPending to indicate that the > signal was received. Or do some of the processes want to check for > ShutdownRequestPending only at specific places, and don't want to get > terminated at the any random CHECK_FOR_INTERRUPTS()? At least, pg_log_backend_memory_context() causes a call to ProcessInterrupts via "ereport(LOG_SERVER_ONLY" which can lead to an exit due to ProcDiePending. In this regard, checkpointer clearly requires the distinction. Rather than merely consolidating the notification variables and striving to annihilate CFI calls in the execution path, I believe we need a shutdown mechanism that CFI doesn't react to. However, as for the method to achieve this, whether we should keep the notification variables separate as they are now, or whether it would be better to introduce a variable that causes CFI to ignore ProcDiePending, is a matter I think is open to discussion. Attached patches are the rebased version of v3 (0003 is updated) and additional 0005 that makes use of die() instead of walreceiver's custom function. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From daac3aa06bd33f5e8f04a406c8a59fd4cc97 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 23 Jan 2024 11:01:03 +0200 Subject: [PATCH v4 1/5] Revert "Revert "libpqwalreceiver: Convert to libpq-be-fe-helpers.h"" This reverts commit 21ef4d4d897563adb2f7920ad53b734950f1e0a4. --- .../libpqwalreceiver/libpqwalreceiver.c | 55 +++ 1 file changed, 8 insertions(+), 47 deletions(-) diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 2439733b55..dbee2f8f0e 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -24,6 +24,7 @@ #include "common/connect.h" #include "funcapi.h" #include "libpq-fe.h" +#include "libpq/libpq-be-fe-helpers.h" #include "mb/pg_wchar.h" #include "miscadmin.h" #include "pgstat.h" @@ -136,7 +137,6 @@ libp
Re: More new SQL/JSON item methods
At Sun, 28 Jan 2024 22:47:02 -0500, Tom Lane wrote in Kyotaro Horiguchi writes: > They seem to be suggesting that PostgreSQL has the types "decimal" and > "number". I know of the former, but I don't think PostgreSQL has the > latter type. Perhaps the "number" was intended to refer to "numeric"? Probably. But I would write just "type numeric". We do not generally acknowledge "decimal" as a separate type, because for us it's only an alias for numeric (there is not a pg_type entry for it). Also, that leads to the thought that "numeric argument ... is out of range for type numeric" seems either redundant or contradictory depending on how you look at it. So I suggest wording like argument "...input string here..." of jsonpath item method .%s() is out of range for type numeric > (And I think it is largely helpful if the given string were shown in > the error message, but it would be another issue.) Agreed, so I suggest the above. Having said that, I'm a bit concerned about the case where an overly long string is given there. However, considering that we already have "invalid input syntax for type xxx: %x" messages (including for the numeric), this concern might be unnecessary. Another concern is that the input value is already a numeric, not a string. This situation occurs when the input is NaN of +-Inf. Although numeric_out could be used, it might cause another error. Therefore, simply writing out as "argument NaN and Infinity.." would be better. Once we resolve these issues, another question arises regarding on of the messages. In the case of NaN of Infinity, the message will be as the follows: "argument NaN or Infinity of jsonpath item method .%s() is out of range for type numeric" This message appears quite strange and puzzling. I suspect that the intended message was somewhat different. > The same commit has introduced the following set of messages: >> %s format is not recognized: "%s" >> date format is not recognized: "%s" >> time format is not recognized: "%s" >> time_tz format is not recognized: "%s" >> timestamp format is not recognized: "%s" >> timestamp_tz format is not recognized: "%s" > I believe that the first line was intended to cover all the others:p +1 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: More new SQL/JSON item methods
I have two possible issues in a recent commit. Commit 66ea94e8e6 has introduced the following messages: (Aplogizies in advance if the commit is not related to this thread.) jsonpath_exec.c:1287 > if (numeric_is_nan(num) || numeric_is_inf(num)) > RETURN_ERROR(ereport(ERROR, > (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM), > errmsg("numeric argument of jsonpath item method .%s() is out > of range for type decimal or number", >jspOperationName(jsp->type); :1387 > noerr = DirectInputFunctionCallSafe(numeric_in, numstr, ... > if (!noerr || escontext.error_occurred) > RETURN_ERROR(ereport(ERROR, > (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM), > errmsg("string argument of jsonpath item method .%s() is not a > valid representation of a decimal or number", They seem to be suggesting that PostgreSQL has the types "decimal" and "number". I know of the former, but I don't think PostgreSQL has the latter type. Perhaps the "number" was intended to refer to "numeric"? (And I think it is largely helpful if the given string were shown in the error message, but it would be another issue.) The same commit has introduced the following set of messages: > %s format is not recognized: "%s" > date format is not recognized: "%s" > time format is not recognized: "%s" > time_tz format is not recognized: "%s" > timestamp format is not recognized: "%s" > timestamp_tz format is not recognized: "%s" I believe that the first line was intended to cover all the others:p They are attached to this message separately. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c index 22f598cd35..c10926a8a2 100644 --- a/src/backend/utils/adt/jsonpath_exec.c +++ b/src/backend/utils/adt/jsonpath_exec.c @@ -1287,7 +1287,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, JsonPathItem *jsp, if (numeric_is_nan(num) || numeric_is_inf(num)) RETURN_ERROR(ereport(ERROR, (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM), - errmsg("numeric argument of jsonpath item method .%s() is out of range for type decimal or number", + errmsg("numeric argument of jsonpath item method .%s() is out of range for type decimal or numeric", jspOperationName(jsp->type); if (jsp->type == jpiDecimal) @@ -1312,14 +1312,14 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, JsonPathItem *jsp, if (!noerr || escontext.error_occurred) RETURN_ERROR(ereport(ERROR, (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM), - errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or number", + errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or numeric", jspOperationName(jsp->type); num = DatumGetNumeric(datum); if (numeric_is_nan(num) || numeric_is_inf(num)) RETURN_ERROR(ereport(ERROR, (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM), - errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or number", + errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or numeric", jspOperationName(jsp->type); res = jperOk; @@ -1401,7 +1401,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, JsonPathItem *jsp, if (!noerr || escontext.error_occurred) RETURN_ERROR(ereport(ERROR, (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM), - errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or number", + errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or numeric", jspOperationName(jsp->type); num = DatumGetNumeric(numdatum); diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out index eea2af30c8..9d8ce48a25 100644 --- a/src/test/regress/expected/jsonb_jsonpath.out +++ b/src/test/regress/expected/jsonb_jsonpath.out @@ -2144,7 +2144,7 @@ select jsonb_path_query('"1.23"', '$.decimal()'); (1 row) select jsonb_path_
Re: Network failure may prevent promotion
Thank you for looking this! At Tue, 23 Jan 2024 15:07:10 +0900, Fujii Masao wrote in > Regarding the patch, here are the review comments. > > +/* > + * Is current process a wal receiver? > + */ > +bool > +IsWalReceiver(void) > +{ > + return WalRcv != NULL; > +} > > This looks wrong because WalRcv can be non-NULL in processes other > than walreceiver. Mmm. Sorry for the silly mistake. We can use B_WAL_RECEIVER instead. I'm not sure if the new function IsWalReceiver() is required. The expression "MyBackendType == B_WAL_RECEIVER" is quite descriptive. However, the function does make ProcessInterrupts() more aligned regarding process types. > - pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* request shutdown */ > + pqsignal(SIGTERM, WalRcvShutdownSignalHandler); /* request shutdown */ > > Can't we just use die(), instead? There was a comment explaining the problems associated with exiting within a signal handler; - * Currently, only SIGTERM is of interest. We can't just exit(1) within the - * SIGTERM signal handler, because the signal might arrive in the middle of - * some critical operation, like while we're holding a spinlock. Instead, the And I think we should keep the considerations it suggests. The patch removes the comment itself, but it does so because it implements our standard process exit procedure, which incorporates points suggested by the now-removed comment. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 201c36cb22..db779dc6ca 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -749,7 +749,7 @@ libpqrcv_PQgetResult(PGconn *streamConn) if (rc & WL_LATCH_SET) { ResetLatch(MyLatch); - ProcessWalRcvInterrupts(); + CHECK_FOR_INTERRUPTS(); } /* Consume whatever data is available from the socket */ @@ -1053,7 +1053,7 @@ libpqrcv_processTuples(PGresult *pgres, WalRcvExecResult *walres, { char *cstrs[MaxTupleAttributeNumber]; - ProcessWalRcvInterrupts(); + CHECK_FOR_INTERRUPTS(); /* Do the allocations in temporary context. */ oldcontext = MemoryContextSwitchTo(rowcontext); diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 728059518e..e491f7d4c5 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -147,39 +147,34 @@ static void XLogWalRcvSendReply(bool force, bool requestReply); static void XLogWalRcvSendHSFeedback(bool immed); static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime); static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now); +static void WalRcvShutdownSignalHandler(SIGNAL_ARGS); -/* - * Process any interrupts the walreceiver process may have received. - * This should be called any time the process's latch has become set. - * - * Currently, only SIGTERM is of interest. We can't just exit(1) within the - * SIGTERM signal handler, because the signal might arrive in the middle of - * some critical operation, like while we're holding a spinlock. Instead, the - * signal handler sets a flag variable as well as setting the process's latch. - * We must check the flag (by calling ProcessWalRcvInterrupts) anytime the - * latch has become set. Operations that could block for a long time, such as - * reading from a remote server, must pay attention to the latch too; see - * libpqrcv_PQgetResult for example. - */ void -ProcessWalRcvInterrupts(void) +WalRcvShutdownSignalHandler(SIGNAL_ARGS) { - /* - * Although walreceiver interrupt handling doesn't use the same scheme as - * regular backends, call CHECK_FOR_INTERRUPTS() to make sure we receive - * any incoming signals on Win32, and also to make sure we process any - * barrier events. - */ - CHECK_FOR_INTERRUPTS(); + int save_errno = errno; - if (ShutdownRequestPending) + /* Don't joggle the elbow of proc_exit */ + if (!proc_exit_inprogress) { - ereport(FATAL, -(errcode(ERRCODE_ADMIN_SHUTDOWN), - errmsg("terminating walreceiver process due to administrator command"))); + InterruptPending = true; + ProcDiePending = true; } + + SetLatch(MyLatch); + + errno = save_errno; + } +/* + * Is current process a wal receiver? + */ +bool +IsWalReceiver(void) +{ + return MyBackendType == B_WAL_RECEIVER; +} /* Main entry point for walreceiver process */ void @@ -277,7 +272,7 @@ WalReceiverMain(void) pqsignal(SIGHUP, SignalHandlerForConfigReload); /* set flag to read config * file */ pqsignal(SIGINT, SIG_IGN); - pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* request shutdown */ + pqsignal(SIGTERM, WalRcvShutdownSignalHandler); /* request shutdown *
Fix some errdetail's message format
We recently made corrections to the capitalization of DETAIL messages. For example; - GUC_check_errdetail("invalid list syntax in parameter %s", + GUC_check_errdetail("Invalid list syntax in parameter %s", But still it is missing the period at the end. There are several patterns to this issue, but this time, I have only fixed the ones that are simple and obvious as follows: a. GUC_check_errdetail("LITERAL"), errdetail("LITERAL") without a period. b. psprintf()'ed string that is passed to errdetail_internal() I didn't touch the following patterns: c. errdetail_internal("%s") d. errdetail("Whatever: %s") e. errdetail("%s versus %s") and alikes f. errdetail_internal("%s", pchomp(PQerrorMessage())) g. complex message compilation The attached patch contains the following fix: - GUC_check_errdetail("timestamp out of range: \"%s\"", str); + GUC_check_errdetail("Timestamp out of range: \"%s\".", str); But I'm not quite confident about whether capitalizing the type name here is correct. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/contrib/cube/cubescan.l b/contrib/cube/cubescan.l index a30fbfc311..b45dc08f0c 100644 --- a/contrib/cube/cubescan.l +++ b/contrib/cube/cubescan.l @@ -82,7 +82,7 @@ cube_yyerror(NDBOX **result, Size scanbuflen, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("invalid input syntax for cube"), /* translator: %s is typically "syntax error" */ - errdetail("%s at end of input", message))); + errdetail("%s at end of input.", message))); } else { @@ -90,7 +90,7 @@ cube_yyerror(NDBOX **result, Size scanbuflen, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("invalid input syntax for cube"), /* translator: first %s is typically "syntax error" */ - errdetail("%s at or near \"%s\"", message, yytext))); + errdetail("%s at or near \"%s\".", message, yytext))); } } diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 19a362526d..2720e91c14 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2633,7 +2633,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn, const char *connstr) ereport(ERROR, (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED), errmsg("password or GSSAPI delegated credentials required"), - errdetail("Non-superusers may only connect using credentials they provide, eg: password in connection string or delegated GSSAPI credentials"), + errdetail("Non-superusers may only connect using credentials they provide, eg: password in connection string or delegated GSSAPI credentials."), errhint("Ensure provided credentials match target server's authentication method."))); } diff --git a/contrib/seg/segscan.l b/contrib/seg/segscan.l index 4ad529eccc..f5a72c2496 100644 --- a/contrib/seg/segscan.l +++ b/contrib/seg/segscan.l @@ -79,7 +79,7 @@ seg_yyerror(SEG *result, struct Node *escontext, const char *message) (errcode(ERRCODE_SYNTAX_ERROR), errmsg("bad seg representation"), /* translator: %s is typically "syntax error" */ - errdetail("%s at end of input", message))); + errdetail("%s at end of input.", message))); } else { @@ -87,7 +87,7 @@ seg_yyerror(SEG *result, struct Node *escontext, const char *message) (errcode(ERRCODE_SYNTAX_ERROR), errmsg("bad seg representation"), /* translator: first %s is typically "syntax error" */ - errdetail("%s at or near \"%s\"", message, yytext))); + errdetail("%s at or near \"%s\".", message, yytext))); } } diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 1b48d7171a..2d90bbfee8 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -4882,7 +4882,7 @@ check_recovery_target_time(char **newval, void **extra, GucSource source) if (tm2timestamp(tm, fsec, &tz, ×tamp) != 0) { -GUC_check_errdetail("timestamp out of range: \"%s\"", str); +GUC_check_errdetail("Timestamp out of range: \"%s\".", str); return false; } } diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 226f85d0e3..564f79dab9 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -2946,7 +2946,7 @@ AlterExtensionNamespace(const char *extensionName, const char *newschema, Oid *o (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("extension
Re: Network failure may prevent promotion
At Mon, 22 Jan 2024 13:29:10 -0800, Andres Freund wrote in > Hi, > > On 2024-01-19 12:28:05 +0900, Michael Paquier wrote: > > On Thu, Jan 18, 2024 at 03:42:28PM +0200, Heikki Linnakangas wrote: > > > Given that commit 728f86fec6 that introduced this issue was not strictly > > > required, perhaps we should just revert it for v16. > > > > Is there a point in keeping 728f86fec6 as well on HEAD? That does not > > strike me as wise to keep that in the tree for now. If it needs to be > > reworked, looking at this problem from scratch would be a safer > > approach. > > IDK, I think we'll introduce this type of bug over and over if we don't fix it > properly. Just to clarify my position, I thought that 728f86fec6 was heading the right direction. Considering the current approach to signal handling in walreceiver, I believed that it would be better to further generalize in this direction rather than reverting. That's why I proposed that patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
At Mon, 22 Jan 2024 15:36:31 +1100, Peter Smith wrote in > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there was a CFbot test failure last time it was run [2]. Please have a > look and post an updated version if necessary. Thanks! I have added the necessary includes to the header file this patch adds. With this change, "make headerscheck" now passes. However, when I run "make cpluspluscheck" in my environment, it generates a large number of errors in other areas, but I didn't find one related to this patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 9a2b6fbda882587c127d3e50bccf89508837d1a5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 15 Jan 2024 15:57:53 +0900 Subject: [PATCH v32 1/3] Export wal_sync_method related functions Export several functions related to wal_sync_method for use in subsequent commits. Since PG_O_DIRECT cannot be used in those commits, the new function XLogGetSyncBit() will mask PG_O_DIRECT. --- src/backend/access/transam/xlog.c | 73 +-- src/include/access/xlog.h | 2 + 2 files changed, 52 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 478377c4a2..c5f51849ee 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra) } } +/* + * Exported version of get_sync_bit() + * + * Do not expose PG_O_DIRECT for uses outside xlog.c. + */ +int +XLogGetSyncBit(void) +{ + return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT; +} + /* - * Issue appropriate kind of fsync (if any) for an XLOG output file. + * Issue appropriate kind of fsync (if any) according to wal_sync_method. * - * 'fd' is a file descriptor for the XLOG file to be fsync'd. - * 'segno' is for error reporting purposes. + * 'fd' is a file descriptor for the file to be fsync'd. */ -void -issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) +const char * +XLogFsyncFile(int fd) { - char *msg = NULL; + const char *msg = NULL; instr_time start; - Assert(tli != 0); - /* * Quick exit if fsync is disabled or write() has already synced the WAL * file. @@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) if (!enableFsync || wal_sync_method == WAL_SYNC_METHOD_OPEN || wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC) - return; + return NULL; /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) @@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) break; } - /* PANIC if failed to fsync */ - if (msg) - { - char xlogfname[MAXFNAMELEN]; - int save_errno = errno; - - XLogFileName(xlogfname, tli, segno, wal_segment_size); - errno = save_errno; - ereport(PANIC, -(errcode_for_file_access(), - errmsg(msg, xlogfname))); - } - pgstat_report_wait_end(); /* @@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start); } - PendingWalStats.wal_sync++; + if (msg != NULL) + PendingWalStats.wal_sync++; + + return msg; +} + +/* + * Issue appropriate kind of fsync (if any) for an XLOG output file. + * + * 'fd' is a file descriptor for the XLOG file to be fsync'd. + * 'segno' is for error reporting purposes. + */ +void +issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) +{ + const char *msg; + + Assert(tli != 0); + + msg = XLogFsyncFile(fd); + + /* PANIC if failed to fsync */ + if (msg) + { + char xlogfname[MAXFNAMELEN]; + int save_errno = errno; + + XLogFileName(xlogfname, tli, segno, wal_segment_size); + errno = save_errno; + ereport(PANIC, +(errcode_for_file_access(), + errmsg(msg, xlogfname))); + } } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 301c5fa11f..2a0d65b537 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record); extern void xlog_desc(StringInfo buf, struct XLogReaderState *record); extern const char *xlog_identify(uint8 info); +extern int XLogGetSyncBit(void); +extern const char *XLogFsyncFile(int fd); extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli); extern bool RecoveryInProgress(void); -- 2.39.3 >From c464013071dedc15b838e573ae828f150b3b60f7 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 31 Aug 2023 11:49:10 +0900 Subject: [PATCH v32 2/3] Introduce undo log implementation This patch adds a simple implementation of UNDO log feature. --- src/backend/access/transam/Makefile| 1 + src/backend/access/transam/meson.build | 1 + src/backend/access/transam/rmgr.c
Re: Make mesage at end-of-recovery less scary.
At Wed, 17 Jan 2024 14:32:00 +0900, Michael Paquier wrote in > On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote: > >> For now, let me explain the basis for this patch. The fundamental > >> issue is that these warnings that always appear are, in practice, not > >> a problem in almost all cases. Some of those who encounter them for > >> the first time may feel uneasy and reach out with inquiries. On the > >> other hand, those familiar with these warnings tend to ignore them and > >> only pay attention to details when actual issues arise. Therefore, the > >> intention of this patch is to label them as "no issue" unless a > >> problem is blatantly evident, in order to prevent unnecessary concern. > > > > I agree and don't mind affecting the error message per se. > > > > However I see that the actual logic of how WAL is processed is being > > changed. If we do this, at very least it requires thorough thinking. I > > strongly suspect that the proposed code is wrong and/or not safe > > and/or less safe than it is now for the reasons named above. > > FWIW, that pretty much sums up my feeling regarding this patch, > because an error, basically any error, would hurt back very badly. > Sure, the error messages we generate now when reaching the end of WAL > can sound scary, and they are (I suspect that's not really the case > for anybody who has history doing support with PostgreSQL because a > bunch of these messages are old enough to vote, but I can understand > that anybody would freak out the first time they see that). > > However, per the recent issues we've had in this area, like > cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and > bae868caf222, I am of the opinion that the header validation, the > empty page case in XLogReaderValidatePageHeader() and the record read > changes are risky enough that I am not convinced that the gains are > worth the risks taken. > > The error stack in the WAL reader is complicated enough that making it > more complicated as the patch proposes does not sound like not a good > tradeoff to me to make the reports related to the end of WAL cleaner > for the end-user. I agree that we should do something, but the patch > does not seem like a good step towards this goal. Perhaps somebody > would be more excited about this proposal than I am, of course. Thank you both for the comments. The criticism seems valid. The approach to identifying the end-of-WAL state in this patch is quite heuristic, and its validity or safety can certainly be contested. On the other hand, if we seek perfection in this area of judgment, we may need to have the WAL format itself more robust. In any case, since the majority of the feedback on this patch seems to be negative, I am going to withdraw it if no supportive opinions emerge during this commit-fest. The attached patch addresses the errors reported by CF-bot. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 933d10fa6c7b71e4684f5ba38e85177afaa56f58 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 7 Mar 2023 14:55:58 +0900 Subject: [PATCH v29] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. To make sure that the detection is correct, this patch checks if all trailing bytes in the same page are zeroed in that case. --- src/backend/access/transam/xlogreader.c | 147 ++ src/backend/access/transam/xlogrecovery.c | 96 ++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 +- src/bin/pg_waldump/t/001_basic.pl | 5 +- src/include/access/xlogreader.h | 1 + src/test/recovery/t/035_recovery.pl | 130 +++ src/test/recovery/t/039_end_of_wal.pl | 47 +-- 8 files changed, 383 insertions(+), 63 deletions(-) create mode 100644 src/test/recovery/t/035_recovery.pl diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 7190156f2f..94861969eb 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -48,6 +48,8 @@ static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking); +static bo
Re: Make mesage at end-of-recovery less scary.
At Mon, 22 Jan 2024 16:09:28 +1100, Peter Smith wrote in > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there were CFbot test failures last time it was run [2]. Please have a > look and post an updated version if necessary. > > == > [1] https://commitfest.postgresql.org/46/2490/ > [2] > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490 Thanks for noticing of that. Will repost a new version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Network failure may prevent promotion
At Sun, 31 Dec 2023 20:07:41 +0900 (JST), Kyotaro Horiguchi wrote in > We've noticed that when walreceiver is waiting for a connection to > complete, standby does not immediately respond to promotion > requests. In PG14, upon receiving a promotion request, walreceiver > terminates instantly, but in PG16, it waits for connection > timeout. This behavior is attributed to commit 728f86fec65, where a > part of libpqrcv_connect was simply replaced with a call to > libpqsrc_connect_params. This behavior can be verified by simply > dropping packets from the standby to the primary. Apologize for the inconvenience on my part, but I need to fix this behavior. To continue this discussion, I'm providing a repro script here. With the script, the standby is expected to promote immediately, emitting the following log lines: standby.log: > 2024-01-18 16:25:22.245 JST [31849] LOG: received promote request > 2024-01-18 16:25:22.245 JST [31850] FATAL: terminating walreceiver process > due to administrator command > 2024-01-18 16:25:22.246 JST [31849] LOG: redo is not required > 2024-01-18 16:25:22.246 JST [31849] LOG: selected new timeline ID: 2 > 2024-01-18 16:25:22.274 JST [31849] LOG: archive recovery complete > 2024-01-18 16:25:22.275 JST [31847] LOG: checkpoint starting: force > 2024-01-18 16:25:22.277 JST [31846] LOG: database system is ready to accept > connections > 2024-01-18 16:25:22.280 JST [31847] LOG: checkpoint complete: wrote 3 > buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, > sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s; > distance=0 kB, estimate=0 kB; lsn=0/1548E98, redo lsn=0/1548E40 > 2024-01-18 16:25:22.356 JST [31846] LOG: received immediate shutdown request > 2024-01-18 16:25:22.361 JST [31846] LOG: database system is shut down After 728f86fec65 was introduced, promotion does not complete with the same operation, as follows. The patch attached to the previous mail fixes this behavior to the old behavior above. > 2024-01-18 16:47:53.314 JST [34515] LOG: received promote request > 2024-01-18 16:48:03.947 JST [34512] LOG: received immediate shutdown request > 2024-01-18 16:48:03.952 JST [34512] LOG: database system is shut down The attached script requires that sudo is executable. And there's another point to note. The script attempts to establish a replication connection to $primary_address:$primary_port. To packet-filter can work, it must be a remote address that is accessible when no packet-filter setting is set up. The firewall-cmd setting, need to be configured to block this connection. If simply an inaccessible IP address is set, the process will fail immediately with a "No route to host" error before the first packet is sent out, and it will not be blocked as intended. regards. -- Kyotaro Horiguchi NTT Open Source Software Center #! /bin/perl use Cwd; # This IP address must be a valid and accessible remote address, # otherwise replication connection will immediately fail with 'No # route to host', while we want to wait for replication connection to # complete. $primary_addr = '192.168.56.1'; $primary_port = 5432; $fwzone = 'public'; $rootdir = cwd(); $standby_dir = "$rootdir/standby"; $standby_port = 5432; $standby_logfile= "standby.log"; $rich_rule = "'rule family=\"ipv4\" destination address=\"$primary_addr\" port port=\"$primary_port\" protocol=\"tcp\" drop'"; $add_cmd = "sudo firewall-cmd --zone=$fwzone --add-rich-rule=$rich_rule"; $del_cmd = "sudo firewall-cmd --zone=$fwzone --remove-rich-rule=$rich_rule"; # Remove old entities system('killall -9 postgres'); system("rm -rf $standby_dir $standby_logfile"); # Setup packet drop system($add_cmd) == 0 or die "failed to exec \"$add_cmd\": $!\n"; # Setup a standby. system("initdb -D $standby_dir -c primary_conninfo='host=$primary_addr port=$primary_port'"); system("touch $standby_dir/standby.signal"); # Start it. system("pg_ctl start -D $standby_dir -o \"-p $standby_port\" -l standby.log"); sleep 1; # Try promoting standby, waiting for 10 seconds. system("pg_ctl promote -t 10 -D $standby_dir"); # Stop servers system("pg_ctl stop -m i -D $standby_dir"); # Remove packet-drop setting system($del_cmd) == 0 or die "failed to exec \"$del_cmd\": $!\n";
Re: initdb's -c option behaves wrong way?
Thank you for upicking this up. At Wed, 17 Jan 2024 23:47:41 +0100, Daniel Gustafsson wrote in > > On 17 Jan 2024, at 21:33, Tom Lane wrote: > > > > I wrote: > >> However ... I don't like the patch much. It seems to have left > >> the code in a rather random state. Why, for example, didn't you > >> keep all the code that constructs the "newline" value together? > > > > After thinking about it a bit more, I don't see why you didn't just > > s/strncmp/strncasecmp/ and call it good. The messiness seems to be > > a result of your choice to replace the GUC's case as shown in the > > file with the case used on the command line, which is not better > > IMO. We don't change our mind about the canonical spelling of a > > GUC because somebody varied the case in a SET command. > > The original patch was preserving the case of the file throwing away the case > from the commandline. The attached is a slimmed down version which only moves > the assembly of newline to the different cases (replace vs. new) keeping the > rest of the code intact. Keeping it in one place gets sort of messy too since > it needs to use different values for a replacement and a new variable. Not > sure if there is a cleaner approach? Just to clarify, the current code breaks out after processing the first matching line. I haven't changed that behavior. The reason I moved the rewrite processing code out of the loop was I wanted to avoid adding more lines that are executed only once into the loop. However, it is in exchange of additional complexity to remember the original spelling of the parameter name. Personally, I believe separating the search and rewrite processing is better, but I'm not particularly attached to the approach, so either way is fine with me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia wrote in > Hi, > > Thanks for applying! > > > + errmsg_plural("%zd row were skipped due to data type > > incompatibility", > > Sorry, I just noticed it, but 'were' should be 'was' here? > > >> BTW I'm thinking we should add a column to pg_stat_progress_copy that > >> counts soft errors. I'll suggest this in another thread. > > Please do! > > I've started it here: > > https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff...@oss.nttdata.com Switching topics, this commit (9e2d870119) adds the following help message: > "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n" > "TO { '%s' | PROGRAM '%s' | STDOUT }\n" > ... > "SAVE_ERROR_TO '%s'\n" > ... > _("location"), On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which indicate "immediately error out" and 'just ignore the failure' respectively, but these options hardly seem to denote a 'location', and appear more like an 'action'. I somewhat suspect that this parameter name intially conceived with the assupmtion that it would take file names or similar parameters. I'm not sure if others will agree, but I think the parameter name might not be the best choice. For instance, considering the addition of the third value 'log', something like on_error_action (error, ignore, log) would be more intuitively understandable. What do you think? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Make mesage at end-of-recovery less scary.
Thank you for the comments. At Fri, 12 Jan 2024 15:03:26 +0300, Aleksander Alekseev wrote in > ``` > +p = (char *) record; > +pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > + > +while (p < pe && *p == 0) > +p++; > + > +if (p == pe) > ``` > > Just as a random thought: perhaps we should make this a separate > function, as a part of src/port/. It seems to me that this code could > benefit from using vector instructions some day, similarly to > memcmp(), memset() etc. Surprisingly there doesn't seem to be a > standard C function for this. Alternatively one could argue that one > cycle doesn't make much code to reuse and that the C compiler will > place SIMD instructions for us. However a counter-counter argument > would be that we could use a macro or even better an inline function > and have the same effect except getting a slightly more readable code. Creating a function with a name like memcmp_byte() should be straightforward, but implementing it with SIMD right away seems a bit challenging. Similar operations are already being performed elsewhere in the code, probably within the stats collector, where memcmp is used with a statically allocated area that's filled with zeros. If we can achieve a performance equivalent to memcmp with this new function, then it definitely seems worth pursuing. > ``` > - * This is just a convenience subroutine to avoid duplicated code in > + * This is just a convenience subroutine to avoid duplicate code in > ``` > > This change doesn't seem to be related to the patch. Personally I > don't mind it though. Ah, I'm sorry. That was something I mistakenly thought I had written at the last moment and made modifications to. > All in all I find v28 somewhat scary. It does much more than "making > one message less scary" as it was initially intended and what bugged > me personally, and accordingly touches many more places including > xlogreader.c, xlogrecovery.c, etc. > > Particularly I have mixed feeling about this: > > ``` > +/* > + * Consider it as end-of-WAL if all subsequent bytes of this page > + * are zero. We don't bother checking the subsequent pages since > + * they are not zeroed in the case of recycled segments. > + */ > ``` > > If I understand correctly, if somehow several FS blocks end up being > zeroed (due to OS bug, bit rot, restoring from a corrupted for > whatever reason backup, hardware failures, ...) there is non-zero > chance that PG will interpret this as a normal situation. To my > knowledge this is not what we typically do - typically PG would report > an error and ask a human to figure out what happened. Of course the > possibility of such a scenario is small, but I don't think that as > DBMS developers we can ignore it. For now, let me explain the basis for this patch. The fundamental issue is that these warnings that always appear are, in practice, not a problem in almost all cases. Some of those who encounter them for the first time may feel uneasy and reach out with inquiries. On the other hand, those familiar with these warnings tend to ignore them and only pay attention to details when actual issues arise. Therefore, the intention of this patch is to label them as "no issue" unless a problem is blatantly evident, in order to prevent unnecessary concern. > Does anyone agree or maybe I'm making things up? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Error "initial slot snapshot too large" in create replication slot
At Fri, 12 Jan 2024 11:28:09 -0500, Robert Haas wrote in > However, I wonder whether this whole area is in need of a bigger > rethink. There seem to be a number of situations in which the split > into xip and subxip arrays is not very convenient, and also some > situations where it's quite important. Sometimes we want to record > what's committed, and sometimes what isn't. It's all a bit messy and > inconsistent. The necessity of limiting snapshot size is annoying, > too. I have no real idea what can be done about all of this, but what > strikes me is that the current system has grown up incrementally: we > started with a data structure designed for the original use case, and > now by gradually adding new use cases things have gotten complicated. > If we were designing things over from scratch, maybe we'd do it > differently and end up with something less messy. And maybe someone > can imagine a redesign that is likewise less messy. > > But on the other hand, maybe not. Perhaps we can't really do any > better than what we have. Then the question becomes whether this case > is important enough to justify additional code complexity. I don't > think I've personally seen users run into this problem so I have no > special reason to think that it's important, but if it's causing > issues for other people then maybe it is. Thank you for the deep insights. I have understood your points. As I can't think of any further simple modifications on this line, I will withdraw this CF entry. At the moment, I also lack a fundamental, comprehensive solution, but should if I or anyone else come up with such a solution in the future, I believe it would worth a separate discussion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: In-placre persistance change of a relation
At Tue, 9 Jan 2024 15:07:20 +0530, vignesh C wrote in > CFBot shows compilation issues at [1] with: Thanks! The reason for those errors was that I didn't consider Meson at the time. Additionally, the signature change of reindex_index() caused the build failure. I fixed both issues. While addressing these issues, I modified the simpleundolog module to honor wal_sync_method. Previously, the sync method for undo logs was determined independently, separate from xlog.c. However, I'm still not satisfied with the method for handling PG_O_DIRECT. In this version, I have added the changes to enable the use of wal_sync_method outside of xlog.c as the first part of the patchset. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 40749357f24adf89dc79db9b34f5c053288489bb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 15 Jan 2024 15:57:53 +0900 Subject: [PATCH v31 1/3] Export wal_sync_method related functions Export several functions related to wal_sync_method for use in subsequent commits. Since PG_O_DIRECT cannot be used in those commits, the new function XLogGetSyncBit() will mask PG_O_DIRECT. --- src/backend/access/transam/xlog.c | 73 +-- src/include/access/xlog.h | 2 + 2 files changed, 52 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 478377c4a2..c5f51849ee 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra) } } +/* + * Exported version of get_sync_bit() + * + * Do not expose PG_O_DIRECT for uses outside xlog.c. + */ +int +XLogGetSyncBit(void) +{ + return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT; +} + /* - * Issue appropriate kind of fsync (if any) for an XLOG output file. + * Issue appropriate kind of fsync (if any) according to wal_sync_method. * - * 'fd' is a file descriptor for the XLOG file to be fsync'd. - * 'segno' is for error reporting purposes. + * 'fd' is a file descriptor for the file to be fsync'd. */ -void -issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) +const char * +XLogFsyncFile(int fd) { - char *msg = NULL; + const char *msg = NULL; instr_time start; - Assert(tli != 0); - /* * Quick exit if fsync is disabled or write() has already synced the WAL * file. @@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) if (!enableFsync || wal_sync_method == WAL_SYNC_METHOD_OPEN || wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC) - return; + return NULL; /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) @@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) break; } - /* PANIC if failed to fsync */ - if (msg) - { - char xlogfname[MAXFNAMELEN]; - int save_errno = errno; - - XLogFileName(xlogfname, tli, segno, wal_segment_size); - errno = save_errno; - ereport(PANIC, -(errcode_for_file_access(), - errmsg(msg, xlogfname))); - } - pgstat_report_wait_end(); /* @@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start); } - PendingWalStats.wal_sync++; + if (msg != NULL) + PendingWalStats.wal_sync++; + + return msg; +} + +/* + * Issue appropriate kind of fsync (if any) for an XLOG output file. + * + * 'fd' is a file descriptor for the XLOG file to be fsync'd. + * 'segno' is for error reporting purposes. + */ +void +issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) +{ + const char *msg; + + Assert(tli != 0); + + msg = XLogFsyncFile(fd); + + /* PANIC if failed to fsync */ + if (msg) + { + char xlogfname[MAXFNAMELEN]; + int save_errno = errno; + + XLogFileName(xlogfname, tli, segno, wal_segment_size); + errno = save_errno; + ereport(PANIC, +(errcode_for_file_access(), + errmsg(msg, xlogfname))); + } } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 301c5fa11f..2a0d65b537 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record); extern void xlog_desc(StringInfo buf, struct XLogReaderState *record); extern const char *xlog_identify(uint8 info); +extern int XLogGetSyncBit(void); +extern const char *XLogFsyncFile(int fd); extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli); extern bool RecoveryInProgress(void); -- 2.39.3 >From 5c120b94c407b971485ab52133399305e5e81a88 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 31 Aug 2023 11:49:10 +0900 Subject: [PATCH v31 2/3] Introduce undo log implementation This patch adds a simple implementation of UNDO log feature. --- src/backend/access/transam/Makefile| 1 + src/backend/access/t
Re: initdb --no-locale=C doesn't work as specified when the environment is not C
At Wed, 10 Jan 2024 18:16:03 -0500, Tom Lane wrote in > Kyotaro Horiguchi writes: > > It seems somewhat intentional that only lc_messages references the > > environment at boot time. On the other hand, previously, in the > > absence of a specified locale, initdb would embed the environmental > > value in the configuration file, as it seems to be documented. Given > > that initdb is always used for cluster creation, it's unlikey that > > systems depend on this boot-time default for their operation. > > Yeah, after further reflection there doesn't seem to be a lot of value > in leaving these entries commented-out, even in the cases where that's > technically correct. Let's just go back to the old behavior of always > uncommenting them; that stood for years without complaints. So I > committed your latest patch as-is. I'm glad you understand. Thank you for commiting. > > By the way, the lines around lc_* in the sample file seem to have > > somewhat inconsistent indentations. Wouldnt' it be preferable to fix > > this? (The attached doesn't that.) > > They look all right if you assume the tab width is 8, which seems to > be what is used elsewhere in the file. I think there's been some > prior discussion about whether to ban use of tabs at all in these > sample files, so as to reduce confusion about how wide the tabs are. > But I'm not touching that question today. Ah, I see, I understood. regards. -- Kyotaro Horiguchi NTT Open Source Software Center