Re: In-placre persistance change of a relation

2024-11-04 Thread Kyotaro Horiguchi
Thank you for the quick comments.

At Thu, 31 Oct 2024 23:24:36 +0200, Heikki Linnakangas  wrote 
in 
> On 31/10/2024 10:01, Kyotaro Horiguchi wrote:
> > After some delays, here’s the new version. In this update, UNDO logs
> > are WAL-logged and processed in memory under most conditions. During
> > checkpoints, they’re flushed to files, which are then read when a
> > specific XID’s UNDO log is accessed for the first time during
> > recovery.
> > The biggest changes are in patches 0001 through 0004 (equivalent to
> > the previous 0001-0002). After that, there aren’t any major
> > changes. Since this update involves removing some existing features,
> > I’ve split these parts into multiple smaller identity transformations
> > to make them clearer.
> > As for changes beyond that, the main one is lifting the previous
> > restriction on PREPARE for transactions after a persistence
> > change. This was made possible because, with the shift to in-memory
> > processing of UNDO logs, commit-time crash recovery detection is now
> > simpler. Additional changes include completely removing the
> > abort-handling portion from the pendingDeletes mechanism (0008-0010).
> 
> In this patch version, the undo log is kept in dynamic shared
> memory. It can grow indefinitely. On a checkpoint, it's flushed to
> disk.
> 
> If I'm reading it correctly, the undo records are kept in the DSA area
> even after it's flushed to disk. That's not necessary; system never
> needs to read the undo log unless there's a crash, so there's no need

The system also needs to read the undo log whenever additional undo
logs are added. In this version, I’ve moved all abort-time
pendingDeletes data entirely to the undo logs. In other words, the DSA
area is expanded in exchange for reducing the pendingDelete list. As a
result, there is minimal impact on overall memory usage. Additionally,
the current flushing code is straightforward because it relies on the
in-memory primary image. If we drop the in-memory image during flush,
we might need exclusive locking or possibly some ordering
techniques. Anyway, I’ll consider that approach.

> to keep it in memory after it's been flushed to disk. That's true
> today; we could start relying on the undo log to clean up on abort
> even when there's no crash, but I think it's a good design to not do
> that and rely on backend-private state for non-crash transaction
> abort.

Hmm. Sounds reasonable. In the next version, I'll revert the changes
to pendingDeletes and adjust it to just discard the log on regular
aborts.

> I'd suggest doing this the other way 'round. Let's treat the on-disk
> representation as the primary representation, not the in-memory
> one. Let's use a small fixed-size shared memory area just as a write
> buffer to hold the dirty undo log entries that haven't been written to
> disk yet. Most transactions are short, so most undo log entries never
> need to be flushed to disk, but I think it'll be simpler to think of
> it that way. On checkpoint, flush all the buffered dirty entries from
> memory to disk and clear the buffer. Also do that if the buffer fills
> up.

I'd like to clarify the specific concept of these fixed-length memory
slots. Is it something like this: each slot is keyed by an XID,
followed by an in-file offset and a series of, say, 1024-byte areas?
When writing a log for a new XID, if no slot is available, the backend
would immediately evict the slot with the smallest XID to disk to free
up space. If an existing slot runs out of space while writing new
logs, the backend would flush it immediately and continue using the
area. Is this correct? Additionally, if multiple processes try to
write to a single slot, stricter locking might be needed. For example,
if a slot is evicted by a backend other than its user, exclusive
control might be required during the file write.  jjjIs there any
effective way to avoid such locking? In the current patch set, I’m
avoiding any impact on the backend from checkpointer file writes by
treating the in-memory image as primary. And regarding the number of
these areas… although I’m not entirely sure, it seems unlikely we’d
have hundreds of sessions simultaneously creating tables, so would it
make sense to make this configurable, with a default of around 32
areas?

> A high-level overview comment of the on-disk format would be nice. If
> I understand correctly, there's a magic constant at the beginning of
> each undo file, followed by UndoLogRecords. There are no other file
> headers and no page structure within the file.

Right.

> That format seems reasonable. For cross-checking, maybe add the XID to
> the file header too. There is a separate CRC value on each record,
> which is

Re: In-placre persistance change of a relation

2024-09-04 Thread Kyotaro Horiguchi
At Mon, 2 Sep 2024 09:30:20 +0900, Michael Paquier  wrote 
in 
> On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote:
> > I wonder if the twophase state files and undo log files should be merged
> > into one file. They're similar in many ways: there's one file per
> > transaction, named using the XID. I haven't thought this fully through, just
> > a thought..
> 
> Hmm.  It could be possible to extract some of this knowledge out of
> twophase.c and design some APIs that could be used for both, but would
> that be really necessary?  The 2PC data and the LSNs used by the files
> to check if things are replayed or on disk rely on
> GlobalTransactionData that has its own idea of things and timings at
> recovery.

I'm not sure, but I feel that Heikki mentioned only about using the
file format and in/out functions if the file formats of the two are
sufficiently overlapping.

> Or perhaps your point is actually to do that and add one layer for the
> file handlings and their flush timings?  I am not sure, TBH, what this
> thread is trying to fix is complicated enough that it may be better to
> live with two different code paths.  But perhaps my gut feeling is
> just wrong reading your paragraph.

I believe this statement is valid, so I’m not in a hurry to do this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: In-placre persistance change of a relation

2024-09-04 Thread Kyotaro Horiguchi
Hello.

Thank you for the response.

At Sun, 1 Sep 2024 22:15:00 +0300, Heikki Linnakangas  wrote 
in 
> On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
> > - UNDO log(0002): This handles file deletion during transaction aborts,
> >which was previously managed, in part, by the commit XLOG record at
> >the end of a transaction.
> > - Prevent orphan files after a crash (0005): This is another use-case
> >of the UNDO log system.
> 
> Nice, I'm very excited if we can fix that long-standing issue! I'll
> try to review this properly later, but at a quick 5 minute glance, one
> thing caught my eye:
> 
> This requires fsync()ing the per-xid undo log file every time a
> relation is created. I fear that can be a pretty big performance hit
> for workloads that repeatedly create and drop small tables. Especially

I initially thought that one additional file manipulation during file
creation wouldn't be an issue. However, the created storage file isn't
being synced, so your concern seems valid.

> if they're otherwise running with synchronous_commit=off. Instead of
> flushing the undo log file after every write, I'd suggest WAL-logging
> the undo log like regular relations and SLRUs. So before writing the
> entry to the undo log, WAL-log it. And with a little more effort, you
> could postpone creating the files altogether until a checkpoint
> happens, similar to how twophase state files are checkpointed
> nowadays.

I thought that an UNDO log file not flushed before the last checkpoint
might not survive a system crash. However, including UNDO files in the
checkpointing process resolves that concern. Thansk you for the
suggestion.

> I wonder if the twophase state files and undo log files should be
> merged into one file. They're similar in many ways: there's one file
> per transaction, named using the XID. I haven't thought this fully
> through, just a thought..

Precisely, UNDO log files are created per subtransaction, unlike
twophase files. It might be possible if we allow the twophase files
(as they are currently named) to be overwritten or modified at every
subcommit. If ULOG contents are WAL-logged, these two things will
become even more similar. However, I'm not planning to include that in
the next version for now.

> > +static void
> > +undolog_set_filename(char *buf, TransactionId xid)
> > +{
> > +   snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid);
> > +}
> 
> I'd suggest using FullTransactionId. Doesn't matter much, but seems
> like a good future-proofing.

Agreed. Will fix it in the next vesion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add callback in pgstats for backend initialization

2024-09-04 Thread Kyotaro Horiguchi
At Wed, 4 Sep 2024 15:04:09 +0900, Michael Paquier  wrote 
in 
> On Wed, Sep 04, 2024 at 02:15:43PM +0900, Kyotaro Horiguchi wrote:
> > The name "init_backend" makes it sound like the function initializes
> > the backend. backend_init might be a better choice, but I'm not sure.
> 
> We (kind of) tend to prefer $verb_$thing-or-action_cb for the name of
> the callbacks, which is why I chose this naming.  If you feel strongly
> about "backend_init_cb", that's also fine by me; you are the original
> author of this code area with Andres.

More accurately, I believe this applies when the sentence follows a
verb-object structure. In this case, the function’s meaning is “pgstat
initialization on backend,” which seems somewhat different from the
policy you mentioned. However, it could also be interpreted as
“initialize backend status related to pgstat.” Either way, I’m okay
with the current name if you are, based on the discussion above.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add callbacks for fixed-numbered stats flush in pgstats

2024-09-04 Thread Kyotaro Horiguchi
At Wed, 4 Sep 2024 15:12:37 +0900, Michael Paquier  wrote 
in 
> On Wed, Sep 04, 2024 at 05:28:56AM +, Bertrand Drouvot wrote:
> > On Wed, Sep 04, 2024 at 02:05:46PM +0900, Kyotaro Horiguchi wrote:
> >> The generalization sounds good to me, and hiding the private flags in
> >> private places also seems good.
> > 
> > +1 on the idea.
> 
> Thanks for the feedback.
> 
> >> Regarding pgstat_io_flush_cb, I think it no longer needs to check the
> >> have_iostats variable, and that check should be moved to the new
> >> pgstat_flush_io(). The same applies to pgstat_wal_flush_cb() (and
> >> pgstat_flush_wal()). As for pgstat_slru_flush_cb, it simply doesn't
> >> need the check anymore.
> > 
> > Agree. Another option could be to add only one callback (the flush_fixed_cb 
> > one)
> > and get rid of the have_fixed_pending_cb one. Then add an extra parameter 
> > to the
> > flush_fixed_cb that would only check if there is pending stats (without
> > flushing them). I think those 2 callbacks are highly related that's why I 
> > think we could "merge" them, thoughts?
> 
> I would still value the shortcut that we can use if there is no
> activity to avoid the clock check with GetCurrentTimestamp(), though,
> which is why I'd rather stick with two callbacks as that can lead to a
> much cheaper path.

Yeah, I was wrong in this point. 

> Anyway, I am not sure to follow your point about removing the boolean
> checks in the flush callbacks.  It's surely always a better to never
> call LWLockAcquire() or LWLockConditionalAcquire() if we don't have
> to, and we would go through the whole flush dance as long as we detect
> some stats activity in at least one stats kind.  I mean, that's cheap
> enough to keep around.

I doubt that overprotection is always better, but in this case, it's
not overprotection at all. The flush callbacks are called
unconditionally once we decide to flush anything. Sorry for the noise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RecoveryTargetAction is left out in xlog_interna.h

2024-09-04 Thread Kyotaro Horiguchi
Hello,

While reviewing a patch, I noticed that enum RecoveryTargetAction is
still in xlog_internal.h, even though it seems like it should be in
xlogrecovery.h. Commit 70e81861fa separated out xlogrecovery.c/h and
moved several enums related to recovery targets to
xlogrecovery.h. However, it appears that enum RecoveryTargetAction was
inadvertently left behind in that commit.

Please find the attached patch, which addresses this oversight.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4c938f473164d75761001b31fa85e5dc215fbd9a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 4 Sep 2024 17:12:45 +0900
Subject: [PATCH] Move enum RecoveryTargetAction to xlogrecovery.h

Commit 70e81861fa split out xlogrecovery.c/h and moved some enums
related to recovery targets to xlogrecovery.h. However, it seems that
the enum RecoveryTargetAction was inadvertently left out by that
commit. This commit moves it to xlogrecovery.h for consistency.
---
 src/include/access/xlog_internal.h | 10 --
 src/include/access/xlogrecovery.h  | 10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index e5cdba0584..9ba01ec20e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -316,16 +316,6 @@ typedef struct XLogRecData
 	uint32		len;			/* length of rmgr data to include */
 } XLogRecData;
 
-/*
- * Recovery target action.
- */
-typedef enum
-{
-	RECOVERY_TARGET_ACTION_PAUSE,
-	RECOVERY_TARGET_ACTION_PROMOTE,
-	RECOVERY_TARGET_ACTION_SHUTDOWN,
-}			RecoveryTargetAction;
-
 struct LogicalDecodingContext;
 struct XLogRecordBuffer;
 
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..c09f84765b 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -40,6 +40,16 @@ typedef enum
 	RECOVERY_TARGET_TIMELINE_NUMERIC,
 } RecoveryTargetTimeLineGoal;
 
+/*
+ * Recovery target action.
+ */
+typedef enum
+{
+	RECOVERY_TARGET_ACTION_PAUSE,
+	RECOVERY_TARGET_ACTION_PROMOTE,
+	RECOVERY_TARGET_ACTION_SHUTDOWN,
+}			RecoveryTargetAction;
+
 /* Recovery pause states */
 typedef enum RecoveryPauseState
 {
-- 
2.43.5



Re: Use XLOG_CONTROL_FILE macro everywhere?

2024-09-04 Thread Kyotaro Horiguchi
At Tue, 3 Sep 2024 12:02:26 +0300, "Anton A. Melnikov" 
 wrote in 
> In v2 removed XLOG_CONTROL_FILE from args and used it directly in the
> string.
> IMHO this makes the code more readable and will output the correct
> text if the macro changes.

Yeah, I had this in my mind. Looks good to me.

> 3)
..
> Maybe include/access/xlogdefs.h would be a good place for this?
> In v2 moved some definitions from xlog_internal to xlogdefs
> and removed including xlog_internal.h from the postmaster.c.
> Please, take a look on it.

The change can help avoid disrupting existing users of the
macro. However, the file is documented as follows:

>  * Postgres write-ahead log manager record pointer and
>  * timeline number definitions

We could modify the file definition, but I'm not sure if that's the
best approach. Instead, I'd like to propose separating the file and
path-related definitions from xlog_internal.h, as shown in the
attached first patch. This change would allow some modules to include
files without unnecessary details.

The second file is your patch, adjusted based on the first patch.

I’d appreciate hearing from others on whether they find the first
patch worthwhile. If it’s not considered worthwhile, then I believe
having postmaster include xlog_internal.h would be the best approach.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 22b71830ff601717d01dc66be95bafb930b1a0ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 4 Sep 2024 16:15:16 +0900
Subject: [PATCH 1/2] Split file-related parts from xlog_internal.h

Some modules that require XLOG file manipulation had to include
xlog_internal.h, which contains excessively detailed internal
information. This commit separates the file and file path-related
components into xlogfilepaths.h, thereby avoiding the exposure of
unrelated details of the XLOG module.
---
 src/backend/access/transam/timeline.c |   1 -
 src/backend/postmaster/pgarch.c   |   1 -
 src/backend/utils/adt/genfile.c   |   1 -
 src/bin/initdb/initdb.c   |   1 +
 src/bin/pg_archivecleanup/pg_archivecleanup.c |   2 +-
 src/bin/pg_basebackup/pg_basebackup.c |   4 +-
 src/bin/pg_basebackup/pg_receivewal.c |   1 +
 src/bin/pg_basebackup/pg_recvlogical.c|   1 -
 src/bin/pg_basebackup/receivelog.c|   2 +-
 src/bin/pg_basebackup/streamutil.c|   2 +-
 src/bin/pg_controldata/pg_controldata.c   |   1 -
 src/bin/pg_rewind/pg_rewind.c |   2 +-
 src/include/access/xlog.h |  13 +-
 src/include/access/xlog_internal.h| 178 +-
 src/include/access/xlogfilepaths.h| 219 ++
 15 files changed, 230 insertions(+), 199 deletions(-)
 create mode 100644 src/include/access/xlogfilepaths.h

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 146751ae37..18df8fd326 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,6 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
-#include "access/xlogdefs.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 02f91431f5..4971cd76ff 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -30,7 +30,6 @@
 #include 
 
 #include "access/xlog.h"
-#include "access/xlog_internal.h"
 #include "archive/archive_module.h"
 #include "archive/shell_archive.h"
 #include "lib/binaryheap.h"
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 24b95c32b7..17be1756e2 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -21,7 +21,6 @@
 #include 
 
 #include "access/htup_details.h"
-#include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_tablespace_d.h"
 #include "catalog/pg_type.h"
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a015..95ec8d9c92 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -65,6 +65,7 @@
 #endif
 
 #include "access/xlog_internal.h"
+#include "lib/stringinfo.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
 #include "catalog/pg_collation_d.h"
diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c b/src/bin/pg_archivecleanup/pg_archivecleanup.c
index 5a124385b7..f6b33de32a 100644
--- a/src/bin/pg_archivecleanup/pg_archivecleanup.c
+++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c
@@ -15,7 +15,7 @@
 #include 
 #include 
 
-#include "access/xlog_inte

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-03 Thread Kyotaro Horiguchi
At Tue, 3 Sep 2024 10:43:14 +0530, Amit Kapila  wrote 
in 
> On Mon, Sep 2, 2024 at 9:14 AM shveta malik  wrote:
> >
> > On Mon, Sep 2, 2024 at 5:47 AM Peter Smith  wrote:
> > > 
> > >
> > > To summarize, the current description wrongly describes the field as a
> > > time duration:
> > > "The time since the slot has become inactive."
> > >
> > > I suggest replacing it with:
> > > "The slot has been inactive since this time."
> > >
> >
> > +1 for the change. If I had read the document without knowing about
> > the patch, I too would have interpreted it as a duration.
> >
> 
> The suggested change looks good to me as well. I'll wait for a day or
> two before pushing to see if anyone thinks otherwise.

If possible, I'd prefer to use "the time" as the subject. For example,
would "The time this slot was inactivated" be acceptable? However,
this loses the sense of continuation up to that point, so if that's
crucial, the current proposal might be better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Add callback in pgstats for backend initialization

2024-09-03 Thread Kyotaro Horiguchi
At Tue, 3 Sep 2024 05:00:54 +, Bertrand Drouvot 
 wrote in 
> Hi,
> 
> On Tue, Sep 03, 2024 at 10:52:20AM +0900, Michael Paquier wrote:
> > Hi all,
> > 
> > Currently, the backend-level initialization of pgstats happens in
> > pgstat_initialize(), where we are using a shortcut for the WAL stats,
> > with pgstat_init_wal().
> > 
> > I'd like to propose to add a callback for that, so as in-core or
> > custom pgstats kinds can assign custom actions when initializing a
> > backend.  The use-case I had for this one are pretty close to what we
> > do for WAL, where we would rely on a difference of state depending on
> > what may have existed before reaching the initialization path.  So
> > this can offer more precision.  Another case, perhaps less
> > interesting, is to be able to initialize some static backend state.
> 
> I think the proposal makes sense and I can see the use cases too, so +1.

+1, too.

The name "init_backend" makes it sound like the function initializes
the backend. backend_init might be a better choice, but I'm not sure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improving the latch handling between logical replication launcher and worker processes.

2024-09-03 Thread Kyotaro Horiguchi
At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C  wrote in 
> The attached v2 version patch has the changes for the same.

Sorry for jumping in at this point. I've just reviewed the latest
patch (v2), and the frequent Own/Disown-Latch operations caught my
attention. Additionally, handling multiple concurrently operating
trigger sources with nested latch waits seems bug-prone, which I’d
prefer to avoid from both a readability and safety perspective.

With that in mind, I’d like to suggest an alternative approach. I may
not be fully aware of all the previous discussions, so apologies if
this idea has already been considered and dismissed.

Currently, WaitForReplicationWorkerAttach() and
logicalrep_worker_stop_internal() wait on a latch after verifying the
desired state. This ensures that even if there are spurious or missed
wakeups, they won't cause issues. In contrast, ApplyLauncherMain()
enters a latch wait without checking the desired state
first. Consequently, if another process sets the latch to wake up the
main loop while the former two functions are waiting, that wakeup
could be missed. If my understanding is correct, the problem lies in
ApplyLauncherMain() not checking the expected state before beginning
to wait on the latch. There is no issue with waiting if the state
hasn't been satisfied yet.

So, I propose that ApplyLauncherMain() should check the condition that
triggers a main loop wakeup before calling WaitLatch(). To do this, we
could add a flag in LogicalRepCtxStruct to signal that the main loop
has immediate tasks to handle. ApplyLauncherWakeup() would set this
flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check
this flag before calling WaitLatch() and skip the WaitLatch() call if
the flag is set.

I think this approach could solve the issue without adding
complexity. What do you think?

regard.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: per backend I/O statistics

2024-09-03 Thread Kyotaro Horiguchi
At Tue, 03 Sep 2024 15:37:49 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> When I first looked at this patch, my initial thought was whether we
> should let these stats stay "fixed." The reason why the current
> PGSTAT_KIND_IO is fixed is that there is only one global statistics
> storage for the entire database. If we have stats for a flexible
> number of backends, it would need to be non-fixed, perhaps with the
> entry for INVALID_PROC_NUMBER storing the global I/O stats, I
> suppose. However, one concern with that approach would be the impact
> on performance due to the frequent creation and deletion of stats
> entries caused by high turnover of backends.

As an additional benefit of this approach, the client can set a
connection variable, for example, no_backend_iostats to true, or set
its inverse variable to false, to restrict memory usage to only the
required backends.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: per backend I/O statistics

2024-09-02 Thread Kyotaro Horiguchi
At Mon, 2 Sep 2024 14:55:52 +, Bertrand Drouvot 
 wrote in 
> Hi hackers,
> 
> Please find attached a patch to implement $SUBJECT.
> 
> While pg_stat_io provides cluster-wide I/O statistics, this patch adds a new
> pg_my_stat_io view to display "my" backend I/O statistics and a new
> pg_stat_get_backend_io() function to retrieve the I/O statistics for a given
> backend pid.
> 
> By having the per backend level of granularity, one could for example identify
> which running backend is responsible for most of the reads, most of the 
> extends
> and so on... The pg_my_stat_io view could also be useful to check the
> impact on the I/O made by some operations, queries,... in the current session.
> 
> Some remarks:
> 
> - it is split in 2 sub patches: 0001 introducing the necessary changes to 
> provide
> the pg_my_stat_io view and 0002 to add the pg_stat_get_backend_io() function.
> - the idea of having per backend I/O statistics has already been mentioned in
> [1] by Andres.
> 
> Some implementation choices:
> 
> - The KIND_IO stats are still "fixed amount" ones as the maximum number of
> backend is fixed.
> - The statistics snapshot is made for the global stats (the aggregated ones) 
> and
> for my backend stats. The snapshot is not build for all the backend stats 
> (that
> could be memory expensive depending on the number of max connections and given
> the fact that PgStat_IO is 16KB long).
> - The above point means that pg_stat_get_backend_io() behaves as if
> stats_fetch_consistency is set to none (each execution re-fetches counters
> from shared memory).
> - The above 2 points are also the reasons why the pg_my_stat_io view has been
> added (as its results takes care of the stats_fetch_consistency setting). I 
> think
> that makes sense to rely on it in that case, while I'm not sure that would 
> make
> a lot of sense to retrieve other's backend I/O stats and taking care of
> stats_fetch_consistency.
> 
> 
> [1]: 
> https://www.postgresql.org/message-id/20230309003438.rectf7xo7pw5t5cj%40awork3.anarazel.de

I'm not sure about the usefulness of having the stats only available
from the current session. Since they are stored in shared memory,
shouldn't we make them accessible to all backends? However, this would
introduce permission considerations and could become complex.

When I first looked at this patch, my initial thought was whether we
should let these stats stay "fixed." The reason why the current
PGSTAT_KIND_IO is fixed is that there is only one global statistics
storage for the entire database. If we have stats for a flexible
number of backends, it would need to be non-fixed, perhaps with the
entry for INVALID_PROC_NUMBER storing the global I/O stats, I
suppose. However, one concern with that approach would be the impact
on performance due to the frequent creation and deletion of stats
entries caused by high turnover of backends.

Just to be clear, the above comments are not meant to oppose the
current implementation approach. They are purely for the sake of
discussing comparisons with other possible approaches.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Use XLOG_CONTROL_FILE macro everywhere?

2024-09-02 Thread Kyotaro Horiguchi
At Mon, 2 Sep 2024 15:44:45 +0200, Daniel Gustafsson  wrote in 
> Summarizing the thread it seems consensus is using XLOG_CONTROL_FILE
> consistently as per the original patch.
> 
> A few comments on the patch though:
> 
> - * reads the data from $PGDATA/global/pg_control
> + * reads the data from $PGDATA/
> 
> I don't think this is an improvement, I'd leave that one as the filename
> spelled out.
> 
> - "the \".old\" suffix from %s/global/pg_control.old.\n"
> + "the \".old\" suffix from %s/%s.old.\n"
> 
> Same with that change, not sure I think that makes reading the errormessage
> code any easier.

I agree with the first point. In fact, I think it might be even better
to just write something like "reads the data from the control file" in
plain language rather than using the actual file name. As for the
second point, I'm fine either way, but if the main goal is to ensure
resilience against future changes to the value of XLOG_CONTROL_FILE,
then changing it makes sense. On the other hand, I don't see any
strong reasons not to change it. That said, I don't really expect the
value to change in the first place.

The following point caught my attention.

> +++ b/src/backend/postmaster/postmaster.c
...
> +#include "access/xlog_internal.h"

The name xlog_internal suggests that the file should only be included
by modules dealing with XLOG details, and postmaster.c doesn't seem to
fit that category. If the macro is used more broadly, it might be
better to move it to a more public location. However, following the
current discussion, if we decide to keep the macro's name as it is, it
would make more sense to keep it in its current location.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Missing reflection in nls.mk from recent basebackup changes

2024-08-05 Thread Kyotaro Horiguchi
Hello.

After a recent commit f80b09bac8, make update-po fails with "missing
required files". The commit moved some files to fe_utils but this
change was not reflected in pg_basebackup's nls.mk. The attached patch
fixes this issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f9eac9605d0309d32d6d1fab7a0a2d4e8d42798a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 6 Aug 2024 10:04:54 +0900
Subject: [PATCH] Fix nls.mk to reflect astreamer files relocation

In the recent commit f80b09bac8, astreamer files were moved to another
directory, but this change was not reflected in nls.mk. This commit
corrects that oversight.
---
 src/bin/pg_basebackup/nls.mk | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_basebackup/nls.mk b/src/bin/pg_basebackup/nls.mk
index 950b9797b1..f75f44dbe6 100644
--- a/src/bin/pg_basebackup/nls.mk
+++ b/src/bin/pg_basebackup/nls.mk
@@ -1,12 +1,7 @@
 # src/bin/pg_basebackup/nls.mk
 CATALOG_NAME = pg_basebackup
 GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \
-   astreamer_file.c \
-   astreamer_gzip.c \
astreamer_inject.c \
-   astreamer_lz4.c \
-   astreamer_tar.c \
-   astreamer_zstd.c \
pg_basebackup.c \
pg_createsubscriber.c \
pg_receivewal.c \
@@ -19,6 +14,11 @@ GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \
../../common/fe_memutils.c \
../../common/file_utils.c \
../../common/restricted_token.c \
+   ../../fe_utils/astreamer_file.c \
+   ../../fe_utils/astreamer_gzip.c \
+   ../../fe_utils/astreamer_lz4.c \
+   ../../fe_utils/astreamer_tar.c \
+   ../../fe_utils/astreamer_zstd.c \
../../fe_utils/option_utils.c \
../../fe_utils/recovery_gen.c \
../../fe_utils/string_utils.c
-- 
2.43.5



wrong translation file reference in pg_createsubscriber

2024-08-01 Thread Kyotaro Horiguchi
Hi,

I found that pg_createsubscriber doesn't use NLS files. This is due to
the wrong reference name "pg_createsubscriber" being passed to
set_pglocale_pgservice(). It should be "pg_basebackup" instead. See
the attached patch.

# Sorry for being away for a while. I should be able to return soon.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ede9fecf6a042e87c5cf92ed34d6b1991646577b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 2 Aug 2024 11:33:52 +0900
Subject: [PATCH] Fix NLS file reference in pg_createsubscriber

pg_createsubscriber is referring to a non-existent message translation
file, causing NLS to not work correctly. This command should use the
same file as pg_basebackup.
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index aab9bf0f51..4bce45e1e4 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -1904,7 +1904,7 @@ main(int argc, char **argv)
 	pg_logging_init(argv[0]);
 	pg_logging_set_level(PG_LOG_WARNING);
 	progname = get_progname(argv[0]);
-	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_createsubscriber"));
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_basebackup"));
 
 	if (argc > 1)
 	{
-- 
2.43.5



Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-24 Thread Kyotaro Horiguchi
At Fri, 21 Jun 2024 11:48:22 +0530, Amit Kapila  wrote 
in 
> On Wed, Jun 19, 2024 at 10:44 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Horiguchi-san,
> >
> > Thanks for sharing the patch! I agree this approach (ensure WAL records are 
> > flushed)
> > Is more proper than others.
> >
> > I have an unclear point. According to the comment atop GetInsertRecPtr(), 
> > it just
> > returns the approximated value - the position of the last full WAL page [1].
> > If there is a continuation WAL record which across a page, will it return 
> > the
> > halfway point of the WAL record (end of the first WAL page)? If so, the 
> > proposed
> > fix seems not sufficient. We have to point out the exact the end of the 
> > record.
> >
> 
> You have a point but if this theory is correct why we are not able to
> reproduce the issue after this patch? Also, how to get the WAL
> location up to which we need to flush? Is XLogCtlData->logInsertResult
> the one we are looking for?

It is not exposed, but of course logInsertResult is more
straightforward source for the LSN.

The reason why the patch is working well is due to the following bit
of the code.

xlog.c:958, in XLInsertRecord()
>   /*
>* Update shared LogwrtRqst.Write, if we crossed page boundary.
>*/
>   if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
>   {
>   SpinLockAcquire(&XLogCtl->info_lck);
>   /* advance global request to include new block(s) */
>   if (XLogCtl->LogwrtRqst.Write < EndPos)
>   XLogCtl->LogwrtRqst.Write = EndPos;
>   SpinLockRelease(&XLogCtl->info_lck);
>   RefreshXLogWriteResult(LogwrtResult);
>   }

The code, which exists has existed for a long time, ensures that
GetInsertRecPtr() returns the accurate end of a record when it spanns
over page boundaries. This would need to be written in the new comment
if we use GetInsertRecPtr().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Pluggable cumulative statistics

2024-06-20 Thread Kyotaro Horiguchi
At Thu, 13 Jun 2024 16:59:50 +0900, Michael Paquier  wrote 
in 
> * The kind IDs may change across restarts, meaning that any stats data 
> associated to a custom kind is stored with the *name* of the custom
> stats kind.  Depending on the discussion happening here, I'd be open
> to use the same concept as custom RMGRs, where custom kind IDs are
> "reserved", fixed in time, and tracked in the Postgres wiki.  It is
> cheaper to store the stats this way, as well, while managing conflicts
> across extensions available in the community ecosystem.

I prefer to avoid having a central database if possible.

If we don't intend to move stats data alone out of a cluster for use
in another one, can't we store the relationship between stats names
and numeric IDs (or index numbers) in a separate file, which is loaded
just before and synced just after extension preloading finishes?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-06-20 Thread Kyotaro Horiguchi
ts.


= L756
+{ oid => '16387', descr => 'wait for LSN with timeout',

The description seems a bit off. While timeout is mentioned, the more
important characteristic that the LSN is the replay LSN is not
included.

= L791
+ * WaitLSNProcInfo [e2 80 93] the shared memory structure representing 
information

This line contains a non-ascii character EN Dash(U+2013).


= L798
+* A process number, same as the index of this item in 
waitLSN->procInfos.
+* Stored for convenience.
+*/
+   int procnum;

It is described as "(just) for convenience". However, it is referenced
by Startup to fetch the PGPROC entry for the waiter, which necessary
for Startup. That aside, why don't we hold (the pointer to) procLatch
instead of procnum? It makes things simpler and I believe it is our
standard practice.


= L809
+   /* A flag indicating that this item is added to waitLSN->waitersHeap */
+   boolinHeap;

The name "inHeap" seems too leteral and might be hard to understand in
most occurances. How about using "waiting" instead?

= L920
+# I
+# Make sure that pg_wal_replay_wait() works: add new content to

Hmm. I feel that Arabic numerals look nicer than Greek ones here.


= L940
+# Check that new data is visible after calling pg_wal_replay_wait()

On the other hand, the comment for the check for this test states that

> +# Make sure the current LSN on standby and is the same as primary's
> LSN +ok($output eq 30, "standby reached the same LSN as primary");

I think the first comment and the second should be consistent.




> Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.
> 
> 1. It's not clear why this test needs node_standby2 at all.  It seems useless.

I agree with you. What we would need is a second *waiter client*
connecting to the same stanby rather than a second standby. I feel
like having a test where the first waiter is removed while multiple
waiters are waiting, as well as a test where promotion occurs under
the same circumstances.

> 2. The target LSN is set to pg_current_wal_insert_lsn() + 1.  This
> location seems to be unachievable in this test.  So, it's not clear
> what race condition this test could potentially detect.
> 3. I think it would make sense to check for the race condition
> reported by Heikki.  That is to insert the injection point at the
> beginning of WaitLSNSetLatches().

I think the race condition you mentioned refers to the inconsistency
between the inHeap flag and the pairing heap caused by a race
condition between timeout and wakeup (or perhaps other combinations?
I'm not sure which version of the patch the mentioned race condition
refers to). However, I imagine it is difficult to reliably reproduce
this condition. In that regard, in the latest patch, the coherence
between the inHeap flag and the pairing heap is protected by LWLock,
so I believe we no longer need that test.

regards.


> Links.
> 1. 
> https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: replace strtok()

2024-06-19 Thread Kyotaro Horiguchi
At Tue, 18 Jun 2024 09:18:28 +0200, Peter Eisentraut  
wrote in 
> Under the topic of getting rid of thread-unsafe functions in the
> backend [0], here is a patch series to deal with strtok().
> 
> Of course, strtok() is famously not thread-safe and can be replaced by
> strtok_r().  But it also has the wrong semantics in some cases,
> because it considers adjacent delimiters to be one delimiter.  So if
> you parse
> 
> SCRAM-SHA-256$:$:
> 
> with strtok(), then
> 
> SCRAM-SHA-256$$::$$::
> 
> parses just the same.  In many cases, this is arguably wrong and could
> hide mistakes.
> 
> So I'm suggesting to use strsep() in those places.  strsep() is
> nonstandard but widely available.
> 
> There are a few places where strtok() has the right semantics, such as
> parsing tokens separated by whitespace.  For those, I'm using
> strtok_r().

I agree with the distinction.

> A reviewer job here would be to check whether I made that distinction
> correctly in each case.

0001 and 0002 look correct to me regarding that distinction. They
applied correctly to the master HEAD and all tests passed on Linux.

> On the portability side, I'm including a port/ replacement for
> strsep() and some workaround to get strtok_r() for Windows.  I have
> included these here as separate patches for clarity.

0003 looks fine and successfully built and seems working on an MSVC
build.

About 0004, Cygwin seems to have its own strtok_r, but I haven't
checked how that fact affects the build.

> [0]:
> https://www.postgresql.org/message-id/856e5ec3-879f-42ee-8258-8bcc6ec9b...@eisentraut.org

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-18 Thread Kyotaro Horiguchi
At Thu, 13 Jun 2024 09:29:03 +0530, Amit Kapila  wrote 
in 
> Yeah, but the commit you quoted later reverted by commit 703f148e98
> and committed again as c6c3334364.

Yeah, right..

> > aiming to prevent walsenders from
> > generating competing WAL (by, for example, CREATE_REPLICATION_SLOT)
> > records with the shutdown checkpoint. Thus, it seems that the
> > walsender cannot see the shutdown record,
> >
> 
> This is true of logical walsender. The physical walsender do send
> shutdown checkpoint record before getting terminated.

Yes, I know. They differ in their blocking mechanisms.

> > and a certain amount of
> > bytes before it, as the walsender appears to have relied on the
> > checkpoint flushing its record, rather than on XLogBackgroundFlush().
> >
> > If we approve of the walsender being terminated before the shutdown
> > checkpoint, we need to "fix" the comment, then provide a function to
> > ensure the synchronization of WAL records.
> >
> 
> Which comment do you want to fix?

Yeah. The part you seem to think I was trying to fix is actually
fine. Instead, I have revised the comment on the modified section to
make its intention clearer.

> > I'll consider this direction for a while.
> >
> 
> Okay, thanks.

The attached patch is it. It's only for the master.

I decided not to create a new function because the simple code has
only one caller. I haven't seen the test script fail with this fix.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 663bdeaf8d4d2f5a192dd3ecb6d2817d9b1311f1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 18 Jun 2024 16:36:43 +0900
Subject: [PATCH] Ensure WAL is written out when terminating a logical
 walsender

In commit c6c3334364, XLogBackgroundFlush() was assumed to flush all
written WAL to the end, but this function may not flush the last
incomplete page in a single call. In such cases, if the last written
page ends with a continuation record, the latter part will not be
flushed, causing shutdown to wait indefinitely for that part. This
commit ensures that the written records are fully flushed to the end,
guaranteeing expected behavior.
---
 src/backend/replication/walsender.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07cf0..5aa0f58238 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1858,12 +1858,13 @@ WalSndWaitForWal(XLogRecPtr loc)
 		ProcessRepliesIfAny();
 
 		/*
-		 * If we're shutting down, trigger pending WAL to be written out,
-		 * otherwise we'd possibly end up waiting for WAL that never gets
-		 * written, because walwriter has shut down already.
+		 * If we're shutting down, all WAL-writing processes are gone, leaving
+		 * only checkpointer to perform the shutdown checkpoint. Ensure that
+		 * any pending WAL is written out here; otherwise, we'd possibly end up
+		 * waiting for WAL that never gets written.
 		 */
-		if (got_STOPPING)
-			XLogBackgroundFlush();
+		if (got_STOPPING && !RecoveryInProgress())
+			XLogFlush(GetInsertRecPtr());
 
 		/*
 		 * To avoid the scenario where standbys need to catch up to a newer
-- 
2.43.0



Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-11 Thread Kyotaro Horiguchi
At Tue, 11 Jun 2024 14:27:28 +0530, Amit Kapila  wrote 
in 
> On Tue, Jun 11, 2024 at 12:34 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila  
> > wrote in
> > > Sorry, it is not clear to me why we failed to flush the last
> > > continuation record in logical walsender? I see that we try to flush
> > > the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
> > > that not sufficient?
> >
> > It seems that, it uses XLogBackgroundFlush(), which does not guarantee
> > flushing WAL until the end.
> >
> 
> What would it take to ensure the same? I am trying to explore this
> path because currently logical WALSender sends any outstanding logs up
> to the shutdown checkpoint record (i.e., the latest record) and waits
> for them to be replicated to the standby before exit. Please take a
> look at the comments where we call WalSndDone(). The fix you are
> proposing will break that guarantee.

Shutdown checkpoint is performed after the walsender completed
termination since 086221cf6b, aiming to prevent walsenders from
generating competing WAL (by, for example, CREATE_REPLICATION_SLOT)
records with the shutdown checkpoint. Thus, it seems that the
walsender cannot see the shutdown record, and a certain amount of
bytes before it, as the walsender appears to have relied on the
checkpoint flushing its record, rather than on XLogBackgroundFlush().

If we approve of the walsender being terminated before the shutdown
checkpoint, we need to "fix" the comment, then provide a function to
ensure the synchronization of WAL records.

I'll consider this direction for a while.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-11 Thread Kyotaro Horiguchi
At Tue, 11 Jun 2024 09:27:20 +0900, Michael Paquier  wrote 
in 
> On Thu, Jun 06, 2024 at 03:19:20PM +0900, Kyotaro Horiguchi wrote:
> > So, I believe the attached small patch fixes the behavior. I haven't
> > come up with a good test script for this issue. Something like
> > 026_overwrite_contrecord.pl might work, but this situation seems a bit
> > more complex than what it handles.
> 
> Hmm.  Indeed you will not be able to reuse the same trick with the end
> of a segment.  Still you should be able to get a rather stable test by
> using the same tricks as 039_end_of_wal.pl to spawn a record across
> multiple pages, no?  

With the trick, we could write a page-spanning record, but I'm not
sure we can control the behavior of XLogBackgroundFlush().

As Amit suggested, we have the option to create a variant of the
function that guarantees flushing WAL until the end.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-11 Thread Kyotaro Horiguchi
At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila  wrote 
in 
> Sorry, it is not clear to me why we failed to flush the last
> continuation record in logical walsender? I see that we try to flush
> the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
> that not sufficient?

It seems that, it uses XLogBackgroundFlush(), which does not guarantee
flushing WAL until the end.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: relfilenode statistics

2024-06-10 Thread Kyotaro Horiguchi
At Mon, 10 Jun 2024 08:09:56 +, Bertrand Drouvot 
 wrote in 
> Hi,
> 
> On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund  wrote:
> > > If we just want to keep prior stats upon arelation rewrite, we can just 
> > > copy
> > > the stats from the old relfilenode.  Or we can decide that those stats 
> > > don't
> > > really make sense anymore, and start from scratch.
> > 
> > I think we need to think carefully about what we want the user
> > experience to be here. "Per-relfilenode stats" could mean "sometimes I
> > don't know the relation OID so I want to use the relfilenumber
> > instead, without changing the user experience" or it could mean "some
> > of these stats actually properly pertain to the relfilenode rather
> > than the relation so I want to associate them with the right object
> > and that will affect how the user sees things." We need to decide
> > which it is. If it's the former, then we need to examine whether the
> > goal of hiding the distinction between relfilenode stats and relation
> > stats from the user is in fact feasible. If it's the latter, then we
> > need to make sure the whole patch reflects that design, which would
> > include e.g. NOT copying stats from the old to the new relfilenode,
> > and which would also include documenting the behavior in a way that
> > will be understandable to users.
> 
> Thanks for sharing your thoughts!
> 
> Let's take the current heap_blks_read as an example: it currently survives
> a relation rewrite and I guess we don't want to change the existing user
> experience for it.
> 
> Now say we want to add "heap_blks_written" (like in this POC patch) then I 
> think
> that it makes sense for the user to 1) query this new stat from the same place
> as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the 
> same
> experience as far the relation rewrite is concerned (keep the previous stats).
> 
> To achieve the rewrite behavior we could:
> 
> 1) copy the stats from the OLD relfilenode to the relation (like in the POC 
> patch)
> 2) copy the stats from the OLD relfilenode to the NEW one (could be in a 
> dedicated
> field)
> 
> The PROS of 1) is that the behavior is consistent with the current 
> heap_blks_read
> and that the user could still see the current relfilenode stats (through a 
> new API)
> if he wants to.
> 
> > In my experience, the worst thing you can do in cases like this is be
> > somewhere in the middle. Then you tend to end up with stuff like: the
> > difference isn't supposed to be something that the user knows or cares
> > about, except that they do have to know and care because you haven't
> > thoroughly covered up the deception, and often they have to reverse
> > engineer the behavior because you didn't document what was really
> > happening because you imagined that they wouldn't notice.
> 
> My idea was to move all that is in pg_statio_all_tables to relfilenode stats
> and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) 
> ensure
> the user can still retrieve the stats from pg_statio_all_tables in such a way
> that it survives a rewrite, 3) provide dedicated APIs to retrieve
> relfilenode stats but only for the current relfilenode, 4) document this
> behavior. This is what the POC patch is doing for heap_blks_written (would
> need to do the same for heap_blks_read and friends) except for the 
> documentation
> part. What do you think?

In my opinion, it is certainly strange that bufmgr is aware of
relation kinds, but introducing relfilenode stats to avoid this skew
doesn't seem to be the best way, as it invites inconclusive arguments
like the one raised above. The fact that we transfer counters from old
relfilenodes to new ones indicates that we are not really interested
in counts by relfilenode. If that's the case, wouldn't it be simpler
to call pgstat_count_relation_buffer_read() from bufmgr.c and then
branch according to relkind within that function? If you're concerned
about the additional branch, some ingenuity may be needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Things I don't like about \du's "Attributes" column

2024-06-09 Thread Kyotaro Horiguchi
At Sat, 8 Jun 2024 14:09:11 -0400, Robert Haas  wrote in 
> On Sat, Jun 8, 2024 at 10:02 AM Pavel Luzanov  
> wrote:
> > Therefore, I think the current patch offers a better version of the \du 
> > command.
> > However, I admit that these improvements are not enough to accept the patch.
> > I would like to hear other opinions.
> 
> Hmm, I don't think I quite agree with this. If people like this
> version better than what we have now, that's all we need to accept the
> patch. I just don't really want to be the one to decide all by myself
> whether this is, in fact, better. So, like you, I would like to hear
> other opinions.

>  regress_du_role0 | yes   | Inherit | Tue Jun 04 00:00:00 2024 PDT |  
>   0 |
>  regress_du_role1 | no| Create role+| infinity |  
> |

I guess that in English, when written as "'Login' = 'yes/no'", it can
be easily understood. However, in Japanese, "'ログイン' = 'はい/いいえ'"
looks somewhat awkward and is a bit difficult to understand at a
glance. "'ログイン' = '可/不可'" (equivalent to "Login is
'can/cannot'") sounds more natural in Japanese, but it was rejected
upthread, and I also don't like 'can/cannot'. To give further
candidates, "allowed/not allowed" or "granted/denied" can be
mentioned, and they would be easier to translate, at least to
Japanese. However, is there a higher likelihood that 'granted/denied'
will be misunderstood as referring to database permissions?

Likewise, "'Valid until' = 'infinity'" (equivalent to "'有効期限' = '
無限'") also sounds awkward. Maybe that's the same in English. I guess
that 'unbounded' or 'indefinite' sounds better, and their Japanese
translation '無期限' also sounds natural. However, I'm not sure we
want to go to that extent in transforming the table.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Kyotaro Horiguchi
At Thu, 06 Jun 2024 17:15:15 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > I have been thinking about this since then. At first, I thought it
> > referred to FindFirstChangeNotification() and friends, and inotify on
> > Linux. However, I haven't found a way to simplify the specified code
> > area using those APIs.
> 
> By the way, the need to shift by 2 seconds to tolerate clock skew
> suggests that the current launcher-postmaster association mechanism is
> somewhat unreliable. Couldn't we add a command line option to
> postmaster to explicitly pass a unique identifier (say, pid itself) of
> the launcher? If it is not specified, the number should be the PID of
> the immediate parent process.

No. The combination of pg_ctl's pid and timestamp, to avoid false
matching during reboot.

> This change avoids the need for the special treatment for Windows.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Kyotaro Horiguchi
At Thu, 06 Jun 2024 16:45:00 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I have been thinking about this since then. At first, I thought it
> referred to FindFirstChangeNotification() and friends, and inotify on
> Linux. However, I haven't found a way to simplify the specified code
> area using those APIs.

By the way, the need to shift by 2 seconds to tolerate clock skew
suggests that the current launcher-postmaster association mechanism is
somewhat unreliable. Couldn't we add a command line option to
postmaster to explicitly pass a unique identifier (say, pid itself) of
the launcher? If it is not specified, the number should be the PID of
the immediate parent process.

This change avoids the need for the special treatment for Windows.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-06-06 Thread Kyotaro Horiguchi
At Tue, 4 Jun 2024 08:30:19 +0900, Michael Paquier  wrote 
in 
> On Mon, Jan 15, 2024 at 01:34:46PM -0500, Robert Haas wrote:
> > This kind of change looks massively helpful to me. I don't know if it
> > is exactly right or not, but it would have been a big help to me when
> > writing my previous review, so +1 for some change of this general
> > type.
> 
> During a live review of this patch last week, as part of the Advanced
> Patch Workshop of pgconf.dev, it has been mentioned by Tom that we may
> be able to simplify the check on pmstart if the detection of the
> postmaster PID started by pg_ctl is more stable using the WIN32
> internals that this patch relies on.  I am not sure that this
> suggestion is right, though, because we should still care about the
> clock skew case as written in the surrounding comments?  Even if
> that's OK, I would assume that this should be an independent patch,
> written on top of the proposed v6-0001.
> 
> Tom, could you comment about that?  Perhaps my notes did not catch
> what you meant.

Thank you for the follow-up.

I have been thinking about this since then. At first, I thought it
referred to FindFirstChangeNotification() and friends, and inotify on
Linux. However, I haven't found a way to simplify the specified code
area using those APIs.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-05 Thread Kyotaro Horiguchi
At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith  wrote in 
> Hi, I have reproduced this multiple times now.
> 
> I confirmed the initial post/steps from Alexander. i.e. The test
> script provided [1] gets itself into a state where function
> ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
> for the next page to become available") constantly returns
> XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
> forever, since it never calls WalSndDone() to exit the walsender
> process.

Thanks for the repro; I believe I understand what's happening here.

During server shutdown, the latter half of the last continuation
record may fail to be flushed. This is similar to what is described in
the commit message of commit ff9f111bce. While shutting down,
WalSndLoop() waits for XLogSendLogical() to consume WAL up to
flushPtr, but in this case, the last record cannot complete without
the continuation part starting from flushPtr, which is
missing. However, in such cases, xlogreader.missingContrecPtr is set
to the beginning of the missing part, but something similar to 

So, I believe the attached small patch fixes the behavior. I haven't
come up with a good test script for this issue. Something like
026_overwrite_contrecord.pl might work, but this situation seems a bit
more complex than what it handles.

Versions back to 10 should suffer from the same issue and the same
patch will be applicable without significant changes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 99cad7bd53a94b4b90937fb1eb2f37f2ebcadf6a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 6 Jun 2024 14:56:53 +0900
Subject: [PATCH] Fix infinite loop in walsender during publisher shutdown

When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
 src/backend/replication/walsender.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07cf0..635396c138 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3426,8 +3426,15 @@ XLogSendLogical(void)
 			flushPtr = GetFlushRecPtr(NULL);
 	}
 
-	/* If EndRecPtr is still past our flushPtr, it means we caught up. */
-	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+	/*
+	 * If EndRecPtr is still past our flushPtr, it means we caught up.  When
+	 * the server is shutting down, the latter part of a continuation record
+	 * may be missing.  If got_STOPPING is true, assume we are caught up if the
+	 * last record is missing its continuation part at flushPtr.
+	 */
+	if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+		(got_STOPPING &&
+		 logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
 		WalSndCaughtUp = true;
 
 	/*
-- 
2.43.0



Re: Add last_commit_lsn to pg_stat_database

2024-06-04 Thread Kyotaro Horiguchi
At Tue, 20 Feb 2024 07:56:28 +0900, Michael Paquier  wrote 
in 
> On Mon, Feb 19, 2024 at 10:26:43AM +0100, Tomas Vondra wrote:
> It just means that I am not much a fan of the signature changes done
> for RecordTransactionCommit() and AtEOXact_PgStat_Database(), adding a
> dependency to a specify LSN.  Your suggestion to switching to
> XactLastRecEnd should avoid that.
> 
> > - stats_fetch_consistency=cache: always the same min/max value
> > 
> > - stats_fetch_consistency=none: min != max
> > 
> > Which would suggest you're right and this should be VOLATILE and not
> > STABLE. But then again - the existing pg_stat_get_db_* functions are all
> > marked as STABLE, so (a) is that correct and (b) why should this
> > function be marked different?
> 
> This can look like an oversight of 5891c7a8ed8f to me.  I've skimmed
> through the threads related to this commit and messages around [1]
> explain why this GUC exists and why we have both behaviors, but I'm
> not seeing a discussion about the volatibility of these functions.
> The current situation is not bad either, the default makes these
> functions stable, and I'd like to assume that users of this GUC know
> what they do.  Perhaps Andres or Horiguchi-san can comment on that.
> 
> https://www.postgresql.org/message-id/382908.1616343...@sss.pgh.pa.us

I agree that we cannot achieve, nor do we need, perfect MVCC behavior,
and that completely volatile behavior is unusable. I believe the
functions are kept marked as stable, as this is the nearest and most
usable volatility for the default behavior, since function volatility
is not switchable on-the-fly. This approach seems least trouble-prone
to me.

The consistency of the functions are discussed here.

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-STATS-VIEWS

> This is a feature, not a bug, because ... Conversely, if it's known
> that statistics are only accessed once, caching accessed statistics is
> unnecessary and can be avoided by setting stats_fetch_consistency to
> none.

It seems to me that this description already implies such an
incongruity in the functions' behavior from the "stable" behavior, but
we might want to explicitly mention that incongruity.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Fix use of possible uninitialized variable retval (src/pl/plpgsql/src/pl_handler.c)

2024-06-04 Thread Kyotaro Horiguchi
At Mon, 27 May 2024 11:31:24 -0300, Ranier Vilela  wrote 
in 
> Hi.
> 
> The function *plpgsql_inline_handler* can use uninitialized
> variable retval, if PG_TRY fails.
> Fix like function*plpgsql_call_handler* wich declare retval as
> volatile and initialize to (Datum 0).

If PG_TRY fails, retval is not actually accessed, so no real issue
exists. Commit 7292fd8f1c changed plpgsql_call_handler() to the
current form, but as stated in its commit message, it did not fix a
real issue and was solely to silence compiler.

I believe we do not need to modify plpgsql_inline_handler() unless
compiler actually issues a false warning for it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH]A minor improvement to the error-report in SimpleLruWriteAll()

2024-06-04 Thread Kyotaro Horiguchi
At Tue, 28 May 2024 20:15:59 +0800 (CST), "Long Song"  
wrote in 
> 
> Hi,
> Actually, I still wonder why only the error message
> of the last failure to close the file was recorded.
> For this unusual situation, it is acceptable to
> record all failure information without causing
> too much logging.
> Was it designed that way on purpose?

Note that SlruReportIOError() causes a non-local exit. To me, the
point of the loop seems to be that we want to close every single file,
apart from the failed ones. From that perspective, the patch disrupts
that intended behavior by exiting in the middle of the loop. It seems
we didn't want to bother collecting errors for every failed file in
that part.
 
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_parse_json() should not leak token copies on failure

2024-06-03 Thread Kyotaro Horiguchi
Hi,

At Fri, 24 May 2024 08:43:01 -0700, Jacob Champion 
 wrote in 
> On Tue, Apr 30, 2024 at 2:29 PM Jacob Champion
>  wrote:
> > Attached is a draft patch to illustrate what I mean, but it's
> > incomplete: it only solves the problem for scalar values.
> 
> (Attached is a v2 of that patch; in solving a frontend leak I should
> probably not introduce a backend segfault.)

I understand that the part enclosed in parentheses refers to the "if
(ptr) pfree(ptr)" structure. I believe we rely on the behavior of
free(NULL), which safely does nothing. (I couldn't find the related
discussion due to a timeout error on the ML search page.)

Although I don't fully understand the entire parser code, it seems
that the owner transfer only occurs in JSON_TOKEN_STRING cases. That
is, the memory pointed by scalar_val would become dangling in
JSON_TOKEN_NUBMER cases. Even if this is not the case, the ownership
transition apperas quite callenging to follow.

It might be safer or clearer to pstrdup the token in jsonb_in_scalar()
and avoid NULLifying scalar_val after calling callbacks, or to let
jsonb_in_sclar() NULLify the pointer.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: In-placre persistance change of a relation

2024-05-24 Thread Kyotaro Horiguchi
Rebased.

Along with rebasing, I changed the interface of XLogFsyncFile() to
return a boolean instead of an error message.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From bed74e638643d7491bbd86fe640c33db1e16f0e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v33 1/3] Export wal_sync_method related functions

Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
 src/backend/access/transam/xlog.c | 73 +--
 src/include/access/xlog.h |  2 +
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..492ababd9c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8592,21 +8592,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
 	}
 }
 
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+	return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
 
 /*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
  *
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
  */
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
 {
-	char	   *msg = NULL;
+	const char *msg = NULL;
 	instr_time	start;
 
-	Assert(tli != 0);
-
 	/*
 	 * Quick exit if fsync is disabled or write() has already synced the WAL
 	 * file.
@@ -8614,7 +8622,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	if (!enableFsync ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
-		return;
+		return NULL;
 
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
@@ -8651,19 +8659,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 			break;
 	}
 
-	/* PANIC if failed to fsync */
-	if (msg)
-	{
-		char		xlogfname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		XLogFileName(xlogfname, tli, segno, wal_segment_size);
-		errno = save_errno;
-		ereport(PANIC,
-(errcode_for_file_access(),
- errmsg(msg, xlogfname)));
-	}
-
 	pgstat_report_wait_end();
 
 	/*
@@ -8677,7 +8672,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
-	PendingWalStats.wal_sync++;
+	if (msg != NULL)
+		PendingWalStats.wal_sync++;
+
+	return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+	const char	*msg;
+	
+	Assert(tli != 0);
+
+	msg = XLogFsyncFile(fd);
+
+	/* PANIC if failed to fsync */
+	if (msg)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, tli, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+(errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+	}
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1a1f11a943..badfe4abd6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
 extern bool RecoveryInProgress(void);
-- 
2.43.0

>From c200b85c1311f97bdae2ed20e2746c44d5c4aadb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v33 2/3] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/transam/meson.build |   1 +
 src/backend/access/transam/rmgr.c  |   4 +-
 src/backend/access/transam/simpleundolog.c | 362 +
 src/backend/access/transam/twophase.c  |   3 +
 src/backend/access/transam/xact.c  |  24 ++
 src/backend/access/transam/xlog.c  |  42 ++-
 src/backend/catalog/storage.c  | 171 ++
 src/backend/storage/file/reinit.c  |  78 +
 src/backend/storage/smgr/smgr.c|   9 +
 src/bin/initdb/initdb.c|  17 +
 src/bin/pg_rewind/parsexlog.c 

inconsistent quoting in error messages

2024-05-20 Thread Kyotaro Horiguchi
I noticed that there are slightly inconsistent messages regarding
quoting policies.

> This happens if you temporarily set "wal_level=minimal" on the server.
> WAL generated with "full_page_writes=off" was replayed during online backup
 
> pg_log_standby_snapshot() can only be used if "wal_level" >= "replica"

> WAL streaming ("max_wal_senders" > 0) requires "wal_level" to be "replica" or 
> "logical"

I think it's best to quote variable names and values separately, like
"wal_level" = "minimal" (but not use quotes for numeric values), as it
seems to be the most common practice. Anyway, we might want to unify
them.


Likewise, I saw two different versions of values with units.

> "max_stack_depth" must not exceed %ldkB.
> "vacuum_buffer_usage_limit" must be 0 or between %d kB and %d kB

I'm not sure, but it seems like the latter version is more common.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: cataloguing NOT NULL constraints

2024-05-07 Thread Kyotaro Horiguchi
Hello,

At Wed, 1 May 2024 19:49:35 +0200, Alvaro Herrera  
wrote in 
> Here are two patches that I intend to push soon (hopefully tomorrow).

This commit added and edited two error messages, resulting in using
slightly different wordings "in" and "on" for relation constraints.

+   errmsg("cannot change NO INHERIT status of NOT NULL constraint \"%s\" on 
relation \"%s\"",
===
+   errmsg("cannot change NO INHERIT status of NOT NULL constraint \"%s\" in 
relation \"%s\"",

I think we usually use on in this case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Allow non-superuser to cancel superuser tasks.

2024-04-11 Thread Kyotaro Horiguchi
At Fri, 12 Apr 2024 09:10:35 +0900, Michael Paquier  wrote 
in 
> On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote:
> > The test doesn't fail because pg_terminate_backend actually meets his
> > point: autovac is killed. But while dying, autovac also receives
> > segfault. Thats because of injections points:
> > 
> > (gdb) bt
> > #0  0x56361c3379ea in tas (lock=0x7fbcb9632224  > access memory at address 0x7fbcb9632224>) at
> > ../../../../src/include/storage/s_lock.h:228
> > #1  ConditionVariableCancelSleep () at condition_variable.c:238
...
> > #3  0x56361c330a40 in CleanupProcSignalState (status=, arg=) at procsignal.c:240
> > #4  0x56361c328801 in shmem_exit (code=code@entry=1) at ipc.c:276
> > #9  0x56361c3378d7 in ConditionVariableTimedSleep
> > (cv=0x7fbcb9632224, timeout=timeout@entry=-1,
...
> I can see this stack trace as well.  Capturing a bit more than your
> own stack, this is crashing in the autovacuum worker while waiting on
> a condition variable when processing a ProcessInterrupts().
> 
> That may point to a legit bug with condition variables in this
> context, actually?  From what I can see, issuing a signal on a backend
> process waiting with a condition variable is able to process the
> interrupt correctly.

ProcSignalInit sets up CleanupProcSignalState to be called via
on_shmem_exit. If the CV is allocated in a dsm segment, shmem_exit
should have detached the region for the CV.  CV cleanup code should be
invoked via before_shmem_exit.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: NLS doesn't work for pg_combinebackup

2024-04-09 Thread Kyotaro Horiguchi
At Tue, 9 Apr 2024 15:00:27 +0900, Michael Paquier  wrote 
in 
> I've checked the whole tree, and the two you are pointing at are the
> only incorrect paths.  So applied, thanks!

Thank you for cross-checking and committing!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: NLS doesn't work for pg_combinebackup

2024-04-08 Thread Kyotaro Horiguchi
At Mon, 08 Apr 2024 16:27:02 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Hello.
> 
> I noticed that NLS doesn't work for pg_combinebackup. The cause is
> that the tool forgets to call set_pglocale_pgservice().
> 
> This issue is fixed by the following chage.
> 
> diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c 
> b/src/bin/pg_combinebackup/pg_combinebackup.c
> index 1b07ca3fb6..2788c78fdd 100644
> --- a/src/bin/pg_combinebackup/pg_combinebackup.c
> +++ b/src/bin/pg_combinebackup/pg_combinebackup.c
> @@ -154,6 +154,7 @@ main(int argc, char *argv[])
>  
>   pg_logging_init(argv[0]);
>   progname = get_progname(argv[0]);
> + set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_combinebackup"));
>   handle_help_version_opts(argc, argv, progname, help);
>  
>   memset(&opt, 0, sizeof(opt));

Forgot to mention, but pg_walsummary has the same issue.

diff --git a/src/bin/pg_walsummary/pg_walsummary.c 
b/src/bin/pg_walsummary/pg_walsummary.c
index 5e41b376d7..daf6cd14ce 100644
--- a/src/bin/pg_walsummary/pg_walsummary.c
+++ b/src/bin/pg_walsummary/pg_walsummary.c
@@ -67,6 +67,7 @@ main(int argc, char *argv[])
 
pg_logging_init(argv[0]);
progname = get_progname(argv[0]);
+   set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_walsummary"));
handle_help_version_opts(argc, argv, progname, help);
 
/* process command-line options */


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




NLS doesn't work for pg_combinebackup

2024-04-08 Thread Kyotaro Horiguchi
Hello.

I noticed that NLS doesn't work for pg_combinebackup. The cause is
that the tool forgets to call set_pglocale_pgservice().

This issue is fixed by the following chage.

diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c 
b/src/bin/pg_combinebackup/pg_combinebackup.c
index 1b07ca3fb6..2788c78fdd 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -154,6 +154,7 @@ main(int argc, char *argv[])
 
pg_logging_init(argv[0]);
progname = get_progname(argv[0]);
+   set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_combinebackup"));
handle_help_version_opts(argc, argv, progname, help);
 
memset(&opt, 0, sizeof(opt));


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: remaining sql/json patches

2024-03-21 Thread Kyotaro Horiguchi
At Fri, 22 Mar 2024 11:44:08 +0900, Amit Langote  
wrote in 
> Thanks for the heads up.
> 
> My bad, will push a fix shortly.

No problem. Thank you for the prompt correction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: remaining sql/json patches

2024-03-21 Thread Kyotaro Horiguchi
At Wed, 20 Mar 2024 21:53:52 +0900, Amit Langote  
wrote in 
> I'll push 0001 tomorrow.

This patch (v44-0001-Add-SQL-JSON-query-functions.patch) introduced the 
following new erro message:

+errmsg("can only specify 
constant, non-aggregate"
+   " function, or 
operator expression for"
+   " DEFAULT"),

I believe that our convention here is to write an error message in a
single string literal, not split into multiple parts, for better
grep'ability.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Inconsistent printf placeholders

2024-03-21 Thread Kyotaro Horiguchi
Thank you for looking this.

At Tue, 19 Mar 2024 10:50:23 +0100, Peter Eisentraut  
wrote in 
> On 15.03.24 08:20, Kyotaro Horiguchi wrote:
> > diff --git a/src/backend/access/transam/twophase.c
> > b/src/backend/access/transam/twophase.c
> > @@ -1369,8 +1369,8 @@ ReadTwoPhaseFile(TransactionId xid, bool
> > missing_ok)
> >  errmsg("could not read file \"%s\":
> >  %m", path)));
> > else
> > ereport(ERROR,
> > -   (errmsg("could not read file \"%s\":
> > -   read %d of %lld",
> > -   path, r, (long long
> > -   int) stat.st_size)));
> > + (errmsg("could not read file \"%s\": read %zd of %zu",
> > + path, r, (Size) stat.st_size)));
> > }
> > pgstat_report_wait_end();
> 
> This might be worse, because stat.st_size is of type off_t, which
> could be smaller than Size/size_t.

I think you were trying to mention that off_t could be wider than
size_t and you're right in that point. I thought that it is safe since
we are trying to read the whole content of the file at once here into
palloc'ed memory.

However, on second thought, if st_size is out of the range of ssize_t,
and palloc accepts that size, at least on Linux, read(2) reads only
0x7000 bytes and raches the error reporting. Addition to that,
this size was closer to the memory allocation size limit than I had
thought.

As the result, I removed the change. However, I kept the change of the
type of variable "r" and corresponding placeholder %zd.

> > diff --git a/src/backend/libpq/be-secure-gssapi.c
> > b/src/backend/libpq/be-secure-gssapi.c
> > index bc04e78abb..68645b4519 100644
> > --- a/src/backend/libpq/be-secure-gssapi.c
> > +++ b/src/backend/libpq/be-secure-gssapi.c
> > @@ -572,9 +572,9 @@ secure_open_gssapi(Port *port)
> > if (input.length > PQ_GSS_RECV_BUFFER_SIZE)
> > {
> > ereport(COMMERROR,
> > -   (errmsg("oversize GSSAPI packet sent
> > -   by the client (%zu > %d)",
> > + (errmsg("oversize GSSAPI packet sent by the client (%zu > %zu)",
> > (size_t) input.length,
> > -   
> > PQ_GSS_RECV_BUFFER_SIZE)));
> > + (size_t) PQ_GSS_RECV_BUFFER_SIZE)));
> > return -1;
> > }
> >   
> 
> Might be better to add that cast to the definition of
> PQ_GSS_RECV_BUFFER_SIZE instead, so that all code can benefit.

As far as I see, the only exceptional uses I found were a comparison
with int values, and being passed as an OM_uint32 (to one of the
parameters of gss_wrap_size_limit()). Therefore, I agree that it is
beneficial. By the way, we currently define Size as the same as size_t
(since 1998). Is it correct to consider Size as merely for backward
compatibility and we should use size_t for new code?  I used size_t in
the modified part in the attached patch.

> > diff --git a/src/backend/replication/repl_gram.y
> > b/src/backend/replication/repl_gram.y
> > index 7474f5bd67..baa76280b9 100644
> > --- a/src/backend/replication/repl_gram.y
> > +++ b/src/backend/replication/repl_gram.y
> > @@ -312,11 +312,6 @@ timeline_history:
> > {
> > TimeLineHistoryCmd *cmd;
> >   - if ($2 <= 0)
> > -   ereport(ERROR,
> > -   
> > (errcode(ERRCODE_SYNTAX_ERROR),
> > -errmsg("invalid
> > -timeline %u",
> > -$2)));
> > -
...
> I don't think this is correct.  It loses the check for == 0.

Ugh. It's my mistake. So we need to consider unifying the messages
again. In walsummaryfuncs.c, %lld is required, but it's silly for the
uses in repl_gram.y. Finally, I chose not to change anything here.

> > diff --git a/src/backend/tsearch/to_tsany.c
> > b/src/backend/tsearch/to_tsany.c
> > index 88cba58cba..9d21178107 100644
> > --- a/src/backend/tsearch/to_tsany.c
> > +++ b/src/backend/tsearch/to_tsany.c
> > @@ -191,7 +

Re: Add new error_action COPY ON_ERROR "log"

2024-03-17 Thread Kyotaro Horiguchi
At Fri, 15 Mar 2024 16:57:25 +0900, Michael Paquier  wrote 
in 
> On Wed, Mar 13, 2024 at 07:32:37PM +0530, Bharath Rupireddy wrote:
> > I'm attaching the v8 patch set implementing the above idea. With this,
> > [1] is sent to the client, [2] is sent to the server log. This
> > approach not only reduces the duplicate info in the NOTICE and CONTEXT
> > messages, but also makes it easy for users to get all the necessary
> > info in the NOTICE message without having to set extra parameters to
> > get CONTEXT message.
> 
> In terms of eliminating the information duplication while allowing
> clients to know the value, the attribute and the line involved in the
> conversion failure, the approach of tweaking the context information
> has merits, I guess.
> 
> How do others feel about this kind of tuning?

If required, I think that we have already included the minimum
information necessary for the primary diagnosis, including locations,
within the primary messages. Here is an example:

> LOG:  0: invalid record length at 0/18049F8: expected at least 24, got 0

And I believe that CONTEXT, if it exists, is augmentation information
to the primary messages. The objective of the precedent for the use of
relname_only was somewhat different, but this use also seems legit.

In short, I think the distribution between message types (primary and
context) is fine as it is in the latest patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Inconsistent printf placeholders

2024-03-15 Thread Kyotaro Horiguchi
At Fri, 15 Mar 2024 16:20:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I checked for that kind of msgids in a bit more intensive way. The
> numbers are the line numbers in ja.po of backend. I didn't check the
> same for other modules.
> 
> > ###: invalid timeline %lld
> >@ backup/walsummaryfuncs.c:95
> >  invalid timeline %u
> >@ repl_gram.y:318 repl_gram.y:359

"The numbers are the line numbers in ja.po" is wrong. Correctly, it
should be written as:

The information consists of two lines: the first of them is the msgid,
and the second indicates the locations where they appear.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Inconsistent printf placeholders

2024-03-15 Thread Kyotaro Horiguchi
At Fri, 15 Mar 2024 16:01:28 +1300, David Rowley  wrote 
in 
> On Fri, 15 Mar 2024 at 15:27, Kyotaro Horiguchi  
> wrote:
> > I have considered only the two messages. Actually, buffile.c and md.c
> > are already like that. The attached aligns the messages in
> > pg_combinebackup.c and reconstruct.c with the precedents.
> 
> This looks like a worthy cause to make translator work easier.
> 
> I don't want to widen the goalposts or anything, but just wondering if
> you'd searched for any others that could get similar treatment?
> 
> I only just had a quick look at the following.
> 
> $ cat src/backend/po/fr.po | grep -E "^msgid\s" | sed -E
> 's/%[a-zA-Z]+/\%/g' | sort | uniq -d -c
>  31 msgid ""
>   2 msgid "could not accept SSL connection: %"
>   2 msgid "could not initialize LDAP: %"
>   2 msgid "could not look up local user ID %: %"
...
> I've not looked at how hard it would be with any of the above to
> determine how hard it would be to make the formats consistent.  The
> 3rd last one seems similar enough that it might be worth doing
> together with this?


I checked for that kind of msgids in a bit more intensive way. The
numbers are the line numbers in ja.po of backend. I didn't check the
same for other modules.

> ###: invalid timeline %lld
>@ backup/walsummaryfuncs.c:95
>  invalid timeline %u
>@ repl_gram.y:318 repl_gram.y:359

In the first case, the %lld can be more than 2^31.

In the second case, %u is uint32. However, the bigger issue is
checking if the uint32 value is negative:

repl_gram.c: 147
>   if ((yyvsp[0].uintval) <= 0)
>   ereport(ERROR,
>   (errcode(ERRCODE_SYNTAX_ERROR),
>errmsg("invalid timeline %u", 
> (yyvsp[0].uintval;

which cannot happen. I think we can simply remove the useless error
check.

> ###: could not read file \"%s\": read %d of %zu
>@ ../common/controldata_utils.c:116 ../common/controldata_utils.c:119 
> access/transam/xlog.c:3417 access/transam/xlog.c:4278 
> replication/logical/origin.c:750 replication/logical/origin.c:789 
> replication/logical/snapbuild.c:2040 replication/slot.c:2218 
> replication/slot.c:2259 replication/walsender.c:660 
> utils/cache/relmapper.c:833
>  could not read file \"%s\": read %d of %lld
>@ access/transam/twophase.c:1372
>  could not read file \"%s\": read %zd of %zu
>@ backup/basebackup.c:2102
> ###: oversize GSSAPI packet sent by the client (%zu > %zu)
>@ libpq/be-secure-gssapi.c:351
>  oversize GSSAPI packet sent by the client (%zu > %d)
>@ libpq/be-secure-gssapi.c:575
> ###: compressed segment file \"%s\" has incorrect uncompressed size %d, 
> skipping
>@ pg_receivewal.c:362
>  compressed segment file \"%s\" has incorrect uncompressed size %zu, 
> skipping
>@ pg_receivewal.c:448
> ###: could not read file \"%s\": read only %zd of %lld bytes
>@ pg_combinebackup.c:1304
>  could not read file \"%s\": read only %d of %u bytes
>@ reconstruct.c:514


We can "fix" them the same way.  I debated whether to use ssize_t for
read() and replace all instances of size_t with Size. However, in the
end, I decided to only keep it consistent with the surrounding code.

> ###: index %d out of valid range, 0..%d
>@ utils/adt/varlena.c:3220 utils/adt/varlena.c:3287
>  index %lld out of valid range, 0..%lld
>@ utils/adt/varlena.c:3251 utils/adt/varlena.c:3323
> ###: string is too long for tsvector (%d bytes, max %d bytes)
>@ tsearch/to_tsany.c:194 utils/adt/tsvector.c:277 
> utils/adt/tsvector_op.c:1126
>  string is too long for tsvector (%ld bytes, max %ld bytes)
>@ utils/adt/tsvector.c:223

We can unify them and did in the attached, but I'm not sure that's
sensible..

> ###: could not look up local user ID %d: %s
>@ ../port/user.c:43 ../port/user.c:79
>  could not look up local user ID %ld: %s
>@ libpq/auth.c:1886

Both of the above use uid_t (defined as int) as user ID and it is
explicitly cast to "int" in the first case and to long in the second,
which seems mysterious. Although I'm not sure if there's a possibility
of uid_t being widened in the future, I unified them to the latter way
for robustness.

> ###: error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %m
>@ file.c:44
>  error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s
>@ file.c:65
> 
> The la

Re: Typos in reorderbuffer.c.

2024-03-14 Thread Kyotaro Horiguchi
At Thu, 14 Mar 2024 11:23:38 +0530, Amit Kapila  wrote 
in 
> On Thu, Mar 14, 2024 at 9:58 AM Kyotaro Horiguchi
>  wrote:
> >
> > While examining reorderbuffer.c, I found several typos. I'm not sure
> > if fixing them is worthwhile, but I've attached a fix just in case.
> >
> 
> LGTM. I'll push this in some time.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Inconsistent printf placeholders

2024-03-14 Thread Kyotaro Horiguchi
Thank you for the suggestions.

At Thu, 14 Mar 2024 13:45:41 +0100, Daniel Gustafsson  wrote 
in 
> I've only skimmed this so far but +1 on keeping the messages the same where
> possible to reduce translation work.  Adding a comment on the message where 
> the
> casting is done to indicate that it is for translation might reduce the risk 
> of
> it "getting fixed" down the line.

Added a comment "/* cast xxx to avoid extra translatable messages */.

At Thu, 14 Mar 2024 14:02:46 +0100, Peter Eisentraut  
wrote in 
> If you want to make them uniform, then I suggest the error messages

Yeah. Having the same messages with only the placeholders changed is
not very pleasing during translation. If possible, I would like to
align them.

> should both be "%zd of %zu bytes", which are the actual types read()
> deals with.

I have considered only the two messages. Actually, buffile.c and md.c
are already like that. The attached aligns the messages in
pg_combinebackup.c and reconstruct.c with the precedents.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 6f0814d9ac..feb4d5dcf4 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -1301,8 +1301,9 @@ slurp_file(int fd, char *filename, StringInfo buf, int maxlen)
 		if (rb < 0)
 			pg_fatal("could not read file \"%s\": %m", filename);
 		else
-			pg_fatal("could not read file \"%s\": read only %zd of %lld bytes",
-	 filename, rb, (long long int) st.st_size);
+			/* cast st_size to avoid extra translatable messages */
+			pg_fatal("could not read file \"%s\": read only %zd of %zu bytes",
+	 filename, rb, (size_t) st.st_size);
 	}
 
 	/* Adjust buffer length for new data and restore trailing-\0 invariant */
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 41f06bb26b..a4badb90e2 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -504,15 +504,16 @@ make_rfile(char *filename, bool missing_ok)
 static void
 read_bytes(rfile *rf, void *buffer, unsigned length)
 {
-	int			rb = read(rf->fd, buffer, length);
+	ssize_t		rb = read(rf->fd, buffer, length);
 
 	if (rb != length)
 	{
 		if (rb < 0)
 			pg_fatal("could not read file \"%s\": %m", rf->filename);
 		else
-			pg_fatal("could not read file \"%s\": read only %d of %u bytes",
-	 rf->filename, rb, length);
+			/* cast length to avoid extra translatable messages */
+			pg_fatal("could not read file \"%s\": read only %zd of %zu bytes",
+	 rf->filename, rb, (size_t) length);
 	}
 }
 


Typos in reorderbuffer.c.

2024-03-13 Thread Kyotaro Horiguchi
Hello.

While examining reorderbuffer.c, I found several typos. I'm not sure
if fixing them is worthwhile, but I've attached a fix just in case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 001f901ee6..92cf39ff74 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -32,7 +32,7 @@
  *
  *	  In order to cope with large transactions - which can be several times as
  *	  big as the available memory - this module supports spooling the contents
- *	  of a large transactions to disk. When the transaction is replayed the
+ *	  of large transactions to disk. When the transaction is replayed the
  *	  contents of individual (sub-)transactions will be read from disk in
  *	  chunks.
  *
@@ -3078,10 +3078,10 @@ ReorderBufferImmediateInvalidation(ReorderBuffer *rb, uint32 ninvalidations,
  * least once for every xid in XLogRecord->xl_xid (other places in records
  * may, but do not have to be passed through here).
  *
- * Reorderbuffer keeps some datastructures about transactions in LSN order,
+ * Reorderbuffer keeps some data structures about transactions in LSN order,
  * for efficiency. To do that it has to know about when transactions are seen
  * first in the WAL. As many types of records are not actually interesting for
- * logical decoding, they do not necessarily pass though here.
+ * logical decoding, they do not necessarily pass through here.
  */
 void
 ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
@@ -3513,11 +3513,11 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
  * snapshot because we don't decode such transactions.  Also, we do not select
  * the transaction which doesn't have any streamable change.
  *
- * Note that, we skip transactions that contains incomplete changes. There
+ * Note that, we skip transactions that contain incomplete changes. There
  * is a scope of optimization here such that we can select the largest
  * transaction which has incomplete changes.  But that will make the code and
  * design quite complex and that might not be worth the benefit.  If we plan to
- * stream the transactions that contains incomplete changes then we need to
+ * stream the transactions that contain incomplete changes then we need to
  * find a way to partially stream/truncate the transaction changes in-memory
  * and build a mechanism to partially truncate the spilled files.
  * Additionally, whenever we partially stream the transaction we need to
@@ -4701,8 +4701,8 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
- * Rejigger change->newtuple to point to in-memory toast tuples instead to
- * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
+ * Rejigger change->newtuple to point to in-memory toast tuples instead of
+ * on-disk toast tuples that may no longer exist (think DROP TABLE or VACUUM).
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
@@ -4713,7 +4713,7 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * at unexpected times.
  *
  * We simply subtract size of the change before rejiggering the tuple, and
- * then adding the new size. This makes it look like the change was removed
+ * then add the new size. This makes it look like the change was removed
  * and then added back, except it only tweaks the accounting info.
  *
  * In particular it can't trigger serialization, which would be pointless


Inconsistent printf placeholders

2024-03-13 Thread Kyotaro Horiguchi
Hello.

A recent commit 6612185883 introduced two error messages that are
identical in text but differ in their placeholders.

-   pg_fatal("could not read file \"%s\": read only %d of 
%d bytes",
-filename, (int) rb, (int) st.st_size);
+   pg_fatal("could not read file \"%s\": read only %zd of 
%lld bytes",
+filename, rb, (long long int) 
st.st_size);
...
-   pg_fatal("could not read file \"%s\": read only %d of 
%d bytes",
+   pg_fatal("could not read file \"%s\": read only %d of 
%u bytes",
 rf->filename, rb, length);

I'd be happy if the two messages kept consistency. I suggest aligning
types instead of making the messages different, as attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 41f06bb26b..e3b8e84289 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -504,15 +504,15 @@ make_rfile(char *filename, bool missing_ok)
 static void
 read_bytes(rfile *rf, void *buffer, unsigned length)
 {
-	int			rb = read(rf->fd, buffer, length);
+	ssize_t		rb = read(rf->fd, buffer, length);
 
 	if (rb != length)
 	{
 		if (rb < 0)
 			pg_fatal("could not read file \"%s\": %m", rf->filename);
 		else
-			pg_fatal("could not read file \"%s\": read only %d of %u bytes",
-	 rf->filename, rb, length);
+			pg_fatal("could not read file \"%s\": read only %zd of %lld bytes",
+	 rf->filename, rb, (long long int) length);
 	}
 }
 


Re: Infinite loop in XLogPageRead() on standby

2024-03-12 Thread Kyotaro Horiguchi
At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Oh, I once saw the fix work, but seems not to be working after some
> point. The new issue was a corruption of received WAL records on the
> first standby, and it may be related to the setting.

I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().

I think the primary change to XLogPageRead that I suggested is correct
(assuming the use of wal_segment_size instead of the
constant). However, still XLogFileReadAnyTLI() has a chance to read
the segment from the old timeline after the second standby notices a
timeline switch, leading to the second issue. The second issue was
fixed by preventing XLogFileReadAnyTLI from reading segments from
older timelines than those suggested by the latest timeline
history. (In other words, disabling the "AnyTLI" part).

I recall that there was a discussion for commit 4bd0ad9e44, about the
objective of allowing reading segments from older timelines than the
timeline history suggests. In my faint memory, we concluded to
postpone making the decision to remove the feature due to uncertainity
about the objective. If there's no clear reason to continue using
XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
XLogFileReadOnTLHistory(), which reads segments that align precisely
with the timeline history.

Of course, regardless of the changes above, if recovery on the second
standby had reached the end of the page-spanning record before
redirection to the first standby, it would need pg_rewind to connect
to the first standby.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-03-11 Thread Kyotaro Horiguchi
At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin  
wrote in 
> Hmm, I think you meant to use wal_segment_size, because 0x10 is just
> 1MB. As a result, currently it works for you by accident.

Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.

> > Thus, I managed to reproduce precisely the same situation as you
> > described utilizing your script with modifications and some core
> > tweaks, and with the change above, I saw that the behavior was
> > fixed. However, for reasons unclear to me, it shows another issue, and
> > I am running out of time and need more caffeine. I'll continue
> > investigating this tomorrow.
> >
> 
> Thank you for spending your time on it!

You're welcome, but I aplogize for the delay in the work..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-03-06 Thread Kyotaro Horiguchi
At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin  
wrote in 
> Please find attached the patch fixing the problem and the updated TAP test
> that addresses Nit.

Record-level retries happen when the upper layer detects errors. In my
previous mail, I cited code that is intended to prevent this at
segment boundaries. However, the resulting code applies to all page
boundaries, since we judged that the difference doen't significanty
affects the outcome.

> * Check the page header immediately, so that we can retry immediately if
> * it's not valid. This may seem unnecessary, because ReadPageInternal()
> * validates the page header anyway, and would propagate the failure up to

So, the following (tentative) change should also work.

xlogrecovery.c:
@@ -3460,8 +3490,10 @@ retry:
 * responsible for the validation.
 */
if (StandbyMode &&
+   targetPagePtr % 0x10 == 0 &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, 
readBuf))
{

Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: initdb's -c option behaves wrong way?

2024-03-04 Thread Kyotaro Horiguchi
At Mon, 4 Mar 2024 09:39:39 +0100, Daniel Gustafsson  wrote in 
> 
> 
> > On 4 Mar 2024, at 02:01, Tom Lane  wrote:
> > 
> > Daniel Gustafsson  writes:
> >> I took the liberty to add this to the upcoming CF to make sure we don't 
> >> lose
> >> track of it.
> > 
> > Thanks for doing that, because the cfbot pointed out a problem:
> > I should have written pg_strncasecmp not strncasecmp.  If this
> > version tests cleanly, I'll push it.
> 
> +1, LGTM.

Thank you for fixing this, Tom!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Anyway, our current policy here is to avoid record-rereads beyond
> source switches. However, fixing this seems to require that source
> switches cause record rereads unless some additional information is
> available to know if replay LSN needs to back up.

It seems to me that the error messages are related to commit 0668719801.

XLogPageRead:
> * Check the page header immediately, so that we can retry immediately if
> * it's not valid. This may seem unnecessary, because ReadPageInternal()
> * validates the page header anyway, and would propagate the failure up to
> * ReadRecord(), which would retry. However, there's a corner case with
> * continuation records, if a record is split across two pages such that
> * we would need to read the two pages from different sources. For
...
>   if (StandbyMode &&
>   !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, 
> readBuf))
>   {
>   /*
>* Emit this error right now then retry this page immediately. 
> Use
>* errmsg_internal() because the message was already translated.
>*/
>   if (xlogreader->errormsg_buf[0])
>   ereport(emode_for_corrupt_record(emode, 
> xlogreader->EndRecPtr),
>   (errmsg_internal("%s", 
> xlogreader->errormsg_buf)));

This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.

Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > After reading this, I came up with a possibility that walreceiver
> > recovers more quickly than the calling interval to
> > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
> > to the function WaitForWAL...(), and then somehow recovers the
> > connection before the next call, the function doesn't notice the
> > disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
> > is correct, the correct fix might be for us to return XLREAD_FAIL when
> > reconnection happens after the last call to the WaitForWAL...()
> > function.
> 
> That's my stupid. The function performs reconnection by itself.

Anyway, our current policy here is to avoid record-rereads beyond
source switches. However, fixing this seems to require that source
switches cause record rereads unless some additional information is
available to know if replay LSN needs to back up.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> After reading this, I came up with a possibility that walreceiver
> recovers more quickly than the calling interval to
> WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
> to the function WaitForWAL...(), and then somehow recovers the
> connection before the next call, the function doesn't notice the
> disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
> is correct, the correct fix might be for us to return XLREAD_FAIL when
> reconnection happens after the last call to the WaitForWAL...()
> function.

That's my stupid. The function performs reconnection by itself.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi
At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier  wrote 
in 
> On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
> > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi 
> > wrote:
> >> In the first place, it's important to note that we do not guarantee
> >> that an async standby can always switch its replication connection to
> >> the old primary or another sibling standby. This is due to the
> >> variations in replication lag among standbys. pg_rewind is required to
> >> adjust such discrepancies.
> > 
> > Sure, I know. But in this case the async standby received and flushed
> > absolutely the same amount of WAL as the promoted one.
> 
> Ugh.  If it can happen transparently to the user without the user
> knowing directly about it, that does not sound good to me.  I did not
> look very closely at monitoring tools available out there, but if both
> standbys flushed the same WAL locations a rewind should not be
> required.  It is not something that monitoring tools would be able to
> detect because they just look at LSNs.
> 
> >> In the repro script, the replication connection of the second standby
> >> is switched from the old primary to the first standby after its
> >> promotion. After the switching, replication is expected to continue
> >> from the beginning of the last replayed segment.
> > 
> > Well, maybe, but apparently the standby is busy trying to decode a record
> > that spans multiple pages, and it is just infinitely waiting for the next
> > page to arrive. Also, the restart "fixes" the problem, because indeed it is
> > reading the file from the beginning.
> 
> What happens if the continuation record spawns across multiple segment
> files boundaries in this case?  We would go back to the beginning of
> the segment where the record spawning across multiple segments began,
> right?  (I may recall this part of contrecords incorrectly, feel free
> to correct me if necessary.)

After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

> >> But with the script,
> >> the second standby copies the intentionally broken file, which differs
> >> from the data that should be received via streaming.
> > 
> > As I already said, this is a simple way to emulate the primary crash while
> > standbys receiving WAL.
> > It could easily happen that the record spans on multiple pages is not fully
> > received and flushed.
> 
> I think that's OK to do so at test level to force a test in the
> backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
> proved that this can be designed to be cheap and stable across the
> buildfarm fleet.

Yeah, I agree that it clearly illustrates the final state after the
issue happened, but if my assumption above is correct, the test
doesn't manifest the real issue.

> For anything like that, the result had better have solid test
> coverage, where perhaps we'd better refactor some of the routines of
> 039_end_of_wal.pl into a module to use them here, rather than
> duplicate the code.  The other test has a few assumptions with the
> calculation of page boundaries, and I'd rather not duplicate that
> across the tree.

I agree to the point.

> Seeing that Alexander is a maintainer of Patroni, which is very
> probably used by his employer across a large set of PostgreSQL
> instances, if he says that he's seen that in the field, that's good
> enough for me to respond to the problem, especially if reconnecting a
> standby to a promoted node where both flushed the same LSN.  Now the
> level of response also depends on the invasiness of the change, and we
> need a very careful evaluation here.

I don't mean to say that we should respond with DNF to this "issue" at
all. I simply wanted to make the real issue clear.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Infinite loop in XLogPageRead() on standby

2024-02-28 Thread Kyotaro Horiguchi
At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier  wrote 
in 
> On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:
> > I spent some time debugging an issue with standby not being able to
> > continue streaming after failover.
> >
> > The problem happens when standbys received only the first part of the WAL
> > record that spans multiple pages.
> > In this case the promoted standby discards the first part of the WAL record
> > and writes END_OF_RECOVERY instead. If in addition to that someone will
> > call pg_switch_wal(), then there are chances that SWITCH record will also
> > fit to the page where the discarded part was settling, As a result the
> > other standby (that wasn't promoted) will infinitely try making attempts to
> > decode WAL record span on multiple pages by reading the next page, which is
> > filled with zero bytes. And, this next page will never be written, because
> > the new primary will be writing to the new WAL file after pg_switch_wal().

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment. But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming. A similar problem
to the issue here was seen at segment boundaries, before we introduced
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting
a WAL file that is already archived. However, in this case, the second
standby won't see the broken record because it cannot be in a
non-partial segment in the archive, and the new primary streams
END_OF_RECOVERY instead of the broken record.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-28 Thread Kyotaro Horiguchi
At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera  
wrote in 
> Here's the complete set, with these two names using the singular.

The commit by the second patch added several GUC descriptions:

> Sets the size of the dedicated buffer pool used for the commit timestamp 
> cache.

Some of them, commit_timestamp_buffers, transaction_buffers,
subtransaction_buffers use 0 to mean auto-tuning based on
shared-buffer size. I think it's worth adding an extra_desc such as "0
to automatically determine this value based on the shared buffer
size".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: MultiXact\SLRU buffers configuration

2024-02-28 Thread Kyotaro Horiguchi
At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin"  
wrote in 
> Here's the test draft. This test reliably reproduces sleep on CV when waiting 
> next multixact to be filled into "members" SLRU.

By the way, I raised a question about using multiple CVs
simultaneously [1]. That is, I suspect that the current CV
implementation doesn't allow us to use multiple condition variables at
the same time, because all CVs use the same PCPROC member cvWaitLink
to accommodate different waiter sets. If this assumption is correct,
we should resolve the issue before spreading more uses of CVs.

[1] 
https://www.postgresql.org/message-id/20240227.150709.1766217736683815840.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A failure in t/001_rep_changes.pl

2024-02-26 Thread Kyotaro Horiguchi
At Fri, 23 Feb 2024 15:50:21 +0530, vignesh C  wrote in 
> By any chance do you have the log files when this failure occurred, if
> so please share it.

In my understanding, within a single instance, no two proclists can
simultaneously share the same waitlink member of PGPROC.

On the other hand, a publisher uses two condition variables for slots
and WAL waiting, which work on the same PGPROC member cvWaitLink. I
suspect this issue arises from the configuration. However, although it
is unlikly related to this specific issue, a similar problem can arise
in instances that function both as logical publisher and physical
primary.

Regardless of this issue, I think we should provide separate waitlink
members for condition variables that can possibly be used
simultaneously.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: About a recently-added message

2024-02-21 Thread Kyotaro Horiguchi
At Thu, 22 Feb 2024 10:51:07 +0530, Amit Kapila  wrote 
in 
> > Do you think some additional tests for the rest of the messages are
> > worth the trouble?
> >
> 
> We have discussed this during development and didn't find it worth
> adding tests for all misconfigured parameters. However, in the next
> patch where we are planning to add a slot sync worker that will
> automatically sync slots, we are adding a test for one more parameter.
> I am not against adding tests for all the parameters but it didn't
> appeal to add more test cycles for this.

Thanks for the explanation. I'm fine with that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: About a recently-added message

2024-02-21 Thread Kyotaro Horiguchi
At Thu, 22 Feb 2024 09:36:43 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Yes, I'm happy with all of the changes.  The proposed patch appears to
> cover all instances related to slotsync.c, and it looks fine to
> me. Thanks!

I'd like to raise another potential issue outside the patch. The patch
needed to change only one test item even though it changed nine
messages. This means eigh out of nine messages that the patch changed
are not covered by our test. I doubt all of them are worth additional
test items; however, I think we want to increase coverage.

Do you think some additional tests for the rest of the messages are
worth the trouble?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: About a recently-added message

2024-02-21 Thread Kyotaro Horiguchi
At Wed, 21 Feb 2024 14:57:42 +0530, Amit Kapila  wrote 
in 
> On Tue, Feb 20, 2024 at 3:21 PM shveta malik  wrote:
> >
> > okay, attached v2 patch with changed error msgs and double quotes
> > around logical.
> >
> 
> Horiguchi-San, does this address all your concerns related to
> translation with these new messages?

Yes, I'm happy with all of the changes.  The proposed patch appears to
cover all instances related to slotsync.c, and it looks fine to
me. Thanks!

I found that logica.c is also using the policy that I complained
about, but it is a separate issue.

./logical.c122:errmsg("logical decoding requires wal_level >= 
logical")));

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Kyotaro Horiguchi
At Tue, 20 Feb 2024 19:56:10 +0530, Robert Haas  wrote 
in 
> It seems like maybe somebody should look into why this is happening,
> and perhaps fix it.

GetConnection()@streamutil.c wants to ensure conninfo has a fallback
database name ("replication"). However, the function seems to be
ignoring the case where neither dbname nor connection string is given,
which is the first case Kuroda-san raised. The second case is the
intended behavior of the function.

>   /* pg_recvlogical uses dbname only; others use connection_string only. 
> */
>   Assert(dbname == NULL || connection_string == NULL);

And the function incorrectly assumes that the connection string
requires "dbname=replication".

>* Merge the connection info inputs given in form of connection string,
>* options and default values (dbname=replication, replication=true, 
> etc.)

But the name is a pseudo database name only used by pg_hba.conf
(maybe) , which cannot be used as an actual database name (without
quotation marks, or unless it is actually created). The function
should not add the fallback database name because the connection
string for physical replication connection doesn't require the dbname
parameter. (attached patch)

About the proposed patch, pg_basebackup cannot verify the validity of
the dbname. It could be problematic.

Although I haven't looked the original thread, it seems that the
dbname is used only by pg_sync_replication_slots(). If it is true,
couldn't we make the SQL function require a database name to make a
connection, instead of requiring it in physical-replication conninfo?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_basebackup/streamutil.c 
b/src/bin/pg_basebackup/streamutil.c
index 56d1b15951..da63a04de4 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -78,7 +78,7 @@ GetConnection(void)
 
/*
 * Merge the connection info inputs given in form of connection string,
-* options and default values (dbname=replication, replication=true, 
etc.)
+* options and default values (replication=true, etc.)
 */
i = 0;
if (connection_string)
@@ -96,14 +96,6 @@ GetConnection(void)
keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
values = pg_malloc0((argcount + 1) * sizeof(*values));
 
-   /*
-* Set dbname here already, so it can be overridden by a dbname 
in the
-* connection string.
-*/
-   keywords[i] = "dbname";
-   values[i] = "replication";
-   i++;
-
for (conn_opt = conn_opts; conn_opt->keyword != NULL; 
conn_opt++)
{
if (conn_opt->val != NULL && conn_opt->val[0] != '\0')


Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Kyotaro Horiguchi
At Mon, 19 Feb 2024 11:02:39 +0530, Bharath Rupireddy 
 wrote in 
> > After all, the patch looks good to me.
> 
> Thanks. It was committed -
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=73f0a1326608ac3a7d390706fdeec59fe4dc42c0.

Yeah. I realied that after I had already sent the mail.. No harm done:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A new message seems missing a punctuation

2024-02-18 Thread Kyotaro Horiguchi
At Mon, 19 Feb 2024 10:31:33 +0530, Robert Haas  wrote 
in 
> On Mon, Feb 19, 2024 at 10:10 AM Kyotaro Horiguchi
>  wrote:
> > A recent commit (7a424ece48) added the following message:
> >
> > > could not sync slot information as remote slot precedes local slot:
> > > remote slot "%s": LSN (%X/%X), catalog xmin (%u) local slot: LSN
> > > (%X/%X), catalog xmin (%u)
> >
> > Since it is a bit overloaded but doesn't have a separator between
> > "catalog xmin (%u)" and "local slot:", it is somewhat cluttered. Don't
> > we need something, for example a semicolon there to improve
> > readability and reduce clutter?
> 
> I think maybe we should even revise the message even more. In most
> places we do not just print out a whole bunch of values, but use a
> sentence to connect them.

Mmm. Something like this?:

"could not sync slot information: local slot LSN (%X/%X) or xmin(%u)
 behind remote (%X/%X, %u)"

Or I thought the values could be moved to DETAILS: line.

By the way, the code around the message is as follows.

> /*
>  * The remote slot didn't catch up to locally reserved position.
>  *
>  * We do not drop the slot because the restart_lsn can be ahead of the
>  * current location when recreating the slot in the next cycle. It may
>  * take more time to create such a slot. Therefore, we keep this slot
>  * and attempt the synchronization in the next cycle.
>  *
>  * XXX should this be changed to elog(DEBUG1) perhaps?
>  */
> ereport(LOG,
>   errmsg("could not sync slot information as remote slot precedes 
> local slot:"
>  " remote slot \"%s\": LSN (%X/%X), 
> catalog xmin (%u) local slot: LSN (%X/%X), catalog xmin (%u)",

If we change this message to DEBUG1, a timeout feature to show a
WARNING message might be needed for DBAs to notice that something bad
is ongoing. However, it doesn't seem appropriate as a LOG message to
me. Perhaps, a LOG message should be more like:

> "LOG:  waiting for local slot to catch up to remote"
> "DETAIL:  remote slot LSN (%X/%X); "

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


A new message seems missing a punctuation

2024-02-18 Thread Kyotaro Horiguchi
A recent commit (7a424ece48) added the following message:

> could not sync slot information as remote slot precedes local slot:
> remote slot "%s": LSN (%X/%X), catalog xmin (%u) local slot: LSN
> (%X/%X), catalog xmin (%u)

Since it is a bit overloaded but doesn't have a separator between
"catalog xmin (%u)" and "local slot:", it is somewhat cluttered. Don't
we need something, for example a semicolon there to improve
readability and reduce clutter?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Kyotaro Horiguchi
At Mon, 19 Feb 2024 11:56:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Yeah, perhaps I was overly concerned. The removed comment made me
> think that someone could add code relying on the incorrect assumption
> that the remaining bytes beyond the returned count are cleared out. On
> the flip side, SimpleXLogPageRead always reads a whole page and
> returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
> contain random garbage bytes. Therefore, it's safe as long as the

Forgot to mention that there is a case involving non-initialized
pages, but it doesn't affect the correctness of the description you
pointed out.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Do away with zero-padding assumption before WALRead()

2024-02-18 Thread Kyotaro Horiguchi
At Fri, 16 Feb 2024 19:50:00 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Feb 16, 2024 at 7:10 AM Kyotaro Horiguchi
>  wrote:
> > 1. It's useless to copy the whole page regardless of the 'count'. It's
> >   enough to copy only up to the 'count'. The patch looks good in this
> >   regard.
> 
> Yes, it's not needed to copy the whole page. Per my understanding
> about page transfers between disk and OS page cache - upon OS page
> cache miss, the whole page (of disk block size) gets fetched from disk
> even if we just read 'count' bytes (< disk block size).

Right, but with a possibly-different block size. Anyway that behavior
doesn't affect the result of this change. (Could affect performance
hereafter if it were not the case, though..)

> > 2. Maybe we need a comment that states the page_read callback
> >   functions leave garbage bytes beyond the returned count, due to
> >   partial copying without clearing the unused portion.
> 
> Isn't the comment around page_read callback at
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/xlogreader.h;h=2e9e5f43eb2de1ca9ba81afe76d21357065c61aa;hb=d57b7cc3338e9d9aa1d7c5da1b25a17c5a72dcce#l78
> enough?
> 
> "The callback shall return the number of bytes read (never more than
> XLOG_BLCKSZ), or -1 on failure."

Yeah, perhaps I was overly concerned. The removed comment made me
think that someone could add code relying on the incorrect assumption
that the remaining bytes beyond the returned count are cleared out. On
the flip side, SimpleXLogPageRead always reads a whole page and
returns XLOG_BLCKSZ. However, as you know, the returned buffer doesn't
contain random garbage bytes. Therefore, it's safe as long as the
caller doesn't access beyond the returned count. As a result, the
description you pointed out seems to be enough.

After all, the patch looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Do away with zero-padding assumption before WALRead()

2024-02-15 Thread Kyotaro Horiguchi
At Tue, 13 Feb 2024 11:47:06 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> I noticed an assumption [1] at WALRead() call sites expecting the
> flushed WAL page to be zero-padded after the flush LSN. I think this
> can't always be true as the WAL can get flushed after determining the
> flush LSN before reading it from the WAL file using WALRead(). I've
> hacked the code up a bit to check if that's true -

Good catch! The comment seems wrong also to me. The subsequent bytes
can be written simultaneously, and it's very normal that there are
unflushed bytes are in OS's page buffer. The objective of the comment
seems to be to declare that there's no need to clear out the remaining
bytes, here. I agree that it's not a problem for now. However, I think
we need two fixes here.

1. It's useless to copy the whole page regardless of the 'count'. It's
  enough to copy only up to the 'count'. The patch looks good in this
  regard.

2. Maybe we need a comment that states the page_read callback
  functions leave garbage bytes beyond the returned count, due to
  partial copying without clearing the unused portion.

> I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ
> despite knowing exactly how much to read. Is it to tell the OS to
> explicitly fetch the whole page from the disk? If yes, the OS will do
> that anyway because the page transfers from disk to OS page cache are
> always in terms of disk block sizes, no?

If I understand your question correctly, I guess that the whole-page
copy was expected to clear out the remaining bytes, as I mentioned
earlier.

> Although, there's no immediate problem with it right now, the
> assumption is going to be incorrect when reading WAL from WAL buffers
> using WALReadFromBuffers -
> https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com.
>
> If we have no reason, can the WALRead() callers just read how much
> they want like walsender for physical replication? Attached a patch
> for the change.
> 
> Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: About a recently-added message

2024-02-14 Thread Kyotaro Horiguchi
At Thu, 15 Feb 2024 09:22:23 +0530, shveta malik  wrote 
in 
> On Thu, Feb 15, 2024 at 8:26 AM Amit Kapila  wrote:
> >
> > On Wed, Feb 14, 2024 at 7:51 PM Euler Taveira  wrote:
> > >
> > > On Wed, Feb 14, 2024, at 8:45 AM, Amit Kapila wrote:
> > >
> > > Now, I am less clear about whether to quote "logical" or not in the
> > > above message. Do you have any suggestions?
> > >
> > >
> > > The possible confusion comes from the fact that the sentence contains 
> > > "must be"
> > > in the middle of a comparison expression. For pg_createsubscriber, we are 
> > > using
> > >
> > > publisher requires wal_level >= logical
> > >
> > > I suggest to use something like
> > >
> > > slot synchronization requires wal_level >= logical
> > >
> >
> > This sounds like a better idea but shouldn't we directly use this in
> > 'errmsg' instead of a separate 'errhint'? Also, if change this, then
> > we should make a similar change for other messages in the same
> > function.
> 
> +1 on changing the msg(s) suggested way. Please find the patch for the
> same. It also removes double quotes around the variable names

Thanks for the discussion.

With a translator hat on, I would be happy if I could determine
whether a word requires translation with minimal background
information. In this case, a translator needs to know which values
wal_level can take. It's relatively easy in this case, but I'm not
sure if this is always the case. Therefore, I would be slightly
happier if "logical" were double-quoted.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: About a recently-added message

2024-02-13 Thread Kyotaro Horiguchi
At Wed, 14 Feb 2024 16:26:52 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > "wal_level" must be >= logical.
..
> > wal_level must be set to "replica" or "logical" at server start.
..
> I suggest making the quoting policy consistent.

Just after this, I found another inconsistency regarding quotation.

> 'dbname' must be specified in "%s".

The use of single quotes doesn't seem to comply with our standard.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




About a recently-added message

2024-02-13 Thread Kyotaro Horiguchi
A recent commit added the following message:

> "wal_level" must be >= logical.

The use of the term "logical" here is a bit confusing, as it's unclear
whether it's meant to be a natural language word or a token. (I
believe it to be a token.)

On the contrary, we already have the following message:

> wal_level must be set to "replica" or "logical" at server start.

This has the conflicting policy about quotation of variable names and
enum values.

I suggest making the quoting policy consistent.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A comment in DropRole() contradicts the actual behavior

2024-02-08 Thread Kyotaro Horiguchi
At Thu, 8 Feb 2024 16:39:23 +0900, Michael Paquier  wrote 
in 
> On Thu, Feb 08, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> > Hello,
> > 
> > Please look at errors, which produced by the following script, starting
> > from 6566133c5:
> > for i in `seq 100`; do (echo "CREATE USER u; DROP USER u;"); done | psql 
> > >psql-1.log 2>&1 &
> > for i in `seq 100`; do (echo "CREATE USER u; DROP USER u;"); done | psql 
> > >psql-2.log 2>&1 &
> > wait
> > 
> > ERROR:  could not find tuple for role 16387
> > ERROR:  could not find tuple for role 16390
> > ERROR:  could not find tuple for role 16394
> > ...
> > 
> > Maybe these errors are expected, but then I'm confused by the comment in
> > DropRole():
> 
> Well, these errors should never happen, IMHO, so it seems to me that
> the comment is right and that the code lacks locking, contrary to what
> the comment tells.

The function acquires a lock, but it does not perform an existence
check until it first attempts to fetch the tuple, believing in its
presence. However, I doubt performing an additional existence check
right after acquiring the lock is worthwhile.

By the way, I saw the following error with the provided script:

> ERROR:  duplicate key value violates unique constraint 
> "pg_authid_rolname_index"
> DETAIL:  Key (rolname)=(u) already exists.
> STATEMENT:  CREATE USER u;

This seems to be another instance of a similar thinko.

I vaguely think that we should just regard the absence as a concurrent
drop and either adjust or remove the message, then fix the
comment. The situation is slightly different for the duplication
case. We shouldn't ignore it but rather need to adjust the error
message.  As long as these behaviors don't lead to inconsistencies.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: InstallXLogFileSegment() vs concurrent WAL flush

2024-02-05 Thread Kyotaro Horiguchi
At Fri, 2 Feb 2024 14:42:46 +0100, Thomas Munro  wrote 
in 
> On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA  wrote:
> > On Fri, 2 Feb 2024 11:18:18 +0100
> > Thomas Munro  wrote:
> > > One simple way to address that would be to make XLogFileInitInternal()
> > > wait for InstallXLogFileSegment() to finish.  It's a little
> >
> > Or, can we make sure the rename is durable by calling fsync before
> > returning the fd, as a patch attached here?
> 
> Right, yeah, that works too.  I'm not sure which way is better.

I'm not sure I like issuing spurious syncs unconditionally. Therefore,
I prefer Thomas' approach in that regard.  0002 would be beneficial,
considering the case of a very large max_wal_size, and the code seems
to be the minimal required. I don't think it matters that the lock
attempts occur uselessly until the first segment installation. That
being said, we could avoid it by initializing
last_known_installed_segno properly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-02-05 Thread Kyotaro Horiguchi
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA  wrote in 
> On Mon, 05 Feb 2024 11:28:59 +0900
> torikoshia  wrote:
> 
> > > Based on this, I've made a patch.
> > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > on_error 'null', the  keyword NULL should be single quoted.
> > 
> > As you mentioned, single quotation seems a little odd..
> > 
> > I'm not sure what is the best name and syntax for this feature, but 
> > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > might not be appropriate.
> 
> I am not in favour of using 'null' either, so I suggested to use
> "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> previous post[1], although I'm not convinced what is the best either.
> 
> [1] 
> https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example.  Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.

COPY (on_error 'replace-colomn', replacement 'null') ..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: More new SQL/JSON item methods

2024-01-31 Thread Kyotaro Horiguchi
Sorry for a minor correction, but..

At Thu, 01 Feb 2024 14:53:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ah.. Understood. "NaN or Infinity" cannot be used in those
> cases. Additionally, for jpiBoolean and jpiBigint, we lack the text
> representation of the value.

This "Additionally" was merely left in by mistake.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: More new SQL/JSON item methods

2024-01-31 Thread Kyotaro Horiguchi
At Thu, 1 Feb 2024 09:22:22 +0530, Jeevan Chalke 
 wrote in 
> On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi 
> wrote:
> 
> > At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
> > horikyota@gmail.com> wrote in
> > > By the way, while playing with this feature, I noticed the following
> > > error message:
> > >
> > > > select jsonb_path_query('1.1' , '$.boolean()');
> > > > ERROR:  numeric argument of jsonpath item method .boolean() is out of
> > range for type boolean
> > >
> > > The error message seems a bit off to me. For example, "argument '1.1'
> > > is invalid for type [bB]oolean" seems more appropriate for this
> > > specific issue. (I'm not ceratin about our policy on the spelling of
> > > Boolean..)
> >
> > Or, following our general convention, it would be spelled as:
> >
> > 'invalid argument for type Boolean: "1.1"'
> >
> 
> jsonpath way:

Hmm. I see.

> ERROR:  argument of jsonpath item method .boolean() is invalid for type
> boolean
> 
> or, if we add input value, then
> 
> ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
> type boolean
> 
> And this should work for all the error types, like out of range, not valid,
> invalid input, etc, etc. Also, we don't need separate error messages for
> string input as well, which currently has the following form:
> 
> "string argument of jsonpath item method .%s() is not a valid
> representation.."

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: More new SQL/JSON item methods

2024-01-31 Thread Kyotaro Horiguchi
At Thu, 1 Feb 2024 09:19:40 +0530, Jeevan Chalke 
 wrote in 
> > As Tom suggested, given that similar situations have already been
> > disregarded elsewhere, worrying about excessively long input strings
> > in this specific instance won't notably improve safety in total.
> >
> > > Also, for non-string input, we need to convert numeric to string just for
> > > the error message, which seems overkill.
> >
> > As I suggested and you seem to agree, using literally "Nan or
> > Infinity" would be sufficient.
> >
> 
> I am more concerned about .bigint() and .integer(). We can have errors when
> the numeric input is out of range, but not NaN or Infinity. At those
> places, we need to convert numeric to string to put that value into the
> error.
> Do you mean we should still put "Nan or Infinity" there?
> 
> This is the case:
>  select jsonb_path_query('12345678901', '$.integer()');
>  ERROR:  numeric argument of jsonpath item method .integer() is out of
> range for type integer

Ah.. Understood. "NaN or Infinity" cannot be used in those
cases. Additionally, for jpiBoolean and jpiBigint, we lack the text
representation of the value.

By a quick grepping, I found that the following functions call
numeric_out to convert the jbvNumeric values back into text
representation.

JsonbValueAstext, populate_scalar, iterate_jsonb_values,
executeItemOptUnrwapTarget, jsonb_put_escaped_value

The function iterate_jsonb_values(), in particular, iterates over a
values array, calling numeric_out on each iteration.

The following functions re-converts the converted numeric into another type.

jsonb_int[248]() converts the numeric value into int2 using numeric_int[248]().
jsonb_float[48]() converts it into float4 using numeric_float[48]().

Given these facts, it seems more efficient for jbvNumber to retain the
original scalar value, converting it only when necessary. If needed,
we could also add a numeric struct member as a cache for better
performance. I'm not sure we refer the values more than once, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: More new SQL/JSON item methods

2024-01-31 Thread Kyotaro Horiguchi
At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> By the way, while playing with this feature, I noticed the following
> error message:
> 
> > select jsonb_path_query('1.1' , '$.boolean()');
> > ERROR:  numeric argument of jsonpath item method .boolean() is out of range 
> > for type boolean
> 
> The error message seems a bit off to me. For example, "argument '1.1'
> is invalid for type [bB]oolean" seems more appropriate for this
> specific issue. (I'm not ceratin about our policy on the spelling of
> Boolean..)

Or, following our general convention, it would be spelled as:

'invalid argument for type Boolean: "1.1"'

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: More new SQL/JSON item methods

2024-01-31 Thread Kyotaro Horiguchi
Thank you for the fix!

At Tue, 30 Jan 2024 13:46:17 +0530, Jeevan Chalke 
 wrote in 
> On Mon, Jan 29, 2024 at 11:03 AM Tom Lane  wrote:
> 
> > Kyotaro Horiguchi  writes:
> > > Having said that, I'm a bit concerned about the case where an overly
> > > long string is given there. However, considering that we already have
> > > "invalid input syntax for type xxx: %x" messages (including for the
> > > numeric), this concern might be unnecessary.
> >
> > Yeah, we have not worried about that in the past.
> >
> > > Another concern is that the input value is already a numeric, not a
> > > string. This situation occurs when the input is NaN of +-Inf. Although
> > > numeric_out could be used, it might cause another error. Therefore,
> > > simply writing out as "argument NaN and Infinity.." would be better.
> >
> > Oh!  I'd assumed that we were discussing a string that we'd failed to
> > convert to numeric.  If the input is already numeric, then either
> > the error is unreachable or what we're really doing is rejecting
> > special values such as NaN on policy grounds.  I would ask first
> > if that policy is sane at all.  (I'd lean to "not" --- if we allow
> > it in the input JSON, why not in the output?)  If it is sane, the
> > error message needs to be far more specific.
> >
> > regards, tom lane
> >
> 
> *Consistent error message related to type:*
...
> What if we use phrases like "for type double precision" at both places,
> like:
> numeric argument of jsonpath item method .%s() is out of range for type
> double precision
> string argument of jsonpath item method .%s() is not a valid representation
> for type double precision
> 
> With this, the rest will be like:
> for type numeric
> for type bigint
> for type integer
> 
> Suggestions?

FWIW, I prefer consistently using "for type hoge".

> ---
> 
> *Showing input string in the error message:*
> 
> argument "...input string here..." of jsonpath item method .%s() is out of
> range for type numeric
> 
> If we add the input string in the error, then for some errors, it will be
> too big, for example:
> 
> -ERROR:  numeric argument of jsonpath item method .double() is out of range
> for type double precision
> +ERROR:  argument
> "10"
> of jsonpath item method .double() is out of range for type double precision

As Tom suggested, given that similar situations have already been
disregarded elsewhere, worrying about excessively long input strings
in this specific instance won't notably improve safety in total.

> Also, for non-string input, we need to convert numeric to string just for
> the error message, which seems overkill.

As I suggested and you seem to agree, using literally "Nan or
Infinity" would be sufficient.

> On another note, irrespective of these changes, is it good to show the
> given input in the error messages? Error messages are logged and may leak
> some details.
> 
> I think the existing way seems ok.

In my opinion, it is quite common to include the error-causing value
in error messages. Also, we have already many functions that impliy
the possibility for revealing input values when converting text
representation into internal format, such as with int4in. However, I
don't stick to that way.

> ---
> 
> *NaN and Infinity restrictions:*
> 
> I am not sure why NaN and Infinity are not allowed in conversion to double
> precision (.double() method). I have used the same restriction for
> .decimal() and .number(). However, as you said, we should have error
> messages more specific. I tried that in the attached patch; please have
> your views. I have the following wordings for that error message:
> "NaN or Infinity is not allowed for jsonpath item method .%s()"
> 
> Suggestions...

They seem good to *me*.

By the way, while playing with this feature, I noticed the following
error message:

> select jsonb_path_query('1.1' , '$.boolean()');
> ERROR:  numeric argument of jsonpath item method .boolean() is out of range 
> for type boolean

The error message seems a bit off to me. For example, "argument '1.1'
is invalid for type [bB]oolean" seems more appropriate for this
specific issue. (I'm not ceratin about our policy on the spelling of
Boolean..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


possible inter-gcc-version incompatibility regarding fat objects

2024-01-30 Thread Kyotaro Horiguchi
Hello.

I would like to share some information regarding a recent issue we
encoutered.

While compiling an extension against the RPM package
postgresql16-devel-16.1-2PGDG.rhel9.x86_64.rpm we faced a problem that
appears to be caused by the combination of LTO and constant
folding/propagation, leading to a SEGV. This crash didn't occur when
we gave --fno-lto instead of --flto=auto to the compiler.

To put this case briefly:
postgresql16-devel-16.1-2PGDG.rhel9.x86_64.rpm + gcc 11.3
  => building the extension completes but the binary may crash with SEGV
 --fno-lto prevents this crash from occurring.

The PG package is built with gcc-11.4, and the compiler in our build
environment for the extension was gcc-11.3. A quick search suggests
that the LTO bytecode format was optimized in gcc-11.4. Conversely,
when we built the extension with postgresql-16-16.0 and gc-11.4, the
build fails. This failure seems to stem LTO binary format
incompatibility.

> Error: bad register name `%'
> Error: open CFI at the end of file; missing .cfi_endproc directive
> /usr/bin/ld: error: lto-wrapper failed

To put this case briefly:
postgresql16-devel-16.0-1PGDG.rhel9.x86_64.rpm + gcc 11.4
  => build failure regarding LTO

Given these issues, it seems there might be some issues with including
fat objects in the -devel package.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Network failure may prevent promotion

2024-01-28 Thread Kyotaro Horiguchi
Thank you fixing the issue.

At Tue, 23 Jan 2024 11:43:43 +0200, Heikki Linnakangas  wrote i
n 
> There's an existing AmWalReceiverProcess() macro too. Let's use that.

Mmm. I sought an Is* function becuase "IsLogicalWorker()" is placed on
the previous line. Our convention regarding those functions (macros)
and variables seems inconsistent. However, I can't say for sure that
we should unify all of them.

> (See also
> https://www.postgresql.org/message-id/f3ecd4cb-85ee-4e54-8278-5fabfb3a4ed0%40iki.fi
> for refactoring in this area)
> 
> Here's a patch set summarizing the changes so far. They should be
> squashed, but I kept them separate for now to help with review:
> 
> 1. revert the revert of 728f86fec6.
> 2. your walrcv_shutdown_deblocking_v2-2.patch
> 3. Also replace libpqrcv_PQexec() and libpqrcv_PQgetResult() with the
> wrappers from libpq-be-fe-helpers.h

Both replacements look fine. I didn't find another instance of similar
code.

> 4. Replace IsWalReceiver() with AmWalReceiverProcess()

Just look fine.

> >> - pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* request
> >> - shutdown */
> >> + pqsignal(SIGTERM, WalRcvShutdownSignalHandler); /* request shutdown
> >> */
> >>
> >> Can't we just use die(), instead?
> > There was a comment explaining the problems associated with exiting
> > within a signal handler;
> > - * Currently, only SIGTERM is of interest.  We can't just exit(1) within
> > - * the
> > - * SIGTERM signal handler, because the signal might arrive in the middle
> > - * of
> > - * some critical operation, like while we're holding a spinlock.
> > - * Instead, the
> > And I think we should keep the considerations it suggests. The patch
> > removes the comment itself, but it does so because it implements our
> > standard process exit procedure, which incorporates points suggested
> > by the now-removed comment.
> 
> die() doesn't call exit(1). Unless DoingCommandRead is set, but it
> never is in the walreceiver. It looks just like the new
> WalRcvShutdownSignalHandler() function. Am I missing something?

Ugh.. Doesn't the name 'die()' suggest exit()?
I agree that die() can be used instad.

> Hmm, but doesn't bgworker_die() have that problem with exit(1)ing in
> the signal handler?

I noticed that but ignored for this time.

> I also wonder if we should replace SignalHandlerForShutdownRequest()
> completely with die(), in all processes? The difference is that
> SignalHandlerForShutdownRequest() uses ShutdownRequestPending, while
> die() uses ProcDiePending && InterruptPending to indicate that the
> signal was received. Or do some of the processes want to check for
> ShutdownRequestPending only at specific places, and don't want to get
> terminated at the any random CHECK_FOR_INTERRUPTS()?

At least, pg_log_backend_memory_context() causes a call to
ProcessInterrupts via "ereport(LOG_SERVER_ONLY" which can lead to an
exit due to ProcDiePending. In this regard, checkpointer clearly
requires the distinction.

Rather than merely consolidating the notification variables and
striving to annihilate CFI calls in the execution path, I
believe we need a shutdown mechanism that CFI doesn't react
to. However, as for the method to achieve this, whether we should keep
the notification variables separate as they are now, or whether it
would be better to introduce a variable that causes CFI to ignore
ProcDiePending, is a matter I think is open to discussion.

Attached patches are the rebased version of v3 (0003 is updated) and
additional 0005 that makes use of die() instead of walreceiver's
custom function.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From daac3aa06bd33f5e8f04a406c8a59fd4cc97 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 23 Jan 2024 11:01:03 +0200
Subject: [PATCH v4 1/5] Revert "Revert "libpqwalreceiver: Convert to
 libpq-be-fe-helpers.h""

This reverts commit 21ef4d4d897563adb2f7920ad53b734950f1e0a4.
---
 .../libpqwalreceiver/libpqwalreceiver.c   | 55 +++
 1 file changed, 8 insertions(+), 47 deletions(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 2439733b55..dbee2f8f0e 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -24,6 +24,7 @@
 #include "common/connect.h"
 #include "funcapi.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -136,7 +137,6 @@ libp

Re: More new SQL/JSON item methods

2024-01-28 Thread Kyotaro Horiguchi
At Sun, 28 Jan 2024 22:47:02 -0500, Tom Lane  
wrote in

Kyotaro Horiguchi  writes:
> They seem to be suggesting that PostgreSQL has the types 
"decimal" and
> "number". I know of the former, but I don't think PostgreSQL has 
the
>  latter type. Perhaps the "number" was intended to refer to 
"numeric"?


Probably.  But I would write just "type numeric".  We do not 
generally
acknowledge "decimal" as a separate type, because for us it's only 
an

alias for numeric (there is not a pg_type entry for it).

Also, that leads to the thought that "numeric argument ... is out of
range for type numeric" seems either redundant or contradictory
depending on how you look at it.  So I suggest wording like

argument "...input string here..." of jsonpath item method .%s() is 
out of range for type numeric


> (And I think it is largely helpful if the given string were shown 
in

> the error message, but it would be another issue.)

Agreed, so I suggest the above.


Having said that, I'm a bit concerned about the case where an overly
long string is given there. However, considering that we already have
"invalid input syntax for type xxx: %x" messages (including for the
numeric), this concern might be unnecessary.

Another concern is that the input value is already a numeric, not a
string. This situation occurs when the input is NaN of +-Inf. Although
numeric_out could be used, it might cause another error. Therefore,
simply writing out as "argument NaN and Infinity.." would be better.

Once we resolve these issues, another question arises regarding on of
the messages. In the case of NaN of Infinity, the message will be as
the follows:

"argument NaN or Infinity of jsonpath item method .%s() is out of 
range for type numeric"


This message appears quite strange and puzzling. I suspect that the
intended message was somewhat different.



> The same commit has introduced the following set of messages:

>> %s format is not recognized: "%s"
>> date format is not recognized: "%s"
>> time format is not recognized: "%s"
>> time_tz format is not recognized: "%s"
>> timestamp format is not recognized: "%s"
>> timestamp_tz format is not recognized: "%s"

> I believe that the first line was intended to cover all the 
others:p


+1


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center




Re: More new SQL/JSON item methods

2024-01-28 Thread Kyotaro Horiguchi
I have two possible issues in a recent commit.

Commit 66ea94e8e6 has introduced the following messages:
(Aplogizies in advance if the commit is not related to this thread.)

jsonpath_exec.c:1287
> if (numeric_is_nan(num) || numeric_is_inf(num))
>   RETURN_ERROR(ereport(ERROR,
>  (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM),
>   errmsg("numeric argument of jsonpath item method .%s() is out 
> of range for type decimal or number",
>jspOperationName(jsp->type);

:1387
>   noerr = DirectInputFunctionCallSafe(numeric_in, numstr,
...
> if (!noerr || escontext.error_occurred)
>   RETURN_ERROR(ereport(ERROR,
>  (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM),
>   errmsg("string argument of jsonpath item method .%s() is not a 
> valid representation of a decimal or number",

They seem to be suggesting that PostgreSQL has the types "decimal" and
"number". I know of the former, but I don't think PostgreSQL has the
 latter type. Perhaps the "number" was intended to refer to "numeric"?
(And I think it is largely helpful if the given string were shown in
the error message, but it would be another issue.)


The same commit has introduced the following set of messages:

> %s format is not recognized: "%s"
> date format is not recognized: "%s"
> time format is not recognized: "%s"
> time_tz format is not recognized: "%s"
> timestamp format is not recognized: "%s"
> timestamp_tz format is not recognized: "%s"

I believe that the first line was intended to cover all the others:p

They are attached to this message separately.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index 22f598cd35..c10926a8a2 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -1287,7 +1287,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, JsonPathItem *jsp,
 	if (numeric_is_nan(num) || numeric_is_inf(num))
 		RETURN_ERROR(ereport(ERROR,
 			 (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM),
-			  errmsg("numeric argument of jsonpath item method .%s() is out of range for type decimal or number",
+			  errmsg("numeric argument of jsonpath item method .%s() is out of range for type decimal or numeric",
 	 jspOperationName(jsp->type);
 
 	if (jsp->type == jpiDecimal)
@@ -1312,14 +1312,14 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, JsonPathItem *jsp,
 	if (!noerr || escontext.error_occurred)
 		RETURN_ERROR(ereport(ERROR,
 			 (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM),
-			  errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or number",
+			  errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or numeric",
 	 jspOperationName(jsp->type);
 
 	num = DatumGetNumeric(datum);
 	if (numeric_is_nan(num) || numeric_is_inf(num))
 		RETURN_ERROR(ereport(ERROR,
 			 (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM),
-			  errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or number",
+			  errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or numeric",
 	 jspOperationName(jsp->type);
 
 	res = jperOk;
@@ -1401,7 +1401,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, JsonPathItem *jsp,
 	if (!noerr || escontext.error_occurred)
 		RETURN_ERROR(ereport(ERROR,
 			 (errcode(ERRCODE_NON_NUMERIC_SQL_JSON_ITEM),
-			  errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or number",
+			  errmsg("string argument of jsonpath item method .%s() is not a valid representation of a decimal or numeric",
 	 jspOperationName(jsp->type);
 
 	num = DatumGetNumeric(numdatum);
diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out
index eea2af30c8..9d8ce48a25 100644
--- a/src/test/regress/expected/jsonb_jsonpath.out
+++ b/src/test/regress/expected/jsonb_jsonpath.out
@@ -2144,7 +2144,7 @@ select jsonb_path_query('"1.23"', '$.decimal()');
 (1 row)
 
 select jsonb_path_

Re: Network failure may prevent promotion

2024-01-23 Thread Kyotaro Horiguchi
Thank you for looking this!

At Tue, 23 Jan 2024 15:07:10 +0900, Fujii Masao  wrote 
in 
> Regarding the patch, here are the review comments.
> 
> +/*
> + * Is current process a wal receiver?
> + */
> +bool
> +IsWalReceiver(void)
> +{
> + return WalRcv != NULL;
> +}
> 
> This looks wrong because WalRcv can be non-NULL in processes other
> than walreceiver.

Mmm. Sorry for the silly mistake. We can use B_WAL_RECEIVER
instead. I'm not sure if the new function IsWalReceiver() is
required. The expression "MyBackendType == B_WAL_RECEIVER" is quite
descriptive. However, the function does make ProcessInterrupts() more
aligned regarding process types.

> - pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* request shutdown */
> + pqsignal(SIGTERM, WalRcvShutdownSignalHandler); /* request shutdown */
> 
> Can't we just use die(), instead?

There was a comment explaining the problems associated with exiting
within a signal handler;

- * Currently, only SIGTERM is of interest.  We can't just exit(1) within the
- * SIGTERM signal handler, because the signal might arrive in the middle of
- * some critical operation, like while we're holding a spinlock.  Instead, the

And I think we should keep the considerations it suggests. The patch
removes the comment itself, but it does so because it implements our
standard process exit procedure, which incorporates points suggested
by the now-removed comment.

--
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 201c36cb22..db779dc6ca 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -749,7 +749,7 @@ libpqrcv_PQgetResult(PGconn *streamConn)
 		if (rc & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
-			ProcessWalRcvInterrupts();
+			CHECK_FOR_INTERRUPTS();
 		}
 
 		/* Consume whatever data is available from the socket */
@@ -1053,7 +1053,7 @@ libpqrcv_processTuples(PGresult *pgres, WalRcvExecResult *walres,
 	{
 		char	   *cstrs[MaxTupleAttributeNumber];
 
-		ProcessWalRcvInterrupts();
+		CHECK_FOR_INTERRUPTS();
 
 		/* Do the allocations in temporary context. */
 		oldcontext = MemoryContextSwitchTo(rowcontext);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 728059518e..e491f7d4c5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -147,39 +147,34 @@ static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
+static void WalRcvShutdownSignalHandler(SIGNAL_ARGS);
 
-/*
- * Process any interrupts the walreceiver process may have received.
- * This should be called any time the process's latch has become set.
- *
- * Currently, only SIGTERM is of interest.  We can't just exit(1) within the
- * SIGTERM signal handler, because the signal might arrive in the middle of
- * some critical operation, like while we're holding a spinlock.  Instead, the
- * signal handler sets a flag variable as well as setting the process's latch.
- * We must check the flag (by calling ProcessWalRcvInterrupts) anytime the
- * latch has become set.  Operations that could block for a long time, such as
- * reading from a remote server, must pay attention to the latch too; see
- * libpqrcv_PQgetResult for example.
- */
 void
-ProcessWalRcvInterrupts(void)
+WalRcvShutdownSignalHandler(SIGNAL_ARGS)
 {
-	/*
-	 * Although walreceiver interrupt handling doesn't use the same scheme as
-	 * regular backends, call CHECK_FOR_INTERRUPTS() to make sure we receive
-	 * any incoming signals on Win32, and also to make sure we process any
-	 * barrier events.
-	 */
-	CHECK_FOR_INTERRUPTS();
+	int			save_errno = errno;
 
-	if (ShutdownRequestPending)
+	/* Don't joggle the elbow of proc_exit */
+	if (!proc_exit_inprogress)
 	{
-		ereport(FATAL,
-(errcode(ERRCODE_ADMIN_SHUTDOWN),
- errmsg("terminating walreceiver process due to administrator command")));
+		InterruptPending = true;
+		ProcDiePending = true;
 	}
+
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+	
 }
 
+/*
+ * Is current process a wal receiver?
+ */
+bool
+IsWalReceiver(void)
+{
+	return MyBackendType == B_WAL_RECEIVER;
+}
 
 /* Main entry point for walreceiver process */
 void
@@ -277,7 +272,7 @@ WalReceiverMain(void)
 	pqsignal(SIGHUP, SignalHandlerForConfigReload); /* set flag to read config
 	 * file */
 	pqsignal(SIGINT, SIG_IGN);
-	pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* request shutdown */
+	pqsignal(SIGTERM, WalRcvShutdownSignalHandler); /* request shutdown *

Fix some errdetail's message format

2024-01-22 Thread Kyotaro Horiguchi
We recently made corrections to the capitalization of DETAIL messages.
For example;

-   GUC_check_errdetail("invalid list syntax in parameter %s",
+   GUC_check_errdetail("Invalid list syntax in parameter %s",

But still it is missing the period at the end.

There are several patterns to this issue, but this time, I have only
fixed the ones that are simple and obvious as follows:

a. GUC_check_errdetail("LITERAL"), errdetail("LITERAL") without a period.
b. psprintf()'ed string that is passed to errdetail_internal()

I didn't touch the following patterns:

c. errdetail_internal("%s")
d. errdetail("Whatever: %s")
e. errdetail("%s versus %s") and alikes
f. errdetail_internal("%s", pchomp(PQerrorMessage()))
g. complex message compilation


The attached patch contains the following fix:

-   GUC_check_errdetail("timestamp out of range: 
\"%s\"", str);
+   GUC_check_errdetail("Timestamp out of range: 
\"%s\".", str);

But I'm not quite confident about whether capitalizing the type name
here is correct.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/cube/cubescan.l b/contrib/cube/cubescan.l
index a30fbfc311..b45dc08f0c 100644
--- a/contrib/cube/cubescan.l
+++ b/contrib/cube/cubescan.l
@@ -82,7 +82,7 @@ cube_yyerror(NDBOX **result, Size scanbuflen,
 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
  errmsg("invalid input syntax for cube"),
  /* translator: %s is typically "syntax error" */
- errdetail("%s at end of input", message)));
+ errdetail("%s at end of input.", message)));
 	}
 	else
 	{
@@ -90,7 +90,7 @@ cube_yyerror(NDBOX **result, Size scanbuflen,
 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
  errmsg("invalid input syntax for cube"),
  /* translator: first %s is typically "syntax error" */
- errdetail("%s at or near \"%s\"", message, yytext)));
+ errdetail("%s at or near \"%s\".", message, yytext)));
 	}
 }
 
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 19a362526d..2720e91c14 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2633,7 +2633,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn, const char *connstr)
 	ereport(ERROR,
 			(errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
 			 errmsg("password or GSSAPI delegated credentials required"),
-			 errdetail("Non-superusers may only connect using credentials they provide, eg: password in connection string or delegated GSSAPI credentials"),
+			 errdetail("Non-superusers may only connect using credentials they provide, eg: password in connection string or delegated GSSAPI credentials."),
 			 errhint("Ensure provided credentials match target server's authentication method.")));
 }
 
diff --git a/contrib/seg/segscan.l b/contrib/seg/segscan.l
index 4ad529eccc..f5a72c2496 100644
--- a/contrib/seg/segscan.l
+++ b/contrib/seg/segscan.l
@@ -79,7 +79,7 @@ seg_yyerror(SEG *result, struct Node *escontext, const char *message)
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("bad seg representation"),
  /* translator: %s is typically "syntax error" */
- errdetail("%s at end of input", message)));
+ errdetail("%s at end of input.", message)));
 	}
 	else
 	{
@@ -87,7 +87,7 @@ seg_yyerror(SEG *result, struct Node *escontext, const char *message)
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("bad seg representation"),
  /* translator: first %s is typically "syntax error" */
- errdetail("%s at or near \"%s\"", message, yytext)));
+ errdetail("%s at or near \"%s\".", message, yytext)));
 	}
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..2d90bbfee8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -4882,7 +4882,7 @@ check_recovery_target_time(char **newval, void **extra, GucSource source)
 
 			if (tm2timestamp(tm, fsec, &tz, ×tamp) != 0)
 			{
-GUC_check_errdetail("timestamp out of range: \"%s\"", str);
+GUC_check_errdetail("Timestamp out of range: \"%s\".", str);
 return false;
 			}
 		}
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 226f85d0e3..564f79dab9 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -2946,7 +2946,7 @@ AlterExtensionNamespace(const char *extensionName, const char *newschema, Oid *o
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("extension 

Re: Network failure may prevent promotion

2024-01-22 Thread Kyotaro Horiguchi
At Mon, 22 Jan 2024 13:29:10 -0800, Andres Freund  wrote in 
> Hi,
> 
> On 2024-01-19 12:28:05 +0900, Michael Paquier wrote:
> > On Thu, Jan 18, 2024 at 03:42:28PM +0200, Heikki Linnakangas wrote:
> > > Given that commit 728f86fec6 that introduced this issue was not strictly
> > > required, perhaps we should just revert it for v16.
> > 
> > Is there a point in keeping 728f86fec6 as well on HEAD?  That does not
> > strike me as wise to keep that in the tree for now.  If it needs to be
> > reworked, looking at this problem from scratch would be a safer
> > approach.
> 
> IDK, I think we'll introduce this type of bug over and over if we don't fix it
> properly.

Just to clarify my position, I thought that 728f86fec6 was heading the
right direction. Considering the current approach to signal handling
in walreceiver, I believed that it would be better to further
generalize in this direction rather than reverting. That's why I
proposed that patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2024-01-22 Thread Kyotaro Horiguchi
At Mon, 22 Jan 2024 15:36:31 +1100, Peter Smith  wrote 
in 
> 2024-01 Commitfest.
> 
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there was a CFbot test failure last time it was run [2]. Please have a
> look and post an updated version if necessary.

Thanks! I have added the necessary includes to the header file this
patch adds. With this change, "make headerscheck" now passes. However,
when I run "make cpluspluscheck" in my environment, it generates a
large number of errors in other areas, but I didn't find one related
to this patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9a2b6fbda882587c127d3e50bccf89508837d1a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v32 1/3] Export wal_sync_method related functions

Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
 src/backend/access/transam/xlog.c | 73 +--
 src/include/access/xlog.h |  2 +
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c5f51849ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
 	}
 }
 
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+	return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
 
 /*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
  *
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
  */
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
 {
-	char	   *msg = NULL;
+	const char *msg = NULL;
 	instr_time	start;
 
-	Assert(tli != 0);
-
 	/*
 	 * Quick exit if fsync is disabled or write() has already synced the WAL
 	 * file.
@@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	if (!enableFsync ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
-		return;
+		return NULL;
 
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
@@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 			break;
 	}
 
-	/* PANIC if failed to fsync */
-	if (msg)
-	{
-		char		xlogfname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		XLogFileName(xlogfname, tli, segno, wal_segment_size);
-		errno = save_errno;
-		ereport(PANIC,
-(errcode_for_file_access(),
- errmsg(msg, xlogfname)));
-	}
-
 	pgstat_report_wait_end();
 
 	/*
@@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
-	PendingWalStats.wal_sync++;
+	if (msg != NULL)
+		PendingWalStats.wal_sync++;
+
+	return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+	const char	*msg;
+	
+	Assert(tli != 0);
+
+	msg = XLogFsyncFile(fd);
+
+	/* PANIC if failed to fsync */
+	if (msg)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, tli, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+(errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+	}
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..2a0d65b537 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
 extern bool RecoveryInProgress(void);
-- 
2.39.3

>From c464013071dedc15b838e573ae828f150b3b60f7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v32 2/3] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/transam/meson.build |   1 +
 src/backend/access/transam/rmgr.c

Re: Make mesage at end-of-recovery less scary.

2024-01-22 Thread Kyotaro Horiguchi
At Wed, 17 Jan 2024 14:32:00 +0900, Michael Paquier  wrote 
in 
> On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote:
> >> For now, let me explain the basis for this patch. The fundamental
> >> issue is that these warnings that always appear are, in practice, not
> >> a problem in almost all cases. Some of those who encounter them for
> >> the first time may feel uneasy and reach out with inquiries. On the
> >> other hand, those familiar with these warnings tend to ignore them and
> >> only pay attention to details when actual issues arise. Therefore, the
> >> intention of this patch is to label them as "no issue" unless a
> >> problem is blatantly evident, in order to prevent unnecessary concern.
> > 
> > I agree and don't mind affecting the error message per se.
> > 
> > However I see that the actual logic of how WAL is processed is being
> > changed. If we do this, at very least it requires thorough thinking. I
> > strongly suspect that the proposed code is wrong and/or not safe
> > and/or less safe than it is now for the reasons named above.
> 
> FWIW, that pretty much sums up my feeling regarding this patch,
> because an error, basically any error, would hurt back very badly.
> Sure, the error messages we generate now when reaching the end of WAL
> can sound scary, and they are (I suspect that's not really the case
> for anybody who has history doing support with PostgreSQL because a
> bunch of these messages are old enough to vote, but I can understand
> that anybody would freak out the first time they see that).
> 
> However, per the recent issues we've had in this area, like
> cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and
> bae868caf222, I am of the opinion that the header validation, the
> empty page case in XLogReaderValidatePageHeader() and the record read
> changes are risky enough that I am not convinced that the gains are
> worth the risks taken.
> 
> The error stack in the WAL reader is complicated enough that making it
> more complicated as the patch proposes does not sound like not a good
> tradeoff to me to make the reports related to the end of WAL cleaner
> for the end-user.  I agree that we should do something, but the patch
> does not seem like a good step towards this goal.  Perhaps somebody
> would be more excited about this proposal than I am, of course.

Thank you both for the comments. The criticism seems valid. The
approach to identifying the end-of-WAL state in this patch is quite
heuristic, and its validity or safety can certainly be contested. On
the other hand, if we seek perfection in this area of judgment, we may
need to have the WAL format itself more robust. In any case, since the
majority of the feedback on this patch seems to be negative, I am
going to withdraw it if no supportive opinions emerge during this
commit-fest.

The attached patch addresses the errors reported by CF-bot.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 933d10fa6c7b71e4684f5ba38e85177afaa56f58 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v29] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 147 ++
 src/backend/access/transam/xlogrecovery.c |  96 ++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 +-
 src/bin/pg_waldump/t/001_basic.pl |   5 +-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/035_recovery.pl   | 130 +++
 src/test/recovery/t/039_end_of_wal.pl |  47 +--
 8 files changed, 383 insertions(+), 63 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..94861969eb 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bo

Re: Make mesage at end-of-recovery less scary.

2024-01-22 Thread Kyotaro Horiguchi
At Mon, 22 Jan 2024 16:09:28 +1100, Peter Smith  wrote 
in 
> 2024-01 Commitfest.
> 
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
> 
> ==
> [1] https://commitfest.postgresql.org/46/2490/
> [2] 
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490

Thanks for noticing of that. Will repost a new version.
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Network failure may prevent promotion

2024-01-18 Thread Kyotaro Horiguchi
At Sun, 31 Dec 2023 20:07:41 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> We've noticed that when walreceiver is waiting for a connection to
> complete, standby does not immediately respond to promotion
> requests. In PG14, upon receiving a promotion request, walreceiver
> terminates instantly, but in PG16, it waits for connection
> timeout. This behavior is attributed to commit 728f86fec65, where a
> part of libpqrcv_connect was simply replaced with a call to
> libpqsrc_connect_params. This behavior can be verified by simply
> dropping packets from the standby to the primary.

Apologize for the inconvenience on my part, but I need to fix this
behavior. To continue this discussion, I'm providing a repro script
here.

With the script, the standby is expected to promote immediately,
emitting the following log lines:

standby.log:
> 2024-01-18 16:25:22.245 JST [31849] LOG:  received promote request
> 2024-01-18 16:25:22.245 JST [31850] FATAL:  terminating walreceiver process 
> due to administrator command
> 2024-01-18 16:25:22.246 JST [31849] LOG:  redo is not required
> 2024-01-18 16:25:22.246 JST [31849] LOG:  selected new timeline ID: 2
> 2024-01-18 16:25:22.274 JST [31849] LOG:  archive recovery complete
> 2024-01-18 16:25:22.275 JST [31847] LOG:  checkpoint starting: force
> 2024-01-18 16:25:22.277 JST [31846] LOG:  database system is ready to accept 
> connections
> 2024-01-18 16:25:22.280 JST [31847] LOG:  checkpoint complete: wrote 3 
> buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, 
> sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s; 
> distance=0 kB, estimate=0 kB; lsn=0/1548E98, redo lsn=0/1548E40
> 2024-01-18 16:25:22.356 JST [31846] LOG:  received immediate shutdown request
> 2024-01-18 16:25:22.361 JST [31846] LOG:  database system is shut down

After 728f86fec65 was introduced, promotion does not complete with the
same operation, as follows. The patch attached to the previous mail
fixes this behavior to the old behavior above.

> 2024-01-18 16:47:53.314 JST [34515] LOG:  received promote request
> 2024-01-18 16:48:03.947 JST [34512] LOG:  received immediate shutdown request
> 2024-01-18 16:48:03.952 JST [34512] LOG:  database system is shut down

The attached script requires that sudo is executable. And there's
another point to note. The script attempts to establish a replication
connection to $primary_address:$primary_port. To packet-filter can
work, it must be a remote address that is accessible when no
packet-filter setting is set up.  The firewall-cmd setting, need to be
configured to block this connection. If simply an inaccessible IP
address is set, the process will fail immediately with a "No route to
host" error before the first packet is sent out, and it will not be
blocked as intended.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /bin/perl

use Cwd;

# This IP address must be a valid and accessible remote address,
# otherwise replication connection will immediately fail with 'No
# route to host', while we want to wait for replication connection to
# complete.
$primary_addr = '192.168.56.1';
$primary_port = 5432;
$fwzone = 'public';
$rootdir = cwd();
$standby_dir = "$rootdir/standby";
$standby_port = 5432;
$standby_logfile= "standby.log";
$rich_rule = "'rule family=\"ipv4\" destination address=\"$primary_addr\" port 
port=\"$primary_port\" protocol=\"tcp\" drop'";
$add_cmd = "sudo firewall-cmd --zone=$fwzone --add-rich-rule=$rich_rule";
$del_cmd = "sudo firewall-cmd --zone=$fwzone --remove-rich-rule=$rich_rule";

# Remove old entities
system('killall -9 postgres');
system("rm -rf $standby_dir $standby_logfile");

# Setup packet drop 
system($add_cmd) == 0 or die "failed to exec \"$add_cmd\": $!\n";

# Setup a standby.
system("initdb -D $standby_dir -c primary_conninfo='host=$primary_addr 
port=$primary_port'");
system("touch $standby_dir/standby.signal");

# Start it.
system("pg_ctl start -D $standby_dir -o \"-p $standby_port\" -l standby.log");
sleep 1;

# Try promoting standby, waiting for 10 seconds.
system("pg_ctl promote -t 10 -D $standby_dir");

# Stop servers
system("pg_ctl stop -m i -D $standby_dir");

# Remove packet-drop setting
system($del_cmd) == 0 or die "failed to exec \"$del_cmd\": $!\n";


Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Kyotaro Horiguchi
Thank you for upicking this up.

At Wed, 17 Jan 2024 23:47:41 +0100, Daniel Gustafsson  wrote 
in 
> > On 17 Jan 2024, at 21:33, Tom Lane  wrote:
> > 
> > I wrote:
> >> However ... I don't like the patch much.  It seems to have left
> >> the code in a rather random state.  Why, for example, didn't you
> >> keep all the code that constructs the "newline" value together?
> > 
> > After thinking about it a bit more, I don't see why you didn't just
> > s/strncmp/strncasecmp/ and call it good.  The messiness seems to be
> > a result of your choice to replace the GUC's case as shown in the
> > file with the case used on the command line, which is not better
> > IMO.  We don't change our mind about the canonical spelling of a
> > GUC because somebody varied the case in a SET command.
> 
> The original patch was preserving the case of the file throwing away the case
> from the commandline.  The attached is a slimmed down version which only moves
> the assembly of newline to the different cases (replace vs.  new) keeping the
> rest of the code intact.  Keeping it in one place gets sort of messy too since
> it needs to use different values for a replacement and a new variable.  Not
> sure if there is a cleaner approach?

Just to clarify, the current code breaks out after processing the
first matching line. I haven't changed that behavior.  The reason I
moved the rewrite processing code out of the loop was I wanted to
avoid adding more lines that are executed only once into the
loop. However, it is in exchange of additional complexity to remember
the original spelling of the parameter name. Personally, I believe
separating the search and rewrite processing is better, but I'm not
particularly attached to the approach, so either way is fine with me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-16 Thread Kyotaro Horiguchi
At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia  
wrote in 
> Hi,
> 
> Thanks for applying!
> 
> > + errmsg_plural("%zd row were skipped due to data type
> > incompatibility",
> 
> Sorry, I just noticed it, but 'were' should be 'was' here?
> 
> >> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> >> counts soft errors. I'll suggest this in another thread.
> > Please do!
> 
> I've started it here:
> 
> https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff...@oss.nttdata.com

Switching topics, this commit (9e2d870119) adds the following help message:


>   "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n"
>   "TO { '%s' | PROGRAM '%s' | STDOUT }\n"
> ...
>   "SAVE_ERROR_TO '%s'\n"
> ...
>   _("location"),

On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
indicate "immediately error out" and 'just ignore the failure'
respectively, but these options hardly seem to denote a 'location',
and appear more like an 'action'. I somewhat suspect that this
parameter name intially conceived with the assupmtion that it would
take file names or similar parameters. I'm not sure if others will
agree, but I think the parameter name might not be the best
choice. For instance, considering the addition of the third value
'log', something like on_error_action (error, ignore, log) would be
more intuitively understandable. What do you think?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Make mesage at end-of-recovery less scary.

2024-01-15 Thread Kyotaro Horiguchi
Thank you for the comments.

At Fri, 12 Jan 2024 15:03:26 +0300, Aleksander Alekseev 
 wrote in 
> ```
> +p = (char *) record;
> +pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> +
> +while (p < pe && *p == 0)
> +p++;
> +
> +if (p == pe)
> ```
> 
> Just as a random thought: perhaps we should make this a separate
> function, as a part of src/port/. It seems to me that this code could
> benefit from using vector instructions some day, similarly to
> memcmp(), memset() etc. Surprisingly there doesn't seem to be a
> standard C function for this. Alternatively one could argue that one
> cycle doesn't make much code to reuse and that the C compiler will
> place SIMD instructions for us. However a counter-counter argument
> would be that we could use a macro or even better an inline function
> and have the same effect except getting a slightly more readable code.

Creating a function with a name like memcmp_byte() should be
straightforward, but implementing it with SIMD right away seems a bit
challenging. Similar operations are already being performed elsewhere
in the code, probably within the stats collector, where memcmp is used
with a statically allocated area that's filled with zeros. If we can
achieve a performance equivalent to memcmp with this new function,
then it definitely seems worth pursuing.

> ```
> - * This is just a convenience subroutine to avoid duplicated code in
> + * This is just a convenience subroutine to avoid duplicate code in
> ```
> 
> This change doesn't seem to be related to the patch. Personally I
> don't mind it though.

Ah, I'm sorry. That was something I mistakenly thought I had written
at the last moment and made modifications to.

> All in all I find v28 somewhat scary. It does much more than "making
> one message less scary" as it was initially intended and what bugged
> me personally, and accordingly touches many more places including
> xlogreader.c, xlogrecovery.c, etc.
> 
> Particularly I have mixed feeling about this:
> 
> ```
> +/*
> + * Consider it as end-of-WAL if all subsequent bytes of this page
> + * are zero. We don't bother checking the subsequent pages since
> + * they are not zeroed in the case of recycled segments.
> + */
> ```
> 
> If I understand correctly, if somehow several FS blocks end up being
> zeroed (due to OS bug, bit rot, restoring from a corrupted for
> whatever reason backup, hardware failures, ...) there is non-zero
> chance that PG will interpret this as a normal situation. To my
> knowledge this is not what we typically do - typically PG would report
> an error and ask a human to figure out what happened. Of course the
> possibility of such a scenario is small, but I don't think that as
> DBMS developers we can ignore it.

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

> Does anyone agree or maybe I'm making things up?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Error "initial slot snapshot too large" in create replication slot

2024-01-15 Thread Kyotaro Horiguchi
At Fri, 12 Jan 2024 11:28:09 -0500, Robert Haas  wrote 
in 
> However, I wonder whether this whole area is in need of a bigger
> rethink. There seem to be a number of situations in which the split
> into xip and subxip arrays is not very convenient, and also some
> situations where it's quite important. Sometimes we want to record
> what's committed, and sometimes what isn't. It's all a bit messy and
> inconsistent. The necessity of limiting snapshot size is annoying,
> too. I have no real idea what can be done about all of this, but what
> strikes me is that the current system has grown up incrementally: we
> started with a data structure designed for the original use case, and
> now by gradually adding new use cases things have gotten complicated.
> If we were designing things over from scratch, maybe we'd do it
> differently and end up with something less messy. And maybe someone
> can imagine a redesign that is likewise less messy.
> 
> But on the other hand, maybe not. Perhaps we can't really do any
> better than what we have. Then the question becomes whether this case
> is important enough to justify additional code complexity. I don't
> think I've personally seen users run into this problem so I have no
> special reason to think that it's important, but if it's causing
> issues for other people then maybe it is.

Thank you for the deep insights. I have understood your points. As I
can't think of any further simple modifications on this line, I will
withdraw this CF entry. At the moment, I also lack a fundamental,
comprehensive solution, but should if I or anyone else come up with
such a solution in the future, I believe it would worth a separate
discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: In-placre persistance change of a relation

2024-01-14 Thread Kyotaro Horiguchi
At Tue, 9 Jan 2024 15:07:20 +0530, vignesh C  wrote in 
> CFBot shows compilation issues at [1] with:

Thanks!

The reason for those errors was that I didn't consider Meson at the
time. Additionally, the signature change of reindex_index() caused the
build failure. I fixed both issues. While addressing these issues, I
modified the simpleundolog module to honor
wal_sync_method. Previously, the sync method for undo logs was
determined independently, separate from xlog.c. However, I'm still not
satisfied with the method for handling PG_O_DIRECT.

In this version, I have added the changes to enable the use of
wal_sync_method outside of xlog.c as the first part of the patchset.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 40749357f24adf89dc79db9b34f5c053288489bb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v31 1/3] Export wal_sync_method related functions

Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
 src/backend/access/transam/xlog.c | 73 +--
 src/include/access/xlog.h |  2 +
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c5f51849ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
 	}
 }
 
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+	return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
 
 /*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
  *
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
  */
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
 {
-	char	   *msg = NULL;
+	const char *msg = NULL;
 	instr_time	start;
 
-	Assert(tli != 0);
-
 	/*
 	 * Quick exit if fsync is disabled or write() has already synced the WAL
 	 * file.
@@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	if (!enableFsync ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN ||
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
-		return;
+		return NULL;
 
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
@@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 			break;
 	}
 
-	/* PANIC if failed to fsync */
-	if (msg)
-	{
-		char		xlogfname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		XLogFileName(xlogfname, tli, segno, wal_segment_size);
-		errno = save_errno;
-		ereport(PANIC,
-(errcode_for_file_access(),
- errmsg(msg, xlogfname)));
-	}
-
 	pgstat_report_wait_end();
 
 	/*
@@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
 	}
 
-	PendingWalStats.wal_sync++;
+	if (msg != NULL)
+		PendingWalStats.wal_sync++;
+
+	return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+	const char	*msg;
+	
+	Assert(tli != 0);
+
+	msg = XLogFsyncFile(fd);
+
+	/* PANIC if failed to fsync */
+	if (msg)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, tli, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+(errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+	}
 }
 
 /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..2a0d65b537 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
 extern bool RecoveryInProgress(void);
-- 
2.39.3

>From 5c120b94c407b971485ab52133399305e5e81a88 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v31 2/3] Introduce undo log implementation

This patch adds a simple implementation of UNDO log feature.
---
 src/backend/access/transam/Makefile|   1 +
 src/backend/access/t

Re: initdb --no-locale=C doesn't work as specified when the environment is not C

2024-01-11 Thread Kyotaro Horiguchi
At Wed, 10 Jan 2024 18:16:03 -0500, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > It seems somewhat intentional that only lc_messages references the
> > environment at boot time. On the other hand, previously, in the
> > absence of a specified locale, initdb would embed the environmental
> > value in the configuration file, as it seems to be documented. Given
> > that initdb is always used for cluster creation, it's unlikey that
> > systems depend on this boot-time default for their operation.
> 
> Yeah, after further reflection there doesn't seem to be a lot of value
> in leaving these entries commented-out, even in the cases where that's
> technically correct.  Let's just go back to the old behavior of always
> uncommenting them; that stood for years without complaints.  So I
> committed your latest patch as-is.

I'm glad you understand. Thank you for commiting.

> > By the way, the lines around lc_* in the sample file seem to have
> > somewhat inconsistent indentations. Wouldnt' it be preferable to fix
> > this? (The attached doesn't that.)
> 
> They look all right if you assume the tab width is 8, which seems to
> be what is used elsewhere in the file.  I think there's been some
> prior discussion about whether to ban use of tabs at all in these
> sample files, so as to reduce confusion about how wide the tabs are.
> But I'm not touching that question today.

Ah, I see, I understood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




  1   2   3   4   5   6   7   8   9   10   >