On 2021-03-18 13:37, Fujii Masao wrote:
On 2021/03/18 11:59, kuroda.hay...@fujitsu.com wrote:
Dear Ikeda-san,
I confirmed new patch and no problem was found. Thanks.
(I'm not a native English speaker, so I cannot check your comments
correctly, sorry)
One user-visible side-effect by this change is; with the patch, the
stats is
cleared when only the stats collector is killed (with SIGQUIT)
accidentally
and restarted by postmaster later.
Thanks for your comments.
As you said, the temporary stats files don't removed if the stats
collector is killed with SIGQUIT.
So, if the user change the GUC parameter "stats_temp_directory" after
immediate shutdown,
temporary stats file can't be removed forever.
But, I think this case is rarely happened and unimportant. Actually,
pgstat_write_statsfiles()
didn't check error of unlink() and the same problem is occurred if the
server is crashed now.
The documentation said following. I think it's enough.
```
For better performance, <varname>stats_temp_directory</varname> can
be
pointed at a RAM-based file system, decreasing physical I/O
requirements.
When the server shuts down cleanly, a permanent copy of the
statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so
that
statistics can be retained across server restarts. When recovery is
performed at server start (e.g., after immediate shutdown, server
crash,
and point-in-time recovery), all statistics counters are reset.
```
On the other than, currently the stats is
written in that case and subsequently-restarted stats collector can use
that stats file. I'm not sure if we need to keep supporting this
behavior, though.
I don't have any strong opinion this behaivor is useless too.
Since the reinitialized phase is not executed when only the stats
collector is crashed
(since it didn't touch the shared memory), if the permanent stats file
is exists, the
stats collector can use it. But, IIUC, the case is rare.
The case is happened by operation mistake which a operator sends the
SIGQUIT signal to
the stats collector. Please let me know if there are more important
case.
But, if SIGKILL is sent by operator, the stats can't be rescure now
because the permanent stats
files can't be written before exiting. Since the case which can rescure
the stats is rare,
I think it's ok to initialize the stats even if SIGQUIT is sent.
If to support this feature, we need to implement the following first.
(2) As another aspect, it needs to change the behavior removing all
stats files because autovacuum
uses the stats. There are some ideas, for example writing the stats
files every X minutes
(using wal or another mechanism) and even if a crash occurs, the
startup process can restore
the stats with slightly low freshness. But, it needs to find a way
how to handle the stats files
when deleting on PITR rewind or stats collector crash happens.
When only the stats collector exits by SIGQUIT, with the patch
FreeWaitEventSet() is also skipped. Is this ok?
Thanks, I fixed it.
- * Loop to process messages until we get SIGQUIT or detect ungraceful
- * death of our parent postmaster.
+ * Loop to process messages until we get SIGTERM or SIGQUIT of our
parent
+ * postmaster.
"SIGTERM or SIGQUIT of our parent postmaster" should be
"SIGTERM, SIGQUIT, or detect ungraceful death of our parent
postmaster"?
Yes, I fixed it.
+SignalHandlerForUnsafeExit(SIGNAL_ARGS)
I don't think SignalHandlerForUnsafeExit is good name. Because that's
not
"unsafe" exit. No? Even after this signal handler is triggered, the
server is
still running normally and a process that exits will be restarted
later. What
about SignalHandlerForNonCrashExit or SignalHandlerForNonFatalExit?
OK, I fixed.
I changed to the SignalPgstatHandlerForNonCrashExit() to add
FreeWaitEventSet()
in the handler for the stats collector.
Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index dd9136a942..2cc5a39ec5 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
* shut down and exit.
*
* Typically, this handler would be used for SIGTERM, but some processes use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2 and the
+ * WAL writer exits on either SIGINT or SIGTERM.
*
* ShutdownRequestPending should be checked at a convenient place within the
* main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f..7c09a8d5fa 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -190,6 +190,8 @@ static time_t last_pgstat_start_time;
static bool pgStatRunningInCollector = false;
+static WaitEventSet *wes;
+
/*
* Structures in which backends store per-table info that's waiting to be
* sent to the collector.
@@ -386,6 +388,8 @@ static void pgstat_recv_connstat(PgStat_MsgConn *msg, int len);
static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void SignalPgstatHandlerForNonCrashExit(SIGNAL_ARGS);
+
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
@@ -724,6 +728,7 @@ pgstat_reset_remove_files(const char *directory)
snprintf(fname, sizeof(fname), "%s/%s", directory,
entry->d_name);
+ elog(DEBUG2, "removing stats file \"%s\"", fname);
unlink(fname);
}
FreeDir(dir);
@@ -4817,17 +4822,16 @@ PgstatCollectorMain(int argc, char *argv[])
PgStat_Msg msg;
int wr;
WaitEvent event;
- WaitEventSet *wes;
/*
* Ignore all signals usually bound to some action in the postmaster,
- * except SIGHUP and SIGQUIT. Note we don't need a SIGUSR1 handler to
- * support latch operations, because we only use a local latch.
+ * except SIGHUP, SIGTERM and SIGQUIT. Note we don't need a SIGUSR1
+ * handler to support latch operations, because we only use a local latch.
*/
pqsignal(SIGHUP, SignalHandlerForConfigReload);
pqsignal(SIGINT, SIG_IGN);
- pqsignal(SIGTERM, SIG_IGN);
- pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
+ pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
+ pqsignal(SIGQUIT, SignalPgstatHandlerForNonCrashExit);
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGUSR1, SIG_IGN);
@@ -4852,8 +4856,8 @@ PgstatCollectorMain(int argc, char *argv[])
AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
/*
- * Loop to process messages until we get SIGQUIT or detect ungraceful
- * death of our parent postmaster.
+ * Loop to process messages until we get SIGTERM, SIGQUIT or detect
+ * ungraceful death of our parent postmaster.
*
* For performance reasons, we don't want to do ResetLatch/WaitLatch after
* every message; instead, do that only after a recv() fails to obtain a
@@ -4871,7 +4875,7 @@ PgstatCollectorMain(int argc, char *argv[])
ResetLatch(MyLatch);
/*
- * Quit if we get SIGQUIT from the postmaster.
+ * Quit if we get SIGTERM from the postmaster.
*/
if (ShutdownRequestPending)
break;
@@ -7462,3 +7466,19 @@ pgstat_count_slru_truncate(int slru_idx)
{
slru_entry(slru_idx)->m_truncate += 1;
}
+
+/*
+ * Simple signal handler for handling SIGQUIT to exit quickly without
+ * writing stats files.
+ */
+static void
+SignalPgstatHandlerForNonCrashExit(SIGNAL_ARGS)
+{
+ FreeWaitEventSet(wes);
+
+ /*
+ * We arrange to do _exit(1) because the stats collector doesn't touch
+ * shared memory.
+ */
+ _exit(1);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6436ae0f48..b9ca39ef95 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3084,7 +3084,7 @@ reaper(SIGNAL_ARGS)
* nothing left for it to do.
*/
if (PgStatPID != 0)
- signal_child(PgStatPID, SIGQUIT);
+ signal_child(PgStatPID, SIGTERM);
}
else
{