On 2021-03-19 16:30, Fujii Masao wrote: > On 2021/03/15 10:39, Masahiro Ikeda wrote: >> Thanks, I understood get_sync_bit() checks the sync flags and >> the write unit of generated wal data and replicated wal data is >> different. >> (It's interesting optimization whether to use kernel cache or not.) >> >> OK. Although I agree to separate the stats for the walrecever, >> I want to hear opinions from other people too. I didn't change the >> patch. >> >> Please feel to your comments. > > What about applying the patch for common WAL write function like > XLogWriteFile(), separately from the patch for walreceiver's stats? > Seems the former reaches the consensus, so we can commit it firstly. > Also even only the former change is useful because which allows > walreceiver to report WALWrite wait event.
Agreed. I separated the patches. If only the former is committed, my trivial concern is that there may be a disadvantage, but no advantage for the standby server. It may lead to performance degradation to the wal receiver by calling INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the latter patch is committed. I think it's ok because this not happening in the case to disable the "track_wal_io_timing" in the standby server. Although some users may start the standby server using the backup which "track_wal_io_timing" is enabled in the primary server, they will say it's ok since the users already accept the performance degradation in the primary server. >> OK. I agree. >> >> I wonder to change the error check ways depending on who calls this >> function? >> Now, only the walreceiver checks (1)errno==0 and doesn't check >> (2)errno==ENITR. >> Other processes are the opposite. >> >> IIUC, it's appropriate that every process checks (1)(2). >> Please let me know my understanding is wrong. > > I'm thinking the same. Regarding (2), commit 79ce29c734 introduced > that code. According to the following commit log, it seems harmless > to retry on EINTR even walreceiver. > > Also retry on EINTR. All signals used in the backend are flagged > SA_RESTART > nowadays, so it shouldn't happen, but better to be defensive. Thanks, I understood. >>> BTW, currently XLogWrite() increments IO timing even when pg_pwrite() >>> reports an error. But this is useless. Probably IO timing should be >>> incremented after the return code of pg_pwrite() is checked, instead? >> >> Yes, I agree. I fixed it. >> (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) > > Thanks for the patch! > > nleft = nbytes; > do > { > - errno = 0; > + written = XLogWriteFile(openLogFile, from, > nleft, (off_t) > startoffset, > + > ThisTimeLineID, openLogSegNo, wal_segment_size); > > Can we merge this do-while loop in XLogWrite() into the loop > in XLogWriteFile()? > If we do that, ISTM that the following codes are not necessary in > XLogWrite(). > > nleft -= written; > from += written; OK, I fixed it. > + * 'segsize' is a segment size of WAL segment file. > > Since segsize is always wal_segment_size, segsize argument seems > not necessary in XLogWriteFile(). Right. I fixed it. > +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset, > + TimeLineID timelineid, XLogSegNo segno, int segsize) > > Why did you use "const void *" instead of "char *" for *buf? I followed the argument of pg_pwrite(). But, I think "char *" is better, so fixed it. > Regarding 0005 patch, I will review it later. Thanks. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index a7a94d2a83..df028c5039 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -771,6 +771,9 @@ WalRcvDie(int code, Datum arg) /* Ensure that all WAL records received are flushed to disk */ XLogWalRcvFlush(true); + /* Send WAL statistics to the stats collector before terminating */ + pgstat_send_wal(true); + /* Mark ourselves inactive in shared memory */ SpinLockAcquire(&walrcv->mutex); Assert(walrcv->walRcvState == WALRCV_STREAMING || @@ -910,6 +913,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) XLogArchiveForceDone(xlogfname); else XLogArchiveNotify(xlogfname); + + /* + * Send WAL statistics to the stats collector when finishing + * the current WAL segment file to avoid overloading it. + */ + pgstat_send_wal(false); } recvFile = -1;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index bd5e787e55..4c7d90f1b9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2536,61 +2536,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) Size nbytes; Size nleft; int written; - instr_time start; /* OK to write the page(s) */ from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; nbytes = npages * (Size) XLOG_BLCKSZ; nleft = nbytes; - do - { - errno = 0; - - /* Measure I/O timing to write WAL data */ - if (track_wal_io_timing) - INSTR_TIME_SET_CURRENT(start); - - pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); - written = pg_pwrite(openLogFile, from, nleft, startoffset); - pgstat_report_wait_end(); - - /* - * Increment the I/O timing and the number of times WAL data - * were written out to disk. - */ - if (track_wal_io_timing) - { - instr_time duration; - - INSTR_TIME_SET_CURRENT(duration); - INSTR_TIME_SUBTRACT(duration, start); - WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration); - } - - WalStats.m_wal_write++; - - if (written <= 0) - { - char xlogfname[MAXFNAMELEN]; - int save_errno; - - if (errno == EINTR) - continue; - - save_errno = errno; - XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo, - wal_segment_size); - errno = save_errno; - ereport(PANIC, - (errcode_for_file_access(), - errmsg("could not write to log file %s " - "at offset %u, length %zu: %m", - xlogfname, startoffset, nleft))); - } - nleft -= written; - from += written; - startoffset += written; - } while (nleft > 0); + written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset, + ThisTimeLineID, openLogSegNo, true); + startoffset += written; npages = 0; @@ -2707,6 +2660,94 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) } } +/* + * Issue pg_pwrite to write an WAL segment file. + * + * 'fd' is a file descriptor for the XLOG file to write + * 'buf' is a buffer starting address to write. + * 'nbyte' is a number of max bytes to write up. + * 'offset' is a offset of XLOG file to be set. + * 'timelineid' is a timeline ID of WAL segment file. + * 'segno' is a segment number of WAL segment file. + * 'write_all' is whether to write 'nbyte' exactly. + * + * Return the number of bytes written. + */ +int +XLogWriteFile(int fd, char *buf, size_t nbyte, off_t offset, + TimeLineID timelineid, XLogSegNo segno, bool write_all) +{ + int written = 0; + + /* + * Loop until to write the buffer data or an error occurred. + */ + for (;;) + { + int written_tmp; + instr_time start; + + errno = 0; + + /* Measure I/O timing to write WAL data */ + if (track_wal_io_timing) + INSTR_TIME_SET_CURRENT(start); + + pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); + written_tmp = pg_pwrite(fd, buf, nbyte, offset); + pgstat_report_wait_end(); + + if (written_tmp <= 0) + { + char xlogfname[MAXFNAMELEN]; + int save_errno; + + /* + * Retry on EINTR. All signals used in the backend and background + * processes are flagged SA_RESTART, so it shouldn't happen, but + * better to be defensive. + */ + if (errno == EINTR) + continue; + + /* if write didn't set errno, assume no disk space */ + if (errno == 0) + errno = ENOSPC; + + save_errno = errno; + XLogFileName(xlogfname, timelineid, segno, wal_segment_size); + errno = save_errno; + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not write to log file %s " + "at offset %u, length %zu: %m", + xlogfname, (unsigned int) offset, nbyte))); + } + + /* + * Increment the I/O timing and the number of times WAL data were + * written out to disk. + */ + if (track_wal_io_timing) + { + instr_time duration; + + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration); + } + + WalStats.m_wal_write++; + + nbyte -= written_tmp; + buf += written_tmp; + written += written_tmp; + + if (!write_all || nbyte <= 0) + return written; + } +} + /* * Record the LSN for an asynchronous transaction commit/abort * and nudge the WALWriter if there is work for it to do. diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index a7a94d2a83..daf764446f 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -868,7 +868,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len) static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) { - int startoff; + uint32 startoff; int byteswritten; while (nbytes > 0) @@ -929,27 +929,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) segbytes = nbytes; /* OK to write the logs */ - errno = 0; - - byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff); - if (byteswritten <= 0) - { - char xlogfname[MAXFNAMELEN]; - int save_errno; - - /* if write didn't set errno, assume no disk space */ - if (errno == 0) - errno = ENOSPC; - - save_errno = errno; - XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); - errno = save_errno; - ereport(PANIC, - (errcode_for_file_access(), - errmsg("could not write to log segment %s " - "at offset %u, length %lu: %m", - xlogfname, startoff, (unsigned long) segbytes))); - } + byteswritten = XLogWriteFile(recvFile, buf, segbytes, (off_t) startoff, + recvFileTLI, recvSegNo, false); /* Update state for write */ recptr += byteswritten; diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 77187c12be..b562cfa4c1 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -298,6 +298,10 @@ extern bool XLogBackgroundFlush(void); extern bool XLogNeedsFlush(XLogRecPtr RecPtr); extern int XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock); extern int XLogFileOpen(XLogSegNo segno); +extern int XLogWriteFile(int fd, char *buf, + size_t nbyte, off_t offset, + TimeLineID timelineid, XLogSegNo segno, + bool write_all); extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli); extern XLogSegNo XLogGetLastRemovedSegno(void);