Tighten error control for OpenTransientFile/CloseTransientFile
Hi all, Joe's message here has reminded me that we have lacked a lot of error handling around CloseTransientFile(): https://www.postgresql.org/message-id/c49b69ec-e2f7-ff33-4f17-0eaa4f2ce...@joeconway.com This has been mentioned by Alvaro a couple of months ago (cannot find the thread about that at quick glance), and I just forgot about it at that time. Anyway, attached is a patch to do some cleanup for all that: - Switch OpenTransientFile to read-only where sufficient. - Add more error handling for CloseTransientFile A major take of this patch is to make sure that the new error messages generated have an elevel consistent with their neighbors. Just on time for this last CF. Thoughts? -- Michael diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 9905593661..7b39283c89 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -1991,7 +1991,10 @@ qtext_load_file(Size *buffer_size) return NULL; } - CloseTransientFile(fd); + if (CloseTransientFile(fd)) + ereport(LOG, +(errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", PGSS_TEXT_FILE))); *buffer_size = stat.st_size; return buf; diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index f5cf9ffc9c..bce4274362 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -1202,7 +1202,10 @@ heap_xlog_logical_rewrite(XLogReaderState *r) errmsg("could not fsync file \"%s\": %m", path))); pgstat_report_wait_end(); - CloseTransientFile(fd); + if (CloseTransientFile(fd)) + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); } /* --- @@ -1300,7 +1303,11 @@ CheckPointLogicalRewriteHeap(void) (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", path))); pgstat_report_wait_end(); - CloseTransientFile(fd); + + if (CloseTransientFile(fd)) +ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); } } FreeDir(mappings_dir); diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 3623352b9c..974d42fc86 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -599,7 +599,7 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno) SlruFileName(ctl, path, segno); - fd = OpenTransientFile(path, O_RDWR | PG_BINARY); + fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); if (fd < 0) { /* expected: file doesn't exist */ @@ -621,7 +621,13 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno) result = endpos >= (off_t) (offset + BLCKSZ); - CloseTransientFile(fd); + if (CloseTransientFile(fd)) + { + slru_errcause = SLRU_CLOSE_FAILED; + slru_errno = errno; + return false; + } + return result; } @@ -654,7 +660,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) * SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case * where the file doesn't exist, and return zeroes instead. */ - fd = OpenTransientFile(path, O_RDWR | PG_BINARY); + fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); if (fd < 0) { if (errno != ENOENT || !InRecovery) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index c96c8b60ba..cbd9b2cee1 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -370,7 +370,11 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, } pgstat_report_wait_end(); } - CloseTransientFile(srcfd); + + if (CloseTransientFile(srcfd)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); } /* @@ -416,7 +420,6 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, (errcode_for_file_access(), errmsg("could not close file \"%s\": %m", tmppath))); - /* * Now move the completed history file into place with its final name. */ @@ -495,7 +498,6 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) (errcode_for_file_access(), errmsg("could not close file \"%s\": %m", tmppath))); - /* * Now move the completed history file into place with its final name. */ diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 64679dd2de..21986e48fe 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1297,7 +1297,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok) } pgstat_report_wait_end(); - CloseTransientFile(fd); + + if (CloseTransientFile(fd)) + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); hdr = (TwoPhaseFileHeader *) buf; if (hdr->magic != TWOPHASE_MAGIC) diff --git a/src/bac
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On 2/28/19 9:33 PM, Michael Paquier wrote: > Hi all, > > Joe's message here has reminded me that we have lacked a lot of error > handling around CloseTransientFile(): > https://www.postgresql.org/message-id/c49b69ec-e2f7-ff33-4f17-0eaa4f2ce...@joeconway.com > > This has been mentioned by Alvaro a couple of months ago (cannot find > the thread about that at quick glance), and I just forgot about it at > that time. Anyway, attached is a patch to do some cleanup for all > that: > - Switch OpenTransientFile to read-only where sufficient. > - Add more error handling for CloseTransientFile > A major take of this patch is to make sure that the new error messages > generated have an elevel consistent with their neighbors. > > Just on time for this last CF. Thoughts? Seems like it would be better to modify the arguments to CloseTransientFile() to include the filename being closed, errorlevel, and fail_on_error or something similar. Then all the repeated ereport stanzas could be eliminated. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Fri, Mar 01, 2019 at 05:05:54PM -0500, Joe Conway wrote: > Seems like it would be better to modify the arguments to > CloseTransientFile() to include the filename being closed, errorlevel, > and fail_on_error or something similar. Then all the repeated ereport > stanzas could be eliminated. Sure. Now some code paths close file descriptors without having at hand the file name, which would mean that we'd need to pass NULL as argument in this case. That's not really elegant in my opinion. And having a consistent mapping with the system's close() is not really bad to me either.. -- Michael signature.asc Description: PGP signature
Re: Tighten error control for OpenTransientFile/CloseTransientFile
Overall the patch looks good and according to the previous discussion fulfils its purpose. It might be worthwhile to also check for errors on close in SaveSlotToPath(). pgstat_report_wait_end(); CloseTransientFile(fd); /* rename to permanent file, fsync file and directory */ if (rename(tmppath, path) != 0)
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Fri, Mar 1, 2019 at 5:06 PM Joe Conway wrote: > Seems like it would be better to modify the arguments to > CloseTransientFile() to include the filename being closed, errorlevel, > and fail_on_error or something similar. Then all the repeated ereport > stanzas could be eliminated. Hmm. I'm not sure that really saves much in terms of notation, and it's less flexible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Wed, Mar 06, 2019 at 02:54:52PM +, Georgios Kokolatos wrote: > Overall the patch looks good and according to the previous > discussion fulfils its purpose. > > It might be worthwhile to also check for errors on close in > SaveSlotToPath(). Thanks for the feedback, added. I have spent some time double-checking this stuff, and noticed that the new errors in StartupReplicationOrigin() and CheckPointReplicationOrigin() should be switched from ERROR to PANIC to be consistent. One message in dsm_impl_mmap() was not consistent either. Are there any objections if I commit this patch? -- Michael diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 9905593661..7b39283c89 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -1991,7 +1991,10 @@ qtext_load_file(Size *buffer_size) return NULL; } - CloseTransientFile(fd); + if (CloseTransientFile(fd)) + ereport(LOG, +(errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", PGSS_TEXT_FILE))); *buffer_size = stat.st_size; return buf; diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index f5cf9ffc9c..bce4274362 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -1202,7 +1202,10 @@ heap_xlog_logical_rewrite(XLogReaderState *r) errmsg("could not fsync file \"%s\": %m", path))); pgstat_report_wait_end(); - CloseTransientFile(fd); + if (CloseTransientFile(fd)) + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); } /* --- @@ -1300,7 +1303,11 @@ CheckPointLogicalRewriteHeap(void) (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", path))); pgstat_report_wait_end(); - CloseTransientFile(fd); + + if (CloseTransientFile(fd)) +ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); } } FreeDir(mappings_dir); diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 3623352b9c..974d42fc86 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -599,7 +599,7 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno) SlruFileName(ctl, path, segno); - fd = OpenTransientFile(path, O_RDWR | PG_BINARY); + fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); if (fd < 0) { /* expected: file doesn't exist */ @@ -621,7 +621,13 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno) result = endpos >= (off_t) (offset + BLCKSZ); - CloseTransientFile(fd); + if (CloseTransientFile(fd)) + { + slru_errcause = SLRU_CLOSE_FAILED; + slru_errno = errno; + return false; + } + return result; } @@ -654,7 +660,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno) * SlruPhysicalWritePage). Hence, if we are InRecovery, allow the case * where the file doesn't exist, and return zeroes instead. */ - fd = OpenTransientFile(path, O_RDWR | PG_BINARY); + fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); if (fd < 0) { if (errno != ENOENT || !InRecovery) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index c96c8b60ba..cbd9b2cee1 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -370,7 +370,11 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, } pgstat_report_wait_end(); } - CloseTransientFile(srcfd); + + if (CloseTransientFile(srcfd)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); } /* @@ -416,7 +420,6 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, (errcode_for_file_access(), errmsg("could not close file \"%s\": %m", tmppath))); - /* * Now move the completed history file into place with its final name. */ @@ -495,7 +498,6 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) (errcode_for_file_access(), errmsg("could not close file \"%s\": %m", tmppath))); - /* * Now move the completed history file into place with its final name. */ diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 64679dd2de..21986e48fe 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1297,7 +1297,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok) } pgstat_report_wait_end(); - CloseTransientFile(fd); + + if (CloseTransientFile(fd)) + ereport(ERROR, +(errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", path))); hdr = (TwoPhaseFileHeader *) buf; if (hdr->magic != TWOPHASE_MAGIC) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0fdd82a287..c7047738b6 100644 ---
Re: Tighten error control for OpenTransientFile/CloseTransientFile
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation:not tested The second version of this patch seems to be in order and ready for committer. Thank you for taking the time to code! The new status of this patch is: Ready for Committer
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On 2019-Mar-07, Michael Paquier wrote: > #else > - close(fd); > + if (close(fd)) > + { > + fprintf(stderr, _("%s: could not close file \"%s\": %s"), > + progname, ControlFilePath, strerror(errno)); > + exit(EXIT_FAILURE); > + } > #endif I think this one needs a terminating \n. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Thu, Mar 07, 2019 at 10:00:05PM -0300, Alvaro Herrera wrote: > I think this one needs a terminating \n. Argh... Thanks for the lookup, Alvaro. -- Michael signature.asc Description: PGP signature
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Fri, Mar 08, 2019 at 10:23:24AM +0900, Michael Paquier wrote: > Argh... Thanks for the lookup, Alvaro. And committed, after an extra pass to beautify the whole experience. -- Michael signature.asc Description: PGP signature
Re: Tighten error control for OpenTransientFile/CloseTransientFile
Hi, On 2019-03-07 10:56:25 +0900, Michael Paquier wrote: > diff --git a/src/backend/access/heap/rewriteheap.c > b/src/backend/access/heap/rewriteheap.c > index f5cf9ffc9c..bce4274362 100644 > --- a/src/backend/access/heap/rewriteheap.c > +++ b/src/backend/access/heap/rewriteheap.c > @@ -1202,7 +1202,10 @@ heap_xlog_logical_rewrite(XLogReaderState *r) >errmsg("could not fsync file \"%s\": %m", > path))); > pgstat_report_wait_end(); > > - CloseTransientFile(fd); > + if (CloseTransientFile(fd)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not close file \"%s\": %m", > path))); > } ... > diff --git a/src/backend/access/transam/twophase.c > b/src/backend/access/transam/twophase.c > index 64679dd2de..21986e48fe 100644 > --- a/src/backend/access/transam/twophase.c > +++ b/src/backend/access/transam/twophase.c > @@ -1297,7 +1297,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok) > } > > pgstat_report_wait_end(); > - CloseTransientFile(fd); > + > + if (CloseTransientFile(fd)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not close file \"%s\": %m", > path))); > > hdr = (TwoPhaseFileHeader *) buf; > if (hdr->magic != TWOPHASE_MAGIC) > diff --git a/src/backend/access/transam/xlog.c > b/src/backend/access/transam/xlog.c > index 0fdd82a287..c7047738b6 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -3469,7 +3469,10 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, > XLogSegNo srcsegno, > (errcode_for_file_access(), >errmsg("could not close file \"%s\": %m", > tmppath))); > > - CloseTransientFile(srcfd); > + if (CloseTransientFile(srcfd)) > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not close file \"%s\": %m", > path))); > > /* >* Now move the segment into place with its final name. ... This seems like an odd set of changes to me. What is this supposed to buy us? The commit message says: 2) When opening transient files, it is up to the caller to close the file descriptors opened. In error code paths, CloseTransientFile() gets called to clean up things before issuing an error. However in normal exit paths, a lot of callers of CloseTransientFile() never actually reported errors, which could leave a file descriptor open without knowing about it. This is an issue I complained about a couple of times, but never had the courage to write and submit a patch, so here we go. but that reasoning seems bogus to me. For one, on just about any platform close always closes the fd, even when returning an error (unless you pass in a bad fd, in which case it obviously doesn't). So the reasoning that this fixes unnoticed fd leaks doesn't really seem to make sense. For another, even if it did, throwing an ERROR seems to achieve very little: We continue with a leaked fd *AND* we cause the operation to error out. I can see reasoning for: - LOG, so it can be noticed, but operations continue to work - FATAL, to fix the leak - PANIC, so we recover from the problem, in case of the close indicating a durability issue commit 9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 Author: Thomas Munro Date: 2018-11-19 13:31:10 +1300 PANIC on fsync() failure. but ERROR seems to have very little going for it. The durability argument doesn't seem to apply for the cases where we previously fsynced the file, a significant fraction of the locations you touched. And if your goal was just to achieve consistency, I also don't understand, because you left plenty close()'s unchecked? E.g. you added an error check in get_controlfile(), but not one in ReadControlFile(). alterSystemSetConfigFile() writes, but you didn't add one. Greetings, Andres Freund
Re: Tighten error control for OpenTransientFile/CloseTransientFile
On Fri, Oct 04, 2019 at 01:39:38PM -0700, Andres Freund wrote: > but that reasoning seems bogus to me. For one, on just about any > platform close always closes the fd, even when returning an error > (unless you pass in a bad fd, in which case it obviously doesn't). So > the reasoning that this fixes unnoticed fd leaks doesn't really seem to > make sense. For another, even if it did, throwing an ERROR seems to > achieve very little: We continue with a leaked fd *AND* we cause the > operation to error out. I have read again a couple of times the commit log, and this mentions to let users know that a fd is leaking, not that it fixes things. Still we get to know about it, while previously it was not possible. In some cases we may see errors in close() after a previous write(2). Of course this does not apply to all the code paths patched here, but it seems to me that's a good habit to spread, no? > I can see reasoning for: > - LOG, so it can be noticed, but operations continue to work > - FATAL, to fix the leak > - PANIC, so we recover from the problem, in case of the close indicating > a durability issue LOG or WARNING would not be visible enough and would likely be skipped by users. Not sure that this justifies a FATAL either, and PANIC would cause more harm than necessary, so for most of them ERROR sounded like a good compromise, still the elevel choice is not innocent depending on the code paths patched, because the elevel used is consistent with the error handling of the surroundings. > And if your goal was just to achieve consistency, I also don't > understand, because you left plenty close()'s unchecked? E.g. you added > an error check in get_controlfile(), but not one in > ReadControlFile(). alterSystemSetConfigFile() writes, but you didn't add > one. Because I have not considered these when looking at transient files. That may be worth an extra lookup. -- Michael signature.asc Description: PGP signature