Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Re: Tom Lane 2015-05-29 <13871.1432921...@sss.pgh.pa.us> > Why can't the user stop it? We won't be bleating about the case of a > symlink to a non-writable file someplace else, which is the Debian use > case. I don't see a very good excuse to have a non-writable file right > in the data directory. I've repeatedly seen PGDATA or pg_xlog been put directly on a mountpoint, which means there well be a non-writable lost+found directory there. (A case with pg_xlog was also reported as a support case at credativ.) I'm usually advising against using the top level directory directly, but it's not uncommon to encounter it. > In any case, if the cost of such a file is one more line of log output > during a crash restart, most people would have no problem at all in > ignoring that log output. Nod. Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
* Andres Freund (and...@anarazel.de) wrote: > On 2015-05-29 13:49:16 -0400, Tom Lane wrote: > > > That sounds like a potentially nontrivial amount of repetitive log bleat > > > after every crash start? One which the user can't really stop? > > > > Why can't the user stop it? > > Because it makes a good amount of sense to have e.g. certificates not > owned by postgres and not writeable? You don't necessarily want to > symlink them somewhere else, because that makes moving clusters around > harder than when they're self contained. A certain other file might be non-writable by PG too... (*cough* .auto *cough*). > > I'd say it's a pretty damn-fool arrangement: for starters, it's an > > unnecessary security hazard. > > I don't buy the security argument at all. You likely have > postgresql.conf in the data directoy. You can write to at least .auto, > which will definitely reside the data directory. That contains > archive_command. I'm not sure that I see the security issue here either.. We're not talking about setuid shell scripts or anything that isn't running as the PG user, which a superuser could take over anyway.. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-29 14:15:48 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2015-05-29 13:49:16 -0400, Tom Lane wrote: > >> Why can't the user stop it? > > > Because it makes a good amount of sense to have e.g. certificates not > > owned by postgres and not writeable? You don't necessarily want to > > symlink them somewhere else, because that makes moving clusters around > > harder than when they're self contained. > > Meh. Well, I'm willing to yield on the EACCES point, but I still find > the exclusion for ETXTBSY to be ugly and inappropriate. Ok. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Andres Freund writes: > On 2015-05-29 13:49:16 -0400, Tom Lane wrote: >> Why can't the user stop it? > Because it makes a good amount of sense to have e.g. certificates not > owned by postgres and not writeable? You don't necessarily want to > symlink them somewhere else, because that makes moving clusters around > harder than when they're self contained. Meh. Well, I'm willing to yield on the EACCES point, but I still find the exclusion for ETXTBSY to be ugly and inappropriate. >> I'd say it's a pretty damn-fool arrangement: for starters, it's an >> unnecessary security hazard. > I don't buy the security argument at all. You likely have > postgresql.conf in the data directoy. You can write to at least .auto, > which will definitely reside the data directory. That contains > archive_command. The fact that a superuser might have multiple ways to subvert things doesn't make it a good idea to add another one: the attack surface could be larger, or at least different. But even if you don't buy that it's a security hazard, why would it be a good idea to have executables inside $PGDATA? That would for example lead to them getting copied by pg_basebackup, which seems unlikely to be a good thing. And if you did have such executables there, why would they be active during a postmaster restart? I really seriously doubt that this is either common enough or useful enough to justify suppressing warning messages about it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-29 13:49:16 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2015-05-29 13:14:18 -0400, Tom Lane wrote: > >> Abhijit Menon-Sen writes: > >> As I mentioned yesterday, I'm not really on board with ignoring EACCES, > >> except for the directories-on-Windows case. Since we're only logging > >> the failures anyway, I think it is reasonable to log a complaint for any > >> unwritable file in the data directory. > > > That sounds like a potentially nontrivial amount of repetitive log bleat > > after every crash start? One which the user can't really stop? > > Why can't the user stop it? Because it makes a good amount of sense to have e.g. certificates not owned by postgres and not writeable? You don't necessarily want to symlink them somewhere else, because that makes moving clusters around harder than when they're self contained. > I'd say it's a pretty damn-fool arrangement: for starters, it's an > unnecessary security hazard. I don't buy the security argument at all. You likely have postgresql.conf in the data directoy. You can write to at least .auto, which will definitely reside the data directory. That contains archive_command. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Andres Freund writes: > On 2015-05-29 13:14:18 -0400, Tom Lane wrote: >> Abhijit Menon-Sen writes: >> As I mentioned yesterday, I'm not really on board with ignoring EACCES, >> except for the directories-on-Windows case. Since we're only logging >> the failures anyway, I think it is reasonable to log a complaint for any >> unwritable file in the data directory. > That sounds like a potentially nontrivial amount of repetitive log bleat > after every crash start? One which the user can't really stop? Why can't the user stop it? We won't be bleating about the case of a symlink to a non-writable file someplace else, which is the Debian use case. I don't see a very good excuse to have a non-writable file right in the data directory. >> Also I want to get rid of the ETXTBSY special cases. That one doesn't >> seem like something that we should silently ignore: what the heck are >> executables doing in the data directory? Or is there some other meaning >> on Windows? > I've seen a bunch of binaries placed in the data directory as > archive/restore commands. Those will be busy a good amount of the > time. While it'd not be my choice to do that, it's not entirely > unreasonable. I'd say it's a pretty damn-fool arrangement: for starters, it's an unnecessary security hazard. In any case, if the cost of such a file is one more line of log output during a crash restart, most people would have no problem at all in ignoring that log output. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-29 13:14:18 -0400, Tom Lane wrote: > Abhijit Menon-Sen writes: > As I mentioned yesterday, I'm not really on board with ignoring EACCES, > except for the directories-on-Windows case. Since we're only logging > the failures anyway, I think it is reasonable to log a complaint for any > unwritable file in the data directory. That sounds like a potentially nontrivial amount of repetitive log bleat after every crash start? One which the user can't really stop? > Also I want to get rid of the ETXTBSY special cases. That one doesn't > seem like something that we should silently ignore: what the heck are > executables doing in the data directory? Or is there some other meaning > on Windows? I've seen a bunch of binaries placed in the data directory as archive/restore commands. Those will be busy a good amount of the time. While it'd not be my choice to do that, it's not entirely unreasonable. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-29 13:14:18 -0400, t...@sss.pgh.pa.us wrote: > > Pushed with minor revisions. Thanks, looks good. > Since we're only logging the failures anyway, I think it is reasonable > to log a complaint for any unwritable file in the data directory. Sounds reasonable, patch attached. ETXTBSY has no special meaning that I can find under Windows, so I included that change as well. -- Abhijit diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index b4f6590..70e2347 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2646,11 +2646,12 @@ pre_sync_fname(const char *fname, bool isdir, int elevel) if (fd < 0) { - if (errno == EACCES || (isdir && errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir && errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir && errno == EACCES) return; #endif @@ -2701,11 +2702,12 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel) fd = OpenTransientFile((char *) fname, flags, 0); if (fd < 0) { - if (errno == EACCES || (isdir && errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir && errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir && errno == EACCES) return; #endif diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f0d66fa..f343168 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -617,11 +617,12 @@ pre_sync_fname(const char *fname, bool isdir) if (fd < 0) { - if (errno == EACCES || (isdir && errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir && errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir && errno == EACCES) return; #endif @@ -682,11 +683,12 @@ fsync_fname_ext(const char *fname, bool isdir) fd = open(fname, flags); if (fd < 0) { - if (errno == EACCES || (isdir && errno == EISDIR)) + /* Some platforms don't support opening directories at all. */ + if (isdir && errno == EISDIR) return; -#ifdef ETXTBSY - if (errno == ETXTBSY) +#ifdef WIN32 + if (isdir && errno == EACCES) return; #endif -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Abhijit Menon-Sen writes: > At 2015-05-28 17:37:16 -0400, t...@sss.pgh.pa.us wrote: >> I have to leave shortly, so I'll look at the initdb cleanup tomorrow. > Here's a revision of that patch that's more along the lines of what you > committed. Pushed with minor revisions. > It occurred to me that we should probably also silently skip EACCES on > opendir and stat/lstat in walkdir. That's what the attached initdb patch > does. If you think it's a good idea, there's also a small fixup attached > for the backend version too. As I mentioned yesterday, I'm not really on board with ignoring EACCES, except for the directories-on-Windows case. Since we're only logging the failures anyway, I think it is reasonable to log a complaint for any unwritable file in the data directory. I've not done it yet, but what I think we oughta do is revert if (errno == EACCES || (isdir && errno == EISDIR)) back to the former logic if (isdir && (errno == EISDIR || errno == EACCES)) and even then, maybe the EACCES exclusion ought to be Windows only? Also I want to get rid of the ETXTBSY special cases. That one doesn't seem like something that we should silently ignore: what the heck are executables doing in the data directory? Or is there some other meaning on Windows? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Michael Paquier writes: > On Fri, May 29, 2015 at 6:49 PM, Amit Kapila wrote: >> Test - 2 - Directory Symlink for pg_xlog >> First 4 steps are same as Test-1 >> 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc >> 6. Restart Server - Error >> - FATAL: required WAL directory "pg_xlog" does not exist >> This error occurs in >> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) >> 7. If I manually step-through that line, it proceeds and in >> SyncDataDirectory(), it detects the symlink; >> 8. After that in SyncDataDirectory(), when it does walkdir for >> datadir, for ./pg_xlog it reports the log message and then >> proceeds normal and the server is started. > Regarding this test, I just tested initdb -X and it works correctly > after crashing the server, so there is at least a workaround for the > error you are triggering here. Now, shouldn't we improve things as > well with mklink? That's a rather narrow case and we have a workaround > for it at initialization with initdb. I haven't looked much at the > code so I don't have a patch at hand but thoughts on this matter are > welcome. My (vague) recollection is that what mklink creates is not the same as a junction point and we intentionally only support the latter as pseudo-symlinks. You'd need to trawl the archives for more detail, unless someone else remembers. So I think we're good here. Thanks for all the testing! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Fri, May 29, 2015 at 6:49 PM, Amit Kapila wrote: > On Fri, May 29, 2015 at 2:28 PM, Michael Paquier > wrote: >> >> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila >> wrote: >> > >> > Test-3 - Symlinks in pg_tblspc. >> > 1. Create couple of tablespaces which creates symlinks >> > in pg_tblspc >> > 2. Crash the server >> > 3. Restart Server - It works fine. >> > >> > I am not sure Test-2 is a valid thing and we should support it or >> > not, but the current patch is sane w.r.t that form of symlinks >> > as well. >> >> It is always good to check, but is that relevant to the bug fixed? I >> mean, you need to symlink a file in PGDATA that server has no write >> permission on... For example tablespaces can be written to in test 3, >> so that would work even with 9.4.2 after crashing the server with -m >> immediate. > > I have just tried to cover the paths which has symlink related code > in the new commit (in functions SyncDataDirectory() and walkdir()), > because thats what I understood from Tom's mail. Without test-3, it > won't cover walkdir symlink test path, I didn't knew that there was any > confusion about verification of permissions problem on Windows > and I haven't looked to verify the whole patch. Reading once again this thread (after some good red wine + alpha), it looks that you are right: not many checks with Windows junctions have actually been with what has been committed... > Test - 2 - Directory Symlink for pg_xlog > First 4 steps are same as Test-1 > 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 6. Restart Server - Error > - FATAL: required WAL directory "pg_xlog" does not exist > This error occurs in > ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) > 7. If I manually step-through that line, it proceeds and in > SyncDataDirectory(), it detects the symlink; > 8. After that in SyncDataDirectory(), when it does walkdir for > datadir, for ./pg_xlog it reports the log message and then > proceeds normal and the server is started. Regarding this test, I just tested initdb -X and it works correctly after crashing the server, so there is at least a workaround for the error you are triggering here. Now, shouldn't we improve things as well with mklink? That's a rather narrow case and we have a workaround for it at initialization with initdb. I haven't looked much at the code so I don't have a patch at hand but thoughts on this matter are welcome. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Re: Tom Lane 2015-05-28 <5740.1432849...@sss.pgh.pa.us> > Abhijit Menon-Sen writes: > > Here's an updated patch for the fsync problem(s). > > I've committed this after some mostly-cosmetic rearrangements. Fwiw, I can confirm that the problem is fixed for 9.5. The regression tests I've added to postgresql-common pass. Thanks! Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Fri, May 29, 2015 at 2:28 PM, Michael Paquier wrote: > > On Fri, May 29, 2015 at 5:01 PM, Amit Kapila wrote: > > > > Test-3 - Symlinks in pg_tblspc. > > 1. Create couple of tablespaces which creates symlinks > > in pg_tblspc > > 2. Crash the server > > 3. Restart Server - It works fine. > > > > I am not sure Test-2 is a valid thing and we should support it or > > not, but the current patch is sane w.r.t that form of symlinks > > as well. > > It is always good to check, but is that relevant to the bug fixed? I > mean, you need to symlink a file in PGDATA that server has no write > permission on... For example tablespaces can be written to in test 3, > so that would work even with 9.4.2 after crashing the server with -m > immediate. I have just tried to cover the paths which has symlink related code in the new commit (in functions SyncDataDirectory() and walkdir()), because thats what I understood from Tom's mail. Without test-3, it won't cover walkdir symlink test path, I didn't knew that there was any confusion about verification of permissions problem on Windows and I haven't looked to verify the whole patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Fri, May 29, 2015 at 5:01 PM, Amit Kapila wrote: > On Fri, May 29, 2015 at 9:25 AM, Tom Lane wrote: >> >> >> Speaking of which, could somebody test that the committed patch does >> what it's supposed to on Windows? You were worried upthread about >> whether the tests for symlinks (aka junction points) behaved correctly, >> and I have no way to check that either. >> > > I have done some tests for the committed patch for this issue > (especially to check the behaviour of symlink tests in the > new code) on Windows - 7 and below are results: > > Test symlink for pg_xlog > - > Test -1 - Junction point for pg_xlog > 1. Create a database (initdb) > 2. Start server and forcefully crash it > 3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 4. mv E:\PGData\pg_xlog E:\bak_pg_xlog > 5. Created a directory junction (equivalent symlink) > mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 6. Restart Server - operation is successful. > I have debugged this operation, it does what it suppose to do, > detects the junction point and does fsync. > > Test - 2 - Directory Symlink for pg_xlog > First 4 steps are same as Test-1 > 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 6. Restart Server - Error > - FATAL: required WAL directory "pg_xlog" does not exist > This error occurs in > ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) > 7. If I manually step-through that line, it proceeds and in > SyncDataDirectory(), it detects the symlink; > 8. After that in SyncDataDirectory(), when it does walkdir for > datadir, for ./pg_xlog it reports the log message and then > proceeds normal and the server is started. > > Test-3 - Symlinks in pg_tblspc. > 1. Create couple of tablespaces which creates symlinks > in pg_tblspc > 2. Crash the server > 3. Restart Server - It works fine. > > I am not sure Test-2 is a valid thing and we should support it or > not, but the current patch is sane w.r.t that form of symlinks > as well. It is always good to check, but is that relevant to the bug fixed? I mean, you need to symlink a file in PGDATA that server has no write permission on... For example tablespaces can be written to in test 3, so that would work even with 9.4.2 after crashing the server with -m immediate. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Fri, May 29, 2015 at 9:25 AM, Tom Lane wrote: > > > Speaking of which, could somebody test that the committed patch does > what it's supposed to on Windows? You were worried upthread about > whether the tests for symlinks (aka junction points) behaved correctly, > and I have no way to check that either. > I have done some tests for the committed patch for this issue (especially to check the behaviour of symlink tests in the new code) on Windows - 7 and below are results: Test symlink for pg_xlog - Test -1 - Junction point for pg_xlog 1. Create a database (initdb) 2. Start server and forcefully crash it 3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc 4. mv E:\PGData\pg_xlog E:\bak_pg_xlog 5. Created a directory junction (equivalent symlink) mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc 6. Restart Server - operation is successful. I have debugged this operation, it does what it suppose to do, detects the junction point and does fsync. Test - 2 - Directory Symlink for pg_xlog First 4 steps are same as Test-1 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc 6. Restart Server - Error - FATAL: required WAL directory "pg_xlog" does not exist This error occurs in ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) 7. If I manually step-through that line, it proceeds and in SyncDataDirectory(), it detects the symlink; 8. After that in SyncDataDirectory(), when it does walkdir for datadir, for ./pg_xlog it reports the log message and then proceeds normal and the server is started. Test-3 - Symlinks in pg_tblspc. 1. Create couple of tablespaces which creates symlinks in pg_tblspc 2. Crash the server 3. Restart Server - It works fine. I am not sure Test-2 is a valid thing and we should support it or not, but the current patch is sane w.r.t that form of symlinks as well. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Fri, May 29, 2015 at 12:55 PM, Tom Lane wrote: > Abhijit Menon-Sen writes: >>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow. > >> Here's a revision of that patch that's more along the lines of what you >> committed. > > Will look at that tomorrow ... > >> It occurred to me that we should probably also silently skip EACCES on >> opendir and stat/lstat in walkdir. That's what the attached initdb patch >> does. If you think it's a good idea, there's also a small fixup attached >> for the backend version too. > > Actually, I was seriously considering removing the don't-log-EACCES > special case (at least for files, perhaps not for directories). > The reason being that it's darn hard to verify that the code is doing > what it's supposed to when there is no simple way to get it to log any > complaints. I left it as you wrote it in the committed version, but > I think both that and the ETXTBSY special case do not really have value > as long as their only effect is to suppress a LOG entry. > > Speaking of which, could somebody test that the committed patch does > what it's supposed to on Windows? You were worried upthread about > whether the tests for symlinks (aka junction points) behaved correctly, > and I have no way to check that either. The error can be reproduced by using junction (https://technet.microsoft.com/en-us/sysinternals/bb896768.aspx) to define a junction point within PGDATA and then for example mark as read-only a configuration file included in postgresql.conf through the junction point. Using this method, I have checked that 9.4.1 works, reproduced the permission error with 9.4.2, and checked as well that a3ae3db4 fixed the problem. So things look to work fine. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Abhijit Menon-Sen writes: >> I have to leave shortly, so I'll look at the initdb cleanup tomorrow. > Here's a revision of that patch that's more along the lines of what you > committed. Will look at that tomorrow ... > It occurred to me that we should probably also silently skip EACCES on > opendir and stat/lstat in walkdir. That's what the attached initdb patch > does. If you think it's a good idea, there's also a small fixup attached > for the backend version too. Actually, I was seriously considering removing the don't-log-EACCES special case (at least for files, perhaps not for directories). The reason being that it's darn hard to verify that the code is doing what it's supposed to when there is no simple way to get it to log any complaints. I left it as you wrote it in the committed version, but I think both that and the ETXTBSY special case do not really have value as long as their only effect is to suppress a LOG entry. Speaking of which, could somebody test that the committed patch does what it's supposed to on Windows? You were worried upthread about whether the tests for symlinks (aka junction points) behaved correctly, and I have no way to check that either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-28 17:37:16 -0400, t...@sss.pgh.pa.us wrote: > > I've committed this after some mostly-cosmetic rearrangements. Thank you. > I have to leave shortly, so I'll look at the initdb cleanup tomorrow. Here's a revision of that patch that's more along the lines of what you committed. It occurred to me that we should probably also silently skip EACCES on opendir and stat/lstat in walkdir. That's what the attached initdb patch does. If you think it's a good idea, there's also a small fixup attached for the backend version too. -- Abhijit >From 26bb25e99a2954381587bbfe296a50b19a7f047c Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 28 May 2015 22:02:29 +0530 Subject: Make initdb -S behave like xlog.c:fsync_pgdata() In particular, it should silently skip unreadable/unexpected files in the data directory and follow symlinks only for pg_xlog and under pg_tblspc. --- src/bin/initdb/initdb.c | 305 1 file changed, 151 insertions(+), 154 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f1c4920..9d5478c 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -71,6 +71,13 @@ /* Ideally this would be in a .h file, but it hardly seems worth the trouble */ extern const char *select_default_timezone(const char *share_path); +/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */ +#if defined(HAVE_SYNC_FILE_RANGE) +#define PG_FLUSH_DATA_WORKS 1 +#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) +#define PG_FLUSH_DATA_WORKS 1 +#endif + static const char *auth_methods_host[] = {"trust", "reject", "md5", "password", "ident", "radius", #ifdef ENABLE_GSS "gss", @@ -217,10 +224,13 @@ static char **filter_lines_with_token(char **lines, const char *token); #endif static char **readfile(const char *path); static void writefile(char *path, char **lines); -static void walkdir(char *path, void (*action) (char *fname, bool isdir)); -static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir)); -static void pre_sync_fname(char *fname, bool isdir); -static void fsync_fname(char *fname, bool isdir); +static void walkdir(const char *path, + void (*action) (const char *fname, bool isdir), + bool process_symlinks); +#ifdef PG_FLUSH_DATA_WORKS +static void pre_sync_fname(const char *fname, bool isdir); +#endif +static void fsync_fname_ext(const char *fname, bool isdir); static FILE *popen_check(const char *command, const char *mode); static void exit_nicely(void); static char *get_id(void); @@ -248,7 +258,7 @@ static void load_plpgsql(void); static void vacuum_db(void); static void make_template0(void); static void make_postgres(void); -static void perform_fsync(void); +static void fsync_pgdata(void); static void trapsig(int signum); static void check_ok(void); static char *escape_quotes(const char *src); @@ -521,56 +531,58 @@ writefile(char *path, char **lines) * Adapted from copydir() in copydir.c. */ static void -walkdir(char *path, void (*action) (char *fname, bool isdir)) +walkdir(const char *path, + void (*action) (const char *fname, bool isdir), + bool process_symlinks) { DIR *dir; - struct dirent *direntry; - char subpath[MAXPGPATH]; + struct dirent *de; dir = opendir(path); if (dir == NULL) { - fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"), -progname, path, strerror(errno)); - exit_nicely(); + if (errno != EACCES) + fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"), + progname, path, strerror(errno)); + return; } - while (errno = 0, (direntry = readdir(dir)) != NULL) + while (errno = 0, (de = readdir(dir)) != NULL) { + char subpath[MAXPGPATH]; struct stat fst; + int sret; - if (strcmp(direntry->d_name, ".") == 0 || - strcmp(direntry->d_name, "..") == 0) + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) continue; - snprintf(subpath, MAXPGPATH, "%s/%s", path, direntry->d_name); + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); + + if (process_symlinks) + sret = stat(subpath, &fst); + else + sret = lstat(subpath, &fst); - if (lstat(subpath, &fst) < 0) + if (sret < 0) { - fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"), - progname, subpath, strerror(errno)); - exit_nicely(); + if (errno != EACCES) +fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"), + progname, subpath, strerror(errno)); + continue; } - if (S_ISDIR(fst.st_mode)) - walkdir(subpath, action); - else if (S_ISREG(fst.st_mode)) + if (S_ISREG(fst.st_mode)) (*action) (subpath, false); + else if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action, false); } if (errno) - { fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"), progname, path, strerror(errno)); - exit_nicely(); - } - if (closedir(dir)) - { - fprintf(stderr, _("%s: could not close directory \"
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Abhijit Menon-Sen writes: > Here's an updated patch for the fsync problem(s). I've committed this after some mostly-cosmetic rearrangements. > 4. As a partial aside, why does fsync_fname use OpenTransientFile? It >looks like it should use BasicOpenFile as pre_sync_fname does. We >close the fd immediately after calling fsync anyway. Or is there >some reason I'm missing that we should use OpenTransientFile in >pre_sync_fname too? pre_sync_fname is the one that is wrong IMO. Using BasicOpenFile just creates an opportunity for problems; that function should get called from as few places as possible. > 5. I made walktblspc_entries use stat() instead of lstat(), so that we >can handle symlinks and directories the same way. Note that we won't >continue to follow links inside the sub-directories because we use >walkdir instead of recursing. Given that, walktblspc_entries was 99% duplicate, so I folded it into walkdir with a process_symlinks boolean. I have to leave shortly, so I'll look at the initdb cleanup tomorrow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Thu, May 28, 2015 at 1:04 PM, Tom Lane wrote: > Robert Haas writes: >> On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen >> wrote: >>> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? >>> I wasn't sure if I should move that to fd.c as well. I think it's >>> borderline OK for now. > >> I think if the function is specific as fsync_pgdata(), fd.c is not the >> right place. I'm not sure xlog.c is either, though. > > ISTM xlog.c is clearly the *wrong* place; if this behavior has anything > to do with WAL logging as such, it's not apparent to me. fd.c is not > a great place perhaps, but in view of the fact that we have things like > RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata > as something to put there as well (perhaps with a name more chosen to > match fd.c names...) OK, sure, makes sense. > Since Robert appears to be busy worrying about the multixact issue > reported by Steve Kehlet, I suggest he focus on that and I'll take > care of getting this thing committed. AFAICT we have consensus on > what it should do and we're down to arguing about code style. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-28 19:56:15 +0530, a...@2ndquadrant.com wrote: > > I have a separate patch to initdb with the corresponding changes, which > I will post after dinner and a bit more testing. Here's that patch too. I was a bit unsure how far to go with this. It fixes the problem of not following pg_xlog if it's a symlink (Andres) and the one where it died on bogus stuff in pg_tblspc (Tom) and unreadable files anywhere else (it now spews errors to stderr, but carries on for as long as it can). I've done a bit more than strictly necessary to fix those problems, though, and made the code largely similar to what's in the other patch. If you want something minimal, let me know and I'll cut it down to size. -- Abhijit P.S. If this patch gets applied, then the "Adapted from walktblspc_links in initdb.c" comment in fd.c should be changed: s/links/entries/. >From 8e69433fae0b4f51879793f6594c76b99d764818 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 28 May 2015 22:02:29 +0530 Subject: Make initdb -S behave like xlog.c:fsync_pgdata() In particular, it should silently skip unreadable/unexpected files in the data directory and follow symlinks only for pg_xlog and under pg_tblspc. --- src/bin/initdb/initdb.c | 235 +--- 1 file changed, 122 insertions(+), 113 deletions(-) diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f1c4920..8ebaa55 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -218,9 +218,9 @@ static char **filter_lines_with_token(char **lines, const char *token); static char **readfile(const char *path); static void writefile(char *path, char **lines); static void walkdir(char *path, void (*action) (char *fname, bool isdir)); -static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir)); +static void walktblspc_entries(char *path, void (*action) (char *fname, bool isdir)); static void pre_sync_fname(char *fname, bool isdir); -static void fsync_fname(char *fname, bool isdir); +static void fsync_fname_ext(char *fname, bool isdir); static FILE *popen_check(const char *command, const char *mode); static void exit_nicely(void); static char *get_id(void); @@ -248,7 +248,7 @@ static void load_plpgsql(void); static void vacuum_db(void); static void make_template0(void); static void make_postgres(void); -static void perform_fsync(void); +static void fsync_pgdata(void); static void trapsig(int signum); static void check_ok(void); static char *escape_quotes(const char *src); @@ -518,38 +518,38 @@ writefile(char *path, char **lines) * walkdir: recursively walk a directory, applying the action to each * regular file and directory (including the named directory itself). * - * Adapted from copydir() in copydir.c. + * Adapted from walkdir() in backend/storage/file/fd.c. */ static void walkdir(char *path, void (*action) (char *fname, bool isdir)) { DIR *dir; - struct dirent *direntry; - char subpath[MAXPGPATH]; + struct dirent *de; dir = opendir(path); if (dir == NULL) { fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"), progname, path, strerror(errno)); - exit_nicely(); + return; } - while (errno = 0, (direntry = readdir(dir)) != NULL) + while (errno = 0, (de = readdir(dir)) != NULL) { + char subpath[MAXPGPATH]; struct stat fst; - if (strcmp(direntry->d_name, ".") == 0 || - strcmp(direntry->d_name, "..") == 0) + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) continue; - snprintf(subpath, MAXPGPATH, "%s/%s", path, direntry->d_name); + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); if (lstat(subpath, &fst) < 0) { fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"), progname, subpath, strerror(errno)); - exit_nicely(); + continue; } if (S_ISDIR(fst.st_mode)) @@ -559,75 +559,72 @@ walkdir(char *path, void (*action) (char *fname, bool isdir)) } if (errno) - { fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"), progname, path, strerror(errno)); - exit_nicely(); - } - if (closedir(dir)) - { - fprintf(stderr, _("%s: could not close directory \"%s\": %s\n"), -progname, path, strerror(errno)); - exit_nicely(); - } + (void) closedir(dir); - /* - * It's important to fsync the destination directory itself as individual - * file fsyncs don't guarantee that the directory entry for the file is - * synced. Recent versions of ext4 have made the window much wider but - * it's been an issue for ext3 and other filesystems in the past. - */ (*action) (path, true); } /* - * walktblspc_links: call walkdir on each entry under the given - * pg_tblspc directory, or do nothing if pg_tblspc doesn't exist. + * walktblspc_entries -- apply the action to the entries in pb_tblspc + * + * We expect to find only symlinks to tablespace directories here, but + * we'll apply the action to regular files, and call walkdir() on any + * directorie
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Robert Haas writes: > On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen > wrote: >> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? >> I wasn't sure if I should move that to fd.c as well. I think it's >> borderline OK for now. > I think if the function is specific as fsync_pgdata(), fd.c is not the > right place. I'm not sure xlog.c is either, though. ISTM xlog.c is clearly the *wrong* place; if this behavior has anything to do with WAL logging as such, it's not apparent to me. fd.c is not a great place perhaps, but in view of the fact that we have things like RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata as something to put there as well (perhaps with a name more chosen to match fd.c names...) Since Robert appears to be busy worrying about the multixact issue reported by Steve Kehlet, I suggest he focus on that and I'll take care of getting this thing committed. AFAICT we have consensus on what it should do and we're down to arguing about code style. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen wrote: > 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? >I wasn't sure if I should move that to fd.c as well. I think it's >borderline OK for now. I think if the function is specific as fsync_pgdata(), fd.c is not the right place. I'm not sure xlog.c is either, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Hi. Here's an updated patch for the fsync problem(s). A few points that may be worth considering: 1. I've made the ReadDir → ReadDirExtended change, but in retrospect I don't think it's really worth it. Using ReadDir and letting it die is probably fine. (Aside: I left ReadDirExtended static for now.) 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? I wasn't sure if I should move that to fd.c as well. I think it's borderline OK for now. 3. There's a comment in fsync_fname that we need to open directories with O_RDONLY and non-directories with O_RDWR. Does this also apply to pre_sync_fname (i.e. sync_file_range and posix_fadvise) on any platforms we care about? It doesn't seem so, but I'm not sure. 4. As a partial aside, why does fsync_fname use OpenTransientFile? It looks like it should use BasicOpenFile as pre_sync_fname does. We close the fd immediately after calling fsync anyway. Or is there some reason I'm missing that we should use OpenTransientFile in pre_sync_fname too? (Aside²: it's pointless to specify S_IRUSR|S_IWUSR in fsync_fname since we're not setting O_CREAT. Minor, so will submit separate patch.) 5. I made walktblspc_entries use stat() instead of lstat(), so that we can handle symlinks and directories the same way. Note that we won't continue to follow links inside the sub-directories because we use walkdir instead of recursing. But this depends on S_ISDIR() returning true after stat() on Windows with a junction. Does that work? And are the entries in pb_tblspc on Windows actually junctions? I've tested this with unreadable files in PGDATA and junk inside pb_tblspc. Feedback welcome. I have a separate patch to initdb with the corresponding changes, which I will post after dinner and a bit more testing. -- Abhijit >From 491dcb8c7203fd992dd9c441f5463a75c88e6f52 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen Date: Thu, 28 May 2015 17:22:18 +0530 Subject: Skip unreadable files in fsync_pgdata; follow only expected symlinks --- src/backend/access/transam/xlog.c | 40 +- src/backend/storage/file/fd.c | 262 ++ src/include/storage/fd.h | 6 +- 3 files changed, 226 insertions(+), 82 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 087b6be..32447e7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11605,20 +11605,53 @@ SetWalWriterSleeping(bool sleeping) /* * Issue fsync recursively on PGDATA and all its contents. + * + * We fsync regular files and directories wherever they are, but we + * follow symlinks only for pg_xlog and under pg_tblspc. */ static void fsync_pgdata(char *datadir) { + bool xlog_is_symlink; + char pg_xlog[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; + struct stat st; + if (!enableFsync) return; + snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", datadir); + snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", datadir); + + /* + * If pg_xlog is a symlink, then we need to recurse into it separately, + * because the first walkdir below will ignore it. + */ + xlog_is_symlink = false; + + if (lstat(pg_xlog, &st) < 0) + ereport(ERROR, +(errcode_for_file_access(), +errmsg("could not stat directory \"pg_xlog\": %m"))); + +#ifndef WIN32 + if (S_ISLNK(st.st_mode)) + xlog_is_symlink = true; +#else + if (pgwin32_is_junction(subpath)) + xlog_is_symlink = true; +#endif + /* * If possible, hint to the kernel that we're soon going to fsync the data * directory and its contents. */ #if defined(HAVE_SYNC_FILE_RANGE) || \ (defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)) - walkdir(datadir, pre_sync_fname); + walkdir(DEBUG1, datadir, pre_sync_fname); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, pre_sync_fname); + walktblspc_entries(DEBUG1, pg_tblspc, pre_sync_fname); #endif /* @@ -11628,5 +11661,8 @@ fsync_pgdata(char *datadir) * file fsyncs don't guarantee that the directory entry for the file is * synced. */ - walkdir(datadir, fsync_fname); + walkdir(LOG, datadir, fsync_fname_ext); + if (xlog_is_symlink) + walkdir(LOG, pg_xlog, fsync_fname_ext); + walktblspc_entries(LOG, pg_tblspc, fsync_fname_ext); } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 68d43c6..916f8b5 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1921,6 +1921,37 @@ TryAgain: } /* + * Read a directory opened with AllocateDir and return the next dirent. + * On error, ereports at a caller-specified level and returns NULL if + * the level is not ERROR or more severe. + */ +static struct dirent * +ReadDirExtended(int elevel, DIR *dir, const char *dirname) +{ + struct dirent *dent; + + /* Give a generic message for AllocateDir failure, if caller didn't */ + if (dir == NULL) + { + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not open d
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-27 23:41:29 -0400, t...@sss.pgh.pa.us wrote: > > What about turning ReadDir into a wrapper around a ReadDirExtended > function that adds an elevel argument? Sounds reasonable, doing. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Abhijit Menon-Sen writes: > At 2015-05-27 20:22:18 -0400, t...@sss.pgh.pa.us wrote: >> I doubt that that (not using AllocateDir) [â¦] > (Disregarding per your followup.) >> What I think is a reasonable compromise is to treat AllocateDir >> failure as nonfatal, but to continue using ReadDir etc which means >> that any post-open failure in reading a successfully-opened directory >> would be fatal. > OK, that's halfway between the two patches I posted. Will do. Thought you were disregarding? Anyway, I had an idea about reducing the ugliness. What you basically did is copy-and-paste ReadDir so you could change the elevel. What about turning ReadDir into a wrapper around a ReadDirExtended function that adds an elevel argument? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-27 20:22:18 -0400, t...@sss.pgh.pa.us wrote: > > I doubt that that (not using AllocateDir) […] (Disregarding per your followup.) > What I think is a reasonable compromise is to treat AllocateDir > failure as nonfatal, but to continue using ReadDir etc which means > that any post-open failure in reading a successfully-opened directory > would be fatal. OK, that's halfway between the two patches I posted. Will do. > Meanwhile, it seems like you have copied a couple of very dubious > decisions in initdb's walktblspc_links(): […] > > Now, we don't really have to do anything about these things in order > to fix the immediate problem, but I wonder if we shouldn't try to > clean up initdb's behavior while we're at it. OK, I'll fix that in both. > Independently of that, I thought the plan was to complain about any > problems at LOG message level, not DEBUG1, and definitely not WARNING > (which is lower than LOG for this purpose). I'll change the level to LOG for fsync_fname_ext, but I think DEBUG1 is more appropriate for the pre_sync_fname run. Or do you think I should use LOG for that too? > And why would you use a different message level for pg_xlog/ than for > other files anyway? That was just a mistake, I forgot to change it after copying. Thanks for having a look. I'll post a revised patch shortly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
I wrote: > Abhijit Menon-Sen writes: >> At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote: >>> I'm trying a couple of approaches to that (e.g. using readdir directly >>> instead of ReadDir), but other suggestions are welcome. >> Here's what that looks like, but not yet fully tested. > I doubt that that (not using AllocateDir) is a good idea in the backend, Oh, scratch that. Reading the patch closer, I see you kept the use of AllocateDir and only replaced ReadDir. That's kind of ugly (and certainly requires a comment about the inconsistency) but it's probably ok from an error handling standpoint. There are hard failure cases in AllocateDir, but they seem unlikely to create problems in practice. The other points stand though ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Abhijit Menon-Sen writes: > At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote: >> I'm trying a couple of approaches to that (e.g. using readdir directly >> instead of ReadDir), but other suggestions are welcome. > Here's what that looks like, but not yet fully tested. I doubt that that (not using AllocateDir) is a good idea in the backend, mainly because you are going to leak directory references if anything called by walkdir() throws elog(ERROR). While that doesn't matter much for the immediate usage, since we'd throw up our hands and die anyway, it completely puts the kibosh on any idea that walkdir() is a general-purpose subroutine. It would have to be decorated with big red warnings NEVER USE THIS FUNCTION. Ick. What I think is a reasonable compromise is to treat AllocateDir failure as nonfatal, but to continue using ReadDir etc which means that any post-open failure in reading a successfully-opened directory would be fatal. Such errors would suggest that something is badly wrong in the filesystem, so I do not feel too terrible about not covering that case. Meanwhile, it seems like you have copied a couple of very dubious decisions in initdb's walktblspc_links(): 1. It doesn't say a word if the passed-in path (which in practice is always $PGDATA/pg_tblspc) doesn't exist. 2. It assumes that every entry within pg_tblspc must be a tablespace directory (or symlink to one), and fails if any are plain files. At the very least, these behaviors seem inconsistent. #2 is making strong assumptions about pg_tblspc being correctly set up and free of user-added cruft, while #1 doesn't even assume it's there at all. Now, we don't really have to do anything about these things in order to fix the immediate problem, but I wonder if we shouldn't try to clean up initdb's behavior while we're at it. Independently of that, I thought the plan was to complain about any problems at LOG message level, not DEBUG1, and definitely not WARNING (which is lower than LOG for this purpose). And why would you use a different message level for pg_xlog/ than for other files anyway? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote: > > I'm trying a couple of approaches to that (e.g. using readdir directly > instead of ReadDir), but other suggestions are welcome. Here's what that looks like, but not yet fully tested. -- Abhijit diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 087b6be..6956e88 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11609,16 +11609,41 @@ SetWalWriterSleeping(bool sleeping) static void fsync_pgdata(char *datadir) { + bool xlog_is_symlink; + char pg_xlog[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; + struct stat st; + if (!enableFsync) return; + snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", datadir); + snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", datadir); + + if (lstat(pg_xlog, &st) < 0) + ereport(ERROR, +(errcode_for_file_access(), +errmsg("could not stat directory \"pg_xlog\": %m"))); + + xlog_is_symlink = false; +#ifndef WIN32 + if (S_ISLNK(st.st_mode)) + xlog_is_symlink = true; +#else + if (pgwin32_is_junction(subpath)) + xlog_is_symlink = true; +#endif + /* * If possible, hint to the kernel that we're soon going to fsync the data * directory and its contents. */ #if defined(HAVE_SYNC_FILE_RANGE) || \ (defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)) - walkdir(datadir, pre_sync_fname); + walkdir(DEBUG1, datadir, pre_sync_fname); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, pre_sync_fname); + walktblspc_links(DEBUG1, pg_tblspc, pre_sync_fname); #endif /* @@ -11628,5 +11653,8 @@ fsync_pgdata(char *datadir) * file fsyncs don't guarantee that the directory entry for the file is * synced. */ - walkdir(datadir, fsync_fname); + walkdir(WARNING, datadir, fsync_fname_ext); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, fsync_fname_ext); + walktblspc_links(WARNING, pg_tblspc, fsync_fname_ext); } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 68d43c6..711b4b7 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2443,48 +2443,113 @@ looks_like_temp_rel_name(const char *name) /* * Hint to the OS that it should get ready to fsync() this file. * - * Adapted from pre_sync_fname in initdb.c + * Ignores errors trying to open unreadable files, and logs other errors at a + * caller-specified level. */ void -pre_sync_fname(char *fname, bool isdir) +pre_sync_fname(int elevel, char *fname, bool isdir) { int fd; fd = BasicOpenFile(fname, O_RDONLY | PG_BINARY, 0); + if (fd < 0) + { + if (errno == EACCES || (isdir && errno == EISDIR)) + return; + +#ifdef ETXTBSY + if (errno == ETXTBSY) + return; +#endif + + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", fname))); + } + + (void) pg_flush_data(fd, 0, 0); + (void) close(fd); +} + +/* + * fsync_fname_ext -- Try to fsync a file or directory + * + * Ignores errors trying to open unreadable files, or trying to fsync + * directories on systems where that isn't allowed/required, and logs other + * errors at a caller-specified level. + */ +void +fsync_fname_ext(int elevel, char *fname, bool isdir) +{ + int fd; + int returncode; + /* - * Some OSs don't allow us to open directories at all (Windows returns - * EACCES) + * Some OSs require directories to be opened read-only whereas other + * systems don't allow us to fsync files opened read-only; so we need both + * cases here */ - if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES)) - return; + if (!isdir) + fd = OpenTransientFile(fname, + O_RDWR | PG_BINARY, + S_IRUSR | S_IWUSR); + else + fd = OpenTransientFile(fname, + O_RDONLY | PG_BINARY, + S_IRUSR | S_IWUSR); if (fd < 0) - ereport(FATAL, -(errmsg("could not open file \"%s\": %m", - fname))); + { + if (errno == EACCES || (isdir && errno == EISDIR)) + return; - pg_flush_data(fd, 0, 0); +#ifdef ETXTBSY + if (errno == ETXTBSY) + return; +#endif - close(fd); + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", fname))); + } + + returncode = pg_fsync(fd); + + /* + * Some OSes don't allow us to fsync directories at all, so we can ignore + * those errors. Anything else needs to be logged. + */ + if (returncode != 0 && !(isdir && errno == EBADF)) + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not fsync file \"%s\": %m", fname))); + + CloseTransientFile(fd); } /* * walkdir: recursively walk a directory, applying the action to each - * regular file and directory (including the named directory itself) - * and following symbolic links. + * regular file and directory (including the named directory itself). * - * NB: There is another version of walkdir in initdb.c, but that version - * behaves differently with respect to symbolic links. Caveat emptor! + * Adapted from walkdir in initdb.c */ voi
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 22:44:03 +0200, and...@anarazel.de wrote: > > So what I propose is: > 1) Remove the automatic symlink following > 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in >initdb -S > 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, >continue for stat() failures in the readdir() loop. > 4) Add elevel argument to pre_sync_fname, fsync_fname, return after >errors. > 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By >virtue of not following symlinks we should not need to worry about >EROFS Here's a WIP patch for discussion. I've (a) removed the S_ISLNK() branch in walkdir, (b) reintroduced walktblspc_links to call walkdir on each of the entries within pg_tblspc (simpler than trying to make walkdir follow links only for pg_xlog and under pg_tblspc), (c) call walkdir on pg_xlog if it's a symlink (not done for initdb -S; will submit separately), (d) add elevel arguments as described, (e) ignore EACCES and ETXTBSY. This correctly fsync()s stuff according to strace, and doesn't die if there are unreadable files/links in PGDATA. What I haven't done is return if AllocateDir() fails. I'm not convinced that's correct, because it'll not complain if PGDATA is unreadable (but this will break other things, so it doesn't matter), but also will die if readdir fails rather than opendir. I'm trying a couple of approaches to that (e.g. using readdir directly instead of ReadDir), but other suggestions are welcome. -- Abhijit diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 087b6be..6956e88 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11609,16 +11609,41 @@ SetWalWriterSleeping(bool sleeping) static void fsync_pgdata(char *datadir) { + bool xlog_is_symlink; + char pg_xlog[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; + struct stat st; + if (!enableFsync) return; + snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", datadir); + snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", datadir); + + if (lstat(pg_xlog, &st) < 0) + ereport(ERROR, +(errcode_for_file_access(), +errmsg("could not stat directory \"pg_xlog\": %m"))); + + xlog_is_symlink = false; +#ifndef WIN32 + if (S_ISLNK(st.st_mode)) + xlog_is_symlink = true; +#else + if (pgwin32_is_junction(subpath)) + xlog_is_symlink = true; +#endif + /* * If possible, hint to the kernel that we're soon going to fsync the data * directory and its contents. */ #if defined(HAVE_SYNC_FILE_RANGE) || \ (defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)) - walkdir(datadir, pre_sync_fname); + walkdir(DEBUG1, datadir, pre_sync_fname); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, pre_sync_fname); + walktblspc_links(DEBUG1, pg_tblspc, pre_sync_fname); #endif /* @@ -11628,5 +11653,8 @@ fsync_pgdata(char *datadir) * file fsyncs don't guarantee that the directory entry for the file is * synced. */ - walkdir(datadir, fsync_fname); + walkdir(WARNING, datadir, fsync_fname_ext); + if (xlog_is_symlink) + walkdir(DEBUG1, pg_xlog, fsync_fname_ext); + walktblspc_links(WARNING, pg_tblspc, fsync_fname_ext); } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 68d43c6..c855004 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -2443,42 +2443,99 @@ looks_like_temp_rel_name(const char *name) /* * Hint to the OS that it should get ready to fsync() this file. * - * Adapted from pre_sync_fname in initdb.c + * Ignores errors trying to open unreadable files, and logs other errors at a + * caller-specified level. */ void -pre_sync_fname(char *fname, bool isdir) +pre_sync_fname(int elevel, char *fname, bool isdir) { int fd; fd = BasicOpenFile(fname, O_RDONLY | PG_BINARY, 0); + if (fd < 0) + { + if (errno == EACCES || (isdir && errno == EISDIR)) + return; + +#ifdef ETXTBSY + if (errno == ETXTBSY) + return; +#endif + + ereport(elevel, +(errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", fname))); + } + + (void) pg_flush_data(fd, 0, 0); + (void) close(fd); +} + +/* + * fsync_fname_ext -- Try to fsync a file or directory + * + * Ignores errors trying to open unreadable files, or trying to fsync + * directories on systems where that isn't allowed/required, and logs other + * errors at a caller-specified level. + */ +void +fsync_fname_ext(int elevel, char *fname, bool isdir) +{ + int fd; + int returncode; + /* - * Some OSs don't allow us to open directories at all (Windows returns - * EACCES) + * Some OSs require directories to be opened read-only whereas other + * systems don't allow us to fsync files opened read-only; so we need both + * cases here */ - if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES)) - return; + if (!isdir) + fd = OpenTransientFile(fname, + O_RDWR | PG_BINARY, + S_IRUSR | S_IWUSR); + else + fd = OpenTransientFile(fname, +
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 22:44:03 +0200, and...@anarazel.de wrote: > > So, this was discussed in the following thread, starting at: > http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org Sorry, I didn't see this before replying. > There are no other places we it's "allowed" to introduce symlinks and > we have refuted bugreports of people having problems after doing that. OK. > So what I propose is: > 1) Remove the automatic symlink following > 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in >initdb -S > 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, >continue for stat() failures in the readdir() loop. > 4) Add elevel argument to pre_sync_fname, fsync_fname, return after >errors. > 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By >virtue of not following symlinks we should not need to worry about >EROFS Makes sense. I've got most of that done, I'll just remove the symlink following (or rather, restrict it to the cases mentioned), test a bit, and post. > I'm inclined to think that 4) is a big enough compat break that a > fsync_fname_ext with the new argument is a good idea. Agreed. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 19:07:20 +0200, and...@anarazel.de wrote: > > Abhijit, do you recall why the code was changed to follow all symlinks > in contrast to explicitly going through the tablespaces as initdb -S > does? I'm pretty sure early versions of the patch pretty much had a > verbatim copy of the initdb logic? Yes, earlier versions of the patch did follow symlinks only in pg_tblspc. I changed this because Álvaro pointed out that this was likely to miss important stuff, e.g. in 20150403163232.ga28...@eldon.alvh.no-ip.org, 20150406131636.gd4...@alvh.no-ip.org -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-26 19:07:20 +0200, Andres Freund wrote: > It is somewhat interesting that similar code has been used in > pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it > causing reported problem. I think the relevant difference is that that > code doesn't follow symlinks. It's obviously also less exercised and > poeople might just have fixed up permissions when encountering troubles. > > Abhijit, do you recall why the code was changed to follow all symlinks > in contrast to explicitly going through the tablespaces as initdb -S > does? I'm pretty sure early versions of the patch pretty much had a > verbatim copy of the initdb logic? That logic is missing pg_xlog btw, > which is bad for pg_upgrade. So, this was discussed in the following thread, starting at: http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org "Actually, since surely we must follow symlinks everywhere, why do we have to do this separately for pg_tblspc? Shouldn't that link-following occur automatically when walking PGDATA in the first place?" I don't think it's true that we must follow symlinks everywhere. I think, as argued upthread, that it's sufficient to recurse through PGDATA, follow the symlinks in pg_tbspc, and if a symlink, also go through pg_xlog separately. There are no other places we it's "allowed" to introduce symlinks and we have refuted bugreports of people having problems after doing that. So what I propose is: 1) Remove the automatic symlink following 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in initdb -S 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, continue for stat() failures in the readdir() loop. 4) Add elevel argument to pre_sync_fname, fsync_fname, return after errors. 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By virtue of not following symlinks we should not need to worry about EROFS I'm inclined to think that 4) is a big enough compat break that a fsync_fname_ext with the new argument is a good idea. Arguments for/against? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-26 19:07:20 +0200, Andres Freund wrote: > Abhijit, do you recall why the code was changed to follow all symlinks > in contrast to explicitly going through the tablespaces as initdb -S > does? I'm pretty sure early versions of the patch pretty much had a > verbatim copy of the initdb logic? Forgot to add Abhijit to CC list, sorry. > That [initdb's] logic is missing pg_xlog btw, which is bad for pg_upgrade. On second thought it's probably not actually bad for pg_upgrade's specific usecase. It'll always end with a proper shutdown, so it'll never need that WAL. But it'd be bad if anybody ever relied on initdb -S in a different scenario. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-26 10:41:12 -0400, Tom Lane wrote: > Yeah. Perhaps I missed it, but was the original patch motivated by > actual failures that had been seen in the field, or was it just a > hypothetical concern? I'd mentioned that it might be a good idea to do this while investingating a bug with unlogged tables that several parties had reported. That patch has been fixed in a more granular fashion. From there it took on kind of a life on its own. It is somewhat interesting that similar code has been used in pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it causing reported problem. I think the relevant difference is that that code doesn't follow symlinks. It's obviously also less exercised and poeople might just have fixed up permissions when encountering troubles. Abhijit, do you recall why the code was changed to follow all symlinks in contrast to explicitly going through the tablespaces as initdb -S does? I'm pretty sure early versions of the patch pretty much had a verbatim copy of the initdb logic? That logic is missing pg_xlog btw, which is bad for pg_upgrade. I *can* reproduce corrupted clusters without this without trying too hard. I yesterday wrote a test for the crash testing infrastructure I've on and off worked on (since 2011. Uhm) and I could reproduce a bunch of corrupted clusters without this patch. When randomly triggering crash restarts shortly afterwards follwed by a simulated hw restart (stop accepting writes via linux device mapper) while concurrent COPYs are running, I can trigger a bunch of data corruptions. Since then my computer in berlin has done 440 testruns with the patch, and 450 without. I've gotten 7 errors without, 0 with. But the instrumentation for detecting errors is really shitty (pkey lookup for every expected row) and doesn't yet keep meaningful diagnosistics. So this isn't particularly bulletproof either way. I can't tell whether the patch is just masking yet another problem due to different timings caused by the fsync, or whether it's fixing the problem that we can forget to sync WAL segments. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Andrew Dunstan writes: > On 05/26/2015 11:58 AM, Tom Lane wrote: >> One thing perhaps we *should* be selective about, though, is which >> symlinks we try to follow. I think that a good case could be made >> for ignoring symlinks everywhere except in the pg_tablespace directory. >> If we did, that would all by itself take care of the Debian scenario, >> if I understand that case correctly. > People have symlinked the xlog directory. Good point, we need to cover that case. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Tue, May 26, 2015 at 6:16 PM, Andrew Dunstan wrote: > > On 05/26/2015 11:58 AM, Tom Lane wrote: > >> Andrew Dunstan writes: >> >>> OK, I'm late to the party. But why exactly are we syncing absolutely >>> everything? That seems over-broad. >>> >> If we try to be selective, we risk errors of omission, which no one would >> ever notice until someone's data got eaten in a low-probability crash >> scenario. It seems more robust (at least to me) to fsync everything we >> can find. That does require more thought about error cases than went >> into the original patch ... but I think that we need more thought about >> error cases even if we do try to be selective. >> >> One thing perhaps we *should* be selective about, though, is which >> symlinks we try to follow. I think that a good case could be made >> for ignoring symlinks everywhere except in the pg_tablespace directory. >> If we did, that would all by itself take care of the Debian scenario, >> if I understand that case correctly. >> > > People have symlinked the xlog directory. I've done it myself in the past. > A better rule might be to ignore symlinks unless either they are in > pg_tblspc or they are in the data directory and their name starts with > "pg_". > Not just "people". initdb will symlink the xlog directory if you use -x. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 05/26/2015 11:58 AM, Tom Lane wrote: Andrew Dunstan writes: OK, I'm late to the party. But why exactly are we syncing absolutely everything? That seems over-broad. If we try to be selective, we risk errors of omission, which no one would ever notice until someone's data got eaten in a low-probability crash scenario. It seems more robust (at least to me) to fsync everything we can find. That does require more thought about error cases than went into the original patch ... but I think that we need more thought about error cases even if we do try to be selective. One thing perhaps we *should* be selective about, though, is which symlinks we try to follow. I think that a good case could be made for ignoring symlinks everywhere except in the pg_tablespace directory. If we did, that would all by itself take care of the Debian scenario, if I understand that case correctly. People have symlinked the xlog directory. I've done it myself in the past. A better rule might be to ignore symlinks unless either they are in pg_tblspc or they are in the data directory and their name starts with "pg_". cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Andrew Dunstan writes: > OK, I'm late to the party. But why exactly are we syncing absolutely > everything? That seems over-broad. If we try to be selective, we risk errors of omission, which no one would ever notice until someone's data got eaten in a low-probability crash scenario. It seems more robust (at least to me) to fsync everything we can find. That does require more thought about error cases than went into the original patch ... but I think that we need more thought about error cases even if we do try to be selective. One thing perhaps we *should* be selective about, though, is which symlinks we try to follow. I think that a good case could be made for ignoring symlinks everywhere except in the pg_tablespace directory. If we did, that would all by itself take care of the Debian scenario, if I understand that case correctly. > And might it be better to check that we can open each file using > access() than calling open() and looking at the error code? Don't really see the point; that's just an extra step, and access() won't exactly prove you can open the file, anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 05/26/2015 08:05 AM, Robert Haas wrote: On Mon, May 25, 2015 at 9:54 PM, Stephen Frost wrote: I certainly see your point, but Tom also pointed out that it's not great to ignore failures during this phase: * Tom Lane (t...@sss.pgh.pa.us) wrote: Greg Stark writes: What exactly is failing? Is it that fsync is returning -1 ? According to the original report from Christoph Berg, it was open() not fsync() that was failing, at least in permissions-based cases. I'm not sure if we should just uniformly ignore all failures in this phase. That would have the merit of clearly not creating any new startup failure cases compared to the previous code, but as you say sometimes it might mean ignoring real problems. If we accept this, then we still have to have the lists, to decide what to fail on and what to ignore. If we're going to have said lists tho, I don't really see the point in fsync'ing things we're pretty confident aren't ours. No, that's not right at all. The idea, at least as I understand it, would be decide which errors to ignore by looking at the error code, not by looking at which file is involved. OK, I'm late to the party. But why exactly are we syncing absolutely everything? That seems over-broad. And might it be better to check that we can open each file using access() than calling open() and looking at the error code? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Robert Haas writes: > Anything we do short of making all errors in this area non-fatal is > going to leave behind startup-failure cases that exist today, and we > have no evidence at this time that such startup failures would be > justified by any actual data loss risk. Yeah. Perhaps I missed it, but was the original patch motivated by actual failures that had been seen in the field, or was it just a hypothetical concern? Certainly, any actual failures of that sort are few and far between compared to the number of problems we now realize the patch introduced. Also, we need to discuss how hard walkdir() needs to try to avoid throwing errors of its own. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Mon, May 25, 2015 at 9:54 PM, Andres Freund wrote: > On 2015-05-25 21:33:03 -0400, Robert Haas wrote: >> On Mon, May 25, 2015 at 2:20 PM, Tom Lane wrote: >> > Perhaps, but if we didn't have permission to write the file, it's hard to >> > argue that it's our responsibility to fsync it. So this seems like it's >> > adding complexity without really adding any safety. >> >> I agree. I think ignoring fsync failures is a very sensible approach. >> If the files are not writable, they're probably not ours. > > The reason we started discussing this is because Tom had the - quite > reasonable - concern that this might not solely be a problem of EACCESS, > but that there could be other errors that we need to ignore to not fail > spuriously. Say a symlink goes to a binary, which is currently being > executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS. So > we'd need to ignore a lot of errors, possibly ignoring valid ones. But ignoring those errors wouldn't compromise data integrity, either. So let's just ignore (but log) all errors: then we'll be demonstrably no worse off than we were before this patch went in. If it turns out that there's some particular case where ignoring errors DOES compromise data integrity, then we can plug that hole surgically when somebody reports it. Anything we do short of making all errors in this area non-fatal is going to leave behind startup-failure cases that exist today, and we have no evidence at this time that such startup failures would be justified by any actual data loss risk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Mon, May 25, 2015 at 9:54 PM, Stephen Frost wrote: > I certainly see your point, but Tom also pointed out that it's not great > to ignore failures during this phase: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> Greg Stark writes: >> > What exactly is failing? >> > Is it that fsync is returning -1 ? >> According to the original report from Christoph Berg, it was open() >> not fsync() that was failing, at least in permissions-based cases. >> >> I'm not sure if we should just uniformly ignore all failures in this >> phase. That would have the merit of clearly not creating any new >> startup failure cases compared to the previous code, but as you say >> sometimes it might mean ignoring real problems. > > If we accept this, then we still have to have the lists, to decide what > to fail on and what to ignore. If we're going to have said lists tho, I > don't really see the point in fsync'ing things we're pretty confident > aren't ours. No, that's not right at all. The idea, at least as I understand it, would be decide which errors to ignore by looking at the error code, not by looking at which file is involved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
* Abhijit Menon-Sen (a...@2ndquadrant.com) wrote: > At 2015-05-26 03:54:51 +0200, and...@anarazel.de wrote: > > Another thing is whether we should handle a recursive symlink in > > pgdata? I personally think not, but... > > I think not too. Yikes.. That's definitely the kind of thing that's why I worry about the whole 'fsync everything' idea- what if I symlink to /? I've certainly done that before from my home directory for ease of use and I imagine there are people out there who have similar setups where they sftp as the PG user and use the symlink to more easily navigate somewhere else. We have to realize that, on at least some systems, PGDATA could be the postgres user's home directory too. That's not the case on Debian-based systems today, but I think it might have been before Debian had the multi-cluster tooling. > > It's also not just as simple as making fsync_fname fail gracefully > > upon EACCESS - the opendir() could fail just as well. > > I'll post a proposed patch shortly. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
At 2015-05-26 03:54:51 +0200, and...@anarazel.de wrote: > > Say a symlink goes to a binary, which is currently being executed: > ETXTBSY. Or the file is in a readonly filesystem: EROFS. So we'd > need to ignore a lot of errors, possibly ignoring valid ones. Right. That's why I started out by being conservative and following only the "expected" symlinks in pg_tblspc (as other parts of the code do). > Another thing is whether we should handle a recursive symlink in > pgdata? I personally think not, but... I think not too. > It's also not just as simple as making fsync_fname fail gracefully > upon EACCESS - the opendir() could fail just as well. I'll post a proposed patch shortly. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-25 21:33:03 -0400, Robert Haas wrote: > On Mon, May 25, 2015 at 2:20 PM, Tom Lane wrote: > > Perhaps, but if we didn't have permission to write the file, it's hard to > > argue that it's our responsibility to fsync it. So this seems like it's > > adding complexity without really adding any safety. > > I agree. I think ignoring fsync failures is a very sensible approach. > If the files are not writable, they're probably not ours. The reason we started discussing this is because Tom had the - quite reasonable - concern that this might not solely be a problem of EACCESS, but that there could be other errors that we need to ignore to not fail spuriously. Say a symlink goes to a binary, which is currently being executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS. So we'd need to ignore a lot of errors, possibly ignoring valid ones. I personally can see why people will put things in PGDATA itself, if you put unreadable stuff in some subdirectory that you didn't create yourself, I see much less reason to tolerate that. Another thing is whether we should handle a recursive symlink in pgdata? I personally think not, but... It's also not just as simple as making fsync_fname fail gracefully upon EACCESS - the opendir() could fail just as well. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Robert, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, May 25, 2015 at 2:20 PM, Tom Lane wrote: >> Andres Freund writes: >>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: Not really sure I see that as helping. >>> On most OSs, except windows and some obscure unixes, a readonly fd is >>> allowed to fsync a file. >> Perhaps, but if we didn't have permission to write the file, it's hard to >> argue that it's our responsibility to fsync it. So this seems like it's >> adding complexity without really adding any safety. > > I agree. I think ignoring fsync failures is a very sensible approach. > If the files are not writable, they're probably not ours. If they are > not writable but somehow still ours, we probably can't have written > them before the crash, either. If they are ours and we somehow wrote > to them before the crash, and then while the system was down they were > made inaccessible, and then the database was restarted, then we're > well into the territory where the system administrator has done > something that we cannot possibly be expected to cope with ... but > ignoring the fsync isn't very likely to cause any real problems even > here. If we really did modify those blocks recently, recovery will > try to redo the changes, and we'll fail then anyway. So what's the > problem? > > I agree with Tom's concern that if we have two lists of directories, > they may get out of sync. We could probably merge the two lists > somehow, but I'm not really seeing the point, since Tom's blanket > approach should work just fine. I certainly see your point, but Tom also pointed out that it's not great to ignore failures during this phase: * Tom Lane (t...@sss.pgh.pa.us) wrote: > Greg Stark writes: > > What exactly is failing? > > Is it that fsync is returning -1 ? > According to the original report from Christoph Berg, it was open() > not fsync() that was failing, at least in permissions-based cases. > > I'm not sure if we should just uniformly ignore all failures in this > phase. That would have the merit of clearly not creating any new > startup failure cases compared to the previous code, but as you say > sometimes it might mean ignoring real problems. If we accept this, then we still have to have the lists, to decide what to fail on and what to ignore. If we're going to have said lists tho, I don't really see the point in fsync'ing things we're pretty confident aren't ours. Further, in any of these cases, we have to decide which failure cases are ones that are "fatal" and which are not- being in the list or not isn't the only criteria, it's just one part of the overall decision. We also need to consider what return value we get back for which system calls, all of which may entirely be system dependent, meanly we may have to deal with portability issues here too. Then there are other interesting considerations like what happens with an NFS mount (as Greg mentioned), or perhaps what happens when it's a MAC violation (eg: SELinux). Generally speaking, those will also return an error code which we can contemplate, but it'll still create annoying log noise for people running in such environments. Perhaps that would encourage them to move whatever files they have out of $PGDATA, which is likely to be a good decision, but that may not always be possible.. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On Mon, May 25, 2015 at 2:20 PM, Tom Lane wrote: > Andres Freund writes: >> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: Additionally we could attempt to fsync with a readonly fd before trying the read-write fd... > >>> Not really sure I see that as helping. > >> On most OSs, except windows and some obscure unixes, a readonly fd is >> allowed to fsync a file. > > Perhaps, but if we didn't have permission to write the file, it's hard to > argue that it's our responsibility to fsync it. So this seems like it's > adding complexity without really adding any safety. I agree. I think ignoring fsync failures is a very sensible approach. If the files are not writable, they're probably not ours. If they are not writable but somehow still ours, we probably can't have written them before the crash, either. If they are ours and we somehow wrote to them before the crash, and then while the system was down they were made inaccessible, and then the database was restarted, then we're well into the territory where the system administrator has done something that we cannot possibly be expected to cope with ... but ignoring the fsync isn't very likely to cause any real problems even here. If we really did modify those blocks recently, recovery will try to redo the changes, and we'll fail then anyway. So what's the problem? I agree with Tom's concern that if we have two lists of directories, they may get out of sync. We could probably merge the two lists somehow, but I'm not really seeing the point, since Tom's blanket approach should work just fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 05/25/2015 01:21 PM, Tom Lane wrote: > And from the other direction, where exactly is it written that > distros/users will only create problematic files at the top level of > $PGDATA? I'd have zero confidence in such an assertion applied to > tablespace directories, for sure. Yes, absolutely. For example: Cstore_FDW now puts its files in a subdirectory of the tablespace directory, which for a really large cstore table, the user will want to symlink somewhere else, or might create as a mount. Such a user might then, for example, annotate this with a README.txt which happens to be root/root perms. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Alvaro Herrera writes: > Tom Lane wrote: >> Well, that opens us to errors of omission, ie failing to fsync things we >> should have. Maybe that's an okay risk, but personally I'd judge that >> "fsync everything and ignore (some?) errors" is probably a more robust >> approach over time. > How is it possible to make errors of omission? The list of directories > in initdb is the complete set of directories that are created for a > newly-initdb'd database, no? Surely there can't be a database that > contains vital directories that are not created there? See "subdirs" > static in initdb.c. Easy: all you need is to suppose that some of the plain files at top level of $PGDATA ought to be fsync'd. (I'm fairly sure for example that we took steps awhile back to force postmaster.pid to be fsync'd.) If there is a distinction between the fsync requirements of top-level files and everything else, it is completely accidental and not to be relied on. And from the other direction, where exactly is it written that distros/users will only create problematic files at the top level of $PGDATA? I'd have zero confidence in such an assertion applied to tablespace directories, for sure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Tom Lane wrote: > Stephen Frost writes: > > I've not followed this thread all that closely, but I do tend to agree > > with the idea of "only try to mess with files that are *clearly* ours to > > mess with." > > Well, that opens us to errors of omission, ie failing to fsync things we > should have. Maybe that's an okay risk, but personally I'd judge that > "fsync everything and ignore (some?) errors" is probably a more robust > approach over time. How is it possible to make errors of omission? The list of directories in initdb is the complete set of directories that are created for a newly-initdb'd database, no? Surely there can't be a database that contains vital directories that are not created there? See "subdirs" static in initdb.c. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Andres Freund writes: > On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: >>> Additionally we could attempt to fsync with a readonly fd before trying >>> the read-write fd... >> Not really sure I see that as helping. > On most OSs, except windows and some obscure unixes, a readonly fd is > allowed to fsync a file. Perhaps, but if we didn't have permission to write the file, it's hard to argue that it's our responsibility to fsync it. So this seems like it's adding complexity without really adding any safety. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-25 14:02:28 -0400, Tom Lane wrote: > Stephen Frost writes: > > I've not followed this thread all that closely, but I do tend to agree > > with the idea of "only try to mess with files that are *clearly* ours to > > mess with." > > Well, that opens us to errors of omission, ie failing to fsync things we > should have. Is that really that likely? I mean we don't normally add data to the top level directory itself, and subdirectories hopefully won't be added except via initdb? > Maybe that's an okay risk, but personally I'd judge that > "fsync everything and ignore (some?) errors" is probably a more robust > approach over time. The over-the-top approach would be to combine the two. Error out in directories that are in the initdb list, and ignore permission errors otherwise... Additionally we could attempt to fsync with a readonly fd before trying the read-write fd... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
* Andres Freund (and...@anarazel.de) wrote: > On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: > > That seems overly complicated, for my 2c at least. I don't particularly > > like trying to mess with files that might be rightfully considered "not > > ours" either. > > I'd not consider an fsync to be "messing" with files, especially if > they're in PGDATA. I'm not entirely sure I agree. > > > Additionally we could attempt to fsync with a readonly fd before trying > > > the read-write fd... > > > > Not really sure I see that as helping. > > On most OSs, except windows and some obscure unixes, a readonly fd is > allowed to fsync a file. I wouldn't have thought otherwise, given that you were suggesting it, but there's no guarantee we're going to be allowed to read it either, or even access the directory the symlink points to, etc.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
* Andres Freund (and...@anarazel.de) wrote: > On 2015-05-25 14:02:28 -0400, Tom Lane wrote: > > Stephen Frost writes: > > > I've not followed this thread all that closely, but I do tend to agree > > > with the idea of "only try to mess with files that are *clearly* ours to > > > mess with." > > > > Well, that opens us to errors of omission, ie failing to fsync things we > > should have. > > Is that really that likely? I mean we don't normally add data to the top > level directory itself, and subdirectories hopefully won't be added > except via initdb? That feels like a pretty low risk, to me at least. Certainly better than having a failure, like what's going on now. > > Maybe that's an okay risk, but personally I'd judge that > > "fsync everything and ignore (some?) errors" is probably a more robust > > approach over time. > > The over-the-top approach would be to combine the two. Error out in > directories that are in the initdb list, and ignore permission errors > otherwise... That seems overly complicated, for my 2c at least. I don't particularly like trying to mess with files that might be rightfully considered "not ours" either. This all makes me really wonder about postgresql.auto.conf too.. Clearly, on the one hand, we consider that "our" file, and so we should error out if we don't own it, but on the other hand, I've specifically recommended making that file owned by root to some folks, to avoid DBAs playing with the startup-time settings.. > Additionally we could attempt to fsync with a readonly fd before trying > the read-write fd... Not really sure I see that as helping. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Stephen Frost writes: > I've not followed this thread all that closely, but I do tend to agree > with the idea of "only try to mess with files that are *clearly* ours to > mess with." Well, that opens us to errors of omission, ie failing to fsync things we should have. Maybe that's an okay risk, but personally I'd judge that "fsync everything and ignore (some?) errors" is probably a more robust approach over time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: > That seems overly complicated, for my 2c at least. I don't particularly > like trying to mess with files that might be rightfully considered "not > ours" either. I'd not consider an fsync to be "messing" with files, especially if they're in PGDATA. > > Additionally we could attempt to fsync with a readonly fd before trying > > the read-write fd... > > Not really sure I see that as helping. On most OSs, except windows and some obscure unixes, a readonly fd is allowed to fsync a file. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
* Andres Freund (and...@anarazel.de) wrote: > On 2015-05-25 13:38:01 -0400, Tom Lane wrote: > > Andres Freund writes: > > > On May 24, 2015 7:52:53 AM PDT, Tom Lane wrote: > > > If we'd merge it with initdb's list I think I'd not be that bad. I'm > > > thinking of some header declaring it, roughly like the rmgr list. > > > > pg_log/ is a counterexample to that idea too; initdb doesn't know about it > > (and shouldn't). > > The idea would be to *only* directories that initdb knows about. Since > that's where the valuables are. So I don't see how pg_log would be a > counterexample. Indeed, that wouldn't be included in the list of things to fsync and it isn't listed in initdb, so that works. I've not followed this thread all that closely, but I do tend to agree with the idea of "only try to mess with files that are *clearly* ours to mess with." Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-25 13:38:01 -0400, Tom Lane wrote: > Andres Freund writes: > > On May 24, 2015 7:52:53 AM PDT, Tom Lane wrote: > > If we'd merge it with initdb's list I think I'd not be that bad. I'm > > thinking of some header declaring it, roughly like the rmgr list. > > pg_log/ is a counterexample to that idea too; initdb doesn't know about it > (and shouldn't). The idea would be to *only* directories that initdb knows about. Since that's where the valuables are. So I don't see how pg_log would be a counterexample. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Andres Freund writes: > On May 24, 2015 7:52:53 AM PDT, Tom Lane wrote: >> Christoph Berg writes: >>> pg_log/ is also admin domain. What about only recursing into >>> well-known directories + postgresql.auto.conf? >> The idea that this code would know exactly what's what under $PGDATA >> scares me. I can positively guarantee that it would diverge from >> reality over time, and nobody would notice until it ate their data, >> failed to start, or otherwise behaved undesirably. >> >> pg_log/ is a perfect example, because that is not a hard-wired >> directory name; somebody could point the syslogger at a different place >> very easily. Wiring in special behavior for that name is just wrong. >> >> I would *much* rather have a uniform rule for how to treat each file >> the scan comes across. It might take some tweaking to get to one that >> works well; but once we did, we could have some confidence that it >> wouldn't break later. > If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking > of some header declaring it, roughly like the rmgr list. pg_log/ is a counterexample to that idea too; initdb doesn't know about it (and shouldn't). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On May 24, 2015 7:52:53 AM PDT, Tom Lane wrote: >Christoph Berg writes: >> Re: To Andres Freund 2015-05-24 <20150524075244.gb27...@msg.df7cb.de> >>> Re: Andres Freund 2015-05-24 ><20150524005245.gd32...@alap3.anarazel.de> How about, to avoid masking actual problems, we have a more differentiated logic for the toplevel data directory? > >> pg_log/ is also admin domain. What about only recursing into >> well-known directories + postgresql.auto.conf? > >The idea that this code would know exactly what's what under $PGDATA >scares me. I can positively guarantee that it would diverge from >reality >over time, and nobody would notice until it ate their data, failed to >start, or otherwise behaved undesirably. > >pg_log/ is a perfect example, because that is not a hard-wired >directory >name; somebody could point the syslogger at a different place very >easily. >Wiring in special behavior for that name is just wrong. > >I would *much* rather have a uniform rule for how to treat each file >the scan comes across. It might take some tweaking to get to one that >works well; but once we did, we could have some confidence that it >wouldn't break later. If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like the rmgr list. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Christoph Berg writes: > Re: To Andres Freund 2015-05-24 <20150524075244.gb27...@msg.df7cb.de> >> Re: Andres Freund 2015-05-24 <20150524005245.gd32...@alap3.anarazel.de> >>> How about, to avoid masking actual problems, we have a more >>> differentiated logic for the toplevel data directory? > pg_log/ is also admin domain. What about only recursing into > well-known directories + postgresql.auto.conf? The idea that this code would know exactly what's what under $PGDATA scares me. I can positively guarantee that it would diverge from reality over time, and nobody would notice until it ate their data, failed to start, or otherwise behaved undesirably. pg_log/ is a perfect example, because that is not a hard-wired directory name; somebody could point the syslogger at a different place very easily. Wiring in special behavior for that name is just wrong. I would *much* rather have a uniform rule for how to treat each file the scan comes across. It might take some tweaking to get to one that works well; but once we did, we could have some confidence that it wouldn't break later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Re: To Andres Freund 2015-05-24 <20150524075244.gb27...@msg.df7cb.de> > Re: Andres Freund 2015-05-24 <20150524005245.gd32...@alap3.anarazel.de> > > How about, to avoid masking actual problems, we have a more > > differentiated logic for the toplevel data directory? I think we could > > just skip all non-directory files in there data_directory itself. None > > of the files in the toplevel directory, with the exception of > > postgresql.auto.conf, will ever get written to by PG itself. And if > > there's readonly files somewhere in a subdirectory, I won't feel > > particularly bad. pg_log/ is also admin domain. What about only recursing into well-known directories + postgresql.auto.conf? (I've also been wondering if pg_basebackup shouldn't skip pg_log, but that's a different topic...) Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Re: Andres Freund 2015-05-24 <20150524005245.gd32...@alap3.anarazel.de> > How about, to avoid masking actual problems, we have a more > differentiated logic for the toplevel data directory? I think we could > just skip all non-directory files in there data_directory itself. None > of the files in the toplevel directory, with the exception of > postgresql.auto.conf, will ever get written to by PG itself. And if > there's readonly files somewhere in a subdirectory, I won't feel > particularly bad. I like that idea. Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
On 2015-05-23 16:33:29 -0400, Tom Lane wrote: > Christoph Berg writes: > > the new fsync-pgdata-on-recovery code tries to open all files using > > O_RDWR. At least on 9.1, this can make recovery fail: > > Hm. I wonder whether it would be all right to just skip files for which > we get EPERM on open(). The argument being that if we can't write to the > file, we should not be held responsible for fsync'ing it either. But > I'm not sure whether EPERM would be the only relevant errno, or whether > there are cases where this would mask real problems. We could even try doing the a fsync with a readonly fd as a fallback, but that's also pretty hacky. How about, to avoid masking actual problems, we have a more differentiated logic for the toplevel data directory? I think we could just skip all non-directory files in there data_directory itself. None of the files in the toplevel directory, with the exception of postgresql.auto.conf, will ever get written to by PG itself. And if there's readonly files somewhere in a subdirectory, I won't feel particularly bad. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Re: Tom Lane 2015-05-23 <2284.1432413...@sss.pgh.pa.us> > Christoph Berg writes: > > the new fsync-pgdata-on-recovery code tries to open all files using > > O_RDWR. At least on 9.1, this can make recovery fail: > > Hm. I wonder whether it would be all right to just skip files for which > we get EPERM on open(). The argument being that if we can't write to the > file, we should not be held responsible for fsync'ing it either. But > I'm not sure whether EPERM would be the only relevant errno, or whether > there are cases where this would mask real problems. Maybe logging WARNINGs instead of FATAL would be enough of a fix? Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Christoph Berg writes: > the new fsync-pgdata-on-recovery code tries to open all files using > O_RDWR. At least on 9.1, this can make recovery fail: Hm. I wonder whether it would be all right to just skip files for which we get EPERM on open(). The argument being that if we can't write to the file, we should not be held responsible for fsync'ing it either. But I'm not sure whether EPERM would be the only relevant errno, or whether there are cases where this would mask real problems. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously
Hi, the new fsync-pgdata-on-recovery code tries to open all files using O_RDWR. At least on 9.1, this can make recovery fail: * launch postgres, hit ^\ (or otherwise shut down uncleanly) * touch foo; chmod 444 foo * launch postgres LOG: database system was interrupted; last known up at 2015-05-23 19:18:36 CEST FATAL: could not open file "/home/cbe/9.1/foo": Permission denied LOG: startup process (PID 27305) exited with exit code 1 LOG: aborting startup due to startup process failure The code on 9.4 looks similar to me, but I couldn't trigger the problem there. I think this is a real-world problem: 1) In older releases, the SSL certs were read from the data directory, and at least the default Debian installation creates symlinks from PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write 2) It's probably a pretty common scenario that the root user will edit postgresql.conf, and make backups or create other random files there that are not writable. Even a non-writable postgresql.conf itself or recovery.conf was not a problem previously. To me, this is a serious regression because it prevents automatic startup of a server that would otherwise just run. Christoph -- c...@df7cb.de | http://www.df7cb.de/ signature.asc Description: Digital signature