subject:"\[HACKERS\] fsync\-pgdata\-on\-recovery tries to write to more files than previously"

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Christoph Berg

Re: Tom Lane 2015-05-29 <13871.1432921...@sss.pgh.pa.us>
> Why can't the user stop it?  We won't be bleating about the case of a
> symlink to a non-writable file someplace else, which is the Debian use
> case.  I don't see a very good excuse to have a non-writable file right
> in the data directory.

I've repeatedly seen PGDATA or pg_xlog been put directly on a
mountpoint, which means there well be a non-writable lost+found
directory there. (A case with pg_xlog was also reported as a support
case at credativ.) I'm usually advising against using the top level
directory directly, but it's not uncommon to encounter it.

> In any case, if the cost of such a file is one more line of log output
> during a crash restart, most people would have no problem at all in
> ignoring that log output.

Nod.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Stephen Frost

* Andres Freund (and...@anarazel.de) wrote:
> On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
> > > That sounds like a potentially nontrivial amount of repetitive log bleat
> > > after every crash start? One which the user can't really stop?
> > 
> > Why can't the user stop it?
> 
> Because it makes a good amount of sense to have e.g. certificates not
> owned by postgres and not writeable? You don't necessarily want to
> symlink them somewhere else, because that makes moving clusters around
> harder than when they're self contained.

A certain other file might be non-writable by PG too... (*cough* .auto
*cough*).

> > I'd say it's a pretty damn-fool arrangement: for starters, it's an
> > unnecessary security hazard.
> 
> I don't buy the security argument at all. You likely have
> postgresql.conf in the data directoy. You can write to at least .auto,
> which will definitely reside the data directory. That contains
> archive_command.

I'm not sure that I see the security issue here either..  We're not
talking about setuid shell scripts or anything that isn't running as the
PG user, which a superuser could take over anyway..

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Andres Freund

On 2015-05-29 14:15:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
> >> Why can't the user stop it?
> 
> > Because it makes a good amount of sense to have e.g. certificates not
> > owned by postgres and not writeable? You don't necessarily want to
> > symlink them somewhere else, because that makes moving clusters around
> > harder than when they're self contained.
> 
> Meh.  Well, I'm willing to yield on the EACCES point, but I still find
> the exclusion for ETXTBSY to be ugly and inappropriate.

Ok.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Tom Lane

Andres Freund  writes:
> On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
>> Why can't the user stop it?

> Because it makes a good amount of sense to have e.g. certificates not
> owned by postgres and not writeable? You don't necessarily want to
> symlink them somewhere else, because that makes moving clusters around
> harder than when they're self contained.

Meh.  Well, I'm willing to yield on the EACCES point, but I still find
the exclusion for ETXTBSY to be ugly and inappropriate.

>> I'd say it's a pretty damn-fool arrangement: for starters, it's an
>> unnecessary security hazard.

> I don't buy the security argument at all. You likely have
> postgresql.conf in the data directoy. You can write to at least .auto,
> which will definitely reside the data directory. That contains
> archive_command.

The fact that a superuser might have multiple ways to subvert things
doesn't make it a good idea to add another one: the attack surface
could be larger, or at least different.  But even if you don't buy
that it's a security hazard, why would it be a good idea to have
executables inside $PGDATA?  That would for example lead to them getting
copied by pg_basebackup, which seems unlikely to be a good thing.
And if you did have such executables there, why would they be active
during a postmaster restart?

I really seriously doubt that this is either common enough or useful
enough to justify suppressing warning messages about it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Andres Freund

On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2015-05-29 13:14:18 -0400, Tom Lane wrote:
> >> Abhijit Menon-Sen  writes:
> >> As I mentioned yesterday, I'm not really on board with ignoring EACCES,
> >> except for the directories-on-Windows case.  Since we're only logging
> >> the failures anyway, I think it is reasonable to log a complaint for any
> >> unwritable file in the data directory.
> 
> > That sounds like a potentially nontrivial amount of repetitive log bleat
> > after every crash start? One which the user can't really stop?
> 
> Why can't the user stop it?

Because it makes a good amount of sense to have e.g. certificates not
owned by postgres and not writeable? You don't necessarily want to
symlink them somewhere else, because that makes moving clusters around
harder than when they're self contained.

> I'd say it's a pretty damn-fool arrangement: for starters, it's an
> unnecessary security hazard.

I don't buy the security argument at all. You likely have
postgresql.conf in the data directoy. You can write to at least .auto,
which will definitely reside the data directory. That contains
archive_command.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Tom Lane

Andres Freund  writes:
> On 2015-05-29 13:14:18 -0400, Tom Lane wrote:
>> Abhijit Menon-Sen  writes:
>> As I mentioned yesterday, I'm not really on board with ignoring EACCES,
>> except for the directories-on-Windows case.  Since we're only logging
>> the failures anyway, I think it is reasonable to log a complaint for any
>> unwritable file in the data directory.

> That sounds like a potentially nontrivial amount of repetitive log bleat
> after every crash start? One which the user can't really stop?

Why can't the user stop it?  We won't be bleating about the case of a
symlink to a non-writable file someplace else, which is the Debian use
case.  I don't see a very good excuse to have a non-writable file right
in the data directory.

>> Also I want to get rid of the ETXTBSY special cases.  That one doesn't
>> seem like something that we should silently ignore: what the heck are
>> executables doing in the data directory?  Or is there some other meaning
>> on Windows?

> I've seen a bunch of binaries placed in the data directory as
> archive/restore commands. Those will be busy a good amount of the
> time. While it'd not be my choice to do that, it's not entirely
> unreasonable.

I'd say it's a pretty damn-fool arrangement: for starters, it's
an unnecessary security hazard.

In any case, if the cost of such a file is one more line of log output
during a crash restart, most people would have no problem at all in
ignoring that log output.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Andres Freund

On 2015-05-29 13:14:18 -0400, Tom Lane wrote:
> Abhijit Menon-Sen  writes:
> As I mentioned yesterday, I'm not really on board with ignoring EACCES,
> except for the directories-on-Windows case.  Since we're only logging
> the failures anyway, I think it is reasonable to log a complaint for any
> unwritable file in the data directory.

That sounds like a potentially nontrivial amount of repetitive log bleat
after every crash start? One which the user can't really stop?

> Also I want to get rid of the ETXTBSY special cases.  That one doesn't
> seem like something that we should silently ignore: what the heck are
> executables doing in the data directory?  Or is there some other meaning
> on Windows?

I've seen a bunch of binaries placed in the data directory as
archive/restore commands. Those will be busy a good amount of the
time. While it'd not be my choice to do that, it's not entirely
unreasonable.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Abhijit Menon-Sen

At 2015-05-29 13:14:18 -0400, t...@sss.pgh.pa.us wrote:
>
> Pushed with minor revisions.

Thanks, looks good.

> Since we're only logging the failures anyway, I think it is reasonable
> to log a complaint for any unwritable file in the data directory. 

Sounds reasonable, patch attached. ETXTBSY has no special meaning that I
can find under Windows, so I included that change as well.

-- Abhijit
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b4f6590..70e2347 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2646,11 +2646,12 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
 
 	if (fd < 0)
 	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir && errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir && errno == EACCES)
 			return;
 #endif
 
@@ -2701,11 +2702,12 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 	fd = OpenTransientFile((char *) fname, flags, 0);
 	if (fd < 0)
 	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir && errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir && errno == EACCES)
 			return;
 #endif
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f0d66fa..f343168 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -617,11 +617,12 @@ pre_sync_fname(const char *fname, bool isdir)
 
 	if (fd < 0)
 	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir && errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir && errno == EACCES)
 			return;
 #endif
 
@@ -682,11 +683,12 @@ fsync_fname_ext(const char *fname, bool isdir)
 	fd = open(fname, flags);
 	if (fd < 0)
 	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
+		/* Some platforms don't support opening directories at all. */
+		if (isdir && errno == EISDIR)
 			return;
 
-#ifdef ETXTBSY
-		if (errno == ETXTBSY)
+#ifdef WIN32
+		if (isdir && errno == EACCES)
 			return;
 #endif
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Tom Lane

Abhijit Menon-Sen  writes:
> At 2015-05-28 17:37:16 -0400, t...@sss.pgh.pa.us wrote:
>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

> Here's a revision of that patch that's more along the lines of what you
> committed.

Pushed with minor revisions.

> It occurred to me that we should probably also silently skip EACCES on
> opendir and stat/lstat in walkdir. That's what the attached initdb patch
> does. If you think it's a good idea, there's also a small fixup attached
> for the backend version too.

As I mentioned yesterday, I'm not really on board with ignoring EACCES,
except for the directories-on-Windows case.  Since we're only logging
the failures anyway, I think it is reasonable to log a complaint for any
unwritable file in the data directory.  I've not done it yet, but what
I think we oughta do is revert

if (errno == EACCES || (isdir && errno == EISDIR))

back to the former logic

if (isdir && (errno == EISDIR || errno == EACCES))

and even then, maybe the EACCES exclusion ought to be Windows only?

Also I want to get rid of the ETXTBSY special cases.  That one doesn't
seem like something that we should silently ignore: what the heck are
executables doing in the data directory?  Or is there some other meaning
on Windows?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Tom Lane

Michael Paquier  writes:
> On Fri, May 29, 2015 at 6:49 PM, Amit Kapila  wrote:
>> Test - 2 - Directory Symlink for pg_xlog
>> First 4 steps are same as Test-1
>> 5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
>> 6.  Restart Server - Error
>> - FATAL:  required WAL directory "pg_xlog" does not exist
>> This error occurs in
>> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
>> 7. If I manually step-through that line, it proceeds and in
>> SyncDataDirectory(), it detects the symlink;
>> 8. After that in SyncDataDirectory(), when it does walkdir for
>> datadir, for ./pg_xlog it reports the log message and then
>> proceeds normal and the server is started.

> Regarding this test, I just tested initdb -X and it works correctly
> after crashing the server, so there is at least a workaround for the
> error you are triggering here. Now, shouldn't we improve things as
> well with mklink? That's a rather narrow case and we have a workaround
> for it at initialization with initdb. I haven't looked much at the
> code so I don't have a patch at hand but thoughts on this matter are
> welcome.

My (vague) recollection is that what mklink creates is not the same
as a junction point and we intentionally only support the latter as
pseudo-symlinks.  You'd need to trawl the archives for more detail,
unless someone else remembers.

So I think we're good here.  Thanks for all the testing!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Michael Paquier

On Fri, May 29, 2015 at 6:49 PM, Amit Kapila  wrote:
> On Fri, May 29, 2015 at 2:28 PM, Michael Paquier 
> wrote:
>>
>> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila 
>> wrote:
>> >
>> > Test-3 - Symlinks in pg_tblspc.
>> > 1. Create couple of tablespaces which creates symlinks
>> > in pg_tblspc
>> > 2. Crash the server
>> > 3. Restart Server - It works fine.
>> >
>> > I am not sure Test-2 is a valid thing and we should support it or
>> > not, but the current patch is sane w.r.t that form of symlinks
>> > as well.
>>
>> It is always good to check, but is that relevant to the bug fixed? I
>> mean, you need to symlink a file in PGDATA that server has no write
>> permission on... For example tablespaces can be written to in test 3,
>> so that would work even with 9.4.2 after crashing the server with -m
>> immediate.
>
> I have just tried to cover the paths which has symlink related code
> in the new commit (in functions SyncDataDirectory() and walkdir()),
> because thats what I understood from Tom's mail.  Without test-3, it
> won't cover walkdir symlink test path, I didn't knew that there was any
> confusion about verification of permissions problem on Windows
> and I haven't looked to verify the whole patch.

Reading once again this thread (after some good red wine + alpha), it
looks that you are right: not many checks with Windows junctions have
actually been with what has been committed...

> Test - 2 - Directory Symlink for pg_xlog
> First 4 steps are same as Test-1
> 5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 6.  Restart Server - Error
> - FATAL:  required WAL directory "pg_xlog" does not exist
> This error occurs in
> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
> 7. If I manually step-through that line, it proceeds and in
> SyncDataDirectory(), it detects the symlink;
> 8. After that in SyncDataDirectory(), when it does walkdir for
> datadir, for ./pg_xlog it reports the log message and then
> proceeds normal and the server is started.

Regarding this test, I just tested initdb -X and it works correctly
after crashing the server, so there is at least a workaround for the
error you are triggering here. Now, shouldn't we improve things as
well with mklink? That's a rather narrow case and we have a workaround
for it at initialization with initdb. I haven't looked much at the
code so I don't have a patch at hand but thoughts on this matter are
welcome.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Christoph Berg

Re: Tom Lane 2015-05-28 <5740.1432849...@sss.pgh.pa.us>
> Abhijit Menon-Sen  writes:
> > Here's an updated patch for the fsync problem(s).
> 
> I've committed this after some mostly-cosmetic rearrangements.

Fwiw, I can confirm that the problem is fixed for 9.5. The regression
tests I've added to postgresql-common pass. Thanks!

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Amit Kapila

On Fri, May 29, 2015 at 2:28 PM, Michael Paquier 
wrote:
>
> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila 
wrote:
> >
> > Test-3 - Symlinks in pg_tblspc.
> > 1. Create couple of tablespaces which creates symlinks
> > in pg_tblspc
> > 2. Crash the server
> > 3. Restart Server - It works fine.
> >
> > I am not sure Test-2 is a valid thing and we should support it or
> > not, but the current patch is sane w.r.t that form of symlinks
> > as well.
>
> It is always good to check, but is that relevant to the bug fixed? I
> mean, you need to symlink a file in PGDATA that server has no write
> permission on... For example tablespaces can be written to in test 3,
> so that would work even with 9.4.2 after crashing the server with -m
> immediate.

I have just tried to cover the paths which has symlink related code
in the new commit (in functions SyncDataDirectory() and walkdir()),
because thats what I understood from Tom's mail.  Without test-3, it
won't cover walkdir symlink test path, I didn't knew that there was any
confusion about verification of permissions problem on Windows
and I haven't looked to verify the whole patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Michael Paquier

On Fri, May 29, 2015 at 5:01 PM, Amit Kapila  wrote:
> On Fri, May 29, 2015 at 9:25 AM, Tom Lane  wrote:
>>
>>
>> Speaking of which, could somebody test that the committed patch does
>> what it's supposed to on Windows?  You were worried upthread about
>> whether the tests for symlinks (aka junction points) behaved correctly,
>> and I have no way to check that either.
>>
>
> I have done some tests for the committed patch for this issue
> (especially to check the behaviour of symlink tests in the
> new code) on Windows - 7 and below are results:
>
> Test symlink for pg_xlog
> -
> Test -1 - Junction point for pg_xlog
> 1. Create a database (initdb)
> 2. Start server and forcefully crash it
> 3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 4. mv E:\PGData\pg_xlog E:\bak_pg_xlog
> 5. Created a directory junction (equivalent symlink)
> mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 6.  Restart Server - operation is successful.
> I have debugged this operation, it does what it suppose to do,
> detects the junction point and does fsync.
>
> Test - 2 - Directory Symlink for pg_xlog
> First 4 steps are same as Test-1
> 5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 6.  Restart Server - Error
> - FATAL:  required WAL directory "pg_xlog" does not exist
> This error occurs in
> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
> 7. If I manually step-through that line, it proceeds and in
> SyncDataDirectory(), it detects the symlink;
> 8. After that in SyncDataDirectory(), when it does walkdir for
> datadir, for ./pg_xlog it reports the log message and then
> proceeds normal and the server is started.
>
> Test-3 - Symlinks in pg_tblspc.
> 1. Create couple of tablespaces which creates symlinks
> in pg_tblspc
> 2. Crash the server
> 3. Restart Server - It works fine.
>
> I am not sure Test-2 is a valid thing and we should support it or
> not, but the current patch is sane w.r.t that form of symlinks
> as well.

It is always good to check, but is that relevant to the bug fixed? I
mean, you need to symlink a file in PGDATA that server has no write
permission on... For example tablespaces can be written to in test 3,
so that would work even with 9.4.2 after crashing the server with -m
immediate.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Amit Kapila

On Fri, May 29, 2015 at 9:25 AM, Tom Lane  wrote:
>
>
> Speaking of which, could somebody test that the committed patch does
> what it's supposed to on Windows?  You were worried upthread about
> whether the tests for symlinks (aka junction points) behaved correctly,
> and I have no way to check that either.
>

I have done some tests for the committed patch for this issue
(especially to check the behaviour of symlink tests in the
new code) on Windows - 7 and below are results:

Test symlink for pg_xlog
-
Test -1 - Junction point for pg_xlog
1. Create a database (initdb)
2. Start server and forcefully crash it
3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc
4. mv E:\PGData\pg_xlog E:\bak_pg_xlog
5. Created a directory junction (equivalent symlink)
mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
6.  Restart Server - operation is successful.
I have debugged this operation, it does what it suppose to do,
detects the junction point and does fsync.

Test - 2 - Directory Symlink for pg_xlog
First 4 steps are same as Test-1
5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
6.  Restart Server - Error
- FATAL:  required WAL directory "pg_xlog" does not exist
This error occurs in
ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
7. If I manually step-through that line, it proceeds and in
SyncDataDirectory(), it detects the symlink;
8. After that in SyncDataDirectory(), when it does walkdir for
datadir, for ./pg_xlog it reports the log message and then
proceeds normal and the server is started.

Test-3 - Symlinks in pg_tblspc.
1. Create couple of tablespaces which creates symlinks
in pg_tblspc
2. Crash the server
3. Restart Server - It works fine.

I am not sure Test-2 is a valid thing and we should support it or
not, but the current patch is sane w.r.t that form of symlinks
as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-29 Thread Michael Paquier

On Fri, May 29, 2015 at 12:55 PM, Tom Lane  wrote:
> Abhijit Menon-Sen  writes:
>>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.
>
>> Here's a revision of that patch that's more along the lines of what you
>> committed.
>
> Will look at that tomorrow ...
>
>> It occurred to me that we should probably also silently skip EACCES on
>> opendir and stat/lstat in walkdir. That's what the attached initdb patch
>> does. If you think it's a good idea, there's also a small fixup attached
>> for the backend version too.
>
> Actually, I was seriously considering removing the don't-log-EACCES
> special case (at least for files, perhaps not for directories).
> The reason being that it's darn hard to verify that the code is doing
> what it's supposed to when there is no simple way to get it to log any
> complaints.  I left it as you wrote it in the committed version, but
> I think both that and the ETXTBSY special case do not really have value
> as long as their only effect is to suppress a LOG entry.
>
> Speaking of which, could somebody test that the committed patch does
> what it's supposed to on Windows?  You were worried upthread about
> whether the tests for symlinks (aka junction points) behaved correctly,
> and I have no way to check that either.

The error can be reproduced by using junction
(https://technet.microsoft.com/en-us/sysinternals/bb896768.aspx) to
define a junction point within PGDATA and then for example mark as
read-only a configuration file included in postgresql.conf through the
junction point.

Using this method, I have checked that 9.4.1 works, reproduced the
permission error with 9.4.2, and checked as well that a3ae3db4 fixed
the problem. So things look to work fine.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Tom Lane

Abhijit Menon-Sen  writes:
>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

> Here's a revision of that patch that's more along the lines of what you
> committed.

Will look at that tomorrow ...

> It occurred to me that we should probably also silently skip EACCES on
> opendir and stat/lstat in walkdir. That's what the attached initdb patch
> does. If you think it's a good idea, there's also a small fixup attached
> for the backend version too.

Actually, I was seriously considering removing the don't-log-EACCES
special case (at least for files, perhaps not for directories).
The reason being that it's darn hard to verify that the code is doing
what it's supposed to when there is no simple way to get it to log any
complaints.  I left it as you wrote it in the committed version, but
I think both that and the ETXTBSY special case do not really have value
as long as their only effect is to suppress a LOG entry.

Speaking of which, could somebody test that the committed patch does
what it's supposed to on Windows?  You were worried upthread about
whether the tests for symlinks (aka junction points) behaved correctly,
and I have no way to check that either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Abhijit Menon-Sen

At 2015-05-28 17:37:16 -0400, t...@sss.pgh.pa.us wrote:
>
> I've committed this after some mostly-cosmetic rearrangements.

Thank you.

> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

Here's a revision of that patch that's more along the lines of what you
committed.

It occurred to me that we should probably also silently skip EACCES on
opendir and stat/lstat in walkdir. That's what the attached initdb patch
does. If you think it's a good idea, there's also a small fixup attached
for the backend version too.

-- Abhijit
>From 26bb25e99a2954381587bbfe296a50b19a7f047c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 28 May 2015 22:02:29 +0530
Subject: Make initdb -S behave like xlog.c:fsync_pgdata()

In particular, it should silently skip unreadable/unexpected files in
the data directory and follow symlinks only for pg_xlog and under
pg_tblspc.
---
 src/bin/initdb/initdb.c | 305 
 1 file changed, 151 insertions(+), 154 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f1c4920..9d5478c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -71,6 +71,13 @@
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
 static const char *auth_methods_host[] = {"trust", "reject", "md5", "password", "ident", "radius",
 #ifdef ENABLE_GSS
 	"gss",
@@ -217,10 +224,13 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(char *path, void (*action) (char *fname, bool isdir));
-static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir));
-static void pre_sync_fname(char *fname, bool isdir);
-static void fsync_fname(char *fname, bool isdir);
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir),
+	bool process_symlinks);
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir);
+#endif
+static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -248,7 +258,7 @@ static void load_plpgsql(void);
 static void vacuum_db(void);
 static void make_template0(void);
 static void make_postgres(void);
-static void perform_fsync(void);
+static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -521,56 +531,58 @@ writefile(char *path, char **lines)
  * Adapted from copydir() in copydir.c.
  */
 static void
-walkdir(char *path, void (*action) (char *fname, bool isdir))
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir),
+		bool process_symlinks)
 {
 	DIR		   *dir;
-	struct dirent *direntry;
-	char		subpath[MAXPGPATH];
+	struct dirent *de;
 
 	dir = opendir(path);
 	if (dir == NULL)
 	{
-		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
-progname, path, strerror(errno));
-		exit_nicely();
+		if (errno != EACCES)
+			fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+	progname, path, strerror(errno));
+		return;
 	}
 
-	while (errno = 0, (direntry = readdir(dir)) != NULL)
+	while (errno = 0, (de = readdir(dir)) != NULL)
 	{
+		char		subpath[MAXPGPATH];
 		struct stat fst;
+		int			sret;
 
-		if (strcmp(direntry->d_name, ".") == 0 ||
-			strcmp(direntry->d_name, "..") == 0)
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
 			continue;
 
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, direntry->d_name);
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, &fst);
+		else
+			sret = lstat(subpath, &fst);
 
-		if (lstat(subpath, &fst) < 0)
+		if (sret < 0)
 		{
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-	progname, subpath, strerror(errno));
-			exit_nicely();
+			if (errno != EACCES)
+fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+		progname, subpath, strerror(errno));
+			continue;
 		}
 
-		if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action);
-		else if (S_ISREG(fst.st_mode))
+		if (S_ISREG(fst.st_mode))
 			(*action) (subpath, false);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false);
 	}
 
 	if (errno)
-	{
 		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
 progname, path, strerror(errno));
-		exit_nicely();
-	}
 
-	if (closedir(dir))
-	{
-		fprintf(stderr, _("%s: could not close directory \"

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Tom Lane

Abhijit Menon-Sen  writes:
> Here's an updated patch for the fsync problem(s).

I've committed this after some mostly-cosmetic rearrangements.

> 4. As a partial aside, why does fsync_fname use OpenTransientFile? It
>looks like it should use BasicOpenFile as pre_sync_fname does. We
>close the fd immediately after calling fsync anyway. Or is there
>some reason I'm missing that we should use OpenTransientFile in
>pre_sync_fname too?

pre_sync_fname is the one that is wrong IMO.  Using BasicOpenFile just
creates an opportunity for problems; that function should get called
from as few places as possible.

> 5. I made walktblspc_entries use stat() instead of lstat(), so that we
>can handle symlinks and directories the same way. Note that we won't
>continue to follow links inside the sub-directories because we use
>walkdir instead of recursing.

Given that, walktblspc_entries was 99% duplicate, so I folded it into
walkdir with a process_symlinks boolean.

I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Robert Haas

On Thu, May 28, 2015 at 1:04 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen  
>> wrote:
>>> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
>>> I wasn't sure if I should move that to fd.c as well. I think it's
>>> borderline OK for now.
>
>> I think if the function is specific as fsync_pgdata(), fd.c is not the
>> right place.  I'm not sure xlog.c is either, though.
>
> ISTM xlog.c is clearly the *wrong* place; if this behavior has anything
> to do with WAL logging as such, it's not apparent to me.  fd.c is not
> a great place perhaps, but in view of the fact that we have things like
> RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata
> as something to put there as well (perhaps with a name more chosen to
> match fd.c names...)

OK, sure, makes sense.

> Since Robert appears to be busy worrying about the multixact issue
> reported by Steve Kehlet, I suggest he focus on that and I'll take
> care of getting this thing committed.  AFAICT we have consensus on
> what it should do and we're down to arguing about code style.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Abhijit Menon-Sen

At 2015-05-28 19:56:15 +0530, a...@2ndquadrant.com wrote:
>
> I have a separate patch to initdb with the corresponding changes, which
> I will post after dinner and a bit more testing.

Here's that patch too.

I was a bit unsure how far to go with this. It fixes the problem of not
following pg_xlog if it's a symlink (Andres) and the one where it died
on bogus stuff in pg_tblspc (Tom) and unreadable files anywhere else
(it now spews errors to stderr, but carries on for as long as it can).

I've done a bit more than strictly necessary to fix those problems,
though, and made the code largely similar to what's in the other patch.
If you want something minimal, let me know and I'll cut it down to size.

-- Abhijit

P.S. If this patch gets applied, then the "Adapted from walktblspc_links
in initdb.c" comment in fd.c should be changed: s/links/entries/.
>From 8e69433fae0b4f51879793f6594c76b99d764818 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 28 May 2015 22:02:29 +0530
Subject: Make initdb -S behave like xlog.c:fsync_pgdata()

In particular, it should silently skip unreadable/unexpected files in
the data directory and follow symlinks only for pg_xlog and under
pg_tblspc.
---
 src/bin/initdb/initdb.c | 235 +---
 1 file changed, 122 insertions(+), 113 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f1c4920..8ebaa55 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -218,9 +218,9 @@ static char **filter_lines_with_token(char **lines, const char *token);
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
 static void walkdir(char *path, void (*action) (char *fname, bool isdir));
-static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_entries(char *path, void (*action) (char *fname, bool isdir));
 static void pre_sync_fname(char *fname, bool isdir);
-static void fsync_fname(char *fname, bool isdir);
+static void fsync_fname_ext(char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -248,7 +248,7 @@ static void load_plpgsql(void);
 static void vacuum_db(void);
 static void make_template0(void);
 static void make_postgres(void);
-static void perform_fsync(void);
+static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -518,38 +518,38 @@ writefile(char *path, char **lines)
  * walkdir: recursively walk a directory, applying the action to each
  * regular file and directory (including the named directory itself).
  *
- * Adapted from copydir() in copydir.c.
+ * Adapted from walkdir() in backend/storage/file/fd.c.
  */
 static void
 walkdir(char *path, void (*action) (char *fname, bool isdir))
 {
 	DIR		   *dir;
-	struct dirent *direntry;
-	char		subpath[MAXPGPATH];
+	struct dirent *de;
 
 	dir = opendir(path);
 	if (dir == NULL)
 	{
 		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
 progname, path, strerror(errno));
-		exit_nicely();
+		return;
 	}
 
-	while (errno = 0, (direntry = readdir(dir)) != NULL)
+	while (errno = 0, (de = readdir(dir)) != NULL)
 	{
+		char		subpath[MAXPGPATH];
 		struct stat fst;
 
-		if (strcmp(direntry->d_name, ".") == 0 ||
-			strcmp(direntry->d_name, "..") == 0)
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
 			continue;
 
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, direntry->d_name);
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
 
 		if (lstat(subpath, &fst) < 0)
 		{
 			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
 	progname, subpath, strerror(errno));
-			exit_nicely();
+			continue;
 		}
 
 		if (S_ISDIR(fst.st_mode))
@@ -559,75 +559,72 @@ walkdir(char *path, void (*action) (char *fname, bool isdir))
 	}
 
 	if (errno)
-	{
 		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
 progname, path, strerror(errno));
-		exit_nicely();
-	}
 
-	if (closedir(dir))
-	{
-		fprintf(stderr, _("%s: could not close directory \"%s\": %s\n"),
-progname, path, strerror(errno));
-		exit_nicely();
-	}
+	(void) closedir(dir);
 
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
 	(*action) (path, true);
 }
 
 /*
- * walktblspc_links: call walkdir on each entry under the given
- * pg_tblspc directory, or do nothing if pg_tblspc doesn't exist.
+ * walktblspc_entries -- apply the action to the entries in pb_tblspc
+ *
+ * We expect to find only symlinks to tablespace directories here, but
+ * we'll apply the action to regular files, and call walkdir() on any
+ * directorie

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Tom Lane

Robert Haas  writes:
> On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen  
> wrote:
>> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
>> I wasn't sure if I should move that to fd.c as well. I think it's
>> borderline OK for now.

> I think if the function is specific as fsync_pgdata(), fd.c is not the
> right place.  I'm not sure xlog.c is either, though.

ISTM xlog.c is clearly the *wrong* place; if this behavior has anything
to do with WAL logging as such, it's not apparent to me.  fd.c is not
a great place perhaps, but in view of the fact that we have things like
RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata
as something to put there as well (perhaps with a name more chosen to
match fd.c names...)

Since Robert appears to be busy worrying about the multixact issue
reported by Steve Kehlet, I suggest he focus on that and I'll take
care of getting this thing committed.  AFAICT we have consensus on
what it should do and we're down to arguing about code style.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Robert Haas

On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen  
wrote:
> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
>I wasn't sure if I should move that to fd.c as well. I think it's
>borderline OK for now.

I think if the function is specific as fsync_pgdata(), fd.c is not the
right place.  I'm not sure xlog.c is either, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-28 Thread Abhijit Menon-Sen

Hi.

Here's an updated patch for the fsync problem(s).

A few points that may be worth considering:

1. I've made the ReadDir → ReadDirExtended change, but in retrospect I
   don't think it's really worth it. Using ReadDir and letting it die
   is probably fine. (Aside: I left ReadDirExtended static for now.)

2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
   I wasn't sure if I should move that to fd.c as well. I think it's
   borderline OK for now.

3. There's a comment in fsync_fname that we need to open directories
   with O_RDONLY and non-directories with O_RDWR. Does this also apply
   to pre_sync_fname (i.e. sync_file_range and posix_fadvise) on any
   platforms we care about? It doesn't seem so, but I'm not sure.

4. As a partial aside, why does fsync_fname use OpenTransientFile? It
   looks like it should use BasicOpenFile as pre_sync_fname does. We
   close the fd immediately after calling fsync anyway. Or is there
   some reason I'm missing that we should use OpenTransientFile in
   pre_sync_fname too?

   (Aside²: it's pointless to specify S_IRUSR|S_IWUSR in fsync_fname
   since we're not setting O_CREAT. Minor, so will submit separate
   patch.)

5. I made walktblspc_entries use stat() instead of lstat(), so that we
   can handle symlinks and directories the same way. Note that we won't
   continue to follow links inside the sub-directories because we use
   walkdir instead of recursing.

   But this depends on S_ISDIR() returning true after stat() on Windows
   with a junction. Does that work? And are the entries in pb_tblspc on
   Windows actually junctions?

I've tested this with unreadable files in PGDATA and junk inside
pb_tblspc. Feedback welcome.

I have a separate patch to initdb with the corresponding changes, which
I will post after dinner and a bit more testing.

-- Abhijit
>From 491dcb8c7203fd992dd9c441f5463a75c88e6f52 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 28 May 2015 17:22:18 +0530
Subject: Skip unreadable files in fsync_pgdata; follow only expected symlinks

---
 src/backend/access/transam/xlog.c |  40 +-
 src/backend/storage/file/fd.c | 262 ++
 src/include/storage/fd.h  |   6 +-
 3 files changed, 226 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 087b6be..32447e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11605,20 +11605,53 @@ SetWalWriterSleeping(bool sleeping)
 
 /*
  * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and under pg_tblspc.
  */
 static void
 fsync_pgdata(char *datadir)
 {
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+	struct stat st;
+
 	if (!enableFsync)
 		return;
 
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", datadir);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", datadir);
+
+	/*
+	 * If pg_xlog is a symlink, then we need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+	if (lstat(pg_xlog, &st) < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+errmsg("could not stat directory \"pg_xlog\": %m")));
+
+#ifndef WIN32
+	if (S_ISLNK(st.st_mode))
+		xlog_is_symlink = true;
+#else
+	if (pgwin32_is_junction(subpath))
+		xlog_is_symlink = true;
+#endif
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.
 	 */
 #if defined(HAVE_SYNC_FILE_RANGE) || \
 	(defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED))
-	walkdir(datadir, pre_sync_fname);
+	walkdir(DEBUG1, datadir, pre_sync_fname);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, pre_sync_fname);
+	walktblspc_entries(DEBUG1, pg_tblspc, pre_sync_fname);
 #endif
 
 	/*
@@ -11628,5 +11661,8 @@ fsync_pgdata(char *datadir)
 	 * file fsyncs don't guarantee that the directory entry for the file is
 	 * synced.
 	 */
-	walkdir(datadir, fsync_fname);
+	walkdir(LOG, datadir, fsync_fname_ext);
+	if (xlog_is_symlink)
+		walkdir(LOG, pg_xlog, fsync_fname_ext);
+	walktblspc_entries(LOG, pg_tblspc, fsync_fname_ext);
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 68d43c6..916f8b5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1921,6 +1921,37 @@ TryAgain:
 }
 
 /*
+ * Read a directory opened with AllocateDir and return the next dirent.
+ * On error, ereports at a caller-specified level and returns NULL if
+ * the level is not ERROR or more severe.
+ */
+static struct dirent *
+ReadDirExtended(int elevel, DIR *dir, const char *dirname)
+{
+	struct dirent *dent;
+
+	/* Give a generic message for AllocateDir failure, if caller didn't */
+	if (dir == NULL)
+	{
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not open d

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Abhijit Menon-Sen

At 2015-05-27 23:41:29 -0400, t...@sss.pgh.pa.us wrote:
>
> What about turning ReadDir into a wrapper around a ReadDirExtended
> function that adds an elevel argument?

Sounds reasonable, doing.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Tom Lane

Abhijit Menon-Sen  writes:
> At 2015-05-27 20:22:18 -0400, t...@sss.pgh.pa.us wrote:
>> I doubt that that (not using AllocateDir) [â¦]

> (Disregarding per your followup.)

>> What I think is a reasonable compromise is to treat AllocateDir
>> failure as nonfatal, but to continue using ReadDir etc which means
>> that any post-open failure in reading a successfully-opened directory
>> would be fatal.

> OK, that's halfway between the two patches I posted. Will do.

Thought you were disregarding?

Anyway, I had an idea about reducing the ugliness.  What you basically
did is copy-and-paste ReadDir so you could change the elevel.  What
about turning ReadDir into a wrapper around a ReadDirExtended function
that adds an elevel argument?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Abhijit Menon-Sen

At 2015-05-27 20:22:18 -0400, t...@sss.pgh.pa.us wrote:
>
> I doubt that that (not using AllocateDir) […]

(Disregarding per your followup.)

> What I think is a reasonable compromise is to treat AllocateDir
> failure as nonfatal, but to continue using ReadDir etc which means
> that any post-open failure in reading a successfully-opened directory
> would be fatal.

OK, that's halfway between the two patches I posted. Will do.

> Meanwhile, it seems like you have copied a couple of very dubious
> decisions in initdb's walktblspc_links(): […]
>
> Now, we don't really have to do anything about these things in order
> to fix the immediate problem, but I wonder if we shouldn't try to
> clean up initdb's behavior while we're at it.

OK, I'll fix that in both.

> Independently of that, I thought the plan was to complain about any
> problems at LOG message level, not DEBUG1, and definitely not WARNING
> (which is lower than LOG for this purpose). 

I'll change the level to LOG for fsync_fname_ext, but I think DEBUG1 is
more appropriate for the pre_sync_fname run. Or do you think I should
use LOG for that too?

> And why would you use a different message level for pg_xlog/ than for
> other files anyway?

That was just a mistake, I forgot to change it after copying.

Thanks for having a look. I'll post a revised patch shortly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Tom Lane

I wrote:
> Abhijit Menon-Sen  writes:
>> At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote:
>>> I'm trying a couple of approaches to that (e.g. using readdir directly
>>> instead of ReadDir), but other suggestions are welcome.

>> Here's what that looks like, but not yet fully tested.

> I doubt that that (not using AllocateDir) is a good idea in the backend,

Oh, scratch that.  Reading the patch closer, I see you kept the use of
AllocateDir and only replaced ReadDir.  That's kind of ugly (and certainly
requires a comment about the inconsistency) but it's probably ok from an
error handling standpoint.  There are hard failure cases in AllocateDir,
but they seem unlikely to create problems in practice.

The other points stand though ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-27 Thread Tom Lane

Abhijit Menon-Sen  writes:
> At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote:
>> I'm trying a couple of approaches to that (e.g. using readdir directly
>> instead of ReadDir), but other suggestions are welcome.

> Here's what that looks like, but not yet fully tested.

I doubt that that (not using AllocateDir) is a good idea in the backend,
mainly because you are going to leak directory references if anything
called by walkdir() throws elog(ERROR).  While that doesn't matter much
for the immediate usage, since we'd throw up our hands and die anyway,
it completely puts the kibosh on any idea that walkdir() is a
general-purpose subroutine.  It would have to be decorated with big red
warnings NEVER USE THIS FUNCTION.  Ick.

What I think is a reasonable compromise is to treat AllocateDir failure as
nonfatal, but to continue using ReadDir etc which means that any post-open
failure in reading a successfully-opened directory would be fatal.  Such
errors would suggest that something is badly wrong in the filesystem, so
I do not feel too terrible about not covering that case.

Meanwhile, it seems like you have copied a couple of very dubious
decisions in initdb's walktblspc_links():

1. It doesn't say a word if the passed-in path (which in practice
is always $PGDATA/pg_tblspc) doesn't exist.

2. It assumes that every entry within pg_tblspc must be a tablespace
directory (or symlink to one), and fails if any are plain files.

At the very least, these behaviors seem inconsistent.  #2 is making
strong assumptions about pg_tblspc being correctly set up and free
of user-added cruft, while #1 doesn't even assume it's there at all.

Now, we don't really have to do anything about these things in order
to fix the immediate problem, but I wonder if we shouldn't try to
clean up initdb's behavior while we're at it.

Independently of that, I thought the plan was to complain about any
problems at LOG message level, not DEBUG1, and definitely not WARNING
(which is lower than LOG for this purpose).  And why would you use a
different message level for pg_xlog/ than for other files anyway?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Abhijit Menon-Sen

At 2015-05-27 11:46:39 +0530, a...@2ndquadrant.com wrote:
>
> I'm trying a couple of approaches to that (e.g. using readdir directly
> instead of ReadDir), but other suggestions are welcome.

Here's what that looks like, but not yet fully tested.

-- Abhijit
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 087b6be..6956e88 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11609,16 +11609,41 @@ SetWalWriterSleeping(bool sleeping)
 static void
 fsync_pgdata(char *datadir)
 {
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+	struct stat st;
+
 	if (!enableFsync)
 		return;
 
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", datadir);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", datadir);
+
+	if (lstat(pg_xlog, &st) < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+errmsg("could not stat directory \"pg_xlog\": %m")));
+
+	xlog_is_symlink = false;
+#ifndef WIN32
+	if (S_ISLNK(st.st_mode))
+		xlog_is_symlink = true;
+#else
+	if (pgwin32_is_junction(subpath))
+		xlog_is_symlink = true;
+#endif
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.
 	 */
 #if defined(HAVE_SYNC_FILE_RANGE) || \
 	(defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED))
-	walkdir(datadir, pre_sync_fname);
+	walkdir(DEBUG1, datadir, pre_sync_fname);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, pre_sync_fname);
+	walktblspc_links(DEBUG1, pg_tblspc, pre_sync_fname);
 #endif
 
 	/*
@@ -11628,5 +11653,8 @@ fsync_pgdata(char *datadir)
 	 * file fsyncs don't guarantee that the directory entry for the file is
 	 * synced.
 	 */
-	walkdir(datadir, fsync_fname);
+	walkdir(WARNING, datadir, fsync_fname_ext);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, fsync_fname_ext);
+	walktblspc_links(WARNING, pg_tblspc, fsync_fname_ext);
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 68d43c6..711b4b7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2443,48 +2443,113 @@ looks_like_temp_rel_name(const char *name)
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
- * Adapted from pre_sync_fname in initdb.c
+ * Ignores errors trying to open unreadable files, and logs other errors at a
+ * caller-specified level.
  */
 void
-pre_sync_fname(char *fname, bool isdir)
+pre_sync_fname(int elevel, char *fname, bool isdir)
 {
 	int			fd;
 
 	fd = BasicOpenFile(fname, O_RDONLY | PG_BINARY, 0);
 
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+
+#ifdef ETXTBSY
+		if (errno == ETXTBSY)
+			return;
+#endif
+
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", fname)));
+	}
+
+	(void) pg_flush_data(fd, 0, 0);
+	(void) close(fd);
+}
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required, and logs other
+ * errors at a caller-specified level.
+ */
+void
+fsync_fname_ext(int elevel, char *fname, bool isdir)
+{
+	int			fd;
+	int			returncode;
+
 	/*
-	 * Some OSs don't allow us to open directories at all (Windows returns
-	 * EACCES)
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here
 	 */
-	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
-		return;
+	if (!isdir)
+		fd = OpenTransientFile(fname,
+			   O_RDWR | PG_BINARY,
+			   S_IRUSR | S_IWUSR);
+	else
+		fd = OpenTransientFile(fname,
+			   O_RDONLY | PG_BINARY,
+			   S_IRUSR | S_IWUSR);
 
 	if (fd < 0)
-		ereport(FATAL,
-(errmsg("could not open file \"%s\": %m",
-		fname)));
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
 
-	pg_flush_data(fd, 0, 0);
+#ifdef ETXTBSY
+		if (errno == ETXTBSY)
+			return;
+#endif
 
-	close(fd);
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", fname)));
+	}
+
+	returncode = pg_fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be logged.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not fsync file \"%s\": %m", fname)));
+
+	CloseTransientFile(fd);
 }
 
 /*
  * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself)
- * and following symbolic links.
+ * regular file and directory (including the named directory itself).
  *
- * NB: There is another version of walkdir in initdb.c, but that version
- * behaves differently with respect to symbolic links.  Caveat emptor!
+ * Adapted from walkdir in initdb.c
  */
 voi

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Abhijit Menon-Sen

At 2015-05-26 22:44:03 +0200, and...@anarazel.de wrote:
>
> So what I propose is:
> 1) Remove the automatic symlink following
> 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in
>initdb -S
> 3) Add a elevel argument to walkdir(), return if AllocateDir() fails,
>continue for stat() failures in the readdir() loop.
> 4) Add elevel argument to pre_sync_fname, fsync_fname, return after
>errors.
> 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By
>virtue of not following symlinks we should not need to worry about
>EROFS

Here's a WIP patch for discussion.

I've (a) removed the S_ISLNK() branch in walkdir, (b) reintroduced
walktblspc_links to call walkdir on each of the entries within pg_tblspc
(simpler than trying to make walkdir follow links only for pg_xlog and
under pg_tblspc), (c) call walkdir on pg_xlog if it's a symlink (not
done for initdb -S; will submit separately), (d) add elevel arguments as
described, (e) ignore EACCES and ETXTBSY.

This correctly fsync()s stuff according to strace, and doesn't die if
there are unreadable files/links in PGDATA.

What I haven't done is return if AllocateDir() fails. I'm not convinced
that's correct, because it'll not complain if PGDATA is unreadable (but
this will break other things, so it doesn't matter), but also will die
if readdir fails rather than opendir.

I'm trying a couple of approaches to that (e.g. using readdir directly
instead of ReadDir), but other suggestions are welcome.

-- Abhijit
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 087b6be..6956e88 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11609,16 +11609,41 @@ SetWalWriterSleeping(bool sleeping)
 static void
 fsync_pgdata(char *datadir)
 {
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+	struct stat st;
+
 	if (!enableFsync)
 		return;
 
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", datadir);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", datadir);
+
+	if (lstat(pg_xlog, &st) < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+errmsg("could not stat directory \"pg_xlog\": %m")));
+
+	xlog_is_symlink = false;
+#ifndef WIN32
+	if (S_ISLNK(st.st_mode))
+		xlog_is_symlink = true;
+#else
+	if (pgwin32_is_junction(subpath))
+		xlog_is_symlink = true;
+#endif
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.
 	 */
 #if defined(HAVE_SYNC_FILE_RANGE) || \
 	(defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED))
-	walkdir(datadir, pre_sync_fname);
+	walkdir(DEBUG1, datadir, pre_sync_fname);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, pre_sync_fname);
+	walktblspc_links(DEBUG1, pg_tblspc, pre_sync_fname);
 #endif
 
 	/*
@@ -11628,5 +11653,8 @@ fsync_pgdata(char *datadir)
 	 * file fsyncs don't guarantee that the directory entry for the file is
 	 * synced.
 	 */
-	walkdir(datadir, fsync_fname);
+	walkdir(WARNING, datadir, fsync_fname_ext);
+	if (xlog_is_symlink)
+		walkdir(DEBUG1, pg_xlog, fsync_fname_ext);
+	walktblspc_links(WARNING, pg_tblspc, fsync_fname_ext);
 }
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 68d43c6..c855004 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2443,42 +2443,99 @@ looks_like_temp_rel_name(const char *name)
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
- * Adapted from pre_sync_fname in initdb.c
+ * Ignores errors trying to open unreadable files, and logs other errors at a
+ * caller-specified level.
  */
 void
-pre_sync_fname(char *fname, bool isdir)
+pre_sync_fname(int elevel, char *fname, bool isdir)
 {
 	int			fd;
 
 	fd = BasicOpenFile(fname, O_RDONLY | PG_BINARY, 0);
 
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+
+#ifdef ETXTBSY
+		if (errno == ETXTBSY)
+			return;
+#endif
+
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", fname)));
+	}
+
+	(void) pg_flush_data(fd, 0, 0);
+	(void) close(fd);
+}
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required, and logs other
+ * errors at a caller-specified level.
+ */
+void
+fsync_fname_ext(int elevel, char *fname, bool isdir)
+{
+	int			fd;
+	int			returncode;
+
 	/*
-	 * Some OSs don't allow us to open directories at all (Windows returns
-	 * EACCES)
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here
 	 */
-	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
-		return;
+	if (!isdir)
+		fd = OpenTransientFile(fname,
+			   O_RDWR | PG_BINARY,
+			   S_IRUSR | S_IWUSR);
+	else
+		fd = OpenTransientFile(fname,
+

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Abhijit Menon-Sen

At 2015-05-26 22:44:03 +0200, and...@anarazel.de wrote:
>
> So, this was discussed in the following thread, starting at:
> http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org

Sorry, I didn't see this before replying.

> There are no other places we it's "allowed" to introduce symlinks and
> we have refuted bugreports of people having problems after doing that.

OK.

> So what I propose is:
> 1) Remove the automatic symlink following
> 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in
>initdb -S
> 3) Add a elevel argument to walkdir(), return if AllocateDir() fails,
>continue for stat() failures in the readdir() loop.
> 4) Add elevel argument to pre_sync_fname, fsync_fname, return after
>errors.
> 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By
>virtue of not following symlinks we should not need to worry about
>EROFS

Makes sense. I've got most of that done, I'll just remove the symlink
following (or rather, restrict it to the cases mentioned), test a bit,
and post.

> I'm inclined to think that 4) is a big enough compat break that a
> fsync_fname_ext with the new argument is a good idea.

Agreed.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Abhijit Menon-Sen

At 2015-05-26 19:07:20 +0200, and...@anarazel.de wrote:
>
> Abhijit, do you recall why the code was changed to follow all symlinks
> in contrast to explicitly going through the tablespaces as initdb -S
> does? I'm pretty sure early versions of the patch pretty much had a
> verbatim copy of the initdb logic?

Yes, earlier versions of the patch did follow symlinks only in
pg_tblspc. I changed this because Álvaro pointed out that this
was likely to miss important stuff, e.g. in

20150403163232.ga28...@eldon.alvh.no-ip.org,
20150406131636.gd4...@alvh.no-ip.org

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Andres Freund

On 2015-05-26 19:07:20 +0200, Andres Freund wrote:
> It is somewhat interesting that similar code has been used in
> pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it
> causing reported problem. I think the relevant difference is that that
> code doesn't follow symlinks.  It's obviously also less exercised and
> poeople might just have fixed up permissions when encountering troubles.
> 
> Abhijit, do you recall why the code was changed to follow all symlinks
> in contrast to explicitly going through the tablespaces as initdb -S
> does? I'm pretty sure early versions of the patch pretty much had a
> verbatim copy of the initdb logic?  That logic is missing pg_xlog btw,
> which is bad for pg_upgrade.

So, this was discussed in the following thread, starting at:
http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org

"Actually, since surely we must follow symlinks everywhere, why do we
have to do this separately for pg_tblspc?  Shouldn't that link-following
occur automatically when walking PGDATA in the first place?"

I don't think it's true that we must follow symlinks everywhere. I
think, as argued upthread, that it's sufficient to recurse through
PGDATA, follow the symlinks in pg_tbspc, and if a symlink, also go
through pg_xlog separately.  There are no other places we it's "allowed"
to introduce symlinks and we have refuted bugreports of people having
problems after doing that.

So what I propose is:
1) Remove the automatic symlink following
2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in
   initdb -S
3) Add a elevel argument to walkdir(), return if AllocateDir() fails,
   continue for stat() failures in the readdir() loop.
4) Add elevel argument to pre_sync_fname, fsync_fname, return after
   errors.
5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By
   virtue of not following symlinks we should not need to worry about
   EROFS

I'm inclined to think that 4) is a big enough compat break that a
fsync_fname_ext with the new argument is a good idea.

Arguments for/against?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Andres Freund

On 2015-05-26 19:07:20 +0200, Andres Freund wrote:
> Abhijit, do you recall why the code was changed to follow all symlinks
> in contrast to explicitly going through the tablespaces as initdb -S
> does? I'm pretty sure early versions of the patch pretty much had a
> verbatim copy of the initdb logic?

Forgot to add Abhijit to CC list, sorry.

> That [initdb's] logic is missing pg_xlog btw, which is bad for pg_upgrade.

On second thought it's probably not actually bad for pg_upgrade's
specific usecase. It'll always end with a proper shutdown, so it'll
never need that WAL. But it'd be bad if anybody ever relied on initdb -S
in a different scenario.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Andres Freund

On 2015-05-26 10:41:12 -0400, Tom Lane wrote:
> Yeah.  Perhaps I missed it, but was the original patch motivated by
> actual failures that had been seen in the field, or was it just a
> hypothetical concern?

I'd mentioned that it might be a good idea to do this while
investingating a bug with unlogged tables that several parties had
reported. That patch has been fixed in a more granular fashion. From
there it took on kind of a life on its own.

It is somewhat interesting that similar code has been used in
pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it
causing reported problem. I think the relevant difference is that that
code doesn't follow symlinks.  It's obviously also less exercised and
poeople might just have fixed up permissions when encountering troubles.

Abhijit, do you recall why the code was changed to follow all symlinks
in contrast to explicitly going through the tablespaces as initdb -S
does? I'm pretty sure early versions of the patch pretty much had a
verbatim copy of the initdb logic?  That logic is missing pg_xlog btw,
which is bad for pg_upgrade.

I *can* reproduce corrupted clusters without this without trying too
hard. I yesterday wrote a test for the crash testing infrastructure I've
on and off worked on (since 2011. Uhm) and I could reproduce a bunch of
corrupted clusters without this patch.  When randomly triggering crash
restarts shortly afterwards follwed by a simulated hw restart (stop
accepting writes via linux device mapper) while concurrent COPYs are
running, I can trigger a bunch of data corruptions.

Since then my computer in berlin has done 440 testruns with the patch,
and 450 without. I've gotten 7 errors without, 0 with. But the
instrumentation for detecting errors is really shitty (pkey lookup for
every expected row) and doesn't yet keep meaningful diagnosistics. So
this isn't particularly bulletproof either way.

I can't tell whether the patch is just masking yet another problem due
to different timings caused by the fsync, or whether it's fixing the
problem that we can forget to sync WAL segments.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Tom Lane

Andrew Dunstan  writes:
> On 05/26/2015 11:58 AM, Tom Lane wrote:
>> One thing perhaps we *should* be selective about, though, is which
>> symlinks we try to follow.  I think that a good case could be made
>> for ignoring symlinks everywhere except in the pg_tablespace directory.
>> If we did, that would all by itself take care of the Debian scenario,
>> if I understand that case correctly.

> People have symlinked the xlog directory.

Good point, we need to cover that case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Magnus Hagander

On Tue, May 26, 2015 at 6:16 PM, Andrew Dunstan  wrote:

>
> On 05/26/2015 11:58 AM, Tom Lane wrote:
>
>> Andrew Dunstan  writes:
>>
>>> OK, I'm late to the party. But why exactly are we syncing absolutely
>>> everything? That seems over-broad.
>>>
>> If we try to be selective, we risk errors of omission, which no one would
>> ever notice until someone's data got eaten in a low-probability crash
>> scenario.  It seems more robust (at least to me) to fsync everything we
>> can find.  That does require more thought about error cases than went
>> into the original patch ... but I think that we need more thought about
>> error cases even if we do try to be selective.
>>
>> One thing perhaps we *should* be selective about, though, is which
>> symlinks we try to follow.  I think that a good case could be made
>> for ignoring symlinks everywhere except in the pg_tablespace directory.
>> If we did, that would all by itself take care of the Debian scenario,
>> if I understand that case correctly.
>>
>
> People have symlinked the xlog directory. I've done it myself in the past.
> A better rule might be to ignore symlinks unless either they are in
> pg_tblspc or they are in the data directory and their name starts with
> "pg_".
>

Not just "people". initdb will symlink the xlog directory if you use -x.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Andrew Dunstan



On 05/26/2015 11:58 AM, Tom Lane wrote:

Andrew Dunstan  writes:

OK, I'm late to the party. But why exactly are we syncing absolutely
everything? That seems over-broad.

If we try to be selective, we risk errors of omission, which no one would
ever notice until someone's data got eaten in a low-probability crash
scenario.  It seems more robust (at least to me) to fsync everything we
can find.  That does require more thought about error cases than went
into the original patch ... but I think that we need more thought about
error cases even if we do try to be selective.

One thing perhaps we *should* be selective about, though, is which
symlinks we try to follow.  I think that a good case could be made
for ignoring symlinks everywhere except in the pg_tablespace directory.
If we did, that would all by itself take care of the Debian scenario,
if I understand that case correctly.


People have symlinked the xlog directory. I've done it myself in the 
past. A better rule might be to ignore symlinks unless either they are 
in pg_tblspc or they are in the data directory and their name starts 
with "pg_".


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Tom Lane

Andrew Dunstan  writes:
> OK, I'm late to the party. But why exactly are we syncing absolutely 
> everything? That seems over-broad.

If we try to be selective, we risk errors of omission, which no one would
ever notice until someone's data got eaten in a low-probability crash
scenario.  It seems more robust (at least to me) to fsync everything we
can find.  That does require more thought about error cases than went
into the original patch ... but I think that we need more thought about
error cases even if we do try to be selective.

One thing perhaps we *should* be selective about, though, is which
symlinks we try to follow.  I think that a good case could be made
for ignoring symlinks everywhere except in the pg_tablespace directory.
If we did, that would all by itself take care of the Debian scenario,
if I understand that case correctly.

> And might it be better to check that we can open each file using 
> access() than calling open() and looking at the error code?

Don't really see the point; that's just an extra step, and access()
won't exactly prove you can open the file, anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Andrew Dunstan



On 05/26/2015 08:05 AM, Robert Haas wrote:

On Mon, May 25, 2015 at 9:54 PM, Stephen Frost  wrote:

I certainly see your point, but Tom also pointed out that it's not great
to ignore failures during this phase:

* Tom Lane (t...@sss.pgh.pa.us) wrote:

Greg Stark  writes:

What exactly is failing?
Is it that fsync is returning -1 ?

According to the original report from Christoph Berg, it was open()
not fsync() that was failing, at least in permissions-based cases.

I'm not sure if we should just uniformly ignore all failures in this
phase.  That would have the merit of clearly not creating any new
startup failure cases compared to the previous code, but as you say
sometimes it might mean ignoring real problems.

If we accept this, then we still have to have the lists, to decide what
to fail on and what to ignore.  If we're going to have said lists tho, I
don't really see the point in fsync'ing things we're pretty confident
aren't ours.

No, that's not right at all.  The idea, at least as I understand it,
would be decide which errors to ignore by looking at the error code,
not by looking at which file is involved.




OK, I'm late to the party. But why exactly are we syncing absolutely 
everything? That seems over-broad.


And might it be better to check that we can open each file using 
access() than calling open() and looking at the error code?


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Tom Lane

Robert Haas  writes:
> Anything we do short of making all errors in this area non-fatal is
> going to leave behind startup-failure cases that exist today, and we
> have no evidence at this time that such startup failures would be
> justified by any actual data loss risk.

Yeah.  Perhaps I missed it, but was the original patch motivated by
actual failures that had been seen in the field, or was it just a
hypothetical concern?  Certainly, any actual failures of that sort
are few and far between compared to the number of problems we now
realize the patch introduced.

Also, we need to discuss how hard walkdir() needs to try to avoid
throwing errors of its own.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Robert Haas

On Mon, May 25, 2015 at 9:54 PM, Andres Freund  wrote:
> On 2015-05-25 21:33:03 -0400, Robert Haas wrote:
>> On Mon, May 25, 2015 at 2:20 PM, Tom Lane  wrote:
>> > Perhaps, but if we didn't have permission to write the file, it's hard to
>> > argue that it's our responsibility to fsync it.  So this seems like it's
>> > adding complexity without really adding any safety.
>>
>> I agree.  I think ignoring fsync failures is a very sensible approach.
>> If the files are not writable, they're probably not ours.
>
> The reason we started discussing this is because Tom had the - quite
> reasonable - concern that this might not solely be a problem of EACCESS,
> but that there could be other errors that we need to ignore to not fail
> spuriously.  Say a symlink goes to a binary, which is currently being
> executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So
> we'd need to ignore a lot of errors, possibly ignoring valid ones.

But ignoring those errors wouldn't compromise data integrity, either.
So let's just ignore (but log) all errors: then we'll be demonstrably
no worse off than we were before this patch went in.  If it turns out
that there's some particular case where ignoring errors DOES
compromise data integrity, then we can plug that hole surgically when
somebody reports it.

Anything we do short of making all errors in this area non-fatal is
going to leave behind startup-failure cases that exist today, and we
have no evidence at this time that such startup failures would be
justified by any actual data loss risk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Robert Haas

On Mon, May 25, 2015 at 9:54 PM, Stephen Frost  wrote:
> I certainly see your point, but Tom also pointed out that it's not great
> to ignore failures during this phase:
>
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Greg Stark  writes:
>> > What exactly is failing?
>> > Is it that fsync is returning -1 ?
>> According to the original report from Christoph Berg, it was open()
>> not fsync() that was failing, at least in permissions-based cases.
>>
>> I'm not sure if we should just uniformly ignore all failures in this
>> phase.  That would have the merit of clearly not creating any new
>> startup failure cases compared to the previous code, but as you say
>> sometimes it might mean ignoring real problems.
>
> If we accept this, then we still have to have the lists, to decide what
> to fail on and what to ignore.  If we're going to have said lists tho, I
> don't really see the point in fsync'ing things we're pretty confident
> aren't ours.

No, that's not right at all.  The idea, at least as I understand it,
would be decide which errors to ignore by looking at the error code,
not by looking at which file is involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-26 Thread Stephen Frost

* Abhijit Menon-Sen (a...@2ndquadrant.com) wrote:
> At 2015-05-26 03:54:51 +0200, and...@anarazel.de wrote:
> > Another thing is whether we should handle a recursive symlink in
> > pgdata? I personally think not, but...
> 
> I think not too.

Yikes..  That's definitely the kind of thing that's why I worry about
the whole 'fsync everything' idea- what if I symlink to /?  I've
certainly done that before from my home directory for ease of use and I
imagine there are people out there who have similar setups where they
sftp as the PG user and use the symlink to more easily navigate
somewhere else.  We have to realize that, on at least some systems,
PGDATA could be the postgres user's home directory too.  That's not the
case on Debian-based systems today, but I think it might have been
before Debian had the multi-cluster tooling.

> > It's also not just as simple as making fsync_fname fail gracefully
> > upon EACCESS - the opendir() could fail just as well.
> 
> I'll post a proposed patch shortly.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Abhijit Menon-Sen

At 2015-05-26 03:54:51 +0200, and...@anarazel.de wrote:
>
> Say a symlink goes to a binary, which is currently being executed:
> ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So we'd
> need to ignore a lot of errors, possibly ignoring valid ones.

Right. That's why I started out by being conservative and following only
the "expected" symlinks in pg_tblspc (as other parts of the code do).

> Another thing is whether we should handle a recursive symlink in
> pgdata? I personally think not, but...

I think not too.

> It's also not just as simple as making fsync_fname fail gracefully
> upon EACCESS - the opendir() could fail just as well.

I'll post a proposed patch shortly.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Andres Freund

On 2015-05-25 21:33:03 -0400, Robert Haas wrote:
> On Mon, May 25, 2015 at 2:20 PM, Tom Lane  wrote:
> > Perhaps, but if we didn't have permission to write the file, it's hard to
> > argue that it's our responsibility to fsync it.  So this seems like it's
> > adding complexity without really adding any safety.
> 
> I agree.  I think ignoring fsync failures is a very sensible approach.
> If the files are not writable, they're probably not ours.

The reason we started discussing this is because Tom had the - quite
reasonable - concern that this might not solely be a problem of EACCESS,
but that there could be other errors that we need to ignore to not fail
spuriously.  Say a symlink goes to a binary, which is currently being
executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So
we'd need to ignore a lot of errors, possibly ignoring valid ones.

I personally can see why people will put things in PGDATA itself, if you
put unreadable stuff in some subdirectory that you didn't create
yourself, I see much less reason to tolerate that.

Another thing is whether we should handle a recursive symlink in pgdata?
I personally think not, but...

It's also not just as simple as making fsync_fname fail gracefully upon
EACCESS - the opendir() could fail just as well.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, May 25, 2015 at 2:20 PM, Tom Lane  wrote:
>> Andres Freund  writes:
>>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
 Not really sure I see that as helping.
>>> On most OSs, except windows and some obscure unixes, a readonly fd is
>>> allowed to fsync a file.
>> Perhaps, but if we didn't have permission to write the file, it's hard to
>> argue that it's our responsibility to fsync it.  So this seems like it's
>> adding complexity without really adding any safety.
>
> I agree.  I think ignoring fsync failures is a very sensible approach.
> If the files are not writable, they're probably not ours.  If they are
> not writable but somehow still ours, we probably can't have written
> them before the crash, either.  If they are ours and we somehow wrote
> to them before the crash, and then while the system was down they were
> made inaccessible, and then the database was restarted, then we're
> well into the territory where the system administrator has done
> something that we cannot possibly be expected to cope with ... but
> ignoring the fsync isn't very likely to cause any real problems even
> here.  If we really did modify those blocks recently, recovery will
> try to redo the changes, and we'll fail then anyway.  So what's the
> problem?
>
> I agree with Tom's concern that if we have two lists of directories,
> they may get out of sync.  We could probably merge the two lists
> somehow, but I'm not really seeing the point, since Tom's blanket
> approach should work just fine.

I certainly see your point, but Tom also pointed out that it's not great
to ignore failures during this phase:

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Greg Stark  writes:
> > What exactly is failing?
> > Is it that fsync is returning -1 ?
> According to the original report from Christoph Berg, it was open()
> not fsync() that was failing, at least in permissions-based cases.
>
> I'm not sure if we should just uniformly ignore all failures in this
> phase.  That would have the merit of clearly not creating any new
> startup failure cases compared to the previous code, but as you say
> sometimes it might mean ignoring real problems.

If we accept this, then we still have to have the lists, to decide what
to fail on and what to ignore.  If we're going to have said lists tho, I
don't really see the point in fsync'ing things we're pretty confident
aren't ours.

Further, in any of these cases, we have to decide which failure cases
are ones that are "fatal" and which are not- being in the list or not
isn't the only criteria, it's just one part of the overall decision.  We
also need to consider what return value we get back for which system
calls, all of which may entirely be system dependent, meanly we may have
to deal with portability issues here too.

Then there are other interesting considerations like what happens with
an NFS mount (as Greg mentioned), or perhaps what happens when it's a
MAC violation (eg: SELinux).  Generally speaking, those will also return
an error code which we can contemplate, but it'll still create annoying
log noise for people running in such environments.  Perhaps that would
encourage them to move whatever files they have out of $PGDATA, which is
likely to be a good decision, but that may not always be possible..

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Robert Haas

On Mon, May 25, 2015 at 2:20 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
 Additionally we could attempt to fsync with a readonly fd before trying
 the read-write fd...
>
>>> Not really sure I see that as helping.
>
>> On most OSs, except windows and some obscure unixes, a readonly fd is
>> allowed to fsync a file.
>
> Perhaps, but if we didn't have permission to write the file, it's hard to
> argue that it's our responsibility to fsync it.  So this seems like it's
> adding complexity without really adding any safety.

I agree.  I think ignoring fsync failures is a very sensible approach.
If the files are not writable, they're probably not ours.  If they are
not writable but somehow still ours, we probably can't have written
them before the crash, either.  If they are ours and we somehow wrote
to them before the crash, and then while the system was down they were
made inaccessible, and then the database was restarted, then we're
well into the territory where the system administrator has done
something that we cannot possibly be expected to cope with ... but
ignoring the fsync isn't very likely to cause any real problems even
here.  If we really did modify those blocks recently, recovery will
try to redo the changes, and we'll fail then anyway.  So what's the
problem?

I agree with Tom's concern that if we have two lists of directories,
they may get out of sync.  We could probably merge the two lists
somehow, but I'm not really seeing the point, since Tom's blanket
approach should work just fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Josh Berkus

On 05/25/2015 01:21 PM, Tom Lane wrote:
> And from the other direction, where exactly is it written that
> distros/users will only create problematic files at the top level of
> $PGDATA?  I'd have zero confidence in such an assertion applied to
> tablespace directories, for sure.

Yes, absolutely.

For example: Cstore_FDW now puts its files in a subdirectory of the
tablespace directory, which for a really large cstore table, the user
will want to symlink somewhere else, or might create as a mount.  Such a
user might then, for example, annotate this with a README.txt which
happens to be root/root perms.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> Well, that opens us to errors of omission, ie failing to fsync things we
>> should have.  Maybe that's an okay risk, but personally I'd judge that
>> "fsync everything and ignore (some?) errors" is probably a more robust
>> approach over time.

> How is it possible to make errors of omission?  The list of directories
> in initdb is the complete set of directories that are created for a
> newly-initdb'd database, no?  Surely there can't be a database that
> contains vital directories that are not created there?  See "subdirs"
> static in initdb.c.

Easy: all you need is to suppose that some of the plain files at top level
of $PGDATA ought to be fsync'd.  (I'm fairly sure for example that we took
steps awhile back to force postmaster.pid to be fsync'd.)  If there is a
distinction between the fsync requirements of top-level files and
everything else, it is completely accidental and not to be relied on.

And from the other direction, where exactly is it written that
distros/users will only create problematic files at the top level of
$PGDATA?  I'd have zero confidence in such an assertion applied to
tablespace directories, for sure.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Alvaro Herrera

Tom Lane wrote:
> Stephen Frost  writes:
> > I've not followed this thread all that closely, but I do tend to agree
> > with the idea of "only try to mess with files that are *clearly* ours to
> > mess with."
> 
> Well, that opens us to errors of omission, ie failing to fsync things we
> should have.  Maybe that's an okay risk, but personally I'd judge that
> "fsync everything and ignore (some?) errors" is probably a more robust
> approach over time.

How is it possible to make errors of omission?  The list of directories
in initdb is the complete set of directories that are created for a
newly-initdb'd database, no?  Surely there can't be a database that
contains vital directories that are not created there?  See "subdirs"
static in initdb.c.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Tom Lane

Andres Freund  writes:
> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
>>> Additionally we could attempt to fsync with a readonly fd before trying
>>> the read-write fd...

>> Not really sure I see that as helping.

> On most OSs, except windows and some obscure unixes, a readonly fd is
> allowed to fsync a file.

Perhaps, but if we didn't have permission to write the file, it's hard to
argue that it's our responsibility to fsync it.  So this seems like it's
adding complexity without really adding any safety.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Andres Freund

On 2015-05-25 14:02:28 -0400, Tom Lane wrote:
> Stephen Frost  writes:
> > I've not followed this thread all that closely, but I do tend to agree
> > with the idea of "only try to mess with files that are *clearly* ours to
> > mess with."
>
> Well, that opens us to errors of omission, ie failing to fsync things we
> should have.

Is that really that likely? I mean we don't normally add data to the top
level directory itself, and subdirectories hopefully won't be added
except via initdb?

> Maybe that's an okay risk, but personally I'd judge that
> "fsync everything and ignore (some?) errors" is probably a more robust
> approach over time.

The over-the-top approach would be to combine the two. Error out in
directories that are in the initdb list, and ignore permission errors
otherwise...

Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Stephen Frost

* Andres Freund (and...@anarazel.de) wrote:
> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
> > That seems overly complicated, for my 2c at least.  I don't particularly
> > like trying to mess with files that might be rightfully considered "not
> > ours" either.
> 
> I'd not consider an fsync to be "messing" with files, especially if
> they're in PGDATA.

I'm not entirely sure I agree.

> > > Additionally we could attempt to fsync with a readonly fd before trying
> > > the read-write fd...
> > 
> > Not really sure I see that as helping.
> 
> On most OSs, except windows and some obscure unixes, a readonly fd is
> allowed to fsync a file.

I wouldn't have thought otherwise, given that you were suggesting it,
but there's no guarantee we're going to be allowed to read it either, or
even access the directory the symlink points to, etc..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Stephen Frost

* Andres Freund (and...@anarazel.de) wrote:
> On 2015-05-25 14:02:28 -0400, Tom Lane wrote:
> > Stephen Frost  writes:
> > > I've not followed this thread all that closely, but I do tend to agree
> > > with the idea of "only try to mess with files that are *clearly* ours to
> > > mess with."
> >
> > Well, that opens us to errors of omission, ie failing to fsync things we
> > should have.
> 
> Is that really that likely? I mean we don't normally add data to the top
> level directory itself, and subdirectories hopefully won't be added
> except via initdb?

That feels like a pretty low risk, to me at least.  Certainly better
than having a failure, like what's going on now.

> > Maybe that's an okay risk, but personally I'd judge that
> > "fsync everything and ignore (some?) errors" is probably a more robust
> > approach over time.
> 
> The over-the-top approach would be to combine the two. Error out in
> directories that are in the initdb list, and ignore permission errors
> otherwise...

That seems overly complicated, for my 2c at least.  I don't particularly
like trying to mess with files that might be rightfully considered "not
ours" either.  This all makes me really wonder about
postgresql.auto.conf too..  Clearly, on the one hand, we consider that
"our" file, and so we should error out if we don't own it, but on the
other hand, I've specifically recommended making that file owned by
root to some folks, to avoid DBAs playing with the startup-time
settings..

> Additionally we could attempt to fsync with a readonly fd before trying
> the read-write fd...

Not really sure I see that as helping.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Tom Lane

Stephen Frost  writes:
> I've not followed this thread all that closely, but I do tend to agree
> with the idea of "only try to mess with files that are *clearly* ours to
> mess with."

Well, that opens us to errors of omission, ie failing to fsync things we
should have.  Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Andres Freund

On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
> That seems overly complicated, for my 2c at least.  I don't particularly
> like trying to mess with files that might be rightfully considered "not
> ours" either.

I'd not consider an fsync to be "messing" with files, especially if
they're in PGDATA.

> > Additionally we could attempt to fsync with a readonly fd before trying
> > the read-write fd...
> 
> Not really sure I see that as helping.

On most OSs, except windows and some obscure unixes, a readonly fd is
allowed to fsync a file.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Stephen Frost

* Andres Freund (and...@anarazel.de) wrote:
> On 2015-05-25 13:38:01 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On May 24, 2015 7:52:53 AM PDT, Tom Lane  wrote:
> > > If we'd merge it with initdb's list I think I'd not be that bad. I'm 
> > > thinking of some header declaring it, roughly like the rmgr list.
> > 
> > pg_log/ is a counterexample to that idea too; initdb doesn't know about it
> > (and shouldn't).
> 
> The idea would be to *only* directories that initdb knows about. Since
> that's where the valuables are. So I don't see how pg_log would be a
> counterexample.

Indeed, that wouldn't be included in the list of things to fsync and it
isn't listed in initdb, so that works.

I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Andres Freund

On 2015-05-25 13:38:01 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On May 24, 2015 7:52:53 AM PDT, Tom Lane  wrote:
> > If we'd merge it with initdb's list I think I'd not be that bad. I'm 
> > thinking of some header declaring it, roughly like the rmgr list.
> 
> pg_log/ is a counterexample to that idea too; initdb doesn't know about it
> (and shouldn't).

The idea would be to *only* directories that initdb knows about. Since
that's where the valuables are. So I don't see how pg_log would be a
counterexample.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Tom Lane

Andres Freund  writes:
> On May 24, 2015 7:52:53 AM PDT, Tom Lane  wrote:
>> Christoph Berg  writes:
>>> pg_log/ is also admin domain. What about only recursing into
>>> well-known directories + postgresql.auto.conf?

>> The idea that this code would know exactly what's what under $PGDATA
>> scares me.  I can positively guarantee that it would diverge from
>> reality over time, and nobody would notice until it ate their data,
>> failed to start, or otherwise behaved undesirably.
>> 
>> pg_log/ is a perfect example, because that is not a hard-wired
>> directory name; somebody could point the syslogger at a different place
>> very easily.  Wiring in special behavior for that name is just wrong.
>> 
>> I would *much* rather have a uniform rule for how to treat each file
>> the scan comes across.  It might take some tweaking to get to one that
>> works well; but once we did, we could have some confidence that it
>> wouldn't break later.

> If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking 
> of some header declaring it, roughly like the rmgr list.

pg_log/ is a counterexample to that idea too; initdb doesn't know about it
(and shouldn't).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-25 Thread Andres Freund

On May 24, 2015 7:52:53 AM PDT, Tom Lane  wrote:
>Christoph Berg  writes:
>> Re: To Andres Freund 2015-05-24 <20150524075244.gb27...@msg.df7cb.de>
>>> Re: Andres Freund 2015-05-24
><20150524005245.gd32...@alap3.anarazel.de>
 How about, to avoid masking actual problems, we have a more
 differentiated logic for the toplevel data directory?
>
>> pg_log/ is also admin domain. What about only recursing into
>> well-known directories + postgresql.auto.conf?
>
>The idea that this code would know exactly what's what under $PGDATA
>scares me.  I can positively guarantee that it would diverge from
>reality
>over time, and nobody would notice until it ate their data, failed to
>start, or otherwise behaved undesirably.
>
>pg_log/ is a perfect example, because that is not a hard-wired
>directory
>name; somebody could point the syslogger at a different place very
>easily.
>Wiring in special behavior for that name is just wrong.
>
>I would *much* rather have a uniform rule for how to treat each file
>the scan comes across.  It might take some tweaking to get to one that
>works well; but once we did, we could have some confidence that it
>wouldn't break later.

If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking 
of some header declaring it, roughly like the rmgr list.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-24 Thread Tom Lane

Christoph Berg  writes:
> Re: To Andres Freund 2015-05-24 <20150524075244.gb27...@msg.df7cb.de>
>> Re: Andres Freund 2015-05-24 <20150524005245.gd32...@alap3.anarazel.de>
>>> How about, to avoid masking actual problems, we have a more
>>> differentiated logic for the toplevel data directory?

> pg_log/ is also admin domain. What about only recursing into
> well-known directories + postgresql.auto.conf?

The idea that this code would know exactly what's what under $PGDATA
scares me.  I can positively guarantee that it would diverge from reality
over time, and nobody would notice until it ate their data, failed to
start, or otherwise behaved undesirably.

pg_log/ is a perfect example, because that is not a hard-wired directory
name; somebody could point the syslogger at a different place very easily.
Wiring in special behavior for that name is just wrong.

I would *much* rather have a uniform rule for how to treat each file
the scan comes across.  It might take some tweaking to get to one that
works well; but once we did, we could have some confidence that it
wouldn't break later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-24 Thread Christoph Berg

Re: To Andres Freund 2015-05-24 <20150524075244.gb27...@msg.df7cb.de>
> Re: Andres Freund 2015-05-24 <20150524005245.gd32...@alap3.anarazel.de>
> > How about, to avoid masking actual problems, we have a more
> > differentiated logic for the toplevel data directory? I think we could
> > just skip all non-directory files in there data_directory itself. None
> > of the files in the toplevel directory, with the exception of
> > postgresql.auto.conf, will ever get written to by PG itself.  And if
> > there's readonly files somewhere in a subdirectory, I won't feel
> > particularly bad.

pg_log/ is also admin domain. What about only recursing into
well-known directories + postgresql.auto.conf?

(I've also been wondering if pg_basebackup shouldn't skip pg_log, but
that's a different topic...)

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-24 Thread Christoph Berg

Re: Andres Freund 2015-05-24 <20150524005245.gd32...@alap3.anarazel.de>
> How about, to avoid masking actual problems, we have a more
> differentiated logic for the toplevel data directory? I think we could
> just skip all non-directory files in there data_directory itself. None
> of the files in the toplevel directory, with the exception of
> postgresql.auto.conf, will ever get written to by PG itself.  And if
> there's readonly files somewhere in a subdirectory, I won't feel
> particularly bad.

I like that idea.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-23 Thread Andres Freund

On 2015-05-23 16:33:29 -0400, Tom Lane wrote:
> Christoph Berg  writes:
> > the new fsync-pgdata-on-recovery code tries to open all files using
> > O_RDWR. At least on 9.1, this can make recovery fail:
> 
> Hm.  I wonder whether it would be all right to just skip files for which
> we get EPERM on open().  The argument being that if we can't write to the
> file, we should not be held responsible for fsync'ing it either.  But
> I'm not sure whether EPERM would be the only relevant errno, or whether
> there are cases where this would mask real problems.

We could even try doing the a fsync with a readonly fd as a fallback,
but that's also pretty hacky.

How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory? I think we could
just skip all non-directory files in there data_directory itself. None
of the files in the toplevel directory, with the exception of
postgresql.auto.conf, will ever get written to by PG itself.  And if
there's readonly files somewhere in a subdirectory, I won't feel
particularly bad.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-23 Thread Christoph Berg

Re: Tom Lane 2015-05-23 <2284.1432413...@sss.pgh.pa.us>
> Christoph Berg  writes:
> > the new fsync-pgdata-on-recovery code tries to open all files using
> > O_RDWR. At least on 9.1, this can make recovery fail:
> 
> Hm.  I wonder whether it would be all right to just skip files for which
> we get EPERM on open().  The argument being that if we can't write to the
> file, we should not be held responsible for fsync'ing it either.  But
> I'm not sure whether EPERM would be the only relevant errno, or whether
> there are cases where this would mask real problems.

Maybe logging WARNINGs instead of FATAL would be enough of a fix?

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-23 Thread Tom Lane

Christoph Berg  writes:
> the new fsync-pgdata-on-recovery code tries to open all files using
> O_RDWR. At least on 9.1, this can make recovery fail:

Hm.  I wonder whether it would be all right to just skip files for which
we get EPERM on open().  The argument being that if we can't write to the
file, we should not be held responsible for fsync'ing it either.  But
I'm not sure whether EPERM would be the only relevant errno, or whether
there are cases where this would mask real problems.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] fsync-pgdata-on-recovery tries to write to more files than previously

2015-05-23 Thread Christoph Berg

Hi,

the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:

* launch postgres, hit ^\ (or otherwise shut down uncleanly)
* touch foo; chmod 444 foo
* launch postgres

LOG:  database system was interrupted; last known up at 2015-05-23 19:18:36 CEST
FATAL:  could not open file "/home/cbe/9.1/foo": Permission denied
LOG:  startup process (PID 27305) exited with exit code 1
LOG:  aborting startup due to startup process failure

The code on 9.4 looks similar to me, but I couldn't trigger the
problem there.

I think this is a real-world problem:

1) In older releases, the SSL certs were read from the data directory,
and at least the default Debian installation creates symlinks from
PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write

2) It's probably a pretty common scenario that the root user will edit
postgresql.conf, and make backups or create other random files there
that are not writable. Even a non-writable postgresql.conf itself or
recovery.conf was not a problem previously.

To me, this is a serious regression because it prevents automatic
startup of a server that would otherwise just run.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


signature.asc
Description: Digital signature

69 matches

Mail list logo