subject:"\[HACKERS\] Bug in point releases 9.3.6 and 9.2.10\?"

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-17 Thread Peter Geoghegan

On Thu, Mar 12, 2015 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote:
  Any chance that the new nodes also use a different kernel version or
  such?

 They may differ, but that doesn't seem likely to be relevant, at least
 to me.

 There've been some issues with seek(END) sometimes returning the wrong
 length in the past. And I've seen a report that might indicate a similar
 bug has been reintroduced. That could certainly cause such anerror.

It seems that I failed to take account of the fact that these cases
all involved Ubuntu 14.04 LTS (that wasn't clear to me before, but
I've verified that that's the case now), which is a small minority of
the kernels our instances use. I've verified that the issue arises
independently of point release, including on earlier point releases
from about a year ago, so I was hasty in suggesting that there was a
regression in the latest point releases.

I'm still going to get a back trace here, because it seems reasonable
to suppose that there is a Postgres bug even still - it may be that
whatever differences are in the 14.04 kernel are enough to make a
previously latent bug trip this code up. Plenty of people are using
Ubuntu 14.04 LTS with Postgres without seeing this issue, or any other
issue, and there is no other issue with the kernel that I know of. The
issue continues to occur as new instances with this configuration are
deployed.

I should also add that my temporary remediation (VACUUM FULL
pg_auth_members;) works consistently, as new cases start to crop up.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-17 Thread Josh Berkus

On 03/17/2015 10:35 AM, Peter Geoghegan wrote:
 I'm still going to get a back trace here, because it seems reasonable
 to suppose that there is a Postgres bug even still - it may be that
 whatever differences are in the 14.04 kernel are enough to make a
 previously latent bug trip this code up. Plenty of people are using
 Ubuntu 14.04 LTS with Postgres without seeing this issue, or any other
 issue, and there is no other issue with the kernel that I know of. The
 issue continues to occur as new instances with this configuration are
 deployed.

Yeah ... I think I have logins on something like 50 servers running
9.3.6 on 14.04, in replication, and haven't seen this issue.

Which specific kernel version?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-17 Thread Greg Stark

Well, when a database is first initdb'd that relation is in fact 0 bytes:

::***# select pg_relation_filenode(oid) from pg_class where relname =
'pg_auth_members';
┌──┐
│ pg_relation_filenode │
├──┤
│12610 │
└──┘
(1 row)

$ find . -name 12610 -ls
1317720 -rw---   1 starkstark   0 Mar  3 18:52
./global/12610

So it's not surprising to find a 0-byte file there sometime. The question
is how the ALTER ROLE commands failed to extend it or why they're expecting
it to be non-zero when it is.

Is it possible the permissions on the file are wrong for some reason? The
symptoms would make sense if the directory had permissions that allowed
unlinking (for vacuum full) but the file didn't have r access for the
postgres user or SE rules blocked it from being read or something like that.



-- 
greg

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-12 Thread Andres Freund

Hi,

On 2015-03-12 16:42:24 -0700, Peter Geoghegan wrote:
 We want to create a new role when this happens, for various reasons.
 This occurs after recovery ends, but before the database has been
 unfenced. The template code that generates various ALTER ROLE
 statements in our internal provisioning system - which has apparently
 worked just fine for a long time - is:

Is this all the code that's exececuted after recovery? How are these
forks brought up? Promoted how? Is it a common 'source' database?

 db.execute(ALTER ROLE #{old_database_user} RENAME TO #{database_user})
 db.execute(ALTER ROLE #{database_user} PASSWORD '#{database_password}' 
 LOGIN)
 db.execute(CREATE ROLE \#{old_database_user}\ PASSWORD
 '#{old_database_password}' IN ROLE \#{database_user}\ LOGIN)
 
 I've seen multiple reports of apparent corruption, appearing as the
 resulting ALTER ROLE statements are executed:
 
 PG::DataCorrupted: ERROR: could not read block 0 in file
 global/12811: read only 0 of 8192 bytes
 or:
 PG::DataCorrupted: ERROR: could not read block 0 in file
 global/12785: read only 0 of 8192 bytes
 or:
 PG::DataCorrupted: ERROR: could not read block 0 in file
 global/12811: read only 0 of 8192 bytes

Have you looked at these files? Are they indeed zero bytes when this
error occurs?

Do you still have a base backup from the relevant time, so you could
repeat the whole thing?

 The only common factor is that this occurs on the latest point
 releases (either 9.3.6 and 9.2.10, at least so far). In all cases I've
 seen so far, the relation in question is the pg_auth_members heap
 relation. For example:

Any chance that the new nodes also use a different kernel version or
such?

 redacteddb=#  select pg_relation_filenode(oid), oid, relname, relkind
 from pg_class where pg_relation_filenode(oid) = 12785;
  pg_relation_filenode | oid  | relname | relkind
 --+--+-+-
 12785 | 1261 | pg_auth_members | r
 (1 row)

This filenode got to be pg_auth_member's original one, given it's below
FirstNormalObjectId. I get a lower value, but that's probably caused by
having fewer collations and other data generated during initdb. That
implies that the table hasn't ever been rewritten.

What's 12811?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-12 Thread Peter Geoghegan

In a hurry right now, so unfortunately I'll need to be brief for now.

On Thu, Mar 12, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-03-12 16:42:24 -0700, Peter Geoghegan wrote:
 We want to create a new role when this happens, for various reasons.
 This occurs after recovery ends, but before the database has been
 unfenced. The template code that generates various ALTER ROLE
 statements in our internal provisioning system - which has apparently
 worked just fine for a long time - is:

 Is this all the code that's exececuted after recovery? How are these
 forks brought up? Promoted how? Is it a common 'source' database?

We do PITR up to a recovery target. We're talking about the same issue
occurring on entirely distinct customer databases, with entirely
distinct major PG versions. I'm not sure what other code might have
already been run at this point, but it won't have been much. As I
said, the only common factor that I know of is all affected databases
being on the latest point release.

 Have you looked at these files? Are they indeed zero bytes when this
 error occurs?

I think that they are indeed zero. I looked at that last week, when I
didn't consider that this might be a more widespread issue. I'll check
again later.

 Do you still have a base backup from the relevant time, so you could
 repeat the whole thing?

Yes.

 The only common factor is that this occurs on the latest point
 releases (either 9.3.6 and 9.2.10, at least so far). In all cases I've
 seen so far, the relation in question is the pg_auth_members heap
 relation. For example:

 Any chance that the new nodes also use a different kernel version or
 such?

They may differ, but that doesn't seem likely to be relevant, at least
to me. This has happened something like 6 or 7 times already, starting
late last week. I am unfamiliar with this provisioning code, so, as I
mentioned, offhand I cannot be entirely sure that there isn't some
other code run when the problem originally arises (that I should have
included in my report). What I can tell you is that I saw the same
error messages when I manually ran the statements generated by the
above code within a transaction...until I ran VACUUM FULL
pg_auth_members;.

 This filenode got to be pg_auth_member's original one, given it's below
 FirstNormalObjectId. I get a lower value, but that's probably caused by
 having fewer collations and other data generated during initdb. That
 implies that the table hasn't ever been rewritten.

 What's 12811?

It's the same catalog, pg_auth_member. As I said, the messages you saw
are on entirely different customer databases, servers and (sometimes)
PG version.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-12 Thread Andres Freund

On 2015-03-12 17:42:33 -0700, Peter Geoghegan wrote:
 On Thu, Mar 12, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2015-03-12 16:42:24 -0700, Peter Geoghegan wrote:
  We want to create a new role when this happens, for various reasons.
  This occurs after recovery ends, but before the database has been
  unfenced. The template code that generates various ALTER ROLE
  statements in our internal provisioning system - which has apparently
  worked just fine for a long time - is:
 
  Is this all the code that's exececuted after recovery? How are these
  forks brought up? Promoted how? Is it a common 'source' database?
 
 We do PITR up to a recovery target.

So, no hot standby enabled?

 I'm not sure what other code might have already been run at this
 point, but it won't have been much.

  Have you looked at these files? Are they indeed zero bytes when this
  error occurs?

 I think that they are indeed zero. I looked at that last week, when I
 didn't consider that this might be a more widespread issue. I'll check
 again later.

  Any chance that the new nodes also use a different kernel version or
  such?
 
 They may differ, but that doesn't seem likely to be relevant, at least
 to me.

There've been some issues with seek(END) sometimes returning the wrong
length in the past. And I've seen a report that might indicate a similar
bug has been reintroduced. That could certainly cause such anerror.

 I am unfamiliar with this provisioning code, so, as I
 mentioned, offhand I cannot be entirely sure that there isn't some
 other code run when the problem originally arises (that I should have
 included in my report).

It's probably worthwhile to dig out what's happening.

 What I can tell you is that I saw the same error messages when I
 manually ran the statements generated by the above code within a
 transaction...until I ran VACUUM FULL pg_auth_members;.

You can reproduce that problem? How easily? If you can, please produce a
backtrace. It'll certainly be interesting to see whether that access is
through an index or whatnot.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-12 Thread Peter Geoghegan

On Thu, Mar 12, 2015 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote:
 So, no hot standby enabled?

Right.

 There've been some issues with seek(END) sometimes returning the wrong
 length in the past. And I've seen a report that might indicate a similar
 bug has been reintroduced. That could certainly cause such anerror.

Perhaps.

 I am unfamiliar with this provisioning code, so, as I
 mentioned, offhand I cannot be entirely sure that there isn't some
 other code run when the problem originally arises (that I should have
 included in my report).

 It's probably worthwhile to dig out what's happening.

I'll get to that after the backtrace.

 What I can tell you is that I saw the same error messages when I
 manually ran the statements generated by the above code within a
 transaction...until I ran VACUUM FULL pg_auth_members;.

 You can reproduce that problem? How easily? If you can, please produce a
 backtrace. It'll certainly be interesting to see whether that access is
 through an index or whatnot.

I suspect I can reproduce it quite easily. The next step should be to
do a backtrace. I'll look at that tomorrow.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bug in point releases 9.3.6 and 9.2.10?

2015-03-12 Thread Peter Geoghegan

Heroku Postgres runs provisioning code that performs certain actions
on roles when creating a new fork of an existing database. This
often causes the new fork to be on the latest point release, where the
database being forked was not.

We want to create a new role when this happens, for various reasons.
This occurs after recovery ends, but before the database has been
unfenced. The template code that generates various ALTER ROLE
statements in our internal provisioning system - which has apparently
worked just fine for a long time - is:

db.execute(ALTER ROLE #{old_database_user} RENAME TO #{database_user})
db.execute(ALTER ROLE #{database_user} PASSWORD '#{database_password}' LOGIN)
db.execute(CREATE ROLE \#{old_database_user}\ PASSWORD
'#{old_database_password}' IN ROLE \#{database_user}\ LOGIN)

I've seen multiple reports of apparent corruption, appearing as the
resulting ALTER ROLE statements are executed:

PG::DataCorrupted: ERROR: could not read block 0 in file
global/12811: read only 0 of 8192 bytes
or:
PG::DataCorrupted: ERROR: could not read block 0 in file
global/12785: read only 0 of 8192 bytes
or:
PG::DataCorrupted: ERROR: could not read block 0 in file
global/12811: read only 0 of 8192 bytes


The only common factor is that this occurs on the latest point
releases (either 9.3.6 and 9.2.10, at least so far). In all cases I've
seen so far, the relation in question is the pg_auth_members heap
relation. For example:

redacteddb=#  select pg_relation_filenode(oid), oid, relname, relkind
from pg_class where pg_relation_filenode(oid) = 12785;
 pg_relation_filenode | oid  | relname | relkind
--+--+-+-
12785 | 1261 | pg_auth_members | r
(1 row)


Running VACUUM FULL pg_auth_members; has made the problem go away
(to the extent that the above code doesn't trip up and everything is
at least superficially okay) on at least one occasion. I'm currently
investigating how consistently that works as a short term remediation.

Theories?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?

[HACKERS] Bug in point releases 9.3.6 and 9.2.10?

8 matches

Site Navigation

Mail list logo

Footer information