Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
On Thu, Mar 12, 2015 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote: Any chance that the new nodes also use a different kernel version or such? They may differ, but that doesn't seem likely to be relevant, at least to me. There've been some issues with seek(END) sometimes returning the wrong length in the past. And I've seen a report that might indicate a similar bug has been reintroduced. That could certainly cause such anerror. It seems that I failed to take account of the fact that these cases all involved Ubuntu 14.04 LTS (that wasn't clear to me before, but I've verified that that's the case now), which is a small minority of the kernels our instances use. I've verified that the issue arises independently of point release, including on earlier point releases from about a year ago, so I was hasty in suggesting that there was a regression in the latest point releases. I'm still going to get a back trace here, because it seems reasonable to suppose that there is a Postgres bug even still - it may be that whatever differences are in the 14.04 kernel are enough to make a previously latent bug trip this code up. Plenty of people are using Ubuntu 14.04 LTS with Postgres without seeing this issue, or any other issue, and there is no other issue with the kernel that I know of. The issue continues to occur as new instances with this configuration are deployed. I should also add that my temporary remediation (VACUUM FULL pg_auth_members;) works consistently, as new cases start to crop up. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
On 03/17/2015 10:35 AM, Peter Geoghegan wrote: I'm still going to get a back trace here, because it seems reasonable to suppose that there is a Postgres bug even still - it may be that whatever differences are in the 14.04 kernel are enough to make a previously latent bug trip this code up. Plenty of people are using Ubuntu 14.04 LTS with Postgres without seeing this issue, or any other issue, and there is no other issue with the kernel that I know of. The issue continues to occur as new instances with this configuration are deployed. Yeah ... I think I have logins on something like 50 servers running 9.3.6 on 14.04, in replication, and haven't seen this issue. Which specific kernel version? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
Well, when a database is first initdb'd that relation is in fact 0 bytes: ::***# select pg_relation_filenode(oid) from pg_class where relname = 'pg_auth_members'; ┌──┐ │ pg_relation_filenode │ ├──┤ │12610 │ └──┘ (1 row) $ find . -name 12610 -ls 1317720 -rw--- 1 starkstark 0 Mar 3 18:52 ./global/12610 So it's not surprising to find a 0-byte file there sometime. The question is how the ALTER ROLE commands failed to extend it or why they're expecting it to be non-zero when it is. Is it possible the permissions on the file are wrong for some reason? The symptoms would make sense if the directory had permissions that allowed unlinking (for vacuum full) but the file didn't have r access for the postgres user or SE rules blocked it from being read or something like that. -- greg
Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
Hi, On 2015-03-12 16:42:24 -0700, Peter Geoghegan wrote: We want to create a new role when this happens, for various reasons. This occurs after recovery ends, but before the database has been unfenced. The template code that generates various ALTER ROLE statements in our internal provisioning system - which has apparently worked just fine for a long time - is: Is this all the code that's exececuted after recovery? How are these forks brought up? Promoted how? Is it a common 'source' database? db.execute(ALTER ROLE #{old_database_user} RENAME TO #{database_user}) db.execute(ALTER ROLE #{database_user} PASSWORD '#{database_password}' LOGIN) db.execute(CREATE ROLE \#{old_database_user}\ PASSWORD '#{old_database_password}' IN ROLE \#{database_user}\ LOGIN) I've seen multiple reports of apparent corruption, appearing as the resulting ALTER ROLE statements are executed: PG::DataCorrupted: ERROR: could not read block 0 in file global/12811: read only 0 of 8192 bytes or: PG::DataCorrupted: ERROR: could not read block 0 in file global/12785: read only 0 of 8192 bytes or: PG::DataCorrupted: ERROR: could not read block 0 in file global/12811: read only 0 of 8192 bytes Have you looked at these files? Are they indeed zero bytes when this error occurs? Do you still have a base backup from the relevant time, so you could repeat the whole thing? The only common factor is that this occurs on the latest point releases (either 9.3.6 and 9.2.10, at least so far). In all cases I've seen so far, the relation in question is the pg_auth_members heap relation. For example: Any chance that the new nodes also use a different kernel version or such? redacteddb=# select pg_relation_filenode(oid), oid, relname, relkind from pg_class where pg_relation_filenode(oid) = 12785; pg_relation_filenode | oid | relname | relkind --+--+-+- 12785 | 1261 | pg_auth_members | r (1 row) This filenode got to be pg_auth_member's original one, given it's below FirstNormalObjectId. I get a lower value, but that's probably caused by having fewer collations and other data generated during initdb. That implies that the table hasn't ever been rewritten. What's 12811? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
In a hurry right now, so unfortunately I'll need to be brief for now. On Thu, Mar 12, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-03-12 16:42:24 -0700, Peter Geoghegan wrote: We want to create a new role when this happens, for various reasons. This occurs after recovery ends, but before the database has been unfenced. The template code that generates various ALTER ROLE statements in our internal provisioning system - which has apparently worked just fine for a long time - is: Is this all the code that's exececuted after recovery? How are these forks brought up? Promoted how? Is it a common 'source' database? We do PITR up to a recovery target. We're talking about the same issue occurring on entirely distinct customer databases, with entirely distinct major PG versions. I'm not sure what other code might have already been run at this point, but it won't have been much. As I said, the only common factor that I know of is all affected databases being on the latest point release. Have you looked at these files? Are they indeed zero bytes when this error occurs? I think that they are indeed zero. I looked at that last week, when I didn't consider that this might be a more widespread issue. I'll check again later. Do you still have a base backup from the relevant time, so you could repeat the whole thing? Yes. The only common factor is that this occurs on the latest point releases (either 9.3.6 and 9.2.10, at least so far). In all cases I've seen so far, the relation in question is the pg_auth_members heap relation. For example: Any chance that the new nodes also use a different kernel version or such? They may differ, but that doesn't seem likely to be relevant, at least to me. This has happened something like 6 or 7 times already, starting late last week. I am unfamiliar with this provisioning code, so, as I mentioned, offhand I cannot be entirely sure that there isn't some other code run when the problem originally arises (that I should have included in my report). What I can tell you is that I saw the same error messages when I manually ran the statements generated by the above code within a transaction...until I ran VACUUM FULL pg_auth_members;. This filenode got to be pg_auth_member's original one, given it's below FirstNormalObjectId. I get a lower value, but that's probably caused by having fewer collations and other data generated during initdb. That implies that the table hasn't ever been rewritten. What's 12811? It's the same catalog, pg_auth_member. As I said, the messages you saw are on entirely different customer databases, servers and (sometimes) PG version. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
On 2015-03-12 17:42:33 -0700, Peter Geoghegan wrote: On Thu, Mar 12, 2015 at 5:21 PM, Andres Freund and...@2ndquadrant.com wrote: On 2015-03-12 16:42:24 -0700, Peter Geoghegan wrote: We want to create a new role when this happens, for various reasons. This occurs after recovery ends, but before the database has been unfenced. The template code that generates various ALTER ROLE statements in our internal provisioning system - which has apparently worked just fine for a long time - is: Is this all the code that's exececuted after recovery? How are these forks brought up? Promoted how? Is it a common 'source' database? We do PITR up to a recovery target. So, no hot standby enabled? I'm not sure what other code might have already been run at this point, but it won't have been much. Have you looked at these files? Are they indeed zero bytes when this error occurs? I think that they are indeed zero. I looked at that last week, when I didn't consider that this might be a more widespread issue. I'll check again later. Any chance that the new nodes also use a different kernel version or such? They may differ, but that doesn't seem likely to be relevant, at least to me. There've been some issues with seek(END) sometimes returning the wrong length in the past. And I've seen a report that might indicate a similar bug has been reintroduced. That could certainly cause such anerror. I am unfamiliar with this provisioning code, so, as I mentioned, offhand I cannot be entirely sure that there isn't some other code run when the problem originally arises (that I should have included in my report). It's probably worthwhile to dig out what's happening. What I can tell you is that I saw the same error messages when I manually ran the statements generated by the above code within a transaction...until I ran VACUUM FULL pg_auth_members;. You can reproduce that problem? How easily? If you can, please produce a backtrace. It'll certainly be interesting to see whether that access is through an index or whatnot. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in point releases 9.3.6 and 9.2.10?
On Thu, Mar 12, 2015 at 5:56 PM, Andres Freund and...@2ndquadrant.com wrote: So, no hot standby enabled? Right. There've been some issues with seek(END) sometimes returning the wrong length in the past. And I've seen a report that might indicate a similar bug has been reintroduced. That could certainly cause such anerror. Perhaps. I am unfamiliar with this provisioning code, so, as I mentioned, offhand I cannot be entirely sure that there isn't some other code run when the problem originally arises (that I should have included in my report). It's probably worthwhile to dig out what's happening. I'll get to that after the backtrace. What I can tell you is that I saw the same error messages when I manually ran the statements generated by the above code within a transaction...until I ran VACUUM FULL pg_auth_members;. You can reproduce that problem? How easily? If you can, please produce a backtrace. It'll certainly be interesting to see whether that access is through an index or whatnot. I suspect I can reproduce it quite easily. The next step should be to do a backtrace. I'll look at that tomorrow. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bug in point releases 9.3.6 and 9.2.10?
Heroku Postgres runs provisioning code that performs certain actions on roles when creating a new fork of an existing database. This often causes the new fork to be on the latest point release, where the database being forked was not. We want to create a new role when this happens, for various reasons. This occurs after recovery ends, but before the database has been unfenced. The template code that generates various ALTER ROLE statements in our internal provisioning system - which has apparently worked just fine for a long time - is: db.execute(ALTER ROLE #{old_database_user} RENAME TO #{database_user}) db.execute(ALTER ROLE #{database_user} PASSWORD '#{database_password}' LOGIN) db.execute(CREATE ROLE \#{old_database_user}\ PASSWORD '#{old_database_password}' IN ROLE \#{database_user}\ LOGIN) I've seen multiple reports of apparent corruption, appearing as the resulting ALTER ROLE statements are executed: PG::DataCorrupted: ERROR: could not read block 0 in file global/12811: read only 0 of 8192 bytes or: PG::DataCorrupted: ERROR: could not read block 0 in file global/12785: read only 0 of 8192 bytes or: PG::DataCorrupted: ERROR: could not read block 0 in file global/12811: read only 0 of 8192 bytes The only common factor is that this occurs on the latest point releases (either 9.3.6 and 9.2.10, at least so far). In all cases I've seen so far, the relation in question is the pg_auth_members heap relation. For example: redacteddb=# select pg_relation_filenode(oid), oid, relname, relkind from pg_class where pg_relation_filenode(oid) = 12785; pg_relation_filenode | oid | relname | relkind --+--+-+- 12785 | 1261 | pg_auth_members | r (1 row) Running VACUUM FULL pg_auth_members; has made the problem go away (to the extent that the above code doesn't trip up and everything is at least superficially okay) on at least one occasion. I'm currently investigating how consistently that works as a short term remediation. Theories? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers