Re: "PANIC: could not open critical system index 2662" - twice
Hi, On 2023-06-19 10:04:35 +, Evgeny Morozov wrote: > There haven't been any updates posted to > https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de > so I just wanted to check if there is any update on the status of the > patch? Can we expect it in PostgreSQL 15.4? Thanks. I pushed the fixes to all branches just now. Thanks for the report! Greetings, Andres Freund
Re: "PANIC: could not open critical system index 2662" - twice
There haven't been any updates posted to https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de so I just wanted to check if there is any update on the status of the patch? Can we expect it in PostgreSQL 15.4? Thanks.
Re: "PANIC: could not open critical system index 2662" - twice
On 17/05/2023 1:39 am, Andres Freund wrote: > Try to prevent the DROP DATABASE from getting cancelled :/. I still don't know why that's happening. I mean, I know why it gets cancelled (the client timeout we set in Npgsql), but I don't know why the drop does not succeed within 30 seconds. We could, of course, increase that timeout, but my gut feeling is that if it hasn't happened in 30 seconds it won't happen in 5 minutes, either, and the connection will have to be closed eventually. Can you think of any way to troubleshoot that? > If you want to go a bit further, you could rename the database to > *_dropped before dropping it, and then try to do the DROP DATABASE. > That way you'd at least know that it's corrupt because of a failed > DROP database. That's a good idea and I will do that, but my bigger concern is PG crashing entirely with the "PANIC" error and refusing to start again. I still don't know why that happened back in April (but not this time, in May), so I don't know how to prevent that.
Re: "PANIC: could not open critical system index 2662" - twice
Hi, On 2023-05-16 14:20:46 +, Evgeny Morozov wrote: > On 9/05/2023 3:32 am, Andres Freund wrote: > > Attached is a rough prototype of that idea (only using datconnlimit == > > -2 for now). > > I guess we need to move this to -hackers. Perhaps I'll post subsequent > > versions below > > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de > > ? > So now that a patch is in the works, can I drop the two corrupted > databases? Yes. > Is there a workaround I can use to reduce the risk of running into this > issue again until a patch is released? (Which I guess would be in August?) Try to prevent the DROP DATABASE from getting cancelled :/. If you want to go a bit further, you could rename the database to *_dropped before dropping it, and then try to do the DROP DATABASE. That way you'd at least know that it's corrupt because of a failed DROP database. Greetings, Andres Freund
Re: "PANIC: could not open critical system index 2662" - twice
On Tue, May 16, 2023 at 10:20 AM Evgeny Morozov < postgres...@realityexists.net> wrote: > On 9/05/2023 3:32 am, Andres Freund wrote: > > Attached is a rough prototype of that idea (only using datconnlimit == > > -2 for now). > > I guess we need to move this to -hackers. Perhaps I'll post subsequent > > versions below > > > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de > ? > So now that a patch is in the works, can I drop the two corrupted > databases? Is there a workaround I can use to reduce the risk of running > into this issue again until a patch is released? (Which I guess would be > in August?) > The only work around to avoid losing data that I know of are backups and WAL backups. Plus "hard core testing/validation" that they work. We settled on pg_backrest and are happy with it. Technically, based on what I understand of this bug. It did not corrupt the WAL. If that's true, then if you had a basebackup and all the wall files, you could have played back and recovered the data. At least to some degree. Assuming I am right. HTH
Re: "PANIC: could not open critical system index 2662" - twice
On 9/05/2023 3:32 am, Andres Freund wrote: > Attached is a rough prototype of that idea (only using datconnlimit == > -2 for now). > I guess we need to move this to -hackers. Perhaps I'll post subsequent > versions below > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de > ? So now that a patch is in the works, can I drop the two corrupted databases? Is there a workaround I can use to reduce the risk of running into this issue again until a patch is released? (Which I guess would be in August?)
Re: "PANIC: could not open critical system index 2662" - twice
On Wed, May 10, 2023 at 9:32 AM Evgeny Morozov < postgres...@realityexists.net> wrote: > On 10/05/2023 6:39 am, Kirk Wolak wrote: > > It could be as simple as creating temp tables in the other database (since > I believe pg_class was hit). > > We do indeed create temp tables, both in other databases and in the ones > being tested. (We also create non-temp tables there.) > > > Also, not sure if the OP has a set of things done after he creates the DB > that may help? > > Basically we read rows from the source database, create some partitions of > tables in the target database, insert into a temp table there using BULK > COPY, then using a regular INSERT copy from the temp tables to the new > partitions. > > > Now that the probem has been reproduced and understood by the PG > developers, could anyone explain why PG crashed entirely with the "PANIC" > error back in April when only specific databases were corrupted, not any > global objects necesary for PG to run? And why did it not crash with the > "PANIC" on this occasion? > I understand the question as: Why would it PANIC on non-system data corruption, but not on system data corruption? To which my guess is: Because System Data Corruption, on startup is probably a use case, and we want to report, and come up as much as possible. Whereas the OTHER code did a PANIC simply because it was BOTH unexpected, and NOT Where it was in a place it could move forward. Meaning it had no idea if it read in bad data, or if it CREATED the bad data. As a programmer, you will find much more robust code on startup checking than in the middle of doing something else. But just a guess. Someone deeper into the code might explain it better. And you COULD go dig through the source to compare the origination of the error messages? Kirk...
Re: "PANIC: could not open critical system index 2662" - twice
On 10/05/2023 6:39 am, Kirk Wolak wrote: > It could be as simple as creating temp tables in the other database > (since I believe pg_class was hit). We do indeed create temp tables, both in other databases and in the ones being tested. (We also create non-temp tables there.) > > Also, not sure if the OP has a set of things done after he creates the > DB that may help? Basically we read rows from the source database, create some partitions of tables in the target database, insert into a temp table there using BULK COPY, then using a regular INSERT copy from the temp tables to the new partitions. Now that the probem has been reproduced and understood by the PG developers, could anyone explain why PG crashed entirely with the "PANIC" error back in April when only specific databases were corrupted, not any global objects necesary for PG to run? And why did it not crash with the "PANIC" on this occasion?
Re: "PANIC: could not open critical system index 2662" - twice
On Sun, May 7, 2023 at 10:18 PM Thomas Munro wrote: > On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov > wrote: > > On 6/05/2023 11:13 pm, Thomas Munro wrote: > > > Would you like to try requesting FILE_COPY for a while and see if it > eventually happens like that too? > > Sure, we can try that. > > Maybe you could do some one way and some the other, so that we try to > learn more? > > > Ubuntu 18.04.6 > > Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 > > x86_64 x86_64 x86_64 GNU/Linux > > zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64 > > I tried for a few hours to reproduce this by trying to make as many > things as similar to yours as I could based on info in this thread > (Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same > kernel 4.15.0-208-generic and a close but slightly different version > of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a > pair of loopback files), recordsize=16kB, compression=lz4, PG: > compiled from tag REL_15_2, data_checksums=on, full_page_writes=off, > wal_recycle=off, wal_init_zero=off), with what I thought might be > roughly what you're doing (creating three DBs, two clones of the > first, with various modification at various points, with various > overlapping activities, and then checking for catalog corruption). No > cigar. Hrmph. > My first thought on this was along the lines of previous comments. We are writing the initial pieces of the files for a new DATABASE. Some flag is set, and then context is lost, and it ends up that a SAVE happens to an existing DATABASE. So, my thought was a parallel set of UPDATE "type" commands that would cause those types of pages for a stable/existing DB to be updated/written. It could be as simple as creating temp tables in the other database (since I believe pg_class was hit). >From a "rare" standpoint. That "feels" about right to me. It would pass serial tests (like you are running). Just a thought. Maybe a simple background script creating temp tables in another DB. Also, not sure if the OP has a set of things done after he creates the DB that may help? Kirk
Re: "PANIC: could not open critical system index 2662" - twice
On 8/05/2023 11:04 pm, Andres Freund wrote: > Are you using any extensions? Only plpgsql. > Do you have any chance to figure out what statements were running > concurrently with the DROP DATABASE? No. Is there some way to log that, other than just logging all statements (which seems impractical)? Most of the statements running on that server should be inserts into another database, some reads. Something could be creating or dropping partitions in another DB in parallel, too, but not frequently.
Re: "PANIC: could not open critical system index 2662" - twice
On Tue, May 9, 2023 at 3:15 AM Michael Paquier wrote: > > On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote: > > I am able to reproduce this using the steps given above, I am also > > trying to analyze this further. I will send the update once I get > > some clue. > > Have you been able to reproduce this on HEAD or at the top of > REL_15_STABLE, or is that 15.2-ish without fa5dd46? > I am able to reproduce on REL_15_STABLE as well with your test case. The only difference is without fa5dd46 the issue gets reproduced just in a couple of seconds and very consistent OTOH on REL_15_STABLE it takes time to reproduce 2-3 mins and also reproduction is not very consistent. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: "PANIC: could not open critical system index 2662" - twice
Hi, On 2023-05-08 17:46:37 -0700, Andres Freund wrote: > My current gut feeling is that we should use datconnlimit == -2 to prevent > connections after reaching DropDatabaseBuffers() in dropdb(), and use a new > column in 16, for both createdb() and dropdb(). Attached is a rough prototype of that idea (only using datconnlimit == -2 for now). This would need a fair bit more polish. The tests are crappy and output stuff to stderr and don't validate enough. The error messages are bad. No docs changes. More comments about why datconnlimit == -2 is used. Etc. But I think it should be sufficient to discuss whether this is a viable path. I guess we need to move this to -hackers. Perhaps I'll post subsequent versions below https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de ? Greetings, Andres Freund >From 5784ba0b21eaf05c2989c57ce1f73c13edf0111e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 8 May 2023 18:28:33 -0700 Subject: [PATCH v1] Handle interrupted DROP DATABASE Author: Reviewed-by: Discussion: https://postgr.es/m/20230509004637.cgvmfwrbht7xm...@awork3.anarazel.de Backpatch: --- src/include/catalog/pg_database.h | 2 +- src/backend/commands/dbcommands.c | 50 ++--- src/backend/utils/init/postinit.c | 11 + src/test/recovery/meson.build | 1 + src/test/recovery/t/037_invalid_database.pl | 48 5 files changed, 95 insertions(+), 17 deletions(-) create mode 100644 src/test/recovery/t/037_invalid_database.pl diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h index d004f4dc8aa..5cd972da324 100644 --- a/src/include/catalog/pg_database.h +++ b/src/include/catalog/pg_database.h @@ -49,7 +49,7 @@ CATALOG(pg_database,1262,DatabaseRelationId) BKI_SHARED_RELATION BKI_ROWTYPE_OID /* new connections allowed? */ bool datallowconn; - /* max connections allowed (-1=no limit) */ + /* max connections allowed (-1=no limit, -2=invalid database) */ int32 datconnlimit; /* all Xids < this are frozen in this DB */ diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 2e242eeff24..a0360d9ad80 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1569,6 +1569,7 @@ dropdb(const char *dbname, bool missing_ok, bool force) bool db_istemplate; Relation pgdbrel; HeapTuple tup; + Form_pg_database datform; int notherbackends; int npreparedxacts; int nslots, @@ -1684,17 +1685,6 @@ dropdb(const char *dbname, bool missing_ok, bool force) dbname), errdetail_busy_db(notherbackends, npreparedxacts))); - /* - * Remove the database's tuple from pg_database. - */ - tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(db_id)); - if (!HeapTupleIsValid(tup)) - elog(ERROR, "cache lookup failed for database %u", db_id); - - CatalogTupleDelete(pgdbrel, &tup->t_self); - - ReleaseSysCache(tup); - /* * Delete any comments or security labels associated with the database. */ @@ -1711,6 +1701,31 @@ dropdb(const char *dbname, bool missing_ok, bool force) */ dropDatabaseDependencies(db_id); + /* + * Tell the cumulative stats system to forget it immediately, too. + */ + pgstat_drop_database(db_id); + + tup = SearchSysCacheCopy1(DATABASEOID, ObjectIdGetDatum(db_id)); + if (!HeapTupleIsValid(tup)) + elog(ERROR, "cache lookup failed for database %u", db_id); + datform = (Form_pg_database) GETSTRUCT(tup); + + /* + * Subsequent actions are not transactional (consider + * DropDatabaseBuffers() discarding modified buffers). But we might crash + * or get interrupted below. To prevent accessing a database with invalid + * contents, mark the database as invalid using an in-place update. + */ + datform->datconnlimit = -2; + heap_inplace_update(pgdbrel, tup); + + /* + * Also delete the tuple - transactionally. If this transaction commits, + * the row will be gone, but if we fail, dropdb() can be invoked again. + */ + CatalogTupleDelete(pgdbrel, &tup->t_self); + /* * Drop db-specific replication slots. */ @@ -1723,11 +1738,6 @@ dropdb(const char *dbname, bool missing_ok, bool force) */ DropDatabaseBuffers(db_id); - /* - * Tell the cumulative stats system to forget it immediately, too. - */ - pgstat_drop_database(db_id); - /* * Tell checkpointer to forget any pending fsync and unlink requests for * files in the database; else the fsyncs will fail at next checkpoint, or @@ -2339,6 +2349,14 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel) datform = (Form_pg_database) GETSTRUCT(tuple); dboid = datform->oid; + if (datform->datconnlimit == -2) + { + ereport(FATAL, +errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("database \"%s\" is invalid", stmt->dbname), +errdetail("Use DROP DATABASE to drop invalid databases")); + } + if (!object_ownercheck(DatabaseRelationId, dboid, GetUs
Re: "PANIC: could not open critical system index 2662" - twice
Hi, On 2023-05-08 14:04:00 -0700, Andres Freund wrote: > But perhaps a similar approach could be the solution? My gut says that the > rought direction might allow us to keep dropdb() a single transaction. I started to hack on the basic approach of committing after the catalog changes. But then I started wondering if we're not tackling this all wrong. We don't actually want to drop the catalog contents early as that prevents the database from being dropped again, while potentially leaving behind contents in the case of a crash / interrupt. Similary, the fact that we only commit the transaction at the end of createdb() leads to interrupts / crashes orphaning the contents of that [partial] database. We also hve similar issues with movedb(), I think. This is a non-transactional operation. I think we should copy the approach of the CONCURRENTLY operations. Namely add a new column to pg_database, indicating whether the database contents are valid. An invalid database can be dropped, but not connected to. Then we could have createdb() commit before starting to create the target database directory (with invalid = true, of course). After all the filesystem level stuff is complete, set invalid = false. For dropping a database we'd use heap_inplace_update() to set invalid = true just before the DropDatabaseBuffers(), preventing any connections after that point. Making movedb() safe is probably a bit harder - I think it'd temporarily require two pg_database entries? Of course we can't add a new column in the back branches. IIRC we had a similar issue with CIC some point, and just ended up misusing some other column for the backbranches? We could e.g. use datconnlimit == -2 for that purpose (but would need to make sure that ALTER DATABASE can't unset it). My current gut feeling is that we should use datconnlimit == -2 to prevent connections after reaching DropDatabaseBuffers() in dropdb(), and use a new column in 16, for both createdb() and dropdb(). In some ways handling createdb() properly is a new feature, but it's also arguably a bug that we leak the space - and I think the code will be better if we work on both together. Greetings, Andres Freund
Re: "PANIC: could not open critical system index 2662" - twice
On Mon, May 08, 2023 at 06:04:23PM -0400, Tom Lane wrote: > Andres seems to think it's a problem with aborting a DROP DATABASE. > Adding more data might serve to make the window wider, perhaps. And the odds get indeed much better once I use these two toys: CREATE OR REPLACE FUNCTION create_tables(num_tables int) RETURNS VOID AS $func$ BEGIN FOR i IN 1..num_tables LOOP EXECUTE format(' CREATE TABLE IF NOT EXISTS %I (id int)', 't_' || i); END LOOP; END $func$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION drop_tables(num_tables int) RETURNS VOID AS $func$ BEGIN FOR i IN 1..num_tables LOOP EXECUTE format(' DROP TABLE IF EXISTS %I', 't_' || i); END LOOP; END $func$ LANGUAGE plpgsql; And then use this loop with the others I have mentioned upthread (just create origindb and the functions in them): while true; do psql -c 'select create_tables(1000)' origindb > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; psql -c 'select drop_tables(1000)' origindb > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; done; On top of that, I have also been able to reproduce the issue on HEAD, and luckily some pg_class file remain around, full of zeros: $ hexdump ./base/199634/1259 000 The contents of 2662, though, looked OK. Echoing Alvaro.. Could we, err, revisit the choice of making WAL_LOG the default strategy even for this set of minor releases? FWIW, I've mentioned that this choice was too aggressive in the thread of 8a86618.. -- Michael signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
On Tue, May 9, 2023 at 10:04 AM Tom Lane wrote: > Michael Paquier writes: > > One thing I was wondering about to improve the odds of the hits is to > > be more aggressive with the number of relations created at once, so as > > we are much more aggressive with the number of pages extended in > > pg_class from the origin database. > > Andres seems to think it's a problem with aborting a DROP DATABASE. > Adding more data might serve to make the window wider, perhaps. Here's an easy way: @@ -1689,6 +1689,14 @@ dropdb(const char *dbname, bool missing_ok, bool force) /* Close all smgr fds in all backends. */ WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); +/* XXX pretend one of the above steps got interrupted by a statement timeout or ^C */ +if (random() % 2 == 0) +{ +QueryCancelPending = true; +InterruptPending = true; +CHECK_FOR_INTERRUPTS(); +} postgres=# create database db2; CREATE DATABASE postgres=# drop database db2; ERROR: canceling statement due to user request $ psql db2 psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC: could not open critical system index 2662 $ od -t x1 base/17/2662 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 010 $ od -t x1 base/17/2837 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 004 $ od -t x1 base/17/2840 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 010
Re: "PANIC: could not open critical system index 2662" - twice
Michael Paquier writes: > One thing I was wondering about to improve the odds of the hits is to > be more aggressive with the number of relations created at once, so as > we are much more aggressive with the number of pages extended in > pg_class from the origin database. Andres seems to think it's a problem with aborting a DROP DATABASE. Adding more data might serve to make the window wider, perhaps. regards, tom lane
Re: "PANIC: could not open critical system index 2662" - twice
On Mon, May 08, 2023 at 07:15:20PM +0530, Dilip Kumar wrote: > I am able to reproduce this using the steps given above, I am also > trying to analyze this further. I will send the update once I get > some clue. Have you been able to reproduce this on HEAD or at the top of REL_15_STABLE, or is that 15.2-ish without fa5dd46? One thing I was wondering about to improve the odds of the hits is to be more aggressive with the number of relations created at once, so as we are much more aggressive with the number of pages extended in pg_class from the origin database. -- Michael signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
Hi, On 2023-05-08 20:27:14 +, Evgeny Morozov wrote: > On 8/05/2023 9:47 pm, Andres Freund wrote: > > Did you have any occasions where CREATE or DROP DATABASE was interrupted? > > Either due the connection being terminated or a crash? > > I've uploaded an edited version of the PG log for the time as > https://objective.realityexists.net/temp/log-extract-2023-05-02.txt > (test_behavior_638186279733138190 and test_behavior_638186280406544656 > are the two DBs that got corrupted). > > I cannot see any crash in the test logs or the PG logs, but whether it > was interrupted is less clear. I don't know whether the the tests ran > successfully up to the point where they tried to drop the DBs (I've > since added logging to show that next time), but DROP DATABASE did not > return after 30 seconds and the client library (Npgsql) then tried to > cancel the requests. We then tried to drop the DB again, with the same > results in both cases. After the second attempts timed out we closed the > connections anyway - so maybe that was the interruption? Yep, that is the interruption. I suspect that something was not processing interrupts, which then lead the WaitForProcSignalBarrier() in dropdb() to block. Are you using any extensions? Do you have any chance to figure out what statements were running concurrently with the DROP DATABASE? Seems we figured out what the problem is... Just need to figure out how to fix it. I think the minimal fix will entail moving the commit of the transaction into dropdb(). We need to commit before executing DropDatabaseBuffers(), as the database is inconsistent after that point. Luckily DROP DATABASE already can't be executed in a transaction, so that's easily possible. That means we'd loose track of the files for the database in case of a crash, but that's surely better than a corrupt database. And it's already the case for CREATE DATABASE. Additionally we probably need to prevent processing cancellations between DropDatabaseBuffers() and remove_dbtablespaces(). But we can't just stop accepting interrupts, because that'd break WaitForProcSignalBarrier(). Ah, what fun. Unfortunately QueryCancelHoldoffCount is not sufficient for this purpose - we don't just need to prevent cancellations, we also can't accept termintions either... For a bit I thought we might be able to just reorder operations, moving the WaitForProcSignalBarrier() earlier. Unfortunately I don't think that works, because until DropDatabaseBuffers() has executed, backends might write such buffers back, we need to do WaitForProcSignalBarrier() after that. I've toyed with the idea of splitting DropDatabaseBuffers() into two. In a first pass, mark all buffers as IO_IN_PROGRESS (plus perhaps some other flag). That'd prevent new IO from being started for those buffers. Then WaitForProcSignalBarrier(), to make sure nobody has an open FD for those files. Then actually drop all those buffers and unlink the files. That now can happen with interrupts held, without any chance of being blocked, afaict. In case of being cancelled during WaitForProcSignalBarrier(), AbortBufferIO() would remove IO_IN_PROGRESS, and everything would be fine. I don't like the idea of WaitForProcSignalBarrier() while having buffers marked as in-progress. I don't *immediately* see a deadlock danger, but I'd be unsurprised if there were some. But perhaps a similar approach could be the solution? My gut says that the rought direction might allow us to keep dropdb() a single transaction. Greetings, Andres Freund
Re: "PANIC: could not open critical system index 2662" - twice
On 8/05/2023 9:47 pm, Andres Freund wrote: > Did you have any occasions where CREATE or DROP DATABASE was interrupted? > Either due the connection being terminated or a crash? I've uploaded an edited version of the PG log for the time as https://objective.realityexists.net/temp/log-extract-2023-05-02.txt (test_behavior_638186279733138190 and test_behavior_638186280406544656 are the two DBs that got corrupted). I cannot see any crash in the test logs or the PG logs, but whether it was interrupted is less clear. I don't know whether the the tests ran successfully up to the point where they tried to drop the DBs (I've since added logging to show that next time), but DROP DATABASE did not return after 30 seconds and the client library (Npgsql) then tried to cancel the requests. We then tried to drop the DB again, with the same results in both cases. After the second attempts timed out we closed the connections anyway - so maybe that was the interruption? > As described in > https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de > we don't handle that correctly for DROP DATABASE. > > I think that might actually fit the symptoms - the DropDatabaseBuffers() will > throw away the dirty buffer contents from the WAL strategy CREATE DATABASE, > but if you then get cancelled ata point before all the files are removed, the > on-disk fails with all-zeroes would remain. Oooh, that does seem to fit! Thank you for digging that up.
Re: "PANIC: could not open critical system index 2662" - twice
Hi, On 2023-05-07 16:10:28 +, Evgeny Morozov wrote: > Yes, kind of. We have a test suite that creates one test DB and runs a > bunch of tests on it. Two of these tests, however, create another DB > each (also by cloning the same template DB) in order to test copying > data between DBs. It's only these "extra" DBs that were corrupted, at > least on this occasion. Did you have any occasions where CREATE or DROP DATABASE was interrupted? Either due the connection being terminated or a crash? As described in https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de we don't handle that correctly for DROP DATABASE. I think that might actually fit the symptoms - the DropDatabaseBuffers() will throw away the dirty buffer contents from the WAL strategy CREATE DATABASE, but if you then get cancelled ata point before all the files are removed, the on-disk fails with all-zeroes would remain. Greetings, Andres Freund
Re: "PANIC: could not open critical system index 2662" - twice
Alvaro Herrera writes: > Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default > again, for branch 15, before today's release. If we had more than one such report, I'd be in favor of that. But I think it's a bit premature to conclude that the copy strategy is to blame. regards, tom lane
Re: "PANIC: could not open critical system index 2662" - twice
On Mon, May 8, 2023 at 7:55 AM Michael Paquier wrote: > > On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote: > > Bug-in-PostgreSQL explanations could include that we forgot it was > > dirty, or some backend wrote it out to the wrong file; but if we were > > forgetting something like permanent or dirty, would there be a more > > systematic failure? Oh, it could require special rare timing if it is > > similar to 8a8661828's confusion about permanence level or otherwise > > somehow not setting BM_PERMANENT, but in the target blocks, so I think > > that'd require a checkpoint AND a crash. It doesn't reproduce for me, > > but perhaps more unlucky ingredients are needed. > > > > Bug-in-OS/FS explanations could include that a whole lot of writes > > were mysteriously lost in some time window, so all those files still > > contain the zeroes we write first in smgrextend(). I guess this > > previously rare (previously limited to hash indexes?) use of sparse > > file hole-punching could be a factor in an it's-all-ZFS's-fault > > explanation: > > Yes, you would need a bit of all that. > > I can reproduce the same backtrace here. That's just my usual laptop > with ext4, so this would be a Postgres bug. First, here are the four > things running in parallel so as I can get a failure in loading a > critical index when connecting: > 1) Create and drop a database with WAL_LOG as strategy and the > regression database as template: > while true; do > createdb --template=regression --strategy=wal_log testdb; > dropdb testdb; > done > 2) Feeding more data to pg_class in the middle, while testing the > connection to the database created: > while true; > do psql -c 'create table popo as select 1 as a;' regression > /dev/null > 2>&1 ; > psql testdb -c "select 1" > /dev/null 2>&1 ; > psql -c 'drop table popo' regression > /dev/null 2>&1 ; > psql testdb -c "select 1" > /dev/null 2>&1 ; > done; > 3) Force some checkpoints: > while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done > 4) Force a few crashes and recoveries: > while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done > I am able to reproduce this using the steps given above, I am also trying to analyze this further. I will send the update once I get some clue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: "PANIC: could not open critical system index 2662" - twice
On 2023-May-07, Thomas Munro wrote: > Did you previously run this same workload on versions < 15 and never > see any problem? 15 gained a new feature CREATE DATABASE ... > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > bug somewhere near that, though I have no specific idea. If you > explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands, > you'll get the traditional behaviour. Maybe it would be sensible to make STRATEGY_FILE=FILE_COPY the default again, for branch 15, before today's release. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
Re: "PANIC: could not open critical system index 2662" - twice
On 8/05/2023 4:24 am, Michael Paquier wrote: > here are the four things running in parallel so as I can get a failure > in loading a critical index when connecting Wow, that is some amazing detective work! We do indeed create tables during our tests, specifically partitions of tables copied from the template DB. Checkpoints seem to be happening every few minutes (we don't force them, but there is a big DB with more writes on the same instance - it's probably due to that). PG is not crashing in our case, though - not this time. Do you have any idea why the "drop database" command would have timed out (not completed after 30 seconds) for the corrupted DBs? On 8/05/2023 4:17 am, Thomas Munro wrote: > Maybe you could do some one way and some the other, so that we try to > learn more? Do you still want me to try this given what Michael has found? Or anything else to help narrow this down?
Re: "PANIC: could not open critical system index 2662" - twice
On Mon, May 08, 2023 at 02:46:37PM +1200, Thomas Munro wrote: > That sounds like good news, but I'm still confused: do you see all 0s > in the target database (popo)'s catalogs, as reported (and if so can > you explain how they got there?), or is it regression that is > corrupted in more subtle ways also involving 1s? Nope, I have not been able to confirm that yet without 8a86618. The test runs at a high frequency, so that's kind of hard to catch. I have not been able to get things in a state where I could look at a FPW for pg_class or its index, either, in a way similar to Evgeny. -- Michael signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
On Mon, May 8, 2023 at 2:24 PM Michael Paquier wrote: > I can reproduce the same backtrace here. That's just my usual laptop > with ext4, so this would be a Postgres bug. First, here are the four > things running in parallel so as I can get a failure in loading a > critical index when connecting: That sounds like good news, but I'm still confused: do you see all 0s in the target database (popo)'s catalogs, as reported (and if so can you explain how they got there?), or is it regression that is corrupted in more subtle ways also involving 1s?
Re: "PANIC: could not open critical system index 2662" - twice
On Sun, May 07, 2023 at 10:30:52PM +1200, Thomas Munro wrote: > Bug-in-PostgreSQL explanations could include that we forgot it was > dirty, or some backend wrote it out to the wrong file; but if we were > forgetting something like permanent or dirty, would there be a more > systematic failure? Oh, it could require special rare timing if it is > similar to 8a8661828's confusion about permanence level or otherwise > somehow not setting BM_PERMANENT, but in the target blocks, so I think > that'd require a checkpoint AND a crash. It doesn't reproduce for me, > but perhaps more unlucky ingredients are needed. > > Bug-in-OS/FS explanations could include that a whole lot of writes > were mysteriously lost in some time window, so all those files still > contain the zeroes we write first in smgrextend(). I guess this > previously rare (previously limited to hash indexes?) use of sparse > file hole-punching could be a factor in an it's-all-ZFS's-fault > explanation: Yes, you would need a bit of all that. I can reproduce the same backtrace here. That's just my usual laptop with ext4, so this would be a Postgres bug. First, here are the four things running in parallel so as I can get a failure in loading a critical index when connecting: 1) Create and drop a database with WAL_LOG as strategy and the regression database as template: while true; do createdb --template=regression --strategy=wal_log testdb; dropdb testdb; done 2) Feeding more data to pg_class in the middle, while testing the connection to the database created: while true; do psql -c 'create table popo as select 1 as a;' regression > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; psql -c 'drop table popo' regression > /dev/null 2>&1 ; psql testdb -c "select 1" > /dev/null 2>&1 ; done; 3) Force some checkpoints: while true; do psql -c 'checkpoint' > /dev/null 2>&1; sleep 4; done 4) Force a few crashes and recoveries: while true ; do pg_ctl stop -m immediate ; pg_ctl start ; sleep 4 ; done And here is one backtrace: (gdb) bt #0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x7f8a395cad2f in __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78 #2 0x7f8a3957bef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x7f8a39566472 in __GI_abort () at ./stdlib/abort.c:79 #4 0x55f0b36b95ac in errfinish (filename=0x55f0b38d4f68 "relcache.c", lineno=4335, funcname=0x55f0b38d6500 <__func__.12> "load_critical_index") at elog.c:604 #5 0x55f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at relcache.c:4335 #6 0x55f0b36a9712 in RelationCacheInitializePhase3 () at relcache.c:4110 (gdb) up 5 #5 0x55f0b36a9c64 in load_critical_index (indexoid=2662, heapoid=1259) at relcache.c:4335 4335elog(PANIC, "could not open critical system index %u", indexoid); You can also get failures with btree deduplication because of the CTAS running above, by the way, but that's just the top of the iceberg once the state is corrupted: #0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #1 0x7fcae38abd2f in __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78 #2 0x7fcae385cef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x7fcae3847472 in __GI_abort () at ./stdlib/abort.c:79 #4 0x5556a809f584 in ExceptionalCondition (conditionName=0x5556a812b639 "false", fileName=0x5556a812abc8 "heapam.c", lineNumber=7882) at assert.c:66 #5 0x5556a79e13db in index_delete_sort_cmp (deltid1=0x5556a9682158, deltid2=0x7ffdb62c7088) at heapam.c:7882 #6 0x5556a79e14f6 in index_delete_sort (delstate=0x7ffdb62c8350) at heapam.c:7923 #7 0x5556a79e0cd0 in heap_index_delete_tuples (rel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at heapam.c:7580 #8 0x5556a7a07be4 in table_index_delete_tuples (rel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at ../../../../src/include/access/tableam.h:1337 #9 0x5556a7a0a7a3 in _bt_delitems_delete_check (rel=0x7fcae0f231c8, buf=179, heapRel=0x7fcae0f1a9a8, delstate=0x7ffdb62c8350) at nbtpage.c:1541 #10 0x5556a7a07545 in _bt_simpledel_pass (rel=0x7fcae0f231c8, buffer=179, heapRel=0x7fcae0f1a9a8, deletable=0x7ffdb62c8430, ndeletable=160, newitem=0x5556a9689260, Anyway, you would be able to see that b3e184a (pretty much the same as 15.2), without 8a86618 included. Once 8a86618 is included, all these steps are stable in the backend, at least here. Or do we have some low-hanging fruit with the WAL_LOG strategy? That could always be possible, of course, but that looks like the same issue to me, just with a different symptom showing up. -- Michael signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov wrote: > On 6/05/2023 11:13 pm, Thomas Munro wrote: > > Would you like to try requesting FILE_COPY for a while and see if it > > eventually happens like that too? > Sure, we can try that. Maybe you could do some one way and some the other, so that we try to learn more? > Ubuntu 18.04.6 > Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 > x86_64 x86_64 x86_64 GNU/Linux > zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64 I tried for a few hours to reproduce this by trying to make as many things as similar to yours as I could based on info in this thread (Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same kernel 4.15.0-208-generic and a close but slightly different version of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a pair of loopback files), recordsize=16kB, compression=lz4, PG: compiled from tag REL_15_2, data_checksums=on, full_page_writes=off, wal_recycle=off, wal_init_zero=off), with what I thought might be roughly what you're doing (creating three DBs, two clones of the first, with various modification at various points, with various overlapping activities, and then checking for catalog corruption). No cigar. Hrmph.
Re: "PANIC: could not open critical system index 2662" - twice
On 6/05/2023 11:13 pm, Thomas Munro wrote: > Did you previously run this same workload on versions < 15 and never > see any problem? Yes, kind of. We have a test suite that creates one test DB and runs a bunch of tests on it. Two of these tests, however, create another DB each (also by cloning the same template DB) in order to test copying data between DBs. It's only these "extra" DBs that were corrupted, at least on this occasion. (Hard to say about the last time, because that time it all went south and the whole server crashed, and we may have had some residual corruption from bad disks then - who knows.) I'm not sure whether the tests that created the extra DBs existed before we upgraded to PG 15, but we definitely have not seen such problems on PG 13 or 14. > It seems like you have some kind of high frequency testing workload that > creates and tests databases all day long, and just occasionally detects this > corruption. Maybe 10-30 times per day normally, depending on the day. However, I have tried to repro this by running those two specific tests thousands of times in one day, without success. > Would you like to try requesting FILE_COPY for a while and see if it > eventually happens like that too? Sure, we can try that. On 7/05/2023 12:30 pm, Thomas Munro wrote: > your "zfs get all /path/to/pgdata" PROPERTY VALUE SOURCE type filesystem - creation Mon Mar 6 17:07 2023 - used 166G - available 2.34T - referenced 166G - compressratio 2.40x - mounted yes - quota none default reservation none default recordsize 16K local mountpoint /default sharenfs off default checksum on default compression lz4 received atime off inherited from pgdata devices on default exec off inherited from pgdata setuid off inherited from pgdata readonly off default zoned off default snapdir hidden default aclinherit restricted default createtxg 90 - canmount on received xattr on default copies 1 default version 5 - utf8only off - normalization none - casesensitivity sensitive - vscan off default nbmand off default sharesmb off default refquota none default refreservation none default primarycache all default secondarycache all default usedbysnapshots 199M - usedbydataset 166G - usedbychildren 0B - usedbyrefreservation 0B - logbias latency default dedup off default mlslabel none default sync standard default dnodesize legacy default refcompressratio 2.40x - written 64.9M - logicalused 397G - logicalreferenced 397G - volmode default default filesystem_limit none default snapshot_limit none default filesystem_count none default snapshot_count none default snapdev hidden default acltype off default context none default fscontext none default defcontext none default rootcontext none default relatime off default redundant_metadata all default overlay off default > your postgresql.conf? We have a bunch of config files, so I tried to get the resulting config using "select name, setting from pg_settings where source = 'configuration file'" - hopefully that gives what you wanted. name | setting
Re: "PANIC: could not open critical system index 2662" - twice
On Sun, May 7, 2023 at 1:21 PM Tom Lane wrote: > Thomas Munro writes: > > Did you previously run this same workload on versions < 15 and never > > see any problem? 15 gained a new feature CREATE DATABASE ... > > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > > bug somewhere near that, though I have no specific idea. > > Per the release notes I was just writing ... > > > > > Fix potential corruption of the template (source) database after > CREATE DATABASE with the STRATEGY > WAL_LOG option (Nathan Bossart, Ryo Matsumura) > Hmm. That bug seems to have caused corruption (backwards time travel) of blocks in the *source* DB's pg_class, by failing to write back changes. We seem to have zeroed pages in the *target* database, for all catalogs (apparently everything copied by RelationCopyStorageUsingBuffer()), even though the template is still fine. It is as if RelationCopyStorageUsingBuffer() created the zero-filed file with smgrextend(), but then the buffer data was never written out even though we memcpy'd it into the a buffer and set the buffer dirty. Bug-in-PostgreSQL explanations could include that we forgot it was dirty, or some backend wrote it out to the wrong file; but if we were forgetting something like permanent or dirty, would there be a more systematic failure? Oh, it could require special rare timing if it is similar to 8a8661828's confusion about permanence level or otherwise somehow not setting BM_PERMANENT, but in the target blocks, so I think that'd require a checkpoint AND a crash. It doesn't reproduce for me, but perhaps more unlucky ingredients are needed. Bug-in-OS/FS explanations could include that a whole lot of writes were mysteriously lost in some time window, so all those files still contain the zeroes we write first in smgrextend(). I guess this previously rare (previously limited to hash indexes?) use of sparse file hole-punching could be a factor in an it's-all-ZFS's-fault explanation: openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600) = 36 (0x24) openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25) lseek(37,0x0,SEEK_END) = 32768 (0x8000) lseek(37,0x0,SEEK_END) = 32768 (0x8000) pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192 (0x2000) <-- smgrextend(final block) lseek(36,0x0,SEEK_END) = 32768 (0x8000) I was trying to think about how I might go about trying to repro the exact system setup. Evgeny, do you mind sharing your "zfs get all /path/to/pgdata" (curious to see block size, compression settings, anything else etc) and your postgresql.conf? And your exact Ubuntu kernel version and ZFS package versions?
Re: "PANIC: could not open critical system index 2662" - twice
Thomas Munro writes: > Did you previously run this same workload on versions < 15 and never > see any problem? 15 gained a new feature CREATE DATABASE ... > STRATEGY=WAL_LOG, which is also the default. I wonder if there is a > bug somewhere near that, though I have no specific idea. Per the release notes I was just writing ... Fix potential corruption of the template (source) database after CREATE DATABASE with the STRATEGY WAL_LOG option (Nathan Bossart, Ryo Matsumura) Improper buffer handling created a risk that any later modification of the template's pg_class catalog would be lost. The comment about only pg_class being affected is my interpretation of what the commit message said, but I might have misunderstood. regards, tom lane
Re: "PANIC: could not open critical system index 2662" - twice
On Sun, May 7, 2023 at 10:23 AM Jeffrey Walton wrote: > This may be related... I seem to recall the GNUlib folks talking about > a cp bug on sparse files. It looks like it may be fixed in coreutils > release 9.2 (2023-03-20): > https://github.com/coreutils/coreutils/blob/master/NEWS#L233 > > If I recall correctly, it had something to do with the way > copy_file_range worked. (Or maybe, it did not work as expected). > > According to the GNUlib docs > (https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html): > > This function has many problems on Linux > kernel versions before 5.3 That's quite interesting, thanks (we've been talking about making direct use of copy_file_range() in a few threads, I'll definitely be looking into that history), but we don't currently use copy_file_range() or any coreutils stuff in the relevant code paths here -- this data is copied by plain old pread() and pwrite().
Re: "PANIC: could not open critical system index 2662" - twice
On Sat, May 6, 2023 at 6:35 AM Thomas Munro wrote: > > On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov > wrote: > > Right - I should have realised that! base/1414389/2662 is indeed all > > nulls, 32KB of them. I included the file anyway in > > https://objective.realityexists.net/temp/pgstuff2.zip > > OK so it's not just page 0, you have 32KB or 4 pages of all zeroes. > That's the expected length of that relation when copied from the > initial template, and consistent with the pg_waldump output (it uses > FPIs to copy blocks 0-3). We can't see the block contents but we know > that block 2 definitely is not all zeroes at that point because there > are various modifications to it, which not only write non-zeroes but > must surely have required a sane page 0. > > So it does indeed look like something unknown has replaced 32KB of > data with 32KB of zeroes underneath us. This may be related... I seem to recall the GNUlib folks talking about a cp bug on sparse files. It looks like it may be fixed in coreutils release 9.2 (2023-03-20): https://github.com/coreutils/coreutils/blob/master/NEWS#L233 If I recall correctly, it had something to do with the way copy_file_range worked. (Or maybe, it did not work as expected). According to the GNUlib docs (https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html): This function has many problems on Linux kernel versions before 5.3 > Are there more non-empty > files that are all-zeroes? Something like this might find them: > > for F in base/1414389/* > do > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > then > echo $F > fi > done
Re: "PANIC: could not open critical system index 2662" - twice
On Sun, May 7, 2023 at 12:29 AM Evgeny Morozov wrote: > On 6/05/2023 12:34 pm, Thomas Munro wrote: > > So it does indeed look like something unknown has replaced 32KB of > > data with 32KB of zeroes underneath us. Are there more non-empty > > files that are all-zeroes? Something like this might find them: > > > > for F in base/1414389/* > > do > > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > > then > > echo $F > > fi > > done > > Yes, a total of 309 files are all-zeroes (and 52 files are not). > > I also checked the other DB that reports the same "unexpected zero page > at block 0" error, "test_behavior_638186280406544656" (OID 1414967) - > similar story there. I uploaded the lists of zeroed and non-zeroed files > and the ls -la output for both as > https://objective.realityexists.net/temp/pgstuff3.zip > > I then searched recursively such all-zeroes files in $PGDATA/base and > did not find any outside of those two directories (base/1414389 and > base/1414967). None in $PGDATA/global, either. So "diff -u zeroed-files-1414967.txt zeroed-files-1414389.txt" shows that they have the same broken stuff in the range cloned from the template database by CREATE DATABASE STRATEGY=WAL_LOG, and it looks like it's *all* the cloned catalogs, and then they have some non-matching relfilenodes > 140, presumably stuff you created directly in the new database (I'm not sure if I can say for sure that those files are broken, without knowing what they are). Did you previously run this same workload on versions < 15 and never see any problem? 15 gained a new feature CREATE DATABASE ... STRATEGY=WAL_LOG, which is also the default. I wonder if there is a bug somewhere near that, though I have no specific idea. If you explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands, you'll get the traditional behaviour. It seems like you have some kind of high frequency testing workload that creates and tests databases all day long, and just occasionally detects this corruption. Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too? My spidey sense is leaning away from filesystem bugs. We've found plenty of filesystem bugs on these mailing lists over the years and of course it's not impossible, but I dunno... it seems quite suspicious that all the system catalogs have apparently been wiped during or moments after the creation of a new database that's running new PostgreSQL 15 code...
Re: "PANIC: could not open critical system index 2662" - twice
On 6/05/2023 12:34 pm, Thomas Munro wrote: > So it does indeed look like something unknown has replaced 32KB of > data with 32KB of zeroes underneath us. Are there more non-empty > files that are all-zeroes? Something like this might find them: > > for F in base/1414389/* > do > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > then > echo $F > fi > done Yes, a total of 309 files are all-zeroes (and 52 files are not). I also checked the other DB that reports the same "unexpected zero page at block 0" error, "test_behavior_638186280406544656" (OID 1414967) - similar story there. I uploaded the lists of zeroed and non-zeroed files and the ls -la output for both as https://objective.realityexists.net/temp/pgstuff3.zip I then searched recursively such all-zeroes files in $PGDATA/base and did not find any outside of those two directories (base/1414389 and base/1414967). None in $PGDATA/global, either.
Re: "PANIC: could not open critical system index 2662" - twice
On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov wrote: > Right - I should have realised that! base/1414389/2662 is indeed all > nulls, 32KB of them. I included the file anyway in > https://objective.realityexists.net/temp/pgstuff2.zip OK so it's not just page 0, you have 32KB or 4 pages of all zeroes. That's the expected length of that relation when copied from the initial template, and consistent with the pg_waldump output (it uses FPIs to copy blocks 0-3). We can't see the block contents but we know that block 2 definitely is not all zeroes at that point because there are various modifications to it, which not only write non-zeroes but must surely have required a sane page 0. So it does indeed look like something unknown has replaced 32KB of data with 32KB of zeroes underneath us. Are there more non-empty files that are all-zeroes? Something like this might find them: for F in base/1414389/* do if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null then echo $F fi done
Re: "PANIC: could not open critical system index 2662" - twice
On 6/05/2023 1:06 am, Thomas Munro wrote: > Next can you share the file base/1414389/2662? ("5" was from the wrong > database.) Right - I should have realised that! base/1414389/2662 is indeed all nulls, 32KB of them. I included the file anyway in https://objective.realityexists.net/temp/pgstuff2.zip > Try something like: > pg_waldump -R 1663/1414389/2662 -F main 00010001 > 00010007 > > ... but change that to the range of files you have in your pg_wal. The PG server had only 2 files left from today, and pg_waldumping them gave this error: pg_waldump -R 1663/1414389/2662 -F main 00010065005E 00010065005F pg_waldump: error: error in WAL record at 65/5F629838: invalid record length at 65/5F62A1E0: wanted 24, got 0 Not sure if that's something to worry about or not! Then I realised we're actually archiving our WAL files with pgBackRest, retrieved the WAL files for the time the DB was created and used (~12:39 UTC on 2023-05-02) and re-ran pg_waldump on those. pg_waldump -R 1663/1414389/2662 -F main 0001005B 0001005B000F rmgr: XLOG len (rec/tot): 51/ 108, tx: 242382, lsn: 5B/0222BC68, prev 5B/0222BC38, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 0 FPW rmgr: XLOG len (rec/tot): 51/ 3224, tx: 242382, lsn: 5B/0222BCD8, prev 5B/0222BC68, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 1 FPW rmgr: XLOG len (rec/tot): 51/ 1544, tx: 242382, lsn: 5B/0222C988, prev 5B/0222BCD8, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 2 FPW rmgr: XLOG len (rec/tot): 49/ 121, tx: 242382, lsn: 5B/0222CF90, prev 5B/0222C988, desc: FPI , blkref #0: rel 1663/1414389/2662 blk 3 FPW rmgr: Btree len (rec/tot): 64/ 64, tx: 242384, lsn: 5B/02321870, prev 5B/023217A0, desc: INSERT_LEAF off 132, blkref #0: rel 1663/1414389/2662 blk 2 rmgr: Btree len (rec/tot): 64/ 64, tx: 242384, lsn: 5B/02322640, prev 5B/02322570, desc: INSERT_LEAF off 133, blkref #0: rel 1663/1414389/2662 blk 2 ... (many more entries like that) pg_waldump: error: error in WAL record at 5B/F38: missing contrecord at 5B/F70 The full output is also in https://objective.realityexists.net/temp/pgstuff2.zip
Re: "PANIC: could not open critical system index 2662" - twice
On Fri, May 5, 2023 at 7:50 PM Evgeny Morozov wrote: > The OID of the bad DB ('test_behavior_638186279733138190') is 1414389 and > I've uploaded base/1414389/pg_filenode.map and also base/5/2662 (in case > that's helpful) as https://objective.realityexists.net/temp/pgstuff1.zip Thanks. That pg_filenode.map looks healthy to me. tmunro@build1:~/junk $ od -t x1 pg_filenode.map 00017 27 59 00 11 00 00 00 eb 04 00 00 eb 04 00 00 020e1 04 00 00 e1 04 00 00 e7 04 00 00 e7 04 00 00 040df 04 00 00 df 04 00 00 14 0b 00 00 14 0b 00 00 06015 0b 00 00 15 0b 00 00 4b 10 00 00 4b 10 00 00 1004c 10 00 00 4c 10 00 00 82 0a 00 00 82 0a 00 00 12083 0a 00 00 83 0a 00 00 8f 0a 00 00 8f 0a 00 00 14090 0a 00 00 90 0a 00 00 62 0a 00 00 62 0a 00 00 16063 0a 00 00 63 0a 00 00 66 0a 00 00 66 0a 00 00 ... hex(2662) is 0xa66, and we see 63 0a 00 00 followed by 63 0a 00 00 in that last line as expected, so that rules out the idea that it's somehow trashed that map file and points to the wrong relation file. Next can you share the file base/1414389/2662? ("5" was from the wrong database.) > > Maybe you still have enough WAL if it happened recently? > > Maybe! What should I do with pg_waldump? I've never used it before. Try something like: pg_waldump -R 1663/1414389/2662 -F main 00010001 00010007 ... but change that to the range of files you have in your pg_wal.
Re: "PANIC: could not open critical system index 2662" - twice
On 5/05/2023 10:38 am, Andrew Gierth wrote: > sudo -u postgres psql -w -p 5434 -d "options='-P'" > (make that -d "dbname=whatever options='-P'" if you need to specify > some database name; or use PGOPTIONS="-P" in the environment.) Thanks, good to know! Unfortunately that also fails: # sudo -u postgres psql -w -p 5434 -d "dbname=test_behavior_638186279733138190 options='-P'" psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: could not open relation with OID 2964 (PG server log contains the same error message and no further details.) select 2964::regclass returns "pg_db_role_setting" FWIW.
Re: "PANIC: could not open critical system index 2662" - twice
> "Evgeny" == Evgeny Morozov writes: Evgeny> Indeed, I cannot get that far due to the same error. I read Evgeny> about ignore_system_indexes, but... Evgeny> # sudo -u postgres psql -w -p 5434 -c "set ignore_system_indexes=on"; Evgeny> ERROR: parameter "ignore_system_indexes" cannot be set after connection Evgeny> start sudo -u postgres psql -w -p 5434 -d "options='-P'" (make that -d "dbname=whatever options='-P'" if you need to specify some database name; or use PGOPTIONS="-P" in the environment.) -- Andrew (irc:RhodiumToad)
Re: "PANIC: could not open critical system index 2662" - twice
On 5/05/2023 2:02 am, Thomas Munro wrote: > On Fri, May 5, 2023 at 11:15 AM Thomas Munro wrote: >> What does select >> pg_relation_filepath('pg_class_oid_index') show in the corrupted >> database, base/5/2662 or something else? > Oh, you can't get that far, but perhaps you could share the > pg_filenode.map file? Hi Thomas, thanks very much for looking into this! Indeed, I cannot get that far due to the same error. I read about ignore_system_indexes, but... # sudo -u postgres psql -w -p 5434 -c "set ignore_system_indexes=on"; ERROR: parameter "ignore_system_indexes" cannot be set after connection start I'm not sure how to set it BEFORE connection start, but without restarting the server (which I'd rather avoid if I can). The OID of the bad DB ('test_behavior_638186279733138190') is 1414389 and I've uploaded base/1414389/pg_filenode.map and also base/5/2662 (in case that's helpful) as https://objective.realityexists.net/temp/pgstuff1.zip > Maybe you still have enough WAL if it happened recently? Maybe! What should I do with pg_waldump? I've never used it before.
Re: "PANIC: could not open critical system index 2662" - twice
On Fri, May 5, 2023 at 11:15 AM Thomas Munro wrote: > What does select > pg_relation_filepath('pg_class_oid_index') show in the corrupted > database, base/5/2662 or something else? Oh, you can't get that far, but perhaps you could share the pg_filenode.map file? Or alternatively strace -f PostgreSQL while it's starting up to see which file it's reading, just to be sure. One way to find clues about whether PostgreSQL did something wrong, once we definitely have the right relfilenode for the index, aside from examining its contents, would be to search the WAL for references to that block with pg_waldump. Maybe you still have enough WAL if it happened recently?
Re: "PANIC: could not open critical system index 2662" - twice
On Fri, May 5, 2023 at 11:15 AM Thomas Munro wrote: > Now *that* is a piece of > logic that changed in PostgreSQL 15. It changed from sector-based > atomicity assumptions to a directory entry swizzling trick, in commit > d8cd0c6c95c0120168df93aae095df4e0682a08a. Hmm. I spoke too soon, that only changed in 16. But still, it means there are two files that could be corrupted here, pg_filenode.map which might somehow be pointing to the wrong file, and the relation (index) main fork file.
Re: "PANIC: could not open critical system index 2662" - twice
On Fri, May 5, 2023 at 6:11 AM Evgeny Morozov wrote: > Meanwhile, what do I do with the existing server, though? Just try to > drop the problematic DBs again manually? That earlier link to a FreeBSD thread is surely about bleeding edge new ZFS stuff that was briefly broken then fixed, being discovered by people running code imported from OpenZFS master branch into FreeBSD main branch (ie it's not exactly released, not following the details but I think it might soon be 2.2?), but you're talking about an LTS Ubuntu release from 2018, which shipped "ZFS on Linux" version 0.7.5, unless you installed a newer version somehow? So it doesn't sound like it could be related. That doesn't mean it couldn't be a different ZFS bug though. While looking into file system corruption issues that had similar symptoms on some other file system (which turned out to be a bug in btrfs) I did bump into a claim that ZFS could product unexpected zeroes in some mmap coherency scenario, OpenZFS issue #14548. I don't immediately see how PostgreSQL could get tangled up with that problem though, as we aren't doing that... It seems quite interesting that it's always pg_class_oid_index block 0 (the btree meta-page), which feels more like a PostgreSQL bug, unless the access pattern of that particular file/block is somehow highly unusual compared to every other block and tickling bugs elsewhere in the stack. How does that file look, in terms of size, and how many pages in it are zero? I think it should be called base/5/2662. Oooh, but this is a relation that goes through RelationMapOidToFilenumber. What does select pg_relation_filepath('pg_class_oid_index') show in the corrupted database, base/5/2662 or something else? Now *that* is a piece of logic that changed in PostgreSQL 15. It changed from sector-based atomicity assumptions to a directory entry swizzling trick, in commit d8cd0c6c95c0120168df93aae095df4e0682a08a. Hmm.
Re: "PANIC: could not open critical system index 2662" - twice
On 5/4/23 13:10, Evgeny Morozov wrote: [snip] I'm now thinking of setting up a dedicated AWS EC2 instance just for these little DBs that get created by our automated tests. If the problem happens there as well then that would strongly point towards a bug in PostgreSQL, wouldn't it? Many other people besides you would have noticed regular corruption of system catalogs. Meanwhile, what do I do with the existing server, though? Just try to drop the problematic DBs again manually? *"Fix the hardware"* is what you do. -- Born in Arizona, moved to Babylonia.
Re: "PANIC: could not open critical system index 2662" - twice
On 4/05/2023 6:42 pm, Laurenz Albe wrote: > On Thu, 2023-05-04 at 15:49 +, Evgeny Morozov wrote: >> Well, the problem happened again! Kind of... This time PG has not >> crashed with the PANIC error in the subject, but pg_dumping certain DBs >> again fails with >> >> >> pg_dump: error: connection to server on socket >> "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index >> "pg_class_oid_index" contains unexpected zero page at block 0 > If you dumped and restored the database after the last time the error > happened, there must be a systemic problem. I dumped and restored the "real" databases I cared about. The tests databases on which error now happens are new (created 2 days ago). > Perhaps you have bad hardware, or a problem with a storage driver, > file system or some other low-level software component. > It might of course be a PostgreSQL bug too, but it is hard to say > without a way to reproduce... I'm now thinking of setting up a dedicated AWS EC2 instance just for these little DBs that get created by our automated tests. If the problem happens there as well then that would strongly point towards a bug in PostgreSQL, wouldn't it? (And if nothing else, at least it won't affect the more important DBs!) Meanwhile, what do I do with the existing server, though? Just try to drop the problematic DBs again manually?
Re: "PANIC: could not open critical system index 2662" - twice
On Thu, 2023-05-04 at 15:49 +, Evgeny Morozov wrote: > Well, the problem happened again! Kind of... This time PG has not > crashed with the PANIC error in the subject, but pg_dumping certain DBs > again fails with > > > pg_dump: error: connection to server on socket > "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index > "pg_class_oid_index" contains unexpected zero page at block 0 If you dumped and restored the database after the last time the error happened, there must be a systemic problem. Perhaps you have bad hardware, or a problem with a storage driver, file system or some other low-level software component. It might of course be a PostgreSQL bug too, but it is hard to say without a way to reproduce... Yours, Laurenz Albe
Re: "PANIC: could not open critical system index 2662" - twice
On 14/04/2023 10:42 am, Alban Hertroys wrote: > Your problem coincides with a thread at freebsd-current with very > similar data corruption after a recent OpenZFS import: blocks of all > zeroes, but also missing files. So, perhaps these problems are related? > Apparently, there was a recent fix for a data corruption issue with the > block_cloning feature enabled, but people are still seeing corruption even > when they never enabled that feature. > > I couldn’t really find the start of the thread in the archives, so this one > kind of jumps into the middle of the thread at a relevant-looking point: > > https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html That thread was a bit over my head, I'm afraid, so I can't say if it's related. I haven't detected any missing files, anyway. Well, the problem happened again! Kind of... This time PG has not crashed with the PANIC error in the subject, but pg_dumping certain DBs again fails with pg_dump: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5434" failed: FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 PG server log contains: 2023-05-03 04:31:49.903 UTC [14724] postgres@test_behavior_638186279733138190 FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 2023-05-03 04:31:49.903 UTC [14724] postgres@test_behavior_638186279733138190 HINT: Please REINDEX it. The server PID does not change on such a pg_dump attempt, so it appears that only the backend process for the pg_dump connection crashes this time. I don't see any disk errors and there haven't been any OS crashes. This currently happens for two DBs. Both of them are very small DBs created by automated .NET tests using Npgsql as client. The code creates such a test DB from a template DB and tries to drop it at the end of the test. This times out sometimes and on timeout our test code tries to drop the DB again (while the first drop command is likely still pending on the server). This second attempt to drop the DB also timed out: [12:40:39] Npgsql.NpgsqlException : Exception while reading from stream > System.TimeoutException : Timeout during reading attempt at Npgsql.NpgsqlConnector.g__ReadMessageLong|194_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage) at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken) at Npgsql.NpgsqlDataReader.NextResult() at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery() ... [12:41:41] (same error again for the same DB) >From looking at old logs it seems like the same thing happened last time (in April) as well. That's quite an unlikely coincidence if a bad disk or bad filesystem was to blame, isn't it? I've tried to reproduce this by re-running those tests over and over, but without success so far. So what can I do about this? Do I just try to drop those databases again manually? But what about the next time it happens? How do I figure out the cause and prevent this problem?
Re: "PANIC: could not open critical system index 2662" - twice
> On 14 Apr 2023, at 9:38, Evgeny Morozov wrote: (…) > I don't know whether ZFS zero-fills blocks on disk errors. As I > understood, ZFS should have been able to recover from disk errors (that > were "unrecoverable" at the hardware level) using the data on the other > two disks (which did not report any errors). Thus, PG should not have > seen any corrupted data (if ZFS was working correctly). > https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux > seems to confirm this. Am I misunderstanding something? Your problem coincides with a thread at freebsd-current with very similar data corruption after a recent OpenZFS import: blocks of all zeroes, but also missing files. So, perhaps these problems are related? Apparently, there was a recent fix for a data corruption issue with the block_cloning feature enabled, but people are still seeing corruption even when they never enabled that feature. I couldn’t really find the start of the thread in the archives, so this one kind of jumps into the middle of the thread at a relevant-looking point: https://lists.freebsd.org/archives/freebsd-current/2023-April/003446.html Regards, Alban Hertroys -- There is always an exception to always.
Re: "PANIC: could not open critical system index 2662" - twice
> Hmm, I am not certain. The block was filled with zeros from your error > message, and I think such blocks don't trigger a checksum warning. OK, so data_checksums=on might not have made any difference in this case? > So if your disk replaces a valid block with zeros (filesystem check > after crash?), that could explain what you see. If by "crash" here you mean the OS crashing - we didn't have that happen. The OS is on separate disks, which have not reported any errors. When we first ran into this problem the PG data was on a ZFS RAIDZ (i.e. RAID5) volume of 3 disks, and for one of them `zpool status -v` reported read, write and checksum error count > 0, but it also said "errors: No known data errors" and the disk status remained "online" (it did not become "faulted" or "offline"). (Now we have the PG data on a ZFS mirror volume of 2 new disks, which have not reported any errors.) I don't know whether ZFS zero-fills blocks on disk errors. As I understood, ZFS should have been able to recover from disk errors (that were "unrecoverable" at the hardware level) using the data on the other two disks (which did not report any errors). Thus, PG should not have seen any corrupted data (if ZFS was working correctly). https://unix.stackexchange.com/questions/341614/understanding-the-error-reporting-of-zfs-on-linux seems to confirm this. Am I misunderstanding something?
Re: "PANIC: could not open critical system index 2662" - twice
On Fri, 14 Apr 2023, Laurenz Albe wrote: >So if your disk replaces a valid block with zeros (filesystem check >after crash?), that could explain what you see. Oh, I had that happen on a RAID 1 once. On of the two discs had an intermittent error (write I guess) but didn’t fail out of the RAID, and some of the reads from there got zero-filled blocks in some positions. It was a CVS repository so I was able to identify all such blocks in question and restore them from the rsync slave (whose initial population predated the HDD issue). Hm, now that I think about it, it could even have been a read error with subsequent block reassignment. Oh well. Filesystem issues (ext3, and ext4 without/predating auto_da_alloc, in particular) could be it just as well of course. bye, //mirabilos -- 15:41⎜ Somebody write a testsuite for helloworld :-)
Re: "PANIC: could not open critical system index 2662" - twice
On Thu, 2023-04-13 at 19:07 +, Evgeny Morozov wrote: > On 13/04/2023 5:02 pm, Laurenz Albe wrote: > > It means that if the error is caused by a faulty disk changing your data, > > you'll notice as soon as you touch the page. > > > > That would perhaps not have made a lot of difference in your case, > > except that the error message would have been different and proof > > that the disk was the problem. > > OK, but we had data checksums on the whole time. So that means that the > disk was NOT the problem in our case? Hmm, I am not certain. The block was filled with zeros from your error message, and I think such blocks don't trigger a checksum warning. So if your disk replaces a valid block with zeros (filesystem check after crash?), that could explain what you see. Yours, Laurenz Albe
Re: "PANIC: could not open critical system index 2662" - twice
On 13/04/2023 5:02 pm, Laurenz Albe wrote: > It means that if the error is caused by a faulty disk changing your data, > you'll notice as soon as you touch the page. > > That would perhaps not have made a lot of difference in your case, > except that the error message would have been different and proof > that the disk was the problem. OK, but we had data checksums on the whole time. So that means that the disk was NOT the problem in our case?
Re: "PANIC: could not open critical system index 2662" - twice
On Thu, 2023-04-13 at 06:56 +, Evgeny Morozov wrote: > On 12/04/2023 2:35 am, Michael Paquier wrote: > > initdb does not enable checksums by default, requiring a > > -k/--data-checksums, so likely this addition comes from from your > > environment. > > OK, so then what does that mean for the error in the subject? At what > point should that problem have been detected by the data checksums? It means that if the error is caused by a faulty disk changing your data, you'll notice as soon as you touch the page. That would perhaps not have made a lot of difference in your case, except that the error message would have been different and proof that the disk was the problem. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Re: "PANIC: could not open critical system index 2662" - twice
On 12/04/2023 2:35 am, Michael Paquier wrote: > initdb does not enable checksums by default, requiring a > -k/--data-checksums, so likely this addition comes from from your > environment. Indeed, turns out we had it in init_db_options. > However, the docs say "Only >> data pages are protected by checksums; internal data structures and >> temporary files are not.", so I guess pg_class_oid_index might be an >> "internal data structure"? > pg_class_oid_index is a btree index that relies on 8k on-disk pages > (default size), so it is subject to the same rules as normal relations > regarding checksums for the pages flushed to disk, even if it is on a > catalog. OK, so then what does that mean for the error in the subject? At what point should that problem have been detected by the data checksums?
Re: "PANIC: could not open critical system index 2662" - twice
On Tue, Apr 11, 2023 at 04:44:54PM +, Evgeny Morozov wrote: > We have data_checksums=on. (It must be on by default, since I cannot > find that in our config files anywhere.) initdb does not enable checksums by default, requiring a -k/--data-checksums, so likely this addition comes from from your environment. > However, the docs say "Only > data pages are protected by checksums; internal data structures and > temporary files are not.", so I guess pg_class_oid_index might be an > "internal data structure"? pg_class_oid_index is a btree index that relies on 8k on-disk pages (default size), so it is subject to the same rules as normal relations regarding checksums for the pages flushed to disk, even if it is on a catalog. -- Michael signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
> No idea about the former, but bad hardware is a good enough explanation. > As to keeping it from happening: use good hardware. Alright, thanks, I'll just keep my fingers crossed that it doesn't happen again then! > Also: Use checksums. PostgreSQL offers data checksums[1]. Some filesystems also offer checksums. We have data_checksums=on. (It must be on by default, since I cannot find that in our config files anywhere.) However, the docs say "Only data pages are protected by checksums; internal data structures and temporary files are not.", so I guess pg_class_oid_index might be an "internal data structure"? We also have checksum=on for the ZFS dataset on which the data is stored (also the default - we didn't change it). ZFS did detect problems (zpool status reported read, write and checksum errors for one of the old disks), but it also said "errors: No known data errors". I understood that to meant that it recovered from the errors, i.e. wrote the data different disk blocks or read it from another disk in the pool.
Re: "PANIC: could not open critical system index 2662" - twice
On 2023-04-07 13:04:34 +0200, Laurenz Albe wrote: > On Thu, 2023-04-06 at 16:41 +, Evgeny Morozov wrote: > > What can I do to figure out why this is happening and prevent it from > > happening again? > > No idea about the former, but bad hardware is a good enough explanation. > > As to keeping it from happening: use good hardware. Also: Use checksums. PostgreSQL offers data checksums[1]. Some filesystems also offer checksums. This doesn't prevent corruption but at least it will be detected early and can't spread. hp [1] For some reason I thought the Debian/Ubuntu packages enabled this by default. But that doesn't seem to be the case. -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
On Fri, Apr 07, 2023 at 01:04:34PM +0200, Laurenz Albe wrote: > On Thu, 2023-04-06 at 16:41 +, Evgeny Morozov wrote: >> Could this be a PG bug? > > It could be, but data corruption caused by bad hardware is much more likely. There is no way to be completely sure here, except if we would be able to put our hands on a reproducible test case that would break the cluster so much that we'd get into this state. I don't recall seeing this error pattern in recent history, though. -- Michael signature.asc Description: PGP signature
Re: "PANIC: could not open critical system index 2662" - twice
On Thu, 2023-04-06 at 16:41 +, Evgeny Morozov wrote: > Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this > error: > > 2023-04-05 09:24:03.448 UTC [15227] ERROR: index "pg_class_oid_index" > contains unexpected zero page at block 0 > [...] > > We had the same thing happened about a month ago on a different database on > the same cluster. > For a while PG actually ran OK as long as you didn't access that specific DB, > but when trying > to back up that DB with pg_dump it would crash every time. At that time one > of the disks > hosting the ZFS dataset with the PG data directory on it was reporting > errors, so we thought > it was likely due to that. > > Unfortunately, before we could replace the disks, PG crashed completely and > would not start > again at all, so I had to rebuild the cluster from scratch and restore from > pg_dump backups > (still onto the old, bad disks). Once the disks were replaced (all of them) I > just copied > the data to them using zfs send | zfs receive and didn't bother restoring > pg_dump backups > again - which was perhaps foolish in hindsight. > > Well, yesterday it happened again. The server still restarted OK, so I took > fresh pg_dump > backups of the databases we care about (which ran fine), rebuilt the cluster > and restored > the pg_dump backups again - now onto the new disks, which are not reporting > any problems. > > So while everything is up and running now this error has me rather concerned. > Could the > error we're seeing now have been caused by some corruption in the PG data > that's been there > for a month (so it could still be attributed to the bad disk), which should > now be fixed by > having restored from backups onto good disks? Yes, that is entirely possible. > Could this be a PG bug? It could be, but data corruption caused by bad hardware is much more likely. > What can I do to figure out why this is happening and prevent it from > happening again? No idea about the former, but bad hardware is a good enough explanation. As to keeping it from happening: use good hardware. Yours, Laurenz Albe
"PANIC: could not open critical system index 2662" - twice
Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this error: 2023-04-05 09:24:03.448 UTC [15227] ERROR: index "pg_class_oid_index" contains unexpected zero page at block 0 2023-04-05 09:24:03.448 UTC [15227] HINT: Please REINDEX it. ... 2023-04-05 13:05:25.018 UTC [15437] root@test_behavior_638162834106895162 FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 2023-04-05 13:05:25.018 UTC [15437] root@test_behavior_638162834106895162 HINT: Please REINDEX it. ... (same error for a few more DBs) 2023-04-05 13:05:25.144 UTC [16965] root@test_behavior_638162855458823077 FATAL: index "pg_class_oid_index" contains unexpected zero page at block 0 2023-04-05 13:05:25.144 UTC [16965] root@test_behavior_638162855458823077 HINT: Please REINDEX it. ... 2023-04-05 13:05:25.404 UTC [17309] root@test_behavior_638162881641031612 PANIC: could not open critical system index 2662 2023-04-05 13:05:25.405 UTC [9372] LOG: server process (PID 17309) was terminated by signal 6: Aborted 2023-04-05 13:05:25.405 UTC [9372] LOG: terminating any other active server processes We had the same thing happened about a month ago on a different database on the same cluster. For a while PG actually ran OK as long as you didn't access that specific DB, but when trying to back up that DB with pg_dump it would crash every time. At that time one of the disks hosting the ZFS dataset with the PG data directory on it was reporting errors, so we thought it was likely due to that. Unfortunately, before we could replace the disks, PG crashed completely and would not start again at all, so I had to rebuild the cluster from scratch and restore from pg_dump backups (still onto the old, bad disks). Once the disks were replaced (all of them) I just copied the data to them using zfs send | zfs receive and didn't bother restoring pg_dump backups again - which was perhaps foolish in hindsight. Well, yesterday it happened again. The server still restarted OK, so I took fresh pg_dump backups of the databases we care about (which ran fine), rebuilt the cluster and restored the pg_dump backups again - now onto the new disks, which are not reporting any problems. So while everything is up and running now this error has me rather concerned. Could the error we're seeing now have been caused by some corruption in the PG data that's been there for a month (so it could still be attributed to the bad disk), which should now be fixed by having restored from backups onto good disks? Could this be a PG bug? What can I do to figure out why this is happening and prevent it from happening again? Advice appreciated!