On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote: > daveg <da...@sonic.net> writes: > > Here is the update: the problem happens with vacuum full alone, no reindex > > is needed to trigger it. I updated the script to avoid reindexing after > > vacuum. Over the past two days there are still many ocurrances of this > > error coincident with the vacuum. > > Well, that jives with the assumption that the one case we saw in > the buildfarm was the same thing, because the regression tests were > certainly only doing a VACUUM FULL and not a REINDEX of pg_class. > But it doesn't get us much closer to understanding what's happening. > In particular, it seems to knock out most ideas associated with race > conditions, because the VAC FULL should hold exclusive lock on pg_class > until it's completely done (including index rebuilds). > > I think we need to start adding some instrumentation so we can get a > better handle on what's going on in your database. If I were to send > you a source-code patch for the server that adds some more logging > printout when this happens, would you be willing/able to run a patched > build on your machine?
Yes we can run an instrumented server so long as the instrumentation does not interfere with normal operation. However, scheduling downtime to switch binaries is difficult, and generally needs to be happen on a weekend, but sometimes can be expedited. I'll look into that. > (BTW, just to be perfectly clear ... the "could not find pg_class tuple" > errors always mention index 2662, right, never any other number?) Yes, only index 2662, never any other. I'm attaching a somewhat redacted log for two different databases on the same instance around the time of vacuum full of pg_class in each database. My observations so far are: - the error occurs at commit of vacuum full of pg_class - in these cases error hits autovacuum after it waited for a lock on pg_class - in these two cases there was a new process startup while the vacuum was running. Don't know if this is relevant. - while these hit autovacuum, the error does hit other processs (just not in these sessions). Unknown if autovacuum is a required component. -dg -- David Gould da...@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
NOTES: -- most statements for db c01 in one catalog vacuum session -- database c01 is oid 16404 -- applications run as user 'app' -- vacuums run as user postgres -- somewhat redacted MM:SS.ms pid user log text --------- ----- ---- ----------------------------------------------------- 03:00.350 2716 c01 postgres vacuum full pg_catalog.pg_class; 03:00.507 2866 LOG: process 2866 acquired ShareUpdateExclusiveLock on relation 2604 of database 16416 after 10143.750 ms 03:01.222 21685 c01 app 11100.670 ms 03:01.223 21685 c01 app SELECT DISTINCT 03:01.224 21685 c01 app 1.524 ms 03:02.031 3001 c01 app connection authorized: user=app database=c01 03:04.952 2787 LOG: process 2787 still waiting for AccessShareLock on relation 1259 of database 16412 after 5001.035 ms 03:05.065 2787 LOG: process 2787 acquired AccessShareLock on relation 1259 of database 16412 after 5114.253 ms 03:05.564 2977 LOG: process 2977 still waiting for AccessShareLock on relation 1259 of database 16404 after 5000.970 ms 03:05.640 2731 LOG: process 2731 still waiting for RowExclusiveLock on relation 1259 of database 16404 after 5000.186 ms 03:06.045 2977 LOG: process 2977 acquired AccessShareLock on relation 1259 of database 16404 after 5482.389 ms 03:06.045 2731 LOG: process 2731 acquired RowExclusiveLock on relation 1259 of database 16404 after 5405.652 ms 03:06.046 2731 ERROR: could not find pg_class tuple for index 2662 03:06.046 2731 CONTEXT: automatic vacuum of table "c01.pg_catalog.pg_index" 03:06.046 2731 ERROR: could not find pg_class tuple for index 2662 03:06.046 2716 c01 postgres 5696.537 ms 03:06.056 3001 c01 app SET SESSION TIME ZONE 'UTC'; ... 03:06.057 2716 c01 postgres vacuum full pg_catalog.pg_rewrite; 03:06.066 3001 c01 app 10.459 ms 03:06.091 3001 c01 app disconnection: session time: 0:00:04.085 user=app database=c01 host=xxxl01 03:08.908 3006 LOG: process 3006 still waiting for AccessShareLock on relation 2659 of database 16407 after 5000.778 ms 03:11.777 21685 c01 app SELECT 03:11.779 21685 c01 app 2.296 ms 03:11.779 21685 c01 app SELECT 03:11.780 21685 c01 app 0.328 ms 03:11.798 21685 c01 app SELECT 03:11.799 21685 c01 app 0.348 ms 03:11.800 21685 c01 app SELECT 03:11.800 21685 c01 app 0.205 ms 03:11.804 21685 c01 app SELECT 03:11.805 21685 c01 app 0.589 ms 03:11.806 21685 c01 app SELECT DISTINCT 03:11.809 21685 c01 app 3.552 ms 03:11.810 21685 c01 app SELECT 03:12.554 21685 c01 app 744.233 ms 03:12.555 21685 c01 app SELECT DISTINCT 03:12.556 21685 c01 app 1.136 ms 03:13.153 2716 c01 postgres 7096.367 ms 03:13.182 2716 c01 postgres vacuum full pg_catalog.pg_depend; 03:16.926 2716 c01 postgres 3743.880 ms 03:16.937 2716 c01 postgres vacuum full pg_catalog.pg_type; 03:20.460 2716 c01 postgres 3523.571 ms 03:20.492 2716 c01 postgres disconnection: session time: 0:02:06.916 user=postgres database=c01 host=xxxb-0
NOTES: -- most statements for db c57 in one catalog vacuum session -- database c57 is oid 16413 -- applications run as user 'app' -- vacuums run as user postgres -- somewhat redacted MM:SS.ms pid user log text --------- ----- ---- ----------------------------------------------------- 02:49.335 2758 postgres vacuum full pg_catalog.pg_attribute; ... 03:00.356 2758 postgres 11020.962 ms 03:00.366 2758 postgres vacuum full pg_catalog.pg_class; 03:00.382 2968 app SELECT ... 03:00.383 2968 app 1.132 ms 03:00.384 2968 app SELECT ... 03:00.385 2968 app 0.605 ms 03:00.392 2968 app SELECT ... 03:00.394 2968 app 2.165 ms 03:00.443 2968 app select ... 03:00.461 2992 [unknown] [unknown] connection received: host=xxx03 03:00.462 2992 app connection authorized: user=app database=c57 03:00.469 2994 [unknown] [unknown] connection received: host=xxx03 03:00.470 2994 app connection authorized: user=app database=c57 03:02.875 26962 app temporary file: path "pg_tblspc/16402/PG_9.0_201008051/pgsql_tmp/pgsql_tmp26962.6", size 502128640 03:02.875 26962 app STATEMENT: UPDATE ... 03:02.875 21693 app 13286.010 ms 03:02.876 21693 app SELECT DISTINCT ... 03:02.907 26962 app 95964.763 ms 03:02.910 26962 app EXPLAIN INSERT ... SELECT ... 03:05.444 2968 app process 2968 still waiting for AccessShareLock on relation 1259 of database 16413 after 5000.577 ms at character 41 03:05.444 2968 app STATEMENT: select ... 03:05.463 2992 app process 2992 still waiting for AccessShareLock on relation 1259 of database 16413 after 5000.790 ms 03:05.471 2994 app process 2994 still waiting for AccessShareLock on relation 1259 of database 16413 after 5000.600 ms 03:05.987 2968 app process 2968 acquired AccessShareLock on relation 1259 of database 16413 after 5543.035 ms at character 41 03:05.987 2968 app STATEMENT: select ... 03:05.987 2992 app process 2992 acquired AccessShareLock on relation 1259 of database 16413 after 5524.348 ms 03:05.987 2994 app process 2994 acquired AccessShareLock on relation 1259 of database 16413 after 5516.085 ms 03:05.987 2908 ERROR: could not find pg_class tuple for index 2662 03:05.987 2908 CONTEXT: automatic vacuum of table "c57.pg_catalog.pg_depend" 03:05.988 2758 postgres 5622.009 ms 03:05.991 21693 app 3115.449 ms 03:05.991 2968 app 5548.129 ms 03:05.992 2968 app BEGIN 03:05.992 2968 app 0.081 ms 03:05.993 2968 app UPDATE ... 03:05.994 26962 app 3084.019 ms 03:05.996 26962 app insert ... select 03:05.997 2994 app SHOW default_transaction_isolation 03:05.998 2992 app SHOW default_transaction_isolation 03:05.998 2758 postgres vacuum full pg_catalog.pg_depend; 03:05.998 2994 app 0.306 ms 03:05.998 2992 app 0.230 ms 03:06.000 2992 app select ... 03:06.000 2994 app select ... 03:06.002 2994 app 1.994 ms 03:06.002 2992 app 2.010 ms 03:06.002 2992 app SELECT ... 03:06.002 2994 app SELECT ... 03:06.003 2992 app 1.114 ms 03:06.004 2994 app 1.443 ms 03:06.021 2968 app 28.084 ms 03:06.022 2968 app COMMIT 03:06.023 2968 app 0.921 ms 03:06.032 2992 app SELECT ... 03:06.032 2457 app unexpected EOF on client connection 03:06.033 2992 app 0.956 ms 03:06.034 2994 app SELECT ... 03:06.034 2994 app 0.854 ms 03:06.034 2992 app SELECT ... 03:06.035 2992 app 0.626 ms 03:06.036 2994 app SELECT ... 03:06.036 2994 app 0.532 ms 03:06.041 2992 app SELECT ... 03:06.042 2992 app 1.380 ms 03:06.043 2994 app SELECT ... 03:06.046 2994 app 3.062 ms 03:06.048 2968 app disconnection: session time: 0:00:16.643 user=app database=c57 host=xxx03 03:06.050 2908 ERROR: could not find pg_class tuple for index 2662 03:06.071 2457 app disconnection: session time: 0:02:17.725 user=app database=c57 host=xxx03 03:06.093 2992 app select ... 03:06.095 2992 app 1.621 ms 03:06.095 2992 app BEGIN 03:06.096 2992 app 0.105 ms 03:06.096 2992 app UPDATE ... 03:06.098 2994 app select ... 03:06.100 2994 app 1.459 ms 03:06.100 2994 app BEGIN 03:06.100 2994 app 0.069 ms 03:06.101 2994 app UPDATE ... 03:06.138 2992 app 42.103 ms 03:06.138 2992 app COMMIT 03:06.143 2994 app 42.321 ms 03:06.143 2994 app COMMIT 03:06.469 2992 app 330.560 ms 03:06.469 2994 app 325.776 ms 03:06.480 2088 app unexpected EOF on client connection 03:06.480 1972 app unexpected EOF on client connection 03:06.496 2994 app disconnection: session time: 0:00:06.037 user=app database=c57 host=xxx03 03:06.496 2992 app disconnection: session time: 0:00:06.046 user=app database=c57 host=xxx03 03:11.481 1972 app process 1972 still waiting for RowExclusiveLock on relation 2608 of database 16413 after 5001.056 ms 03:11.481 2088 app process 2088 still waiting for RowExclusiveLock on relation 2608 of database 16413 after 5001.155 ms 03:12.614 2618 app create temp table ... 03:13.152 2088 app process 2088 acquired RowExclusiveLock on relation 2608 of database 16413 after 6671.666 ms 03:13.152 1972 app process 1972 acquired RowExclusiveLock on relation 2608 of database 16413 after 6671.637 ms 03:13.153 2758 postgres 7155.089 ms 03:13.157 2618 app 543.485 ms 03:13.158 2618 app COPY ... FROM stdin ... 03:13.160 2618 app 1.819 ms 03:13.160 2618 app update ... 03:13.182 2758 postgres vacuum full pg_catalog.pg_type; 03:13.183 2088 app disconnection: session time: 0:03:38.788 user=app database=c57 host=xxx03 03:13.183 2618 app 22.693 ms 03:13.183 2618 app insert ... 03:13.184 1972 app disconnection: session time: 0:04:42.576 user=app database=c57 host=xxx03 03:13.187 2618 app 4.191 ms 03:13.188 2618 app select distinct ... 03:13.191 2618 app 2.939 ms 03:15.306 2115 app create temp table ... 03:17.200 2758 postgres 4018.090 ms 03:17.205 2115 app 1899.466 ms 03:17.206 2115 app COPY ... FROM stdin ... 03:17.207 2115 app 1.538 ms 03:17.208 2115 app update ... 03:17.213 2758 postgres vacuum full pg_catalog.pg_index; 03:17.231 2115 app 23.801 ms 03:17.232 2115 app insert ... 03:17.238 2115 app 6.210 ms 03:17.239 2115 app select distinct ... 03:18.138 21693 app SELECT ... 03:21.043 2758 postgres 3830.239 ms 03:21.044 21693 app 2906.017 ms 03:21.044 21693 app SELECT ... 03:21.045 21693 app 0.720 ms 03:21.047 2115 app 3807.690 ms 03:21.049 2758 postgres vacuum full pg_catalog.pg_rewrite; 03:21.058 21693 app SELECT ... 03:21.058 21693 app 0.322 ms 03:21.059 21693 app SELECT ... 03:21.060 21693 app 0.308 ms 03:21.064 21693 app SELECT ... 03:21.064 21693 app 0.545 ms 03:21.065 21693 app SELECT DISTINCT ... 03:21.123 2618 app unexpected EOF on client connection 03:21.162 2618 app disconnection: session time: 0:02:15.747 user=app database=c57 host=xxx03 03:21.235 21693 app 169.569 ms 03:21.279 21693 app SELECT ... 03:23.683 21693 app 2404.143 ms 03:23.684 21693 app SELECT DISTINCT ... 03:23.688 21693 app 4.080 ms 03:24.499 2758 postgres 3450.352 ms 03:24.539 2758 postgres disconnection: session time: 0:01:56.796 user=postgres database=c57 host=xxxb-1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers