On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:
> daveg <da...@sonic.net> writes:
> > Here is the update: the problem happens with vacuum full alone, no reindex
> > is needed to trigger it. I updated the script to avoid reindexing after
> > vacuum. Over the past two days there are still many ocurrances of this
> > error coincident with the vacuum.
> 
> Well, that jives with the assumption that the one case we saw in
> the buildfarm was the same thing, because the regression tests were
> certainly only doing a VACUUM FULL and not a REINDEX of pg_class.
> But it doesn't get us much closer to understanding what's happening.
> In particular, it seems to knock out most ideas associated with race
> conditions, because the VAC FULL should hold exclusive lock on pg_class
> until it's completely done (including index rebuilds).
> 
> I think we need to start adding some instrumentation so we can get a
> better handle on what's going on in your database.  If I were to send
> you a source-code patch for the server that adds some more logging
> printout when this happens, would you be willing/able to run a patched
> build on your machine?

Yes we can run an instrumented server so long as the instrumentation does
not interfere with normal operation. However, scheduling downtime to switch
binaries is difficult, and generally needs to be happen on a weekend, but
sometimes can be expedited. I'll look into that.

> (BTW, just to be perfectly clear ... the "could not find pg_class tuple"
> errors always mention index 2662, right, never any other number?)

Yes, only index 2662, never any other.

I'm attaching a somewhat redacted log for two different databases on the same
instance around the time of vacuum full of pg_class in each database.
My observations so far are:

 - the error occurs at commit of vacuum full of pg_class
 - in these cases error hits autovacuum after it waited for a lock on pg_class
 - in these two cases there was a new process startup while the vacuum was
   running. Don't know if this is relevant.
 - while these hit autovacuum, the error does hit other processs (just not in
   these sessions).  Unknown if autovacuum is a required component.

-dg

-- 
David Gould       da...@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.
NOTES:
-- most statements for db c01 in one catalog vacuum session
-- database c01 is oid 16404
-- applications run as user 'app'
-- vacuums run as user postgres
-- somewhat redacted

MM:SS.ms   pid    user  log text
---------  -----  ----  -----------------------------------------------------

03:00.350  2716   c01  postgres  vacuum full pg_catalog.pg_class;
03:00.507  2866  LOG: process 2866 acquired ShareUpdateExclusiveLock on 
relation 2604 of database 16416 after 10143.750 ms
03:01.222  21685  c01  app   11100.670 ms
03:01.223  21685  c01  app  SELECT DISTINCT
03:01.224  21685  c01  app   1.524 ms
03:02.031  3001   c01  app  connection authorized: user=app database=c01
03:04.952  2787  LOG: process 2787 still waiting for AccessShareLock on 
relation 1259 of database 16412 after 5001.035 ms
03:05.065  2787  LOG: process 2787 acquired AccessShareLock on relation 1259 of 
database 16412 after 5114.253 ms
03:05.564  2977  LOG: process 2977 still waiting for AccessShareLock on 
relation 1259 of database 16404 after 5000.970 ms
03:05.640  2731  LOG: process 2731 still waiting for RowExclusiveLock on 
relation 1259 of database 16404 after 5000.186 ms
03:06.045  2977  LOG: process 2977 acquired AccessShareLock on relation 1259 of 
database 16404 after 5482.389 ms
03:06.045  2731  LOG: process 2731 acquired RowExclusiveLock on relation 1259 
of database 16404 after 5405.652 ms
03:06.046  2731  ERROR: could not find pg_class tuple for index 2662
03:06.046  2731  CONTEXT: automatic vacuum of table "c01.pg_catalog.pg_index"
03:06.046  2731  ERROR: could not find pg_class tuple for index 2662
03:06.046  2716   c01  postgres   5696.537 ms
03:06.056  3001   c01  app  SET SESSION TIME ZONE 'UTC'; ...
03:06.057  2716   c01  postgres  vacuum full pg_catalog.pg_rewrite;
03:06.066  3001   c01  app   10.459 ms
03:06.091  3001   c01  app  disconnection: session time: 0:00:04.085 user=app 
database=c01 host=xxxl01
03:08.908  3006  LOG: process 3006 still waiting for AccessShareLock on 
relation 2659 of database 16407 after 5000.778 ms
03:11.777  21685  c01  app  SELECT
03:11.779  21685  c01  app   2.296 ms
03:11.779  21685  c01  app  SELECT
03:11.780  21685  c01  app   0.328 ms
03:11.798  21685  c01  app  SELECT
03:11.799  21685  c01  app   0.348 ms
03:11.800  21685  c01  app  SELECT
03:11.800  21685  c01  app   0.205 ms
03:11.804  21685  c01  app  SELECT
03:11.805  21685  c01  app   0.589 ms
03:11.806  21685  c01  app  SELECT DISTINCT
03:11.809  21685  c01  app   3.552 ms
03:11.810  21685  c01  app  SELECT
03:12.554  21685  c01  app   744.233 ms
03:12.555  21685  c01  app  SELECT DISTINCT
03:12.556  21685  c01  app   1.136 ms
03:13.153  2716   c01  postgres   7096.367 ms
03:13.182  2716   c01  postgres  vacuum full pg_catalog.pg_depend;
03:16.926  2716   c01  postgres   3743.880 ms
03:16.937  2716   c01  postgres  vacuum full pg_catalog.pg_type;
03:20.460  2716   c01  postgres   3523.571 ms
03:20.492  2716   c01  postgres  disconnection: session time: 0:02:06.916 
user=postgres database=c01 host=xxxb-0
NOTES:
-- most statements for db c57 in one catalog vacuum session
-- database c57 is oid 16413
-- applications run as user 'app'
-- vacuums run as user postgres
-- somewhat redacted

MM:SS.ms   pid    user  log text
---------  -----  ----  -----------------------------------------------------
02:49.335  2758   postgres  vacuum full pg_catalog.pg_attribute;
...
03:00.356  2758   postgres  11020.962 ms
03:00.366  2758   postgres  vacuum full pg_catalog.pg_class;
03:00.382  2968   app  SELECT ...
03:00.383  2968   app  1.132 ms
03:00.384  2968   app  SELECT ...
03:00.385  2968   app  0.605 ms
03:00.392  2968   app  SELECT ...
03:00.394  2968   app  2.165 ms
03:00.443  2968   app  select ...
03:00.461  2992   [unknown]  [unknown]  connection received: host=xxx03 
03:00.462  2992   app  connection authorized: user=app database=c57
03:00.469  2994   [unknown]  [unknown]  connection received: host=xxx03 
03:00.470  2994   app  connection authorized: user=app database=c57
03:02.875  26962  app  temporary file: path 
"pg_tblspc/16402/PG_9.0_201008051/pgsql_tmp/pgsql_tmp26962.6", size 502128640
03:02.875  26962  app  STATEMENT:  UPDATE ...
03:02.875  21693  app  13286.010 ms
03:02.876  21693  app  SELECT DISTINCT ...
03:02.907  26962  app  95964.763 ms
03:02.910  26962  app  EXPLAIN INSERT ... SELECT ...
03:05.444  2968   app  process 2968 still waiting for AccessShareLock on 
relation 1259 of database 16413 after 5000.577 ms at character 41
03:05.444  2968   app  STATEMENT:  select ...
03:05.463  2992   app  process 2992 still waiting for AccessShareLock on 
relation 1259 of database 16413 after 5000.790 ms
03:05.471  2994   app  process 2994 still waiting for AccessShareLock on 
relation 1259 of database 16413 after 5000.600 ms
03:05.987  2968   app  process 2968 acquired AccessShareLock on relation 1259 
of database 16413 after 5543.035 ms at character 41
03:05.987  2968   app  STATEMENT:  select ...
03:05.987  2992   app  process 2992 acquired AccessShareLock on relation 1259 
of database 16413 after 5524.348 ms
03:05.987  2994   app  process 2994 acquired AccessShareLock on relation 1259 
of database 16413 after 5516.085 ms
03:05.987  2908  ERROR:  could not find pg_class tuple for index 2662
03:05.987  2908  CONTEXT:  automatic vacuum of table "c57.pg_catalog.pg_depend"
03:05.988  2758   postgres  5622.009 ms
03:05.991  21693  app  3115.449 ms
03:05.991  2968   app  5548.129 ms
03:05.992  2968   app  BEGIN
03:05.992  2968   app  0.081 ms
03:05.993  2968   app  UPDATE ...
03:05.994  26962  app  3084.019 ms
03:05.996  26962  app  insert ... select
03:05.997  2994   app  SHOW default_transaction_isolation
03:05.998  2992   app  SHOW default_transaction_isolation
03:05.998  2758   postgres  vacuum full pg_catalog.pg_depend;
03:05.998  2994   app  0.306 ms
03:05.998  2992   app  0.230 ms
03:06.000  2992   app  select ...
03:06.000  2994   app  select ...
03:06.002  2994   app  1.994 ms
03:06.002  2992   app  2.010 ms
03:06.002  2992   app  SELECT ...
03:06.002  2994   app  SELECT ...
03:06.003  2992   app  1.114 ms
03:06.004  2994   app  1.443 ms
03:06.021  2968   app  28.084 ms
03:06.022  2968   app  COMMIT
03:06.023  2968   app  0.921 ms
03:06.032  2992   app  SELECT ...
03:06.032  2457   app  unexpected EOF on client connection
03:06.033  2992   app  0.956 ms
03:06.034  2994   app  SELECT ...
03:06.034  2994   app  0.854 ms
03:06.034  2992   app  SELECT ...
03:06.035  2992   app  0.626 ms
03:06.036  2994   app  SELECT ...
03:06.036  2994   app  0.532 ms
03:06.041  2992   app  SELECT ...
03:06.042  2992   app  1.380 ms
03:06.043  2994   app  SELECT ...
03:06.046  2994   app  3.062 ms
03:06.048  2968   app  disconnection: session time: 0:00:16.643 user=app 
database=c57 host=xxx03 
03:06.050  2908  ERROR:  could not find pg_class tuple for index 2662
03:06.071  2457   app  disconnection: session time: 0:02:17.725 user=app 
database=c57 host=xxx03 
03:06.093  2992   app  select ...
03:06.095  2992   app  1.621 ms
03:06.095  2992   app  BEGIN
03:06.096  2992   app  0.105 ms
03:06.096  2992   app  UPDATE ...
03:06.098  2994   app  select ...
03:06.100  2994   app  1.459 ms
03:06.100  2994   app  BEGIN
03:06.100  2994   app  0.069 ms
03:06.101  2994   app  UPDATE ...
03:06.138  2992   app  42.103 ms
03:06.138  2992   app  COMMIT
03:06.143  2994   app  42.321 ms
03:06.143  2994   app  COMMIT
03:06.469  2992   app  330.560 ms
03:06.469  2994   app  325.776 ms
03:06.480  2088   app  unexpected EOF on client connection
03:06.480  1972   app  unexpected EOF on client connection
03:06.496  2994   app  disconnection: session time: 0:00:06.037 user=app 
database=c57 host=xxx03 
03:06.496  2992   app  disconnection: session time: 0:00:06.046 user=app 
database=c57 host=xxx03 
03:11.481  1972   app  process 1972 still waiting for RowExclusiveLock on 
relation 2608 of database 16413 after 5001.056 ms
03:11.481  2088   app  process 2088 still waiting for RowExclusiveLock on 
relation 2608 of database 16413 after 5001.155 ms
03:12.614  2618   app  create temp table ...
03:13.152  2088   app  process 2088 acquired RowExclusiveLock on relation 2608 
of database 16413 after 6671.666 ms
03:13.152  1972   app  process 1972 acquired RowExclusiveLock on relation 2608 
of database 16413 after 6671.637 ms
03:13.153  2758   postgres  7155.089 ms
03:13.157  2618   app  543.485 ms
03:13.158  2618   app  COPY ... FROM stdin ...
03:13.160  2618   app  1.819 ms
03:13.160  2618   app  update ...
03:13.182  2758   postgres  vacuum full pg_catalog.pg_type;
03:13.183  2088   app  disconnection: session time: 0:03:38.788 user=app 
database=c57 host=xxx03 
03:13.183  2618   app  22.693 ms
03:13.183  2618   app  insert ...
03:13.184  1972   app  disconnection: session time: 0:04:42.576 user=app 
database=c57 host=xxx03 
03:13.187  2618   app  4.191 ms
03:13.188  2618   app  select distinct ...
03:13.191  2618   app  2.939 ms
03:15.306  2115   app  create temp table ...
03:17.200  2758   postgres  4018.090 ms
03:17.205  2115   app  1899.466 ms
03:17.206  2115   app  COPY ... FROM stdin ...
03:17.207  2115   app  1.538 ms
03:17.208  2115   app  update ...
03:17.213  2758   postgres  vacuum full pg_catalog.pg_index;
03:17.231  2115   app  23.801 ms
03:17.232  2115   app  insert ...
03:17.238  2115   app  6.210 ms
03:17.239  2115   app  select distinct ...
03:18.138  21693  app  SELECT ...
03:21.043  2758   postgres  3830.239 ms
03:21.044  21693  app  2906.017 ms
03:21.044  21693  app  SELECT ...
03:21.045  21693  app  0.720 ms
03:21.047  2115   app  3807.690 ms
03:21.049  2758   postgres  vacuum full pg_catalog.pg_rewrite;
03:21.058  21693  app  SELECT ...
03:21.058  21693  app  0.322 ms
03:21.059  21693  app  SELECT ...
03:21.060  21693  app  0.308 ms
03:21.064  21693  app  SELECT ...
03:21.064  21693  app  0.545 ms
03:21.065  21693  app  SELECT DISTINCT ...
03:21.123  2618   app  unexpected EOF on client connection
03:21.162  2618   app  disconnection: session time: 0:02:15.747 user=app 
database=c57 host=xxx03 
03:21.235  21693  app  169.569 ms
03:21.279  21693  app  SELECT ...
03:23.683  21693  app  2404.143 ms
03:23.684  21693  app  SELECT DISTINCT ...
03:23.688  21693  app  4.080 ms
03:24.499  2758   postgres  3450.352 ms
03:24.539  2758   postgres  disconnection: session time: 0:01:56.796 
user=postgres database=c57 host=xxxb-1
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to