I noticed that buildfarm member piculet fell over this afternoon: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2016-11-10%2020%3A10%3A02 with this interesting failure during startup of the "collate" test: psql: FATAL: cache lookup failed for relation 1255
1255 is pg_proc, and nosing around, I noticed that the concurrent "init_privs" test does this: GRANT SELECT ON pg_proc TO CURRENT_USER; GRANT SELECT (prosrc) ON pg_proc TO CURRENT_USER; So that led me to hypothesize that GRANT on a system catalog can cause a concurrent connection failure, which I tested by running pgbench -U postgres -n -f script1.sql -T 300 regression with this script: GRANT SELECT ON pg_proc TO CURRENT_USER; GRANT SELECT (prosrc) ON pg_proc TO CURRENT_USER; REVOKE SELECT ON pg_proc FROM CURRENT_USER; REVOKE SELECT (prosrc) ON pg_proc FROM CURRENT_USER; and concurrently pgbench -C -U postgres -n -f script2.sql -c 10 -j 10 -T 300 regression with this script: select 2 + 2; and sure enough, the second one falls over after a bit with connection to database "regression" failed: FATAL: cache lookup failed for relation 1255 client 5 aborted while establishing connection For me, this typically happens within thirty seconds or less. I thought perhaps it only happened with --no-atomics which piculet is using, but nope, I can reproduce it in a stock debug build. For the record, I'm testing on an 8-core x86_64 machine running RHEL6. Note: you can't merge this test scenario into one pgbench run with two scripts, because then you can't keep pgbench from sometimes running two instances of script1 concurrently, with ensuing "tuple concurrently updated" errors. That's something we've previously deemed not worth changing, and in any case it's not what I'm on about at the moment. I tried to make script1 safe for concurrent calls by putting "begin; lock table pg_proc in share row exclusive mode; ...; commit;" around it, but that caused the error to go away, or at least become far less frequent. Which is odd in itself, since that lock level shouldn't block connection startup accesses to pg_proc. A quick look through the sources confirms that this error implies that SearchSysCache on the RELOID cache must have failed to find a tuple for pg_proc --- there are many occurrences of this text, but they all are reporting that. Which absolutely should not be happening now that we use MVCC catalog scans, concurrent updates or no. So I think this is a bug, and possibly a fairly-recently-introduced one, because I can't remember seeing buildfarm failures like this one before. I've not dug further than that yet. Any thoughts out there? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers