Hi Tom, I have tested the subscription test 013_partition.pl with CCA enabled on HEAD and PG13, and I am able to reproduce the issue on both the versions.
*Logs:* [centos@clobber-cache subscription]$ git branch * REL_13_STABLE master [centos@clobber-cache-db93 subscription]$ tail -f tmp_check/log/013_partition_subscriber1.log 2020-09-15 08:42:19.763 UTC [27866] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-15 08:42:20.395 UTC [27866] ERROR: cache lookup failed for relation 0 2020-09-15 08:42:20.436 UTC [26427] LOG: background worker "logical replication worker" (PID 27866) exited with exit code 1 2020-09-15 08:42:20.835 UTC [27868] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-15 08:42:21.462 UTC [27868] ERROR: cache lookup failed for relation 0 2020-09-15 08:42:21.508 UTC [26427] LOG: background worker "logical replication worker" (PID 27868) exited with exit code 1 2020-09-15 08:42:21.921 UTC [27870] LOG: logical replication apply worker for subscription "sub1" has started 2020-09-15 08:42:22.551 UTC [27870] ERROR: cache lookup failed for relation 0 Thanks. -- Regards, Neha Sharma On Mon, Sep 14, 2020 at 8:50 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > In connection with a nearby thread, I tried to run the subscription > test suite in a CLOBBER_CACHE_ALWAYS build. I soon found that I had > to increase wal_receiver_timeout, but after doing this: > > diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm > index 1488bff..5fe6810 100644 > *** a/src/test/perl/PostgresNode.pm > --- b/src/test/perl/PostgresNode.pm > *************** sub init > *** 447,452 **** > --- 447,453 ---- > print $conf "log_statement = all\n"; > print $conf "log_replication_commands = on\n"; > print $conf "wal_retrieve_retry_interval = '500ms'\n"; > + print $conf "wal_receiver_timeout = '10min'\n"; > > # If a setting tends to affect whether tests pass or fail, print it > after > # TEMP_CONFIG. Otherwise, print it before TEMP_CONFIG, thereby > permitting > > I let it run overnight, and came back to find that it was stuck at > > [03:02:15] t/013_partition.pl ................. 19/51 > > and had been for circa eight hours, where extrapolation from other tests > said it shouldn't take much over half an hour. Investigation found that > the subscriber was repeatedly failing like this: > > 2020-09-14 11:05:26.483 EDT [1030506] LOG: logical replication apply > worker for subscription "sub1" has started > 2020-09-14 11:05:27.139 EDT [1030506] ERROR: cache lookup failed for > relation 0 > 2020-09-14 11:05:27.140 EDT [947156] LOG: background worker "logical > replication worker" (PID 1030506) exited with exit code 1 > 2020-09-14 11:05:27.571 EDT [1030509] LOG: logical replication apply > worker for subscription "sub1" has started > 2020-09-14 11:05:28.227 EDT [1030509] ERROR: cache lookup failed for > relation 0 > 2020-09-14 11:05:28.228 EDT [947156] LOG: background worker "logical > replication worker" (PID 1030509) exited with exit code 1 > > The publisher's log shows no sign of distress: > > 2020-09-14 11:06:09.380 EDT [1030619] sub1 LOG: statement: SELECT > pg_catalog.set_config('search_path', '', false); > 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication > command: IDENTIFY_SYSTEM > 2020-09-14 11:06:09.446 EDT [1030619] sub1 LOG: received replication > command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version > '2', publication_names '"pub1"') > 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: starting logical decoding > for slot "sub1" > 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: Streaming transactions > committing after 0/163D848, reading WAL from 0/163CF08. > 2020-09-14 11:06:09.447 EDT [1030619] sub1 LOG: logical decoding found > consistent point at 0/163CF08 > 2020-09-14 11:06:09.447 EDT [1030619] sub1 DETAIL: There are no running > transactions. > 2020-09-14 11:06:10.468 EDT [1030621] sub1 LOG: statement: SELECT > pg_catalog.set_config('search_path', '', false); > 2020-09-14 11:06:10.533 EDT [1030621] sub1 LOG: received replication > command: IDENTIFY_SYSTEM > 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: received replication > command: START_REPLICATION SLOT "sub1" LOGICAL 0/163CF08 (proto_version > '2', publication_names '"pub1"') > 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: starting logical decoding > for slot "sub1" > 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: Streaming transactions > committing after 0/163D848, reading WAL from 0/163CF08. > 2020-09-14 11:06:10.534 EDT [1030621] sub1 LOG: logical decoding found > consistent point at 0/163CF08 > 2020-09-14 11:06:10.534 EDT [1030621] sub1 DETAIL: There are no running > transactions. > > I do not have time today to chase this further, but somebody should. > > More generally, this seems like good evidence that we really oughta have a > buildfarm member that's running *all* the tests under CLOBBER_CACHE_ALWAYS > not just the core tests. That seems impossibly expensive, but I realized > while watching these tests that a ridiculous fraction of the runtime is > being spent in repeated initdb calls. On my machine, initdb takes about > 12 minutes under CCA, so doing it separately for publisher and subscriber > means 24 minutes, which compares not at all favorably to the > circa-half-an-hour total runtime of each of the subscription test scripts. > We're surely not learning anything after the first CCA run of initdb, so > if we could find a way to skip that overhead for later runs, it'd make a > huge difference in the practicality of running these tests under CCA. > > I recall having worked on a patch to make the regression tests run > initdb just once, creating a template directory tree, and then "cp -a" > that into place for each test. I did not finish it, because it wasn't > showing a lot of advantage in a normal test run, but maybe the idea > could be resurrected for CCA and other slow builds. > > Another idea is to make CCA a little more dynamic, say allow it to be > suppressed through an environment variable setting, and then use that > to speed up per-test initdbs. > > regards, tom lane > > >