Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Noah Misch Wed, 02 Aug 2017 20:49:13 -0700

On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10
> 
> This one is even more exciting than the last one, because it sure looks
> like the crashing bgworker took the postmaster down with it.  That is
> Not Supposed To Happen.
> 
> Wondering if we broke something here recently, I tried to reproduce it
> on a Linux machine by adding a randomized Assert failure in
> shm_mq_set_sender.  I don't see any such problem, even with EXEC_BACKEND;
> we recover from the crash as-expected.
> 
> So I'm starting to get a distinct feeling that there's something wrong
> with the cygwin port.  But I dunno what.


I think signal blocking broke on Cygwin.

On a system (gcc 5.4.0, CYGWIN_NT-10.0 2.7.0(0.306/5/3) 2017-02-12 13:18
x86_64) that reproduces lorikeet's symptoms, I instrumented the postmaster as
attached.  The patch's small_parallel.sql is a subset of select_parallel.sql
sufficient to reproduce the mq_sender Assert failure and the postmaster silent
exit.  (It occasionally needed hundreds of iterations to do so.)  The parallel
query normally starts four bgworkers; when the mq_sender Assert fired, the
test had started five workers in response to four registrations.

The postmaster.c instrumentation regularly detects sigusr1_handler() calls
while another sigusr1_handler() is already on the stack:

6328 2017-08-02 07:25:42.788 GMT LOG:  forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.788 GMT DEBUG:  saw slot-0 registration, want 0
6328 2017-08-02 07:25:42.788 GMT DEBUG:  saw slot-0 registration, want 1
6328 2017-08-02 07:25:42.788 GMT DEBUG:  slot 1 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 1)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 2 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 2)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-2 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 3 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 3)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-3 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-2 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-1 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  saw slot-0 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG:  slot 4 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG:  registering background worker 
"parallel worker for PID 4776" (slot 4)
6328 2017-08-02 07:25:42.789 GMT DEBUG:  starting background worker process 
"parallel worker for PID 4776"
6328 2017-08-02 07:25:42.790 GMT LOG:  forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT WARNING:  signals already forbidden @ 
sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT LOG:  permit signals @ sigusr1_handler

postmaster algorithms rely on the PG_SETMASK() calls preventing that.  Without
such protection, duplicate bgworkers are an understandable result.  I caught
several other assertions; the PMChildFlags failure is another case of
duplicate postmaster children:

      6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
"pgstat.c", Line: 871)
      3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", 
File: "pmsignal.c", Line: 229)
     20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 
2523)
     21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
"shm_mq.c", Line: 221)
     Also, got a few "select() failed in postmaster: Bad address"

I suspect a Cygwin signals bug.  I'll try to distill a self-contained test
case for the Cygwin hackers.  The lack of failures on buildfarm member brolga
argues that older Cygwin is not affected.

diff --git a/src/backend/access/transam/parallel.c 
b/src/backend/access/transam/parallel.c
index 17b1038..f42034d 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -985,6 +985,8 @@ ParallelWorkerMain(Datum main_arg)
        error_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_ERROR_QUEUE, 
false);
        mq = (shm_mq *) (error_queue_space +
                                         ParallelWorkerNumber * 
PARALLEL_ERROR_QUEUE_SIZE);
+       write_stderr("PID %d claiming queue %d (%p)\n",
+                                MyProc->pid, ParallelWorkerNumber, mq);
        shm_mq_set_sender(mq, MyProc);
        mqh = shm_mq_attach(mq, seg, NULL);
        pq_redirect_to_shm_mq(seg, mqh);
diff --git a/src/backend/postmaster/bgworker.c 
b/src/backend/postmaster/bgworker.c
index 28af6f0..e393d31 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -219,14 +219,18 @@ FindRegisteredWorkerBySlotNumber(int slotno)
 {
        slist_iter      siter;
 
+       /* elog(DEBUG1, "looking for slot-%d registration", slotno); */
        slist_foreach(siter, &BackgroundWorkerList)
        {
                RegisteredBgWorker *rw;
 
                rw = slist_container(RegisteredBgWorker, rw_lnode, siter.cur);
+               elog(DEBUG1, "saw slot-%d registration, want %d",
+                        rw->rw_shmem_slot, slotno);
                if (rw->rw_shmem_slot == slotno)
                        return rw;
        }
+       /* elog(DEBUG1, "did not find slot-%d registration", slotno); */
 
        return NULL;
 }
@@ -297,6 +301,7 @@ BackgroundWorkerStateChange(void)
                        }
                        continue;
                }
+               elog(DEBUG1, "slot %d not yet registered", slotno);
 
                /*
                 * If the worker is marked for termination, we don't need to 
add it to
@@ -390,8 +395,8 @@ BackgroundWorkerStateChange(void)
 
                /* Log it! */
                ereport(DEBUG1,
-                               (errmsg("registering background worker \"%s\"",
-                                               rw->rw_worker.bgw_name)));
+                               (errmsg("registering background worker \"%s\" 
(slot %d)",
+                                               rw->rw_worker.bgw_name, 
rw->rw_shmem_slot)));
 
                slist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
        }
@@ -422,8 +427,8 @@ ForgetBackgroundWorker(slist_mutable_iter *cur)
        slot->in_use = false;
 
        ereport(DEBUG1,
-                       (errmsg("unregistering background worker \"%s\"",
-                                       rw->rw_worker.bgw_name)));
+                       (errmsg("unregistering background worker \"%s\" (slot 
%d)",
+                                       rw->rw_worker.bgw_name, 
rw->rw_shmem_slot)));
 
        slist_delete_current(cur);
        free(rw);
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 83e99b7..5fb5db1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -561,6 +561,34 @@ int                        postmaster_alive_fds[2] = {-1, 
-1};
 HANDLE         PostmasterHandle;
 #endif
 
+static bool siglevel = false;
+static void sigforbid(const char *location)
+{
+       int save_errno;
+
+       PG_SETMASK(&BlockSig);
+
+       save_errno = errno;
+       elog(LOG, "forbid signals @ %s", location);
+       if (siglevel)
+               elog(WARNING, "signals already forbidden @ %s", location);
+       siglevel = true;
+       errno = save_errno;
+}
+static void sigpermit(const char *location)
+{
+       int save_errno;
+
+       save_errno = errno;
+       elog(LOG, "permit signals @ %s", location);
+       if (!siglevel)
+               elog(WARNING, "signals already permitted @ %s", location);
+       siglevel = false;
+       errno = save_errno;
+
+       PG_SETMASK(&UnBlockSig);
+}
+
 /*
  * Postmaster main entry point
  */
@@ -629,7 +657,7 @@ PostmasterMain(int argc, char *argv[])
         * postmaster/checkpointer.c.
         */
        pqinitmask();
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 
        pqsignal_no_restart(SIGHUP, SIGHUP_handler);    /* reread config file 
and
                                                                                
                         * have children do same */
@@ -1686,12 +1714,12 @@ ServerLoop(void)
 
                if (pmState == PM_WAIT_DEAD_END)
                {
-                       PG_SETMASK(&UnBlockSig);
+                       sigpermit("PM_WAIT_DEAD_END");
 
                        pg_usleep(100000L); /* 100 msec seems reasonable */
                        selres = 0;
 
-                       PG_SETMASK(&BlockSig);
+                       sigforbid("PM_WAIT_DEAD_END");
                }
                else
                {
@@ -1701,11 +1729,11 @@ ServerLoop(void)
                        /* Needs to run with blocked signals! */
                        DetermineSleepTime(&timeout);
 
-                       PG_SETMASK(&UnBlockSig);
+                       sigpermit("select");
 
                        selres = select(nSockets, &rmask, NULL, NULL, &timeout);
 
-                       PG_SETMASK(&BlockSig);
+                       sigforbid("select");
                }
 
                /* Now check the select() result */
@@ -2507,7 +2535,7 @@ SIGHUP_handler(SIGNAL_ARGS)
 {
        int                     save_errno = errno;
 
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 
        if (Shutdown <= SmartShutdown)
        {
@@ -2566,7 +2594,7 @@ SIGHUP_handler(SIGNAL_ARGS)
 #endif
        }
 
-       PG_SETMASK(&UnBlockSig);
+       sigpermit(__FUNCTION__);
 
        errno = save_errno;
 }
@@ -2580,7 +2608,7 @@ pmdie(SIGNAL_ARGS)
 {
        int                     save_errno = errno;
 
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 
        ereport(DEBUG2,
                        (errmsg_internal("postmaster received signal %d",
@@ -2737,7 +2765,7 @@ pmdie(SIGNAL_ARGS)
                        break;
        }
 
-       PG_SETMASK(&UnBlockSig);
+       sigpermit(__FUNCTION__);
 
        errno = save_errno;
 }
@@ -2752,7 +2780,7 @@ reaper(SIGNAL_ARGS)
        int                     pid;                    /* process id of dead 
child process */
        int                     exitstatus;             /* its exit status */
 
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 
        ereport(DEBUG4,
                        (errmsg_internal("reaping dead processes")));
@@ -3051,7 +3079,7 @@ reaper(SIGNAL_ARGS)
        PostmasterStateMachine();
 
        /* Done with signal handler */
-       PG_SETMASK(&UnBlockSig);
+       sigpermit(__FUNCTION__);
 
        errno = save_errno;
 }
@@ -4765,7 +4793,7 @@ SubPostmasterMain(int argc, char *argv[])
 
        /* In EXEC_BACKEND case we will not have inherited these settings */
        pqinitmask();
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 
        /* Read in remaining GUC variables */
        read_nondefault_variables();
@@ -4957,7 +4985,7 @@ sigusr1_handler(SIGNAL_ARGS)
 {
        int                     save_errno = errno;
 
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 
        /* Process background worker state change. */
        if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
@@ -5091,7 +5119,7 @@ sigusr1_handler(SIGNAL_ARGS)
                signal_child(StartupPID, SIGUSR2);
        }
 
-       PG_SETMASK(&UnBlockSig);
+       sigpermit(__FUNCTION__);
 
        errno = save_errno;
 }
@@ -5511,13 +5539,13 @@ BackgroundWorkerInitializeConnectionByOid(Oid dboid, 
Oid useroid)
 void
 BackgroundWorkerBlockSignals(void)
 {
-       PG_SETMASK(&BlockSig);
+       sigforbid(__FUNCTION__);
 }
 
 void
 BackgroundWorkerUnblockSignals(void)
 {
-       PG_SETMASK(&UnBlockSig);
+       sigpermit(__FUNCTION__);
 }
 
 #ifdef EXEC_BACKEND
diff --git a/src/backend/utils/error/assert.c b/src/backend/utils/error/assert.c
index 2ef7792..9f4e597 100644
--- a/src/backend/utils/error/assert.c
+++ b/src/backend/utils/error/assert.c
@@ -42,6 +42,8 @@ ExceptionalCondition(const char *conditionName,
        /* Usually this shouldn't be needed, but make sure the msg went out */
        fflush(stderr);
 
+       for(;;);
+
 #ifdef SLEEP_ON_ASSERT
 
        /*
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 918db0a..e0ef9ec 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -554,6 +554,7 @@ errfinish(int dummy,...)
                 */
                fflush(stdout);
                fflush(stderr);
+               for(;;);
                abort();
        }
 
diff --git a/src/test/regress/expected/small_parallel.out 
b/src/test/regress/expected/small_parallel.out
new file mode 100644
index 0000000..a17ed33
--- /dev/null
+++ b/src/test/regress/expected/small_parallel.out
@@ -0,0 +1,44 @@
+-- Serializable isolation would disable parallel query, so explicitly use an
+-- arbitrary other level.
+begin isolation level repeatable read;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+-- test parallel bitmap heap scan.
+set enable_seqscan to off;
+set enable_indexscan to off;
+set enable_hashjoin to off;
+set enable_mergejoin to off;
+set enable_material to off;
+-- test prefetching, if the platform allows it
+DO $$
+BEGIN
+ SET effective_io_concurrency = 50;
+EXCEPTION WHEN invalid_parameter_value THEN
+END $$;
+set work_mem='64kB';  --set small work mem to force lossy pages
+explain (costs off)
+       select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and 
tenk2.thousand=0;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Aggregate
+   ->  Nested Loop
+         ->  Seq Scan on tenk2
+               Filter: (thousand = 0)
+         ->  Gather
+               Workers Planned: 4
+               ->  Parallel Bitmap Heap Scan on tenk1
+                     Recheck Cond: (hundred > 1)
+                     ->  Bitmap Index Scan on tenk1_hundred
+                           Index Cond: (hundred > 1)
+(10 rows)
+
+select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+ count 
+-------
+ 98000
+(1 row)
+
+rollback;
diff --git a/src/test/regress/sql/small_parallel.sql 
b/src/test/regress/sql/small_parallel.sql
new file mode 100644
index 0000000..1598d2e
--- /dev/null
+++ b/src/test/regress/sql/small_parallel.sql
@@ -0,0 +1,28 @@
+-- Serializable isolation would disable parallel query, so explicitly use an
+-- arbitrary other level.
+begin isolation level repeatable read;
+
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+
+-- test parallel bitmap heap scan.
+set enable_seqscan to off;
+set enable_indexscan to off;
+set enable_hashjoin to off;
+set enable_mergejoin to off;
+set enable_material to off;
+-- test prefetching, if the platform allows it
+DO $$
+BEGIN
+ SET effective_io_concurrency = 50;
+EXCEPTION WHEN invalid_parameter_value THEN
+END $$;
+set work_mem='64kB';  --set small work mem to force lossy pages
+explain (costs off)
+       select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and 
tenk2.thousand=0;
+select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
+
+rollback;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Reply via email to