On Thu, Feb 23, 2012, at 05:07 AM, Jenkins wrote: > See <http://ci.cyrusimap.org/job/cyrus-imapd-master/410/> > > ------------------------------------------ > [...] > + exitcode=1 > + '[' -x jenkins-xml-summary.pl ']' > + ./jenkins-xml-summary.pl > --build-url=http://ci.cyrusimap.org/job/cyrus-imapd-master/410/ > Test failures and errors summary > ================================
Ok, so the good news first. * these test failure report URLs are now correct * the stray process killer in autobuild.sh is working Now for the test failures. Thirty-four of these thirty-five appear to be of the nature > test_dup_keep_keep(Cassandane::Cyrus::Sieve) > Perl exception: Some process is already listening on 127.0.0.1:9101 which AFAICS are all cascading failures due to Cyrus failing to shut down correctly here: > > Cassandane::Cyrus::Bug3072.copy_longset > > http://ci.cyrusimap.org/job/cyrus-imapd-master/410//testReport/%28root%29/Cassandane__Cyrus__Bug3072/test_copy_longset/ which is thorny one. The error message is Error Message Perl exception: Cannot shut down master pid 14360 Stacktrace test_copy_longset(Cassandane::Cyrus::Bug3072) Perl exception: Cannot shut down master pid 14360 at Cassandane/Instance.pm line 831 Cassandane::Instance::stop('Cassandane::Instance=HASH(0x1db7eb80)') called at Cassandane/Cyrus/TestCase.pm line 255 Cassandane::Cyrus::TestCase::tear_down('Cassandane::Cyrus::Bug3072=HASH(0x1c2642d0)') called at Cassandane/Cyrus/Bug3072.pm line 64 Cassandane::Cyrus::Bug3072::tear_down('Cassandane::Cyrus::Bug3072=HASH(0x1c2642d0)') called at /usr/lib/perl5/vendor_perl/5.8.8/Test/Unit/TestCase.pm line 65 So Cassandane::Instance::_stop_pid() returned false. This happens because it fails to shut down the master process gracefully, where "graceful" means the master terminates within about 20 seconds of receiving SIGQUIT. We can see the sequence in syslog Feb 23 05:04:55 hudson-01 cassandane: =====> Instance[826] stop Feb 23 05:04:55 hudson-01 cassandane: =====> Instance[804] _stop_pid: sending signal 3 to 14360 Feb 23 05:05:24 hudson-01 cassandane: =====> Instance[811] _stop_pid: failed to shut down pid 14360 with signal 3 Feb 23 05:05:24 hudson-01 cassandane: =====> Instance[804] _stop_pid: sending signal 4 to 14360 Feb 23 05:05:24 hudson-01 cassandane: =====> Util::Wait[75] Waited 0.011758 sec for unknown condition At 05:04:55, _stop_pid() sent SIGQUIT (signal 3) to the master process. Twenty-nine seconds later, it gave up waiting and sent SIGILL (signal 4) in the hope of getting a core file. Twelve milliseconds later that hope was fulfilled. There's nothing else interesting in the log. In particular, no unusual messages from master. The previous message which mentions instance starting is Feb 23 05:03:17 hudson-01 cassandane: =====> Instance[688] start main instance for test test_copy_longset: basedir /var/tmp/cass/1003145 I logged in earlier this morning and grabbed a copy of that directory before the next build ran (each build removes the leftovers from the previous one and I'm now wondering whether that's a good thing). gnb@hudson-01 2002 ls -l 1003145/conf/cores/ total 840 -rw------- 1 gnb gnb 851968 Feb 23 05:05 core.14360 gnb@hudson-01 2004 file 1003145/conf/cores/core.14360 1003145/conf/cores/core.14360: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'master' gnb@hudson-01 2005 gdb /var/lib/jenkins/jobs/cyrus-imapd-master/workspace/inst/usr/cyrus/bin/master 1003145/conf/cores/core.14360 ... Program terminated with signal 4, Illegal instruction. #0 0x0000000000406903 in child_janitor (now=...) at master.c:1060 1060 janitor_position = janitor_position % child_table_size; (gdb) bt #0 0x0000000000406903 in child_janitor (now=...) at master.c:1060 #1 0x000000000040c916 in main (argc=12, argv=0x7fffade2ce88) at master.c:2272 At this point I poked around and eventually worked out that master was busy looping into and out of the main select loop and the janitor code, but the janitor code itself wasn't stuck in a loop. The child table contains exactly one entry. (gdb) p ctable $1 = {0x0 <repeats 4366 times>, 0x13be10a0, 0x0 <repeats 5633 times>} (gdb) p *(struct centry *)0x13be10a0 $2 = {pid = 14366, service_state = SERVICE_STATE_READY, janitor_deadline = 0, si = 0, next = 0x0} According to syslog, pid 14366 was an imapd Feb 23 05:03:17 hudson-01 1003145/imap[14366]: login: localhost.localdomain [127.0.0.1] admin plaintext User logged in SESSIONID=<1003145-14366-1329991397-1> This explains why master didn't die, it's waiting for this child imapd to die. Indeed, we see the struct service has the correct nactive and ready_workers counts (gdb) p Services[0] $5 = {name = 0x13be12c0 "imap", listen = 0x13be1120 "127.0.0.1:9104", proto = 0x13be1280 "tcp", exec = 0x13be1360, babysit = 0, associate = 0, family = 2, socket = 5, stat = {7, 9}, desired_workers = 0, max_workers = 2147483647, maxfds = 256, maxforkrate = 0, ready_workers = 1, nforks = 1, nactive = 1, nconnections = 2, forkrate = 0, nreadyfails = 0, last_interval_start = {tv_sec = 1329991397, tv_usec = 543872}, interval_forks = 1} So what went wrong? 1078 static void sigquit_handler(int sig __attribute__((unused))) 1079 { ... 1095 /* Set a flag so main loop knows to shut down when 1096 all children have exited */ 1097 in_shutdown = 1; (gdb) p in_shutdown $6 = 1 So we did receive SIGQUIT and run sigquit_handler, which re-sends SIGQUIT to master's process group. That should have killed off imapd. Because autobuild.sh now cleans up stray processes, there's no evidence that it failed to shut down. We don't have 'auditlog' set in imapd.conf so there's no message in syslog when an imapd shuts down gracefully, and so no evidence that it succeeded in shutting down. What we have in fact is a big question mark. There's a number of possible failure modes here, but basically either a) imapd might have died and master failed to notice. b) imapd might have decided not to die. So did happen in imapd? From syslog we see Feb 23 05:04:55 hudson-01 1003145/imap[14366]: USAGE cassandane user: 2.518617 sys: 1.865716 so telemetry_rusage() was called. We don't create any shutdown file in this test, so this means imapd processed a LOGOUT command, which is to be expected as Cassandane is running the test's tear_down method. This means cmdloop() would have returned, and service_main() also, back to main() in master/service.c. At this point, imapd would have sent a MASTER_SERVICE_AVAILABLE status message to master, which is consistent with the centry being in state SERVICE_STATE_READY. Imapd then goes back around and accept()s. What probably happens here is that accept() returns with EINTR and signals_poll() is called. It seems like signals_poll() should be driving the whole sequence of dealing with SIGQUIT, by detecting it and calling shut_down(). Here's what it does 107 int signals_poll(void) 108 { 109 switch (gotsignal) { 110 case SIGINT: 111 case SIGQUIT: 112 if (shutdown_cb) shutdown_cb(EC_TEMPFAIL); 113 else exit(EC_TEMPFAIL); 114 break; 115 default: 116 return gotsignal; 117 break; 118 } 119 return 0; /* compiler warning stupidity */ 120 } and where is gotsignal set? 55 static volatile sig_atomic_t gotsignal = 0; 56 57 static void sighandler(int sig) 58 { 59 /* syslog(LOG_DEBUG, "got signal %d", sig); */ 60 gotsignal = sig; 61 } and guess what, sighandler is set up as the signal handler for SIGQUIT, SIGALRM, SIGHUP and SIGINT. So we handle only the *last received* of all those signals. If SIGQUIT is received, followed by (say) SIGALRM, we never even notice SIGQUIT. Hmm, that's definitely a bug and would explain the observed behaviour, although it's not the only possibility. I'll raise a Bugzilla ticket. -- Greg.