On Fri, Jan 24, 2014, at 10:51 AM, John Griffith wrote: > On Fri, Jan 24, 2014 at 11:37 AM, Clay Gerrard <clay.gerr...@gmail.com> > wrote: > >> > >> > >> That's a pretty high rate of failure, and really needs investigation. > > > > > > That's a great point, did you look into the logs of any of those jobs? > > Thanks for bringing it to my attention. > > > > I saw a few swift tests that would pop, I'll open bugs to look into those. > > But the cardinality of the failures (7) was dwarfed by jenkins failures I > > don't quite understand. > > > > [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException > > (e.g. > > http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html) > > > > FATAL: command execution failed | java.io.InterruptedIOException (e.g. > > http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html) > > > > These jobs are blowing up setting up the workspace on the slave, and we're > > not automatically retrying them? How can this only be effecting swift? > > It's certainly not just swift: > > http://logstash.openstack.org/#eyJzZWFyY2giOiJcImphdmEuaW8uSW50ZXJydXB0ZWRJT0V4Y2VwdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwNTg5MTg4NjY5fQ== > > > > > -Clay > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStack-dev@lists.openstack.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
This isn't all doom and gloom, but rather an unfortunate side effect of how Jenkins aborts jobs. When a job is aborted there are corner cases where Jenkins does not catch all of the exceptions that may happen and that results in reporting the build as a failure instead of an abort. Now all of this would be fine if we never aborted jobs, but it turns out Zuul aggressively aborts jobs when it knows the result of that job will not help anything (either ability to merge or useful results to report back to code reviewers). I have a hunch (but would need to do a bunch of digging to confirm it) that most of these errors are simply job aborts that happened in ways that Jenkins couldn't recover from gracefully. Looking at the most recent occurrence of this particular failure we see https://review.openstack.org/#/c/66307 failed gate-tempest-dsvm-neutron-large-ops. If we go to the comments on the change we see that this particular failure was never reported, which implies the failure happened as part of a build abort. The other thing we can do to convince ourselves that this problem is mostly a poor reporting of job aborts is restricting our logstash search to build_queue:"check". Only the gate queue aborts jobs in this way so occurrences in the check queue would indicate an actual problem. If we do that we see a bunch of "hudson.remoting.RequestAbortedException" which are also aborts not handled properly and since zuul shouldn't abort the check queue were probably a result of some human aborting jobs after a Zuul restart. TL;DR I believe this is mostly a non issue and has to do with Zuul and Jenkins quirks. If you see this error reported to Gerrit we should do more digging. Clark _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev