Hi, We got a report of CI jobs failing with disconnects when downloading from tarballs.openstack.org. The file in question is a largish container for kolla-kubernetes [1]
ISTR this is not the first time we've had complaints about this, but I'm not sure if we ever came up with a solution. Below are some of the failed jobs with the ip, start & failure time of the download. --- http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-ceph-nv/21e4361/console.html inet 146.20.105.26 2017-01-30 00:00:08.109056 | + curl ... http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 2017-01-30 00:04:44.002456 | curl: (18) transfer closed with 88598540 bytes remaining to read http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-ceph-multi-nv/fe0849b/console.html inet 146.20.105.198 2017-01-30 00:00:08.471434 | + curl -o ... http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 2017-01-30 00:04:42.685201 | curl: (18) transfer closed with 542002092 bytes remaining to read http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-external-ovs-nv/38030a1/console.html inet6 2001:4800:1ae1:18:f816:3eff:fe9b:f9e2/64 2017-01-30 00:01:07.306370 | + curl -o ... http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 2017-01-30 00:03:34.810258 | curl: (18) transfer closed with 222546512 bytes remaining to read --- At first, there is not much correlation and two of the requests appear to not be logged at all. --- root@static:/var/log/apache2# grep 'centos-binary-ceph.tar.bz2' tarballs.openstack.org_* | grep '30/Jan' tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe9e:4ccf - - [30/Jan/2017:00:00:33 +0000] "GET /kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 1395496732 "-" "curl/7.29.0" tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe67:2ca6 - - [30/Jan/2017:00:00:47 +0000] "GET /kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 1395496732 "-" "curl/7.29.0" tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe9b:f9e2 - - [30/Jan/2017:00:01:07 +0000] "GET /kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 1395496732 "-" "curl/7.29.0" --- However, I went to the generic apache error log around that time and found the following --- [Sun Jan 29 23:59:16.284909 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers [Mon Jan 30 00:00:02.334021 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers [Mon Jan 30 00:00:04.336258 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers [Mon Jan 30 00:01:48.449350 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers [Mon Jan 30 00:02:25.490781 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers [Mon Jan 30 00:03:10.539081 2017] [mpm_event:error] [pid 1967:tid 140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers ... --- I think this is a smoking gun for the issue, because this issue leads to the death of the serving process, which gets logged a little later. Correlating this it seems like a few of these time-stamps match up with when the reported jobs reported they got disconnected. --- [Mon Jan 30 00:03:31.562290 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 14410 exit signal Segmentation fault (11) [Mon Jan 30 00:03:35.566735 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 20378 exit signal Segmentation fault (11) ... [Mon Jan 30 00:04:17.614883 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4126 exit signal Segmentation fault (11) [Mon Jan 30 00:04:17.614951 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4204 exit signal Segmentation fault (11) [Mon Jan 30 00:04:22.621893 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4290 exit signal Segmentation fault (11) [Mon Jan 30 00:04:37.638901 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4358 exit signal Segmentation fault (11) [Mon Jan 30 00:04:41.643388 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 4324 exit signal Segmentation fault (11) [Mon Jan 30 00:04:42.645053 2017] [core:notice] [pid 1967:tid 140205198583680] AH00052: child pid 20485 exit signal Segmentation fault (11) --- This ungraceful exit might also explain the lack of logs. Unfortunately, the prognosis for this issue is not great. The original bug [2] seems to show it is a systemic issue and it is discussed in the documentation [3] which says in short This mpm showed some scalability bottlenecks in the past leading to the following error: "scoreboard is full, not at MaxRequestWorkers". ... From 2.4.24 onward, mpm-event is smarter and it is able to handle graceful terminations in a much better way We are not in an immediate position to upgrade apache, and it does not seem tuning is a reliable solution. Considering the workload of this server, it seems avoiding the problem by switching to the worker MPM might be the expedient solution. I've proposed that in [4]. Thanks, -i [1] http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 [2] https://bz.apache.org/bugzilla/show_bug.cgi?id=53555 [3] https://httpd.apache.org/docs/2.4/mod/event.html (Graceful process termination) [4] https://review.openstack.org/#/c/426639/ _______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra