Folks,
Starting around 1:15pm DST / 5:15pm UTC, there has been a 20-30% job
failure rate due to 'Java Connection Closed Exception' causing a 100%
verification failure rate of VPP gerrit changes. I have been working
with Vanessa in LF-IT and Mohammed at Vexxhost to resolve the issue.
Based on past causes of connection resets, all network paths between the
Nomad cluster and Jenkins instance were tested for latency and packet
loss without any issues being uncovered. Jenkins was restarted which
unfortunately did not resolve the issue. Then the primary Nomad server
which Jenkins is configured to connect to for spinning up executors was
rebooted. This too failed to resolve the issue.
Further investigation tonight with Mohammed's assistance (a huge THANK
YOU to Mohammed for staying up late debugging this), seems to indicate
that the docker containers are dying prematurely. However, the nomad
logs are also being removed at the same time so there is presently no
means of verifying if the containers are being terminated due to
internal events. The next step is to temporarily disable or reduce the
frequency of nomad garbage collection, archive the nomad logs and then
collate them with the system logs to determine the order of events that
cause the docker containers to be terminated.
Thank you for your patience as the root cause of this outage is being
investigated & fixed.
-daw-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20034): https://lists.fd.io/g/vpp-dev/message/20034
Mute This Topic: https://lists.fd.io/mt/85179512/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-