Folks,

Starting around 1:15pm DST / 5:15pm UTC, there has been a 20-30% job failure rate due to 'Java Connection Closed Exception' causing a 100% verification failure rate of VPP gerrit changes.  I have been working with Vanessa in LF-IT and Mohammed at Vexxhost to resolve the issue.

Based on past causes of connection resets, all network paths between the Nomad cluster and Jenkins instance were tested for latency and packet loss without any issues being uncovered.  Jenkins was restarted which unfortunately did not resolve the issue.  Then the primary Nomad server which Jenkins is configured to connect to for spinning up executors was rebooted.  This too failed to resolve the issue.

Further investigation tonight with Mohammed's assistance (a huge THANK YOU to Mohammed for staying up late debugging this), seems to indicate that the docker containers are dying prematurely. However, the nomad logs are also being removed at the same time so there is presently no means of verifying if the containers are being terminated due to internal events.  The next step is to temporarily disable or reduce the frequency of nomad garbage collection, archive the nomad logs and then collate them with the system logs to determine the order of events that cause the docker containers to be terminated.

Thank you for your patience as the root cause of this outage is being investigated & fixed.
-daw-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20034): https://lists.fd.io/g/vpp-dev/message/20034
Mute This Topic: https://lists.fd.io/mt/85179512/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to