Final update: Peter fixed the vpp-device job issue and I have issued 'recheck' on all gerrit changes in the VPP queue which failed due to this week's outage.

Operation of the CI is now back to normal.  There are still a low rate of 'git clone' failures at a rate of ~ 1/day, so some 'recheck's may be required if a job fails to clone the vpp repo. The solution for this issue is tentatively scheduled to occur in early August.

Thanks again for your patience.
-daw-

On 7/20/2021 6:08 PM, Dave Wallace via lists.fd.io wrote:
Folks,

After troubleshooting latency issues in the datapath between the Jenkins openstack instance and the Nomad cluster, the host of the Ingress instance appeared to be the source of the problem and the Ingress instance was live migrated to another host.  In addition, the primary Nomad server was rebooted. These changes resolved the 'Java Connection Closed Exception' issues.

However, at this time, the vpp-device job is still failing due to a known issue which will be resolved when Peter Mikus comes online tomorrow morning CET.  Once the vpp-device job failures have been resolved, I will be issuing 'recheck' on open VPP gerrit changes which are failing due to the vpp-device job. Please feel free to 'recheck' your gerrit changes if you would like to verify that the rest of the CI jobs complete successfully.

I'd like to thank Mohammed Naser, Vanessa Valderrama, Anton Baranov, Peter Mikus & Maciek Konstantynowicz for their coordinated efforts in resolving this outage.

Thanks again for your patience during this CI outage.
-daw-

On 7/19/2021 10:51 PM, Dave Wallace via lists.fd.io wrote:
Folks,

Vanessa performed a Jenkins reset at my request to see if that would resolve this problem.  Unfortunately the Jenkins reset did not resolve the connection resets. A recheck of gerrit change after the Jenkins restart failed with multiple job failures due to TCP connection resets:

https://gerrit.fd.io/r/c/vpp/+/32858/6#message-c77806c2fd58c3c00935e1b5589a402e4b670f9f

There has also been no correlation with Ping Monitor events, Nomad cluster events, Nomad host, subnet, or docker image.

Investigation continues in the datapath between the Jenkins openstack instance and the Nomad cluster.

Thanks again for your patience.
-daw-

On 7/19/2021 11:29 AM, Dave Wallace via lists.fd.io wrote:
Folks,

There have been large numbers CI job failures due to 'Java Connection Closed Exception' that appear to have started occurring on July 17.

I have opened a ticket with Vexxhost and am actively diagnosing the problem with them.

Thank you for your patience while the issue is being resolved.
-daw-








-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19842): https://lists.fd.io/g/vpp-dev/message/19842
Mute This Topic: https://lists.fd.io/mt/84310372/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to