Hello folks, apologize for this long post, but this issue is randomly failing CI job and I would like to share the result of a long and subtle gray hair inducing investigation :)
The most obvious symptom in the CI tests is the health-check failure of this task: "Check for config-update post job". It can't find the commit sha in config-update/builds/lastStableBuild logs. This is abnormal since a previous task made sure the change has been Verified +2 and that it was successfully merged, so there should have been a config-update run. I believe this failure may also be affecting other tests that rely on the config-update job to succeed (such as the test_project_test and test_resource_workflow). Here is an analysis of this failure: https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console The relevant lines are: [+] Starting the health base check Fri Dec 2 00:33:53 UTC 2016 Running health-check/zuul.yaml Zuul integration test SUCCESS Fri Dec 2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml [...] TASK [Check for config-update post job] **************************************** fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed": 1200, "failed": true, "msg": "Timeout when waiting for search string 67ddb89966cc4999df1869d99d140dd969963416 in /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"} to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry Those two health-check (zuul and gerritbot) are very similar as they both trigger a config-update job. Looking in zuul.log, we can observe how the first change (1,1) is being merged and how the config-update is being successfully executed: https://softwarefactory-project.io/paste/show/661/ As you can see, gerrit.GerritWatcher received two events: change-merged and ref-updated, resulting in two trigger event eventually running the config-update. Then, gerritbot test merge another config change (4,1), gerrit.GerritWatcher didn't received the ref-update event and the config-update didn't get triggered: https://softwarefactory-project.io/paste/show/662/ The question is, does gerrit failed to emit that event, or does zuul failed to process it? Using a fresh sf deployment, I tried to reproduce the failure while dumping the gerrit events-stream in parallel: ssh sftests.com gerrit stream-events > stream.log & while true; do ansible-playbook health-check/zuul.yaml && \ ansible-playbook health-check/gerritbot.yaml || break; done After about 3 hours of execution, it finally reproduces the infamous error: "Timeout when waiting for search string 65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml" The stream.log do have both events logged: https://softwarefactory-project.io/paste/show/664/ The zuul.log doesn't show any trace of the last updated-ref: https://softwarefactory-project.io/paste/show/665/ That's it for now, next is to debug the GerritWatcher, but since it's happening so rarely, it's not an easy task. Alternatively, it may be worthy to run post pipeline on change-merged instead of ref-updated. If anybody have an idea what's going with that failure, please chime in! Yours, -Tristan
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Softwarefactory-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/softwarefactory-dev
