On 12/02/2016 08:38 AM, Fabien Boucher wrote: > Hi Tristan ! > > Thanks investigating that. It is pretty clear now there is an > issue in Zuul catching that ref-updated event. > I suspected that too since I saw sometime (rarely) in dev env > the config-updated not executed (or if I remember well executed but > after a long delay, by far after the event). > Ha, the same thing happened in my test environment, the event got processed after I submitted a random change! It seems like there is a really issue in GerritWatcher and I reported a bug upstream here:
https://storyboard.openstack.org/#!/story/2000816 > It can be interesting to try to bind the config-update on the change-merged > event as you suggested and experiment with that in our functional tests > and take a decision to keep that change or not before the next release of SF. > Actually the issue may happen with any type of events afterall, I'll try to fix this at zuul level instead. > Do you think trying to reproduce it in the Zuul unittest may be feasible ? > > Cheers, > Fabien > > Le 02/12/2016 à 08:34, Tristan Cacqueray a écrit : >> Hello folks, >> >> apologize for this long post, but this issue is randomly failing CI job >> and I would like to share the result of a long and subtle gray hair >> inducing investigation :) >> >> The most obvious symptom in the CI tests is the health-check failure of >> this task: "Check for config-update post job". It can't find the commit >> sha in config-update/builds/lastStableBuild logs. This is abnormal since >> a previous task made sure the change has been Verified +2 and that it >> was successfully merged, so there should have been a config-update run. >> >> I believe this failure may also be affecting other tests that rely on >> the config-update job to succeed (such as the test_project_test and >> test_resource_workflow). >> >> >> Here is an analysis of this failure: >> https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console >> >> The relevant lines are: >> [+] Starting the health base check >> Fri Dec 2 00:33:53 UTC 2016 Running health-check/zuul.yaml >> Zuul integration test SUCCESS >> Fri Dec 2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml >> [...] >> TASK [Check for config-update post job] >> **************************************** >> fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed": >> 1200, "failed": true, "msg": "Timeout when waiting for search string >> 67ddb89966cc4999df1869d99d140dd969963416 in >> /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"} >> to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry >> >> >> Those two health-check (zuul and gerritbot) are very similar as they >> both trigger a config-update job. Looking in zuul.log, we can observe >> how the first change (1,1) is being merged and how the config-update is >> being successfully executed: >> https://softwarefactory-project.io/paste/show/661/ >> >> As you can see, gerrit.GerritWatcher received two events: change-merged >> and ref-updated, resulting in two trigger event eventually running the >> config-update. >> >> >> Then, gerritbot test merge another config change (4,1), >> gerrit.GerritWatcher didn't received the ref-update event and the >> config-update didn't get triggered: >> https://softwarefactory-project.io/paste/show/662/ >> >> >> The question is, does gerrit failed to emit that event, or does zuul >> failed to process it? Using a fresh sf deployment, I tried to reproduce >> the failure while dumping the gerrit events-stream in parallel: >> >> ssh sftests.com gerrit stream-events > stream.log & >> while true; do >> ansible-playbook health-check/zuul.yaml && \ >> ansible-playbook health-check/gerritbot.yaml || break; >> done >> >> >> After about 3 hours of execution, it finally reproduces the infamous error: >> "Timeout when waiting for search string >> 65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in >> /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml" >> >> The stream.log do have both events logged: >> https://softwarefactory-project.io/paste/show/664/ >> >> The zuul.log doesn't show any trace of the last updated-ref: >> https://softwarefactory-project.io/paste/show/665/ >> >> >> That's it for now, next is to debug the GerritWatcher, but since it's >> happening so rarely, it's not an easy task. >> >> Alternatively, it may be worthy to run post pipeline on change-merged >> instead of ref-updated. >> >> If anybody have an idea what's going with that failure, please chime in! >> >> Yours, >> -Tristan >> >> >> >> _______________________________________________ >> Softwarefactory-dev mailing list >> [email protected] >> https://www.redhat.com/mailman/listinfo/softwarefactory-dev >> > > > > _______________________________________________ > Softwarefactory-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/softwarefactory-dev >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Softwarefactory-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/softwarefactory-dev
