Hi Tristan ! Thanks investigating that. It is pretty clear now there is an issue in Zuul catching that ref-updated event. I suspected that too since I saw sometime (rarely) in dev env the config-updated not executed (or if I remember well executed but after a long delay, by far after the event).
It can be interesting to try to bind the config-update on the change-merged event as you suggested and experiment with that in our functional tests and take a decision to keep that change or not before the next release of SF. Do you think trying to reproduce it in the Zuul unittest may be feasible ? Cheers, Fabien Le 02/12/2016 à 08:34, Tristan Cacqueray a écrit : > Hello folks, > > apologize for this long post, but this issue is randomly failing CI job > and I would like to share the result of a long and subtle gray hair > inducing investigation :) > > The most obvious symptom in the CI tests is the health-check failure of > this task: "Check for config-update post job". It can't find the commit > sha in config-update/builds/lastStableBuild logs. This is abnormal since > a previous task made sure the change has been Verified +2 and that it > was successfully merged, so there should have been a config-update run. > > I believe this failure may also be affecting other tests that rely on > the config-update job to succeed (such as the test_project_test and > test_resource_workflow). > > > Here is an analysis of this failure: > https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console > > The relevant lines are: > [+] Starting the health base check > Fri Dec 2 00:33:53 UTC 2016 Running health-check/zuul.yaml > Zuul integration test SUCCESS > Fri Dec 2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml > [...] > TASK [Check for config-update post job] > **************************************** > fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed": > 1200, "failed": true, "msg": "Timeout when waiting for search string > 67ddb89966cc4999df1869d99d140dd969963416 in > /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"} > to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry > > > Those two health-check (zuul and gerritbot) are very similar as they > both trigger a config-update job. Looking in zuul.log, we can observe > how the first change (1,1) is being merged and how the config-update is > being successfully executed: > https://softwarefactory-project.io/paste/show/661/ > > As you can see, gerrit.GerritWatcher received two events: change-merged > and ref-updated, resulting in two trigger event eventually running the > config-update. > > > Then, gerritbot test merge another config change (4,1), > gerrit.GerritWatcher didn't received the ref-update event and the > config-update didn't get triggered: > https://softwarefactory-project.io/paste/show/662/ > > > The question is, does gerrit failed to emit that event, or does zuul > failed to process it? Using a fresh sf deployment, I tried to reproduce > the failure while dumping the gerrit events-stream in parallel: > > ssh sftests.com gerrit stream-events > stream.log & > while true; do > ansible-playbook health-check/zuul.yaml && \ > ansible-playbook health-check/gerritbot.yaml || break; > done > > > After about 3 hours of execution, it finally reproduces the infamous error: > "Timeout when waiting for search string > 65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in > /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml" > > The stream.log do have both events logged: > https://softwarefactory-project.io/paste/show/664/ > > The zuul.log doesn't show any trace of the last updated-ref: > https://softwarefactory-project.io/paste/show/665/ > > > That's it for now, next is to debug the GerritWatcher, but since it's > happening so rarely, it's not an easy task. > > Alternatively, it may be worthy to run post pipeline on change-merged > instead of ref-updated. > > If anybody have an idea what's going with that failure, please chime in! > > Yours, > -Tristan > > > > _______________________________________________ > Softwarefactory-dev mailing list > [email protected] > https://www.redhat.com/mailman/listinfo/softwarefactory-dev >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Softwarefactory-dev mailing list [email protected] https://www.redhat.com/mailman/listinfo/softwarefactory-dev
