Hi Tristan !

Thanks investigating that. It is pretty clear now there is an
issue in Zuul catching that ref-updated event.
I suspected that too since I saw sometime (rarely) in dev env
the config-updated not executed (or if I remember well executed but
after a long delay, by far after the event).

It can be interesting to try to bind the config-update on the change-merged
event as you suggested and experiment with that in our functional tests
and take a decision to keep that change or not before the next release of SF.

Do you think trying to reproduce it in the Zuul unittest may be feasible ?

Cheers,
Fabien

Le 02/12/2016 à 08:34, Tristan Cacqueray a écrit :
> Hello folks,
> 
> apologize for this long post, but this issue is randomly failing CI job
> and I would like to share the result of a long and subtle gray hair
> inducing investigation :)
> 
> The most obvious symptom in the CI tests is the health-check failure of
> this task: "Check for config-update post job". It can't find the commit
> sha in config-update/builds/lastStableBuild logs. This is abnormal since
> a previous task made sure the change has been Verified +2 and that it
> was successfully merged, so there should have been a config-update run.
> 
> I believe this failure may also be affecting other tests that rely on
> the config-update job to succeed (such as the test_project_test and
> test_resource_workflow).
> 
> 
> Here is an analysis of this failure:
> https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console
> 
> The relevant lines are:
> [+] Starting the health base check
> Fri Dec  2 00:33:53 UTC 2016 Running health-check/zuul.yaml
> Zuul integration test SUCCESS
> Fri Dec  2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml
> [...]
> TASK [Check for config-update post job]
> ****************************************
> fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed":
> 1200, "failed": true, "msg": "Timeout when waiting for search string
> 67ddb89966cc4999df1869d99d140dd969963416 in
> /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"}
>       to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry
> 
> 
> Those two health-check (zuul and gerritbot) are very similar as they
> both trigger a config-update job.  Looking in zuul.log, we can observe
> how the first change (1,1) is being merged and how the config-update is
> being successfully executed:
> https://softwarefactory-project.io/paste/show/661/
> 
> As you can see, gerrit.GerritWatcher received two events: change-merged
> and ref-updated, resulting in two trigger event eventually running the
> config-update.
> 
> 
> Then, gerritbot test merge another config change (4,1),
> gerrit.GerritWatcher didn't received the ref-update event and the
> config-update didn't get triggered:
> https://softwarefactory-project.io/paste/show/662/
> 
> 
> The question is, does gerrit failed to emit that event, or does zuul
> failed to process it? Using a fresh sf deployment, I tried to reproduce
> the failure while dumping the gerrit events-stream in parallel:
> 
>     ssh sftests.com gerrit stream-events > stream.log &
>     while true; do
>       ansible-playbook health-check/zuul.yaml && \
>       ansible-playbook health-check/gerritbot.yaml || break;
>     done
> 
> 
> After about 3 hours of execution, it finally reproduces the infamous error:
> "Timeout when waiting for search string
> 65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in
> /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"
> 
> The stream.log do have both events logged:
> https://softwarefactory-project.io/paste/show/664/
> 
> The zuul.log doesn't show any trace of the last updated-ref:
> https://softwarefactory-project.io/paste/show/665/
> 
> 
> That's it for now, next is to debug the GerritWatcher, but since it's
> happening so rarely, it's not an easy task.
> 
> Alternatively, it may be worthy to run post pipeline on change-merged
> instead of ref-updated.
> 
> If anybody have an idea what's going with that failure, please chime in!
> 
> Yours,
> -Tristan
> 
> 
> 
> _______________________________________________
> Softwarefactory-dev mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/softwarefactory-dev
> 

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Softwarefactory-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/softwarefactory-dev

Reply via email to