On 12/02/2016 08:38 AM, Fabien Boucher wrote:
> Hi Tristan !
> 
> Thanks investigating that. It is pretty clear now there is an
> issue in Zuul catching that ref-updated event.
> I suspected that too since I saw sometime (rarely) in dev env
> the config-updated not executed (or if I remember well executed but
> after a long delay, by far after the event).
> 
Ha, the same thing happened in my test environment, the event got
processed after I submitted a random change! It seems like there is a
really issue in GerritWatcher and I reported a bug upstream here:

  https://storyboard.openstack.org/#!/story/2000816


> It can be interesting to try to bind the config-update on the change-merged
> event as you suggested and experiment with that in our functional tests
> and take a decision to keep that change or not before the next release of SF.
> 
Actually the issue may happen with any type of events afterall, I'll try
to fix this at zuul level instead.

> Do you think trying to reproduce it in the Zuul unittest may be feasible ?
> 
> Cheers,
> Fabien
> 
> Le 02/12/2016 à 08:34, Tristan Cacqueray a écrit :
>> Hello folks,
>>
>> apologize for this long post, but this issue is randomly failing CI job
>> and I would like to share the result of a long and subtle gray hair
>> inducing investigation :)
>>
>> The most obvious symptom in the CI tests is the health-check failure of
>> this task: "Check for config-update post job". It can't find the commit
>> sha in config-update/builds/lastStableBuild logs. This is abnormal since
>> a previous task made sure the change has been Verified +2 and that it
>> was successfully merged, so there should have been a config-update run.
>>
>> I believe this failure may also be affecting other tests that rely on
>> the config-update job to succeed (such as the test_project_test and
>> test_resource_workflow).
>>
>>
>> Here is an analysis of this failure:
>> https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console
>>
>> The relevant lines are:
>> [+] Starting the health base check
>> Fri Dec  2 00:33:53 UTC 2016 Running health-check/zuul.yaml
>> Zuul integration test SUCCESS
>> Fri Dec  2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml
>> [...]
>> TASK [Check for config-update post job]
>> ****************************************
>> fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed":
>> 1200, "failed": true, "msg": "Timeout when waiting for search string
>> 67ddb89966cc4999df1869d99d140dd969963416 in
>> /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"}
>>      to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry
>>
>>
>> Those two health-check (zuul and gerritbot) are very similar as they
>> both trigger a config-update job.  Looking in zuul.log, we can observe
>> how the first change (1,1) is being merged and how the config-update is
>> being successfully executed:
>> https://softwarefactory-project.io/paste/show/661/
>>
>> As you can see, gerrit.GerritWatcher received two events: change-merged
>> and ref-updated, resulting in two trigger event eventually running the
>> config-update.
>>
>>
>> Then, gerritbot test merge another config change (4,1),
>> gerrit.GerritWatcher didn't received the ref-update event and the
>> config-update didn't get triggered:
>> https://softwarefactory-project.io/paste/show/662/
>>
>>
>> The question is, does gerrit failed to emit that event, or does zuul
>> failed to process it? Using a fresh sf deployment, I tried to reproduce
>> the failure while dumping the gerrit events-stream in parallel:
>>
>>     ssh sftests.com gerrit stream-events > stream.log &
>>     while true; do
>>      ansible-playbook health-check/zuul.yaml && \
>>      ansible-playbook health-check/gerritbot.yaml || break;
>>     done
>>
>>
>> After about 3 hours of execution, it finally reproduces the infamous error:
>> "Timeout when waiting for search string
>> 65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in
>> /var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"
>>
>> The stream.log do have both events logged:
>> https://softwarefactory-project.io/paste/show/664/
>>
>> The zuul.log doesn't show any trace of the last updated-ref:
>> https://softwarefactory-project.io/paste/show/665/
>>
>>
>> That's it for now, next is to debug the GerritWatcher, but since it's
>> happening so rarely, it's not an easy task.
>>
>> Alternatively, it may be worthy to run post pipeline on change-merged
>> instead of ref-updated.
>>
>> If anybody have an idea what's going with that failure, please chime in!
>>
>> Yours,
>> -Tristan
>>
>>
>>
>> _______________________________________________
>> Softwarefactory-dev mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/softwarefactory-dev
>>
> 
> 
> 
> _______________________________________________
> Softwarefactory-dev mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/softwarefactory-dev
> 


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Softwarefactory-dev mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/softwarefactory-dev

Reply via email to