On Mon, Dec 10, 2018, at 8:56 PM, Vysali Vaidhyam Subramanian wrote:
> Hello ,
> 
> I am a grad student and I am currently studying Flaky Tests. 
> As a part of my study , I’ve been examining the Check and the Gate jobs 
> in the OpenStack projects.I have been trying to identify why developers 
> run rechecks and how often 
> developers running a recheck helps in the identification of a Flaky 
> test.
> 
> To identify how often a recheck points to a Flaky test, I need the test 
> results of each of the rechecks.
> However, I have not been able to get this information from Gerrit for 
> each recheck comment.
> I was wondering if the history of the jobs run against a recheck comment 
> is available and if it can be retrieved. 
> 
> It would be great if I can get some pointers :)
> 

This information is known to Zuul, but I don't think Zuul currently records a 
flag to indicate results are due to some human triggered retry mechanism. One 
approach could be to add this functionality to Zuul and rely on the Zuul builds 
db for that data.

Another approach that doesn't require updates to Zuul is to parse Gerrit 
comments and flag things yourself. For check jobs they only run when a new 
patchset is pushed or when rechecked. This means the first results for a 
patchset are the initial set. Any subsequent results for that patchset from the 
check pipeline (indicated in the comment itself) are the result of rechecks.

The gate is a bit more complicated because shared gate queues can cause a 
change's tests to be rerun if a related change is rechecked. You can probably 
infer if the recheck was on this particular change by looking for previous 
recheck comments without results.

Unfortunately I don't know how clean the data is. I believe the Zuul comments 
have been very consistent over time, but don't know that for sure. You may want 
to start with both things. The first to make future data easier to consume and 
the second to have a way to get at the preexisting data.

Some other thoughts. Our job log retention time is quite short due to disk 
space contraints (~4 weeks?). While the Gerrit comments go back many years if 
you want to know what specific test case a tempest job failed you'll only be 
able to get that data for the last month or so.

We also try to index our job logs in elasticsearch and expose them via a kibana 
web ui and a subset of the elasticsearch API at http://logstash.openstack.org. 
More details at https://docs.openstack.org/infra/system-config/logstash.html. 
We are happy for people to use that for additional insight. Just please try to 
be nice to our cluster and we'd love it if you shared insights/results with us 
too.

Finally we do some tracking of what we think are reasons for rechecks with our 
"elastic-recheck" tool. It builds on top of the elasticsearch cluster above 
using bug fingerprint queries to track the occurrence of known issues. 
http://status.openstack.org/elastic-recheck/ renders graphs and the source repo 
for elastic-recheck has all the query fingerprints. Again feel free to use this 
tool if it is helpful, but we'd love insights/feedback/etc if you end up 
learning anything interesting with it.

Hope this was useful,
Clark

_______________________________________________
OpenStack-Infra mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Reply via email to