Re: [Nfs-ganesha-devel] Intermittent test failures - manual tests and continuous integration

Niels de Vos Fri, 08 Sep 2017 08:43:15 -0700

On Fri, Sep 08, 2017 at 06:55:19AM -0700, Frank Filz wrote:
> > On Fri, Sep 01, 2017 at 03:09:34PM -0700, Frank Filz wrote:
> > > Lately, we have been plagued by a lot of intermittent test failures.
> > >
> > > I have seen intermittent failures in pynfs WRT14, WRT15, and WRT16.
> > > These have not been resolved by the latest ntirpc pullup.
> > >
> > > Additionally, we see a lot of intermittent failures in the continuous
> > > integration.
> > >
> > > A big issue with the Centos CI is that it seems to have a fragile
> > > setup, and sometimes doesn't even succeed in trying to build Ganesha,
> > > and then fires a Verified -1. This makes it hard to evaluate what
> > > patches are actually ready for integration.
> > 
> > We can look into this, but it helps if you can provide a link to the patch
> in
> > GerritHub or the job in the CI.
> 
> Here's one merged last week with a Gluster CI Verify -1:
> 
> https://review.gerrithub.io/#/c/375463/
> 
> And just to preserve it in case... here's the log:
> 
> Triggered by Gerrit: https://review.gerrithub.io/375463 in silent mode.
> [EnvInject] - Loading node environment variables.
> Building remotely on nfs-ganesha-ci-slave01 (nfs-ganesha) in workspace
> /home/nfs-ganesha/workspace/nfs-ganesha_trigger-fsal_gluster
> [nfs-ganesha_trigger-fsal_gluster] $ /bin/sh -xe
> /tmp/jenkins5031649144466335345.sh
> + set +x
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                  Dload  Upload   Total   Spent    Left
> Speed
> 
>   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
> 0
>   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--
> 0
> 100  1735  100  1735    0     0   8723      0 --:--:-- --:--:-- --:--:--
> 8718
> Traceback (most recent call last):
>   File "bootstrap.py", line 33, in <module>
>     b=json.loads(dat)
>   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
>     return _default_decoder.decode(s)
>   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
>     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>   File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
>     raise ValueError("No JSON object could be decoded")
> ValueError: No JSON object could be decoded
> https://ci.centos.org/job/nfs-ganesha_trigger-fsal_gluster/3455//console :
> FAILED
> Build step 'Execute shell' marked build as failure
> Finished: FAILURE
> 
> Which tells me not much about why it failed, though it looks like a failure
> that has nothing to do with Ganesha...


>From #centos-devel on Freenode:

15:49 < ndevos> bstinson: is 
https://ci.centos.org/job/nfs-ganesha_trigger-fsal_gluster/3487/console a known 
duffy problem? and how can the jobs work around this?
15:51 < bstinson> ndevos: you may be hitting the rate limit
15:52 < ndevos> bstinson: oh, that is possible, I guess... it might happen when 
a series of patches get sent
15:53 < ndevos> bstinson: should I do a sleep and retry in case of such a 
failure?
15:55 < bstinson> ndevos: yeah, that should work. we measure your usage over 5 
minutes
15:57 < ndevos> bstinson: ok, so sleeping 5 minutes, retry and loop should be 
acceptible?
15:59 < ndevos> bstinson: is there a particular message returned by duffy when 
the rate limit is hit? the reply is not json, but maybe some error?
15:59 < ndevos> (in plain text format?)
15:59 < bstinson> yeah 5 minutes should be acceptable, it does return a plain 
text error message
16:00 < bstinson> 'Deployment rate over quota, try again in a few minutes'

Added a retry logic which is now live, and should get applied for all
upcoming tests:

https://github.com/nfs-ganesha/ci-tests/commit/ed055058c7956ebb703464c742837a9ace797129


> > > An additional issue with the Centos CI is that the failure logs often
> > > aren't preserved long enough to even diagnose the issue.
> > 
> > That is something we can change. Some jobs do not delete the results, but
> > others seem to do. How long (in days), or how many results would you like
> to
> > keep?
> 
> I'd say they need to be kept at least a week, if we could have time based
> retention rather than number of results retention, I think that would help.

Some jobs seem to have been set to keep 7 days, max 7 jobs. It does not
really cost us anything, so I'll change it to 14 days. A screenshot for
these settings has been attached. It can be that I missed updating a job
so let us know in case logs are deleted too early.

> At least after a week, it's reasonable to expect folks to rebase their
> patches and re-submit, which would trigger a new run.
> 
> > > The result is that honestly, I mostly ignore the Centos CI results.
> > > They almost might as well not be run...
> > 
> > This is definitely not what we want, so lets fix the problems.
> 
> Yea, and thus my rant...

I really understand this, a CI should be helpful in identifying
problems, and not introduce problems from itself. Lets try hard to not
have you needing to rant about it much more :-)

> > > Let's talk about CI more on a near time concall (it would help if
> > > Niels and Jiffin could join a call to talk about this, our next call
> > > might be too soon for that).
> > 
> > Tuesdays tend to be very busy for me, and I am not sure I can join the
> call
> > next week. Arthy did some work on the jobs in the CentOS CI, she could
> > probably work with Jiffin to make any changes that improve the experience
> > for you. I'm happy to help out where I can too, of course :-)
> 
> If we can figure out another time to have a CI call, that would be helpful.

> It would be good to pull in Patrice from CEA as well as anyone else who
> cares.
> 
> It would really help if we could have someone with better time zone overlap
> with me who could manage the CI stuff, but that may not be realistic.

We can sign up anyone in the NFS-Ganesha community to do this. It takes
a little time to get familiar with the scripts and tools that are used,
but once that settled it is relatively straight forward.

Volunteers?

Cheers,
Niels

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Re: [Nfs-ganesha-devel] Intermittent test failures - manual tests and continuous integration

Reply via email to