On 01/08/2016 03:25 PM, Emmanuel Dreyfus wrote:
On Fri, Jan 08, 2016 at 03:18:02PM +0530, Pranith Kumar Karampuri wrote:
Should the cleanup script needs to be manually executed on the NetBSD
machine?
You can run the script manually, but if the goal is to restore a
misbehaving machine, rebooting is probbaly the fastest way to sort
the issue.

While thinking about it, I suspect there may be some benefit
into rebooting the machine if the regression does not finish
within a sane amount of time.

Rebooting upon a single test leading to crash may not be a good idea. We need a reliable way of finding the need for finding that the mount hung because of crash and execute this cleanup script when that situation happens. So question is can we detect this state?


First step could be to parse jenkins logs and find which test fail or hang
most often in NetBSD regression
This work is under way. I will have to change some of the scripts I wrote to
get this information.
Great.

To avoid duplication of work, did you take any tests that you are
already investigating? If not that is the first thing I will try to find out.
No, I did not started investigating right now because I have no idea where
I should look at. Your input will be very valuable.
Since we don't have the script now, I did this manually:

Here are the results for the last 15-20 runs:

Test Number of times it happened
tests/basic/afr/arbiter-statfs.t: bad status 1 ------- 5
tests/basic/afr/self-heal.t -------                        1
tests/basic/afr/entry-self-heal.t -------                        1
tests/basic/quota-nfs.t -------                        2



The following happened: 4 times
One example: https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13283/console https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13280/console - this seems different compared to the one above.

+ '/opt/qa/build.sh'
Build timed out (after 300 minutes). Marking the build as failed.
Build was aborted
Finished: FAILURE



The following happened: 4 times
One example: https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13279/console

ERROR: Connection was broken: java.io.IOException: Unexpected EOF
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:99) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)


I can take a look at why tests are failing (On sunday, not today :-) ). Could you look at why the timeouts/'Connection broken' stuff is happening?

Once we find out what happened. First goal is to detect and repair it automatically. If we can't, let us write up a wiki page or something to tell how to proceed when this happens.

Pranith


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to