Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-18 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> This means the dd process getting stuck in tstile because glusterfsd
> died is probably a NetBSD kernel bug. I have to investigate. 

I think I found the culprit, but fixing this will need some discussions
on NetBSD lists:

dd waits on a vnode lock owned by the ioflush kernel thread, which is
responsible of periodical fsync.

ioflush is stuck on the following backtrace:
cv_wait
genfs_do_putpages
genfs_putpages
VOP_PUTPAGES
nfs_flush
nfs_fsync
VOP_FSYNC
nfs_sync
sync_fsync

The cv_wait() call in genfs_do_putpages():
/* Wait for output to complete. */
if (!wasclean && !async && vp->v_numoutput != 0) {
while (vp->v_numoutput != 0)
cv_wait(&vp->v_cv, slock);
}

cv_wait() is uninterruptible, timeout-less wait which is obviously wrong
there. cv_timedwait_sig() would be better, but that means pulling NFS
mount options from a lower layer. Not obvious on the architecture front.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-18 Thread Emmanuel Dreyfus
Emmanuel Dreyfus  wrote:

> > We again hit this problem [1]. Can we use soft mount with some retries and
> > timeouts so that we don't need manual intervention to recover a hung VM?
> 
> Um, looking at the current test scripts, we already do it. 

A side note: It seems the hung case is always with dd(1). I have beven
caught tests using quota.c undergoing the same failure.

The only tests that do NFS mount + dd(1) are:

tests/basic/ec/nfs.t
tests/basic/mount-nfs-auth.t
tests/bugs/glusterfs/bug-872923.t
tests/bugs/quota/bug-1153964.t

Perhaps it is time to add options to quota.c and use it everywhere? It
would be interesting to understand what makes dd(1) hang while quota.c
is fine, though.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-18 Thread Emmanuel Dreyfus
Vijay Bellur  wrote:

> We again hit this problem [1]. Can we use soft mount with some retries and
> timeouts so that we don't need manual intervention to recover a hung VM?

Um, looking at the current test scripts, we already do it. In
tests/nfs.rc, both for Linux and NetBSD:
opt="soft,intr,vers=3$opt"

mount -vvv shows the options are indeed honoured. timeo is not specifed,
but a default of 300 is used on NetBSD.

This means the dd process getting stuck in tstile because glusterfsd
died is probably a NetBSD kernel bug. I have to investigate. 

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-18 Thread Emmanuel Dreyfus
Vijay Bellur  wrote:

> We again hit this problem [1]. Can we use soft mount with some retries
> and timeouts so that we don't need manual intervention to recover a hung VM?

Sure, but while there, I advise soft and interruptible mount (On NetBSD,
either mount -o soft,intr or mount -i -s) 

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-18 Thread Vijay Bellur

On Tuesday 16 June 2015 02:19 AM, Emmanuel Dreyfus wrote:

Rajesh Joseph  wrote:


Correct me if I am wrong, but I think interruptible is good with hard
mount. Which is good in real deployment scenario. Since we are talking
about test scripts, I thought soft mount along with timeout period can be
a good option to prevent hangs.


soft mount means an I/O operation can timeout and return failure
interruptible mount means you can kill a process undergoing I/O, which
is useful for cleanup routine.

Both are like belt with sustenders, but given how likely we are to hang,
it does not hurts.



We again hit this problem [1]. Can we use soft mount with some retries 
and timeouts so that we don't need manual intervention to recover a hung VM?


Thanks,
Vijay

[1] 
http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/6971/console 


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Emmanuel Dreyfus
Rajesh Joseph  wrote:

> Correct me if I am wrong, but I think interruptible is good with hard
> mount. Which is good in real deployment scenario. Since we are talking
> about test scripts, I thought soft mount along with timeout period can be
> a good option to prevent hangs.

soft mount means an I/O operation can timeout and return failure
interruptible mount means you can kill a process undergoing I/O, which
is useful for cleanup routine.

Both are like belt with sustenders, but given how likely we are to hang,
it does not hurts.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Rajesh Joseph



On Monday 15 June 2015 06:34 PM, Emmanuel Dreyfus wrote:

On Mon, Jun 15, 2015 at 06:28:26PM +0530, Rajesh Joseph wrote:

For these test cases can't we use the nfs soft mount option to prevent the
hang?

soft mount will not be enough. I think you also need interruptible.


Correct me if I am wrong, but I think interruptible is good with hard 
mount. Which is good
in real deployment scenario. Since we are talking about test scripts, I 
thought soft mount

along with timeout period can be a good option to prevent hangs.


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Emmanuel Dreyfus
On Mon, Jun 15, 2015 at 06:28:26PM +0530, Rajesh Joseph wrote:
> For these test cases can't we use the nfs soft mount option to prevent the
> hang?

soft mount will not be enough. I think you also need interruptible.

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Rajesh Joseph



On Monday 15 June 2015 05:21 PM, Kaushal M wrote:

The hang we observe is not something specific to Gluster. I've
observed this kind of hangs when a filesystem which is in use goes
offline.
For example I've accidently shutdown machines which were being used
for mounting nfs, which lead to the client systems hanging completely
and required a hard reboot.

If there are ways to avoid these kinds hangs when they eventually
occur, I'm all ears.


For these test cases can't we use the nfs soft mount option to prevent 
the hang?




On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri
 wrote:

Emmanuel,
I am not sure of the feasibility but just wanted to ask you. Do you
think there is a possibility to error out operations on the mount when mount
crashes instead of hanging? That would prevent a lot of manual intervention
even in future.

Pranith.

On 06/15/2015 01:35 PM, Niels de Vos wrote:

Hi,

sometimes the NetBSD regression tests hang with messages like this:

  [12:29:07] ./tests/basic/mgmt_v3-locks.t
  ... ok79867 ms
  No volumes present
  mount_nfs: can't access /patchy: Permission denied
  mount_nfs: can't access /patchy: Permission denied
  mount_nfs: can't access /patchy: Permission denied

Most (if not all) of these hangs are caused by a crashing Gluster/NFS
process. Once the Gluster/NFS server is not reachable anymore,
unmounting fails.

The only way to recover is to reboot the VM and retrigger the test. For
rebooting, the http://build.gluster.org/job/reboot-vm job can be used,
and retriggering works by clicking the "retrigger" link in the left menu
once the test has been marked as failed/aborted.

When logging in on the NetBSD system that hangs, you can verify with
these steps:

1. check if there is a /glusterfsd.core file
2. run gdb on the core:

  # cd /build/install
  # gdb --core=/glusterfsd.core sbin/glusterfs
  ...
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8,
  host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900,
  can_write=0xbf7fd8fc)
  at

/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164
  164 *can_write = lookup_res->item->opts->rw;

3. verify the lookup_res structure:

  (gdb) p *lookup_res
  $1 = {timestamp = 1434284981, item = 0xb901e3b0}
  (gdb) p *lookup_res->item
  $2 = {name = 0xff00 , opts = 0x}


A fix for this has been sent, it is currently waiting for an update to
the prosed reference counting:

- http://review.gluster.org/11022
  core: add "gf_ref_t" for common refcounting structures
- http://review.gluster.org/11023
  nfs: refcount each auth_cache_entry and related data_t

Thanks,
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Emmanuel Dreyfus
On Mon, Jun 15, 2015 at 04:38:54PM +0530, Pranith Kumar Karampuri wrote:
> Emmanuel,
>I am not sure of the feasibility but just wanted to ask you. Do you
> think there is a possibility to error out operations on the mount when mount
> crashes instead of hanging? That would prevent a lot of manual intervention
> even in future.

Your message is a bit contradictory: there are bits quoted about NFS mount, 
which is native, and bits about glusterfs mount. What information are
you looking for?

If we talk about hanging mount, this is probably NFS client awaiting
for a NFS server that will never return. I alsready wrote how this can be 
cleaned up by umount -f -R and the limitation of that approahc.

If we talk about crashing mount then this is more likely to be a
native mount, for which you have information in the logs, don't you?

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Kaushal M
The hang we observe is not something specific to Gluster. I've
observed this kind of hangs when a filesystem which is in use goes
offline.
For example I've accidently shutdown machines which were being used
for mounting nfs, which lead to the client systems hanging completely
and required a hard reboot.

If there are ways to avoid these kinds hangs when they eventually
occur, I'm all ears.

On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri
 wrote:
> Emmanuel,
>I am not sure of the feasibility but just wanted to ask you. Do you
> think there is a possibility to error out operations on the mount when mount
> crashes instead of hanging? That would prevent a lot of manual intervention
> even in future.
>
> Pranith.
>
> On 06/15/2015 01:35 PM, Niels de Vos wrote:
>>
>> Hi,
>>
>> sometimes the NetBSD regression tests hang with messages like this:
>>
>>  [12:29:07] ./tests/basic/mgmt_v3-locks.t
>>  ... ok79867 ms
>>  No volumes present
>>  mount_nfs: can't access /patchy: Permission denied
>>  mount_nfs: can't access /patchy: Permission denied
>>  mount_nfs: can't access /patchy: Permission denied
>>
>> Most (if not all) of these hangs are caused by a crashing Gluster/NFS
>> process. Once the Gluster/NFS server is not reachable anymore,
>> unmounting fails.
>>
>> The only way to recover is to reboot the VM and retrigger the test. For
>> rebooting, the http://build.gluster.org/job/reboot-vm job can be used,
>> and retriggering works by clicking the "retrigger" link in the left menu
>> once the test has been marked as failed/aborted.
>>
>> When logging in on the NetBSD system that hangs, you can verify with
>> these steps:
>>
>> 1. check if there is a /glusterfsd.core file
>> 2. run gdb on the core:
>>
>>  # cd /build/install
>>  # gdb --core=/glusterfsd.core sbin/glusterfs
>>  ...
>>  Program terminated with signal SIGSEGV, Segmentation fault.
>>  #0  0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8,
>>  host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900,
>>  can_write=0xbf7fd8fc)
>>  at
>>
>> /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164
>>  164 *can_write = lookup_res->item->opts->rw;
>>
>> 3. verify the lookup_res structure:
>>
>>  (gdb) p *lookup_res
>>  $1 = {timestamp = 1434284981, item = 0xb901e3b0}
>>  (gdb) p *lookup_res->item
>>  $2 = {name = 0xff00 >  0xff00>, opts = 0x}
>>
>>
>> A fix for this has been sent, it is currently waiting for an update to
>> the prosed reference counting:
>>
>>- http://review.gluster.org/11022
>>  core: add "gf_ref_t" for common refcounting structures
>>- http://review.gluster.org/11023
>>  nfs: refcount each auth_cache_entry and related data_t
>>
>> Thanks,
>> Niels
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t

2015-06-15 Thread Pranith Kumar Karampuri

Emmanuel,
   I am not sure of the feasibility but just wanted to ask you. Do 
you think there is a possibility to error out operations on the mount 
when mount crashes instead of hanging? That would prevent a lot of 
manual intervention even in future.


Pranith.
On 06/15/2015 01:35 PM, Niels de Vos wrote:

Hi,

sometimes the NetBSD regression tests hang with messages like this:

 [12:29:07] ./tests/basic/mgmt_v3-locks.t
 ... ok79867 ms
 No volumes present
 mount_nfs: can't access /patchy: Permission denied
 mount_nfs: can't access /patchy: Permission denied
 mount_nfs: can't access /patchy: Permission denied

Most (if not all) of these hangs are caused by a crashing Gluster/NFS
process. Once the Gluster/NFS server is not reachable anymore,
unmounting fails.

The only way to recover is to reboot the VM and retrigger the test. For
rebooting, the http://build.gluster.org/job/reboot-vm job can be used,
and retriggering works by clicking the "retrigger" link in the left menu
once the test has been marked as failed/aborted.

When logging in on the NetBSD system that hangs, you can verify with
these steps:

1. check if there is a /glusterfsd.core file
2. run gdb on the core:

 # cd /build/install
 # gdb --core=/glusterfsd.core sbin/glusterfs
 ...
 Program terminated with signal SIGSEGV, Segmentation fault.
 #0  0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8,
 host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900,
 can_write=0xbf7fd8fc)
 at
 
/home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164
 164 *can_write = lookup_res->item->opts->rw;

3. verify the lookup_res structure:

 (gdb) p *lookup_res
 $1 = {timestamp = 1434284981, item = 0xb901e3b0}
 (gdb) p *lookup_res->item
 $2 = {name = 0xff00 , opts = 0x}


A fix for this has been sent, it is currently waiting for an update to
the prosed reference counting:

   - http://review.gluster.org/11022
 core: add "gf_ref_t" for common refcounting structures
   - http://review.gluster.org/11023
 nfs: refcount each auth_cache_entry and related data_t

Thanks,
Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel