Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Emmanuel Dreyfus wrote: > This means the dd process getting stuck in tstile because glusterfsd > died is probably a NetBSD kernel bug. I have to investigate. I think I found the culprit, but fixing this will need some discussions on NetBSD lists: dd waits on a vnode lock owned by the ioflush kernel thread, which is responsible of periodical fsync. ioflush is stuck on the following backtrace: cv_wait genfs_do_putpages genfs_putpages VOP_PUTPAGES nfs_flush nfs_fsync VOP_FSYNC nfs_sync sync_fsync The cv_wait() call in genfs_do_putpages(): /* Wait for output to complete. */ if (!wasclean && !async && vp->v_numoutput != 0) { while (vp->v_numoutput != 0) cv_wait(&vp->v_cv, slock); } cv_wait() is uninterruptible, timeout-less wait which is obviously wrong there. cv_timedwait_sig() would be better, but that means pulling NFS mount options from a lower layer. Not obvious on the architecture front. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Emmanuel Dreyfus wrote: > > We again hit this problem [1]. Can we use soft mount with some retries and > > timeouts so that we don't need manual intervention to recover a hung VM? > > Um, looking at the current test scripts, we already do it. A side note: It seems the hung case is always with dd(1). I have beven caught tests using quota.c undergoing the same failure. The only tests that do NFS mount + dd(1) are: tests/basic/ec/nfs.t tests/basic/mount-nfs-auth.t tests/bugs/glusterfs/bug-872923.t tests/bugs/quota/bug-1153964.t Perhaps it is time to add options to quota.c and use it everywhere? It would be interesting to understand what makes dd(1) hang while quota.c is fine, though. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Vijay Bellur wrote: > We again hit this problem [1]. Can we use soft mount with some retries and > timeouts so that we don't need manual intervention to recover a hung VM? Um, looking at the current test scripts, we already do it. In tests/nfs.rc, both for Linux and NetBSD: opt="soft,intr,vers=3$opt" mount -vvv shows the options are indeed honoured. timeo is not specifed, but a default of 300 is used on NetBSD. This means the dd process getting stuck in tstile because glusterfsd died is probably a NetBSD kernel bug. I have to investigate. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Vijay Bellur wrote: > We again hit this problem [1]. Can we use soft mount with some retries > and timeouts so that we don't need manual intervention to recover a hung VM? Sure, but while there, I advise soft and interruptible mount (On NetBSD, either mount -o soft,intr or mount -i -s) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
On Tuesday 16 June 2015 02:19 AM, Emmanuel Dreyfus wrote: Rajesh Joseph wrote: Correct me if I am wrong, but I think interruptible is good with hard mount. Which is good in real deployment scenario. Since we are talking about test scripts, I thought soft mount along with timeout period can be a good option to prevent hangs. soft mount means an I/O operation can timeout and return failure interruptible mount means you can kill a process undergoing I/O, which is useful for cleanup routine. Both are like belt with sustenders, but given how likely we are to hang, it does not hurts. We again hit this problem [1]. Can we use soft mount with some retries and timeouts so that we don't need manual intervention to recover a hung VM? Thanks, Vijay [1] http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/6971/console ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Rajesh Joseph wrote: > Correct me if I am wrong, but I think interruptible is good with hard > mount. Which is good in real deployment scenario. Since we are talking > about test scripts, I thought soft mount along with timeout period can be > a good option to prevent hangs. soft mount means an I/O operation can timeout and return failure interruptible mount means you can kill a process undergoing I/O, which is useful for cleanup routine. Both are like belt with sustenders, but given how likely we are to hang, it does not hurts. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
On Monday 15 June 2015 06:34 PM, Emmanuel Dreyfus wrote: On Mon, Jun 15, 2015 at 06:28:26PM +0530, Rajesh Joseph wrote: For these test cases can't we use the nfs soft mount option to prevent the hang? soft mount will not be enough. I think you also need interruptible. Correct me if I am wrong, but I think interruptible is good with hard mount. Which is good in real deployment scenario. Since we are talking about test scripts, I thought soft mount along with timeout period can be a good option to prevent hangs. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
On Mon, Jun 15, 2015 at 06:28:26PM +0530, Rajesh Joseph wrote: > For these test cases can't we use the nfs soft mount option to prevent the > hang? soft mount will not be enough. I think you also need interruptible. -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
On Monday 15 June 2015 05:21 PM, Kaushal M wrote: The hang we observe is not something specific to Gluster. I've observed this kind of hangs when a filesystem which is in use goes offline. For example I've accidently shutdown machines which were being used for mounting nfs, which lead to the client systems hanging completely and required a hard reboot. If there are ways to avoid these kinds hangs when they eventually occur, I'm all ears. For these test cases can't we use the nfs soft mount option to prevent the hang? On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri wrote: Emmanuel, I am not sure of the feasibility but just wanted to ask you. Do you think there is a possibility to error out operations on the mount when mount crashes instead of hanging? That would prevent a lot of manual intervention even in future. Pranith. On 06/15/2015 01:35 PM, Niels de Vos wrote: Hi, sometimes the NetBSD regression tests hang with messages like this: [12:29:07] ./tests/basic/mgmt_v3-locks.t ... ok79867 ms No volumes present mount_nfs: can't access /patchy: Permission denied mount_nfs: can't access /patchy: Permission denied mount_nfs: can't access /patchy: Permission denied Most (if not all) of these hangs are caused by a crashing Gluster/NFS process. Once the Gluster/NFS server is not reachable anymore, unmounting fails. The only way to recover is to reboot the VM and retrigger the test. For rebooting, the http://build.gluster.org/job/reboot-vm job can be used, and retriggering works by clicking the "retrigger" link in the left menu once the test has been marked as failed/aborted. When logging in on the NetBSD system that hangs, you can verify with these steps: 1. check if there is a /glusterfsd.core file 2. run gdb on the core: # cd /build/install # gdb --core=/glusterfsd.core sbin/glusterfs ... Program terminated with signal SIGSEGV, Segmentation fault. #0 0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8, host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900, can_write=0xbf7fd8fc) at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164 164 *can_write = lookup_res->item->opts->rw; 3. verify the lookup_res structure: (gdb) p *lookup_res $1 = {timestamp = 1434284981, item = 0xb901e3b0} (gdb) p *lookup_res->item $2 = {name = 0xff00 , opts = 0x} A fix for this has been sent, it is currently waiting for an update to the prosed reference counting: - http://review.gluster.org/11022 core: add "gf_ref_t" for common refcounting structures - http://review.gluster.org/11023 nfs: refcount each auth_cache_entry and related data_t Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
On Mon, Jun 15, 2015 at 04:38:54PM +0530, Pranith Kumar Karampuri wrote: > Emmanuel, >I am not sure of the feasibility but just wanted to ask you. Do you > think there is a possibility to error out operations on the mount when mount > crashes instead of hanging? That would prevent a lot of manual intervention > even in future. Your message is a bit contradictory: there are bits quoted about NFS mount, which is native, and bits about glusterfs mount. What information are you looking for? If we talk about hanging mount, this is probably NFS client awaiting for a NFS server that will never return. I alsready wrote how this can be cleaned up by umount -f -R and the limitation of that approahc. If we talk about crashing mount then this is more likely to be a native mount, for which you have information in the logs, don't you? -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
The hang we observe is not something specific to Gluster. I've observed this kind of hangs when a filesystem which is in use goes offline. For example I've accidently shutdown machines which were being used for mounting nfs, which lead to the client systems hanging completely and required a hard reboot. If there are ways to avoid these kinds hangs when they eventually occur, I'm all ears. On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri wrote: > Emmanuel, >I am not sure of the feasibility but just wanted to ask you. Do you > think there is a possibility to error out operations on the mount when mount > crashes instead of hanging? That would prevent a lot of manual intervention > even in future. > > Pranith. > > On 06/15/2015 01:35 PM, Niels de Vos wrote: >> >> Hi, >> >> sometimes the NetBSD regression tests hang with messages like this: >> >> [12:29:07] ./tests/basic/mgmt_v3-locks.t >> ... ok79867 ms >> No volumes present >> mount_nfs: can't access /patchy: Permission denied >> mount_nfs: can't access /patchy: Permission denied >> mount_nfs: can't access /patchy: Permission denied >> >> Most (if not all) of these hangs are caused by a crashing Gluster/NFS >> process. Once the Gluster/NFS server is not reachable anymore, >> unmounting fails. >> >> The only way to recover is to reboot the VM and retrigger the test. For >> rebooting, the http://build.gluster.org/job/reboot-vm job can be used, >> and retriggering works by clicking the "retrigger" link in the left menu >> once the test has been marked as failed/aborted. >> >> When logging in on the NetBSD system that hangs, you can verify with >> these steps: >> >> 1. check if there is a /glusterfsd.core file >> 2. run gdb on the core: >> >> # cd /build/install >> # gdb --core=/glusterfsd.core sbin/glusterfs >> ... >> Program terminated with signal SIGSEGV, Segmentation fault. >> #0 0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8, >> host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900, >> can_write=0xbf7fd8fc) >> at >> >> /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164 >> 164 *can_write = lookup_res->item->opts->rw; >> >> 3. verify the lookup_res structure: >> >> (gdb) p *lookup_res >> $1 = {timestamp = 1434284981, item = 0xb901e3b0} >> (gdb) p *lookup_res->item >> $2 = {name = 0xff00 > 0xff00>, opts = 0x} >> >> >> A fix for this has been sent, it is currently waiting for an update to >> the prosed reference counting: >> >>- http://review.gluster.org/11022 >> core: add "gf_ref_t" for common refcounting structures >>- http://review.gluster.org/11023 >> nfs: refcount each auth_cache_entry and related data_t >> >> Thanks, >> Niels >> ___ >> Gluster-devel mailing list >> Gluster-devel@gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-devel > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Emmanuel, I am not sure of the feasibility but just wanted to ask you. Do you think there is a possibility to error out operations on the mount when mount crashes instead of hanging? That would prevent a lot of manual intervention even in future. Pranith. On 06/15/2015 01:35 PM, Niels de Vos wrote: Hi, sometimes the NetBSD regression tests hang with messages like this: [12:29:07] ./tests/basic/mgmt_v3-locks.t ... ok79867 ms No volumes present mount_nfs: can't access /patchy: Permission denied mount_nfs: can't access /patchy: Permission denied mount_nfs: can't access /patchy: Permission denied Most (if not all) of these hangs are caused by a crashing Gluster/NFS process. Once the Gluster/NFS server is not reachable anymore, unmounting fails. The only way to recover is to reboot the VM and retrigger the test. For rebooting, the http://build.gluster.org/job/reboot-vm job can be used, and retriggering works by clicking the "retrigger" link in the left menu once the test has been marked as failed/aborted. When logging in on the NetBSD system that hangs, you can verify with these steps: 1. check if there is a /glusterfsd.core file 2. run gdb on the core: # cd /build/install # gdb --core=/glusterfsd.core sbin/glusterfs ... Program terminated with signal SIGSEGV, Segmentation fault. #0 0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8, host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900, can_write=0xbf7fd8fc) at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164 164 *can_write = lookup_res->item->opts->rw; 3. verify the lookup_res structure: (gdb) p *lookup_res $1 = {timestamp = 1434284981, item = 0xb901e3b0} (gdb) p *lookup_res->item $2 = {name = 0xff00 , opts = 0x} A fix for this has been sent, it is currently waiting for an update to the prosed reference counting: - http://review.gluster.org/11022 core: add "gf_ref_t" for common refcounting structures - http://review.gluster.org/11023 nfs: refcount each auth_cache_entry and related data_t Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel