Re: [Gluster-devel] Shall we revert quota-anon-fd.t?
On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote: On 06/11/2014 09:45 AM, Vijay Bellur wrote: On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote: hi, I see that quota-anon-fd.t is causing too many spurious failures. I think we should revert it and raise a bug so that it can be fixed and committed again along with the fix. I think we can do that. The problem here is stemming from the issue that nfs can deadlock when we have client and servers on the same node with system memory utilization being on the higher side. We also need to look into other nfs tests to determine if there are similar possibilities. I doubt it is because of that, there are so many nfs mount tests, I have been following this problem closely on b.g.o. This backtrace does indicate dd being hung: INFO: task dd:6039 blocked for more than 120 seconds. Not tainted 2.6.32-431.3.1.el6.x86_64 #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. ddD 880028100840 0 6039 5704 0x0080 8801f843faa8 0286 8801 01eff88bb6f58e28 8801db96bb80 8801f8213590 036c74dc ac6f4edf 8801faf11af8 8801f843ffd8 fbc8 8801faf11af8 Call Trace: [810a70b1] ? ktime_get_ts+0xb1/0xf0 [8111f940] ? sync_page+0x0/0x50 [815280b3] io_schedule+0x73/0xc0 [8111f97d] sync_page+0x3d/0x50 [81528b7f] __wait_on_bit+0x5f/0x90 [8111fbb3] wait_on_page_bit+0x73/0x80 [8109b330] ? wake_bit_function+0x0/0x50 [81135c05] ? pagevec_lookup_tag+0x25/0x40 [8111ffdb] wait_on_page_writeback_range+0xfb/0x190 [811201a8] filemap_write_and_wait_range+0x78/0x90 [811baa4e] vfs_fsync_range+0x7e/0x100 [811bab1b] generic_write_sync+0x4b/0x50 [81122056] generic_file_aio_write+0xe6/0x100 [a042f20e] nfs_file_write+0xde/0x1f0 [nfs] [81188c8a] do_sync_write+0xfa/0x140 [8152a825] ? page_fault+0x25/0x30 [8109b2b0] ? autoremove_wake_function+0x0/0x40 [8128ec6f] ? __clear_user+0x3f/0x70 [8128ec51] ? __clear_user+0x21/0x70 [812263d6] ? security_file_permission+0x16/0x20 [81188f88] vfs_write+0xb8/0x1a0 [81189881] sys_write+0x51/0x90 [810e1e6e] ? __audit_syscall_exit+0x25e/0x290 [8100b072] system_call_fastpath+0x16/0x1b I have seen dd being in uninterruptible sleep on b.g.o. There are also instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This definitely points to the nfs deadlock. only this one keeps failing for the past 2-3 days. It is a function of the system memory consumption and what oom killer decides to kill. If NFS or a glusterfsd process gets killed, then the test unit will fail. If the test can continue till the system reclaims memory, it can possibly succeed. However, there could be other possibilities and we need to root cause them as well. -Vijay [1] http://build.gluster.org/job/regression/4783/console ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Shall we revert quota-anon-fd.t?
On 06/11/2014 11:18 AM, Pranith Kumar Karampuri wrote: On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote: On 06/11/2014 09:45 AM, Vijay Bellur wrote: On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote: hi, I see that quota-anon-fd.t is causing too many spurious failures. I think we should revert it and raise a bug so that it can be fixed and committed again along with the fix. I think we can do that. The problem here is stemming from the issue that nfs can deadlock when we have client and servers on the same node with system memory utilization being on the higher side. We also need to look into other nfs tests to determine if there are similar possibilities. I doubt it is because of that, there are so many nfs mount tests, only this one keeps failing for the past 2-3 days. http://review.gluster.org/8031 reverts this test. More tests are failing because of this test. To give this patch more priority, shall I remove the remaining builds for regression and start a run for this and then add the removed ones back again after this one? I have merged this patch. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Shall we revert quota-anon-fd.t?
On Wed, Jun 11, 2014 at 12:58:46PM +0200, Niels de Vos wrote: On Wed, Jun 11, 2014 at 01:31:04PM +0530, Vijay Bellur wrote: On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote: On 06/11/2014 09:45 AM, Vijay Bellur wrote: On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote: hi, I see that quota-anon-fd.t is causing too many spurious failures. I think we should revert it and raise a bug so that it can be fixed and committed again along with the fix. I think we can do that. The problem here is stemming from the issue that nfs can deadlock when we have client and servers on the same node with system memory utilization being on the higher side. We also need to look into other nfs tests to determine if there are similar possibilities. I doubt it is because of that, there are so many nfs mount tests, I have been following this problem closely on b.g.o. This backtrace does indicate dd being hung: INFO: task dd:6039 blocked for more than 120 seconds. Not tainted 2.6.32-431.3.1.el6.x86_64 #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. ddD 880028100840 0 6039 5704 0x0080 8801f843faa8 0286 8801 01eff88bb6f58e28 8801db96bb80 8801f8213590 036c74dc ac6f4edf 8801faf11af8 8801f843ffd8 fbc8 8801faf11af8 Call Trace: [810a70b1] ? ktime_get_ts+0xb1/0xf0 [8111f940] ? sync_page+0x0/0x50 [815280b3] io_schedule+0x73/0xc0 [8111f97d] sync_page+0x3d/0x50 [81528b7f] __wait_on_bit+0x5f/0x90 [8111fbb3] wait_on_page_bit+0x73/0x80 [8109b330] ? wake_bit_function+0x0/0x50 [81135c05] ? pagevec_lookup_tag+0x25/0x40 [8111ffdb] wait_on_page_writeback_range+0xfb/0x190 [811201a8] filemap_write_and_wait_range+0x78/0x90 [811baa4e] vfs_fsync_range+0x7e/0x100 [811bab1b] generic_write_sync+0x4b/0x50 [81122056] generic_file_aio_write+0xe6/0x100 [a042f20e] nfs_file_write+0xde/0x1f0 [nfs] [81188c8a] do_sync_write+0xfa/0x140 [8152a825] ? page_fault+0x25/0x30 [8109b2b0] ? autoremove_wake_function+0x0/0x40 [8128ec6f] ? __clear_user+0x3f/0x70 [8128ec51] ? __clear_user+0x21/0x70 [812263d6] ? security_file_permission+0x16/0x20 [81188f88] vfs_write+0xb8/0x1a0 [81189881] sys_write+0x51/0x90 [810e1e6e] ? __audit_syscall_exit+0x25e/0x290 [8100b072] system_call_fastpath+0x16/0x1b I have seen dd being in uninterruptible sleep on b.g.o. There are also instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This definitely points to the nfs deadlock. [1] is a run where nfs.drc is still enabled. I'd like to know if you have seen other, more recent runs where http://review.gluster.org/8004 has been included (disable nfs.drc by default). To answer my own question, yes some runs have that included: - http://build.gluster.org/job/regression/4828/console Should Bug 1107937 quota-anon-fd-nfs.t fails spuriously be used to figure out what the problem is and diagnose the issues there? Niels Are there backtraces at the same time where alloc_pages() and/or try_to_free_pages() are listed? The blocking of the writer (here: dd) likely depends on the needed memory allocations on the receiving enf (here: nfs-server). This is a relatively common issue for the Linux kernel NFS server where loopback-mounts are used under memory pressure. A nice description and proposed solution of this has recently been posted to LWN.net: - http://lwn.net/Articles/595652/ This solution is client-side (the NFS-client in the Linux kernel), and that should help preventing these issues for Gluster-nfs too (with a quick cursory look through it). But I don't think the patches have been merged yet. only this one keeps failing for the past 2-3 days. It is a function of the system memory consumption and what oom killer decides to kill. If NFS or a glusterfsd process gets killed, then the test unit will fail. If the test can continue till the system reclaims memory, it can possibly succeed. However, there could be other possibilities and we need to root cause them as well. Yes, I agree. It would help if there is a known way to trigger the OOM so that investigation can be done on a different system than build.gluster.org. Does anyone know of steps that reliably reproduce this kind of issue? Thanks, Niels -Vijay [1] http://build.gluster.org/job/regression/4783/console ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list
Re: [Gluster-devel] Shall we revert quota-anon-fd.t?
On 11/06/2014, at 9:01 AM, Vijay Bellur wrote: snip I have seen dd being in uninterruptible sleep on b.g.o. There are also instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This definitely points to the nfs deadlock. A few of the rackspace regression runs seem to be getting stuck on bug/bug-1095097.t. They run forever, never finishing: [19:18:01] ./tests/bugs/bug-1095097.t ok 16 s 8+0 records in 8+0 records out 41943040 bytes (42 MB) copied, 0.304967 s, 138 MB/s 8+0 records in 8+0 records out 41943040 bytes (42 MB) copied, 0.395341 s, 106 MB/s 1+0 records in 1+0 records out 1024 bytes (1.0 kB) copied, 0.00233259 s, 439 kB/s 4+0 records in 4+0 records out 20971520 bytes (21 MB) copied, 0.333085 s, 63.0 MB/s 1+0 records in 1+0 records out 5242880 bytes (5.2 MB) copied, 0.0986338 s, 53.2 MB/s I've been aborting these ones manually, after its obvious they're not working. Could this weird behaviour be related to what you're mentioning above? + Justin Example ones for anyone interested: http://build.gluster.org/job/rackspace-regression/90/console http://build.gluster.org/job/rackspace-regression/88/console -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Shall we revert quota-anon-fd.t?
On 12/06/2014, at 12:12 AM, Justin Clift wrote: On 11/06/2014, at 9:01 AM, Vijay Bellur wrote: snip I have seen dd being in uninterruptible sleep on b.g.o. There are also instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This definitely points to the nfs deadlock. A few of the rackspace regression runs seem to be getting stuck on bug/bug-1095097.t. They run forever, never finishing: Ignore this, it seems to be some weirdness with the Rackspace VM's themselves. They're all now hanging (on different things). I'll look into it after I get some sleep. + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Shall we revert quota-anon-fd.t?
hi, I see that quota-anon-fd.t is causing too many spurious failures. I think we should revert it and raise a bug so that it can be fixed and committed again along with the fix. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Shall we revert quota-anon-fd.t?
Agreed! +1 On Tue, Jun 10, 2014 at 7:51 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, I see that quota-anon-fd.t is causing too many spurious failures. I think we should revert it and raise a bug so that it can be fixed and committed again along with the fix. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel -- Religious confuse piety with mere ritual, the virtuous confuse regulation with outcomes ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel