Re: [Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-11 Thread Vijay Bellur

On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:


On 06/11/2014 09:45 AM, Vijay Bellur wrote:

On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:

hi,
I see that quota-anon-fd.t is causing too many spurious failures. I
think we should revert it and raise a bug so that it can be fixed and
committed again along with the fix.



I think we can do that. The problem here is stemming from the issue
that nfs can deadlock when we have client and servers on the same node
with system memory utilization being on the higher side. We also need
to look into other nfs tests to determine if there are similar
possibilities.


I doubt it is because of that, there are so many nfs mount tests,


I have been following this problem closely on b.g.o. This backtrace does 
indicate dd being hung:


INFO: task dd:6039 blocked for more than 120 seconds.
  Not tainted 2.6.32-431.3.1.el6.x86_64 #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
ddD 880028100840 0  6039   5704 0x0080
 8801f843faa8 0286 8801 01eff88bb6f58e28
 8801db96bb80 8801f8213590 036c74dc ac6f4edf
 8801faf11af8 8801f843ffd8 fbc8 8801faf11af8
Call Trace:
 [810a70b1] ? ktime_get_ts+0xb1/0xf0
 [8111f940] ? sync_page+0x0/0x50
 [815280b3] io_schedule+0x73/0xc0
 [8111f97d] sync_page+0x3d/0x50
 [81528b7f] __wait_on_bit+0x5f/0x90
 [8111fbb3] wait_on_page_bit+0x73/0x80
 [8109b330] ? wake_bit_function+0x0/0x50
 [81135c05] ? pagevec_lookup_tag+0x25/0x40
 [8111ffdb] wait_on_page_writeback_range+0xfb/0x190
 [811201a8] filemap_write_and_wait_range+0x78/0x90
 [811baa4e] vfs_fsync_range+0x7e/0x100
 [811bab1b] generic_write_sync+0x4b/0x50
 [81122056] generic_file_aio_write+0xe6/0x100
 [a042f20e] nfs_file_write+0xde/0x1f0 [nfs]
 [81188c8a] do_sync_write+0xfa/0x140
 [8152a825] ? page_fault+0x25/0x30
 [8109b2b0] ? autoremove_wake_function+0x0/0x40
 [8128ec6f] ? __clear_user+0x3f/0x70
 [8128ec51] ? __clear_user+0x21/0x70
 [812263d6] ? security_file_permission+0x16/0x20
 [81188f88] vfs_write+0xb8/0x1a0
 [81189881] sys_write+0x51/0x90
 [810e1e6e] ? __audit_syscall_exit+0x25e/0x290
 [8100b072] system_call_fastpath+0x16/0x1b

I have seen dd being in uninterruptible sleep on b.g.o. There are also 
instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This 
definitely points to the nfs deadlock.




only
this one keeps failing for the past 2-3 days.


It is a function of the system memory consumption and what oom killer 
decides to kill. If NFS or a glusterfsd process gets killed, then the 
test unit will fail. If the test can continue till the system reclaims 
memory, it can possibly succeed.


However, there could be other possibilities and we need to root cause 
them as well.



-Vijay

[1] http://build.gluster.org/job/regression/4783/console

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-11 Thread Vijay Bellur

On 06/11/2014 11:18 AM, Pranith Kumar Karampuri wrote:


On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:


On 06/11/2014 09:45 AM, Vijay Bellur wrote:

On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:

hi,
I see that quota-anon-fd.t is causing too many spurious failures. I
think we should revert it and raise a bug so that it can be fixed and
committed again along with the fix.



I think we can do that. The problem here is stemming from the issue
that nfs can deadlock when we have client and servers on the same
node with system memory utilization being on the higher side. We also
need to look into other nfs tests to determine if there are similar
possibilities.


I doubt it is because of that, there are so many nfs mount tests, only
this one keeps failing for the past 2-3 days.
http://review.gluster.org/8031 reverts this test.

More tests are failing because of this test. To give this patch more
priority, shall I remove the remaining builds for regression and start a
run for this and then add the removed ones back again after this one?



I have merged this patch.

Thanks,
Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-11 Thread Niels de Vos
On Wed, Jun 11, 2014 at 12:58:46PM +0200, Niels de Vos wrote:
 On Wed, Jun 11, 2014 at 01:31:04PM +0530, Vijay Bellur wrote:
  On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:
  
  On 06/11/2014 09:45 AM, Vijay Bellur wrote:
  On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:
  hi,
  I see that quota-anon-fd.t is causing too many spurious failures. I
  think we should revert it and raise a bug so that it can be fixed and
  committed again along with the fix.
  
  
  I think we can do that. The problem here is stemming from the issue
  that nfs can deadlock when we have client and servers on the same node
  with system memory utilization being on the higher side. We also need
  to look into other nfs tests to determine if there are similar
  possibilities.
  
  I doubt it is because of that, there are so many nfs mount tests,
  
  I have been following this problem closely on b.g.o. This backtrace
  does indicate dd being hung:
  
  INFO: task dd:6039 blocked for more than 120 seconds.
Not tainted 2.6.32-431.3.1.el6.x86_64 #1
  echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
  ddD 880028100840 0  6039   5704 0x0080
   8801f843faa8 0286 8801 01eff88bb6f58e28
   8801db96bb80 8801f8213590 036c74dc ac6f4edf
   8801faf11af8 8801f843ffd8 fbc8 8801faf11af8
  Call Trace:
   [810a70b1] ? ktime_get_ts+0xb1/0xf0
   [8111f940] ? sync_page+0x0/0x50
   [815280b3] io_schedule+0x73/0xc0
   [8111f97d] sync_page+0x3d/0x50
   [81528b7f] __wait_on_bit+0x5f/0x90
   [8111fbb3] wait_on_page_bit+0x73/0x80
   [8109b330] ? wake_bit_function+0x0/0x50
   [81135c05] ? pagevec_lookup_tag+0x25/0x40
   [8111ffdb] wait_on_page_writeback_range+0xfb/0x190
   [811201a8] filemap_write_and_wait_range+0x78/0x90
   [811baa4e] vfs_fsync_range+0x7e/0x100
   [811bab1b] generic_write_sync+0x4b/0x50
   [81122056] generic_file_aio_write+0xe6/0x100
   [a042f20e] nfs_file_write+0xde/0x1f0 [nfs]
   [81188c8a] do_sync_write+0xfa/0x140
   [8152a825] ? page_fault+0x25/0x30
   [8109b2b0] ? autoremove_wake_function+0x0/0x40
   [8128ec6f] ? __clear_user+0x3f/0x70
   [8128ec51] ? __clear_user+0x21/0x70
   [812263d6] ? security_file_permission+0x16/0x20
   [81188f88] vfs_write+0xb8/0x1a0
   [81189881] sys_write+0x51/0x90
   [810e1e6e] ? __audit_syscall_exit+0x25e/0x290
   [8100b072] system_call_fastpath+0x16/0x1b
  
  I have seen dd being in uninterruptible sleep on b.g.o. There are
  also instances [1] where anon-fd-nfs has run for close to 6000+
  seconds. This definitely points to the nfs deadlock.
 
 [1] is a run where nfs.drc is still enabled. I'd like to know if you 
 have seen other, more recent runs where http://review.gluster.org/8004 
 has been included (disable nfs.drc by default).

To answer my own question, yes some runs have that included:
- http://build.gluster.org/job/regression/4828/console

Should Bug 1107937 quota-anon-fd-nfs.t fails spuriously be used to 
figure out what the problem is and diagnose the issues there?

Niels

 
 Are there backtraces at the same time where alloc_pages() and/or 
 try_to_free_pages() are listed? The blocking of the writer (here: dd) 
 likely depends on the needed memory allocations on the receiving enf 
 (here: nfs-server). This is a relatively common issue for the Linux 
 kernel NFS server where loopback-mounts are used under memory pressure.  
 
 A nice description and proposed solution of this has recently been 
 posted to LWN.net:
 - http://lwn.net/Articles/595652/
 
 This solution is client-side (the NFS-client in the Linux kernel), and 
 that should help preventing these issues for Gluster-nfs too (with 
 a quick cursory look through it). But I don't think the patches have 
 been merged yet.
 
  only
  this one keeps failing for the past 2-3 days.
  
  It is a function of the system memory consumption and what oom
  killer decides to kill. If NFS or a glusterfsd process gets killed,
  then the test unit will fail. If the test can continue till the
  system reclaims memory, it can possibly succeed.
  
  However, there could be other possibilities and we need to root
  cause them as well.
 
 Yes, I agree. It would help if there is a known way to trigger the OOM 
 so that investigation can be done on a different system than 
 build.gluster.org. Does anyone know of steps that reliably reproduce 
 this kind of issue?
 
 Thanks,
 Niels
 
  
  
  -Vijay
  
  [1] http://build.gluster.org/job/regression/4783/console
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 ___
 Gluster-devel mailing list
 

Re: [Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-11 Thread Justin Clift
On 11/06/2014, at 9:01 AM, Vijay Bellur wrote:
snip
 I have seen dd being in uninterruptible sleep on b.g.o. There are also 
 instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This 
 definitely points to the nfs deadlock.

A few of the rackspace regression runs seem to be getting
stuck on bug/bug-1095097.t.  They run forever, never finishing:

  [19:18:01] ./tests/bugs/bug-1095097.t  ok   16 s
  8+0 records in
  8+0 records out
  41943040 bytes (42 MB) copied, 0.304967 s, 138 MB/s
  8+0 records in
  8+0 records out
  41943040 bytes (42 MB) copied, 0.395341 s, 106 MB/s
  1+0 records in
  1+0 records out
  1024 bytes (1.0 kB) copied, 0.00233259 s, 439 kB/s
  4+0 records in
  4+0 records out
  20971520 bytes (21 MB) copied, 0.333085 s, 63.0 MB/s
  1+0 records in
  1+0 records out
  5242880 bytes (5.2 MB) copied, 0.0986338 s, 53.2 MB/s

I've been aborting these ones manually, after its obvious they're
not working.

Could this weird behaviour be related to what you're mentioning
above?

+ Justin

Example ones for anyone interested:

  http://build.gluster.org/job/rackspace-regression/90/console
  http://build.gluster.org/job/rackspace-regression/88/console

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-11 Thread Justin Clift
On 12/06/2014, at 12:12 AM, Justin Clift wrote:
 On 11/06/2014, at 9:01 AM, Vijay Bellur wrote:
 snip
 I have seen dd being in uninterruptible sleep on b.g.o. There are also 
 instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This 
 definitely points to the nfs deadlock.
 
 A few of the rackspace regression runs seem to be getting
 stuck on bug/bug-1095097.t.  They run forever, never finishing:


Ignore this, it seems to be some weirdness with the Rackspace VM's
themselves.  They're all now hanging (on different things).

I'll look into it after I get some sleep.

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-10 Thread Pranith Kumar Karampuri

hi,
   I see that quota-anon-fd.t is causing too many spurious failures. I 
think we should revert it and raise a bug so that it can be fixed and 
committed again along with the fix.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Shall we revert quota-anon-fd.t?

2014-06-10 Thread Harshavardhana
Agreed! +1

On Tue, Jun 10, 2014 at 7:51 PM, Pranith Kumar Karampuri
pkara...@redhat.com wrote:
 hi,
I see that quota-anon-fd.t is causing too many spurious failures. I think
 we should revert it and raise a bug so that it can be fixed and committed
 again along with the fix.

 Pranith
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel



-- 
Religious confuse piety with mere ritual, the virtuous confuse
regulation with outcomes
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel