Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
No change in behavior even in case of low memory systems. I confirmed it running on 1Gig machine. Thanks --Chakri On 9/28/07, Chakri n <[EMAIL PROTECTED]> wrote: > Here is a the snapshot of vmstats when the problem happened. I believe > this could help a little. > > crash> kmem -V >NR_FREE_PAGES: 680853 > NR_INACTIVE: 95380 >NR_ACTIVE: 26891 >NR_ANON_PAGES: 2507 > NR_FILE_MAPPED: 1832 >NR_FILE_PAGES: 119779 >NR_FILE_DIRTY: 0 > NR_WRITEBACK: 18272 > NR_SLAB_RECLAIMABLE: 1305 > NR_SLAB_UNRECLAIMABLE: 2085 > NR_PAGETABLE: 123 > NR_UNSTABLE_NFS: 0 >NR_BOUNCE: 0 > NR_VMSCAN_WRITE: 0 > > In my testing, I always saw the processes are waiting in > balance_dirty_pages_ratelimited(), never in throttle_vm_writeout() > path. > > But this could be because I have about 4Gig of memory in the system > and plenty of mem is still available around. > > I will rerun the test limiting memory to 1024MB and lets see if it > takes in any different path. > > Thanks > --Chakri > > > On 9/28/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > On Fri, 28 Sep 2007 16:32:18 -0400 > > Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > > > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: > > > > On Fri, 28 Sep 2007 15:52:28 -0400 > > > > Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > > > > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL > > > > > > PROTECTED]> wrote: > > > > > > > Looking back, they were getting caught up in > > > > > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > > > > > example... > > > > > > > > > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > > > > > > > > > I'm not sure that the hang that is illustrated here is so special. It > > > > > is > > > > > an example of a bog-standard ext3 write, that ends up calling the NFS > > > > > client, which is hanging. The fact that it happens to be hanging on > > > > > the > > > > > nfsd process is more or less irrelevant here: the same thing could > > > > > happen to any other process in the case where we have an NFS server > > > > > that > > > > > is down. > > > > > > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? > > > > > > > > We should be able to fix that by marking the backing device as > > > > write-congested. That'll have small race windows, but it should be a > > > > 99.9% > > > > fix? > > > > > > No. The problem would rather appear to be that we're doing > > > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but > > > we're measuring variables which are global to the VM. The backing device > > > that we are selecting may not be writing out any dirty pages, in which > > > case, we're just spinning in balance_dirty_pages_ratelimited(). > > > > OK, so it's unrelated to page reclaim. > > > > > Should we therefore perhaps be looking at adding per-backing_dev stats > > > too? > > > > That's what mm-per-device-dirty-threshold.patch and friends are doing. > > Whether it works adequately is not really known at this time. > > Unfortunately kernel developers don't test -mm much. > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Here is a the snapshot of vmstats when the problem happened. I believe this could help a little. crash> kmem -V NR_FREE_PAGES: 680853 NR_INACTIVE: 95380 NR_ACTIVE: 26891 NR_ANON_PAGES: 2507 NR_FILE_MAPPED: 1832 NR_FILE_PAGES: 119779 NR_FILE_DIRTY: 0 NR_WRITEBACK: 18272 NR_SLAB_RECLAIMABLE: 1305 NR_SLAB_UNRECLAIMABLE: 2085 NR_PAGETABLE: 123 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 In my testing, I always saw the processes are waiting in balance_dirty_pages_ratelimited(), never in throttle_vm_writeout() path. But this could be because I have about 4Gig of memory in the system and plenty of mem is still available around. I will rerun the test limiting memory to 1024MB and lets see if it takes in any different path. Thanks --Chakri On 9/28/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > On Fri, 28 Sep 2007 16:32:18 -0400 > Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: > > > On Fri, 28 Sep 2007 15:52:28 -0400 > > > Trond Myklebust <[EMAIL PROTECTED]> wrote: > > > > > > > On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: > > > > > On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust <[EMAIL > > > > > PROTECTED]> wrote: > > > > > > Looking back, they were getting caught up in > > > > > > balance_dirty_pages_ratelimited() and friends. See the attached > > > > > > example... > > > > > > > > > > that one is nfs-on-loopback, which is a special case, isn't it? > > > > > > > > I'm not sure that the hang that is illustrated here is so special. It is > > > > an example of a bog-standard ext3 write, that ends up calling the NFS > > > > client, which is hanging. The fact that it happens to be hanging on the > > > > nfsd process is more or less irrelevant here: the same thing could > > > > happen to any other process in the case where we have an NFS server that > > > > is down. > > > > > > hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? > > > > > > We should be able to fix that by marking the backing device as > > > write-congested. That'll have small race windows, but it should be a > > > 99.9% > > > fix? > > > > No. The problem would rather appear to be that we're doing > > per-backing_dev writeback (if I read sync_sb_inodes() correctly), but > > we're measuring variables which are global to the VM. The backing device > > that we are selecting may not be writing out any dirty pages, in which > > case, we're just spinning in balance_dirty_pages_ratelimited(). > > OK, so it's unrelated to page reclaim. > > > Should we therefore perhaps be looking at adding per-backing_dev stats > > too? > > That's what mm-per-device-dirty-threshold.patch and friends are doing. > Whether it works adequately is not really known at this time. > Unfortunately kernel developers don't test -mm much. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
It's works on .23-rc8-mm2 with out any problems. "dd" process does not hang any more. Thanks for all the help. Cheers --Chakri On 9/28/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > [ and one copy for the list too ] > > On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote: > > It's 2.6.23-rc6. > > Could you try .23-rc8-mm2. It includes the per bdi stuff. > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
It's 2.6.23-rc6. Thanks --Chakri On 9/28/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote: > > Thanks for explaining the adaptive logic. > > > > > However other devices will at that moment try to maintain a limit of 0, > > > which ends up being similar to a sync mount. > > > > > > So they'll not get stuck, but they will be slow. > > > > > > > > > > Sync should be ok, when the situation is bad like this and some one > > hijacked all the buffers. > > > > But, I see my simple dd to write 10blocks on local disk never > > completes even after 10 minutes. > > > > [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10 > > > > I think the process is completely stuck and is not progressing at all. > > > > Is something going wrong in the calculations where it does not fall > > back to sync mode. > > What kernel is that? > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Thanks for explaining the adaptive logic. > However other devices will at that moment try to maintain a limit of 0, > which ends up being similar to a sync mount. > > So they'll not get stuck, but they will be slow. > > Sync should be ok, when the situation is bad like this and some one hijacked all the buffers. But, I see my simple dd to write 10blocks on local disk never completes even after 10 minutes. [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10 I think the process is completely stuck and is not progressing at all. Is something going wrong in the calculations where it does not fall back to sync mode. Thanks --Chakri On 9/28/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > [ please don't top-post! ] > > On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote: > > > On 9/27/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > > > > > > > What we _don't_ want to happen is for other processes which are writing > > > > to > > > > other, non-dead devices to get collaterally blocked. We have patches > > > > which > > > > might fix that queued for 2.6.24. Peter? > > > > > > Nasty problem, don't do that :-) > > > > > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that > > > NFS server/mount (?) has - which could be 100%. Other processes will > > > then work almost synchronously against their BDIs but it should work. > > > > > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will > > > limit the other BDIs their dirty limit to not exceed the total limit. > > > And with all these NFS pages stuck, that will still be nothing. ] > > > > > Thanks. > > > > The BDI dirty limits sounds like a good idea. > > > > Is there already a patch for this, which I could try? > > v2.6.23-rc8-mm2 > > > I believe it works like this, > > > > Each BDI, will have a limit. If the dirty_thresh exceeds the limit, > > all the I/O on the block device will be synchronous. > > > > so, if I have sda & a NFS mount, the dirty limit can be different for > > each of them. > > > > I can set dirty limit for > > - sda to be 90% and > > - NFS mount to be 50%. > > > > So, if the dirty limit is greater than 50%, NFS does synchronously, > > but sda can work asynchronously, till dirty limit reaches 90%. > > Not quite, the system determines the limit itself in an adaptive > fashion. > > bdi_limit = total_limit * p_bdi > > Where p is a faction [0,1], and is determined by the relative writeout > speed of the current BDI vs all other BDIs. > > So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is > idle, and the nfs mount gets twice as much traffic as sdb, the ratios > will look like: > > p_sda: 0 > p_sdb: 1/3 > p_nfs: 2/3 > > Once the traffic exceeds the write speed of the device we build up a > backlog and stuff gets throttled, so these proportions converge to the > relative write speed of the BDIs when saturated with data. > > So what can happen in your case is that the NFS mount is the only one > with traffic is will get a fraction of 1. If it then disconnects like in > your case, it will still have all of the dirty limit pinned for NFS. > > However other devices will at that moment try to maintain a limit of 0, > which ends up being similar to a sync mount. > > So they'll not get stuck, but they will be slow. > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Thanks. The BDI dirty limits sounds like a good idea. Is there already a patch for this, which I could try? I believe it works like this, Each BDI, will have a limit. If the dirty_thresh exceeds the limit, all the I/O on the block device will be synchronous. so, if I have sda & a NFS mount, the dirty limit can be different for each of them. I can set dirty limit for - sda to be 90% and - NFS mount to be 50%. So, if the dirty limit is greater than 50%, NFS does synchronously, but sda can work asynchronously, till dirty limit reaches 90%. Thanks --Chakri On 9/27/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: > > > What we _don't_ want to happen is for other processes which are writing to > > other, non-dead devices to get collaterally blocked. We have patches which > > might fix that queued for 2.6.24. Peter? > > Nasty problem, don't do that :-) > > But yeah, with per BDI dirty limits we get stuck at whatever ratio that > NFS server/mount (?) has - which could be 100%. Other processes will > then work almost synchronously against their BDIs but it should work. > > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will > limit the other BDIs their dirty limit to not exceed the total limit. > And with all these NFS pages stuck, that will still be nothing. ] > > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Hi, In my testing, a unresponsive file system can hang all I/O in the system. This is not seen in 2.4. I started 20 threads doing I/O on a NFS share. They are just doing 4K writes in a loop. Now I stop NFS server hosting the NFS share and start a "dd" process to write a file on local EXT3 file system. # dd if=/dev/zero of=/tmp/x count=1000 This process never progresses. There is plenty of HIGH MEMORY available in the system, but this process never progresses. # free total used free sharedbuffers cached Mem: 3238004 6093402628664 0 15136 551024 -/+ buffers/cache: 431803194824 Swap: 4096532 04096532 vmstat on the machine: # vmstat procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 0 21 0 2628416 15152 55102400 0 0 28 344 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 340 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 341 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 357 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 The problem seems to be in balance_dirty_pages, which calculates dirty_thresh based on only ZONE_NORMAL. The same scenario works fine in 2.4. The dd processes finishes in no time. NFS file systems can go offline, due to multiple reasons, a failed switch, filer etc, but that should not effect other file systems in the machine. Can this behavior be fenced?, can the buffer cache be tuned so that other processes do not see the effect? The following is the back trace of the processes: -- PID: 3552 TASK: cb1fc610 CPU: 0 COMMAND: "dd" #0 [f5c04c38] schedule at c0624a34 #1 [f5c04cac] schedule_timeout at c06250ee #2 [f5c04cf0] io_schedule_timeout at c0624c15 #3 [f5c04d04] congestion_wait at c045eb7d #4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f5c04d7c] generic_file_buffered_write at c0457148 #6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5 #7 [f5c04e84] generic_file_aio_write at c0457799 #8 [f5c04eb4] ext3_file_write at ffd7 #9 [f5c04ed0] do_sync_write at c0472e27 #10 [f5c04f7c] vfs_write at c0473689 #11 [f5c04f98] sys_write at c0473c95 #12 [f5c04fb4] sysenter_entry at c0404ddf -- PID: 3091 TASK: cb1f0100 CPU: 1 COMMAND: "test" #0 [f6050c10] schedule at c0624a34 #1 [f6050c84] schedule_timeout at c06250ee #2 [f6050cc8] io_schedule_timeout at c0624c15 #3 [f6050cdc] congestion_wait at c045eb7d #4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6050d54] generic_file_buffered_write at c0457148 #6 [f6050de8] __generic_file_aio_write_nolock at c04576e5 #7 [f6050e40] enqueue_entity at c042131f #8 [f6050e5c] generic_file_aio_write at c0457799 #9 [f6050e8c] nfs_file_write at f8f90cee #10 [f6050e9c] getnstimeofday at c043d3f7 #11 [f6050ed0] do_sync_write at c0472e27 #12 [f6050f7c] vfs_write at c0473689 #13 [f6050f98] sys_write at c0473c95 #14 [f6050fb4] sysenter_entry at c0404ddf Thanks --Chakri - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Hi, In my testing, a unresponsive file system can hang all I/O in the system. This is not seen in 2.4. I started 20 threads doing I/O on a NFS share. They are just doing 4K writes in a loop. Now I stop NFS server hosting the NFS share and start a dd process to write a file on local EXT3 file system. # dd if=/dev/zero of=/tmp/x count=1000 This process never progresses. There is plenty of HIGH MEMORY available in the system, but this process never progresses. # free total used free sharedbuffers cached Mem: 3238004 6093402628664 0 15136 551024 -/+ buffers/cache: 431803194824 Swap: 4096532 04096532 vmstat on the machine: # vmstat procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 0 21 0 2628416 15152 55102400 0 0 28 344 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 340 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 341 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 357 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 The problem seems to be in balance_dirty_pages, which calculates dirty_thresh based on only ZONE_NORMAL. The same scenario works fine in 2.4. The dd processes finishes in no time. NFS file systems can go offline, due to multiple reasons, a failed switch, filer etc, but that should not effect other file systems in the machine. Can this behavior be fenced?, can the buffer cache be tuned so that other processes do not see the effect? The following is the back trace of the processes: -- PID: 3552 TASK: cb1fc610 CPU: 0 COMMAND: dd #0 [f5c04c38] schedule at c0624a34 #1 [f5c04cac] schedule_timeout at c06250ee #2 [f5c04cf0] io_schedule_timeout at c0624c15 #3 [f5c04d04] congestion_wait at c045eb7d #4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f5c04d7c] generic_file_buffered_write at c0457148 #6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5 #7 [f5c04e84] generic_file_aio_write at c0457799 #8 [f5c04eb4] ext3_file_write at ffd7 #9 [f5c04ed0] do_sync_write at c0472e27 #10 [f5c04f7c] vfs_write at c0473689 #11 [f5c04f98] sys_write at c0473c95 #12 [f5c04fb4] sysenter_entry at c0404ddf -- PID: 3091 TASK: cb1f0100 CPU: 1 COMMAND: test #0 [f6050c10] schedule at c0624a34 #1 [f6050c84] schedule_timeout at c06250ee #2 [f6050cc8] io_schedule_timeout at c0624c15 #3 [f6050cdc] congestion_wait at c045eb7d #4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6050d54] generic_file_buffered_write at c0457148 #6 [f6050de8] __generic_file_aio_write_nolock at c04576e5 #7 [f6050e40] enqueue_entity at c042131f #8 [f6050e5c] generic_file_aio_write at c0457799 #9 [f6050e8c] nfs_file_write at f8f90cee #10 [f6050e9c] getnstimeofday at c043d3f7 #11 [f6050ed0] do_sync_write at c0472e27 #12 [f6050f7c] vfs_write at c0473689 #13 [f6050f98] sys_write at c0473c95 #14 [f6050fb4] sysenter_entry at c0404ddf Thanks --Chakri - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
It's 2.6.23-rc6. Thanks --Chakri On 9/28/07, Peter Zijlstra [EMAIL PROTECTED] wrote: On Fri, 2007-09-28 at 02:01 -0700, Chakri n wrote: Thanks for explaining the adaptive logic. However other devices will at that moment try to maintain a limit of 0, which ends up being similar to a sync mount. So they'll not get stuck, but they will be slow. Sync should be ok, when the situation is bad like this and some one hijacked all the buffers. But, I see my simple dd to write 10blocks on local disk never completes even after 10 minutes. [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10 I think the process is completely stuck and is not progressing at all. Is something going wrong in the calculations where it does not fall back to sync mode. What kernel is that? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
It's works on .23-rc8-mm2 with out any problems. dd process does not hang any more. Thanks for all the help. Cheers --Chakri On 9/28/07, Peter Zijlstra [EMAIL PROTECTED] wrote: [ and one copy for the list too ] On Fri, 2007-09-28 at 02:20 -0700, Chakri n wrote: It's 2.6.23-rc6. Could you try .23-rc8-mm2. It includes the per bdi stuff. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Thanks. The BDI dirty limits sounds like a good idea. Is there already a patch for this, which I could try? I believe it works like this, Each BDI, will have a limit. If the dirty_thresh exceeds the limit, all the I/O on the block device will be synchronous. so, if I have sda a NFS mount, the dirty limit can be different for each of them. I can set dirty limit for - sda to be 90% and - NFS mount to be 50%. So, if the dirty limit is greater than 50%, NFS does synchronously, but sda can work asynchronously, till dirty limit reaches 90%. Thanks --Chakri On 9/27/07, Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: What we _don't_ want to happen is for other processes which are writing to other, non-dead devices to get collaterally blocked. We have patches which might fix that queued for 2.6.24. Peter? Nasty problem, don't do that :-) But yeah, with per BDI dirty limits we get stuck at whatever ratio that NFS server/mount (?) has - which could be 100%. Other processes will then work almost synchronously against their BDIs but it should work. [ They will lower the NFS-BDI's ratio, but some fancy clipping code will limit the other BDIs their dirty limit to not exceed the total limit. And with all these NFS pages stuck, that will still be nothing. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Thanks for explaining the adaptive logic. However other devices will at that moment try to maintain a limit of 0, which ends up being similar to a sync mount. So they'll not get stuck, but they will be slow. Sync should be ok, when the situation is bad like this and some one hijacked all the buffers. But, I see my simple dd to write 10blocks on local disk never completes even after 10 minutes. [EMAIL PROTECTED] ~]# dd if=/dev/zero of=/tmp/x count=10 I think the process is completely stuck and is not progressing at all. Is something going wrong in the calculations where it does not fall back to sync mode. Thanks --Chakri On 9/28/07, Peter Zijlstra [EMAIL PROTECTED] wrote: [ please don't top-post! ] On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote: On 9/27/07, Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote: What we _don't_ want to happen is for other processes which are writing to other, non-dead devices to get collaterally blocked. We have patches which might fix that queued for 2.6.24. Peter? Nasty problem, don't do that :-) But yeah, with per BDI dirty limits we get stuck at whatever ratio that NFS server/mount (?) has - which could be 100%. Other processes will then work almost synchronously against their BDIs but it should work. [ They will lower the NFS-BDI's ratio, but some fancy clipping code will limit the other BDIs their dirty limit to not exceed the total limit. And with all these NFS pages stuck, that will still be nothing. ] Thanks. The BDI dirty limits sounds like a good idea. Is there already a patch for this, which I could try? v2.6.23-rc8-mm2 I believe it works like this, Each BDI, will have a limit. If the dirty_thresh exceeds the limit, all the I/O on the block device will be synchronous. so, if I have sda a NFS mount, the dirty limit can be different for each of them. I can set dirty limit for - sda to be 90% and - NFS mount to be 50%. So, if the dirty limit is greater than 50%, NFS does synchronously, but sda can work asynchronously, till dirty limit reaches 90%. Not quite, the system determines the limit itself in an adaptive fashion. bdi_limit = total_limit * p_bdi Where p is a faction [0,1], and is determined by the relative writeout speed of the current BDI vs all other BDIs. So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is idle, and the nfs mount gets twice as much traffic as sdb, the ratios will look like: p_sda: 0 p_sdb: 1/3 p_nfs: 2/3 Once the traffic exceeds the write speed of the device we build up a backlog and stuff gets throttled, so these proportions converge to the relative write speed of the BDIs when saturated with data. So what can happen in your case is that the NFS mount is the only one with traffic is will get a fraction of 1. If it then disconnects like in your case, it will still have all of the dirty limit pinned for NFS. However other devices will at that moment try to maintain a limit of 0, which ends up being similar to a sync mount. So they'll not get stuck, but they will be slow. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
Here is a the snapshot of vmstats when the problem happened. I believe this could help a little. crash kmem -V NR_FREE_PAGES: 680853 NR_INACTIVE: 95380 NR_ACTIVE: 26891 NR_ANON_PAGES: 2507 NR_FILE_MAPPED: 1832 NR_FILE_PAGES: 119779 NR_FILE_DIRTY: 0 NR_WRITEBACK: 18272 NR_SLAB_RECLAIMABLE: 1305 NR_SLAB_UNRECLAIMABLE: 2085 NR_PAGETABLE: 123 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 In my testing, I always saw the processes are waiting in balance_dirty_pages_ratelimited(), never in throttle_vm_writeout() path. But this could be because I have about 4Gig of memory in the system and plenty of mem is still available around. I will rerun the test limiting memory to 1024MB and lets see if it takes in any different path. Thanks --Chakri On 9/28/07, Andrew Morton [EMAIL PROTECTED] wrote: On Fri, 28 Sep 2007 16:32:18 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: On Fri, 28 Sep 2007 15:52:28 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: Looking back, they were getting caught up in balance_dirty_pages_ratelimited() and friends. See the attached example... that one is nfs-on-loopback, which is a special case, isn't it? I'm not sure that the hang that is illustrated here is so special. It is an example of a bog-standard ext3 write, that ends up calling the NFS client, which is hanging. The fact that it happens to be hanging on the nfsd process is more or less irrelevant here: the same thing could happen to any other process in the case where we have an NFS server that is down. hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? We should be able to fix that by marking the backing device as write-congested. That'll have small race windows, but it should be a 99.9% fix? No. The problem would rather appear to be that we're doing per-backing_dev writeback (if I read sync_sb_inodes() correctly), but we're measuring variables which are global to the VM. The backing device that we are selecting may not be writing out any dirty pages, in which case, we're just spinning in balance_dirty_pages_ratelimited(). OK, so it's unrelated to page reclaim. Should we therefore perhaps be looking at adding per-backing_dev stats too? That's what mm-per-device-dirty-threshold.patch and friends are doing. Whether it works adequately is not really known at this time. Unfortunately kernel developers don't test -mm much. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)
No change in behavior even in case of low memory systems. I confirmed it running on 1Gig machine. Thanks --Chakri On 9/28/07, Chakri n [EMAIL PROTECTED] wrote: Here is a the snapshot of vmstats when the problem happened. I believe this could help a little. crash kmem -V NR_FREE_PAGES: 680853 NR_INACTIVE: 95380 NR_ACTIVE: 26891 NR_ANON_PAGES: 2507 NR_FILE_MAPPED: 1832 NR_FILE_PAGES: 119779 NR_FILE_DIRTY: 0 NR_WRITEBACK: 18272 NR_SLAB_RECLAIMABLE: 1305 NR_SLAB_UNRECLAIMABLE: 2085 NR_PAGETABLE: 123 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 In my testing, I always saw the processes are waiting in balance_dirty_pages_ratelimited(), never in throttle_vm_writeout() path. But this could be because I have about 4Gig of memory in the system and plenty of mem is still available around. I will rerun the test limiting memory to 1024MB and lets see if it takes in any different path. Thanks --Chakri On 9/28/07, Andrew Morton [EMAIL PROTECTED] wrote: On Fri, 28 Sep 2007 16:32:18 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2007-09-28 at 13:10 -0700, Andrew Morton wrote: On Fri, 28 Sep 2007 15:52:28 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2007-09-28 at 12:26 -0700, Andrew Morton wrote: On Fri, 28 Sep 2007 15:16:11 -0400 Trond Myklebust [EMAIL PROTECTED] wrote: Looking back, they were getting caught up in balance_dirty_pages_ratelimited() and friends. See the attached example... that one is nfs-on-loopback, which is a special case, isn't it? I'm not sure that the hang that is illustrated here is so special. It is an example of a bog-standard ext3 write, that ends up calling the NFS client, which is hanging. The fact that it happens to be hanging on the nfsd process is more or less irrelevant here: the same thing could happen to any other process in the case where we have an NFS server that is down. hm, so ext3 got stuck in nfs via __alloc_pages direct reclaim? We should be able to fix that by marking the backing device as write-congested. That'll have small race windows, but it should be a 99.9% fix? No. The problem would rather appear to be that we're doing per-backing_dev writeback (if I read sync_sb_inodes() correctly), but we're measuring variables which are global to the VM. The backing device that we are selecting may not be writing out any dirty pages, in which case, we're just spinning in balance_dirty_pages_ratelimited(). OK, so it's unrelated to page reclaim. Should we therefore perhaps be looking at adding per-backing_dev stats too? That's what mm-per-device-dirty-threshold.patch and friends are doing. Whether it works adequately is not really known at this time. Unfortunately kernel developers don't test -mm much. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?
On 9/21/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: > No. The requirement for 'hard' mounts is not that the server be up all > the time. The server can go up and down as it pleases: the client can > happily recover from that. > > The requirement is rather that nobody remove it permanently before the > application is done with it, and the partition is unmounted. That is > hardly unreasonable (it is the only way I know of to ensure data > integrity), and it is much less strict than the requirements for local > disks. Yes. I completely agree. This is required for data consistency. But in my testing, if one of the NFS server/mount goes offline for some point of time, the entire system slows down, especially IO. In my test program, I forked off 50 threads to do 4K writes on 50 different files in a NFS mounted directory. Now, I have turned off the NFS server and started another dd process on local disk ("dd if=/dev/zero of=/tmp/x count=1000") and this dd process progresses. I see I/O wait of 100% in vmstat. procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 0 21 0 2628416 15152 55102400 0 0 28 344 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 340 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 341 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 357 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 I have about 4Gig of RAM in the system and most of the memory is free. I see only about 550MB in buffers, rest all is pretty much available. [EMAIL PROTECTED] ~]# free total used free sharedbuffers cached Mem: 3238004 6093402628664 0 15136 551024 -/+ buffers/cache: 431803194824 Swap: 4096532 04096532 Here is the stack trace for one of my test program threads and dd process, both of them are stuck in congestion_wait. -- PID: 3552 TASK: cb1fc610 CPU: 0 COMMAND: "dd" #0 [f5c04c38] schedule at c0624a34 #1 [f5c04cac] schedule_timeout at c06250ee #2 [f5c04cf0] io_schedule_timeout at c0624c15 #3 [f5c04d04] congestion_wait at c045eb7d #4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f5c04d7c] generic_file_buffered_write at c0457148 #6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5 #7 [f5c04e84] generic_file_aio_write at c0457799 #8 [f5c04eb4] ext3_file_write at ffd7 #9 [f5c04ed0] do_sync_write at c0472e27 #10 [f5c04f7c] vfs_write at c0473689 #11 [f5c04f98] sys_write at c0473c95 #12 [f5c04fb4] sysenter_entry at c0404ddf -- #0 [f6050c10] schedule at c0624a34 #1 [f6050c84] schedule_timeout at c06250ee #2 [f6050cc8] io_schedule_timeout at c0624c15 #3 [f6050cdc] congestion_wait at c045eb7d #4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6050d54] generic_file_buffered_write at c0457148 #6 [f6050de8] __generic_file_aio_write_nolock at c04576e5 #7 [f6050e40] enqueue_entity at c042131f #8 [f6050e5c] generic_file_aio_write at c0457799 #9 [f6050e8c] nfs_file_write at f8f90cee #10 [f6050e9c] getnstimeofday at c043d3f7 #11 [f6050ed0] do_sync_write at c0472e27 #12 [f6050f7c] vfs_write at c0473689 #13 [f6050f98] sys_write at c0473c95 #14 [f6050fb4] sysenter_entry at c0404ddf --- Can this be worked around, since most of the RAM is available, dd process could infact find more memory for it's buffers rather than waiting due to NFS requests. I believe this could be one reason why file systems like VxFS use their own buffer cache different from system-wide buffer cache. Thanks --Chakri - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?
On 9/21/07, Trond Myklebust [EMAIL PROTECTED] wrote: No. The requirement for 'hard' mounts is not that the server be up all the time. The server can go up and down as it pleases: the client can happily recover from that. The requirement is rather that nobody remove it permanently before the application is done with it, and the partition is unmounted. That is hardly unreasonable (it is the only way I know of to ensure data integrity), and it is much less strict than the requirements for local disks. Yes. I completely agree. This is required for data consistency. But in my testing, if one of the NFS server/mount goes offline for some point of time, the entire system slows down, especially IO. In my test program, I forked off 50 threads to do 4K writes on 50 different files in a NFS mounted directory. Now, I have turned off the NFS server and started another dd process on local disk (dd if=/dev/zero of=/tmp/x count=1000) and this dd process progresses. I see I/O wait of 100% in vmstat. procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 0 21 0 2628416 15152 55102400 0 0 28 344 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 340 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 341 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 357 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 0 26 343 0 0 0 100 0 0 21 0 2628416 15152 55102400 0 08 325 0 0 0 100 0 I have about 4Gig of RAM in the system and most of the memory is free. I see only about 550MB in buffers, rest all is pretty much available. [EMAIL PROTECTED] ~]# free total used free sharedbuffers cached Mem: 3238004 6093402628664 0 15136 551024 -/+ buffers/cache: 431803194824 Swap: 4096532 04096532 Here is the stack trace for one of my test program threads and dd process, both of them are stuck in congestion_wait. -- PID: 3552 TASK: cb1fc610 CPU: 0 COMMAND: dd #0 [f5c04c38] schedule at c0624a34 #1 [f5c04cac] schedule_timeout at c06250ee #2 [f5c04cf0] io_schedule_timeout at c0624c15 #3 [f5c04d04] congestion_wait at c045eb7d #4 [f5c04d28] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f5c04d7c] generic_file_buffered_write at c0457148 #6 [f5c04e10] __generic_file_aio_write_nolock at c04576e5 #7 [f5c04e84] generic_file_aio_write at c0457799 #8 [f5c04eb4] ext3_file_write at ffd7 #9 [f5c04ed0] do_sync_write at c0472e27 #10 [f5c04f7c] vfs_write at c0473689 #11 [f5c04f98] sys_write at c0473c95 #12 [f5c04fb4] sysenter_entry at c0404ddf -- #0 [f6050c10] schedule at c0624a34 #1 [f6050c84] schedule_timeout at c06250ee #2 [f6050cc8] io_schedule_timeout at c0624c15 #3 [f6050cdc] congestion_wait at c045eb7d #4 [f6050d00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6050d54] generic_file_buffered_write at c0457148 #6 [f6050de8] __generic_file_aio_write_nolock at c04576e5 #7 [f6050e40] enqueue_entity at c042131f #8 [f6050e5c] generic_file_aio_write at c0457799 #9 [f6050e8c] nfs_file_write at f8f90cee #10 [f6050e9c] getnstimeofday at c043d3f7 #11 [f6050ed0] do_sync_write at c0472e27 #12 [f6050f7c] vfs_write at c0473689 #13 [f6050f98] sys_write at c0473c95 #14 [f6050fb4] sysenter_entry at c0404ddf --- Can this be worked around, since most of the RAM is available, dd process could infact find more memory for it's buffers rather than waiting due to NFS requests. I believe this could be one reason why file systems like VxFS use their own buffer cache different from system-wide buffer cache. Thanks --Chakri - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?
Isn't this a strict requirement from client side, asking to guarantee that a server stays up all the time? I have seen many cases, where people go and directly change IP of their NFS filers or servers worrying least about the clients using them. Can we get around with some sort of congestion logic? Thanks --Chakri On 9/21/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Fri, 2007-09-21 at 09:20 -0700, Chakri n wrote: > > Thanks. > > > > I was using flock (BSD locking) and I think the problem should be > > solved if I move my application to use POSIX locks. > > Yup. > > > And any option to avoid processes waiting indefinitely to free pages > > from NFS requests waiting on unresponsive NFS server? > > The only solution I know of is to use soft mounts, but that brings > another set of problems: > 1. most applications don't know how to recover safely from an EIO > error. > 2. You lose data. > > Cheers > Trond > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?
Thanks. I was using flock (BSD locking) and I think the problem should be solved if I move my application to use POSIX locks. And any option to avoid processes waiting indefinitely to free pages from NFS requests waiting on unresponsive NFS server? Thanks --Chakri On 9/21/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Thu, 2007-09-20 at 20:12 -0700, Chakri n wrote: > > Thanks Trond, for clarifying this for me. > > > > I have seen similar behavior when a remote NFS server is not > > available. Many processes wait end up waiting in nfs_release_page. So, > > what will happen if the remote server is not available, > > nfs_release_page cannot free the memory since it waits on rpc request > > to complete, which never completes and processes wait in there for > > ever? > > > > And unfortunately in my case, I cannot use "mount --bind". I want to > > use the same file system from two different nodes, and I want file & > > record locking to be consistent. The only way to make sure locking is > > consistent is to use loopback NFS on 1 host and NFS mount the same > > file system on other nodes, so that NFS server ensures file & record > > locking to be consistent. Is there any alternative to this? > > > > Is it possible or any efforts to integrate ext3 or other local file > > systems locking & network file system locking, so that user can use > > "mount --bind" on local host and NFS mount on remote nodes, but file & > > record locking will be consistent between both the nodes? > > Could you be a bit more specific? Is the problem that your application > is using BSD locks (flock()) instead of POSIX locks? > > Cheers >Trond > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?
Thanks. I was using flock (BSD locking) and I think the problem should be solved if I move my application to use POSIX locks. And any option to avoid processes waiting indefinitely to free pages from NFS requests waiting on unresponsive NFS server? Thanks --Chakri On 9/21/07, Trond Myklebust [EMAIL PROTECTED] wrote: On Thu, 2007-09-20 at 20:12 -0700, Chakri n wrote: Thanks Trond, for clarifying this for me. I have seen similar behavior when a remote NFS server is not available. Many processes wait end up waiting in nfs_release_page. So, what will happen if the remote server is not available, nfs_release_page cannot free the memory since it waits on rpc request to complete, which never completes and processes wait in there for ever? And unfortunately in my case, I cannot use mount --bind. I want to use the same file system from two different nodes, and I want file record locking to be consistent. The only way to make sure locking is consistent is to use loopback NFS on 1 host and NFS mount the same file system on other nodes, so that NFS server ensures file record locking to be consistent. Is there any alternative to this? Is it possible or any efforts to integrate ext3 or other local file systems locking network file system locking, so that user can use mount --bind on local host and NFS mount on remote nodes, but file record locking will be consistent between both the nodes? Could you be a bit more specific? Is the problem that your application is using BSD locks (flock()) instead of POSIX locks? Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [NFS] NFS on loopback locks up entire system(2.6.23-rc6)?
Isn't this a strict requirement from client side, asking to guarantee that a server stays up all the time? I have seen many cases, where people go and directly change IP of their NFS filers or servers worrying least about the clients using them. Can we get around with some sort of congestion logic? Thanks --Chakri On 9/21/07, Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2007-09-21 at 09:20 -0700, Chakri n wrote: Thanks. I was using flock (BSD locking) and I think the problem should be solved if I move my application to use POSIX locks. Yup. And any option to avoid processes waiting indefinitely to free pages from NFS requests waiting on unresponsive NFS server? The only solution I know of is to use soft mounts, but that brings another set of problems: 1. most applications don't know how to recover safely from an EIO error. 2. You lose data. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NFS on loopback locks up entire system(2.6.23-rc6)?
Thanks Trond, for clarifying this for me. I have seen similar behavior when a remote NFS server is not available. Many processes wait end up waiting in nfs_release_page. So, what will happen if the remote server is not available, nfs_release_page cannot free the memory since it waits on rpc request to complete, which never completes and processes wait in there for ever? And unfortunately in my case, I cannot use "mount --bind". I want to use the same file system from two different nodes, and I want file & record locking to be consistent. The only way to make sure locking is consistent is to use loopback NFS on 1 host and NFS mount the same file system on other nodes, so that NFS server ensures file & record locking to be consistent. Is there any alternative to this? Is it possible or any efforts to integrate ext3 or other local file systems locking & network file system locking, so that user can use "mount --bind" on local host and NFS mount on remote nodes, but file & record locking will be consistent between both the nodes? Thanks --Chakri On 9/20/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: > On Thu, 2007-09-20 at 17:22 -0700, Chakri n wrote: > > Hi, > > > > I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel. > > > > I have mounted a local ext3 partition using loopback NFS (version 3) > > and started my test program. The test program forks 20 threads > > allocates 10MB for each thread, writes & reads a file on the loopback > > NFS mount. After running for about 5 min, I cannot even login to the > > machine. Commands like ps etc, hang in a live session. > > > > The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM > > & CPU to play around and no other io/heavy processes are running on > > the system. > > > > vmstat output shows no buffers are actually getting transferred in or > > out and iowait is 100%. > > > > [EMAIL PROTECTED] ~]# vmstat 1 > > procs ---memory-- ---swap-- -io --system-- > > -cpu-- > > r bswpd free buff cache si so bi bo > > in cs us sy id wa st > > 0 24116 110080 11132 304566400 0 0 28 345 0 > > 1 0 99 0 > > 0 24116 110080 11132 304566400 0 05 329 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 0 26 336 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 08 335 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 0 26 352 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 08 351 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 0 23 358 0 > > 1 0 99 0 > > 0 24116 110080 11132 304566400 0 0 10 350 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 0 26 363 0 > > 0 0 100 0 > > 0 24116 110080 11132 304566400 0 08 346 0 > > 1 0 99 0 > > 0 24116 110080 11132 304566400 0 0 26 360 0 > > 0 0 100 0 > > 0 24116 110080 11140 304565600 8 0 11 345 0 > > 0 0 100 0 > > 0 24116 110080 11140 304566400 0 0 27 355 0 > > 0 2 97 0 > > 0 24116 110080 11140 304566400 0 09 330 0 > > 0 0 100 0 > > 0 24116 110080 11140 304566400 0 0 26 358 0 > > 0 0 100 0 > > > > > > The following is the backtrace of > > 1. one of the threads of my test program > > 2. nfsd daemon and > > 3. a generic command like pstree, after the machine hangs: > > - > > crash> bt 3252 > > PID: 3252 TASK: f6f3c610 CPU: 0 COMMAND: "test" > > #0 [f6bdcc10] schedule at c0624a34 > > #1 [f6bdcc84] schedule_timeout at c06250ee > > #2 [f6bdccc8] io_schedule_timeout at c0624c15 > > #3 [f6bdccdc] congestion_wait at c045eb7d > > #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91 > > #5 [f6bdcd54] generic_file_buffered_write at c0457148 > > #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5 > > #7 [f6bdce40] try_to_wake_up at c042342b > > #8 [f6bdce5c] generic_file_aio_write at c0457799 > > #9 [f6bdce8c] nfs_file_write at f8c25cee > > #10 [f6bdced0] do_sync_write at c0472e27 > > #11 [f6bdcf7c] vfs_write at c0473689 > > #12 [f6bdcf98] sys_write at c0473c95 > > #13 [f6bdcfb4] sysenter_entry at c0404ddf >
NFS on loopback locks up entire system(2.6.23-rc6)?
Hi, I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel. I have mounted a local ext3 partition using loopback NFS (version 3) and started my test program. The test program forks 20 threads allocates 10MB for each thread, writes & reads a file on the loopback NFS mount. After running for about 5 min, I cannot even login to the machine. Commands like ps etc, hang in a live session. The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM & CPU to play around and no other io/heavy processes are running on the system. vmstat output shows no buffers are actually getting transferred in or out and iowait is 100%. [EMAIL PROTECTED] ~]# vmstat 1 procs ---memory-- ---swap-- -io --system-- -cpu-- r bswpd free buff cache si so bi bo in cs us sy id wa st 0 24116 110080 11132 304566400 0 0 28 345 0 1 0 99 0 0 24116 110080 11132 304566400 0 05 329 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 336 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 335 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 352 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 351 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 23 358 0 1 0 99 0 0 24116 110080 11132 304566400 0 0 10 350 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 363 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 346 0 1 0 99 0 0 24116 110080 11132 304566400 0 0 26 360 0 0 0 100 0 0 24116 110080 11140 304565600 8 0 11 345 0 0 0 100 0 0 24116 110080 11140 304566400 0 0 27 355 0 0 2 97 0 0 24116 110080 11140 304566400 0 09 330 0 0 0 100 0 0 24116 110080 11140 304566400 0 0 26 358 0 0 0 100 0 The following is the backtrace of 1. one of the threads of my test program 2. nfsd daemon and 3. a generic command like pstree, after the machine hangs: - crash> bt 3252 PID: 3252 TASK: f6f3c610 CPU: 0 COMMAND: "test" #0 [f6bdcc10] schedule at c0624a34 #1 [f6bdcc84] schedule_timeout at c06250ee #2 [f6bdccc8] io_schedule_timeout at c0624c15 #3 [f6bdccdc] congestion_wait at c045eb7d #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6bdcd54] generic_file_buffered_write at c0457148 #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5 #7 [f6bdce40] try_to_wake_up at c042342b #8 [f6bdce5c] generic_file_aio_write at c0457799 #9 [f6bdce8c] nfs_file_write at f8c25cee #10 [f6bdced0] do_sync_write at c0472e27 #11 [f6bdcf7c] vfs_write at c0473689 #12 [f6bdcf98] sys_write at c0473c95 #13 [f6bdcfb4] sysenter_entry at c0404ddf EAX: 0004 EBX: 0013 ECX: a4966008 EDX: 0098 DS: 007b ESI: 0098 ES: 007b EDI: a4966008 SS: 007b ESP: a5ae6ec0 EBP: a5ae6ef0 CS: 0073 EIP: b7eed410 ERR: 0004 EFLAGS: 0246 crash> bt 3188 PID: 3188 TASK: f74c4000 CPU: 1 COMMAND: "nfsd" #0 [f6836c7c] schedule at c0624a34 #1 [f6836cf0] __mutex_lock_slowpath at c062543d #2 [f6836d0c] mutex_lock at c0625326 #3 [f6836d18] generic_file_aio_write at c0457784 #4 [f6836d48] ext3_file_write at ffd7 #5 [f6836d64] do_sync_readv_writev at c0472d1f #6 [f6836e08] do_readv_writev at c0473486 #7 [f6836e6c] vfs_writev at c047358e #8 [f6836e7c] nfsd_vfs_write at f8e7f8d7 #9 [f6836ee0] nfsd_write at f8e80139 #10 [f6836f10] nfsd3_proc_write at f8e86afd #11 [f6836f44] nfsd_dispatch at f8e7c20c #12 [f6836f6c] svc_process at f89c18e0 #13 [f6836fbc] nfsd at f8e7c794 #14 [f6836fe4] kernel_thread_helper at c0405a35 crash> ps|grep ps 234 2 3 cb194000 IN 0.0 0 0 [khpsbpkt] 520 2 0 f7e18c20 IN 0.0 0 0 [kpsmoused] 2859 1 2 f7f3cc20 IN 0.19600 2040 cupsd 3340 3310 0 f4a0f840 UN 0.04360816 pstree 3343 3284 2 f4a0f230 UN 0.04212944 ps crash> bt 3340 PID: 3340 TASK: f4a0f840 CPU: 0 COMMAND: "pstree" #0 [e856be30] schedule at c0624a34 #1 [e856bea4] rwsem_down_failed_common at c04df6c0 #2 [e856bec4] rwsem_down_read_failed at c0625c2a #3 [e856bedc] call_rwsem_down_read_failed at c0625c96 #4 [e856bee8] down_read at c043c21a #5 [e856bef0] access_process_vm at c0462039 #6 [e856bf38] proc_pid_cmdline at c04a1bbb #7 [e856bf58] proc_info_read at c04a2f41 #8 [e856bf7c] vfs_read at c04737db #9 [e856bf98] sys_read at c0473c2e #10 [e856bfb4] sysenter_entry at c0404ddf EAX: 0003 EBX: 0005 ECX: 0804dc58 EDX: 0062 DS: 007b ESI: 0cba ES: 007b EDI: 0804e0e0 SS: 007b ESP: bfa3afe8 EBP: bfa3d4f8
NFS on loopback locks up entire system(2.6.23-rc6)?
Hi, I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel. I have mounted a local ext3 partition using loopback NFS (version 3) and started my test program. The test program forks 20 threads allocates 10MB for each thread, writes reads a file on the loopback NFS mount. After running for about 5 min, I cannot even login to the machine. Commands like ps etc, hang in a live session. The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM CPU to play around and no other io/heavy processes are running on the system. vmstat output shows no buffers are actually getting transferred in or out and iowait is 100%. [EMAIL PROTECTED] ~]# vmstat 1 procs ---memory-- ---swap-- -io --system-- -cpu-- r bswpd free buff cache si so bi bo in cs us sy id wa st 0 24116 110080 11132 304566400 0 0 28 345 0 1 0 99 0 0 24116 110080 11132 304566400 0 05 329 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 336 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 335 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 352 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 351 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 23 358 0 1 0 99 0 0 24116 110080 11132 304566400 0 0 10 350 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 363 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 346 0 1 0 99 0 0 24116 110080 11132 304566400 0 0 26 360 0 0 0 100 0 0 24116 110080 11140 304565600 8 0 11 345 0 0 0 100 0 0 24116 110080 11140 304566400 0 0 27 355 0 0 2 97 0 0 24116 110080 11140 304566400 0 09 330 0 0 0 100 0 0 24116 110080 11140 304566400 0 0 26 358 0 0 0 100 0 The following is the backtrace of 1. one of the threads of my test program 2. nfsd daemon and 3. a generic command like pstree, after the machine hangs: - crash bt 3252 PID: 3252 TASK: f6f3c610 CPU: 0 COMMAND: test #0 [f6bdcc10] schedule at c0624a34 #1 [f6bdcc84] schedule_timeout at c06250ee #2 [f6bdccc8] io_schedule_timeout at c0624c15 #3 [f6bdccdc] congestion_wait at c045eb7d #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6bdcd54] generic_file_buffered_write at c0457148 #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5 #7 [f6bdce40] try_to_wake_up at c042342b #8 [f6bdce5c] generic_file_aio_write at c0457799 #9 [f6bdce8c] nfs_file_write at f8c25cee #10 [f6bdced0] do_sync_write at c0472e27 #11 [f6bdcf7c] vfs_write at c0473689 #12 [f6bdcf98] sys_write at c0473c95 #13 [f6bdcfb4] sysenter_entry at c0404ddf EAX: 0004 EBX: 0013 ECX: a4966008 EDX: 0098 DS: 007b ESI: 0098 ES: 007b EDI: a4966008 SS: 007b ESP: a5ae6ec0 EBP: a5ae6ef0 CS: 0073 EIP: b7eed410 ERR: 0004 EFLAGS: 0246 crash bt 3188 PID: 3188 TASK: f74c4000 CPU: 1 COMMAND: nfsd #0 [f6836c7c] schedule at c0624a34 #1 [f6836cf0] __mutex_lock_slowpath at c062543d #2 [f6836d0c] mutex_lock at c0625326 #3 [f6836d18] generic_file_aio_write at c0457784 #4 [f6836d48] ext3_file_write at ffd7 #5 [f6836d64] do_sync_readv_writev at c0472d1f #6 [f6836e08] do_readv_writev at c0473486 #7 [f6836e6c] vfs_writev at c047358e #8 [f6836e7c] nfsd_vfs_write at f8e7f8d7 #9 [f6836ee0] nfsd_write at f8e80139 #10 [f6836f10] nfsd3_proc_write at f8e86afd #11 [f6836f44] nfsd_dispatch at f8e7c20c #12 [f6836f6c] svc_process at f89c18e0 #13 [f6836fbc] nfsd at f8e7c794 #14 [f6836fe4] kernel_thread_helper at c0405a35 crash ps|grep ps 234 2 3 cb194000 IN 0.0 0 0 [khpsbpkt] 520 2 0 f7e18c20 IN 0.0 0 0 [kpsmoused] 2859 1 2 f7f3cc20 IN 0.19600 2040 cupsd 3340 3310 0 f4a0f840 UN 0.04360816 pstree 3343 3284 2 f4a0f230 UN 0.04212944 ps crash bt 3340 PID: 3340 TASK: f4a0f840 CPU: 0 COMMAND: pstree #0 [e856be30] schedule at c0624a34 #1 [e856bea4] rwsem_down_failed_common at c04df6c0 #2 [e856bec4] rwsem_down_read_failed at c0625c2a #3 [e856bedc] call_rwsem_down_read_failed at c0625c96 #4 [e856bee8] down_read at c043c21a #5 [e856bef0] access_process_vm at c0462039 #6 [e856bf38] proc_pid_cmdline at c04a1bbb #7 [e856bf58] proc_info_read at c04a2f41 #8 [e856bf7c] vfs_read at c04737db #9 [e856bf98] sys_read at c0473c2e #10 [e856bfb4] sysenter_entry at c0404ddf EAX: 0003 EBX: 0005 ECX: 0804dc58 EDX: 0062 DS: 007b ESI: 0cba ES: 007b EDI: 0804e0e0 SS: 007b ESP: bfa3afe8 EBP: bfa3d4f8 CS:
Re: NFS on loopback locks up entire system(2.6.23-rc6)?
Thanks Trond, for clarifying this for me. I have seen similar behavior when a remote NFS server is not available. Many processes wait end up waiting in nfs_release_page. So, what will happen if the remote server is not available, nfs_release_page cannot free the memory since it waits on rpc request to complete, which never completes and processes wait in there for ever? And unfortunately in my case, I cannot use mount --bind. I want to use the same file system from two different nodes, and I want file record locking to be consistent. The only way to make sure locking is consistent is to use loopback NFS on 1 host and NFS mount the same file system on other nodes, so that NFS server ensures file record locking to be consistent. Is there any alternative to this? Is it possible or any efforts to integrate ext3 or other local file systems locking network file system locking, so that user can use mount --bind on local host and NFS mount on remote nodes, but file record locking will be consistent between both the nodes? Thanks --Chakri On 9/20/07, Trond Myklebust [EMAIL PROTECTED] wrote: On Thu, 2007-09-20 at 17:22 -0700, Chakri n wrote: Hi, I am testing NFS on loopback locks up entire system with 2.6.23-rc6 kernel. I have mounted a local ext3 partition using loopback NFS (version 3) and started my test program. The test program forks 20 threads allocates 10MB for each thread, writes reads a file on the loopback NFS mount. After running for about 5 min, I cannot even login to the machine. Commands like ps etc, hang in a live session. The machine is a DELL 1950 with 4Gig of RAM, so there is plenty of RAM CPU to play around and no other io/heavy processes are running on the system. vmstat output shows no buffers are actually getting transferred in or out and iowait is 100%. [EMAIL PROTECTED] ~]# vmstat 1 procs ---memory-- ---swap-- -io --system-- -cpu-- r bswpd free buff cache si so bi bo in cs us sy id wa st 0 24116 110080 11132 304566400 0 0 28 345 0 1 0 99 0 0 24116 110080 11132 304566400 0 05 329 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 336 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 335 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 352 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 351 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 23 358 0 1 0 99 0 0 24116 110080 11132 304566400 0 0 10 350 0 0 0 100 0 0 24116 110080 11132 304566400 0 0 26 363 0 0 0 100 0 0 24116 110080 11132 304566400 0 08 346 0 1 0 99 0 0 24116 110080 11132 304566400 0 0 26 360 0 0 0 100 0 0 24116 110080 11140 304565600 8 0 11 345 0 0 0 100 0 0 24116 110080 11140 304566400 0 0 27 355 0 0 2 97 0 0 24116 110080 11140 304566400 0 09 330 0 0 0 100 0 0 24116 110080 11140 304566400 0 0 26 358 0 0 0 100 0 The following is the backtrace of 1. one of the threads of my test program 2. nfsd daemon and 3. a generic command like pstree, after the machine hangs: - crash bt 3252 PID: 3252 TASK: f6f3c610 CPU: 0 COMMAND: test #0 [f6bdcc10] schedule at c0624a34 #1 [f6bdcc84] schedule_timeout at c06250ee #2 [f6bdccc8] io_schedule_timeout at c0624c15 #3 [f6bdccdc] congestion_wait at c045eb7d #4 [f6bdcd00] balance_dirty_pages_ratelimited_nr at c045ab91 #5 [f6bdcd54] generic_file_buffered_write at c0457148 #6 [f6bdcde8] __generic_file_aio_write_nolock at c04576e5 #7 [f6bdce40] try_to_wake_up at c042342b #8 [f6bdce5c] generic_file_aio_write at c0457799 #9 [f6bdce8c] nfs_file_write at f8c25cee #10 [f6bdced0] do_sync_write at c0472e27 #11 [f6bdcf7c] vfs_write at c0473689 #12 [f6bdcf98] sys_write at c0473c95 #13 [f6bdcfb4] sysenter_entry at c0404ddf EAX: 0004 EBX: 0013 ECX: a4966008 EDX: 0098 DS: 007b ESI: 0098 ES: 007b EDI: a4966008 SS: 007b ESP: a5ae6ec0 EBP: a5ae6ef0 CS: 0073 EIP: b7eed410 ERR: 0004 EFLAGS: 0246 crash bt 3188 PID: 3188 TASK: f74c4000 CPU: 1 COMMAND: nfsd #0 [f6836c7c] schedule at c0624a34 #1 [f6836cf0] __mutex_lock_slowpath at c062543d #2 [f6836d0c] mutex_lock at c0625326 #3 [f6836d18] generic_file_aio_write at c0457784 #4 [f6836d48] ext3_file_write at ffd7 #5 [f6836d64] do_sync_readv_writev at c0472d1f #6 [f6836e08] do_readv_writev at c0473486 #7 [f6836e6c] vfs_writev at c047358e #8
Re: NFS hang + umount -f: better behaviour requested.
To add to the pain, lsof or fuser hang on unresponsive shares. I wrote my own wrapper to go through the "/proc/" file tables and find any process using the unresponsive mounts and kill those processes.This works well. Also, it brings another point. If the unresponsives problem cannot be fixed for some NFS data corruption reasons, is it possible for a mount to have both soft & hard semantics? Some process might want to use the mount point soft and other processes hard. This can be implemented easily in NFS & SUNRPC layers adding timeout to requests, but it becomes tricky in VFS layer. If a soft proces is waiting on an inode locked by a hard process, the soft process gets hard semantics too. Thanks --Chakri On 8/21/07, Peter Staubach <[EMAIL PROTECTED]> wrote: > John Stoffel wrote: > > Robin> I'm bringing this up again (I know it's been mentioned here > > Robin> before) because I had been told that NFS support had gotten > > Robin> better in Linux recently, so I have been (for my $dayjob) > > Robin> testing the behaviour of NFS (autofs NFS, specifically) under > > Robin> Linux with hard,intr and using iptables to simulate a hang. > > > > So why are you mouting with hard,intr semantics? At my current > > SysAdmin job, we mount everything (solaris included) with 'soft,intr' > > and it works well. If an NFS server goes down, clients don't hang for > > large periods of time. > > > > > > Wow! That's _really_ a bad idea. NFS READ operations which > timeout can lead to executables which mysteriously fail, file > corruption, etc. NFS WRITE operations which fail may or may > not lead to file corruption. > > Anything writable should _always_ be mounted "hard" for safety > purposes. Readonly mounted file systems _may_ be mounted "soft", > depending upon what is located on them. > > > Robin> fuser hangs, as far as I can tell indefinately, as does > > Robin> lsof. umount -f returns after a long time with "busy", umount > > Robin> -l works after a long time but leaves the system in a very > > Robin> unfortunate state such that I have to kill things by hand and > > Robin> manually edit /etc/mtab to get autofs to work again. > > > > Robin> The "correct solution" to this situation according to > > Robin> http://nfs.sourceforge.net/ is cycles of "kill processes" and > > Robin> "umount -f". This has two problems: 1. It sucks. 2. If fuser > > Robin> and lsof both hand (and they do: fuser has been on > > Robin> "stat("/home/rpowell/"," for > 30 minutes now), I have no way to > > Robin> pick which processes to kill. > > > > Robin> I've read every man page I could find, and the only nfs option > > Robin> that semes even vaguely helpful is "soft", but everything that > > Robin> mentions "soft" also says to never use it. > > > > I think the man pages are out of date, or ignoring reality. Try > > mounting with soft,intr and see how it works for you. I think you'll > > be happy. > > > > > > Please don't. You will end up regretting it in the long run. > Taking a chance on corrupted data or critical applications which > just fail is not worth the benefit. > > It would safer for us to implement something which works like > the Solaris forced umount support for NFS. > > Thanx... > >ps > > > Robin> This is the single worst aspect of adminning a Linux system that I, > > Robin> as a carreer sysadmin, have to deal with. In fact, it's really the > > Robin> only one I even dislike. At my current work place, we've lost > > Robin> multiple person-days to this issue, having to go around and reboot > > Robin> every Linux box that was hanging off a down NFS server. > > > > Robin> I know many other admins who also really want Solaris style > > Robin> "umount -f"; I'm sure if I passed the hat I could get a decent > > Robin> bounty together for this feature; let me know if you're interested. > > > > Robin> Thanks. > > > > Robin> -Robin > > > > Robin> -- > > Robin> http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/ > > Robin> Reason #237 To Learn Lojban: "Homonyms: Their Grate!" > > Robin> Proud Supporter of the Singularity Institute - http://singinst.org/ > > Robin> - > > Robin> To unsubscribe from this list: send the line "unsubscribe > > linux-kernel" in > > Robin> the body of a message to [EMAIL PROTECTED] > > Robin> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Robin> Please read the FAQ at http://www.tux.org/lkml/ > > > > > > Robin> !DSPAM:46ca1d9676791030010506! > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To
Re: NFS hang + umount -f: better behaviour requested.
To add to the pain, lsof or fuser hang on unresponsive shares. I wrote my own wrapper to go through the /proc/pid file tables and find any process using the unresponsive mounts and kill those processes.This works well. Also, it brings another point. If the unresponsives problem cannot be fixed for some NFS data corruption reasons, is it possible for a mount to have both soft hard semantics? Some process might want to use the mount point soft and other processes hard. This can be implemented easily in NFS SUNRPC layers adding timeout to requests, but it becomes tricky in VFS layer. If a soft proces is waiting on an inode locked by a hard process, the soft process gets hard semantics too. Thanks --Chakri On 8/21/07, Peter Staubach [EMAIL PROTECTED] wrote: John Stoffel wrote: Robin I'm bringing this up again (I know it's been mentioned here Robin before) because I had been told that NFS support had gotten Robin better in Linux recently, so I have been (for my $dayjob) Robin testing the behaviour of NFS (autofs NFS, specifically) under Robin Linux with hard,intr and using iptables to simulate a hang. So why are you mouting with hard,intr semantics? At my current SysAdmin job, we mount everything (solaris included) with 'soft,intr' and it works well. If an NFS server goes down, clients don't hang for large periods of time. Wow! That's _really_ a bad idea. NFS READ operations which timeout can lead to executables which mysteriously fail, file corruption, etc. NFS WRITE operations which fail may or may not lead to file corruption. Anything writable should _always_ be mounted hard for safety purposes. Readonly mounted file systems _may_ be mounted soft, depending upon what is located on them. Robin fuser hangs, as far as I can tell indefinately, as does Robin lsof. umount -f returns after a long time with busy, umount Robin -l works after a long time but leaves the system in a very Robin unfortunate state such that I have to kill things by hand and Robin manually edit /etc/mtab to get autofs to work again. Robin The correct solution to this situation according to Robin http://nfs.sourceforge.net/ is cycles of kill processes and Robin umount -f. This has two problems: 1. It sucks. 2. If fuser Robin and lsof both hand (and they do: fuser has been on Robin stat(/home/rpowell/, for 30 minutes now), I have no way to Robin pick which processes to kill. Robin I've read every man page I could find, and the only nfs option Robin that semes even vaguely helpful is soft, but everything that Robin mentions soft also says to never use it. I think the man pages are out of date, or ignoring reality. Try mounting with soft,intr and see how it works for you. I think you'll be happy. Please don't. You will end up regretting it in the long run. Taking a chance on corrupted data or critical applications which just fail is not worth the benefit. It would safer for us to implement something which works like the Solaris forced umount support for NFS. Thanx... ps Robin This is the single worst aspect of adminning a Linux system that I, Robin as a carreer sysadmin, have to deal with. In fact, it's really the Robin only one I even dislike. At my current work place, we've lost Robin multiple person-days to this issue, having to go around and reboot Robin every Linux box that was hanging off a down NFS server. Robin I know many other admins who also really want Solaris style Robin umount -f; I'm sure if I passed the hat I could get a decent Robin bounty together for this feature; let me know if you're interested. Robin Thanks. Robin -Robin Robin -- Robin http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/ Robin Reason #237 To Learn Lojban: Homonyms: Their Grate! Robin Proud Supporter of the Singularity Institute - http://singinst.org/ Robin - Robin To unsubscribe from this list: send the line unsubscribe linux-kernel in Robin the body of a message to [EMAIL PROTECTED] Robin More majordomo info at http://vger.kernel.org/majordomo-info.html Robin Please read the FAQ at http://www.tux.org/lkml/ Robin !DSPAM:46ca1d9676791030010506! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOPS in shrink_dcache_for_umount
The patches do not help. The system still paniced in the same place. Still trying to correlate the problem & fix to specific patch. Regards --Chakri On 8/5/07, Chakri n <[EMAIL PROTECTED]> wrote: > Hi Malte, > > Thanks for the information. > > Based on your suggestion I tried the following two patches on top of > 2.6.18-1.8.el5 NFS code. I had to keep the changes minimum to fix this > crash in our release. > > http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-007-fix_readdir_negative_dentry.dif > http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-008-fix_readdir_positive_dentry.dif > > The systems are running for the past 7 hours with out any issues. > Hopefully this fixes it. > > Regards > --Chakri > > On 8/4/07, Malte Schröder <[EMAIL PROTECTED]> wrote: > > On Thu, 2 Aug 2007 14:27:04 -0700 > > "Chakri n" <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > We are seeing this problem while unmounting file systems. It happens > > > once in a while. > > > I am able to grab the trace and core from linux-2.6.18-1.8.el5, but I > > > have observed the same problem with linux-2.6.20.1 kernel. > > > > > > Has this problem fixed in recent kernel? > > > > > > > I had those too ... but I haven't seen one in a while. > > I currently run 2.6.22 + cfs-v19 + the Patch from > > http://linux-nfs.org/Linux-2.6.x/2.6.22/linux-2.6.22-NFS_ALL.dif > > > > -- > > --- > > Malte Schröder > > [EMAIL PROTECTED] > > ICQ# 68121508 > > --- > > > > > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOPS in shrink_dcache_for_umount
The patches do not help. The system still paniced in the same place. Still trying to correlate the problem fix to specific patch. Regards --Chakri On 8/5/07, Chakri n [EMAIL PROTECTED] wrote: Hi Malte, Thanks for the information. Based on your suggestion I tried the following two patches on top of 2.6.18-1.8.el5 NFS code. I had to keep the changes minimum to fix this crash in our release. http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-007-fix_readdir_negative_dentry.dif http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-008-fix_readdir_positive_dentry.dif The systems are running for the past 7 hours with out any issues. Hopefully this fixes it. Regards --Chakri On 8/4/07, Malte Schröder [EMAIL PROTECTED] wrote: On Thu, 2 Aug 2007 14:27:04 -0700 Chakri n [EMAIL PROTECTED] wrote: Hi, We are seeing this problem while unmounting file systems. It happens once in a while. I am able to grab the trace and core from linux-2.6.18-1.8.el5, but I have observed the same problem with linux-2.6.20.1 kernel. Has this problem fixed in recent kernel? I had those too ... but I haven't seen one in a while. I currently run 2.6.22 + cfs-v19 + the Patch from http://linux-nfs.org/Linux-2.6.x/2.6.22/linux-2.6.22-NFS_ALL.dif -- --- Malte Schröder [EMAIL PROTECTED] ICQ# 68121508 --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOPS in shrink_dcache_for_umount
Hi Malte, Thanks for the information. Based on your suggestion I tried the following two patches on top of 2.6.18-1.8.el5 NFS code. I had to keep the changes minimum to fix this crash in our release. http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-007-fix_readdir_negative_dentry.dif http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-008-fix_readdir_positive_dentry.dif The systems are running for the past 7 hours with out any issues. Hopefully this fixes it. Regards --Chakri On 8/4/07, Malte Schröder <[EMAIL PROTECTED]> wrote: > On Thu, 2 Aug 2007 14:27:04 -0700 > "Chakri n" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > We are seeing this problem while unmounting file systems. It happens > > once in a while. > > I am able to grab the trace and core from linux-2.6.18-1.8.el5, but I > > have observed the same problem with linux-2.6.20.1 kernel. > > > > Has this problem fixed in recent kernel? > > > > I had those too ... but I haven't seen one in a while. > I currently run 2.6.22 + cfs-v19 + the Patch from > http://linux-nfs.org/Linux-2.6.x/2.6.22/linux-2.6.22-NFS_ALL.dif > > -- > --- > Malte Schröder > [EMAIL PROTECTED] > ICQ# 68121508 > --- > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOPS in shrink_dcache_for_umount
Hi Malte, Thanks for the information. Based on your suggestion I tried the following two patches on top of 2.6.18-1.8.el5 NFS code. I had to keep the changes minimum to fix this crash in our release. http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-007-fix_readdir_negative_dentry.dif http://linux-nfs.org/Linux-2.6.x/2.6.20-rc7/linux-2.6.20-008-fix_readdir_positive_dentry.dif The systems are running for the past 7 hours with out any issues. Hopefully this fixes it. Regards --Chakri On 8/4/07, Malte Schröder [EMAIL PROTECTED] wrote: On Thu, 2 Aug 2007 14:27:04 -0700 Chakri n [EMAIL PROTECTED] wrote: Hi, We are seeing this problem while unmounting file systems. It happens once in a while. I am able to grab the trace and core from linux-2.6.18-1.8.el5, but I have observed the same problem with linux-2.6.20.1 kernel. Has this problem fixed in recent kernel? I had those too ... but I haven't seen one in a while. I currently run 2.6.22 + cfs-v19 + the Patch from http://linux-nfs.org/Linux-2.6.x/2.6.22/linux-2.6.22-NFS_ALL.dif -- --- Malte Schröder [EMAIL PROTECTED] ICQ# 68121508 --- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
OOPS in shrink_dcache_for_umount
Hi, We are seeing this problem while unmounting file systems. It happens once in a while. I am able to grab the trace and core from linux-2.6.18-1.8.el5, but I have observed the same problem with linux-2.6.20.1 kernel. Has this problem fixed in recent kernel? BUG: Dentry f7498f70{i=12803e,n=client72} still in use (1) [unmount of nfs 0:18] [ cut here ] kernel BUG at fs/dcache.c:615! invalid opcode: [#1] SMP last sysfs file: /block/ram0/range Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl ipmi_msghandler kazprocs(U) sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac lp floppy snd_intel8x0 snd_ac97_cod ec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss sn d_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ide_cd e100 parport_pc i2c_i801 mii par port pcspkr i2c_core cdrom serio_raw ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uh ci_hcd CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010246 (2.6.18-8.el5 #1) EIP is at shrink_dcache_for_umount_subtree+0x133/0x1c1 eax: 0054 ebx: f7498f70 ecx: c0621b24 edx: de121ee8 esi: 0001 edi: f7835740 ebp: 0012803e esp: de121ee4 ds: 007b es: 007b ss: 0068 Process umount.nfs (pid: 29559, ti=de121000 task=ebbe5000 task.ti=de121000) Stack: c0621b24 f7498f70 0012803e f7498fd4 0001 f8d55765 f7835740 f7835600 f8d67980 c047ec5d f7835600 c0470153 0018 f8d67960 c0470248 e1b0d380 f8d3a175 f7835600 c04702d7 f74629c0 f7835600 c0483165 Call Trace: [] shrink_dcache_for_umount+0x2e/0x3a [] generic_shutdown_super+0x16/0xd5 [] kill_anon_super+0x9/0x2f [] nfs_kill_super+0xc/0x14 [nfs] [] deactivate_super+0x52/0x65 [] sys_umount+0x1f0/0x218 [] release_sock+0xc/0x91 [] do_page_fault+0x20a/0x4b8 [] do_page_fault+0x274/0x4b8 [] sys_oldumount+0xb/0xe [] syscall_call+0x7/0xb === Code: ed 8b 53 0c 8b 33 8b 4b 24 8d b8 40 01 00 00 8b 40 1c 85 d2 8b 00 74 03 8b 6a 20 57 50 56 51 55 53 68 24 1b 62 c0 e8 c5 5c fa ff <0f> 0b 67 02 18 1b 62 c0 83 c4 1c 8b 73 18 39 de 75 04 31 f6 eb EIP: [] shrink_dcache_for_umount_subtree+0x133/0x1c1 SS:ESP 0068:de121ee4 Thanks --Chakri - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
OOPS in shrink_dcache_for_umount
Hi, We are seeing this problem while unmounting file systems. It happens once in a while. I am able to grab the trace and core from linux-2.6.18-1.8.el5, but I have observed the same problem with linux-2.6.20.1 kernel. Has this problem fixed in recent kernel? BUG: Dentry f7498f70{i=12803e,n=client72} still in use (1) [unmount of nfs 0:18] [ cut here ] kernel BUG at fs/dcache.c:615! invalid opcode: [#1] SMP last sysfs file: /block/ram0/range Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl ipmi_msghandler kazprocs(U) sunrpc ipv6 dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac lp floppy snd_intel8x0 snd_ac97_cod ec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss sn d_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ide_cd e100 parport_pc i2c_i801 mii par port pcspkr i2c_core cdrom serio_raw ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uh ci_hcd CPU:0 EIP:0060:[c047e49f]Not tainted VLI EFLAGS: 00010246 (2.6.18-8.el5 #1) EIP is at shrink_dcache_for_umount_subtree+0x133/0x1c1 eax: 0054 ebx: f7498f70 ecx: c0621b24 edx: de121ee8 esi: 0001 edi: f7835740 ebp: 0012803e esp: de121ee4 ds: 007b es: 007b ss: 0068 Process umount.nfs (pid: 29559, ti=de121000 task=ebbe5000 task.ti=de121000) Stack: c0621b24 f7498f70 0012803e f7498fd4 0001 f8d55765 f7835740 f7835600 f8d67980 c047ec5d f7835600 c0470153 0018 f8d67960 c0470248 e1b0d380 f8d3a175 f7835600 c04702d7 f74629c0 f7835600 c0483165 Call Trace: [c047ec5d] shrink_dcache_for_umount+0x2e/0x3a [c0470153] generic_shutdown_super+0x16/0xd5 [c0470248] kill_anon_super+0x9/0x2f [f8d3a175] nfs_kill_super+0xc/0x14 [nfs] [c04702d7] deactivate_super+0x52/0x65 [c0483165] sys_umount+0x1f0/0x218 [c059c9eb] release_sock+0xc/0x91 [c05fd48f] do_page_fault+0x20a/0x4b8 [c05fd4f9] do_page_fault+0x274/0x4b8 [c0483198] sys_oldumount+0xb/0xe [c0403eff] syscall_call+0x7/0xb === Code: ed 8b 53 0c 8b 33 8b 4b 24 8d b8 40 01 00 00 8b 40 1c 85 d2 8b 00 74 03 8b 6a 20 57 50 56 51 55 53 68 24 1b 62 c0 e8 c5 5c fa ff 0f 0b 67 02 18 1b 62 c0 83 c4 1c 8b 73 18 39 de 75 04 31 f6 eb EIP: [c047e49f] shrink_dcache_for_umount_subtree+0x133/0x1c1 SS:ESP 0068:de121ee4 Thanks --Chakri - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/