Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, Sep 21, 2007 at 16:58:15 +0100, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > But once I look harder at it, I wonder what would have kept > 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from > the global vm stats too low to get synched back to global, yet > adding up to something which misleads balance_dirty_pages into > an indefinite loop e.g. total nr_writeback actually 0, but > appearing more than dirty_thresh in the global approximation. I have been seeing this problem since Fedora Core 5's 2.6.20-1.2300.fc5 kernel. The last Fedora kernel that didn't have this problem was 2.6.19-1.2288.2.4.fc5, which I am still running on one machine. I have put up vmstat and proc/meminfo data at: http://wolff.to/bruno/vmlog There are a number of iterations of the output and they are mixed together, though the vmstat info was written when a buffer was full, so there are chunks of lines together. This was from a machine running Fedora 7's 2.6.23.1-4.fc7 kernel. The machine is a single processor P3 with 512 MB of memory. There are three hard drives, two of which are used in software raid 1 arrays. I have ext3 file systems mounted noatime. I can provide more details if you need anything else. I can get lockups to happen fairly often. Often lots of firefox activity and/or some other disk heavy task such as yum update will get it to happen. I am not subscribed to LMKL, so please keep me copied on replies. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, Sep 21, 2007 at 16:58:15 +0100, Hugh Dickins [EMAIL PROTECTED] wrote: But once I look harder at it, I wonder what would have kept 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from the global vm stats too low to get synched back to global, yet adding up to something which misleads balance_dirty_pages into an indefinite loop e.g. total nr_writeback actually 0, but appearing more than dirty_thresh in the global approximation. I have been seeing this problem since Fedora Core 5's 2.6.20-1.2300.fc5 kernel. The last Fedora kernel that didn't have this problem was 2.6.19-1.2288.2.4.fc5, which I am still running on one machine. I have put up vmstat and proc/meminfo data at: http://wolff.to/bruno/vmlog There are a number of iterations of the output and they are mixed together, though the vmstat info was written when a buffer was full, so there are chunks of lines together. This was from a machine running Fedora 7's 2.6.23.1-4.fc7 kernel. The machine is a single processor P3 with 512 MB of memory. There are three hard drives, two of which are used in software raid 1 arrays. I have ext3 file systems mounted noatime. I can provide more details if you need anything else. I can get lockups to happen fairly often. Often lots of firefox activity and/or some other disk heavy task such as yum update will get it to happen. I am not subscribed to LMKL, so please keep me copied on replies. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 2007-09-21 at 03:33 -0700, Andrew Morton wrote: > On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy <[EMAIL PROTECTED]> wrote: > > > > That's all a bit crappy if the wrong races happen and some other task is > > > somehow exceeding the dirty limits each time this task polls them. Seems > > > unlikely that such a condition would persist forever. > > > > > > So the question is, why do we have large amounts of dirty pages for one > > > disk which appear to be sitting there not getting written? > > > > The lockup I'm seeing intermittently occurs when I have 2+ tasks copying > > large files (1Gb+) on sda & a small read-mainly mysql db app running on > > sdb. The lockup seems to happen just after the copies finish -- there > > are lots of dirty pages but nothing left to write them until kupdate > > gets round to it. > > Then what happens? The system recovers? when my system is locked up I get this from sysrq-w (2.6.23-rc7) SysRq : Show Blocked State auditdD 8100b422fd28 0 1999 1 8100b422fd78 0086 8100b4103020 0286 8100b4103020 8100bf8af020 8100b41032b8 0001 0001 0003 Call Trace: [] :jbd:log_wait_commit+0xa3/0xf5 [] autoremove_wake_function+0x0/0x2e [] :jbd:journal_stop+0x1be/0x1ee [] __writeback_single_inode+0x1f4/0x332 [] vfs_statfs_native+0x29/0x34 [] sync_inode+0x24/0x36 [] :ext3:ext3_sync_file+0xb4/0xc8 [] mutex_lock+0x1e/0x2f [] do_fsync+0x52/0xa4 [] __do_fsync+0x23/0x36 [] system_call+0x7e/0x83 syslogd D 8100b3f67d28 0 2022 1 8100b3f67d78 0086 0001 0003 810037c66810 8100bf8af020 810037c66aa8 0001 0001 0003 Call Trace: [] :jbd:log_wait_commit+0xa3/0xf5 [] autoremove_wake_function+0x0/0x2e [] :jbd:journal_stop+0x1be/0x1ee [] __writeback_single_inode+0x1f4/0x332 [] sync_inode+0x24/0x36 [] :ext3:ext3_sync_file+0xb4/0xc8 [] mutex_lock+0x1e/0x2f [] do_fsync+0x52/0xa4 [] __do_fsync+0x23/0x36 [] tracesys+0xdc/0xe1 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 2007-09-21 at 03:33 -0700, Andrew Morton wrote: On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy [EMAIL PROTECTED] wrote: That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? The lockup I'm seeing intermittently occurs when I have 2+ tasks copying large files (1Gb+) on sda a small read-mainly mysql db app running on sdb. The lockup seems to happen just after the copies finish -- there are lots of dirty pages but nothing left to write them until kupdate gets round to it. Then what happens? The system recovers? when my system is locked up I get this from sysrq-w (2.6.23-rc7) SysRq : Show Blocked State auditdD 8100b422fd28 0 1999 1 8100b422fd78 0086 8100b4103020 0286 8100b4103020 8100bf8af020 8100b41032b8 0001 0001 0003 Call Trace: [8802c8e9] :jbd:log_wait_commit+0xa3/0xf5 [810482d9] autoremove_wake_function+0x0/0x2e [8802770b] :jbd:journal_stop+0x1be/0x1ee [810a79f7] __writeback_single_inode+0x1f4/0x332 [8108c0b0] vfs_statfs_native+0x29/0x34 [810a8362] sync_inode+0x24/0x36 [8803c350] :ext3:ext3_sync_file+0xb4/0xc8 [81240730] mutex_lock+0x1e/0x2f [810aa7fa] do_fsync+0x52/0xa4 [810aa86f] __do_fsync+0x23/0x36 [8100bc8e] system_call+0x7e/0x83 syslogd D 8100b3f67d28 0 2022 1 8100b3f67d78 0086 0001 0003 810037c66810 8100bf8af020 810037c66aa8 0001 0001 0003 Call Trace: [8802c8e9] :jbd:log_wait_commit+0xa3/0xf5 [810482d9] autoremove_wake_function+0x0/0x2e [8802770b] :jbd:journal_stop+0x1be/0x1ee [810a79f7] __writeback_single_inode+0x1f4/0x332 [810a8362] sync_inode+0x24/0x36 [8803c350] :ext3:ext3_sync_file+0xb4/0xc8 [81240730] mutex_lock+0x1e/0x2f [810aa7fa] do_fsync+0x52/0xa4 [810aa86f] __do_fsync+0x23/0x36 [8100be0c] tracesys+0xdc/0xe1 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007 16:58:15 +0100 (BST) Hugh Dickins <[EMAIL PROTECTED]> wrote: > But once I look harder at it, I wonder what would have kept > 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from > the global vm stats too low to get synched back to global, yet > adding up to something which misleads balance_dirty_pages into > an indefinite loop e.g. total nr_writeback actually 0, but > appearing more than dirty_thresh in the global approximation. This could only happen when: dirty_thresh < nr_cpus * per_cpu_max_delta > Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead. > There is a refresh_vm_stats function which we could call (then retest > the break condition) just before resorting to congestion_wait. But > the big NUMA people might get very upset with me calling that too > often: causing a thundering herd of bouncing cachelines which that > was all designed to avoid. And it's not obvious to me what condition > to test for dirty_thresh "too low". That could be modeled on the error limit I have. For this particular case that would end up looking like: nr_online_cpus * pcp->stat_threshold. > I believe Peter gave all this quite a lot of thought when he was > making the rc6-mm1 changes, and I'd rather defer to him for a > suggestion of what best to do in earlier releases. Or maybe he'll > just point out how this couldn't have been a problem before. As outlined above, and I don't think we'll ever have such a low dirty_limit. But who knows :-) > Or there is is Richard's patch, which I haven't considered, but > Andrew was not quite satisfied with it - partly because he'd like > to understand how the situation could come about first, perhaps > we have now got an explanation. I'm with Andrew on this, that is, quite puzzled on how all this arises. Testing those writeback-fix-* patches might help rule out (or point to) a mis-function of pdflush. The theory that one task will spin in balance_dirty_pages() on a bdi that does not actually have many dirty pages, doesn't sound plausible because eventually the total dirty count (well, actually dirty+unstable+writeback) should subside again. This theory can cause crappy latencies, but should not 'hang' the machine. > (The original bug report was indeed on SMP, but I haven't seen > anyone say that's a necessary condition for the hang: it would > be if this is the issue. And Richard writes at one point of the > system only responding to AltSysRq: that would be surprising for > this issue, though it's possible that a task in balance_dirty_pages > is holding an i_mutex that everybody else comes to need.) Are we actually holding i_mutex on paths that lead into balance_dirty_pages? that does (from my admittedly limited knowledge of the vfs) sound like trouble, since we'd need it to complete writeback. All quite puzzling. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/21/2007 11:58 AM, Hugh Dickins wrote: > Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead. > There is a refresh_vm_stats function which we could call (then retest > the break condition) just before resorting to congestion_wait. But > the big NUMA people might get very upset with me calling that too > often: causing a thundering herd of bouncing cachelines which that > was all designed to avoid. And it's not obvious to me what condition > to test for dirty_thresh "too low". > > I believe Peter gave all this quite a lot of thought when he was > making the rc6-mm1 changes, and I'd rather defer to him for a > suggestion of what best to do in earlier releases. Or maybe he'll > just point out how this couldn't have been a problem before. > > Or there is is Richard's patch, which I haven't considered, but > Andrew was not quite satisfied with it - partly because he'd like > to understand how the situation could come about first, perhaps > we have now got an explanation. > My patch is just Richard's with the added requirement that we loop some minimum number of times before exiting balance_dirty_pages() with unfinished work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007, Andy Whitcroft wrote: > This sounds an awful lot like the same problem I reported with fsck > hanging. I believe that Hugh had a candidate patch for that, which was > related to dirty tracking limits. It seems that that patch tested, and > acked by Peter. All on lkml under: > > 2.6.23-rc6-mm1 -- mkfs stuck in 'D' You may well be right. My initial reaction was to point out that my patch was to a 2.6.23-rc6-mm1 implementation detail, whereas this thread is about a problem seen since 2.6.20 or perhaps earlier. But once I look harder at it, I wonder what would have kept 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from the global vm stats too low to get synched back to global, yet adding up to something which misleads balance_dirty_pages into an indefinite loop e.g. total nr_writeback actually 0, but appearing more than dirty_thresh in the global approximation. Certainly my rc6-mm1 patch won't work here, and all it was doing was apply some safety already added by Peter to one further case. Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead. There is a refresh_vm_stats function which we could call (then retest the break condition) just before resorting to congestion_wait. But the big NUMA people might get very upset with me calling that too often: causing a thundering herd of bouncing cachelines which that was all designed to avoid. And it's not obvious to me what condition to test for dirty_thresh "too low". I believe Peter gave all this quite a lot of thought when he was making the rc6-mm1 changes, and I'd rather defer to him for a suggestion of what best to do in earlier releases. Or maybe he'll just point out how this couldn't have been a problem before. Or there is is Richard's patch, which I haven't considered, but Andrew was not quite satisfied with it - partly because he'd like to understand how the situation could come about first, perhaps we have now got an explanation. (The original bug report was indeed on SMP, but I haven't seen anyone say that's a necessary condition for the hang: it would be if this is the issue. And Richard writes at one point of the system only responding to AltSysRq: that would be surprising for this issue, though it's possible that a task in balance_dirty_pages is holding an i_mutex that everybody else comes to need.) Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/21/2007 05:39 AM, Andy Whitcroft wrote: > This sounds an awful lot like the same problem I reported with fsck > hanging. I believe that Hugh had a candidate patch for that, which was > related to dirty tracking limits. It seems that that patch tested, and > acked by Peter. All on lkml under: > > 2.6.23-rc6-mm1 -- mkfs stuck in 'D' > Unfortunately that whole thread is about fixing bugs in an already much-changed balance_dirty_pages() in -mm. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 2007-09-21 at 03:33 -0700, Andrew Morton wrote: > On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy <[EMAIL PROTECTED]> wrote: > > > > That's all a bit crappy if the wrong races happen and some other task is > > > somehow exceeding the dirty limits each time this task polls them. Seems > > > unlikely that such a condition would persist forever. > > > > > > So the question is, why do we have large amounts of dirty pages for one > > > disk which appear to be sitting there not getting written? > > > > The lockup I'm seeing intermittently occurs when I have 2+ tasks copying > > large files (1Gb+) on sda & a small read-mainly mysql db app running on > > sdb. The lockup seems to happen just after the copies finish -- there > > are lots of dirty pages but nothing left to write them until kupdate > > gets round to it. > > Then what happens? The system recovers? Nothing -- it stays stuck forever. I don't think kupdate is getting started, I added some debug in there but haven't found out anything useful yet. But I am trying to build a better test case, the one I've got at the moment can take hours to trigger this problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy <[EMAIL PROTECTED]> wrote: > > That's all a bit crappy if the wrong races happen and some other task is > > somehow exceeding the dirty limits each time this task polls them. Seems > > unlikely that such a condition would persist forever. > > > > So the question is, why do we have large amounts of dirty pages for one > > disk which appear to be sitting there not getting written? > > The lockup I'm seeing intermittently occurs when I have 2+ tasks copying > large files (1Gb+) on sda & a small read-mainly mysql db app running on > sdb. The lockup seems to happen just after the copies finish -- there > are lots of dirty pages but nothing left to write them until kupdate > gets round to it. Then what happens? The system recovers? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 2007-09-20 at 15:36 -0700, Andrew Morton wrote: > On Thu, 20 Sep 2007 18:04:38 -0400 > Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > > > > > >> Can we get some kind of band-aid, like making the endless 'for' loop in > > >> balance_dirty_pages() terminate after some number of iterations? Clearly > > >> if we haven't written "write_chunk" pages after a few tries, *and* we > > >> haven't encountered congestion, there's no point in trying forever... > > > > > > Did my above questions get looked at? > > > > > > Is anyone able to reproduce this? > > > > > > Do we have a clue what's happening? > > > > There are a ton of dirty pages for one disk, and zero or close to zero dirty > > for a different one. Kernel spins forever trying to write some arbitrary > > minimum amount of data ("write_chunk" pages) to the second disk... > > That should be OK. The caller will sit in that loop, sleeping in > congestion_wait(), polling the correct backing-dev occasionally and waiting > until the dirty limits subside to an acceptable limit, at which stage this: > > if (nr_reclaimable + > global_page_state(NR_WRITEBACK) > <= dirty_thresh) > break; > > > will happen and we leave balance_dirty_pages(). > > That's all a bit crappy if the wrong races happen and some other task is > somehow exceeding the dirty limits each time this task polls them. Seems > unlikely that such a condition would persist forever. > > So the question is, why do we have large amounts of dirty pages for one > disk which appear to be sitting there not getting written? The lockup I'm seeing intermittently occurs when I have 2+ tasks copying large files (1Gb+) on sda & a small read-mainly mysql db app running on sdb. The lockup seems to happen just after the copies finish -- there are lots of dirty pages but nothing left to write them until kupdate gets round to it. BTW kupdate can loop for long periods of time when a disk is under this kind of load -- I regularly see it take over 20 seconds. and often it's unable to start as there are no pdflush threads available. > Do we know if there's any writeout at all happening when the system is in > this state? > No there doesn't seem to be any activity at all -- my machine is completely unresponsive only sysrq works. > I guess it's possible that the dirty inodes on the "other" disk got > themselves onto the wrong per-sb inode list, or are on the correct list, > but in the correct place. If so, these: > > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch > writeback-fix-comment-use-helper-function.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch > writeback-fix-periodic-superblock-dirty-inode-flushing.patch > > from 2.6.23-rc6-mm1 should help. > Did anyone try running /bin/sync when the system is in this state? I'm not able to run anything in this state, but sysrq-s doesn't make any difference. Richard - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
This sounds an awful lot like the same problem I reported with fsck hanging. I believe that Hugh had a candidate patch for that, which was related to dirty tracking limits. It seems that that patch tested, and acked by Peter. All on lkml under: 2.6.23-rc6-mm1 -- mkfs stuck in 'D' -apw - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007 10:08:08 +0200 Matthias Hensler <[EMAIL PROTECTED]> wrote: > On Thu, Sep 20, 2007 at 03:36:54PM -0700, Andrew Morton wrote: > > That's all a bit crappy if the wrong races happen and some other task > > is somehow exceeding the dirty limits each time this task polls them. > > Seems unlikely that such a condition would persist forever. > > How exactly do you define forever? It looks to me, that this condition > never resolves on its own, at least not in a window of several hours > (the system used to get stuck around 3am and was normally rebooted > between 7am and 8am, so at least hours are not enough to have the problem > resolve on its own). That's forever. > > So the question is, why do we have large amounts of dirty pages for > > one disk which appear to be sitting there not getting written? > > Unfortunately I have no idea. The full stacktrace for all processes was > attached to the original bugreport, maybe that can give a clue to that. > > > Do we know if there's any writeout at all happening when the system is > > in this state? > > Not sure about that. The system is responsible on a still open SSH > session, allowing several tasks still to be executed. However, that SSH > session gets stuck very fast if the wrong commands are executed. New > logins are not possible (Connection is akzepted but resulting in a > timeout 60 seconds later, so most likely /bin/login is not able to log > into wtmp). > > From all that I suspect that there is no more write activity on that > system, but cannot say for sure. Easiest would be to run `vmstat 1' in a separate ssh session then just leave it running. > > Did anyone try running /bin/sync when the system is in this state? > > I did not, no. Would be interesting if poss, please. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, Sep 20, 2007 at 03:36:54PM -0700, Andrew Morton wrote: > That's all a bit crappy if the wrong races happen and some other task > is somehow exceeding the dirty limits each time this task polls them. > Seems unlikely that such a condition would persist forever. How exactly do you define forever? It looks to me, that this condition never resolves on its own, at least not in a window of several hours (the system used to get stuck around 3am and was normally rebooted between 7am and 8am, so at least hours are not enough to have the problem resolve on its own). > So the question is, why do we have large amounts of dirty pages for > one disk which appear to be sitting there not getting written? Unfortunately I have no idea. The full stacktrace for all processes was attached to the original bugreport, maybe that can give a clue to that. > Do we know if there's any writeout at all happening when the system is > in this state? Not sure about that. The system is responsible on a still open SSH session, allowing several tasks still to be executed. However, that SSH session gets stuck very fast if the wrong commands are executed. New logins are not possible (Connection is akzepted but resulting in a timeout 60 seconds later, so most likely /bin/login is not able to log into wtmp). From all that I suspect that there is no more write activity on that system, but cannot say for sure. > Did anyone try running /bin/sync when the system is in this state? I did not, no. Regards, Matthias pgpVjssP9xl28.pgp Description: PGP signature
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, Sep 20, 2007 at 03:36:54PM -0700, Andrew Morton wrote: That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. How exactly do you define forever? It looks to me, that this condition never resolves on its own, at least not in a window of several hours (the system used to get stuck around 3am and was normally rebooted between 7am and 8am, so at least hours are not enough to have the problem resolve on its own). So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? Unfortunately I have no idea. The full stacktrace for all processes was attached to the original bugreport, maybe that can give a clue to that. Do we know if there's any writeout at all happening when the system is in this state? Not sure about that. The system is responsible on a still open SSH session, allowing several tasks still to be executed. However, that SSH session gets stuck very fast if the wrong commands are executed. New logins are not possible (Connection is akzepted but resulting in a timeout 60 seconds later, so most likely /bin/login is not able to log into wtmp). From all that I suspect that there is no more write activity on that system, but cannot say for sure. Did anyone try running /bin/sync when the system is in this state? I did not, no. Regards, Matthias pgpVjssP9xl28.pgp Description: PGP signature
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007 10:08:08 +0200 Matthias Hensler [EMAIL PROTECTED] wrote: On Thu, Sep 20, 2007 at 03:36:54PM -0700, Andrew Morton wrote: That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. How exactly do you define forever? It looks to me, that this condition never resolves on its own, at least not in a window of several hours (the system used to get stuck around 3am and was normally rebooted between 7am and 8am, so at least hours are not enough to have the problem resolve on its own). That's forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? Unfortunately I have no idea. The full stacktrace for all processes was attached to the original bugreport, maybe that can give a clue to that. Do we know if there's any writeout at all happening when the system is in this state? Not sure about that. The system is responsible on a still open SSH session, allowing several tasks still to be executed. However, that SSH session gets stuck very fast if the wrong commands are executed. New logins are not possible (Connection is akzepted but resulting in a timeout 60 seconds later, so most likely /bin/login is not able to log into wtmp). From all that I suspect that there is no more write activity on that system, but cannot say for sure. Easiest would be to run `vmstat 1' in a separate ssh session then just leave it running. Did anyone try running /bin/sync when the system is in this state? I did not, no. Would be interesting if poss, please. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
This sounds an awful lot like the same problem I reported with fsck hanging. I believe that Hugh had a candidate patch for that, which was related to dirty tracking limits. It seems that that patch tested, and acked by Peter. All on lkml under: 2.6.23-rc6-mm1 -- mkfs stuck in 'D' -apw - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 2007-09-20 at 15:36 -0700, Andrew Morton wrote: On Thu, 20 Sep 2007 18:04:38 -0400 Chuck Ebbert [EMAIL PROTECTED] wrote: Can we get some kind of band-aid, like making the endless 'for' loop in balance_dirty_pages() terminate after some number of iterations? Clearly if we haven't written write_chunk pages after a few tries, *and* we haven't encountered congestion, there's no point in trying forever... Did my above questions get looked at? Is anyone able to reproduce this? Do we have a clue what's happening? There are a ton of dirty pages for one disk, and zero or close to zero dirty for a different one. Kernel spins forever trying to write some arbitrary minimum amount of data (write_chunk pages) to the second disk... That should be OK. The caller will sit in that loop, sleeping in congestion_wait(), polling the correct backing-dev occasionally and waiting until the dirty limits subside to an acceptable limit, at which stage this: if (nr_reclaimable + global_page_state(NR_WRITEBACK) = dirty_thresh) break; will happen and we leave balance_dirty_pages(). That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? The lockup I'm seeing intermittently occurs when I have 2+ tasks copying large files (1Gb+) on sda a small read-mainly mysql db app running on sdb. The lockup seems to happen just after the copies finish -- there are lots of dirty pages but nothing left to write them until kupdate gets round to it. BTW kupdate can loop for long periods of time when a disk is under this kind of load -- I regularly see it take over 20 seconds. and often it's unable to start as there are no pdflush threads available. Do we know if there's any writeout at all happening when the system is in this state? No there doesn't seem to be any activity at all -- my machine is completely unresponsive only sysrq works. I guess it's possible that the dirty inodes on the other disk got themselves onto the wrong per-sb inode list, or are on the correct list, but in the correct place. If so, these: writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-comment-use-helper-function.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-periodic-superblock-dirty-inode-flushing.patch from 2.6.23-rc6-mm1 should help. Did anyone try running /bin/sync when the system is in this state? I'm not able to run anything in this state, but sysrq-s doesn't make any difference. Richard - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy [EMAIL PROTECTED] wrote: That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? The lockup I'm seeing intermittently occurs when I have 2+ tasks copying large files (1Gb+) on sda a small read-mainly mysql db app running on sdb. The lockup seems to happen just after the copies finish -- there are lots of dirty pages but nothing left to write them until kupdate gets round to it. Then what happens? The system recovers? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 2007-09-21 at 03:33 -0700, Andrew Morton wrote: On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy [EMAIL PROTECTED] wrote: That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? The lockup I'm seeing intermittently occurs when I have 2+ tasks copying large files (1Gb+) on sda a small read-mainly mysql db app running on sdb. The lockup seems to happen just after the copies finish -- there are lots of dirty pages but nothing left to write them until kupdate gets round to it. Then what happens? The system recovers? Nothing -- it stays stuck forever. I don't think kupdate is getting started, I added some debug in there but haven't found out anything useful yet. But I am trying to build a better test case, the one I've got at the moment can take hours to trigger this problem. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/21/2007 05:39 AM, Andy Whitcroft wrote: This sounds an awful lot like the same problem I reported with fsck hanging. I believe that Hugh had a candidate patch for that, which was related to dirty tracking limits. It seems that that patch tested, and acked by Peter. All on lkml under: 2.6.23-rc6-mm1 -- mkfs stuck in 'D' Unfortunately that whole thread is about fixing bugs in an already much-changed balance_dirty_pages() in -mm. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/21/2007 11:58 AM, Hugh Dickins wrote: Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead. There is a refresh_vm_stats function which we could call (then retest the break condition) just before resorting to congestion_wait. But the big NUMA people might get very upset with me calling that too often: causing a thundering herd of bouncing cachelines which that was all designed to avoid. And it's not obvious to me what condition to test for dirty_thresh too low. I believe Peter gave all this quite a lot of thought when he was making the rc6-mm1 changes, and I'd rather defer to him for a suggestion of what best to do in earlier releases. Or maybe he'll just point out how this couldn't have been a problem before. Or there is is Richard's patch, which I haven't considered, but Andrew was not quite satisfied with it - partly because he'd like to understand how the situation could come about first, perhaps we have now got an explanation. My patch is just Richard's with the added requirement that we loop some minimum number of times before exiting balance_dirty_pages() with unfinished work. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007, Andy Whitcroft wrote: This sounds an awful lot like the same problem I reported with fsck hanging. I believe that Hugh had a candidate patch for that, which was related to dirty tracking limits. It seems that that patch tested, and acked by Peter. All on lkml under: 2.6.23-rc6-mm1 -- mkfs stuck in 'D' You may well be right. My initial reaction was to point out that my patch was to a 2.6.23-rc6-mm1 implementation detail, whereas this thread is about a problem seen since 2.6.20 or perhaps earlier. But once I look harder at it, I wonder what would have kept 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from the global vm stats too low to get synched back to global, yet adding up to something which misleads balance_dirty_pages into an indefinite loop e.g. total nr_writeback actually 0, but appearing more than dirty_thresh in the global approximation. Certainly my rc6-mm1 patch won't work here, and all it was doing was apply some safety already added by Peter to one further case. Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead. There is a refresh_vm_stats function which we could call (then retest the break condition) just before resorting to congestion_wait. But the big NUMA people might get very upset with me calling that too often: causing a thundering herd of bouncing cachelines which that was all designed to avoid. And it's not obvious to me what condition to test for dirty_thresh too low. I believe Peter gave all this quite a lot of thought when he was making the rc6-mm1 changes, and I'd rather defer to him for a suggestion of what best to do in earlier releases. Or maybe he'll just point out how this couldn't have been a problem before. Or there is is Richard's patch, which I haven't considered, but Andrew was not quite satisfied with it - partly because he'd like to understand how the situation could come about first, perhaps we have now got an explanation. (The original bug report was indeed on SMP, but I haven't seen anyone say that's a necessary condition for the hang: it would be if this is the issue. And Richard writes at one point of the system only responding to AltSysRq: that would be surprising for this issue, though it's possible that a task in balance_dirty_pages is holding an i_mutex that everybody else comes to need.) Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, 21 Sep 2007 16:58:15 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: But once I look harder at it, I wonder what would have kept 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from the global vm stats too low to get synched back to global, yet adding up to something which misleads balance_dirty_pages into an indefinite loop e.g. total nr_writeback actually 0, but appearing more than dirty_thresh in the global approximation. This could only happen when: dirty_thresh nr_cpus * per_cpu_max_delta Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead. There is a refresh_vm_stats function which we could call (then retest the break condition) just before resorting to congestion_wait. But the big NUMA people might get very upset with me calling that too often: causing a thundering herd of bouncing cachelines which that was all designed to avoid. And it's not obvious to me what condition to test for dirty_thresh too low. That could be modeled on the error limit I have. For this particular case that would end up looking like: nr_online_cpus * pcp-stat_threshold. I believe Peter gave all this quite a lot of thought when he was making the rc6-mm1 changes, and I'd rather defer to him for a suggestion of what best to do in earlier releases. Or maybe he'll just point out how this couldn't have been a problem before. As outlined above, and I don't think we'll ever have such a low dirty_limit. But who knows :-) Or there is is Richard's patch, which I haven't considered, but Andrew was not quite satisfied with it - partly because he'd like to understand how the situation could come about first, perhaps we have now got an explanation. I'm with Andrew on this, that is, quite puzzled on how all this arises. Testing those writeback-fix-* patches might help rule out (or point to) a mis-function of pdflush. The theory that one task will spin in balance_dirty_pages() on a bdi that does not actually have many dirty pages, doesn't sound plausible because eventually the total dirty count (well, actually dirty+unstable+writeback) should subside again. This theory can cause crappy latencies, but should not 'hang' the machine. (The original bug report was indeed on SMP, but I haven't seen anyone say that's a necessary condition for the hang: it would be if this is the issue. And Richard writes at one point of the system only responding to AltSysRq: that would be surprising for this issue, though it's possible that a task in balance_dirty_pages is holding an i_mutex that everybody else comes to need.) Are we actually holding i_mutex on paths that lead into balance_dirty_pages? that does (from my admittedly limited knowledge of the vfs) sound like trouble, since we'd need it to complete writeback. All quite puzzling. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/20/2007 06:36 PM, Andrew Morton wrote: > > So the question is, why do we have large amounts of dirty pages for one > disk which appear to be sitting there not getting written? > > Do we know if there's any writeout at all happening when the system is in > this state? > > I guess it's possible that the dirty inodes on the "other" disk got > themselves onto the wrong per-sb inode list, or are on the correct list, > but in the correct place. If so, these: > > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch > writeback-fix-comment-use-helper-function.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch > writeback-fix-periodic-superblock-dirty-inode-flushing.patch > > from 2.6.23-rc6-mm1 should help. Yikes! Simple fixes would be better. Patch that is confirmed to fix the problem for this user is below, but that one could cause other problems. I was looking for some band-aid could be shown to be harmless... http://lkml.org/lkml/2007/8/2/89: -- --- linux-2.6.22.1/mm/page-writeback.c.orig 2007-07-30 16:36:09.0 +0100 +++ linux-2.6.22.1/mm/page-writeback.c 2007-07-31 16:26:43.0 +0100 @@ -250,6 +250,8 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ + if (!wbc.encountered_congestion && wbc.nr_to_write > 0) + break; /* didn't find enough to do */ } congestion_wait(WRITE, HZ/10); } > > Did anyone try running /bin/sync when the system is in this state? > Reporter is in the CC: list. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 20 Sep 2007 18:04:38 -0400 Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > > >> Can we get some kind of band-aid, like making the endless 'for' loop in > >> balance_dirty_pages() terminate after some number of iterations? Clearly > >> if we haven't written "write_chunk" pages after a few tries, *and* we > >> haven't encountered congestion, there's no point in trying forever... > > > > Did my above questions get looked at? > > > > Is anyone able to reproduce this? > > > > Do we have a clue what's happening? > > There are a ton of dirty pages for one disk, and zero or close to zero dirty > for a different one. Kernel spins forever trying to write some arbitrary > minimum amount of data ("write_chunk" pages) to the second disk... That should be OK. The caller will sit in that loop, sleeping in congestion_wait(), polling the correct backing-dev occasionally and waiting until the dirty limits subside to an acceptable limit, at which stage this: if (nr_reclaimable + global_page_state(NR_WRITEBACK) <= dirty_thresh) break; will happen and we leave balance_dirty_pages(). That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? Do we know if there's any writeout at all happening when the system is in this state? I guess it's possible that the dirty inodes on the "other" disk got themselves onto the wrong per-sb inode list, or are on the correct list, but in the correct place. If so, these: writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-comment-use-helper-function.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-periodic-superblock-dirty-inode-flushing.patch from 2.6.23-rc6-mm1 should help. Did anyone try running /bin/sync when the system is in this state? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/20/2007 05:29 PM, Andrew Morton wrote: > On Thu, 20 Sep 2007 17:07:15 -0400 > Chuck Ebbert <[EMAIL PROTECTED]> wrote: > >> On 08/09/2007 12:55 PM, Andrew Morton wrote: >>> On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[EMAIL PROTECTED]> >>> wrote: >>> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: > [...] > I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running "vmstat 1" and the endless loop of "cat /proc/meminfo", just in case I am wrong. >>> Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? >>> >>> If Richard's patch has indeed fixed it then this confirms that we're seeing >>> contention over the dirty-memory limits. Richard's patch isn't really the >>> right one because it allows unlimited dirty-memory windup in some situations >>> (large number of disks with small writes, or when we perform queue >>> congestion >>> avoidance). >>> >>> As you're seeing this happening when multiple disks are being written to it >>> is >>> possible that the per-device-dirty-threshold patches which recently went >>> into >>> -mm (and which appear to have a bug) will fix it. >>> >>> But I worry that the stall appears to persist *forever*. That would >>> indicate >>> that we have a dirty-memory accounting leak, or that for some reason the >>> system has decided to stop doing writeback to one or more queues (might be >>> caused by an error in a lower-level driver's queue congestion state >>> management). >>> >>> If it is the latter, then it could be that running "sync" will clear the >>> problem. Temporarily, at least. Because sync will ignore the queue >>> congestion >>> state. >>> >> This is still a problem for people, and no fix is in sight until 2.6.24. > > Any bugzilla urls or anything like that? https://bugzilla.redhat.com/show_bug.cgi?id=249563 > >> Can we get some kind of band-aid, like making the endless 'for' loop in >> balance_dirty_pages() terminate after some number of iterations? Clearly >> if we haven't written "write_chunk" pages after a few tries, *and* we >> haven't encountered congestion, there's no point in trying forever... > > Did my above questions get looked at? > > Is anyone able to reproduce this? > > Do we have a clue what's happening? There are a ton of dirty pages for one disk, and zero or close to zero dirty for a different one. Kernel spins forever trying to write some arbitrary minimum amount of data ("write_chunk" pages) to the second disk... > > Is that function just spinning around, failing to start writeout against > any pages at all? If so, how come? Yes, it spins forever. Just removing the "noatime" mount option for the second disk generates enough dirty data to keep the system functional. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 20 Sep 2007 17:07:15 -0400 Chuck Ebbert <[EMAIL PROTECTED]> wrote: > On 08/09/2007 12:55 PM, Andrew Morton wrote: > > On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[EMAIL PROTECTED]> > > wrote: > > > >> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: > >>> On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: > >>> [...] > >>> I am also willing to try the patch posted by Richard. > >> I want to give some update here: > >> > >> 1. We finally hit the problem on a third system, with a total different > >>setup and hardware. However, again high I/O load caused the problem > >>and the affected filesystems were mounted with noatime. > >> > >> 2. I installed a recompiled kernel with just the two line patch from > >>Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 > >>days uptime now and counting. I believe the patch fixed the problem. > >>However, I will continue running "vmstat 1" and the endless loop of > >>"cat /proc/meminfo", just in case I am wrong. > >> > > > > Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? > > > > If Richard's patch has indeed fixed it then this confirms that we're seeing > > contention over the dirty-memory limits. Richard's patch isn't really the > > right one because it allows unlimited dirty-memory windup in some situations > > (large number of disks with small writes, or when we perform queue > > congestion > > avoidance). > > > > As you're seeing this happening when multiple disks are being written to it > > is > > possible that the per-device-dirty-threshold patches which recently went > > into > > -mm (and which appear to have a bug) will fix it. > > > > But I worry that the stall appears to persist *forever*. That would > > indicate > > that we have a dirty-memory accounting leak, or that for some reason the > > system has decided to stop doing writeback to one or more queues (might be > > caused by an error in a lower-level driver's queue congestion state > > management). > > > > If it is the latter, then it could be that running "sync" will clear the > > problem. Temporarily, at least. Because sync will ignore the queue > > congestion > > state. > > > > This is still a problem for people, and no fix is in sight until 2.6.24. Any bugzilla urls or anything like that? > Can we get some kind of band-aid, like making the endless 'for' loop in > balance_dirty_pages() terminate after some number of iterations? Clearly > if we haven't written "write_chunk" pages after a few tries, *and* we > haven't encountered congestion, there's no point in trying forever... Did my above questions get looked at? Is anyone able to reproduce this? Do we have a clue what's happening? Is that function just spinning around, failing to start writeout against any pages at all? If so, how come? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 08/09/2007 12:55 PM, Andrew Morton wrote: > On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[EMAIL PROTECTED]> wrote: > >> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: >>> On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: >>> [...] >>> I am also willing to try the patch posted by Richard. >> I want to give some update here: >> >> 1. We finally hit the problem on a third system, with a total different >>setup and hardware. However, again high I/O load caused the problem >>and the affected filesystems were mounted with noatime. >> >> 2. I installed a recompiled kernel with just the two line patch from >>Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 >>days uptime now and counting. I believe the patch fixed the problem. >>However, I will continue running "vmstat 1" and the endless loop of >>"cat /proc/meminfo", just in case I am wrong. >> > > Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? > > If Richard's patch has indeed fixed it then this confirms that we're seeing > contention over the dirty-memory limits. Richard's patch isn't really the > right one because it allows unlimited dirty-memory windup in some situations > (large number of disks with small writes, or when we perform queue congestion > avoidance). > > As you're seeing this happening when multiple disks are being written to it is > possible that the per-device-dirty-threshold patches which recently went into > -mm (and which appear to have a bug) will fix it. > > But I worry that the stall appears to persist *forever*. That would indicate > that we have a dirty-memory accounting leak, or that for some reason the > system has decided to stop doing writeback to one or more queues (might be > caused by an error in a lower-level driver's queue congestion state > management). > > If it is the latter, then it could be that running "sync" will clear the > problem. Temporarily, at least. Because sync will ignore the queue > congestion > state. > This is still a problem for people, and no fix is in sight until 2.6.24. Can we get some kind of band-aid, like making the endless 'for' loop in balance_dirty_pages() terminate after some number of iterations? Clearly if we haven't written "write_chunk" pages after a few tries, *and* we haven't encountered congestion, there's no point in trying forever... [not even compile tested patch follows] --- mm/page-writeback.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) --- linux-2.6.22.noarch.orig/mm/page-writeback.c +++ linux-2.6.22.noarch/mm/page-writeback.c @@ -208,11 +208,12 @@ static void balance_dirty_pages(struct a long background_thresh; long dirty_thresh; unsigned long pages_written = 0; + int i; unsigned long write_chunk = sync_writeback_pages(); struct backing_dev_info *bdi = mapping->backing_dev_info; - for (;;) { + for (i = 0; ; i++) { struct writeback_control wbc = { .bdi= bdi, .sync_mode = WB_SYNC_NONE, @@ -250,6 +251,8 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ + if (i >= write_chunk && !wbc.encountered_congestion) + break; /* nothing to write? */ } congestion_wait(WRITE, HZ/10); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 08/09/2007 12:55 PM, Andrew Morton wrote: On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler [EMAIL PROTECTED] wrote: On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: [...] I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running vmstat 1 and the endless loop of cat /proc/meminfo, just in case I am wrong. Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? If Richard's patch has indeed fixed it then this confirms that we're seeing contention over the dirty-memory limits. Richard's patch isn't really the right one because it allows unlimited dirty-memory windup in some situations (large number of disks with small writes, or when we perform queue congestion avoidance). As you're seeing this happening when multiple disks are being written to it is possible that the per-device-dirty-threshold patches which recently went into -mm (and which appear to have a bug) will fix it. But I worry that the stall appears to persist *forever*. That would indicate that we have a dirty-memory accounting leak, or that for some reason the system has decided to stop doing writeback to one or more queues (might be caused by an error in a lower-level driver's queue congestion state management). If it is the latter, then it could be that running sync will clear the problem. Temporarily, at least. Because sync will ignore the queue congestion state. This is still a problem for people, and no fix is in sight until 2.6.24. Can we get some kind of band-aid, like making the endless 'for' loop in balance_dirty_pages() terminate after some number of iterations? Clearly if we haven't written write_chunk pages after a few tries, *and* we haven't encountered congestion, there's no point in trying forever... [not even compile tested patch follows] --- mm/page-writeback.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) --- linux-2.6.22.noarch.orig/mm/page-writeback.c +++ linux-2.6.22.noarch/mm/page-writeback.c @@ -208,11 +208,12 @@ static void balance_dirty_pages(struct a long background_thresh; long dirty_thresh; unsigned long pages_written = 0; + int i; unsigned long write_chunk = sync_writeback_pages(); struct backing_dev_info *bdi = mapping-backing_dev_info; - for (;;) { + for (i = 0; ; i++) { struct writeback_control wbc = { .bdi= bdi, .sync_mode = WB_SYNC_NONE, @@ -250,6 +251,8 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ + if (i = write_chunk !wbc.encountered_congestion) + break; /* nothing to write? */ } congestion_wait(WRITE, HZ/10); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 20 Sep 2007 17:07:15 -0400 Chuck Ebbert [EMAIL PROTECTED] wrote: On 08/09/2007 12:55 PM, Andrew Morton wrote: On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler [EMAIL PROTECTED] wrote: On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: [...] I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running vmstat 1 and the endless loop of cat /proc/meminfo, just in case I am wrong. Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? If Richard's patch has indeed fixed it then this confirms that we're seeing contention over the dirty-memory limits. Richard's patch isn't really the right one because it allows unlimited dirty-memory windup in some situations (large number of disks with small writes, or when we perform queue congestion avoidance). As you're seeing this happening when multiple disks are being written to it is possible that the per-device-dirty-threshold patches which recently went into -mm (and which appear to have a bug) will fix it. But I worry that the stall appears to persist *forever*. That would indicate that we have a dirty-memory accounting leak, or that for some reason the system has decided to stop doing writeback to one or more queues (might be caused by an error in a lower-level driver's queue congestion state management). If it is the latter, then it could be that running sync will clear the problem. Temporarily, at least. Because sync will ignore the queue congestion state. This is still a problem for people, and no fix is in sight until 2.6.24. Any bugzilla urls or anything like that? Can we get some kind of band-aid, like making the endless 'for' loop in balance_dirty_pages() terminate after some number of iterations? Clearly if we haven't written write_chunk pages after a few tries, *and* we haven't encountered congestion, there's no point in trying forever... Did my above questions get looked at? Is anyone able to reproduce this? Do we have a clue what's happening? Is that function just spinning around, failing to start writeout against any pages at all? If so, how come? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/20/2007 05:29 PM, Andrew Morton wrote: On Thu, 20 Sep 2007 17:07:15 -0400 Chuck Ebbert [EMAIL PROTECTED] wrote: On 08/09/2007 12:55 PM, Andrew Morton wrote: On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler [EMAIL PROTECTED] wrote: On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: [...] I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running vmstat 1 and the endless loop of cat /proc/meminfo, just in case I am wrong. Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? If Richard's patch has indeed fixed it then this confirms that we're seeing contention over the dirty-memory limits. Richard's patch isn't really the right one because it allows unlimited dirty-memory windup in some situations (large number of disks with small writes, or when we perform queue congestion avoidance). As you're seeing this happening when multiple disks are being written to it is possible that the per-device-dirty-threshold patches which recently went into -mm (and which appear to have a bug) will fix it. But I worry that the stall appears to persist *forever*. That would indicate that we have a dirty-memory accounting leak, or that for some reason the system has decided to stop doing writeback to one or more queues (might be caused by an error in a lower-level driver's queue congestion state management). If it is the latter, then it could be that running sync will clear the problem. Temporarily, at least. Because sync will ignore the queue congestion state. This is still a problem for people, and no fix is in sight until 2.6.24. Any bugzilla urls or anything like that? https://bugzilla.redhat.com/show_bug.cgi?id=249563 Can we get some kind of band-aid, like making the endless 'for' loop in balance_dirty_pages() terminate after some number of iterations? Clearly if we haven't written write_chunk pages after a few tries, *and* we haven't encountered congestion, there's no point in trying forever... Did my above questions get looked at? Is anyone able to reproduce this? Do we have a clue what's happening? There are a ton of dirty pages for one disk, and zero or close to zero dirty for a different one. Kernel spins forever trying to write some arbitrary minimum amount of data (write_chunk pages) to the second disk... Is that function just spinning around, failing to start writeout against any pages at all? If so, how come? Yes, it spins forever. Just removing the noatime mount option for the second disk generates enough dirty data to keep the system functional. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 20 Sep 2007 18:04:38 -0400 Chuck Ebbert [EMAIL PROTECTED] wrote: Can we get some kind of band-aid, like making the endless 'for' loop in balance_dirty_pages() terminate after some number of iterations? Clearly if we haven't written write_chunk pages after a few tries, *and* we haven't encountered congestion, there's no point in trying forever... Did my above questions get looked at? Is anyone able to reproduce this? Do we have a clue what's happening? There are a ton of dirty pages for one disk, and zero or close to zero dirty for a different one. Kernel spins forever trying to write some arbitrary minimum amount of data (write_chunk pages) to the second disk... That should be OK. The caller will sit in that loop, sleeping in congestion_wait(), polling the correct backing-dev occasionally and waiting until the dirty limits subside to an acceptable limit, at which stage this: if (nr_reclaimable + global_page_state(NR_WRITEBACK) = dirty_thresh) break; will happen and we leave balance_dirty_pages(). That's all a bit crappy if the wrong races happen and some other task is somehow exceeding the dirty limits each time this task polls them. Seems unlikely that such a condition would persist forever. So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? Do we know if there's any writeout at all happening when the system is in this state? I guess it's possible that the dirty inodes on the other disk got themselves onto the wrong per-sb inode list, or are on the correct list, but in the correct place. If so, these: writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-comment-use-helper-function.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-periodic-superblock-dirty-inode-flushing.patch from 2.6.23-rc6-mm1 should help. Did anyone try running /bin/sync when the system is in this state? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On 09/20/2007 06:36 PM, Andrew Morton wrote: So the question is, why do we have large amounts of dirty pages for one disk which appear to be sitting there not getting written? Do we know if there's any writeout at all happening when the system is in this state? I guess it's possible that the dirty inodes on the other disk got themselves onto the wrong per-sb inode list, or are on the correct list, but in the correct place. If so, these: writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-comment-use-helper-function.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-periodic-superblock-dirty-inode-flushing.patch from 2.6.23-rc6-mm1 should help. Yikes! Simple fixes would be better. Patch that is confirmed to fix the problem for this user is below, but that one could cause other problems. I was looking for some band-aid could be shown to be harmless... http://lkml.org/lkml/2007/8/2/89: -- --- linux-2.6.22.1/mm/page-writeback.c.orig 2007-07-30 16:36:09.0 +0100 +++ linux-2.6.22.1/mm/page-writeback.c 2007-07-31 16:26:43.0 +0100 @@ -250,6 +250,8 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ + if (!wbc.encountered_congestion wbc.nr_to_write 0) + break; /* didn't find enough to do */ } congestion_wait(WRITE, HZ/10); } Did anyone try running /bin/sync when the system is in this state? Reporter is in the CC: list. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, Aug 09, 2007 at 09:55:34AM -0700, Andrew Morton wrote: > On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[EMAIL PROTECTED]> wrote: > > On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: > > > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: > > > [...] > > > I am also willing to try the patch posted by Richard. > > > > I want to give some update here: > > Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? So far not, sorry. The problem did not reoccur since I am running the kernel with that patch now. > As you're seeing this happening when multiple disks are being written > to it is possible that the per-device-dirty-threshold patches which > recently went into -mm (and which appear to have a bug) will fix it. All affected systems have two devices which are cross mounted, eg. it has /home on disk 1 and /var/spool/imap on disk 2. On the respectively device there is the mirror partition (mirror of /home on disk 2 and mirror of /var/spool/imap on disk 1). Normally the system hangs when running rsync from /var/spool/imap to its corresponding mirror partition. We have around 10 different partitions (all in LVM), but so far the hang mostly started running rsync on the imapspool (which has by far most of the files on the system). _However_: we also had this hang while running a backup to an external FTP server (so only reading the filesystems apart from the usual system activity). And: the third system we had this problem on has no IMAP spool at all. The hang occured while running a backup together with "yum update". It might be related that all the systems uses these crossmounts over two different disks. I wasn't not able to reproduce that on my homesystem with only one disk, either because the activity-pattern was different, or it really needs two disks to run into that issue. > But I worry that the stall appears to persist *forever*. That would indicate > that we have a dirty-memory accounting leak, or that for some reason the > system has decided to stop doing writeback to one or more queues (might be > caused by an error in a lower-level driver's queue congestion state > management). Well, all the time I was still connected to one of that machines it was possible to clear the situation by killing a lot of processes. That is mostly killing all httpd, smtpd, imapd, amavisd, crond, spamassassin and rsync processes. Eventually the system responded normal again (and I could cleanly reboot it). However the final killed process which resolved the issue was non deterministic so far. The workload is different on all the servers: Server 1 processes around 10-20k mails per day but also servers a lot of HTTP requests. Server 2 is only a mailserver processing around 5-10k mails per day. Server 3 just serves HTTP requests (and a bit DNS). > If it is the latter, then it could be that running "sync" will clear > the problem. Temporarily, at least. Because sync will ignore the > queue congestion state. I have to admit that I never tried to sync in such a case, since mostly I had only one open SSH session and tried to find the root cause. So far the problem occured first on server 2 around march or april without changes to the machine (we have a Changelog for every server: there was a single kernel update that time, however we reverted back after the first issue and run a kernel which was stable for several weeks before and encountered the problem again). In the beginning we encountered the problem maybe twice a week, getting worse within the next weeks. Several major updates (kernel and distribution were made) without resolving the problem. Since end of april server 1 began showing the same problem (running a different Fedora version and kernel at that time), first slowly (once a week or so) then more regular. Around july we had a hang nearly every day. Server 3 had the problem only once now, but that server has no high workload. We spent a lot of time investigating the issue, but since all servers use a different hardware, different setups (beside from the crossmounts with noatime) and even different base systems (Fedora Core 5+6, Fedora 7, different kernels: 2.6.18, .19, .20, .21 and .22) I think that we can rule out hardware problems. I think the issue might be there some time now, but is hit more often now since the workload increased a lot over the last months. We ruled a lot out over the month (eg. syslog was replaced, many not so important services were stopped, schedular was changed), without changes. Just reverting from "noatime" in the fstab to "default" fixed it reliable so far. As said I am still running "vmstat 1" and catting of /proc/meminfo just in case. If there is anything I can do beside that to clearify the problem I will try to help. Regards, Matthias pgpz51pHUHUC4.pgp Description: PGP signature
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[EMAIL PROTECTED]> wrote: > On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: > > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: > > [...] > > I am also willing to try the patch posted by Richard. > > I want to give some update here: > > 1. We finally hit the problem on a third system, with a total different >setup and hardware. However, again high I/O load caused the problem >and the affected filesystems were mounted with noatime. > > 2. I installed a recompiled kernel with just the two line patch from >Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 >days uptime now and counting. I believe the patch fixed the problem. >However, I will continue running "vmstat 1" and the endless loop of >"cat /proc/meminfo", just in case I am wrong. > Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? If Richard's patch has indeed fixed it then this confirms that we're seeing contention over the dirty-memory limits. Richard's patch isn't really the right one because it allows unlimited dirty-memory windup in some situations (large number of disks with small writes, or when we perform queue congestion avoidance). As you're seeing this happening when multiple disks are being written to it is possible that the per-device-dirty-threshold patches which recently went into -mm (and which appear to have a bug) will fix it. But I worry that the stall appears to persist *forever*. That would indicate that we have a dirty-memory accounting leak, or that for some reason the system has decided to stop doing writeback to one or more queues (might be caused by an error in a lower-level driver's queue congestion state management). If it is the latter, then it could be that running "sync" will clear the problem. Temporarily, at least. Because sync will ignore the queue congestion state. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: > [...] > I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running "vmstat 1" and the endless loop of "cat /proc/meminfo", just in case I am wrong. Regards, Matthias pgp6P4MHrbGpI.pgp Description: PGP signature
Re: Processes spinning forever, apparently in lock_timer_base()?
On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: [...] I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running vmstat 1 and the endless loop of cat /proc/meminfo, just in case I am wrong. Regards, Matthias pgp6P4MHrbGpI.pgp Description: PGP signature
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler [EMAIL PROTECTED] wrote: On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: [...] I am also willing to try the patch posted by Richard. I want to give some update here: 1. We finally hit the problem on a third system, with a total different setup and hardware. However, again high I/O load caused the problem and the affected filesystems were mounted with noatime. 2. I installed a recompiled kernel with just the two line patch from Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5 days uptime now and counting. I believe the patch fixed the problem. However, I will continue running vmstat 1 and the endless loop of cat /proc/meminfo, just in case I am wrong. Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? If Richard's patch has indeed fixed it then this confirms that we're seeing contention over the dirty-memory limits. Richard's patch isn't really the right one because it allows unlimited dirty-memory windup in some situations (large number of disks with small writes, or when we perform queue congestion avoidance). As you're seeing this happening when multiple disks are being written to it is possible that the per-device-dirty-threshold patches which recently went into -mm (and which appear to have a bug) will fix it. But I worry that the stall appears to persist *forever*. That would indicate that we have a dirty-memory accounting leak, or that for some reason the system has decided to stop doing writeback to one or more queues (might be caused by an error in a lower-level driver's queue congestion state management). If it is the latter, then it could be that running sync will clear the problem. Temporarily, at least. Because sync will ignore the queue congestion state. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Thu, Aug 09, 2007 at 09:55:34AM -0700, Andrew Morton wrote: On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler [EMAIL PROTECTED] wrote: On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote: On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: [...] I am also willing to try the patch posted by Richard. I want to give some update here: Did we ever see the /proc/meminfo and /proc/vmstat output during the stall? So far not, sorry. The problem did not reoccur since I am running the kernel with that patch now. As you're seeing this happening when multiple disks are being written to it is possible that the per-device-dirty-threshold patches which recently went into -mm (and which appear to have a bug) will fix it. All affected systems have two devices which are cross mounted, eg. it has /home on disk 1 and /var/spool/imap on disk 2. On the respectively device there is the mirror partition (mirror of /home on disk 2 and mirror of /var/spool/imap on disk 1). Normally the system hangs when running rsync from /var/spool/imap to its corresponding mirror partition. We have around 10 different partitions (all in LVM), but so far the hang mostly started running rsync on the imapspool (which has by far most of the files on the system). _However_: we also had this hang while running a backup to an external FTP server (so only reading the filesystems apart from the usual system activity). And: the third system we had this problem on has no IMAP spool at all. The hang occured while running a backup together with yum update. It might be related that all the systems uses these crossmounts over two different disks. I wasn't not able to reproduce that on my homesystem with only one disk, either because the activity-pattern was different, or it really needs two disks to run into that issue. But I worry that the stall appears to persist *forever*. That would indicate that we have a dirty-memory accounting leak, or that for some reason the system has decided to stop doing writeback to one or more queues (might be caused by an error in a lower-level driver's queue congestion state management). Well, all the time I was still connected to one of that machines it was possible to clear the situation by killing a lot of processes. That is mostly killing all httpd, smtpd, imapd, amavisd, crond, spamassassin and rsync processes. Eventually the system responded normal again (and I could cleanly reboot it). However the final killed process which resolved the issue was non deterministic so far. The workload is different on all the servers: Server 1 processes around 10-20k mails per day but also servers a lot of HTTP requests. Server 2 is only a mailserver processing around 5-10k mails per day. Server 3 just serves HTTP requests (and a bit DNS). If it is the latter, then it could be that running sync will clear the problem. Temporarily, at least. Because sync will ignore the queue congestion state. I have to admit that I never tried to sync in such a case, since mostly I had only one open SSH session and tried to find the root cause. So far the problem occured first on server 2 around march or april without changes to the machine (we have a Changelog for every server: there was a single kernel update that time, however we reverted back after the first issue and run a kernel which was stable for several weeks before and encountered the problem again). In the beginning we encountered the problem maybe twice a week, getting worse within the next weeks. Several major updates (kernel and distribution were made) without resolving the problem. Since end of april server 1 began showing the same problem (running a different Fedora version and kernel at that time), first slowly (once a week or so) then more regular. Around july we had a hang nearly every day. Server 3 had the problem only once now, but that server has no high workload. We spent a lot of time investigating the issue, but since all servers use a different hardware, different setups (beside from the crossmounts with noatime) and even different base systems (Fedora Core 5+6, Fedora 7, different kernels: 2.6.18, .19, .20, .21 and .22) I think that we can rule out hardware problems. I think the issue might be there some time now, but is hit more often now since the workload increased a lot over the last months. We ruled a lot out over the month (eg. syslog was replaced, many not so important services were stopped, schedular was changed), without changes. Just reverting from noatime in the fstab to default fixed it reliable so far. As said I am still running vmstat 1 and catting of /proc/meminfo just in case. If there is anything I can do beside that to clearify the problem I will try to help. Regards, Matthias pgpz51pHUHUC4.pgp Description: PGP signature
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: > (attempting to cc Matthias. If I have the wrong one, please fix it up) You got the correct one. > > Looks like the same problem with spinlock unfairness we've seen > > elsewhere: it seems to be looping here? Or is everyone stuck > > just waiting for writeout? > [...] > I think. Or perhaps lock_timer_base() really has gone and got stuck. One > possibility is that gcc has decided to cache timer->base in a register > rather than rereading it around that loop, which would be bad. Do: > > gdb vmlinux > (gdb) x/100i lock_timer_base This is from an affected kernel, but not the kernel matching my stack-trace. Hope it is useful anyway. If not maybe Chuck can provide the vmlinux file for the 2.6.22.1-27.fc7 kernel (the debuginfo seems to be deleted from all mirrors). (gdb) x/100i lock_timer_base 0xc102ffda : push %ebp 0xc102ffdb : mov%edx,%ebp 0xc102ffdd : push %edi 0xc102ffde : mov%eax,%edi 0xc102ffe0 : push %esi 0xc102ffe1 : push %ebx 0xc102ffe2 : mov0x14(%edi),%ebx 0xc102ffe5 :mov%ebx,%esi 0xc102ffe7 :and$0xfffe,%esi 0xc102ffea :je 0xc1030004 0xc102ffec :mov%esi,%eax 0xc102ffee :call 0xc122ff35 <_spin_lock_irqsave> 0xc102fff3 :mov%eax,0x0(%ebp) 0xc102fff6 :cmp0x14(%edi),%ebx 0xc102fff9 :je 0xc1030008 0xc102fffb :mov%eax,%edx 0xc102fffd :mov%esi,%eax 0xc102 :call 0xc122ff95 <_spin_unlock_irqrestore> 0xc1030004 :pause 0xc1030006 :jmp0xc102ffe2 0xc1030008 :mov%esi,%eax 0xc103000a :pop%ebx 0xc103000b :pop%esi 0xc103000c :pop%edi 0xc103000d :pop%ebp 0xc103000e :ret 0xc103000f : push %esi 0xc1030010 : or $0x,%esi 0xc1030013 : push %ebx 0xc1030014 : mov%eax,%ebx 0xc1030016 : sub$0x4,%esp 0xc1030019 : mov%esp,%edx 0xc103001b : call 0xc102ffda 0xc1030020 : cmp%ebx,0x4(%eax) 0xc1030023 : mov%eax,%ecx 0xc1030025 : je 0xc1030049 0xc1030027 : xor%esi,%esi 0xc1030029 : cmpl $0x0,(%ebx) 0xc103002c : je 0xc1030049 0xc103002e : mov(%ebx),%edx 0xc1030030 : mov$0x1,%si 0xc1030034 : mov0x4(%ebx),%eax 0xc1030037 : mov%eax,0x4(%edx) 0xc103003a : mov%edx,(%eax) 0xc103003c : movl $0x200200,0x4(%ebx) 0xc1030043 : movl $0x0,(%ebx) 0xc1030049 : mov(%esp),%edx 0xc103004c : mov%ecx,%eax 0xc103004e : call 0xc122ff95 <_spin_unlock_irqrestore> 0xc1030053 : mov%esi,%eax 0xc1030055 : pop%ebx ---Type to continue, or q to quit--- 0xc1030056 : pop%ebx 0xc1030057 : pop%esi 0xc1030058 : ret 0xc1030059 :push %ebx 0xc103005a : mov%eax,%ebx 0xc103005c : mov%ebx,%eax 0xc103005e : call 0xc103000f 0xc1030063 : test %eax,%eax 0xc1030065 : jns0xc103006b 0xc1030067 : pause 0xc1030069 : jmp0xc103005c 0xc103006b : pop%ebx 0xc103006c : ret 0xc103006d <__mod_timer>: push %ebp 0xc103006e <__mod_timer+1>: mov%edx,%ebp 0xc1030070 <__mod_timer+3>: push %edi 0xc1030071 <__mod_timer+4>: push %esi 0xc1030072 <__mod_timer+5>: mov%eax,%esi 0xc1030074 <__mod_timer+7>: push %ebx 0xc1030075 <__mod_timer+8>: sub$0x8,%esp 0xc1030078 <__mod_timer+11>:mov0x18(%esp),%edx 0xc103007c <__mod_timer+15>:call 0xc102f693 <__timer_stats_timer_set_start_info> 0xc1030081 <__mod_timer+20>:cmpl $0x0,0xc(%esi) 0xc1030085 <__mod_timer+24>:jne0xc103008b <__mod_timer+30> 0xc1030087 <__mod_timer+26>:ud2a 0xc1030089 <__mod_timer+28>:jmp0xc1030089 <__mod_timer+28> 0xc103008b <__mod_timer+30>:lea0x4(%esp),%edx 0xc103008f <__mod_timer+34>:mov%esi,%eax 0xc1030091 <__mod_timer+36>:call 0xc102ffda 0xc1030096 <__mod_timer+41>:movl $0x0,(%esp) 0xc103009d <__mod_timer+48>:mov%eax,%ebx 0xc103009f <__mod_timer+50>:cmpl $0x0,(%esi) 0xc10300a2 <__mod_timer+53>:je 0xc10300bc <__mod_timer+79> 0xc10300a4 <__mod_timer+55>:mov0x4(%esi),%eax 0xc10300a7 <__mod_timer+58>:mov(%esi),%edx 0xc10300a9 <__mod_timer+60>:mov%eax,0x4(%edx) 0xc10300ac <__mod_timer+63>:mov%edx,(%eax) 0xc10300ae <__mod_timer+65>:movl $0x200200,0x4(%esi) 0xc10300b5 <__mod_timer+72>:movl $0x1,(%esp) 0xc10300bc <__mod_timer+79>:mov%fs:0xc13a6104,%edx 0xc10300c3 <__mod_timer+86>:mov$0xc13a75dc,%eax 0xc10300c8 <__mod_timer+91>:mov(%edx,%eax,1),%edi 0xc10300cb <__mod_timer+94>:cmp%edi,%ebx 0xc10300cd <__mod_timer+96>:je 0xc10300f0 <__mod_timer+131> 0xc10300cf <__mod_timer+98>:cmp%esi,0x4(%ebx) 0xc10300d2 <__mod_timer+101>: je 0xc10300f0 <__mod_timer+131> 0xc10300d4 <__mod_timer+103>: andl $0x1,0x14(%esi) 0xc10300d8 <__mod_timer+107>: mov$0x1,%al 0xc10300da <__mod_timer+109>: xchg
Re: Processes spinning forever, apparently in lock_timer_base()?
On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote: (attempting to cc Matthias. If I have the wrong one, please fix it up) You got the correct one. Looks like the same problem with spinlock unfairness we've seen elsewhere: it seems to be looping here? Or is everyone stuck just waiting for writeout? [...] I think. Or perhaps lock_timer_base() really has gone and got stuck. One possibility is that gcc has decided to cache timer-base in a register rather than rereading it around that loop, which would be bad. Do: gdb vmlinux (gdb) x/100i lock_timer_base This is from an affected kernel, but not the kernel matching my stack-trace. Hope it is useful anyway. If not maybe Chuck can provide the vmlinux file for the 2.6.22.1-27.fc7 kernel (the debuginfo seems to be deleted from all mirrors). (gdb) x/100i lock_timer_base 0xc102ffda lock_timer_base: push %ebp 0xc102ffdb lock_timer_base+1: mov%edx,%ebp 0xc102ffdd lock_timer_base+3: push %edi 0xc102ffde lock_timer_base+4: mov%eax,%edi 0xc102ffe0 lock_timer_base+6: push %esi 0xc102ffe1 lock_timer_base+7: push %ebx 0xc102ffe2 lock_timer_base+8: mov0x14(%edi),%ebx 0xc102ffe5 lock_timer_base+11:mov%ebx,%esi 0xc102ffe7 lock_timer_base+13:and$0xfffe,%esi 0xc102ffea lock_timer_base+16:je 0xc1030004 lock_timer_base+42 0xc102ffec lock_timer_base+18:mov%esi,%eax 0xc102ffee lock_timer_base+20:call 0xc122ff35 _spin_lock_irqsave 0xc102fff3 lock_timer_base+25:mov%eax,0x0(%ebp) 0xc102fff6 lock_timer_base+28:cmp0x14(%edi),%ebx 0xc102fff9 lock_timer_base+31:je 0xc1030008 lock_timer_base+46 0xc102fffb lock_timer_base+33:mov%eax,%edx 0xc102fffd lock_timer_base+35:mov%esi,%eax 0xc102 lock_timer_base+37:call 0xc122ff95 _spin_unlock_irqrestore 0xc1030004 lock_timer_base+42:pause 0xc1030006 lock_timer_base+44:jmp0xc102ffe2 lock_timer_base+8 0xc1030008 lock_timer_base+46:mov%esi,%eax 0xc103000a lock_timer_base+48:pop%ebx 0xc103000b lock_timer_base+49:pop%esi 0xc103000c lock_timer_base+50:pop%edi 0xc103000d lock_timer_base+51:pop%ebp 0xc103000e lock_timer_base+52:ret 0xc103000f try_to_del_timer_sync: push %esi 0xc1030010 try_to_del_timer_sync+1: or $0x,%esi 0xc1030013 try_to_del_timer_sync+4: push %ebx 0xc1030014 try_to_del_timer_sync+5: mov%eax,%ebx 0xc1030016 try_to_del_timer_sync+7: sub$0x4,%esp 0xc1030019 try_to_del_timer_sync+10: mov%esp,%edx 0xc103001b try_to_del_timer_sync+12: call 0xc102ffda lock_timer_base 0xc1030020 try_to_del_timer_sync+17: cmp%ebx,0x4(%eax) 0xc1030023 try_to_del_timer_sync+20: mov%eax,%ecx 0xc1030025 try_to_del_timer_sync+22: je 0xc1030049 try_to_del_timer_sync+58 0xc1030027 try_to_del_timer_sync+24: xor%esi,%esi 0xc1030029 try_to_del_timer_sync+26: cmpl $0x0,(%ebx) 0xc103002c try_to_del_timer_sync+29: je 0xc1030049 try_to_del_timer_sync+58 0xc103002e try_to_del_timer_sync+31: mov(%ebx),%edx 0xc1030030 try_to_del_timer_sync+33: mov$0x1,%si 0xc1030034 try_to_del_timer_sync+37: mov0x4(%ebx),%eax 0xc1030037 try_to_del_timer_sync+40: mov%eax,0x4(%edx) 0xc103003a try_to_del_timer_sync+43: mov%edx,(%eax) 0xc103003c try_to_del_timer_sync+45: movl $0x200200,0x4(%ebx) 0xc1030043 try_to_del_timer_sync+52: movl $0x0,(%ebx) 0xc1030049 try_to_del_timer_sync+58: mov(%esp),%edx 0xc103004c try_to_del_timer_sync+61: mov%ecx,%eax 0xc103004e try_to_del_timer_sync+63: call 0xc122ff95 _spin_unlock_irqrestore 0xc1030053 try_to_del_timer_sync+68: mov%esi,%eax 0xc1030055 try_to_del_timer_sync+70: pop%ebx ---Type return to continue, or q return to quit--- 0xc1030056 try_to_del_timer_sync+71: pop%ebx 0xc1030057 try_to_del_timer_sync+72: pop%esi 0xc1030058 try_to_del_timer_sync+73: ret 0xc1030059 del_timer_sync:push %ebx 0xc103005a del_timer_sync+1: mov%eax,%ebx 0xc103005c del_timer_sync+3: mov%ebx,%eax 0xc103005e del_timer_sync+5: call 0xc103000f try_to_del_timer_sync 0xc1030063 del_timer_sync+10: test %eax,%eax 0xc1030065 del_timer_sync+12: jns0xc103006b del_timer_sync+18 0xc1030067 del_timer_sync+14: pause 0xc1030069 del_timer_sync+16: jmp0xc103005c del_timer_sync+3 0xc103006b del_timer_sync+18: pop%ebx 0xc103006c del_timer_sync+19: ret 0xc103006d __mod_timer: push %ebp 0xc103006e __mod_timer+1: mov%edx,%ebp 0xc1030070 __mod_timer+3: push %edi 0xc1030071 __mod_timer+4: push %esi 0xc1030072 __mod_timer+5: mov%eax,%esi 0xc1030074 __mod_timer+7: push %ebx 0xc1030075 __mod_timer+8: sub$0x8,%esp 0xc1030078 __mod_timer+11:mov0x18(%esp),%edx 0xc103007c __mod_timer+15:call 0xc102f693 __timer_stats_timer_set_start_info 0xc1030081 __mod_timer+20:cmpl $0x0,0xc(%esi)
Re: Processes spinning forever, apparently in lock_timer_base()?
Chuck Ebbert wrote: > > Looks like the same problem with spinlock unfairness we've seen > elsewhere: it seems to be looping here? Or is everyone stuck > just waiting for writeout? > > lock_timer_base(): > for (;;) { > tvec_base_t *prelock_base = timer->base; > base = tbase_get_base(prelock_base); > if (likely(base != NULL)) { > spin_lock_irqsave(>lock, *flags); > if (likely(prelock_base == timer->base)) > return base; > /* The timer has migrated to another CPU */ > spin_unlock_irqrestore(>lock, *flags); > } > cpu_relax(); > } I don't think there is an unfairness problem. We are looping only if timer->base changes in between. IOW, there is no "lock + unlock + lock(same_lock)" here, we take another lock on each iteration. And: > [] do_IRQ+0xbd/0xd1 > [] lock_timer_base+0x19/0x35 > [] __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f > > ... > > [] do_IRQ+0xbd/0xd1 > [] lock_timer_base+0x19/0x35 > [] __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f > > ... > > [] lock_timer_base+0x19/0x35 > [] __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f All traces start from schedule_timeout()->mod_timer(). This timer is local, nobody can see it, its ->base can't be NULL or changed. This means that lock_timer_base() can't loop, but can't take the tvec_t_base_s.lock. But in that case the trace should different, strange. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
(attempting to cc Matthias. If I have the wrong one, please fix it up) (please generally cc reporters when forwarding their bug reports) On Wed, 01 Aug 2007 18:39:51 -0400 Chuck Ebbert <[EMAIL PROTECTED]> wrote: > Looks like the same problem with spinlock unfairness we've seen > elsewhere: it seems to be looping here? Or is everyone stuck > just waiting for writeout? > > lock_timer_base(): > for (;;) { > tvec_base_t *prelock_base = timer->base; > base = tbase_get_base(prelock_base); > if (likely(base != NULL)) { > spin_lock_irqsave(>lock, *flags); > if (likely(prelock_base == timer->base)) > return base; > /* The timer has migrated to another CPU */ > spin_unlock_irqrestore(>lock, *flags); > } > cpu_relax(); > } > > The problem goes away completely if filesystem are mounted > *without* noatime. Has happened in 2.6.20 through 2.6.22... > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563 > > Part of sysrq-t listing: > > mysqldD 17c0 2196 23162 1562 >e383fcb8 0082 61650954 17c0 e383fc9c c0407208 > e383f000 >a12b0434 4d1d c6ed2c00 c6ed2d9c c200fa80 c0724640 > f6c60540 >c4ff3c70 0508 0286 c042ffcb e383fcc8 00014926 > 0286 > Call Trace: > [] do_IRQ+0xbd/0xd1 > [] lock_timer_base+0x19/0x35 > [] __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f > [] process_timeout+0x0/0x5 > [] schedule_timeout+0x6b/0x8f > [] io_schedule_timeout+0x39/0x5d > [] congestion_wait+0x50/0x64 > [] autoremove_wake_function+0x0/0x35 > [] balance_dirty_pages_ratelimited_nr+0x148/0x193 > [] generic_file_buffered_write+0x4c7/0x5d3 I expect the lock_timer_base() this is just stack gunk. Matthias's trace also includes mysqldS 17c0 2524 1623 1562 f6ce3b44 0082 60ca34b2 17c0 f6ce3b28 f6ce3b54 f6ce3000 57c63d9c 4d1d f6c9 f6c9019c c200fa80 c0724640 f6c60540 07d0 c07e1f00 0286 c042ffcb f6ce3b54 000290ef 0286 Call Trace: [] lock_timer_base+0x19/0x35 [] __mod_timer+0x9a/0xa4 [] schedule_timeout+0x70/0x8f [] process_timeout+0x0/0x5 [] schedule_timeout+0x6b/0x8f [] do_select+0x36d/0x3c4 [] __pollwait+0x0/0xac [] __next_cpu+0x12/0x1e [] find_busiest_group+0x1c4/0x553 [] update_curr+0x23b/0x25c [] rb_insert_color+0x8c/0xad [] enqueue_entity+0x276/0x294 and it appears that schedule_timeout() always leaves a copy of lock_timer_base+0x19 on the stack. Enabling CONFIG_FRAME_POINTER might help sort that out. I think. Or perhaps lock_timer_base() really has gone and got stuck. One possibility is that gcc has decided to cache timer->base in a register rather than rereading it around that loop, which would be bad. Do: gdb vmlinux (gdb) x/100i lock_timer_base Is the machine really completely dead? Or are some tasks running? If the latter, it might be dirty-memory windup - perhaps some device driver has died and we're not getting writes out to disk. Are all the CPUs running flat-out? If so, yup, maybe it's lock_timer_base(). Hit sysrq-P ten times, see where things are stuck. Please leave `vmstat 1' running in an ssh seesion next time, let's see the output just prior to the hang. And do this: while true do echo cat /proc/meminfo sleep 1 done in another ssh session so we can see what the memory looked like when it died too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
(attempting to cc Matthias. If I have the wrong one, please fix it up) (please generally cc reporters when forwarding their bug reports) On Wed, 01 Aug 2007 18:39:51 -0400 Chuck Ebbert [EMAIL PROTECTED] wrote: Looks like the same problem with spinlock unfairness we've seen elsewhere: it seems to be looping here? Or is everyone stuck just waiting for writeout? lock_timer_base(): for (;;) { tvec_base_t *prelock_base = timer-base; base = tbase_get_base(prelock_base); if (likely(base != NULL)) { spin_lock_irqsave(base-lock, *flags); if (likely(prelock_base == timer-base)) return base; /* The timer has migrated to another CPU */ spin_unlock_irqrestore(base-lock, *flags); } cpu_relax(); } The problem goes away completely if filesystem are mounted *without* noatime. Has happened in 2.6.20 through 2.6.22... https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563 Part of sysrq-t listing: mysqldD 17c0 2196 23162 1562 e383fcb8 0082 61650954 17c0 e383fc9c c0407208 e383f000 a12b0434 4d1d c6ed2c00 c6ed2d9c c200fa80 c0724640 f6c60540 c4ff3c70 0508 0286 c042ffcb e383fcc8 00014926 0286 Call Trace: [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 I expect the lock_timer_base() this is just stack gunk. Matthias's trace also includes mysqldS 17c0 2524 1623 1562 f6ce3b44 0082 60ca34b2 17c0 f6ce3b28 f6ce3b54 f6ce3000 57c63d9c 4d1d f6c9 f6c9019c c200fa80 c0724640 f6c60540 07d0 c07e1f00 0286 c042ffcb f6ce3b54 000290ef 0286 Call Trace: [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c04857c7] do_select+0x36d/0x3c4 [c0485da0] __pollwait+0x0/0xac [c04e934d] __next_cpu+0x12/0x1e [c0420130] find_busiest_group+0x1c4/0x553 [c0420ada] update_curr+0x23b/0x25c [c04eb700] rb_insert_color+0x8c/0xad [c042128f] enqueue_entity+0x276/0x294 and it appears that schedule_timeout() always leaves a copy of lock_timer_base+0x19 on the stack. Enabling CONFIG_FRAME_POINTER might help sort that out. I think. Or perhaps lock_timer_base() really has gone and got stuck. One possibility is that gcc has decided to cache timer-base in a register rather than rereading it around that loop, which would be bad. Do: gdb vmlinux (gdb) x/100i lock_timer_base Is the machine really completely dead? Or are some tasks running? If the latter, it might be dirty-memory windup - perhaps some device driver has died and we're not getting writes out to disk. Are all the CPUs running flat-out? If so, yup, maybe it's lock_timer_base(). Hit sysrq-P ten times, see where things are stuck. Please leave `vmstat 1' running in an ssh seesion next time, let's see the output just prior to the hang. And do this: while true do echo cat /proc/meminfo sleep 1 done in another ssh session so we can see what the memory looked like when it died too. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
Chuck Ebbert wrote: Looks like the same problem with spinlock unfairness we've seen elsewhere: it seems to be looping here? Or is everyone stuck just waiting for writeout? lock_timer_base(): for (;;) { tvec_base_t *prelock_base = timer-base; base = tbase_get_base(prelock_base); if (likely(base != NULL)) { spin_lock_irqsave(base-lock, *flags); if (likely(prelock_base == timer-base)) return base; /* The timer has migrated to another CPU */ spin_unlock_irqrestore(base-lock, *flags); } cpu_relax(); } I don't think there is an unfairness problem. We are looping only if timer-base changes in between. IOW, there is no lock + unlock + lock(same_lock) here, we take another lock on each iteration. And: [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f ... [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f ... [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f All traces start from schedule_timeout()-mod_timer(). This timer is local, nobody can see it, its -base can't be NULL or changed. This means that lock_timer_base() can't loop, but can't take the tvec_t_base_s.lock. But in that case the trace should different, strange. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Wed, 01 Aug 2007 18:39:51 -0400, Chuck Ebbert wrote: > Looks like the same problem with spinlock unfairness we've seen > elsewhere: it seems to be looping here? Or is everyone stuck just > waiting for writeout? > > lock_timer_base(): > for (;;) { > tvec_base_t *prelock_base = timer->base; base = > tbase_get_base(prelock_base); > if (likely(base != NULL)) { > spin_lock_irqsave(>lock, *flags); if > (likely(prelock_base == timer->base)) > return base; > /* The timer has migrated to another CPU */ > spin_unlock_irqrestore(>lock, *flags); > } > cpu_relax(); > } > > The problem goes away completely if filesystem are mounted *without* > noatime. Has happened in 2.6.20 through 2.6.22... > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563 > > Part of sysrq-t listing: > > mysqldD 17c0 2196 23162 1562 >e383fcb8 0082 61650954 17c0 e383fc9c c0407208 >e383f000 a12b0434 4d1d c6ed2c00 c6ed2d9c c200fa80 >c0724640 f6c60540 c4ff3c70 0508 0286 c042ffcb e383fcc8 >00014926 0286 > Call Trace: > [] do_IRQ+0xbd/0xd1 > [] lock_timer_base+0x19/0x35 [] > __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f [] > process_timeout+0x0/0x5 > [] schedule_timeout+0x6b/0x8f [] > io_schedule_timeout+0x39/0x5d [] congestion_wait+0x50/0x64 > [] autoremove_wake_function+0x0/0x35 [] > balance_dirty_pages_ratelimited_nr+0x148/0x193 [] > generic_file_buffered_write+0x4c7/0x5d3 > > > named D 17c0 2024 1454 1 >f722acb0 0082 6165ed96 17c0 c1523e80 c16f0c00 c16f20e0 >f722a000 a12be87d 4d1d f768ac00 f768ad9c c200fa80 > f75bda80 c0407208 0508 0286 c042ffcb f722acc0 >00020207 0286 > Call Trace: > [] do_IRQ+0xbd/0xd1 > [] lock_timer_base+0x19/0x35 [] > __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f [] > process_timeout+0x0/0x5 > [] schedule_timeout+0x6b/0x8f [] > io_schedule_timeout+0x39/0x5d [] congestion_wait+0x50/0x64 > [] autoremove_wake_function+0x0/0x35 [] > balance_dirty_pages_ratelimited_nr+0x148/0x193 [] > generic_file_buffered_write+0x4c7/0x5d3 > > > mysqldD 17c0 2196 23456 1562 >e9293cb8 0082 616692ed 17c0 e9293c9c e9293cc8 >e9293000 a12c8dd0 4d1d c3d5ac00 c3d5ad9c c200fa80 >c0724640 f6c60540 e9293d10 c07e1f00 0286 c042ffcb e9293cc8 >0002b57f 0286 > Call Trace: > [] lock_timer_base+0x19/0x35 [] > __mod_timer+0x9a/0xa4 > [] schedule_timeout+0x70/0x8f [] > process_timeout+0x0/0x5 > [] schedule_timeout+0x6b/0x8f [] > io_schedule_timeout+0x39/0x5d [] congestion_wait+0x50/0x64 > [] autoremove_wake_function+0x0/0x35 [] > balance_dirty_pages_ratelimited_nr+0x148/0x193 [] > generic_file_buffered_write+0x4c7/0x5d3 I'm not sure if this is related to your problem, but I posted a patch the other day that fixes an issue in balance_dirty_pages. If a drive is under light load is can get stuck looping waiting for enough pages to be available to be written. In my test case this occurs when one drive is under heavy load and the other isn't. I sent it with the subject "[PATCH] balance_dirty_pages - exit loop when no more pages available" here's the patch again in case it helps Cheers Richard -- --- linux-2.6.22.1/mm/page-writeback.c.orig 2007-07-30 16:36:09.0 +0100 +++ linux-2.6.22.1/mm/page-writeback.c 2007-07-31 16:26:43.0 +0100 @@ -250,6 +250,8 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ + if (!wbc.encountered_congestion && wbc.nr_to_write > 0) + break; /* didn't find enough to do */ } congestion_wait(WRITE, HZ/10); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes spinning forever, apparently in lock_timer_base()?
On Wed, 01 Aug 2007 18:39:51 -0400, Chuck Ebbert wrote: Looks like the same problem with spinlock unfairness we've seen elsewhere: it seems to be looping here? Or is everyone stuck just waiting for writeout? lock_timer_base(): for (;;) { tvec_base_t *prelock_base = timer-base; base = tbase_get_base(prelock_base); if (likely(base != NULL)) { spin_lock_irqsave(base-lock, *flags); if (likely(prelock_base == timer-base)) return base; /* The timer has migrated to another CPU */ spin_unlock_irqrestore(base-lock, *flags); } cpu_relax(); } The problem goes away completely if filesystem are mounted *without* noatime. Has happened in 2.6.20 through 2.6.22... https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563 Part of sysrq-t listing: mysqldD 17c0 2196 23162 1562 e383fcb8 0082 61650954 17c0 e383fc9c c0407208 e383f000 a12b0434 4d1d c6ed2c00 c6ed2d9c c200fa80 c0724640 f6c60540 c4ff3c70 0508 0286 c042ffcb e383fcc8 00014926 0286 Call Trace: [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 named D 17c0 2024 1454 1 f722acb0 0082 6165ed96 17c0 c1523e80 c16f0c00 c16f20e0 f722a000 a12be87d 4d1d f768ac00 f768ad9c c200fa80 f75bda80 c0407208 0508 0286 c042ffcb f722acc0 00020207 0286 Call Trace: [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 mysqldD 17c0 2196 23456 1562 e9293cb8 0082 616692ed 17c0 e9293c9c e9293cc8 e9293000 a12c8dd0 4d1d c3d5ac00 c3d5ad9c c200fa80 c0724640 f6c60540 e9293d10 c07e1f00 0286 c042ffcb e9293cc8 0002b57f 0286 Call Trace: [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 I'm not sure if this is related to your problem, but I posted a patch the other day that fixes an issue in balance_dirty_pages. If a drive is under light load is can get stuck looping waiting for enough pages to be available to be written. In my test case this occurs when one drive is under heavy load and the other isn't. I sent it with the subject [PATCH] balance_dirty_pages - exit loop when no more pages available here's the patch again in case it helps Cheers Richard -- --- linux-2.6.22.1/mm/page-writeback.c.orig 2007-07-30 16:36:09.0 +0100 +++ linux-2.6.22.1/mm/page-writeback.c 2007-07-31 16:26:43.0 +0100 @@ -250,6 +250,8 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written = write_chunk) break; /* We've done our duty */ + if (!wbc.encountered_congestion wbc.nr_to_write 0) + break; /* didn't find enough to do */ } congestion_wait(WRITE, HZ/10); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Processes spinning forever, apparently in lock_timer_base()?
Looks like the same problem with spinlock unfairness we've seen elsewhere: it seems to be looping here? Or is everyone stuck just waiting for writeout? lock_timer_base(): for (;;) { tvec_base_t *prelock_base = timer->base; base = tbase_get_base(prelock_base); if (likely(base != NULL)) { spin_lock_irqsave(>lock, *flags); if (likely(prelock_base == timer->base)) return base; /* The timer has migrated to another CPU */ spin_unlock_irqrestore(>lock, *flags); } cpu_relax(); } The problem goes away completely if filesystem are mounted *without* noatime. Has happened in 2.6.20 through 2.6.22... https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563 Part of sysrq-t listing: mysqldD 17c0 2196 23162 1562 e383fcb8 0082 61650954 17c0 e383fc9c c0407208 e383f000 a12b0434 4d1d c6ed2c00 c6ed2d9c c200fa80 c0724640 f6c60540 c4ff3c70 0508 0286 c042ffcb e383fcc8 00014926 0286 Call Trace: [] do_IRQ+0xbd/0xd1 [] lock_timer_base+0x19/0x35 [] __mod_timer+0x9a/0xa4 [] schedule_timeout+0x70/0x8f [] process_timeout+0x0/0x5 [] schedule_timeout+0x6b/0x8f [] io_schedule_timeout+0x39/0x5d [] congestion_wait+0x50/0x64 [] autoremove_wake_function+0x0/0x35 [] balance_dirty_pages_ratelimited_nr+0x148/0x193 [] generic_file_buffered_write+0x4c7/0x5d3 named D 17c0 2024 1454 1 f722acb0 0082 6165ed96 17c0 c1523e80 c16f0c00 c16f20e0 f722a000 a12be87d 4d1d f768ac00 f768ad9c c200fa80 f75bda80 c0407208 0508 0286 c042ffcb f722acc0 00020207 0286 Call Trace: [] do_IRQ+0xbd/0xd1 [] lock_timer_base+0x19/0x35 [] __mod_timer+0x9a/0xa4 [] schedule_timeout+0x70/0x8f [] process_timeout+0x0/0x5 [] schedule_timeout+0x6b/0x8f [] io_schedule_timeout+0x39/0x5d [] congestion_wait+0x50/0x64 [] autoremove_wake_function+0x0/0x35 [] balance_dirty_pages_ratelimited_nr+0x148/0x193 [] generic_file_buffered_write+0x4c7/0x5d3 mysqldD 17c0 2196 23456 1562 e9293cb8 0082 616692ed 17c0 e9293c9c e9293cc8 e9293000 a12c8dd0 4d1d c3d5ac00 c3d5ad9c c200fa80 c0724640 f6c60540 e9293d10 c07e1f00 0286 c042ffcb e9293cc8 0002b57f 0286 Call Trace: [] lock_timer_base+0x19/0x35 [] __mod_timer+0x9a/0xa4 [] schedule_timeout+0x70/0x8f [] process_timeout+0x0/0x5 [] schedule_timeout+0x6b/0x8f [] io_schedule_timeout+0x39/0x5d [] congestion_wait+0x50/0x64 [] autoremove_wake_function+0x0/0x35 [] balance_dirty_pages_ratelimited_nr+0x148/0x193 [] generic_file_buffered_write+0x4c7/0x5d3 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Processes spinning forever, apparently in lock_timer_base()?
Looks like the same problem with spinlock unfairness we've seen elsewhere: it seems to be looping here? Or is everyone stuck just waiting for writeout? lock_timer_base(): for (;;) { tvec_base_t *prelock_base = timer-base; base = tbase_get_base(prelock_base); if (likely(base != NULL)) { spin_lock_irqsave(base-lock, *flags); if (likely(prelock_base == timer-base)) return base; /* The timer has migrated to another CPU */ spin_unlock_irqrestore(base-lock, *flags); } cpu_relax(); } The problem goes away completely if filesystem are mounted *without* noatime. Has happened in 2.6.20 through 2.6.22... https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563 Part of sysrq-t listing: mysqldD 17c0 2196 23162 1562 e383fcb8 0082 61650954 17c0 e383fc9c c0407208 e383f000 a12b0434 4d1d c6ed2c00 c6ed2d9c c200fa80 c0724640 f6c60540 c4ff3c70 0508 0286 c042ffcb e383fcc8 00014926 0286 Call Trace: [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 named D 17c0 2024 1454 1 f722acb0 0082 6165ed96 17c0 c1523e80 c16f0c00 c16f20e0 f722a000 a12be87d 4d1d f768ac00 f768ad9c c200fa80 f75bda80 c0407208 0508 0286 c042ffcb f722acc0 00020207 0286 Call Trace: [c0407208] do_IRQ+0xbd/0xd1 [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 mysqldD 17c0 2196 23456 1562 e9293cb8 0082 616692ed 17c0 e9293c9c e9293cc8 e9293000 a12c8dd0 4d1d c3d5ac00 c3d5ad9c c200fa80 c0724640 f6c60540 e9293d10 c07e1f00 0286 c042ffcb e9293cc8 0002b57f 0286 Call Trace: [c042ffcb] lock_timer_base+0x19/0x35 [c04300df] __mod_timer+0x9a/0xa4 [c060bb55] schedule_timeout+0x70/0x8f [c042fd37] process_timeout+0x0/0x5 [c060bb50] schedule_timeout+0x6b/0x8f [c060b67c] io_schedule_timeout+0x39/0x5d [c0465eea] congestion_wait+0x50/0x64 [c0438539] autoremove_wake_function+0x0/0x35 [c04620e2] balance_dirty_pages_ratelimited_nr+0x148/0x193 [c045e7fd] generic_file_buffered_write+0x4c7/0x5d3 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/