Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
Sorry for the confusion, you're hitting the other mmap_sem -> transaction lock problem. This one should be solvable with an iget so we make sure not to do the final unlink until after the mmap sem is dropped. Lets see what I can do... Oh dang. I thought this last crash after upgrading to

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:50, Rob Mueller wrote: > > Are you saying that if you mount with noatime *and* use your new patch it > will fix the problem? > > What about the 2 threads linked to. Did those end up getting anywhere? Sorry for the confusion, you're hitting the other mmap_sem ->

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:42, Chris Mason wrote: > > Sounds like a different issue. The patch Bron included before fixes (or > > at least reduces to the point where it fixes it for us) a problem where > > processes get stuck in D state and are unkillable. A reboot is required > > to remove them.

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
There is a much less complex solution that I've just recently gotten working in the SUSE kernel. If reiser3/ext3 don't log the inode during atime updates, the problem goes away. You can solve this now by mounting with -o noatime (although that might not play well with cyrus, not sure). My

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:27, Rob Mueller wrote: > > > We're also applying the attached patch. There's a bug in reiserfs that > > > gets tickled by our huge MMAP usage (it's amazing what really busy > > > Cyrus daemons can do to a server, ouch). It's fixed in generic_write, > > > so we take the

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
> We're also applying the attached patch. There's a bug in reiserfs that > gets tickled by our huge MMAP usage (it's amazing what really busy > Cyrus daemons can do to a server, ouch). It's fixed in generic_write, > so we take the few percent performance hit for something that doesn't >

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Lars Roland
On 7/12/05, Bron Gondwana <[EMAIL PROTECTED]> wrote: > We're also applying the attached patch. There's a bug in reiserfs that > gets tickled by our huge MMAP usage (it's amazing what really busy > Cyrus daemons can do to a server, ouch). It's fixed in generic_write, > so we take the few percent

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Bron Gondwana
On Tue, 12 Jul 2005 14:13:01 +0200, "Lars Roland" <[EMAIL PROTECTED]> said: > You have irq balancing, the line > > CONFIG_IRQBALANCE=y > > in your config file confirms it - I am not completely sure that it is > the root of the problem but when I experienced the problem I changed > two things:

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Lars Roland
On 7/12/05, Rob Mueller <[EMAIL PROTECTED]> wrote: > Here's the /proc/interrupts dump: > >CPU0 CPU1 CPU2 CPU3 > 0: 11524000 0 0 0IO-APIC-edge timer > 1: 8 0 0 0IO-APIC-edge i8042 > 5:

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
> We recently tried upgrading one of the machines to the latest kernel > (2.6.12.2) and it's died after about 24 hours. It seemed to end up in > some > weird state where we could ssh into it, and some commands worked (eg > uptime) > but process list related commands (ps) would just freeze up

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Lars Roland
On 7/12/05, Rob Mueller <[EMAIL PROTECTED]> wrote: > As background, we've been using a relatively old kernel (2.6.4-mm2) on some > IBM x235 machines with 6G of RAM, umem cards, and serveraid storage. These > machines are under continuous heavy-ish load, load avg between about 1 and > 5, with

2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
As background, we've been using a relatively old kernel (2.6.4-mm2) on some IBM x235 machines with 6G of RAM, umem cards, and serveraid storage. These machines are under continuous heavy-ish load, load avg between about 1 and 5, with between 2500-3500 procs at all times, with several largish

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
We're also applying the attached patch. There's a bug in reiserfs that gets tickled by our huge MMAP usage (it's amazing what really busy Cyrus daemons can do to a server, ouch). It's fixed in generic_write, so we take the few percent performance hit for something that doesn't break!

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:27, Rob Mueller wrote: We're also applying the attached patch. There's a bug in reiserfs that gets tickled by our huge MMAP usage (it's amazing what really busy Cyrus daemons can do to a server, ouch). It's fixed in generic_write, so we take the few percent

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
There is a much less complex solution that I've just recently gotten working in the SUSE kernel. If reiser3/ext3 don't log the inode during atime updates, the problem goes away. You can solve this now by mounting with -o noatime (although that might not play well with cyrus, not sure). My

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:42, Chris Mason wrote: Sounds like a different issue. The patch Bron included before fixes (or at least reduces to the point where it fixes it for us) a problem where processes get stuck in D state and are unkillable. A reboot is required to remove them.

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Chris Mason
On Tuesday 12 July 2005 20:50, Rob Mueller wrote: Are you saying that if you mount with noatime *and* use your new patch it will fix the problem? What about the 2 threads linked to. Did those end up getting anywhere? Sorry for the confusion, you're hitting the other mmap_sem - transaction

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
Sorry for the confusion, you're hitting the other mmap_sem - transaction lock problem. This one should be solvable with an iget so we make sure not to do the final unlink until after the mmap sem is dropped. Lets see what I can do... Oh dang. I thought this last crash after upgrading to

2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
As background, we've been using a relatively old kernel (2.6.4-mm2) on some IBM x235 machines with 6G of RAM, umem cards, and serveraid storage. These machines are under continuous heavy-ish load, load avg between about 1 and 5, with between 2500-3500 procs at all times, with several largish

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Lars Roland
On 7/12/05, Rob Mueller [EMAIL PROTECTED] wrote: As background, we've been using a relatively old kernel (2.6.4-mm2) on some IBM x235 machines with 6G of RAM, umem cards, and serveraid storage. These machines are under continuous heavy-ish load, load avg between about 1 and 5, with between

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Rob Mueller
We recently tried upgrading one of the machines to the latest kernel (2.6.12.2) and it's died after about 24 hours. It seemed to end up in some weird state where we could ssh into it, and some commands worked (eg uptime) but process list related commands (ps) would just freeze up into

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Lars Roland
On 7/12/05, Rob Mueller [EMAIL PROTECTED] wrote: Here's the /proc/interrupts dump: CPU0 CPU1 CPU2 CPU3 0: 11524000 0 0 0IO-APIC-edge timer 1: 8 0 0 0IO-APIC-edge i8042 5: 0

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Bron Gondwana
On Tue, 12 Jul 2005 14:13:01 +0200, Lars Roland [EMAIL PROTECTED] said: You have irq balancing, the line CONFIG_IRQBALANCE=y in your config file confirms it - I am not completely sure that it is the root of the problem but when I experienced the problem I changed two things: my acpi

Re: 2.6.12.2 dies after 24 hours

2005-07-12 Thread Lars Roland
On 7/12/05, Bron Gondwana [EMAIL PROTECTED] wrote: We're also applying the attached patch. There's a bug in reiserfs that gets tickled by our huge MMAP usage (it's amazing what really busy Cyrus daemons can do to a server, ouch). It's fixed in generic_write, so we take the few percent