Re: System crash after "No irq handler for vector" linux 2.6.19
Luigi, Unless you have a completely different cause I believe the patches I just posted will fix the issue. If you can test and confirm this that would be great. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
"Luigi Genoni" <[EMAIL PROTECTED]> writes: > the message appeared just once, but no crash. > anyway the load average was really abnormal. Good. You tested it and it worked! High load average is interesting, because it has similar causes as the "No irq handler for vector", but technically they are completely independent. In practice surviving a high load average is a very useful property though. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
"Luigi Genoni" <[EMAIL PROTECTED]> writes: > I tested the patch, but I could not really stress the HW. > anyway no crash, but load average is somehow abnormal, higher than it should > be. Thanks. Did you get any nasty messages about "No irq handler for vector?" If not then you never even hit the problem condition. Given how rare the trigger condition is it would probably take force irq migration to even trigger the "No irq handler for vector message". That patch I'm not really worried about, I've tested it and it meets the obviously correct condition :) But it should enable people to not be afraid of IRQ migration. I'm slowly working my way towards a real fix. I know what I have to code, now I just have to figure out how. :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
> Ok. I've finally figured out what is going on. The code is race free but the > programmer was an > idiot. Hi, Could this IRQ problem account for this bug as well, please? Or is yours strictly a 2.6.19.x issue? http://bugzilla.kernel.org/show_bug.cgi?id=7847 I have a dual P4 Xeon box (HT enabled) so there is a lot of scope for IRQ migration. I was playing WoW when this bug occurred, so there would have been a lot of IRQs needing handling between both the video and sound cards. Cheers, Chris ___ What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship. http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
"Luigi Genoni" <[EMAIL PROTECTED]> writes: > OK, > willing to test any patch. Ok. I've finally figured out what is going on. The code is race free but the programmer was an idiot. In the local apic there are two relevant registers. ISR (in service register) describing all of the interrupts that the cpu in the process of handling. IRR (intrerupt request register) which lists all of the interrupts that are currently pending. Well it happens that IRR is used to catch the case when we are servicing an interrupt and that same interrupt comes in again. When that happens as soon as we are done service the interrupt that same interrupt fires again. We perform interrupt migration in an interrupt handler, so we can be race free. It turns out that if I'm performing migration (updating all of the data structures and hardware registers) while IRR is set the interrupt will happen in the old location immediate after my migration work is complete. And since the kernel is not setup to deal with it we get an ugly error message. Anyway now that I know what is going on I'm going to have to think about this a little bit more to figure out how to fix this. My hunch is the easy fix will be simply not to migrate until I have an interrupt instance when IRR is clear. Anyway with a little luck tomorrow I will be able to figure it out, it's to bed with me now. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
"Luigi Genoni" <[EMAIL PROTECTED]> writes: > OK, > willing to test any patch. Sure. After I get things working on this end I will copy you, on any fixes so you can confirm they work for you. I am still root causing this but I have found a small fix that should keep the system from going down when this problem occurs. If you could confirm that it keeps your system from going down I'd appreciate it. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
<[EMAIL PROTECTED]> writes: > I have in interesting update, at less I suppose I have. It was, at least as another data point. > I do not know very well what happens with irq stuff migrating shared irq, but > I > suppose this has something to do with this crash. The fact the irq was shared should have no bearing on this crash scenario. A shared irq is not at all helpful in a performance sense, but this problem is low enough that a shared irq should have made not difference at all, except for the frequency the interrupt fired, and was migrated. And a high interrupt frequency, and a high migration rate tend to cause this problem. > Right now I stopped irqbalance and puff! load average is back to normal, and > under the same workload notthing similar is happening for the moment. Yes. That sounds like a good work around until this problem is sorted out. > Lesson number one I learnt: avoid shared IRQ on this systems (but to > reconfigure > HW cabling right now is not so easy). Right. Because the only sharing should be because the traces on the motherboard are shared. > I hope this helps It has all helped. I have been tracking some easier problems, keeping this one on my back burner. The good/bad news is that by restricting my set of vectors I can choose from in the kernel, and running ping -f to another machine. And migrating the single irq for my NIC. I have been able to reproduce this in about 5 minutes. I haven't root caused it yet, but the fact I can reproduce this on dual socket motherboard suggest the reason it took you an hour to reproduce the problem is simply because you had so few irqs on your system, and that the extreme latency of 8-socket Opterons is not to blame. Hopefully now that I can reproduce this I will be able to root cause this and then fix this bug. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
"Luigi Genoni" <[EMAIL PROTECTED]> writes: > reproduced. > it took more or less one hour to reproduce it. I could reproduce it olny > running also irqbalance 0.55 and commenting out the sleep 1. The message in > syslog is the same and then, after a few seconds I think, KABOM! system crash > and reboot. > > I tested also a similar system that has 4 dual core CPU Opteron 2600MHZ. On > this system (linux sees 8 CPU, but it is the same kernel, same gcc, same > config, same glibc, same active services) I could not reproduce it even > running irqbalance 0.55 in almost 1 hour. Maybe I could reproduce it waiting > for more time, but my users need to do their work, so I could not have a > longer test window. So on 16 CPU I had the crash, on 8 CPU I had no crash. > > I need to give back the system to the users, so if you need other tests, > please, tell me as soon. Ok. Since it seems to be the irq code I'm going to need to get a dump of the state of the apics, basically the output print_IO_APIC from the time this happens. I'm not really out of bed yet so no patch but a heads up. Once I've finished sleeping I'll look at putting a debugging patch together. Getting a bootlog of the identical system that would not crash at 8 cpus would be interesting. In part this is because the number of apic modes that are available for use are much fewer when we have 8 or more cpus, so it would be interesting to see if it is actually using the same code for interrupt delivery. If you have a few minutes to try it, it would be interesting to know if forcing the migration from the command line (as I was suggesting) would reproduce this faster than irqbalance. Hopefully I can think up things to fix this when I wake up. The time zone difference is going to be a pain :( Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
reproduced. it took more or less one hour to reproduce it. I could reproduce it olny running also irqbalance 0.55 and commenting out the sleep 1. The message in syslog is the same and then, after a few seconds I think, KABOM! system crash and reboot. I tested also a similar system that has 4 dual core CPU Opteron 2600MHZ. On this system (linux sees 8 CPU, but it is the same kernel, same gcc, same config, same glibc, same active services) I could not reproduce it even running irqbalance 0.55 in almost 1 hour. Maybe I could reproduce it waiting for more time, but my users need to do their work, so I could not have a longer test window. So on 16 CPU I had the crash, on 8 CPU I had no crash. I need to give back the system to the users, so if you need other tests, please, tell me as soon. thanx Luigi Genoni On Monday 22 January 2007 18:14, Eric W. Biederman wrote: "Luigi Genoni" <[EMAIL PROTECTED]> writes: > (e-mail resent because not delivered using my other e-mail account) > > Hi, > this night a linux server 8 dual core CPU Optern 2600Mhz crashed just > after giving this message > > Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector Ok. This indicates that the hardware is doing something we didn't expect. We don't know which irq the hardware was trying to deliver when it sent vector 0x98 to cpu 1. > I have no other logs, and I eventually lost the OOPS since I have no net > console setled up. If you had an oops it may have meant the above message was a secondary symptom. Groan. If it stayed up long enough to give an OOPS then there is a chance the above message appearing only once had nothing to do with the actual crash. How long had the system been up? > As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for AMD > Opteron (attached see .config), no kernel preemption excepted the BKL > preemption. glibc 2.4. > > System has 16 GB RAM and 8 dual core Opteron 2600Mhz. > > I am running irqbalance 0.55. > > any hints on what has happened? Three guesses. - A race triggered by irq migration (but I would expect more people to be yelling). The code path where that message comes from is new in 2.6.19 so it may not have had all of the bugs found yet :( - A weird hardware or BIOS setup. - A secondary symptom triggered by some other bug. If this winds up being reproducible we should be able to track it down. If not this may end up in the files of crap something bad happened that we don't understand. The one condition I know how to test for (if you are willing) is an irq migration race. Simply by triggering irq migration much more often, and thus increasing our chances of hitting a problem. Stopping irqbalance and running something like: for irq in 0 24 28 29 44 45 60 68 ; do while :; do for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000 4000 8000 ; do echo mask > /proc/irq/$irq/smp_affinity sleep 1 done done & done Should force every irq to migrate once a second, and removing the sleep 1 is even harsher, although we max at one irq migration by irq received. If some variation of the above loop does not trigger the do_IRQ ??? No irq handler for vector message chances are it isn't a race in irq migration. If we can rule out the race scenario it will at least put us in the right direction for guessing what went wrong with your box. Eric -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: System crash after "No irq handler for vector" linux 2.6.19
"Luigi Genoni" <[EMAIL PROTECTED]> writes: > (e-mail resent because not delivered using my other e-mail account) > > Hi, > this night a linux server 8 dual core CPU Optern 2600Mhz crashed just after > giving this message > > Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector Ok. This indicates that the hardware is doing something we didn't expect. We don't know which irq the hardware was trying to deliver when it sent vector 0x98 to cpu 1. > I have no other logs, and I eventually lost the OOPS since I have no net > console setled up. If you had an oops it may have meant the above message was a secondary symptom. Groan. If it stayed up long enough to give an OOPS then there is a chance the above message appearing only once had nothing to do with the actual crash. How long had the system been up? > As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for AMD > Opteron (attached see .config), no kernel preemption excepted the BKL > preemption. glibc 2.4. > > System has 16 GB RAM and 8 dual core Opteron 2600Mhz. > > I am running irqbalance 0.55. > > any hints on what has happened? Three guesses. - A race triggered by irq migration (but I would expect more people to be yelling). The code path where that message comes from is new in 2.6.19 so it may not have had all of the bugs found yet :( - A weird hardware or BIOS setup. - A secondary symptom triggered by some other bug. If this winds up being reproducible we should be able to track it down. If not this may end up in the files of crap something bad happened that we don't understand. The one condition I know how to test for (if you are willing) is an irq migration race. Simply by triggering irq migration much more often, and thus increasing our chances of hitting a problem. Stopping irqbalance and running something like: for irq in 0 24 28 29 44 45 60 68 ; do while :; do for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000 4000 8000 ; do echo mask > /proc/irq/$irq/smp_affinity sleep 1 done done & done Should force every irq to migrate once a second, and removing the sleep 1 is even harsher, although we max at one irq migration by irq received. If some variation of the above loop does not trigger the do_IRQ ??? No irq handler for vector message chances are it isn't a race in irq migration. If we can rule out the race scenario it will at least put us in the right direction for guessing what went wrong with your box. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/