subject:"System crash after No irq handler for vector linux 2.6.19"

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-02-02 Thread Eric W. Biederman


Luigi,

Unless you have a completely different cause I believe the patches I
just posted will fix the issue.  If you can test and confirm this that
would be great.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-02-02 Thread Eric W. Biederman

"Luigi Genoni" <[EMAIL PROTECTED]> writes:

> the message appeared just once, but no crash.
> anyway the load average was really abnormal.

Good. You tested it and it worked!

High load average is interesting, because it has similar causes
as the "No irq handler for vector", but technically they are
completely independent.

In practice surviving a high load average is a very useful property
though.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-02-02 Thread Eric W. Biederman

"Luigi Genoni" <[EMAIL PROTECTED]> writes:

> I tested the patch, but I could not really stress the HW.
> anyway no crash, but load average is somehow abnormal, higher than it should 
> be.

Thanks.  Did you get any nasty messages about "No irq handler for vector?"
If not then you never even hit the problem condition.

Given how rare the trigger condition is it would probably take force
irq migration to even trigger the "No irq handler for vector message".

That patch I'm not really worried about, I've tested it and it meets
the obviously correct condition :)  But it should enable people to
not be afraid of IRQ migration.

I'm slowly working my way towards a real fix.  I know what I have to code,
now I just have to figure out how. :)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-02-01 Thread Chris Rankin

> Ok. I've finally figured out what is going on. The code is race free but the 
> programmer was an
> idiot.

Hi,

Could this IRQ problem account for this bug as well, please? Or is yours 
strictly a 2.6.19.x
issue?

http://bugzilla.kernel.org/show_bug.cgi?id=7847

I have a dual P4 Xeon box (HT enabled) so there is a lot of scope for IRQ 
migration. I was playing
WoW when this bug occurred, so there would have been a lot of IRQs needing 
handling between both
the video and sound cards.

Cheers,
Chris




___ 
What kind of emailer are you? Find out today - get a free analysis of your 
email personality. Take the quiz at the Yahoo! Mail Championship. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-01-31 Thread Eric W. Biederman

"Luigi Genoni" <[EMAIL PROTECTED]> writes:

> OK,
> willing to test any patch.

Ok. I've finally figured out what is going on.  The code is
race free but the programmer was an idiot.

In the local apic there are two relevant registers.
ISR (in service register) describing all of the
interrupts that the cpu in the process of handling.
IRR (intrerupt request register) which lists all of
the interrupts that are currently pending.

Well it happens that IRR is used to catch the case
when we are servicing an interrupt and that same interrupt
comes in again.  When that happens as soon as we are
done service the interrupt that same interrupt fires again.

We perform interrupt migration in an interrupt handler, so
we can be race free.

It turns out that if I'm performing migration (updating all
of the data structures and hardware registers) while IRR
is set the interrupt will happen in the old location immediate after
my migration work is complete.  And since the kernel is not
setup to deal with it we get an ugly error message.

Anyway now that I know what is going on I'm going to have to think
about this a little bit more to figure out how to fix this.  My hunch
is the easy fix will be simply not to migrate until I have an
interrupt instance when IRR is clear.  

Anyway with a little luck tomorrow I will be able to figure it out,
it's to bed with me now.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-01-31 Thread Eric W. Biederman

"Luigi Genoni" <[EMAIL PROTECTED]> writes:

> OK,
> willing to test any patch.

Sure.  After I get things working on this end I will copy you,
on any fixes so you can confirm they work for you.

I am still root causing this but I have found a small fix that should
keep the system from going down when this problem occurs.

If you could confirm that it keeps your system from going down I'd
appreciate it.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-01-31 Thread Eric W. Biederman

<[EMAIL PROTECTED]> writes:

> I have in interesting update, at less I suppose I have.

It was, at least as another data point.

> I do not know very well what happens with irq stuff migrating shared irq, but 
> I
> suppose this has something to do with this crash.

The fact the irq was shared should have no bearing on this crash scenario.
A shared irq is not at all helpful in a performance sense, but this problem
is low enough that a shared irq should have made not difference at all, except
for the frequency the interrupt fired, and was migrated.  And a high interrupt
frequency, and a high migration rate tend to cause this problem.

> Right now I stopped irqbalance and puff! load average is back to normal, and
> under the same workload notthing similar is happening for the moment.

Yes.  That sounds like a good work around until this problem is sorted out.

> Lesson number one I learnt: avoid shared IRQ on this systems (but to 
> reconfigure
> HW cabling right now is not so easy).

Right.   Because the only sharing should be because the traces on the
motherboard are shared.

> I hope this helps

It has all helped.

I have been tracking some easier problems, keeping this one on my back burner.
The good/bad news is that by restricting my set of vectors I can choose from
in the kernel, and running ping -f to another machine.  And migrating the
single irq for my NIC.  I have been able to reproduce this in about 5 minutes.

I haven't root caused it yet, but the fact I can reproduce this on dual socket
motherboard suggest the reason it took you an hour to reproduce the problem is
simply because you had so few irqs on your system, and that the extreme
latency of 8-socket Opterons is not to blame.

Hopefully now that I can reproduce this I will be able to root cause
this and then fix this bug.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-01-23 Thread Eric W. Biederman

"Luigi Genoni" <[EMAIL PROTECTED]> writes:

> reproduced.
> it took more or less one hour to reproduce it.  I could reproduce it olny 
> running also irqbalance 0.55 and commenting out the sleep 1. The message in 
> syslog is the same and then, after a few seconds I think, KABOM! system crash 
> and reboot.
>
> I tested also a similar system that has 4 dual core CPU Opteron 2600MHZ. On 
> this system (linux sees 8 CPU, but it is the same kernel, same gcc, same 
> config, same glibc, same active services) I could not reproduce it even 
> running irqbalance 0.55 in almost 1 hour. Maybe I could reproduce it waiting 
> for more time, but my users need to do their work, so I could not have a 
> longer test window. So on 16 CPU I had the crash, on 8 CPU I had no crash. 
>
> I need to give back the system to the users, so if you need other tests, 
> please, tell me as soon.

Ok.  Since it seems to be the irq code I'm going to need to get a dump
of the state of the apics, basically the output print_IO_APIC from the time
this happens. 

I'm not really out of bed yet so no patch but a heads up.  Once I've finished
sleeping I'll look at putting a debugging patch together.

Getting a bootlog of the identical system that would not crash at 8 cpus
would be interesting.  In part this is because the number of apic modes
that are available for use are much fewer when we have 8 or more cpus,
so it would be interesting to see if it is actually using the same code
for interrupt delivery.

If you have a few minutes to try it, it would be interesting to know
if forcing the migration from the command line (as I was suggesting)
would reproduce this faster than irqbalance.

Hopefully I can think up things to fix this when I wake up.

The time zone difference is going to be a pain :(

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-01-23 Thread l . genoni

reproduced.
it took more or less one hour to reproduce it.  I could reproduce it olny
running also irqbalance 0.55 and commenting out the sleep 1. The message 
in
syslog is the same and then, after a few seconds I think, KABOM! system 
crash

and reboot.

I tested also a similar system that has 4 dual core CPU Opteron 2600MHZ. 
On

this system (linux sees 8 CPU, but it is the same kernel, same gcc, same
config, same glibc, same active services) I could not reproduce it even
running irqbalance 0.55 in almost 1 hour. Maybe I could reproduce it 
waiting

for more time, but my users need to do their work, so I could not have a
longer test window. So on 16 CPU I had the crash, on 8 CPU I had no crash.

I need to give back the system to the users, so if you need other tests,
please, tell me as soon.

thanx

Luigi Genoni

On Monday 22 January 2007 18:14, Eric W. Biederman wrote:

"Luigi Genoni" <[EMAIL PROTECTED]> writes:
> (e-mail resent because not delivered using my other e-mail account)
>
> Hi,
> this night a linux server 8 dual core CPU Optern 2600Mhz crashed just
> after giving this message
>
> Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector

Ok.  This indicates that the hardware is doing something we didn't 

expect.

We don't know which irq the hardware was trying to deliver when it
sent vector 0x98 to cpu 1.

> I have no other logs, and I eventually lost the OOPS since I have no 

net

> console setled up.

If you had an oops it may have meant the above message was a secondary
symptom.  Groan.  If it stayed up long enough to give an OOPS then
there is a chance the above message appearing only once had nothing
to do with the actual crash.

How long had the system been up?

> As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for 

AMD

> Opteron (attached see .config), no kernel preemption excepted the BKL
> preemption. glibc 2.4.
>
> System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
>
> I am running irqbalance 0.55.
>
> any hints on what has happened?

Three guesses.

- A race triggered by irq migration (but I would expect more people to 

be
yelling). The code path where that message comes from is new in 2.6.19 

so

it may not have had all of the bugs found yet :(
- A weird hardware or BIOS setup.
- A secondary symptom triggered by some other bug.

If this winds up being reproducible we should be able to track it down.
If not this may end up in the files of crap something bad happened that
we don't understand.

The one condition I know how to test for (if you are willing) is an
irq migration race.  Simply by triggering irq migration much more often,
and thus increasing our chances of hitting a problem.

Stopping irqbalance and running something like:
for irq in 0 24 28 29 44 45 60 68 ; do
  while :; do
  for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000 

4000 8000 ; do

  echo mask > /proc/irq/$irq/smp_affinity
  sleep 1
  done
  done &
done

Should force every irq to migrate once a second, and removing the sleep 

1

is even harsher, although we max at one irq migration by irq received.

If some variation of the above loop does not trigger the do_IRQ ??? No 

irq

handler for vector message chances are it isn't a race in irq migration.

If we can rule out the race scenario it will at least put us in the 

right

direction for guessing what went wrong with your box.

Eric

--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

2007-01-22 Thread Eric W. Biederman

"Luigi Genoni" <[EMAIL PROTECTED]> writes:

> (e-mail resent because not delivered using my other e-mail account)
>
> Hi,
> this night a linux server 8 dual core CPU Optern 2600Mhz crashed just after
> giving this message
>
> Jan 22 04:48:28 frey kernel: do_IRQ: 1.98 No irq handler for vector

Ok.  This indicates that the hardware is doing something we didn't expect.
We don't know which irq the hardware was trying to deliver when it
sent vector 0x98 to cpu 1.

> I have no other logs, and I eventually lost the OOPS since I have no net
> console setled up.

If you had an oops it may have meant the above message was a secondary
symptom.  Groan.  If it stayed up long enough to give an OOPS then
there is a chance the above message appearing only once had nothing
to do with the actual crash.

How long had the system been up?

> As I said sistem is running linux 2.6.19 compiled with gcc 4.1.1 for AMD
> Opteron (attached see .config), no kernel preemption excepted the BKL
> preemption. glibc 2.4.
>
> System has 16 GB RAM and 8 dual core Opteron 2600Mhz.
>
> I am running irqbalance 0.55.
>
> any hints on what has happened?

Three guesses.

- A race triggered by irq migration (but I would expect more people to be 
yelling).
  The code path where that message comes from is new in 2.6.19 so it may not 
have
  had all of the bugs found yet :(
- A weird hardware or BIOS setup.
- A secondary symptom triggered by some other bug.

If this winds up being reproducible we should be able to track it down.
If not this may end up in the files of crap something bad happened that
we don't understand.

The one condition I know how to test for (if you are willing) is an
irq migration race.  Simply by triggering irq migration much more often,
and thus increasing our chances of hitting a problem.

Stopping irqbalance and running something like:
for irq in 0 24 28 29 44 45 60 68 ; do
while :; do
for mask in 1 2 4 8 10 20 40 80 100 200 400 800 1000 2000 4000 
8000 ; do
echo mask > /proc/irq/$irq/smp_affinity
sleep 1
done
done &
done

Should force every irq to migrate once a second, and removing the sleep 1
is even harsher, although we max at one irq migration by irq received.

If some variation of the above loop does not trigger the do_IRQ ??? No irq 
handler
for vector message chances are it isn't a race in irq migration.

If we can rule out the race scenario it will at least put us in the right 
direction
for guessing what went wrong with your box.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

Re: System crash after "No irq handler for vector" linux 2.6.19

10 matches

Site Navigation

Mail list logo

Footer information