Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread noah
I just had a strange freeze that killed networking and made software
RAID fail two of my harddisks.

There are a bunch of messages from the kernel which I extracted from
the system log after reboot at the end of this mail. I hit power off
in pure paranoia after the box froze, and then started to do disk I/O
again just right after I noticed the messages about two of my RAID
disks had failed on the console.
The network didn't recover when the harddrive suddenly started working again.
I managed to connect an USB keyboard and wake up the monitor from
sleep so I could see some of the messages printed on the console.

I looked through some other threads and found a mention of
smartmontools which I too use (5.37-5ubuntu2).

Kernel 2.6.22-14-generic (Ubuntu Gutsy Gibbon 7.10)
Motherboard: Asus M2N32 WS Professional nForce 590 SLI MCP (MCP55)
CPU: Athlon64 X2 Dual-Core 5600+
RAM: 4GB (passed memtest86 just a few minutes ago)

The harddrives are four Samsung HD501LJ 500GB drives.
sda and sdb have firmware CR100-10 and sdc and sdd have firmware CR100-11.
The drives are just a couple of months old, well cooled and so far
there's nothing interesting reported by S.M.A.R.T.

Software raid is configured like this:
sda1,sdc1 -> md0 (raid 1)
sdb1,sdd1 -> md1 (raid 1)
Both md0 and md1 are then encrypted with dm-crypt and the dm-devices
are then used to form md2 (stripe).

  -- noah


# lspci
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:08.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:09.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:09.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:09.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:0a.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:0a.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:0c.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:0d.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:0e.1 Audio device: nVidia Corporation MCP55 High Definition Audio (rev a2)
00:10.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:11.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:12.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:14.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:15.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:16.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:17.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8
[Athlon64/Opteron] Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8400 GS (rev a1)
02:06.0 Communication controller: Tiger Jet Network Inc. Tiger3XX
Modem/ISDN interface
03:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X
Bridge (rev 06)
03:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X
Bridge (rev 06)
08:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE6145
SATA II PCI-E controller (rev a1)



 kernel: [734344.717844] irq 21: nobody cared (try booting with the
"irqpoll" option)
 kernel: [734344.717866]
 kernel: [734344.717866] Call Trace:
 kernel: [734344.717868][__report_bad_irq+30/128]
__report_bad_irq+0x1e/0x80
 mdadm: Fail event detected on md device /dev/md1, component device /dev/sdd1
 kernel: [734344.717888]  [note_interrupt+643/704] note_interrupt+0x283/0x2c0
 kernel: [734344.717895]  [handle_fasteoi_irq+221/272]
handle_fasteoi_irq+0xdd/0x110
 mdadm: Fail event detected on md device /dev/md0, component device /dev/sdc1
 kernel: [734344.717901]  [do_IRQ+123/256] do_IRQ+0x7b/0x100
 kernel: [734344.717904]  [default_idle+0/64] default_idle+0x0/0x40
 kernel: [734344.717907]  [ret_from_intr+0/10] ret_from_intr+0x0/0xa
 kernel: [734344.717909][tcp_poll+0/368] tcp_poll+0x0/0x170
 kernel: [734344.717918] 

Re: Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread Alan Cox
>  kernel: [734344.717844] irq 21: nobody cared (try booting with the
> "irqpoll" option)
>  kernel: [734344.717866]

Your machine decided to emit interrupt 21 without an apparent reason.
Whatever caused that made the kernel shut down IRQ 21 at which point the
disk drives on that IRQ were no longer being serviced. Everything on IRQ
21 would have died - which may be why your networking failed too.

What do you have on IRQ 21 and is this a one off ?
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread noah
2007/11/20, Alan Cox <[EMAIL PROTECTED]>:
> >  kernel: [734344.717844] irq 21: nobody cared (try booting with the
> > "irqpoll" option)
> >  kernel: [734344.717866]
>
> Your machine decided to emit interrupt 21 without an apparent reason.
> Whatever caused that made the kernel shut down IRQ 21 at which point the
> disk drives on that IRQ were no longer being serviced. Everything on IRQ
> 21 would have died - which may be why your networking failed too.
>
> What do you have on IRQ 21 and is this a one off ?

I've had other freezes before but this was the first time I was able
to see what was actually going on.
IRQ 21 appears to be shared between sata_nv and ethernet.

Does this mean my hardware/BIOS is broken somehow?
I'm running the latest BIOS available.

# cat /proc/interruptsCPU0   CPU1
  0:  264973603163   IO-APIC-edge  timer
  1:  0  2   IO-APIC-edge  i8042
  8:  0  0   IO-APIC-edge  rtc
 9:  0  0   IO-APIC-fasteoi   acpi
 12:  0  6   IO-APIC-edge  i8042
 16:   4851 669159   IO-APIC-fasteoi   shpchp, libata
 20:  0  0   IO-APIC-fasteoi   sata_nv
 21:  364434775430   IO-APIC-fasteoi   sata_nv, eth0
 22:  312614531218   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
 23:  4   1649   IO-APIC-fasteoi   HDA Intel, ehci_hcd:usb2
NMI:  0  0
LOC:36295623629543
ERR:  0

  -- noah
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-20 Thread Alan Cox
> I've had other freezes before but this was the first time I was able
> to see what was actually going on.
> IRQ 21 appears to be shared between sata_nv and ethernet.
> 
> Does this mean my hardware/BIOS is broken somehow?

Not neccessarily. It could a bug in one of the drivers using IRQ 21
(sata_nv or the nvidia ethernet), it could be another inactive device, or
it could be a hardware funny.

Nvidia stuff can be quite hard to diagnose as we have no documentation
but we can try. The first question is whether it is network or disk
triggered - seeing if heavy loads to one or the other trigger the problem
might be a first plan.


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-21 Thread noah
2007/11/21, Alan Cox <[EMAIL PROTECTED]>:
> > I've had other freezes before but this was the first time I was able
> > to see what was actually going on.
> > IRQ 21 appears to be shared between sata_nv and ethernet.
> >
> > Does this mean my hardware/BIOS is broken somehow?
>
> Not neccessarily. It could a bug in one of the drivers using IRQ 21
> (sata_nv or the nvidia ethernet), it could be another inactive device, or
> it could be a hardware funny.

How can I tell if there's an inactive device?

> Nvidia stuff can be quite hard to diagnose as we have no documentation
> but we can try. The first question is whether it is network or disk
> triggered - seeing if heavy loads to one or the other trigger the problem
> might be a first plan.

I haven't managed to trigger it again yet but at the time the CPU was
heavily loaded and I was re-indexing a database which caused a lot of
disk activity. I'm quite confident the network was pretty much idle at
the time.

  -- noah
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-26 Thread Pavel Machek
Hi!

> >  kernel: [734344.717844] irq 21: nobody cared (try booting with the
> > "irqpoll" option)
> >  kernel: [734344.717866]
> 
> Your machine decided to emit interrupt 21 without an apparent reason.
> Whatever caused that made the kernel shut down IRQ 21 at which point the
> disk drives on that IRQ were no longer being serviced. Everything on IRQ
> 21 would have died - which may be why your networking failed too.

Hmm, perhaps that 'nobody cared' message should be worded more
strongly, and printed and KERN_CRIT?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-27 Thread Tejun Heo
Pavel Machek wrote:
> Hi!
> 
>>>  kernel: [734344.717844] irq 21: nobody cared (try booting with the
>>> "irqpoll" option)
>>>  kernel: [734344.717866]
>> Your machine decided to emit interrupt 21 without an apparent reason.
>> Whatever caused that made the kernel shut down IRQ 21 at which point the
>> disk drives on that IRQ were no longer being serviced. Everything on IRQ
>> 21 would have died - which may be why your networking failed too.
> 
> Hmm, perhaps that 'nobody cared' message should be worded more
> strongly, and printed and KERN_CRIT?

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Phillip Susi

Tejun Heo wrote:

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?


Why does a single spurious interrupt cause it to be shut down?  I can 
see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?


-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Tejun Heo
Phillip Susi wrote:
> Tejun Heo wrote:
>> Agreed.  Nobody cared on ATA controllers is usually very effective at
>> taking the whole machine down.  Is there any reason why we don't turn on
>> irqpoll on turned off IRQs automatically?
> 
> Why does a single spurious interrupt cause it to be shut down?  I can
> see if the interrupt is stuck on and keeps interrupting constantly, but
> if it's just the occasional spurious interrupt, why not just ignore it
> and move on?

Because SFF ATA controller don't have IRQ pending bit.  You don't know
whether IRQ is raised or not.  Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines.  It has to go
through the PCI and ATA bus and come back.  So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-29 Thread Robert Hancock

Phillip Susi wrote:

Tejun Heo wrote:

Agreed.  Nobody cared on ATA controllers is usually very effective at
taking the whole machine down.  Is there any reason why we don't turn on
irqpoll on turned off IRQs automatically?


Why does a single spurious interrupt cause it to be shut down?  I can 
see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?


I'm not certain offhand, but I think there may be such a threshold. 
However, an occasional spurious interrupt isn't likely. For a 
level-triggered interrupt, an unhandled interrupt will keep interrupting 
forever since nobody knows how to clear it (until we decide to disable 
the IRQ entirely).


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Pavel Machek
On Fri 2007-11-30 10:00:55, Mark Lord wrote:
> Pavel Machek wrote:
>> On Fri 2007-11-30 13:13:44, Alan Cox wrote:
 Why does a single spurious interrupt cause it to be shut down?  I can 
>>> It doesn't.
>>>
 see if the interrupt is stuck on and keeps interrupting constantly, but 
 if it's just the occasional spurious interrupt, why not just ignore it 
 and move on?
>>> The interrupt is usually level triggered so it continues to create
>>> interrupts until you silence it. The thresholds are about 10,000
>>> interrupt events and on newer kernels we also reset the count if we don't
>>> see any for a while. That works for most stuff except the thinkpad
>>> bluetooth problem.
>> Which is confirmed hw problem now, btw.
> ...
>
> What problem is that, exactly?

Spurious interrupt, interrupt link is disabled after ~15 minutes. It
seems pretty unique to t61.

> My Dell has an internal USB BT adapter that briefly appears
> and then disappears again on resume (or stays if I have "enabled" it
> via the BIOS key).
>
> I wonder if that has anything to do with the (new in) 2.6.23 pauses
> that machine has on resume (about every 10th time).

No idea, but t61 problem seems different.
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Tejun Heo
Phillip Susi wrote:
> Tejun Heo wrote:
>> Because SFF ATA controller don't have IRQ pending bit.  You don't know
>> whether IRQ is raised or not.  Plus, accessing the status register which
>> clears pending IRQ can be very slow on PATA machines.  It has to go
>> through the PCI and ATA bus and come back.  So, unconditionally trying
>> to clear IRQ by accessing Status can incur noticeable overhead if the
>> IRQ is shared with devices which raise a lot of IRQs.
> 
> There HAS to be a way to determine if that device generated the
> interrupt, or the interrupt can not be shared.  Since the kernel said
> nobody cared about the interrupt, that indicates that the sata driver
> checked the status register and realized the sata chip didn't generate
> the interrupt, and returned to the kernel letting it know that the
> interrupt was not for it.

Surprise, surprise.  There's no way to tell whether the controller
raised interrupt or not if command is not in progress.  As I said
before, there's no IRQ pending bit.  While processing commands, you can
tell by looking at other status registers but when there's nothing in
flight and the controller determines it's a good time to raise a
spurious interrupt, there's no way you can tell.  That dang SFF
interface is like 15+ years old.

But we can still make things pretty robust.  We're working on it.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Mark Lord

Pavel Machek wrote:

On Fri 2007-11-30 13:13:44, Alan Cox wrote:
Why does a single spurious interrupt cause it to be shut down?  I can 

It doesn't.

see if the interrupt is stuck on and keeps interrupting constantly, but 
if it's just the occasional spurious interrupt, why not just ignore it 
and move on?

The interrupt is usually level triggered so it continues to create
interrupts until you silence it. The thresholds are about 10,000
interrupt events and on newer kernels we also reset the count if we don't
see any for a while. That works for most stuff except the thinkpad
bluetooth problem.


Which is confirmed hw problem now, btw.

...

What problem is that, exactly?

My Dell has an internal USB BT adapter that briefly appears
and then disappears again on resume (or stays if I have "enabled" it
via the BIOS key).

I wonder if that has anything to do with the (new in) 2.6.23 pauses
that machine has on resume (about every 10th time).

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Pavel Machek
On Fri 2007-11-30 13:13:44, Alan Cox wrote:
> > Why does a single spurious interrupt cause it to be shut down?  I can 
> 
> It doesn't.
> 
> > see if the interrupt is stuck on and keeps interrupting constantly, but 
> > if it's just the occasional spurious interrupt, why not just ignore it 
> > and move on?
> 
> The interrupt is usually level triggered so it continues to create
> interrupts until you silence it. The thresholds are about 10,000
> interrupt events and on newer kernels we also reset the count if we don't
> see any for a while. That works for most stuff except the thinkpad
> bluetooth problem.

Which is confirmed hw problem now, btw.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Alan Cox
> Why does a single spurious interrupt cause it to be shut down?  I can 

It doesn't.

> see if the interrupt is stuck on and keeps interrupting constantly, but 
> if it's just the occasional spurious interrupt, why not just ignore it 
> and move on?

The interrupt is usually level triggered so it continues to create
interrupts until you silence it. The thresholds are about 10,000
interrupt events and on newer kernels we also reset the count if we don't
see any for a while. That works for most stuff except the thinkpad
bluetooth problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-11-30 Thread Phillip Susi

Tejun Heo wrote:

Because SFF ATA controller don't have IRQ pending bit.  You don't know
whether IRQ is raised or not.  Plus, accessing the status register which
clears pending IRQ can be very slow on PATA machines.  It has to go
through the PCI and ATA bus and come back.  So, unconditionally trying
to clear IRQ by accessing Status can incur noticeable overhead if the
IRQ is shared with devices which raise a lot of IRQs.


There HAS to be a way to determine if that device generated the 
interrupt, or the interrupt can not be shared.  Since the kernel said 
nobody cared about the interrupt, that indicates that the sata driver 
checked the status register and realized the sata chip didn't generate 
the interrupt, and returned to the kernel letting it know that the 
interrupt was not for it.


-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-12-03 Thread Phillip Susi

Tejun Heo wrote:

Surprise, surprise.  There's no way to tell whether the controller
raised interrupt or not if command is not in progress.  As I said
before, there's no IRQ pending bit.  While processing commands, you can
tell by looking at other status registers but when there's nothing in
flight and the controller determines it's a good time to raise a
spurious interrupt, there's no way you can tell.  That dang SFF
interface is like 15+ years old.

But we can still make things pretty robust.  We're working on it.

Thanks.



It sounds like you mean that you know the controller did NOT raise the 
interrupt ( intentionally/correctly ) if there was no command in 
progress, as opposed to not being able to tell.  Unless there is some 
condition under which it is valid for the controller to raise an 
interrupt when it had no commands in progress?  And if that's the case 
and there's know way to know WHY, that's a broken design.


-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-12-03 Thread Tejun Heo
Phillip Susi wrote:
> Tejun Heo wrote:
>> Surprise, surprise.  There's no way to tell whether the controller
>> raised interrupt or not if command is not in progress.  As I said
>> before, there's no IRQ pending bit.  While processing commands, you can
>> tell by looking at other status registers but when there's nothing in
>> flight and the controller determines it's a good time to raise a
>> spurious interrupt, there's no way you can tell.  That dang SFF
>> interface is like 15+ years old.
>>
>> But we can still make things pretty robust.  We're working on it.
> 
> It sounds like you mean that you know the controller did NOT raise the
> interrupt ( intentionally/correctly ) if there was no command in
> progress, as opposed to not being able to tell.  Unless there is some
> condition under which it is valid for the controller to raise an
> interrupt when it had no commands in progress?  And if that's the case
> and there's know way to know WHY, that's a broken design.

If everything works correctly, all interrupts can be accounted for.
It's just that there's no margin for erratic behaviors and most ATA
controllers are built really cheap.  So, yeah, it's a 15+ years old
half-broken design.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibly SATA related freeze killed networking and RAID

2007-12-10 Thread noah
2007/11/21, noah <[EMAIL PROTECTED]>:
> 2007/11/21, Alan Cox <[EMAIL PROTECTED]>:
> > > I've had other freezes before but this was the first time I was able
> > > to see what was actually going on.
> > > IRQ 21 appears to be shared between sata_nv and ethernet.
> > >
> > > Does this mean my hardware/BIOS is broken somehow?
> >
> > Not neccessarily. It could a bug in one of the drivers using IRQ 21
> > (sata_nv or the nvidia ethernet), it could be another inactive device, or
> > it could be a hardware funny.
>
> How can I tell if there's an inactive device?
>
> > Nvidia stuff can be quite hard to diagnose as we have no documentation
> > but we can try. The first question is whether it is network or disk
> > triggered - seeing if heavy loads to one or the other trigger the problem
> > might be a first plan.
>
> I haven't managed to trigger it again yet but at the time the CPU was
> heavily loaded and I was re-indexing a database which caused a lot of
> disk activity. I'm quite confident the network was pretty much idle at
> the time.

The same thing has happened twice now, both during the weekly check of
the md0 and md1 RAID1-arrays. That is, networking on the primary
interface is dead. It's interrupt (irq 21) is shared between sata_nv
and forcedeth.

Is there anything I can do to debug this problem?

I don't have access to the logs right now but will have later.

  -- noah
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html