Victor Balada Diaz a écrit :
Hello,

I got various machines[1] at hetzner.de and I've been having problems
with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
been trying to narrow the problem so someone more knowledgeable than me
is able to fix it. This mail is an other attempt to ask a question
with regards ATA code to see if this time i got something.

For the ones that don't actually know what happened:

With FreeBSD 7.0 -RELEASE for amd64 and default kernel
the system shared re0 interrupt with OHCI and this caused
re(4) to corrupt packets and create interrupt storms. Tried
updating to 7.1 -BETA2 and still had some problems with it.

I've opened the PR kern/128287[2] and Remko quickly answered
with a workaround: that workaround was removing USB support from
my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
and the interrupt storms were gone. Now sometime later the interface
goes up and down from time to time, but less often. Also sometimes
the machine losts the network interface but continues to work.

I know it continues to work because some days later i can see that
it tried to deliver the status reports but was unable to resolve the
aliases hostnames. I can't ping the machine and i know the network
is OK. If i reboot the machine everything is working again.

When switched from 7.0 to 7.1 BETA2 i also found that under load
after some hours the machine created interrupt storms on ATA disks.

Digging at linux source code i've found that they do some special things
for this chipset that i've been unable to find on our code. This is
linux code for my chipset:

371                 AHCI_HFLAGS     (AHCI_HFLAG_IGN_SERR_INTERNAL |
372                                  AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI |
373                                  AHCI_HFLAG_SECT255),

File and the rest of the code in here[3].

As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
think of, switching MSI and MSI-x off for the whole system, so
i added to /boot/loader.conf this tunables:

hw.pci.enable_msix="0"
hw.pci.enable_msi="0"

And then rebooted the machine. After various hours of doing almost nothing
i've found that the machine answered ping but was unable to answer any
request (eg, ssh, nagios nrpe, etc). The machine recovered itself after
some minutes and when i was able to ssh into i saw the following in dmesg:

ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158

and a lot more errors like that. I didn't get this errors with MSI enabled.
I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later
used for DMA related things. Could someone who is more knowledgeable check
if we're doing the right thing?

I've attached verbose dmesg of a machine that's like this one with
7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire.

Also, please, could someone give me a hand on how could i continue debugging
this interrupt issues? I'm a bit lost and digging code and posting each
time i think i've found something is not going to go anywhere.

I would also like to say that i've seen reports of this kind of problems
on amd64 machines in the lists since various years ago, so i don't think
this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital)
on the lists


Thanks in advance for any help.
Regards.


[1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/
[2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287
[3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369

Sorry I didn't take the time to read all the thread, but I got similar problem with the same IXP600 chipset. Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. The simptoms where similar : interrupt 22 was shared between the sata controler and the wireless card. And I got Interrupt Storms at random times when using the wireless network.

No problem since I removed the ral(4) NIC (got a real access point now).
You might not want to point the finger at the re(4) driver too fast.

Arnaud Houdelette


_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to