Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
Hi Jesse, Other than that I'm fine evaluating that patch in our testlab. any news on the evaluation? after checking the driver I think it's best to continously do a offline selftest, as the driver seems to use the SWSM/SWSM2 registers somewhere below there. The first interface of the dual port adapter is UP and continously send traffic through, the other interface is DOWN and being tested. All is fine up until now. If you are fine with that test setup I'll keep it running up until monday. /holger -- Crystal Reports #45; New Free Runtime and 30 Day Trial Check out the new simplified licensign option that enables unlimited royalty#45;free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
On Fri, 24 Apr 2009, Holger Eitzenberger wrote: Other than that I'm fine evaluating that patch in our testlab. any news on the evaluation? after checking the driver I think it's best to continously do a offline selftest, as the driver seems to use the SWSM/SWSM2 registers somewhere below there. The first interface of the dual port adapter is UP and continously send traffic through, the other interface is DOWN and being tested. All is fine up until now. If you are fine with that test setup I'll keep it running up until monday. How does this test relate to the original report of the NVM corruption? Was that the kind of test you were running on the interfaces that had reported corruption? Otherwise just by itself the test sounds fine. Jesse -- Crystal Reports #45; New Free Runtime and 30 Day Trial Check out the new simplified licensign option that enables unlimited royalty#45;free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
The bit doesn't quite work the same as the original SWSM lock bit. And we are only trying to solve the problem of has a driver loaded on either port yet? and since probe is not parallelizable, we are guaranteed not to have a race here, or be preempted (to the point another probe could run) Thanks, that explains a lot! Other than that I'm fine evaluating that patch in our testlab. any news on the evaluation? I think I can do that tomorrow. The patch at least did apply fine against my 2.6.29 test kernel. I'll do some testing with ethtool then. /holger -- Crystal Reports #45; New Free Runtime and 30 Day Trial Check out the new simplified licensign option that enables unlimited royalty#45;free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
no, doesn't apply to this hardware, as it is not ICH (integrated LOM) it is a standalone 82571 with an actual discreet eeprom chip per port. holger, you're welcome to try this patch, it was made against net-2.6 from a couple of weeks ago Thanks Jesse, I have a few questions after looking at your patch: * the older all-in-one e1000 driver does not use SWSM2 on the dual-port adapters. Does it mean it's affected as well? I ask because of the general necessity for me to justify the driver update after all. * I was unable to locate SWSM2 in the documenation of 82571EB. However, the usage seems to be similar to SWSM. Refering to this snippet here: swsm2 = er32(SWSM2); if (!(swsm2 E1000_SWSM2_LOCK)) { /* Only do this for the first interface on this card */ ew32(SWSM2, swsm2 | E1000_SWSM2_LOCK); I see a general race condition, because the patch doesn't check SWSM2 after writing it and there is nothing I see which prevents a preemption after reading SWSM2 the first time. Therefore from the documentation something like ew32(SWSM2, swsm2 | E1000_SWSM2_LOCK); swsm2 = er32(SWSM2); if (swsm2 E1000_SWSM2_LOCK) { /* now you are sure you have the lock */ } should be more correct. Please note however, that I do not have documentation about SWSM2 in particular. If my above assumption about it's workings is not correct, please just ignore the last issue. Other than that I'm fine evaluating that patch in our testlab. Many thanks! :) /holger -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
e1000e: write protect ICHx NVM to prevent malicious write/erase no, doesn't apply to this hardware, as it is not ICH (integrated LOM) it is a standalone 82571 with an actual discreet eeprom chip per port. to include our modules as well in order to find out who is overwriting memory? it is highly unlikely something is succeeding in writing to the eeprom, however, we do know of some locking issues in the driver that we've been resolving specifically for 82571 and that might somehow be related. Do you refer to these two here, or something different? e1000e: do not ever sleep in interrupt context e1000e: reset swflag after resetting hardware Regards. /holger -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
On Mon, 20 Apr 2009, Holger Eitzenberger wrote: e1000e: write protect ICHx NVM to prevent malicious write/erase no, doesn't apply to this hardware, as it is not ICH (integrated LOM) it is a standalone 82571 with an actual discreet eeprom chip per port. to include our modules as well in order to find out who is overwriting memory? it is highly unlikely something is succeeding in writing to the eeprom, however, we do know of some locking issues in the driver that we've been resolving specifically for 82571 and that might somehow be related. Do you refer to these two here, or something different? e1000e: do not ever sleep in interrupt context e1000e: reset swflag after resetting hardware there is a different patch, under internal test currently, we hope to release it soon in a new e1000e driver patch to the kernel once it has completed testing. -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
[E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
Kernel v2.6.16.y (+ patches) e1000e v0.5.11.2 Hi, I'm facing NVM corruption on both ports a dual port NIC module: PCI: Enabling device :09:00.0 ( - 0003) ACPI: PCI Interrupt :09:00.0[A] - GSI 19 (level, low) - IRQ 193 PCI: Setting latency timer of device :09:00.0 to 64 :09:00.0: :09:00.0: The NVM Checksum Is Not Valid ACPI: PCI interrupt for device :09:00.0 disabled e1000e: probe of :09:00.0 failed with error -5 PCI: Enabling device :09:00.1 ( - 0003) ACPI: PCI Interrupt :09:00.1[B] - GSI 16 (level, low) - IRQ 169 PCI: Setting latency timer of device :09:00.1 to 64 :09:00.1: :09:00.1: The NVM Checksum Is Not Valid Output of lspci is available here [1], here [2] and here [3]. There are three other identical modules in that box which do not face the issue. Reportedly the interfaces more or less worked before the upgrade (before that version was the all-in-one e1000 driver v7.6.15.5). However, both these interfaces reportedly both failed several times before the upgrade of the driver. Wrt to http://lkml.org/lkml/2008/9/25/510 and the patches mentioned therein I backported specifically e1000e: allow bad checksum As expected, the interface does not work after that, but the output is different: PCI: Enabling device :09:00.1 ( - 0003) ACPI: PCI Interrupt :09:00.1[B] - GSI 16 (level, low) - IRQ 169 PCI: Setting latency timer of device :09:00.1 to 64 :09:00.1: :09:00.1: The NVM Checksum Is Not Valid :09:00.1: :09:00.1: Invalid MAC Address: 00:00:00:00:00:00 :09:00.1: eth7: (PCI Express:2.5GB/s:Width x4) f79b4118M :09:00.1: eth7: Intel(R) PRO/1000 Network Connection :09:00.1: eth7: MAC: 1, PHY: 1, PBA No: ff-0ff As I'm only a bit familiar with the HW documetation available for 82571EB modules I need your help: 1. can I safely modify the commit 4a7703582836f55 (Linus tree) e1000e: write protect ICHx NVM to prevent malicious write/erase to include our modules as well in order to find out who is overwriting memory? * i copied an apparently correct eeprom from another box (ethtool -e) and tried to apply it (ethtool -E) on the broken box: # ethtool -E eth6 e1000e-eeprom-eth6 Cannot set EEPROM data: Invalid argument (I specifically made sure that the above mentioned patch to write-protect the NVRAM was disabled). Maybe I'm just stupid, but what is wrong here? Any help welcome. Regards. /holger [1] http://people.astaro.com/heitzenberger/e1000e/lspci_tv [2] http://people.astaro.com/heitzenberger/e1000e/lspci_vvx [3] http://people.astaro.com/heitzenberger/e1000e/lspci_vvxn -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] e1000e: NVM corrupted (kernel 2.6.16.y)
On Fri, 17 Apr 2009, Holger Eitzenberger wrote: Kernel v2.6.16.y (+ patches) e1000e v0.5.11.2 Hi, Hi again Holger, I'm facing NVM corruption on both ports a dual port NIC module: PCI: Enabling device :09:00.0 ( - 0003) ACPI: PCI Interrupt :09:00.0[A] - GSI 19 (level, low) - IRQ 193 PCI: Setting latency timer of device :09:00.0 to 64 :09:00.0: :09:00.0: The NVM Checksum Is Not Valid ACPI: PCI interrupt for device :09:00.0 disabled e1000e: probe of :09:00.0 failed with error -5 PCI: Enabling device :09:00.1 ( - 0003) ACPI: PCI Interrupt :09:00.1[B] - GSI 16 (level, low) - IRQ 169 PCI: Setting latency timer of device :09:00.1 to 64 :09:00.1: :09:00.1: The NVM Checksum Is Not Valid Output of lspci is available here [1], here [2] and here [3]. The only link that works is the first, but I do see that you have 82571 parts. There are three other identical modules in that box which do not face the issue. Reportedly the interfaces more or less worked before the upgrade (before that version was the all-in-one e1000 driver v7.6.15.5). However, both these interfaces reportedly both failed several times before the upgrade of the driver. Wrt to http://lkml.org/lkml/2008/9/25/510 and the patches mentioned therein I backported specifically e1000e: allow bad checksum As expected, the interface does not work after that, but the output is different: PCI: Enabling device :09:00.1 ( - 0003) ACPI: PCI Interrupt :09:00.1[B] - GSI 16 (level, low) - IRQ 169 PCI: Setting latency timer of device :09:00.1 to 64 :09:00.1: :09:00.1: The NVM Checksum Is Not Valid :09:00.1: :09:00.1: Invalid MAC Address: 00:00:00:00:00:00 :09:00.1: eth7: (PCI Express:2.5GB/s:Width x4) f79b4118M :09:00.1: eth7: Intel(R) PRO/1000 Network Connection :09:00.1: eth7: MAC: 1, PHY: 1, PBA No: ff-0ff great! okay, please send the output of ethtool -e for each bad interface (if you attach to the list as a .txt file it will be let through) As I'm only a bit familiar with the HW documetation available for 82571EB modules I need your help: 1. can I safely modify the commit 4a7703582836f55 (Linus tree) e1000e: write protect ICHx NVM to prevent malicious write/erase no, doesn't apply to this hardware, as it is not ICH (integrated LOM) it is a standalone 82571 with an actual discreet eeprom chip per port. to include our modules as well in order to find out who is overwriting memory? it is highly unlikely something is succeeding in writing to the eeprom, however, we do know of some locking issues in the driver that we've been resolving specifically for 82571 and that might somehow be related. * i copied an apparently correct eeprom from another box (ethtool -e) and tried to apply it (ethtool -E) on the broken box: # ethtool -E eth6 e1000e-eeprom-eth6 Cannot set EEPROM data: Invalid argument unfortunately the eeprom cannot be written in a big hunk this way, the command will only write a byte at a time, the correct command (for each byte) looks something like this: assuming your device id in lspci -n is 8086:1060 ethtool -E eth2 magic 0x10608086 offset 0x10 value 0xfe so some script that can read datafile and put each byte at a time with the above command should be used. (I specifically made sure that the above mentioned patch to write-protect the NVRAM was disabled). Maybe I'm just stupid, but what is wrong here? there is no write protect (AFAIK) like you refer to when we're using eeprom, only NVM (like flash memory) if you have access to premier.intel.com you already have NDA with us and can probably get a hold of our manufacturing tool eeupdate that will reprogram the eeprom for you. If not you should talk to your local field agent. do you have anything during your boot process that is using ethtool commands frequently on either interface on the MAC that is having problems (one eeprom is shared for each pair of ports) -- Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel