[Bug 263555] Re: [intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE chipsets at risk

Bug Watch Updater Wed, 04 Jul 2018 09:48:01 -0700

Launchpad has imported 129 comments from the remote bug at
https://bugzilla.novell.com/show_bug.cgi?id=425480.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2008-09-11T13:05:40+00:00 Stbinner wrote:

Updated today my work station after two or three weeks to current
Factory kernel and since then the onboard network card doesn't show up
anymore:

 e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
 e1000e: Copyright (c) 1999-2008 Intel Corporation.
 e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
 e1000e 0000:00:19.0: setting latency timer to 64
 input: PC Speaker as /devices/platform/pcspkr/input/input3
 0000:00:19.0: 0000:00:19.0: The NVM Checksum Is Not Valid
 e1000e 0000:00:19.0: PCI INT A disabled
 e1000e: probe of 0000:00:19.0 failed with error -5

Booted an openSUSE 11.0 installation and same issue there now too. Some
BIOS/checksum got broken?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/10

------------------------------------------------------------------------
On 2008-09-11T13:06:33+00:00 Stbinner wrote:

Created attachment 239061
dmesg of Factory installation

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/11

------------------------------------------------------------------------
On 2008-09-11T13:07:04+00:00 Stbinner wrote:

Created attachment 239062
dmesg of 11.0 installation

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/12

------------------------------------------------------------------------
On 2008-09-21T15:25:56+00:00 Stbinner wrote:

*** Bug 428180 has been marked as a duplicate of this bug. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/20

------------------------------------------------------------------------
On 2008-09-21T16:29:54+00:00 Karsten-keil wrote:

Intel did cleanup e1000 and e1000e to have no duplicate PCI IDs in both 
drivers. Maybe they removed this on the wrong driver.
Can you please try to unload e1000e and load e1000 manually if the card is not 
detected, then please add the ids to the driver on runtime:

echo "vendor device subvendor subdevice class class_mask driver_data" > \
/sys/bus/pci/drivers/e1000/new_id

All fields are passed in as hexadecimal values (no leading 0x).
The vendor and device fields are mandatory, the others are optional.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/21

------------------------------------------------------------------------
On 2008-09-21T17:26:30+00:00 Andreas Jaeger wrote:

See bug 428180 - I can load the e1000 just fine but it does not work at
all.

after:  echo "8086 104b" > /sys/bus/pci/drivers/e1000/new_id

I see the following in dmesg:
e1000 0000:00:19.0: setting latency timer to 64
e1000: 0000:00:19.0: e1000_probe: The EEPROM Checksum Is Not Valid
/*********************/
Current EEPROM Checksum : 0xffff
Calculated              : 0xbaf9
Offset    Values
========  ======
00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000030: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
Include this output when contacting your support provider.
This is not a software error! Something bad happened to your hardware or
EEPROM image. Ignoring this problem could result in further problems,
possibly loss of data, corruption or system hangs!
The MAC Address will be reset to 00:00:00:00:00:00, which is invalid
and requires you to set the proper MAC address manually before continuing
to enable this network device.
Please inspect the EEPROM dump and report the issue to your hardware vendor
or Intel Customer Support.
/*********************/
e1000: 0000:00:19.0: e1000_probe: Invalid MAC Address
e1000: 0000:00:19.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 
00:00:00:00:00:00
e1000: 0000:00:19.0: e1000_probe: This device (id 8086:104b) will no longer be 
supported by this driver i
n the future.
e1000: 0000:00:19.0: e1000_probe: please use the "e1000e" driver instead.
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/22

------------------------------------------------------------------------
On 2008-09-21T18:31:10+00:00 Karsten-keil wrote:

I fear that the EEPROM was deleted. This may be the reason, the fix is e1000 
related, but it seems that e1000e has the same reason.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=78566fecbb12a7616ae9a88b2ffbc8062c4a89e3

I hope that Intel can help here and has a way to reprogram the EEPROM.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/23

------------------------------------------------------------------------
On 2008-09-22T12:14:22+00:00 John Ronciak wrote:

And how was the EEPROM deleted?  This is very hard to do without having
things wrong with the system itself.

First is the kernel 2.6.27-rcx?  If so did the system experience a panic
of some sort (not NIC related)?  Was some other tool run on the system
once it got into this state?  Also, is does this system have some sort
of manageability enabled on it?  If so, disable it and try things again.
Let me know about the other questions.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/24

------------------------------------------------------------------------
On 2008-09-22T13:03:58+00:00 Karsten-keil wrote:

Yes it is based on 2.6.27-rc6 and we have no idea how the system get in
this state, but we got multiple reports now :-(, all the same,
installing Beta1 with an e1000e card (I will collect PCI ids of all
reports later) and during the first driver load you see:

e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
e1000e: Copyright (c) 1999-2008 Intel Corporation.
e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
e1000e 0000:00:19.0: setting latency timer to 64
input: PC Speaker as /devices/platform/pcspkr/input/input3
0000:00:19.0: 0000:00:19.0: The NVM Checksum Is Not Valid
e1000e 0000:00:19.0: PCI INT A disabled
e1000e: probe of 0000:00:19.0 failed with error -5

in the dmesg output (for more details see attachment #1 for the full log)
If you then try to boot into a other OS version (which was working before) the 
network card does not work anymore with the same error, which let me think that
the eeprom was overwritten or deleted and later I found the commit in later 
kernels for e1000 (comment #7) which sounds somehow related for me.

Our e1000e driver differs from mainline in 3 additional patches requested by
Kent Liu (in CC now)
1. http://tinyurl.com/6253yl
2. http://tinyurl.com/5bd8v2
3. http://tinyurl.com/6rj8j7

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/25

------------------------------------------------------------------------
On 2008-09-22T13:22:41+00:00 Karsten-keil wrote:

Stephan reading your first comment again, you did only install a new
kernel, you did not install Beta1, correct ?

The it is unlikely some program or configuration probing causing this, it is 
the e1000e driver itself.
Please also give us the PCI IDs of your e1000e card.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/26

------------------------------------------------------------------------
On 2008-09-22T13:33:10+00:00 Karsten-keil wrote:

*** Bug 428322 has been marked as a duplicate of this bug. ***

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/27

------------------------------------------------------------------------
On 2008-09-22T15:20:20+00:00 Stbinner wrote:

> you did only install a new kernel, you did not install Beta1, correct
?

Correct, just "zypper dup" + reboot. openSUSE 11.1 Beta 1 didn't exist
yet when I reported that bug. :-)

29: PCI 19.0: 0200 Ethernet controller
  [Created at pci.318]
  UDI: /org/freedesktop/Hal/devices/pci_8086_294c
  Unique ID: kpGf.nWnnnRlG_JE
  SysFS ID: /devices/pci0000:00/0000:00:19.0
  SysFS BusID: 0000:00:19.0
  Hardware Class: network
  Model: "Intel 82566DC-2 Gigabit Network Connection"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x294c "82566DC-2 Gigabit Network Connection"
  SubVendor: pci 0x8086 "Intel Corporation"
  SubDevice: pci 0x0000
  Revision: 0x02
  Memory Range: 0x92200000-0x9221ffff (rw,non-prefetchable)
  Memory Range: 0x92224000-0x92224fff (rw,non-prefetchable)
  I/O Ports: 0x3400-0x341f (rw)
  IRQ: 216 (no events)
  Module Alias: "pci:v00008086d0000294Csv00008086sd00000000bc02sc00i00"
  Driver Info #0:
    Driver Status: e1000e is active
    Driver Activation Cmd: "modprobe e1000e"
  Config Status: cfg=no, avail=yes, need=no, active=unknown

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/28

------------------------------------------------------------------------
On 2008-09-22T18:38:39+00:00 Andreas Jaeger wrote:

I have an 11.0 system with just this new kernel.  I even went into
Windows (which could not get a dhcp address) and installed a new BIOS
version.  the BIOS shows me that my macid has only FFs.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/29

------------------------------------------------------------------------
On 2008-09-22T19:14:46+00:00 Andreas Jaeger wrote:

Karsten, is the fix mentioned in #7 part of our kernel CVS?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/30

------------------------------------------------------------------------
On 2008-09-22T20:15:18+00:00 Karsten-keil wrote:

Yes with the update to rc7. But note, this is not a fix to this issue.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/31

------------------------------------------------------------------------
On 2008-09-22T20:33:55+00:00 Karsten-keil wrote:

Some more info about affected PCI IDs:
We got 2 reports about (this and bug #428322)
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x294c "82566DC-2 Gigabit Network Connection"

And one report with (bug #428180)
Vendor: pci 0x8086 "Intel Corporation"
Device: pci 0x104b "82566DC Gigabit Network Connection"

And I got one report about a working installation of Beta1 with e1000e
driver

Vendor: pci 0x8086 "Intel Corporation"
Device: pci 0x109a "82573L Gigabit Ethernet Controller"

maybe that helps.

John I did add you to the duplicate bugs as well.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/32

------------------------------------------------------------------------
On 2008-09-22T21:11:43+00:00 Bob Mahar wrote:

In just looking at this quickly, I think something is getting
balled up in one of the e1000_nvm_operations structures.  This would misdirect
the driver to the improper NVM operation and potentially cause the erasure of
the EEPROM.   

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/33

------------------------------------------------------------------------
On 2008-09-22T23:23:28+00:00 Jesse Brandeburg wrote:

bob can you elaborate?

we have reports (linked at debian bugzilla) of a user that had a
graphics panic and then ran into the issue.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/38

------------------------------------------------------------------------
On 2008-09-22T23:33:13+00:00 Jkosina-d wrote:

Upstream bug references to this bug:

        http://lkml.org/lkml/2008/8/8/123
        http://lkml.org/lkml/2008/9/22/23
        http://bugzilla.kernel.org/show_bug.cgi?id=11382

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/39

------------------------------------------------------------------------
On 2008-09-22T23:46:52+00:00 Andreas Jaeger wrote:

In my case: I saw the error message with a previous 2.6.27 kernel first
but did not report it :-(  My log files show the first occurence on the
10th of September which would mean that this was 2.6.27-rc6 or even
2.6.27-rc5 with SUSE patches.

I had some crashes during boot where my graphics display was somehow
screwed up (did not succeed in debugging with serial console, so no
report for that yet).

So, yes it could be that another error broke this but since I mainly use
wireless and not the ethernet port, I only noticed this problem recently
and cannot say for sure when and how it happened.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/42

------------------------------------------------------------------------
On 2008-09-23T00:14:23+00:00 John Ronciak wrote:

All,

If you read the entire thread pointed to in #7 you would have read Intel
response to this.  The thread has to do with VMWare and not Linux.
VMWare is based on a very old kernel that has some poor kernel locking
issues.  We pointed this out to them and asked the question is any of
the people responding had actually seen this problem on Linux.  This is
where thing went awry a bit.  All (_all_) of the reports that we have so
far have had this gfx panic just before this problem comes up.  The
current belief is that the gfx panic is scribbling all over our NVM
space somehow.  We are not sure how this is happening.  Since we can't
repro the problem without this panic happing first this is very hard for
us to look at.  If the gfx panic does not happen there have been no
problems reported.  I do not know the status of a fix for the gfx panic.
We are working on repro cases but because the problem relies on a panic
to another driver this is very hard for us to work on.  Work will
continue.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/43

------------------------------------------------------------------------
On 2008-09-23T00:30:55+00:00 Jesse Brandeburg wrote:

okay, so two votes for a graphics panic possibly related to the issue. 
Andreas, would you be willing to comment out the code just after the
nvm.validate (validate_eeprom) in netdev.c and then try to dump your eeprom
using ethtool -e ethX

if it returns all 0xff 0xff ... then can you please download ethregs utility
from sourceforge.net/projects/e1000 and build/run it and attach or send me the
output?

we are still trying to reproduce, I'm raising the issue in priority and
we will continue to update as we make progress.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/44

------------------------------------------------------------------------
On 2008-09-23T03:33:12+00:00 Bob Mahar wrote:

Jesse, the question I'm left with is "what TYPE of eeprom interface does
the broken NIC's implement?"   I.e. what specific chip and what
interface does it provide to the eeprom.   Part of that you get from the
customer, and part you get from Intel.  Considering the Intel 7256x and
7257x parts, you have 4 different interface - that I know of.

For SPI / uW its hard to "accidentally" overwrite the prom as you have
to sent it the write enable opcode first and then shift in the data.
That's typically to complicated to happen via random garbage from a
crash. If these are parts with SPI / uW addressed proms, then the
overwrite is most likely the result of e1000e code being called.  If
that's the case, the debug builds of the driver would dump out the "I'm
writing to the prom" messages.  ( Hint: that's not the case. )

For the parallel eeprom, which are memory mapped, they can be
overwritten by writing junk over top of thir address space.   I think
the 72573 has this type.   The 72566 part also has a similar, but
different memory mapped eeprom.

Point being, if the SPI based parts are having issues, this points
towards programatic overwrite - as its unlikely to happen by accidental
overwrite of I/O or memory space.   Damage to the memory mapped parts,
on the other hand, could happen because accidental overwrites are poorly
defended.

Perhaps John can give us a clue as to the Intel parts which use SPI vs
uW vs shadow RAM vs memory mapped.  I don't see much in the code that
latches the memory mapped eeproms from accidental overwrite.  For the
shadow ram method, it wants the write through flag to be set to true,
well FF's are "true" so if this is the means to protect the underlying
prom data, its a pretty feeble one.  If the gfx crash writes all FF's
over a swatch of memory, there you have it.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/47

------------------------------------------------------------------------
On 2008-09-23T04:53:37+00:00 Andreas Jaeger wrote:

Ad comment #24: I'm not using vmware at all on this machine.

Ad comment #25: Karsten, could you compile a kernel and the tool for me?
I'll then test it (note I'm the whole day in meetings and have only
little time to do anything extra myself).  Stephan, will you test as
well?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/49

------------------------------------------------------------------------
On 2008-09-23T05:15:16+00:00 Jesse Brandeburg wrote:

Hi Bob, thanks for the informative reply.

all the broken ones seem to use the FLAG_HAS_FLASH, at this point all the
reports I've heard have BAR1 mapped to an area that has registers used for
read/writing the NVM.  The people I've heard reporting this have a laptop based
on the ICH8 or ICH9 chipsets, aka the lan part is the 82566 or 82567.

This NVM is usually part of a larger flash which contains the BIOS and possibly
the PXE boot code as well as the LAN Non-volatile-memory (NVM)

none of our "configuration" areas use direct memory mapped mode, and unless you
call the "write_eeprom" function using ethtool, nothing should be calling
e1000e_write_flash_data_ich8lan.

we have some patches ready for setting the registers to read-only that is
mapped at the flash BAR1 area.  They are not tested, but we will test a little
and post them to the mailing list tomorrow.

If you have any references to any users that *don't* have an 82566 or 567, then
please point it out.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/50

------------------------------------------------------------------------
On 2008-09-23T06:11:44+00:00 Karsten-keil wrote:

Ad comment #23:
Do you still have this kernel around with the exact version (best would be the 
rpm or kernel source) ?
I added the 3 additional patches on Wed Sep 10 17:18:30 CEST 2008 . So this may 
related.
Ad comment #27:
I will prepare a kernel based on 11.0 and the tools.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/52

------------------------------------------------------------------------
On 2008-09-23T06:57:20+00:00 Karsten-keil wrote:

I got an other report about a working system (82566MM Gigabit Network
Connection [8086:1049] (rev 03)) but with a pre Beta1 kernel which is
mostely identically but _without_ the 3 additional patches.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/55

------------------------------------------------------------------------
On 2008-09-23T09:10:42+00:00 Wstephenson-9 wrote:

As per comment #19, I also have 8086:109a here and beta1 (kernel-
pae-2.6.27-7.2) works and has not broken the hardware yet.  I installed
beta1 at the weekend not knowing about this bug, but am loath to boot it
again if it will cook my ethernet.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/56

------------------------------------------------------------------------
On 2008-09-23T09:45:46+00:00 Stefan-seyfried wrote:

I had a machine (hp 2510p) which had a lockup with garbled X display and
refused to boot at all afterwards (got the mainboard replaced on
warranty) and we have another 2510p which, after updating to an early
2.6.27rc, suddenly "forgot" it's 1280x800 video mode. They both have not
shown e1000e trouble (yet ;), but it hints into the "graphics problem
overwrites system flash" direction.

The 2510p has

00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 
Integrated Graphics Controller (rev 0c)
00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network 
Connection (rev 03)

00:02.0 0300: 8086:2a02 (rev 0c)
00:19.0 0200: 8086:1049 (rev 03)

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/58

------------------------------------------------------------------------
On 2008-09-23T10:04:29+00:00 Martin-wilck-d wrote:

Sorry to interfere, our ESX expert told me that under ESX 3.5 it wasn't
necessary to have processes _writing_ to the EEPROM for this problem to
occur. Rather, it happened if two processes were just _reading_ the
EEPROM at the same time, due to a broken locking bit in the HW of some
NICS (82546, that's what I was told).

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/59

------------------------------------------------------------------------
On 2008-09-23T10:35:07+00:00 Karsten-keil wrote:

OK I did built the ethregs package and a 11.0 kernel with the error paths 
commented out. The kernel is only available internally in mbuild 
diabelli-kkeil-179  (kernel-default).
ethregs is available in the buildservice search for it on 
http://software.opensuse.org/search

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/60

------------------------------------------------------------------------
On 2008-09-23T11:48:49+00:00 John Ronciak wrote:

Please stop talking about VMWare.  That OS has a problem that was fixed
by a work-around in the driver.  What the VMWare guys saw has never been
seen in Linux.  It has _nothing_ to do with this bug.

As Jesse said in his mail we are working on some NVM protection patches
for our driver.  This won't fix the root cause of writing over our NVM
area but it might help to find what/who is writing all over it.

For comment #30 and #31, if the system did not see one of the gfx driver
panics our NVM remains fine.  So if that panic does not happen our NIC
NVM remains fine as well.

As soon as the patches are out of test we'll be pushing them upstream.
We'll post links to this BZ once this happens today.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/61

------------------------------------------------------------------------
On 2008-09-23T14:15:34+00:00 Andi-nbz wrote:

If the graphics crash uses DMA to override the MMIO area (assuming it's really
the graphics crash) then it would require VT-d to protect it. But write 
protecting it to on the CPU level is probably a good start, just it might not 
be not enough.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/65

------------------------------------------------------------------------
On 2008-09-23T14:23:37+00:00 John Ronciak wrote:

Agreed Andi.  This is all we can do however.  Let's see what happens
with the patches with the people that are actually seeing the problem.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/67

------------------------------------------------------------------------
On 2008-09-23T14:33:57+00:00 Andi-nbz wrote:

Those would first need to fix their mac addresses to try again right?
Also is there other vital information in that EEPROM?

Some of those systems should have VT-d. Presumably it would give an message
on a stray DMA? Of course the CPU protection would be needed too.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/68

------------------------------------------------------------------------
On 2008-09-23T14:41:36+00:00 Jkosina-d wrote:

(In reply to comment #38 from Andi N Kleen)
> Those would first need to fix their mac addresses to try again right?
> Also is there other vital information in that EEPROM?

By the way the current driver doesn't get even bound to the card that
has wrong EEPROM CRC, right? So it's even not possible to easily fix its
contents up using ethtool from within default installation.

Karsten already patched and built the kernel so that it binds the driver
to the card even in cases of broken EEPROM checksum.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/69

------------------------------------------------------------------------
On 2008-09-23T16:38:12+00:00 Karsten-keil wrote:

The modified driver is in mbuild diabelli-kkeil-184 (diabelli-kkeil-179 was 
wrong and crash).
With this driver I see the card ethtool -e shows all FF and the MAC is also 
read as FF, but I set set the old MAC address with ifconfig and it seems that 
the card is working, but only at 10 Mbit. ethregs output follows.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/74

------------------------------------------------------------------------
On 2008-09-23T16:45:11+00:00 Karsten-keil wrote:

Created attachment 241194
ethregs eth1 output

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/76

------------------------------------------------------------------------
On 2008-09-23T17:07:57+00:00 Andreas Jaeger wrote:

Ad comment #29: The kernel is not available anymore.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/78

------------------------------------------------------------------------
On 2008-09-23T17:13:44+00:00 Jesse Brandeburg wrote:

I strongly recommend if you are going to test for this bug or haven't seen it
yet on your ich8/9 system, that you RIGHT NOW, do ethtool -e ethX >
savemyeep.txt

Having a saved copy of your eeprom means we can help you write it back to your
system.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/79

------------------------------------------------------------------------
On 2008-09-23T17:38:39+00:00 Andreas Jaeger wrote:

Jesse, What happens with those that have already a broken eeprom?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/83

------------------------------------------------------------------------
On 2008-09-23T17:42:22+00:00 Eich-m wrote:

Could this be the same as bug #57976?

At the time I did not notice any connection with a graphics corruption - but 
would not rule it out either.
In any case the intel graphics driver was a totally different one, the one we 
used 
at the time (i810) is no longer shipped with 11.1.

Can someone elaborate a little on the 'graphics panic'? Was this a total
lockup or just a screen corruption?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/84

------------------------------------------------------------------------
On 2008-09-23T17:44:16+00:00 Eich-m wrote:

(In reply to comment #44 from Andreas Jaeger)
> Jesse, What happens with those that have already a broken eeprom?
> 

Point is: if it's related to #57976 writing back the eeprom doesn't work.
At the time I attributed this to preproduction hardware.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/85

------------------------------------------------------------------------
On 2008-09-23T18:11:28+00:00 Bob Mahar wrote:

(In reply to comment #46 from Egbert Eich)
> Point is: if it's related to #57976 writing back the eeprom doesn't work.
> At the time I attributed this to preproduction hardware.

I don't have access to that one... what was the issue?  Can you
elaborate?

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/87

------------------------------------------------------------------------
On 2008-09-23T18:20:48+00:00 John Ronciak wrote:

In comment #40, just setting the MAC address in the NVM is most likely
not going to restore everything to working.  The HW reads a lot of
things out it when it's coming up.  You will probably have to do a BIOS
update to get everything in the system back.  You will also have to put
the MAC address back in as well.  The BIOS update might do this for you
so make sure it's right before trying to update.

We would like to know that this works so if you could try it on this
system it would help a lot.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/88

------------------------------------------------------------------------
On 2008-09-23T18:30:12+00:00 Karsten-keil wrote:

I could write the MAC address with ethtool but now the driver do not load 
completely insmod hangs for about a minute and then it disable the IRQ.
After this here is no eth1 this are the dmesg:
e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
e1000e: Copyright (c) 1999-2007 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:19.0 to 64
ACPI: PCI interrupt for device 0000:00:19.0 disabled
e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
e1000e: Copyright (c) 1999-2007 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:19.0 to 64
ACPI: PCI interrupt for device 0000:00:19.0 disabled

I think, now with a valid checksum it interprets some of the still 0xFF
values as valid and set some registers wrong.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/90

------------------------------------------------------------------------
On 2008-09-23T18:59:17+00:00 Eich-m wrote:

(In reply to comment #47 from Bob Mahar)
> (In reply to comment #46 from Egbert Eich)
> > Point is: if it's related to #57976 writing back the eeprom doesn't work.
> > At the time I attributed this to preproduction hardware.
> 
> I don't have access to that one... what was the issue?  Can you elaborate? 
> 

In bug #57976 an SPI type eeprom which seemed to still hold valid
content (at least not 0xff) but a bogus checksum could not be restored
as no matter to which byte offset a value was written it always ended up
at offset 0 or 1.

However looking at comment #49 it doesn't seem to be related as in this
case the checksum could be fixed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/91

------------------------------------------------------------------------
On 2008-09-23T19:48:42+00:00 Karsten-keil wrote:

Since we do not have a similar machine, we cannot dump the EEProm. The
board is a Intel DQ35JO. Maybe someone from Intel has access to this
board and can provide a ethtool -e dump ?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/94

------------------------------------------------------------------------
On 2008-09-24T00:00:04+00:00 Jkosina-d wrote:

Intel has just posted patches to lkml [1] [2] [3] that mark the memory
mapped EEPROM region as read-only. Therefore if the EEPROM is garbled by
any bug in kernel code, after these patches are applied, the EEPROM
would no longer be overwritten, and stack trace would be dumped instead,
which will hopefully point to the code that is corrupting the memory.

If, however, userspace is corrupting the memory region (most probably
X.Org), then this protection is rendered useless, but it still is worth
trying so that we can potentially rule out either userspace or
kernelspace code completely.

I have built the kernel with these three patches applied, for those who
are willing (and able) to test. The kernel RPM can be obtained from

        http://labs.suse.cz/jikos/download/bug-425480/

Any testing would be highly appreciated.

[1] http://lkml.org/lkml/2008/9/23/427
[2] http://lkml.org/lkml/2008/9/23/431
[3] http://lkml.org/lkml/2008/9/23/432

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/106

------------------------------------------------------------------------
On 2008-09-24T00:02:48+00:00 Jkosina-d wrote:

(In reply to comment #52 from Jiri Kosina)
> If, however, userspace is corrupting the memory region (most probably X.Org),
> then this protection is rendered useless, but it still is worth trying so that
> we can potentially rule out either userspace or kernelspace code completely.

In fact, testing whether booting the system only in text-mode (so that
xorg won't be started at all) also triggers the bug or not would also be
a valuable test.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/107

------------------------------------------------------------------------
On 2008-09-24T00:21:00+00:00 John Ronciak wrote:

Thanks Jiri.  Yes these patches need tried.  We tested them but we have
not been seeing the problem.

Andreas, could you please give us the model and serial number form the
system that  had the NVM corrupted?  We have some people here at Intel
that think that they can track down the EEPROM image.  They want to see
if things are correct with it.

Thanks.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/109

------------------------------------------------------------------------
On 2008-09-24T05:19:54+00:00 Andreas Jaeger wrote:

John, my system is a Lenovo thinkpad X61s:
  System Info: #1
    Manufacturer: "LENOVO"
    Product: "766929G"
    Version: "ThinkPad X61s"
    Serial: "L3A2878"
    UUID: undefined, but settable
    Wake-up: 0x06 (Power Switch)
  Board Info: #2
    Manufacturer: "LENOVO"
    Product: "766929G"
    Version: "Not Available"
    Serial: "1ZDMN77215Z"

The e1000 is:
26: PCI 19.0: 0200 Ethernet controller
  [Created at pci.310]
  UDI: /org/freedesktop/Hal/devices/pci_8086_104b
  Unique ID: kpGf.mInfNyjoCrB
  SysFS ID: /devices/pci0000:00/0000:00:19.0
  SysFS BusID: 0000:00:19.0
  Hardware Class: network
  Model: "Intel 82566DC Gigabit Network Connection"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x104b "82566DC Gigabit Network Connection"
  SubVendor: pci 0x8086 "Intel Corporation"
  SubDevice: pci 0x0000 
  Revision: 0x03

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/113

------------------------------------------------------------------------
On 2008-09-24T06:36:21+00:00 Karsten-keil wrote:

The motherboard from this bug is:

Handle 0x0007, DMI type 2, 20 bytes
Base Board Information
        Manufacturer: Intel Corporation
        Product Name: DQ35JO
        Version: AAD82085-801
        Serial Number: BQJO749006WD

The onboard NIC:
33: PCI 19.0: 0200 Ethernet controller
  [Created at pci.310]
  UDI: /org/freedesktop/Hal/devices/pci_8086_294c
  Unique ID: kpGf.CUCsNZz8jz8
  SysFS ID: /devices/pci0000:00/0000:00:19.0
  SysFS BusID: 0000:00:19.0
  Hardware Class: network
  Model: "Intel 82566DC-2 Gigabit Network Connection"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x294c "82566DC-2 Gigabit Network Connection"
  SubVendor: pci 0x8086 "Intel Corporation"
  SubDevice: pci 0x0000
  Revision: 0xfd
  Memory Range: 0x92200000-0x9221ffff (rw,non-prefetchable)
  Memory Range: 0x92224000-0x92224fff (rw,non-prefetchable)
  I/O Ports: 0x3400-0x341f (rw)
  IRQ: 20 (no events)
  Module Alias: "pci:v00008086d0000294Csv00008086sd00000000bc02sc00i00"

The original MAC: 00:1c:c0:2b:74:3a

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/114

------------------------------------------------------------------------
On 2008-09-24T07:17:45+00:00 Stefan-seyfried wrote:

(In reply to comment #53 from Jiri Kosina)
> In fact, testing whether booting the system only in text-mode (so that xorg
> won't be started at all) also triggers the bug or not would also be a valuable
> test.

It is, unfortunately, not that easy. I have rebooted my machine (hp 2510p) with 
e1000e 17 times since Sep 15 with 2.6.27-rc5-git9+ Kernels (always pretty 
recent STABLE) and I did not encounter any problems.
So it is pretty hard to prove the absence of this bug.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/115

------------------------------------------------------------------------
On 2008-09-24T07:22:56+00:00 Jkosina-d wrote:

(In reply to comment #57 from Stefan Seyfried)
> It is, unfortunately, not that easy. I have rebooted my machine (hp 2510p) 
> with
> e1000e 17 times since Sep 15 with 2.6.27-rc5-git9+ Kernels (always pretty
> recent STABLE) and I did not encounter any problems.
> So it is pretty hard to prove the absence of this bug.

And did this machine expose the problem at least once previously?
Apparently not all systems having e1000e hardware are being hit by the
issue, either only specific product IDs are affected, or it might be
chipset-dependent, etc.

Also, please do not forget to back up contents of your EEPROM before you
start playing with this :)

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/116

------------------------------------------------------------------------
On 2008-09-24T07:28:09+00:00 Jesse Brandeburg wrote:

(In reply to comment #49 from Karsten Keil)
> I could write the MAC address with ethtool but now the driver do not load
> completely insmod hangs for about a minute and then it disable the IRQ.
> After this here is no eth1 this are the dmesg:
> e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
> e1000e: Copyright (c) 1999-2007 Intel Corporation.
> ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 20 (level, low) -> IRQ 20
> PCI: Setting latency timer of device 0000:00:19.0 to 64

at this point (without trying to activate the device) does ethtool -e still
work? I would assume not.

> ACPI: PCI interrupt for device 0000:00:19.0 disabled

I looked at your ethregs dump (thank you!!!) and in the EECD register, bit 8 is
not set, indicating the valid bits in the eeprom are not set.
bit 9 is set indicating the hardware tried to read the eeprom.
bit 22 is only valid if bit 8 and 9 is set, but it would indicate which of the
two eeprom banks had a valid signature.

I'm curious if the other bank on the eeprom might still be okay.  I'll have to
figure out tomorrow if we can switch to the other bank.  I may be able to get
you some internal tools since this is an intel board, I'll have to see what is
available.

BTW this is the first desktop machine I've heard of that reported the
problem.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/117

------------------------------------------------------------------------
On 2008-09-24T07:49:45+00:00 Stefan-seyfried wrote:

(In reply to comment #58 from Jiri Kosina)
> And did this machine expose the problem at least once previously? Apparently
> not all systems having e1000e hardware are being hit by the issue, either only
> specific product IDs are affected, or it might be chipset-dependent, etc.

I am not sure. See comment #32. It might of course also just have been a
broken joint on the mainboard.

> Also, please do not forget to back up contents of your EEPROM before you start
> playing with this :)

If it hits me the same as last time, this won't help :) (and yes, i
backed it up)

(In reply to comment #59 from Jesse Brandeburg)
> BTW this is the first desktop machine I've heard of that reported the problem.
Regarding the desktop: it also has intel integrated graphics, and it had 
recurring problems with the graphics driver (lockups) before the ethernet 
broke. Maybe that's one common factor.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/118

------------------------------------------------------------------------
On 2008-09-24T08:26:15+00:00 Eich-m wrote:

(In reply to comment #52 from Jiri Kosina)
> Intel has just posted patches to lkml [1] [2] [3] that mark the memory mapped
> EEPROM region as read-only. Therefore if the EEPROM is garbled by any bug in
> kernel code, after these patches are applied, the EEPROM would no longer be
> overwritten, and stack trace would be dumped instead, which will hopefully
> point to the code that is corrupting the memory.
> 

This is indeed a valuable test.

> If, however, userspace is corrupting the memory region (most probably X.Org),
> then this protection is rendered useless, but it still is worth trying so that
> we can potentially rule out either userspace or kernelspace code completely.

Not necessarily. If X overwrites this memory from user space, yes.
However if it is overwritten from kernel space (by DRM - either
from the Xserver or from a DRM client) we will be able to catch it.

For now I would rule out user space. From user space the Xserver
cannot access memory unless it is explicitly mapped. You can find
out the mapped memory ranges from /proc/<pid>/maps. It would be
instructive to know which ranges show up there on the affected
machines and compare them to an lspci -v output.
I may have missed this, but I have not seen any analysis which
access method is used on the affected systems to write to the
EEPROM.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/119

------------------------------------------------------------------------
On 2008-09-24T08:46:56+00:00 Karsten-keil wrote:

(In reply to comment #59 from Jesse Brandeburg)
Yes ethtool does not work any more.
What I saw while I was programming the MAC was, that a other word in th EEProm
changed as well (I assume the checksum) and one other byte also get some other
value (maybe BF, but I'm not sure) it was in an other line of the ethtool -e 
dump (2. or 3. line) between MAC (first 6 bytes) and the checksum.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/120

------------------------------------------------------------------------
On 2008-09-24T09:49:29+00:00 Jkosina-d wrote:

(In reply to comment #63 from Egbert Eich)
> For now I would rule out user space. From user space the Xserver
> cannot access memory unless it is explicitly mapped. You can find
> out the mapped memory ranges from /proc/<pid>/maps. It would be
> instructive to know which ranges show up there on the affected
> machines and compare them to an lspci -v output.

It could be some temporary mapping that goes away after a while, so that
it doesn't show in /proc/<pid>/maps permanently, but yes, this of course
can be tried.

Could please someone, who has access to affected hardware, provide
output of

       cat /proc/`pidof Xorg`/maps  
       lspci -v

commands, so that we can see if there is possibly some lethal overlap?

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/126

------------------------------------------------------------------------
On 2008-09-24T09:51:41+00:00 Jkosina-d wrote:

Also, the locking in e1000e seems indeed to be dodgy, Thomas Gleixner
reported this to be spotted by lockdep on his system with 2.6.27-rc7:

e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
e1000e: Copyright (c) 1999-2008 Intel Corporation.
e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
e1000e 0000:00:19.0: setting latency timer to 64
0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:15:58:84:9f:94
0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
0000:00:19.0: eth0: MAC: 4, PHY: 6, PBA No: ffffff-0ff
------------[ cut here ]------------
WARNING: at /home/tglx/work/kernel/git/linux-2.6/kernel/mutex.c:135 
mutex_lock_nested+0x5c/0x26d()
Modules linked in: e1000e i915 drm ipt_MASQUERADE iptable_nat nf_nat 
nf_conntrack_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter 
ip_tables x_tables bridge stp bnep rfcomm l2cap bluetooth autofs4 coretemp fuse 
sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp 
libiscsi scsi_transport_iscsi e1000 cpufreq_ondemand acpi_cpufreq ext2 
dm_mirror dm_log dm_multipath dm_mod ipv6 kvm_intel kvm snd_hda_intel 
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq arc4 snd_seq_device 
snd_pcm_oss ecb snd_mixer_oss snd_pcm video crypto_blkcipher snd_timer 
snd_page_alloc iwlagn i2c_i801 i2c_core firewire_ohci iwlcore mac80211 
snd_hwdep firewire_core crc_itu_t iTCO_wdt iTCO_vendor_support rtc_cmos snd 
soundcore output ac battery pcspkr cfg80211 sr_mod thinkpad_acpi rfkill cdrom 
sg hwmon button joydev ata_piix ahci libata sd_mod scsi_mod ext3 jbd mbcache 
uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 3484, comm: ip Not tainted 2.6.27-rc7-00006-gcec5eb7-dirty #89

Call Trace:
 <IRQ>  [<ffffffff8103654d>] warn_on_slowpath+0x51/0x77
 [<ffffffff810572e9>] __lock_acquire+0x6ad/0x716
 [<ffffffff8129c285>] mutex_lock_nested+0x5c/0x26d
 [<ffffffffa04f85c2>] e1000_acquire_swflag_ich8lan+0x59/0x74 [e1000e]
 [<ffffffffa04fd753>] e1000e_read_kmrn_reg+0x18/0x62 [e1000e]
 [<ffffffffa04f8606>] e1000e_gig_downshift_workaround_ich8lan+0x29/0x71 [e1000e]
 [<ffffffffa0503e07>] e1000_intr_msi+0x46/0xec [e1000e]
 [<ffffffff81076fa5>] handle_IRQ_event+0x1e/0x51
 [<ffffffff81078295>] handle_edge_irq+0xe8/0x12b
 [<ffffffffa04fb312>] e1000e_update_mc_addr_list_generic+0x0/0x18e [e1000e]
 [<ffffffff8100ea88>] do_IRQ+0x6c/0xd4
 [<ffffffff8100c556>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffffa04fb312>] e1000e_update_mc_addr_list_generic+0x0/0x18e 
[e1000e]
 [<ffffffffa04fb38f>] e1000e_update_mc_addr_list_generic+0x7d/0x18e [e1000e]
 [<ffffffffa04fb359>] e1000e_update_mc_addr_list_generic+0x47/0x18e [e1000e]
 [<ffffffffa0500ace>] e1000_set_multi+0xe2/0x11b [e1000e]
 [<ffffffff8121a1e8>] dev_set_rx_mode+0x21/0x2d
 [<ffffffff8121d1a6>] dev_open+0x85/0x9e
 [<ffffffff8121b172>] dev_change_flags+0xa6/0x15d
 [<ffffffff81262588>] devinet_ioctl+0x242/0x58a
 [<ffffffff812109a5>] sock_ioctl+0x1d8/0x1ff
 [<ffffffff810b80a9>] vfs_ioctl+0x21/0x6b
 [<ffffffff810b834c>] do_vfs_ioctl+0x259/0x272
 [<ffffffff81055a14>] trace_hardirqs_on_caller+0xf2/0x115
 [<ffffffff810b83b6>] sys_ioctl+0x51/0x73
 [<ffffffff8100bf4b>] system_call_fastpath+0x16/0x1b

Haven't looked into the code yet to see if this could possibly cause
some deadly race access to NVRAM contents, but maybe some Intel people
will know from top of their head.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/128

------------------------------------------------------------------------
On 2008-09-24T10:41:31+00:00 Eich-m wrote:

(In reply to comment #65 from Jiri Kosina)
> It could be some temporary mapping that goes away after a while, so that it
> doesn't show in /proc/<pid>/maps permanently, but yes, this of course can be
> tried.

The Xserver itself doesn't have any temporary mappings. It could be buffers
requested from DRM which (depending on the implementation) could be requested 
and discarded during runtime of the server.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/131

------------------------------------------------------------------------
On 2008-09-24T11:16:29+00:00 Renato-yamane wrote:

Created attachment 241376
ethtool -e eth0 > e1000e.txt

Requested by Karsten Keil (comment #51).

Hardware: Laptop Lenovo Thinkpad T61

$ uname -a
Linux mandachuva 2.6.26.2 #1 SMP Sat Aug 16 19:08:09 BRT 2008 i686 GNU/Linux

$lspci -vvv

00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network 
Connection (rev 03)
        Subsystem: Lenovo Device 20b9
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 216
        Region 0: Memory at fe200000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fe225000 (32-bit, non-prefetchable) [size=4K]
        Region 2: I/O ports at 1840 [size=32]
        Capabilities: <access denied>
        Kernel driver in use: e1000e
        Kernel modules: e1000e

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/133

------------------------------------------------------------------------
On 2008-09-24T14:42:14+00:00 Okir wrote:

Vladimir Botka just posted this to an internal mailing list:
--------------------------------
So, I am the one who tried (without an intention). The installation of
SLED11 Beta1 on TP T61 crashed in the moment installer was probing the
X configuration. Instead of the well known notification "Dont panic ..."
an error message appeared for a few seconds telling something about
"yast, proposal, etc ...". Then the machine restarted. The network card
stop responding.

Here is the card info:

00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network
Connection (rev 03) Subsystem: Intel Corporation Device 0000
        Flags: fast devsel, IRQ 20
        Memory at fe000000 (32-bit, non-prefetchable) [size=128K]
        Memory at fe025000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at 1840 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable- Kernel modules: e1000e
---------------------------------

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/140

------------------------------------------------------------------------
On 2008-09-24T14:45:00+00:00 Okir wrote:

Given that this is increasingly looking like it's closely related to video,
can people with affected machines please post the lspci output for their gfx
chip?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/142

------------------------------------------------------------------------
On 2008-09-24T14:58:27+00:00 Renato-yamane wrote:

Olaf, I really don't fell confortable to test Kernel 2.6.27-rc if it can
damaged my ethernet device, so I don't know if my hardware is affected,
but my video card is:

01:00.0 VGA compatible controller: nVidia Corporation Quadro NVS 140M (rev a1)
        Subsystem: Lenovo Device 20d8
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at d6000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at d4000000 (64-bit, non-prefetchable) [size=32M]
        Region 5: I/O ports at 2000 [size=128]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidia, nvidiafb

And my ethernet device is listed in Comment #68

Best regards,
Renato

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/143

------------------------------------------------------------------------
On 2008-09-24T16:00:38+00:00 Stefan-seyfried wrote:

The Ethernet on this Thinkpad R400 is already toasted, after an
installation attempt some time ago (Micha, do you still know when you
tried it?)

00:19.0 Ethernet controller: Intel Corporation 82566DC-2 Gigabit Network 
Connection (rev 03)
        Subsystem: Intel Corporation Device 0000
        Flags: fast devsel, IRQ 218
        Memory at fc000000 (32-bit, non-prefetchable) [size=128K]
        Memory at fc024000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at 1820 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Count=1/1 
Enable+
        Capabilities: [e0] PCIe advanced features <?>
        Kernel modules: e1000e

00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset 
Integrated Graphics Controller (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 20e4
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at f4400000 (64-bit, non-prefetchable) [size=4M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 1800 [size=8]
        Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Count=1/1 
Enable-
        Capabilities: [d0] Power Management version 3

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/152

------------------------------------------------------------------------
On 2008-09-24T16:41:24+00:00 Jkosina-d wrote:

Is all hardware which currently know to be affected by the bug driven by
i915 DRM driver? ("dmesg | grep -i drm" should be enough to learn that).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/155

------------------------------------------------------------------------
On 2008-09-24T16:44:55+00:00 Eich-m wrote:

(In reply to comment #71 from Renato Yamane)
> Olaf, I really don't fell confortable to test Kernel 2.6.27-rc if it can
> damaged my ethernet device, so I don't know if my hardware is affected, but my
> video card is:
> 
> 01:00.0 VGA compatible controller: nVidia Corporation Quadro NVS 140M (rev a1)
>         Subsystem: Lenovo Device 20d8
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-

Definitely not. 
And if this system does show to be affected it would point away from the gfx 
driver.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/156

------------------------------------------------------------------------
On 2008-09-24T16:54:17+00:00 Karsten-keil wrote:

But all other have i915.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/157

------------------------------------------------------------------------
On 2008-09-24T17:21:53+00:00 Jkosina-d wrote:

I would also be really interested to know whether the bug triggers when
booting with 'nopat' kernel option, as that's other piece that might go
wrong on the way between Xorg, graphics card, ethernet card and MMIO.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/158

------------------------------------------------------------------------
On 2008-09-24T17:34:50+00:00 Karsten-keil wrote:

And so far I understand comment #71, Renatos machine is not affected
yet, only on the list of potential victims of NICs with writable FLASH.

So yes upto now all affected machines are used the 915 DRM driver.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/160

------------------------------------------------------------------------
On 2008-09-24T18:06:17+00:00 Andreas Jaeger wrote:

My laptop uses the 915 DRM:
dmesg | grep -i drm
[drm] Initialized drm 1.1.0 20060810
[drm] Initialized i915 1.6.0 20060119 on minor 0

# hwinfo --gfxcard
27: PCI 02.1: 0380 Display controller                           
  [Created at pci.310]
  UDI: /org/freedesktop/Hal/devices/pci_8086_2a03
  Unique ID: ruGf.a6pkzICrUB2
  SysFS ID: /devices/pci0000:00/0000:00:02.1
  SysFS BusID: 0000:00:02.1
  Hardware Class: graphics card
  Model: "Intel Mobile GM965/GL960 Integrated Graphics Controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x2a03 "Mobile GM965/GL960 Integrated Graphics Controller"
  SubVendor: pci 0x17aa "Lenovo"
  SubDevice: pci 0x20b5 
  Revision: 0x0c
  Memory Range: 0xf8200000-0xf82fffff (rw,non-prefetchable)
  Module Alias: "pci:v00008086d00002A03sv000017AAsd000020B5bc03sc80i00"
  Config Status: cfg=no, avail=yes, need=no, active=unknown

28: PCI 02.0: 0300 VGA compatible controller (VGA)
  [Created at pci.310]
  UDI: /org/freedesktop/Hal/devices/pci_8086_2a02
  Unique ID: _Znp.3gR64TvADaC
  SysFS ID: /devices/pci0000:00/0000:00:02.0
  SysFS BusID: 0000:00:02.0
  Hardware Class: graphics card
  Model: "Intel 965 GM"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x2a02 "965 GM"
  SubVendor: pci 0x17aa "Lenovo"
  SubDevice: pci 0x20b5 
  Revision: 0x0c
  Memory Range: 0xf8100000-0xf81fffff (rw,non-prefetchable)
  Memory Range: 0xe0000000-0xefffffff (rw,prefetchable)
  I/O Ports: 0x1800-0x1807 (rw)
  IRQ: 16 (2124 events)
  I/O Ports: 0x3c0-0x3df (rw)
  Module Alias: "pci:v00008086d00002A02sv000017AAsd000020B5bc03sc00i00"
  Driver Info #0:
    XFree86 v4 Server Module: intel
  Driver Info #1:
    XFree86 v4 Server Module: intel
    3D Support: yes
    Extensions: dri
  Config Status: cfg=no, avail=yes, need=no, active=unknown

Primary display adapter: #28

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/161

------------------------------------------------------------------------
On 2008-09-24T22:10:43+00:00 Jpallen wrote:

I had the problem on my T60p.  It does not appear to use the 915 DRM.

hwinfo --gfxcard
11: PCI 100.0: 0300 VGA compatible controller (VGA)             
  [Created at pci.318]
  UDI: /org/freedesktop/Hal/devices/pci_1002_71c4
  Unique ID: VCu0.s+GMGFf+eZ0
  Parent ID: vSkL.rxAOeWuq8i6
  SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0
  SysFS BusID: 0000:01:00.0
  Hardware Class: graphics card
  Model: "Lenovo ThinkPad T60p"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x71c4 "Mobility FireGL V5200"
  SubVendor: pci 0x17aa "Lenovo"
  SubDevice: pci 0x2007 "ThinkPad T60p"
  Memory Range: 0xd0000000-0xdfffffff (rw,prefetchable)
  I/O Ports: 0x2000-0x2fff (rw)
  Memory Range: 0xee100000-0xee10ffff (rw,non-prefetchable)
  Memory Range: 0xee120000-0xee13ffff (ro,prefetchable,disabled)
  IRQ: 11 (no events)
  I/O Ports: 0x3c0-0x3df (rw)
  Module Alias: "pci:v00001002d000071C4sv000017AAsd00002007bc03sc00i00"
  Driver Info #0:
    XFree86 v4 Server Module: radeonhd
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #27 (PCI bridge)

Primary display adapter: #11

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/165

------------------------------------------------------------------------
On 2008-09-24T22:26:47+00:00 Jpallen wrote:

I guess I really don't know for sure if I have the same problem.  I did
not inspect the contents of the EEPROM using ethtool, so I really can't
verify.  I have subsequently flashed my BIOS to recover.

What I experienced was a blank screen upon rebooting my system after
installing SLED 11 Beta - build47.  I was unable to install anything
after that (SLED 10, XP, etc.).  I continued to get a blank screen
during installation.  After updating my BIOS using Lenovo's bootable
BIOS CD I was able to install build48 (without loading the e1000e
module).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/166

------------------------------------------------------------------------
On 2008-09-24T22:31:47+00:00 Jkosina-d wrote:

That would be the first case when this would be reported to happen on
non-i915 hardware.

On the other hand, this mail that came to LKML just a while ago is quite
interesting too:

           http://lkml.org/lkml/2008/9/24/133

apparently some realtek ethernet device stopped working, because it has
a lots of 0xff somewhere in its configuration space ....

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/167

------------------------------------------------------------------------
On 2008-09-24T22:40:41+00:00 Jesse Brandeburg wrote:

I think the problem with T60p is different (it has a different lan chip
82573 with a real eeprom (not NVM based) that should not be able to be
corrupted in the same manner as 82566/82567.

let's leave Jared's problem off to the side as a (possibly) new issue,
maybe a new bug if it is reproducible?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/168

------------------------------------------------------------------------
On 2008-09-25T00:15:57+00:00 Bob Mahar wrote:

(In reply to comment #83 from Jesse Brandeburg)
> I think the problem with T60p is different (it has a different lan chip 82573
> with a real eeprom (not NVM based) that should not be able to be corrupted in
> the same manner as 82566/82567.

Oh, if it were only that simple.  The T60p has the 82573L  c.f. the
Intel docs...

http://download.intel.com/design/network/products/LAN/manuals/316080.pdf

See section 2.3...  "The 82573E/82573V/82573L supports both FLASH memory
and EEPROM; however, only one device can be connected at a time (not
both).

So while the 8257x's for the most part SEEPROMs, the -E,V,L suffixed
part could go either way - oh joy!

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/171

------------------------------------------------------------------------
On 2008-09-25T07:17:26+00:00 Martin-wilck-d wrote:

Forgive me this dumb question - if this is due to an accidental
overwrite with random data (DMA), why would the EEPROM contain only FFs
afterwards? Can't we infer something from that?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/173

------------------------------------------------------------------------
On 2008-09-25T07:35:12+00:00 Jkosina-d wrote:

(In reply to comment #85 from Martin Wilck)
> Forgive me this dumb question - if this is due to an accidental overwrite with
> random data (DMA),

It's not DMA but MMIO. The data are not that random, it is really all
0xff.

> why would the EEPROM contain only FFs afterwards? Can't we infer
something from > that?

Well, we weren't able to use this to identify source of the corruption
so far. We have patches that could help to point to the guilty, but
first we need reliable way to restore the EEPROM contents, otherwise the
debugging is almost impossible.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/174

------------------------------------------------------------------------
On 2008-09-25T08:00:00+00:00 Okir wrote:

Looking at the lspci output from the system mentioned on LKML (comment #82)
that machine seems to have an i945G graphics controller, which AFAIK is
also driven by the i915 driver. The chipset is ICH7.

Has anyone tried to resurrect AJ's laptop via a BIOS update?

Is there any way Intel can help resuscitate these e1000e NVMs - this is
really preventing us from doing further debugging.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/175

------------------------------------------------------------------------
On 2008-09-25T08:19:35+00:00 Andreas Jaeger wrote:

Olaf, I did a BIOS update to the latest Lenovo BIOS.  It did not help at
all.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/176

------------------------------------------------------------------------
On 2008-09-25T08:31:31+00:00 Okir wrote:

Jesse, I'm setting this bug as NEEDINFO to you.
The biggest roadblock right now is our inability to bring those dead
NICs back to life. Without this, we cannot proceed with testing, and
we are somewhat reluctant to try this ourselves, as it seems someone at
RedHat has bricked a laptop this way.

We tried a BIOS update on one of the affected laptops, but this didn't
help. And since we weren't aware of the problem in advance, we don't
have an ethregs dump of these.

So can you please get someone from Intel to help us with restoring the
NVM to a working condition? Thanks a lot!

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/177

------------------------------------------------------------------------
On 2008-09-25T14:33:46+00:00 Jpallen wrote:

I realize that my issue might be something different (see comment #83).
However, updating my BIOS seemed to bring my bricked T60p back to life.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/179

------------------------------------------------------------------------
On 2008-09-25T16:42:55+00:00 Okir wrote:

One more question to Intel.

There's a question whether the NVM we're talking about here is actually larger,
and is used by components other than the e1000e. If for instance the video BIOS
maps all of the NVM and, due to some bug, scribbles over parts of it that
include the e1000e's config space - is there a way to verify this?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/181

------------------------------------------------------------------------
On 2008-09-25T17:35:42+00:00 Eich-m wrote:

Update:
Since numerous people have seen this problem during installation when the 
Xserver is probed I've looked into what's happening at this stage:
The Xserver is started with a standard config - only the line containing the 
bus id and driver name is special. The installation program then connects to 
the xserver to obtain the randr version and information about the available 
outputs. Nothing is ever drawn (except for the standard X background.
For now I would also rule out drm as it is initialized but never used (2d 
operations don't do use drm on this driver).
The probing scenario during installation can be reproduced on any running 
system with the command:
sysp -s xstuff
I'm currently condensing down the part that involves the X connection of this 
for better reproduction.
My goal is to narrow down where to look. The X driver is still a considerable 
chunk of code so it would be beneficial to reduce the possible sources of the 
problem.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/183

------------------------------------------------------------------------
On 2008-09-25T17:39:46+00:00 Eich-m wrote:

(In reply to comment #91 from Olaf Kirch)
> One more question to Intel.
> 
> There's a question whether the NVM we're talking about here is actually 
> larger,
> and is used by components other than the e1000e. If for instance the video 
> BIOS
> maps all of the NVM and, due to some bug, scribbles over parts of it that
> include the e1000e's config space - is there a way to verify this?
> 

I don't think this is the case: the driver only maps the POSTed copy of
the VBIOS. This is copied into RAM at POST time (to the 0xC-segment).
This copy is then made read only. This copy is (should be) entirely
independent of the EEPROM containing the PCI ROM.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/184

------------------------------------------------------------------------
On 2008-09-25T17:40:38+00:00 Eich-m wrote:

Another question came up: does this happen on both 64 and 32 bit
installations?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/185

------------------------------------------------------------------------
On 2008-09-25T17:55:54+00:00 Jesse Brandeburg wrote:

(In reply to comment #89 from Olaf Kirch)
> Jesse, I'm setting this bug as NEEDINFO to you.
> The biggest roadblock right now is our inability to bring those dead
> NICs back to life. Without this, we cannot proceed with testing, and
> we are somewhat reluctant to try this ourselves, as it seems someone at
> RedHat has bricked a laptop this way.

We'll get you a utility today to help with this, and at the same time we're 
working on a quick hack to the driver to take in an ethtool eeprom dump and 
push it back to the NVM.  We hope to have that done and working today.

> We tried a BIOS update on one of the affected laptops, but this didn't
> help. And since we weren't aware of the problem in advance, we don't
> have an ethregs dump of these.

so it depends on whether the BIOS version has the LAN part included.
Some bios versions do, and some do not.  I know that in particular there
were a couple versions of the bioses for the X60/T60 line that had LAN
NVM updates.

> So can you please get someone from Intel to help us with restoring the
> NVM to a working condition? Thanks a lot!

We are working on it, a couple of different avenues.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/186

------------------------------------------------------------------------
On 2008-09-25T17:59:47+00:00 Jesse Brandeburg wrote:

(In reply to comment #91 from Olaf Kirch)
> There's a question whether the NVM we're talking about here is actually 
> larger,
> and is used by components other than the e1000e. If for instance the video 
> BIOS
> maps all of the NVM and, due to some bug, scribbles over parts of it that
> include the e1000e's config space - is there a way to verify this?

the NVM in question is a single part that the entire machine (VGA, BIOS,
LAN, Manageability, AHCI, etc) all use.

I couldn't tell you how to verify if something else is mapping over the
top of the LAN area of the NVM.  The only reports I've heard are that
the LAN NVM is corrupted.  If you managed to corrupt the BIOS area, the
machine wouldn't boot.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/187

------------------------------------------------------------------------
On 2008-09-25T18:23:27+00:00 Jesse Brandeburg wrote:

(In reply to comment #94 from Egbert Eich)
> Another question came up: does this happen on both 64 and 32 bit 
> installations?

At this point we don't know.  At least one reported I worked with was
running 32 bit.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/188

------------------------------------------------------------------------
On 2008-09-25T18:26:09+00:00 Stefan-seyfried wrote:

(In reply to comment #96 from Jesse Brandeburg)

> The only reports I've heard are that the LAN NVM is
> corrupted.  If you managed to corrupt the BIOS area, the machine wouldn't 
> boot.

Helmut Schaa has an HP 2510p that lost some of its display modes after a
hard X crash on an early 2.6.27-rc kernel (it now no longer knows that
it has a 1280x800 panel but thinks that it only has 1024x768, the BIOS
screen is in the upper left corner instead of centered on the screen).
Even though we don't know that this is the same problem, it shows that
sh*t happens.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/189

------------------------------------------------------------------------
On 2008-09-25T18:31:53+00:00 Jesse Barnes wrote:

Do we have any dumps of the gfx related crashes?  Comment #98 seems to
indicate that the video ROM may have also become corrupted (either that
or the EEPROM containing the EDID), but I don't currently have any
theories about how the gfx driver could cause that...

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/190

------------------------------------------------------------------------
On 2008-09-26T15:37:39+00:00 Renato-yamane wrote:

About Comment #86:
> but first we need reliable way to restore the EEPROM contents, otherwise the
> debugging is almost impossible.

A strange comment in Ubuntu bug Report that, maybe, can help:
"...I have resolved on my hp 8510w with an old image of windows, and my network 
card is reborn..."
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/75

Anyone have dual-boot (Windows) and can try this?

Best regards,
Renato

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/206

------------------------------------------------------------------------
On 2008-09-26T15:44:50+00:00 John Ronciak wrote:

The Windows drivers do not restore NVM images.  So I don't think this
report was seeing the same issue.  If the NVM is really corrupted,
loading the Windows driver is not going fix it.  The Windows driver does
not calculate and check the checksum so the device could be using what
ever is in the corrupted NVM and running with those settings.  Much like
in some case on this bug if you comment out the checksum check it works
for some people (probably with some random MAC address).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/207

------------------------------------------------------------------------
On 2008-09-26T16:31:03+00:00 Karsten-keil wrote:

Jesse, John we have one case, where the NVM is not completely destroyed. it 
seems only the
NVM valid bit is not longer set and it shows a checksum error.
The Lenovo T61 did work until a attempt to install Beta1, a network install.
During yasts Xserver setup, it reboots and after this it does not longer load
e1000e because of the checksum error. I will attach ethtool -e and ethregs from
this machine. I did not set the NVM valid bit up to now, so the NIC is still in 
this state. 

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/208

------------------------------------------------------------------------
On 2008-09-26T16:35:02+00:00 Karsten-keil wrote:

Created attachment 242026
T61 ethtool -e dump

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/209

------------------------------------------------------------------------
On 2008-09-26T16:37:17+00:00 Karsten-keil wrote:

Created attachment 242027
T61 ethregs output

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/210

------------------------------------------------------------------------
On 2008-10-01T07:04:29+00:00 Quentin Jackson wrote:

Guys, I have an HP 8510w which is experiencing some interesting
behaviour.  I believe I may have had a graphics corruption first, though
I don't recall if the problems started directly afterward.  I'm
definately running the e1000e driver, the machine has an NVIDIA Quadro
FX570M (Mobile Version).  The first thing I noticed was the Intel Boot
agent in the BIOS reports the following;

Initializing Intel (R) Boot Agent GE v1.2.45
PXE-E05: The LAN adapter's confirguration is corrupted or has not been 
initialized.  The Boot Agent cannot continue.

Then the eth0 device would no longer work.  I found a link which I've
posted at the end of this which talked about some work arounds etc using
free dos and resetting the Intel Boot Agent using an Intel Program
called IBAUTIL.  I was at this point able to use the NIC while using
windows, I was not able to use it using Linux, Linux would complain with
a standard message in Yast that the card was corrupted and that
therefore the module was not loaded.

I ran the procedure outlined using IBAUTIL and voila my linux ethernet
worked again.  However, upon booting up a day after it is back to being
dead.

If this is indeed the same situation, this may be all we need to get
info out of the card.  Also, I may potentially have access to more of
these machines that ARE going if that helps.

The windows OS will now not get an IP address either, which I assume
isn't just about the address and rather about hardware failure.  Event
Viewer shows nothing as usual, where's the Windows DMESG!!!!  Windows
was working fine all day though.

I shall try this procedure again, but I expect I am now out of luck :(

If someone wants me to post some kind of image from a going one of these
machines it might be possible, but I'll need to do it from an older
version of Linux I expect :)

http://dance.richii.com/article238.html

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/231

------------------------------------------------------------------------
On 2008-10-01T07:16:47+00:00 Quentin Jackson wrote:

I am now definately in the same boat as everyone else, I don't even have
lights on on my NIC at the hardware level and the driver has been auto
removed from windows!  The worst part is the wireless doesn't work on
Linux in Beta 1 for me so no network in linux at all!  Now where is that
old cisco wireless card......

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/232

------------------------------------------------------------------------
On 2008-10-01T07:43:36+00:00 Jkosina-d wrote:

(In reply to comment #105 from Quentin Jackson)
> Guys, I have an HP 8510w which is experiencing some interesting behaviour.  I
> believe I may have had a graphics corruption first, though I don't recall if
> the problems started directly afterward.  I'm definately running the e1000e
> driver, the machine has an NVIDIA Quadro FX570M (Mobile Version).  The first
> thing I noticed was the Intel Boot agent in the BIOS reports the following;
> 
> Initializing Intel (R) Boot Agent GE v1.2.45
> PXE-E05: The LAN adapter's confirguration is corrupted or has not been
> initialized.  The Boot Agent cannot continue.

Quentin,

could you please post a lspci output from the affected machine? If you are 
experiencing the problem on a system that doesn't have intel graphics chip at 
all, you'd be the first one whatsoever, and this would really change the 
direction of our debugging efforts -- currently the main suspect is intel 
graphics driver in X.org, which apparently couldn't be blamed in such case.
In addition to that, could you please attach your /etc/X11/xorg.conf?

Thanks.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/233

------------------------------------------------------------------------
On 2008-10-01T08:05:55+00:00 Quentin Jackson wrote:

Created attachment 242732
LSPCI.txt

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/234

------------------------------------------------------------------------
On 2008-10-01T08:06:35+00:00 Quentin Jackson wrote:

Created attachment 242733
Xorg.conf

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/235

------------------------------------------------------------------------
On 2008-10-01T08:07:50+00:00 Quentin Jackson wrote:

Done :)  I don't think the nic is showing up in LSPCI at all from what I
can see.  I also noticed my Firewire connector (shows up as a NIC in
windows has an x through it, I really hope that's unrelated!

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/236

------------------------------------------------------------------------
On 2008-10-01T08:40:09+00:00 Jkosina-d wrote:

(In reply to comment #110 from Quentin Jackson)
> Done 

Thanks. So apparently, you are really the first one, to my knowledge,
who reports the problem on ICH chipset, but with no Intel graphics chip
at all. This really seems to rule out the xorg graphics driver issue in
my eyes.

Could you please boot a "Kernel Of The Day" from

          ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/

This kernel contains a load of fixes for the e1000e driver. It is
unfortunately not currently able to bring your network card back to
life, but it will output a EEPROM contents dump into 'dmesg' output even
if the contents are corrupt. Could you please attach this output then?

This will allow us to verify whether you are really hitting the very
same problem.

Thanks.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/237

------------------------------------------------------------------------
On 2008-10-01T09:29:24+00:00 Karsten-keil wrote:

Quentin, it is very important to get the NIC NVM image for this machine
with ethtool -e. You could use a old SuSE 11.0 CD for this the rescue
system is enough, you can mount a USB stick and save the ethtool -e eth0
output on it.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/238

------------------------------------------------------------------------
On 2008-10-01T18:00:25+00:00 S-puch wrote:

Hi Guys until now I'm not affected by this Bug although (according to Jesse  
Brandeburg) I would be a very hot candidate.
As this Bug seems not be related to SuSe Linux (mostly I'm using Mandriva) but 
SUSE Labs seemed for me very active on LKML to get this problem fixed, I 
subscribed to this Bugtracking system, too.

I would like to offer my help if desired, because I've got a Lenovo T61
as in Comment #102 and I have got a graphic adapter from NVIDIA (NVIDIA
Quadro 140M) which should use the same driver as the HP 8510w from
Comment #107.

I've got a backup of my working NIC NVM so if it would help I could post
it here. As I need my laptop for daily business work I can only do
further testings if there is a valid method to get a broken NIC back to
work. I know that some guys of Intel are working on a tool doing that
but I don't know if it is released yet.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/241

------------------------------------------------------------------------
On 2008-10-01T19:29:24+00:00 Quentin Jackson wrote:

OK, I'll have to do the kernel of the day tonight when I'm at home, but
I should be able to use the ethtool dump today, hunting down a laptop
now :)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/244

------------------------------------------------------------------------
On 2008-10-01T19:57:07+00:00 Quentin Jackson wrote:

Created attachment 242921
ethtool dump from HP 8510w

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/245

------------------------------------------------------------------------
On 2008-10-01T19:58:18+00:00 Quentin Jackson wrote:

Please advise, if this suffices.  Sounds like you've been looking for it
for a while.  Theoretically I have one of these machines to play with
whenever needed, both dead and not dead.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/246

------------------------------------------------------------------------
On 2008-10-02T06:41:39+00:00 Quentin Jackson wrote:

Created attachment 242983
DMESG output after latest kernel of the day

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/249

------------------------------------------------------------------------
On 2008-10-02T06:44:34+00:00 Quentin Jackson wrote:

The Kernel upgrade complained that I was upgrading over a newer version,
I forced it as it was dated October.  But thought I should mention it
incase anything doesn't come through correctly.  After the kernel was
loaded and rebooted one of the network card lights now comes on, I don't
think it was doing this in windows and definitely not in linux.  Let me
know if there is anything else I can provide and let me know if this is
this bug or if I need to log it somewhere else!  :)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/250

------------------------------------------------------------------------
On 2008-10-02T08:14:57+00:00 Mmeeks-i wrote:

I can volunteer too - I have a T60p with a:
02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
Controller
and (amusingly) the socket is physically broken (by myself), so I seldom to 
never use it.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/252

------------------------------------------------------------------------
On 2008-10-03T05:19:09+00:00 Quentin Jackson wrote:

Seems to have gone quiet around here :)

Can someone please explain to me what path they expect this bug to take?
I am sitting with an unusable system and am wondering whether to go back
to OpenSuSE 10.3 as at least I can have working wireless in that
version.  Unless I can get some direction I see no point in leaving
Beta1 on my system as I cannot continue with bug fixing with no network
access.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/271

------------------------------------------------------------------------
On 2008-10-03T07:02:04+00:00 Okir wrote:

Currently, we're busy testing the patches we've put into beta2. These
are mostly patches from intel, also posted upstream on LKML

On beta1, we're able to reproduce the issue pretty reliably by simply booting
into runlevel 3, and shutdown the machine 1 minute later. The problem will
usually show up within 3-20 reboots. With beta2, we have so far run 350
reboots or more without hitting the problem.

We're currently still discussing with Intel and LKML what the cause of the
problem may be. We're chasing a number of leads, but it seems at least
one of the patches we have so far is effective in stopping the corruption
from happening.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/272

------------------------------------------------------------------------
On 2008-10-03T07:22:44+00:00 Quentin Jackson wrote:

That's a good update, thanks.  More specifically, is someone able to
advise:

a) is it possible eventually for this hardware to be repaired via some
kind of software programming?

b) If so are we awaiting Intel or can this be done by my providing the
ethtool dump above or something more specific?

c) If so presuming we would have a fix within, 2-4 weeks?

If not then it would make sense to get my hardware repaired and no doubt
others will be interested in ETA's on this too.

Thanks.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/273

------------------------------------------------------------------------
On 2008-10-03T13:25:33+00:00 Karsten-keil wrote:

Hello Quentin to answer your questions:
a) Yes, I'm working on a GPL tool for that
b) You need a ethtool dump, ideal from the machine itself or from a similar
   machine (then you need to give the MAC address to the tool)
   To see if a other machine has the same device, you need the PCI IDs from
   the machine before the overwrite happens, the IDs are overwritten in most
   cases via the NVM, if the NVM got corrupt it will fallback to the generic
   IDs
c) I hope I have a verified working version early next week

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/278

------------------------------------------------------------------------
On 2008-10-03T21:44:46+00:00 Jesse Brandeburg wrote:

It appears that the patch to use set_memory_ro/rw changes the timings
enough in our test boxes that the problem no longer occurs.

We are not currently sure why this patch fixes it, but I wanted to share
our findings.

We also have a patch (will attach here soon) to restore the eeprom from
an ethtool -e dump, using a sysfs interface to the driver.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/289

------------------------------------------------------------------------
On 2008-10-04T22:11:30+00:00 Renato-yamane wrote:

Fixed?
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a7703582836f55a1cbad0e2c1c6ebbee3f9b3a7>

Best regards,
Renato S. Yamane

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/296

------------------------------------------------------------------------
On 2008-10-05T08:38:07+00:00 Jkosina-d wrote:

(In reply to comment #125 from Renato Yamane)
> Fixed?
> <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a7703582836f55a1cbad0e2c1c6ebbee3f9b3a7>
> 

Yes, that's workaround that prevents the corruption of the EEPROM
contents, but it doesn't fix the real problem, just prevents bad things
from happening when the bug triggers.

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/300

------------------------------------------------------------------------
On 2008-10-07T06:55:16+00:00 Quentin Jackson wrote:

I'm hanging out to restore my ethernet card firmware.  Any chance on
getting that EEPROM restore application?  Or if not public yet any
chance of emailing it to quentin dot jackson at exclamation dot co dot
nz?  :)

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/321

------------------------------------------------------------------------
On 2008-10-07T13:52:20+00:00 Karsten-keil wrote:

The restore application does work now, I restored broken Thinkpad X61s
successful. I'm now preparing a mini iso with the application and our
rescue system, so you can boot from this CD and use the application in a
sane environment.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/322

------------------------------------------------------------------------
On 2008-10-09T21:09:03+00:00 Quentin Jackson wrote:

Well, I have gotten hold of and applied the recovery tool.
Unfortunately it does not work for the following reason:

The device does not list in lspci or lspci -n because it is dead,
therefore I cannot find the new device ID because it doesn't have one.
The tool relies on this information to work.  Apparently there are other
tools that will get around it via some kind of BIOS update direct from
intel.  Thought you would all like to know.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/324

------------------------------------------------------------------------
On 2008-10-09T21:10:32+00:00 Quentin Jackson wrote:

I should have said, this is the case on my device, apparently it is not
the case for all devices, you will need to check if your device is
listed in lspci or not.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/325

------------------------------------------------------------------------
On 2008-10-15T23:57:15+00:00 John Ronciak wrote:

Here is a patch which we at Intel LAD have been testing today.  This
looks to be a  work-around and with the .28 a fix for the root cause of
the problem.  The problem was with ftrace which is what we bisec'd to
last week.  On systems that failed with minutes we have not been able to
make it happen once ftrace was disabled.  So   I think the .28 ftrace
needs to get included into SLES11.

>---------- Forwarded message ----------
>From: Steven Rostedt <rost...@goodmis.org>
>Date: Wed, Oct 15, 2008 at 3:21 PM
>Subject: [PATCH -stable] disable CONFIG_DYNAMIC_FTRACE due to possible
>memory corruption on module unload
>To: LKML <linux-ker...@vger.kernel.org>, sta...@kernel.org
>Cc: Linus Torvalds <torva...@linux-foundation.org>, Andrew Morton
><a...@linux-foundation.org>, Arjan van de Ven <ar...@infradead.org>,
>gre...@suse.de, jesse.brandeb...@intel.com, Thomas Gleixner
><t...@linutronix.de>, Ingo Molnar <mi...@elte.hu>
>
>
>
>While debugging the e1000e corruption bug with Intel, we discovered
>today that the dynamic ftrace code in mainline is the likely source of
>this bug.
>
>For the stable kernel we are providing the only viable fix 
>patch: labeling
>CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)
>
>We will follow up with a backport patch that contains the 
>fixes. But since
>the fixes are not a one liner, the safest approach for now is to
>disable the code in question.
>
>The cause of the bug is due to the way the current code in mainline
>handles dynamic ftrace.  When dynamic ftrace is turned on, it also
>turns on CONFIG_FTRACE which enables the -pg config in gcc that places
>a call to mcount at every function call. With just CONFIG_FTRACE this
>causes a noticeable overhead.  CONFIG_DYNAMIC_FTRACE works to ease this
>overhead by dynamically updating the mcount call sites into nops.
>
>The problem arises when we trace functions and modules are unloaded.
>The first time a function is called, it will call mcount and the mcount
>call will call ftrace_record_ip. This records the calling site and
>stores it in a preallocated hash table. Later on a daemon will
>wake up and call kstop_machine and convert any mcount callers into
>nops.
>
>The evolution of this code first tried to do this without the 
>kstop_machine
>and used cmpxchg to update the callers as they were called. But I
>was informed that this is dangerous to do on SMP machines if another
>CPU is running that same code. The solution was to do this with
>kstop_machine.
>
>We still used cmpxchg to test if the code that we are modifying is
>indeed code that we expect to be before updating it - as a final
>line of defense.
>
>But on 32bit machines, ioremapped memory and modules share the same
>address space. When a module would load its code into memory 
>and execute
>some code, that would register the function.
>
>On module unload, ftrace incorrectly did not zap these functions from
>its hash (this was the bug). The cmpxchg could have saved us in most
>cases (via luck) - but with ioremap-ed memory that was exactly 
>the wrong
>thing to do - the results of cmpxchg on device memory are undefined.
>(and will likely result in a write)
>
>The pending .28 ftrace tree does not have this bug anymore, as 
>a general push
>towards more robustness of code patching, this is done 
>differently: we do not
>use cmpxchg and we do a WARN_ON and turn the tracer off if 
>anything deviates
>from its expected state. Furthermore, patch sites are 
>statically identified
>during build time so there's no runtime discovery of dynamic code areas
>anymore, and no room for code unmaps to cause the hash to 
>become out of date.
>
>We believe the fragility of dynamic patching has been sufficiently
>addressed in the development code via the static patching 
>method, but further
>suggestions to make it more robust are welcome.
>
>Signed-off-by: Steven Rostedt <srost...@goodmis.org>
>Acked-by: Ingo Molnar <mi...@elte.hu>
>Acked-by: Thomas Gleixner <t...@linutronix.de>
>---
> kernel/trace/Kconfig |    3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>Index: linux-compile.git/kernel/trace/Kconfig
>===================================================================
>--- linux-compile.git.orig/kernel/trace/Kconfig 2008-10-02
>10:18:49.000000000 -0400
>+++ linux-compile.git/kernel/trace/Kconfig      2008-10-15
>17:29:34.000000000 -0400
>@@ -103,7 +103,8 @@ config CONTEXT_SWITCH_TRACER
>         all switching of tasks.
>
> config DYNAMIC_FTRACE
>-       bool "enable/disable ftrace tracepoints dynamically"
>+       bool "enable/disable ftrace tracepoints dynamically (BROKEN)"
>+       depends on BROKEN
>       depends on FTRACE
>       depends on HAVE_DYNAMIC_FTRACE
>       default y
>
>--
>To unsubscribe from this list: send the line "unsubscribe 
>linux-kernel" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>

Reply at: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/336

------------------------------------------------------------------------
On 2008-10-16T00:07:40+00:00 Gregkh-n wrote:

This patch is now included in our SLE11 kernel, as it is in 2.6.27.1,
which is the base of our kernel tree.

So, I guess we can close this out now, thanks for all of the work
everyone!

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/337

------------------------------------------------------------------------
On 2009-04-28T13:33:09+00:00 Wstephenson-9 wrote:

Was the recovery tool ever published?  I just ran into a beta user who
still has a trashed e1000e.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/353

------------------------------------------------------------------------
On 2009-04-28T16:02:34+00:00 Jesse Brandeburg wrote:

david.gra...@intel.com can help

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/354

------------------------------------------------------------------------
On 2009-04-28T16:07:53+00:00 dave graham wrote:

I have been dealing with a lot of these recovery requests, and have been
using a tool developered by Karsten Keil. The tool reads the (probably)
corrupted content, which is sent to me. I repair the image - usually a
single-byte corruption, and then I return the corrected image to be
written back to the NVM using the same tool.

Follows the instructions that I have been providing to the individual reports I 
have had....
---- Start of instructions -------
Go to:

      ftp://ftp.suse.com/pub/people/kkeil/testing/e1000e/

Copy & paste this link in to a browser window, and you should see a list
of files, including one:

      e1000e_recover.iso

This is an ISO image of a CD, so save it to your local system, then burn
it to a CD, and use it to boot your problem system. From finding the ISO
to actually booting your system is quite a few steps - if you get stuck
of course just let me know and I'll guide you through the detail, but
for now I'll assume that you're still with me.

>From the boot options presented by the CD, select "rescue system", as
that's where we'll find the eeprom recovery tool.

When prompted for user, log on as root. There's no password, so just hit
return.

1) Read the current eeprom and save it to file. Be patient !

      e1000_nvm -r -u -o ethtool.dmp

2) mount a USB disk to save the file, and send the file to me
david.gra...@intel.com

I will then fix up the image, and mail it back to you as ethtoola.dmp,
and then, you can boot again to the CD, and

3) Write the new eeprom content back to your system NVM, using something
like (may be different depending on the device id that is indictaed in
the nvm, but I will provide any update to this step along with the
fixed-up NVM image that I return)

      e1000_nvm -u -P 10498086 ethtoola.dmp

And select YES when prompted.

4) You should then be able to remove the recovery CD, and successfully
boot back to a working ethernet.

---- End of instructions -------

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/355

** Changed in: linux (Suse)
   Importance: Unknown => Critical

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/263555

Title:
  [intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE
  chipsets at risk

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/263555/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 263555] Re: [intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE chipsets at risk

Reply via email to