RE: Silent corruption on AMD64

2007-04-02 Thread Stuart MacDonald
From: On Behalf Of Aaron Lehmann
> I've been able to narrow it down to the Realtek Ethernet card. I can't
> reproduce the problem using onboard Ethernet, whereas the Realtek card
> causes trouble in any slot. However, I still don't know whether it's a
> hardware or software issue, or whether it's caused directly or
> indirectly by the Realtek card.

I had a similar issue recently:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=223216

I recommend trying Doug Ledford's memtest script:
http://people.redhat.com/dledford/memtest.html

It helped me prove the issue was the hardware and not something else.

..Stu

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Silent corruption on AMD64

2007-04-01 Thread Andi Kleen
Aaron Lehmann <[EMAIL PROTECTED]> writes:

[adding netdev]
[meta-comment: I wish people wouldn't use such unnecessarily broad subjects 
-- how is it the x86-64 port's or AMD's fault when you have broken hardware? 
Would anybody write "Silent corruption on i386" or "Silent corruption 
on Intel" or "Silent corruption on Linux"?]

> On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote:
> > Since it shows up under heavy load that includes unrelated devices, I
> > think ruling out hardware problems is important.  Some suggestions:
> 
> I've been able to narrow it down to the Realtek Ethernet card. I can't
> reproduce the problem using onboard Ethernet, whereas the Realtek card
> causes trouble in any slot. However, I still don't know whether it's a
> hardware or software issue, or whether it's caused directly or
> indirectly by the Realtek card.

You could disable the hardware checksumming support in the card with
the appended patch. Then hopefully Linux will catch most corruptions
(but perhaps not all because TCP checksums are not very strong) 
You can watch failed checksums then with netstat -s

-Andi

Index: linux-2.6.21-rc3-net/drivers/net/r8169.c
===
--- linux-2.6.21-rc3-net.orig/drivers/net/r8169.c
+++ linux-2.6.21-rc3-net/drivers/net/r8169.c
@@ -2477,6 +2477,7 @@ static inline int rtl8169_fragmented_fra
 
 static inline void rtl8169_rx_csum(struct sk_buff *skb, struct RxDesc *desc)
 {
+#if 0
u32 opts1 = le32_to_cpu(desc->opts1);
u32 status = opts1 & RxProtoMask;
 
@@ -2485,6 +2486,7 @@ static inline void rtl8169_rx_csum(struc
((status == RxProtoIP) && !(opts1 & IPFail)))
skb->ip_summed = CHECKSUM_UNNECESSARY;
else
+#endif
skb->ip_summed = CHECKSUM_NONE;
 }
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Silent corruption on AMD64

2007-03-31 Thread Aaron Lehmann
On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote:
> Since it shows up under heavy load that includes unrelated devices, I
> think ruling out hardware problems is important.  Some suggestions:

I've been able to narrow it down to the Realtek Ethernet card. I can't
reproduce the problem using onboard Ethernet, whereas the Realtek card
causes trouble in any slot. However, I still don't know whether it's a
hardware or software issue, or whether it's caused directly or
indirectly by the Realtek card.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Silent corruption on AMD64

2007-03-31 Thread Jim Paris
Aaron Lehmann wrote:
> I discovered a reproducible way of causing silent file corruption.
...
> 1. Heavy Ethernet load (nc remotehost < /dev/zero)
> 2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path)
> 3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null)

Since it shows up under heavy load that includes unrelated devices, I
think ruling out hardware problems is important.  Some suggestions:

- Use mcelog to see if you're getting any machine check exceptions
  that would indicate hardware error: http://freshmeat.net/projects/mcelog/

- Use the edac module to turn on pci parity and memory error checks:
  
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/drivers/edac/edac.txt

- Run memtest86+ for several loops to make sure your RAM is ok

- Try moving the SiI card to a different slot

- Try running the SATA drives from a separate power supply

- Move disks and cables around to see whether the problem follows the
  disks, the cables, or the controllers

- Try enabling the "spread spectrum" clock option in your BIOS to
  reduce EMI

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Silent corruption on AMD64

2007-03-31 Thread Aaron Lehmann
On Sat, Mar 31, 2007 at 07:52:36PM -0700, Andrew Morton wrote:
> Are you able to provide us with some before-and-after data so we
> can see this corruption.
> 
> See, if it's dropped-bits or shifted-data or eight-byte-aligned
> kernel addresses or whatever, that helps us generate theories..

Sure.

I created a large file containing the repeating ASCII string "abcdefgh",
and subjected it to the corruption I described earlier. The correct
hex sequence is:

61 62 63 64 65 66 67 68

Here were some of the permutations that I found in corrupted copies:

61 62 63 64 92 57 5C 0A
61 62 63 64 A2 2D E1 C7
61 62 63 64 11 38 0E B6
61 62 63 64 57 B1 EE 1F
61 62 63 64 E0 3D 10 21
61 62 63 64 97 E1 C0 F5

I did not observe any errors other than replacements of four-byte
blocks. These errors always started at addresses in the file that had
a remainder of 12 modulo 16 (i.e. the hex addresses always ended in
'C'). There was an average about one error per 300MB.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Silent corruption on AMD64

2007-03-31 Thread Andrew Morton
> On Sat, 31 Mar 2007 18:27:36 -0700 Aaron Lehmann <[EMAIL PROTECTED]> wrote:
> I have spent a lot of time trying to find a simpler test case. So far,
> as far as I can tell, there are three conditions that must be
> satisfied for corruption to occur:
> 
> 1. Heavy Ethernet load (nc remotehost < /dev/zero)
> 2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path)
> 3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null)
> 
> With these conditions satisfied, data read off sda or sdb (the drives
> associated with sata_sil) is often corrupted. Since I can only see
> this problem with files on those two drives, I'm inclined to suspect
> the sata_sil driver, but I really have no idea what's going on. I know
> this is not a recent issue - I experienced very similar corruption at
> least a year ago. I wasn't able to reproduce it at the time, because
> it only appeared in the backups I was restoring from.

Are you able to provide us with some before-and-after data so we
can see this corruption.

See, if it's dropped-bits or shifted-data or eight-byte-aligned
kernel addresses or whatever, that helps us generate theories..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Silent corruption on AMD64

2007-03-31 Thread Aaron Lehmann
Hello,

I discovered a reproducible way of causing silent file corruption.
Unfortunately, this method happens to me my backup procedure :(.

Background: I have five hard drives. sda and sdb are on a SiI 3112
card. sdc and sdd use onboard sata_via. hda uses onboard VIA VT8237
IDE. All filesystems are ext3. Ethernet is PCI RTL8169. My kernel is
2.6.20.1, configured for SMP and PREEMPT, but I was able to confirm
that this corruption happens without SMP or PREEMPT (though it's
rarer).

The following simultaneous actions result in corrupt data being read
from one of the sata_sil drives:

1. rsync files from sdd to sdc
2. rsync files from sdb to a remote host

If I run md5sum on a few hundred megabytes on sdb while doing these
things, the md5sum computed will usually be wrong. I believe the data
getting rsynced off sdb is also corrupt.

I have spent a lot of time trying to find a simpler test case. So far,
as far as I can tell, there are three conditions that must be
satisfied for corruption to occur:

1. Heavy Ethernet load (nc remotehost < /dev/zero)
2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path)
3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null)

With these conditions satisfied, data read off sda or sdb (the drives
associated with sata_sil) is often corrupted. Since I can only see
this problem with files on those two drives, I'm inclined to suspect
the sata_sil driver, but I really have no idea what's going on. I know
this is not a recent issue - I experienced very similar corruption at
least a year ago. I wasn't able to reproduce it at the time, because
it only appeared in the backups I was restoring from.

dmesg and .config follow.


Linux version 2.6.20.1 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #1 SMP PREEMPT Sat Feb 24 11:41:46 PST 2007
Command line: root=/dev/sda1 notsc ro
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009ec00 (usable)
 BIOS-e820: 0009ec00 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 5fee (usable)
 BIOS-e820: 5fee - 5fee3000 (ACPI NVS)
 BIOS-e820: 5fee3000 - 5fef (ACPI data)
 BIOS-e820: 5fef - 5ff0 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 158) 0 entries of 256 used
Entering add_active_range(0, 256, 392928) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP (v000 K8T890) @ 0x000f7920
ACPI: RSDT (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 
0x5fee3040
ACPI: FADT (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 
0x5fee30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 
0x5feea800
ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 
0x5feeaa40
ACPI: MCFG (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 
0x5feeab40
ACPI: MADT (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 
0x5feea740
ACPI: DSDT (v001 K8T890 AWRDACPI 0x1000 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 158) 0 entries of 256 used
Entering add_active_range(0, 256, 392928) 1 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
early_node_map[2] active PFN ranges
0:0 ->  158
0:  256 ->   392928
On node 0 totalpages: 392830
  DMA zone: 56 pages used for memmap
  DMA zone: 972 pages reserved
  DMA zone: 2970 pages, LIFO batch:0
  DMA32 zone: 5316 pages used for memmap
  DMA32 zone: 383516 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfecc] gsi_base[24])
IOAPIC[1]: apic_id 3, address 0xfecc, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009e000 - 0009f000
Nosave address range: 0009f000 - 000a
Nosave address range: 000a - 000f
Nosave address range: 000f - 0010
Allocating PCI resources starting at 6000 (gap: 5ff0:8010)
PERCPU: Allocating 3