RE: Silent corruption on AMD64
From: On Behalf Of Aaron Lehmann > I've been able to narrow it down to the Realtek Ethernet card. I can't > reproduce the problem using onboard Ethernet, whereas the Realtek card > causes trouble in any slot. However, I still don't know whether it's a > hardware or software issue, or whether it's caused directly or > indirectly by the Realtek card. I had a similar issue recently: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=223216 I recommend trying Doug Ledford's memtest script: http://people.redhat.com/dledford/memtest.html It helped me prove the issue was the hardware and not something else. ..Stu - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Silent corruption on AMD64
Aaron Lehmann <[EMAIL PROTECTED]> writes: [adding netdev] [meta-comment: I wish people wouldn't use such unnecessarily broad subjects -- how is it the x86-64 port's or AMD's fault when you have broken hardware? Would anybody write "Silent corruption on i386" or "Silent corruption on Intel" or "Silent corruption on Linux"?] > On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote: > > Since it shows up under heavy load that includes unrelated devices, I > > think ruling out hardware problems is important. Some suggestions: > > I've been able to narrow it down to the Realtek Ethernet card. I can't > reproduce the problem using onboard Ethernet, whereas the Realtek card > causes trouble in any slot. However, I still don't know whether it's a > hardware or software issue, or whether it's caused directly or > indirectly by the Realtek card. You could disable the hardware checksumming support in the card with the appended patch. Then hopefully Linux will catch most corruptions (but perhaps not all because TCP checksums are not very strong) You can watch failed checksums then with netstat -s -Andi Index: linux-2.6.21-rc3-net/drivers/net/r8169.c === --- linux-2.6.21-rc3-net.orig/drivers/net/r8169.c +++ linux-2.6.21-rc3-net/drivers/net/r8169.c @@ -2477,6 +2477,7 @@ static inline int rtl8169_fragmented_fra static inline void rtl8169_rx_csum(struct sk_buff *skb, struct RxDesc *desc) { +#if 0 u32 opts1 = le32_to_cpu(desc->opts1); u32 status = opts1 & RxProtoMask; @@ -2485,6 +2486,7 @@ static inline void rtl8169_rx_csum(struc ((status == RxProtoIP) && !(opts1 & IPFail))) skb->ip_summed = CHECKSUM_UNNECESSARY; else +#endif skb->ip_summed = CHECKSUM_NONE; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Silent corruption on AMD64
On Sat, Mar 31, 2007 at 08:03:16PM -0700, Jim Paris wrote: > Since it shows up under heavy load that includes unrelated devices, I > think ruling out hardware problems is important. Some suggestions: I've been able to narrow it down to the Realtek Ethernet card. I can't reproduce the problem using onboard Ethernet, whereas the Realtek card causes trouble in any slot. However, I still don't know whether it's a hardware or software issue, or whether it's caused directly or indirectly by the Realtek card. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Silent corruption on AMD64
Aaron Lehmann wrote: > I discovered a reproducible way of causing silent file corruption. ... > 1. Heavy Ethernet load (nc remotehost < /dev/zero) > 2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path) > 3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null) Since it shows up under heavy load that includes unrelated devices, I think ruling out hardware problems is important. Some suggestions: - Use mcelog to see if you're getting any machine check exceptions that would indicate hardware error: http://freshmeat.net/projects/mcelog/ - Use the edac module to turn on pci parity and memory error checks: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/drivers/edac/edac.txt - Run memtest86+ for several loops to make sure your RAM is ok - Try moving the SiI card to a different slot - Try running the SATA drives from a separate power supply - Move disks and cables around to see whether the problem follows the disks, the cables, or the controllers - Try enabling the "spread spectrum" clock option in your BIOS to reduce EMI -jim - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Silent corruption on AMD64
On Sat, Mar 31, 2007 at 07:52:36PM -0700, Andrew Morton wrote: > Are you able to provide us with some before-and-after data so we > can see this corruption. > > See, if it's dropped-bits or shifted-data or eight-byte-aligned > kernel addresses or whatever, that helps us generate theories.. Sure. I created a large file containing the repeating ASCII string "abcdefgh", and subjected it to the corruption I described earlier. The correct hex sequence is: 61 62 63 64 65 66 67 68 Here were some of the permutations that I found in corrupted copies: 61 62 63 64 92 57 5C 0A 61 62 63 64 A2 2D E1 C7 61 62 63 64 11 38 0E B6 61 62 63 64 57 B1 EE 1F 61 62 63 64 E0 3D 10 21 61 62 63 64 97 E1 C0 F5 I did not observe any errors other than replacements of four-byte blocks. These errors always started at addresses in the file that had a remainder of 12 modulo 16 (i.e. the hex addresses always ended in 'C'). There was an average about one error per 300MB. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Silent corruption on AMD64
> On Sat, 31 Mar 2007 18:27:36 -0700 Aaron Lehmann <[EMAIL PROTECTED]> wrote: > I have spent a lot of time trying to find a simpler test case. So far, > as far as I can tell, there are three conditions that must be > satisfied for corruption to occur: > > 1. Heavy Ethernet load (nc remotehost < /dev/zero) > 2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path) > 3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null) > > With these conditions satisfied, data read off sda or sdb (the drives > associated with sata_sil) is often corrupted. Since I can only see > this problem with files on those two drives, I'm inclined to suspect > the sata_sil driver, but I really have no idea what's going on. I know > this is not a recent issue - I experienced very similar corruption at > least a year ago. I wasn't able to reproduce it at the time, because > it only appeared in the backups I was restoring from. Are you able to provide us with some before-and-after data so we can see this corruption. See, if it's dropped-bits or shifted-data or eight-byte-aligned kernel addresses or whatever, that helps us generate theories.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Silent corruption on AMD64
Hello, I discovered a reproducible way of causing silent file corruption. Unfortunately, this method happens to me my backup procedure :(. Background: I have five hard drives. sda and sdb are on a SiI 3112 card. sdc and sdd use onboard sata_via. hda uses onboard VIA VT8237 IDE. All filesystems are ext3. Ethernet is PCI RTL8169. My kernel is 2.6.20.1, configured for SMP and PREEMPT, but I was able to confirm that this corruption happens without SMP or PREEMPT (though it's rarer). The following simultaneous actions result in corrupt data being read from one of the sata_sil drives: 1. rsync files from sdd to sdc 2. rsync files from sdb to a remote host If I run md5sum on a few hundred megabytes on sdb while doing these things, the md5sum computed will usually be wrong. I believe the data getting rsynced off sdb is also corrupt. I have spent a lot of time trying to find a simpler test case. So far, as far as I can tell, there are three conditions that must be satisfied for corruption to occur: 1. Heavy Ethernet load (nc remotehost < /dev/zero) 2. Heavy disk write load on any non-sata_sil drive (cat /dev/zero > /path) 3. Heavy disk read load on any other drive (tar c /path | cat > /dev/null) With these conditions satisfied, data read off sda or sdb (the drives associated with sata_sil) is often corrupted. Since I can only see this problem with files on those two drives, I'm inclined to suspect the sata_sil driver, but I really have no idea what's going on. I know this is not a recent issue - I experienced very similar corruption at least a year ago. I wasn't able to reproduce it at the time, because it only appeared in the backups I was restoring from. dmesg and .config follow. Linux version 2.6.20.1 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP PREEMPT Sat Feb 24 11:41:46 PST 2007 Command line: root=/dev/sda1 notsc ro BIOS-provided physical RAM map: BIOS-e820: - 0009ec00 (usable) BIOS-e820: 0009ec00 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - 5fee (usable) BIOS-e820: 5fee - 5fee3000 (ACPI NVS) BIOS-e820: 5fee3000 - 5fef (ACPI data) BIOS-e820: 5fef - 5ff0 (reserved) BIOS-e820: e000 - f000 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 158) 0 entries of 256 used Entering add_active_range(0, 256, 392928) 1 entries of 256 used end_pfn_map = 1048576 DMI 2.3 present. ACPI: RSDP (v000 K8T890) @ 0x000f7920 ACPI: RSDT (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 0x5fee3040 ACPI: FADT (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 0x5fee30c0 ACPI: SSDT (v001 PTLTD POWERNOW 0x0001 LTP 0x0001) @ 0x5feea800 ACPI: SRAT (v001 AMDHAMMER 0x0001 AMD 0x0001) @ 0x5feeaa40 ACPI: MCFG (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 0x5feeab40 ACPI: MADT (v001 K8T890 AWRDACPI 0x42302e31 AWRD 0x) @ 0x5feea740 ACPI: DSDT (v001 K8T890 AWRDACPI 0x1000 MSFT 0x010e) @ 0x Entering add_active_range(0, 0, 158) 0 entries of 256 used Entering add_active_range(0, 256, 392928) 1 entries of 256 used Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1048576 early_node_map[2] active PFN ranges 0:0 -> 158 0: 256 -> 392928 On node 0 totalpages: 392830 DMA zone: 56 pages used for memmap DMA zone: 972 pages reserved DMA zone: 2970 pages, LIFO batch:0 DMA32 zone: 5316 pages used for memmap DMA32 zone: 383516 pages, LIFO batch:31 Normal zone: 0 pages used for memmap ACPI: PM-Timer IO Port: 0x4008 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xfecc] gsi_base[24]) IOAPIC[1]: apic_id 3, address 0xfecc, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Setting APIC routing to flat Using ACPI (MADT) for SMP configuration information Nosave address range: 0009e000 - 0009f000 Nosave address range: 0009f000 - 000a Nosave address range: 000a - 000f Nosave address range: 000f - 0010 Allocating PCI resources starting at 6000 (gap: 5ff0:8010) PERCPU: Allocating 3