Re: Anyone??? (was Reproducible data corruption on 6.1-Stable)

2006-09-14 Thread Jonathan Stewart
Daniel Gerzo wrote:
> Hello Jonathan,
> 
> Wednesday, September 13, 2006, 2:38:14 AM, you wrote:
> 
>> I set up a new server recently and transferred all the information from
>> my old server over.  I tried to use unison to synchronize the backup of
>> pictures I have taken and noticed that a large number of pictures where
>> marked as changed on the server.  After checking the pictures by hand I
>> confirmed that many of the pictures on the server were corrupted.
> 
>> It appears the corruption happens during the read process because when I
>> recompare the files in a graphical diff tool between cache flushes the
>> differences move around!?!?!?  The differences also appear to be very
>> small for the most part, single bytes scattered throughout the file.  I
>> really have no idea what is causing the problem and would like to pin it
>> down so I can either replace hardware if it's bad or fix whatever the
>> bug is.
> 
>> CPU: AMD Athlon(tm) XP 3200+ (2090.16-MHz 686-class CPU)
>>   Origin = "AuthenticAMD"  Id = 0x6a0  Stepping = 0
> 
> I saw very similar simptons on p4 3.2ghz. I was able to build world
> without any problems and the overall stability of the machine was
> completely good, but when I tried to install some ports, the md5
> sums didn't match the source and I was sure that they were all right.
> 
> The following simple test demonstrates the problem I was hitting:
> 
> [EMAIL PROTECTED] ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 
> b95ddf27bc0ffa379c9aa881ca39e92a7d79e0d08999b4dff6d7d9547ee2a72d
> [EMAIL PROTECTED] ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 
> 71432841b3965b7ab2d83f0dc7c3049195ea4e9267a8dc2d825a8a0466982930
> [EMAIL PROTECTED] ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 
> 83e44f5301b3270e821850164c74d275f6721bed5d126480cf518a9fe5ca0d6c
> [EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> bd8c2e593e1fa4b01fd98eaf016329bb
> [EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> bd8c2e593e1fa4b01fd98eaf016329bb
> [EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> b9342bb213393238dd37322d4e2ee3fe
> [EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
> 88efa7977fd3febaa8d260e3d5f21917
> 
> The memtest didn't show any problems with RAM and we were unable to
> clarify what is really going on. Then we managed to get the machine
> replaced with the complete new hardware and the problem was gone.
> Later, I was told that it is some kind of known bug in older p4's
> bioses (and advised to update the bios which should have been fixed
> in the meantime) but we were unable to find out any information about
> the problem. Fortunately the colo company replaced the hardware with
> no problems. So long so good and the box is running flawlessly.
> 

I don't think it's quite the same as my problem as I have to use dd on a
large file to flush the cache and force freebsd to go back to the disk
before the checksum changes.  At this point I think I need to further
narrow down where the error is occurring but I don't know what to try
next.  I am 99.999% sure memory and cpu are not the problem but after
that point I'm getting into driver and filesystem code testing which is
a little overwhelming to just dive into.

Jonathan
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Anyone??? (was Reproducible data corruption on 6.1-Stable)

2006-09-14 Thread Daniel Gerzo
Hello Jonathan,

Wednesday, September 13, 2006, 2:38:14 AM, you wrote:

> I set up a new server recently and transferred all the information from
> my old server over.  I tried to use unison to synchronize the backup of
> pictures I have taken and noticed that a large number of pictures where
> marked as changed on the server.  After checking the pictures by hand I
> confirmed that many of the pictures on the server were corrupted.

> It appears the corruption happens during the read process because when I
> recompare the files in a graphical diff tool between cache flushes the
> differences move around!?!?!?  The differences also appear to be very
> small for the most part, single bytes scattered throughout the file.  I
> really have no idea what is causing the problem and would like to pin it
> down so I can either replace hardware if it's bad or fix whatever the
> bug is.

> CPU: AMD Athlon(tm) XP 3200+ (2090.16-MHz 686-class CPU)
>   Origin = "AuthenticAMD"  Id = 0x6a0  Stepping = 0

I saw very similar simptons on p4 3.2ghz. I was able to build world
without any problems and the overall stability of the machine was
completely good, but when I tried to install some ports, the md5
sums didn't match the source and I was sure that they were all right.

The following simple test demonstrates the problem I was hitting:

[EMAIL PROTECTED] ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 
b95ddf27bc0ffa379c9aa881ca39e92a7d79e0d08999b4dff6d7d9547ee2a72d
[EMAIL PROTECTED] ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 
71432841b3965b7ab2d83f0dc7c3049195ea4e9267a8dc2d825a8a0466982930
[EMAIL PROTECTED] ~]# sha256 /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
SHA256 (/usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz) = 
83e44f5301b3270e821850164c74d275f6721bed5d126480cf518a9fe5ca0d6c
[EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
bd8c2e593e1fa4b01fd98eaf016329bb
[EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
bd8c2e593e1fa4b01fd98eaf016329bb
[EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
b9342bb213393238dd37322d4e2ee3fe
[EMAIL PROTECTED] ~]# md5 < /usr/ports/distfiles/ruby/ruby-1.8.4.tar.gz
88efa7977fd3febaa8d260e3d5f21917

The memtest didn't show any problems with RAM and we were unable to
clarify what is really going on. Then we managed to get the machine
replaced with the complete new hardware and the problem was gone.
Later, I was told that it is some kind of known bug in older p4's
bioses (and advised to update the bios which should have been fixed
in the meantime) but we were unable to find out any information about
the problem. Fortunately the colo company replaced the hardware with
no problems. So long so good and the box is running flawlessly.

-- 
Best regards,
 Danielmailto:[EMAIL PROTECTED]

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Anyone??? (was Reproducible data corruption on 6.1-Stable)

2006-09-13 Thread Jonathan Stewart
Oliver Fromme wrote:
> Jonathan Stewart <[EMAIL PROTECTED]> wrote:
>  > I set up a new server recently and transferred all the information from
>  > my old server over.  I tried to use unison to synchronize the backup of
>  > pictures I have taken and noticed that a large number of pictures where
>  > marked as changed on the server.  After checking the pictures by hand I
>  > confirmed that many of the pictures on the server were corrupted.  I
>  > attempted to use unison to update the files on the server with the
>  > correct local copies but it would fail on almost all the files with the
>  > message "destination updated during synchronization."
>  > 
>  > It appears the corruption happens during the read process because when I
>  > recompare the files in a graphical diff tool between cache flushes the
>  > differences move around!?!?!?  The differences also appear to be very
>  > small for the most part, single bytes scattered throughout the file.  I
>  > really have no idea what is causing the problem and would like to pin it
>  > down so I can either replace hardware if it's bad or fix whatever the
>  > bug is.
> 
> That very much sounds like bad RAM, or overclocked CPU
> or bus.  I assume you do not overclock, so I recommend
> you replace your RAM modules and check if the symptoms
> are gone.
> 
> Also check your BIOS settings for the RAM timings.
> Setting the timings to more conservative values might
> already solve the problem.

Thanks for the suggestions but I have tried lowering the clock rate on
the processor and and the RAM speed with no luck whatsoever.

I appear to have forgotten to mention that the problem appears no matter
how I read the file, unison, md5, etc.  1 out of maybe 100 times it will
read correctly.  I have another drive that I use for the OS and I have
done many buildworlds/kernels without problems on that drive as well as
compiling some very large software packages.  I'm wondering if a
possible cause is the controller ignoring read errors from the hard
drive but I would think more than the occasional single byte would be
changed.

I'm thinking about maybe trying to dd the file from the raw device in an
attempt to see if the problem is occurring in the filesystem code or is
lower level yet.  Any suggestions on how to locate the file on the disk
or how to isolate the problem better are welcome.  I don't mind doing
the work I just have no idea where to look/what to try next.

Thanks,
Jonathan
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Anyone??? (was Reproducible data corruption on 6.1-Stable)

2006-09-13 Thread Oliver Fromme
Jonathan Stewart <[EMAIL PROTECTED]> wrote:
 > I set up a new server recently and transferred all the information from
 > my old server over.  I tried to use unison to synchronize the backup of
 > pictures I have taken and noticed that a large number of pictures where
 > marked as changed on the server.  After checking the pictures by hand I
 > confirmed that many of the pictures on the server were corrupted.  I
 > attempted to use unison to update the files on the server with the
 > correct local copies but it would fail on almost all the files with the
 > message "destination updated during synchronization."
 > 
 > It appears the corruption happens during the read process because when I
 > recompare the files in a graphical diff tool between cache flushes the
 > differences move around!?!?!?  The differences also appear to be very
 > small for the most part, single bytes scattered throughout the file.  I
 > really have no idea what is causing the problem and would like to pin it
 > down so I can either replace hardware if it's bad or fix whatever the
 > bug is.

That very much sounds like bad RAM, or overclocked CPU
or bus.  I assume you do not overclock, so I recommend
you replace your RAM modules and check if the symptoms
are gone.

Also check your BIOS settings for the RAM timings.
Setting the timings to more conservative values might
already solve the problem.

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"Clear perl code is better than unclear awk code; but NOTHING
comes close to unclear perl code"  (taken from comp.lang.awk FAQ)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Anyone??? (was Reproducible data corruption on 6.1-Stable)

2006-09-12 Thread George Hartzell
Jonathan Stewart writes:
 > [...]
 > I set up a new server recently and transferred all the information from
 > my old server over.  I tried to use unison to synchronize the backup of
 > pictures I have taken and noticed that a large number of pictures where
 > marked as changed on the server.  After checking the pictures by hand I
 > confirmed that many of the pictures on the server were corrupted.  I
 > attempted to use unison to update the files on the server with the
 > correct local copies but it would fail on almost all the files with the
 > message "destination updated during synchronization."
 > 
 > It appears the corruption happens during the read process because when I
 > recompare the files in a graphical diff tool between cache flushes the
 > differences move around!?!?!?  The differences also appear to be very
 > small for the most part, single bytes scattered throughout the file.  I
 > really have no idea what is causing the problem and would like to pin it
 > down so I can either replace hardware if it's bad or fix whatever the
 > bug is.
 > [...]

It might be a memory problem.  I had a linux server that was serving a
subversion repository, plus some web stuff.  I added some additional
memory to keep it from wheezing and it seemed to be running fine.  We
started noticing problems with things that had been checked out of the
repository (e.g. binary tarballs).  Removing the extra memory made
things work again.

memtest86 didn't find anything wrong, which I gather isn't that
unusual in these situations.

Then again, your problem might be something else entirely


g.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Anyone??? (was Reproducible data corruption on 6.1-Stable)

2006-09-12 Thread Jonathan Stewart
(I know double posting is bad form but this includes new information and
it's been several days.  Suggestions on where else to look for help are
welcome, highpoint was no help)

I set up a new server recently and transferred all the information from
my old server over.  I tried to use unison to synchronize the backup of
pictures I have taken and noticed that a large number of pictures where
marked as changed on the server.  After checking the pictures by hand I
confirmed that many of the pictures on the server were corrupted.  I
attempted to use unison to update the files on the server with the
correct local copies but it would fail on almost all the files with the
message "destination updated during synchronization."

It appears the corruption happens during the read process because when I
recompare the files in a graphical diff tool between cache flushes the
differences move around!?!?!?  The differences also appear to be very
small for the most part, single bytes scattered throughout the file.  I
really have no idea what is causing the problem and would like to pin it
down so I can either replace hardware if it's bad or fix whatever the
bug is.

I cvsuped and rebuilt world and kernel recently hoping that it had been
fixed but with no luck. I have not seen any error messages on the
console at all either. I have a pair of 320GB SATA hard drives setup as
RAID0 on a HighPoint RocketRaid 1520 card the card BIOS is the latest
revision as is the motherboard BIOS.

This being a data corruption issue I can afford any amount of downtime
needed for trouble shooting as it's not very useful to have the server
up if everything is going to get corrupted.

Thank you,
Jonathan

uname -a:
FreeBSD X 6.1-STABLE FreeBSD 6.1-STABLE #0: Sun Sep 10 22:54:17 EDT
2006 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/SERVER  i386

dmesg:
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 6.1-STABLE #0: Sun Sep 10 22:54:17 EDT 2006
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/SERVER
mptable_probe: MP Config Table has bad signature: 4\^C\^_
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Athlon(tm) XP 3200+ (2090.16-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0x6a0  Stepping = 0

Features=0x383fbff
  AMD Features=0xc0400800
real memory  = 1073676288 (1023 MB)
avail memory = 1041698816 (993 MB)
kbd1 at kbdmux0
ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413)
acpi0:  on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0
cpu0:  on acpi0
acpi_button0:  on acpi0
pcib0:  port 0xcf8-0xcff on acpi0
pci0:  on pcib0
Correcting nForce2 C1 CPU disconnect hangs
agp0:  mem 0xd800-0xdbff at
device 0.0 on pci0
pci0:  at device 0.1 (no driver attached)
pci0:  at device 0.2 (no driver attached)
pci0:  at device 0.3 (no driver attached)
pci0:  at device 0.4 (no driver attached)
pci0:  at device 0.5 (no driver attached)
isab0:  at device 1.0 on pci0
isa0:  on isab0
pci0:  at device 1.1 (no driver attached)
ohci0:  mem 0xe1085000-0xe1085fff irq 5
at device 2.0 on pci0
ohci0: [GIANT-LOCKED]
usb0: OHCI version 1.0, legacy support
usb0:  on ohci0
usb0: USB revision 1.0
uhub0: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 3 ports with 3 removable, self powered
ohci1:  mem 0xe1082000-0xe1082fff irq 5
at device 2.1 on pci0
ohci1: [GIANT-LOCKED]
usb1: OHCI version 1.0, legacy support
usb1:  on ohci1
usb1: USB revision 1.0
uhub1: nVidia OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 3 ports with 3 removable, self powered
ehci0:  mem 0xe1083000-0xe10830ff irq
12 at device 2.2 on pci0
ehci0: [GIANT-LOCKED]
usb2: EHCI version 1.0
usb2: companion controllers, 4 ports each: usb0 usb1
usb2:  on ehci0
usb2: USB revision 2.0
uhub2: nVidia EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub2: 6 ports with 6 removable, self powered
nve0:  port 0xe400-0xe407 mem
0xe1084000-0xe1084fff irq 12 at device 4.0 on pci0
nve0: Ethernet address 00:0c:6e:7d:e0:79
miibus0:  on nve0
rlphy0:  on miibus0
rlphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
nve0: Ethernet address: 00:0c:6e:7d:e0:79
pci0:  at device 5.0 (no driver attached)
pci0:  at device 6.0 (no driver attached)
pcib1:  at device 8.0 on pci0
pci1:  on pcib1
atapci0:  port
0xa000-0xa007,0xa400-0xa403,0xa800-0xa807,0xac00-0xac03,0xb000-0xb0ff
irq 11 at device 6.0 on pci1
ata2:  on atapci0
ata3:  on atapci0
pci1:  at device 9.0 (no driver attached)
pci1:  at device 9.1 (no driver attached)
atapci1:  port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 9.0 on pci0
ata0:  on atapci1
ata1:  on atapci1
pcib2:  at device 12.0 on pci0
pci2:  on pcib2
xl0: <3Com 3c920B-EMB Integrated Fast Etherlink XL> port 0xc000-0xc07f
mem 0xdd00-0xdd7f irq 5 at device 1.0 on pci