Re: NFS corruption on p4 machines (please test)
Kris Kennaway wrote: On Fri, Oct 03, 2003 at 10:10:20AM -0700, Lars Eggert wrote: Kris, Kris Kennaway wrote: For some months now I have been experiencing NFS corruption on the three machines in the dosirak.kr package cluster - these are SMP pentium 4 machines that run -CURRENT. Setting DISABLE_PSE and DISABLE_PG_G does not fix these problems. I am able to easily reproduce these problems using /usr/src/tools/regression/fsx on a loopback nfs mount - they are not deterministic, but it blows up within about 8000 operations (less than a minute of operation). In fact sometimes it even manages to make fsx segfault, which is fairly impressive :) Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs filesystem for a few minutes. I just ran an fsx cycle on my desktop machine over a TCP mount, and it seemed to work fine: Thanks. What hardware specs? Attached. Lars -- Lars Eggert <[EMAIL PROTECTED]> USC Information Sciences Institute cam: using minimum scsi_delay (100ms) Copyright (c) 1992-2003 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.1-CURRENT #0: Tue Sep 30 10:11:59 PDT 2003 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/KERNEL-1.31 Preloaded elf kernel "/boot/kernel/kernel" at 0xc06ed000. Preloaded elf module "/boot/kernel/vesa.ko" at 0xc06ed21c. Preloaded elf module "/boot/kernel/md.ko" at 0xc06ed2c8. Preloaded elf module "/boot/kernel/linux.ko" at 0xc06ed370. Preloaded elf module "/boot/kernel/if_gif.ko" at 0xc06ed41c. Preloaded elf module "/boot/kernel/if_tun.ko" at 0xc06ed4c8. Preloaded elf module "/boot/kernel/ipfw.ko" at 0xc06ed574. Preloaded elf module "/boot/kernel/if_an.ko" at 0xc06ed620. Preloaded elf module "/boot/kernel/wlan.ko" at 0xc06ed6cc. Preloaded elf module "/boot/kernel/rc4.ko" at 0xc06ed778. Preloaded elf module "/boot/kernel/pccard.ko" at 0xc06ed820. Preloaded elf module "/boot/kernel/if_em.ko" at 0xc06ed8cc. Preloaded elf module "/boot/kernel/if_fxp.ko" at 0xc06ed978. Preloaded elf module "/boot/kernel/miibus.ko" at 0xc06eda24. Preloaded elf module "/boot/kernel/if_lnc.ko" at 0xc06edad0. Preloaded elf module "/boot/kernel/if_wi.ko" at 0xc06edb7c. Preloaded elf module "/boot/kernel/if_xl.ko" at 0xc06edc28. Preloaded elf module "/boot/kernel/snd_emu10k1.ko" at 0xc06edcd4. Preloaded elf module "/boot/kernel/snd_pcm.ko" at 0xc06edd84. Preloaded elf module "/boot/kernel/snd_es137x.ko" at 0xc06ede30. Preloaded elf module "/boot/kernel/snd_ich.ko" at 0xc06edee0. Preloaded elf module "/boot/kernel/snd_maestro3.ko" at 0xc06edf8c. Preloaded elf module "/boot/kernel/ugen.ko" at 0xc06ee040. Preloaded elf module "/boot/kernel/usb.ko" at 0xc06ee0ec. Preloaded elf module "/boot/kernel/uhid.ko" at 0xc06ee194. Preloaded elf module "/boot/kernel/ukbd.ko" at 0xc06ee240. Preloaded elf module "/boot/kernel/ulpt.ko" at 0xc06ee2ec. Preloaded elf module "/boot/kernel/ums.ko" at 0xc06ee398. Preloaded elf module "/boot/kernel/umass.ko" at 0xc06ee440. Preloaded elf module "/boot/kernel/umodem.ko" at 0xc06ee4ec. Preloaded elf module "/boot/kernel/ucom.ko" at 0xc06ee598. Preloaded elf module "/boot/kernel/bktr.ko" at 0xc06ee644. Preloaded elf module "/boot/kernel/bktr_mem.ko" at 0xc06ee6f0. Preloaded elf module "/boot/kernel/agp.ko" at 0xc06ee7a0. Preloaded elf module "/boot/kernel/random.ko" at 0xc06ee848. Preloaded elf module "/boot/kernel/ip_mroute.ko" at 0xc06ee8f4. Preloaded elf module "/boot/kernel/ip6fw.ko" at 0xc06ee9a4. Preloaded elf module "/boot/kernel/netgraph.ko" at 0xc06eea50. Preloaded elf module "/boot/kernel/dummynet.ko" at 0xc06eeb00. Preloaded elf module "/boot/kernel/radeon.ko" at 0xc06eebb0. Preloaded elf module "/boot/kernel/r128.ko" at 0xc06eec5c. Preloaded elf module "/boot/kernel/ahc.ko" at 0xc06eed08. Preloaded elf module "/boot/kernel/mpt.ko" at 0xc06eedb0. Preloaded elf module "/boot/kernel/fdc.ko" at 0xc06eee58. Preloaded elf module "/boot/kernel/cbb.ko" at 0xc06eef00. Preloaded elf module "/boot/kernel/exca.ko" at 0xc06eefa8. Preloaded elf module "/boot/kernel/cardbus.ko" at 0xc06ef054. Preloaded elf module "/boot/kernel/lpt.ko" at 0xc06ef100. Preloaded elf module "/boot/kernel/ubsa.ko" at 0xc06ef1a8. Preloaded elf module "/boot/kernel/firewire.ko" at 0xc06ef254. Preloaded elf module "/boot/kernel/sbp.ko" at 0xc06ef304. Preloaded elf module "/boot/kernel/smbus.ko" at 0xc06ef3ac. Preloaded elf module "/boot/kernel/intpm.ko" at 0xc06ef458. Preloaded elf module "/boot/kernel/smb.ko" at 0xc06ef504. Preloaded elf module "/boot/kernel/iicbus.ko" at 0xc06ef5ac. Preloaded elf module "/boot/kernel/iic.ko" at 0xc06ef658. Preloaded elf module "/boot/kernel/iicsmb.ko" at 0xc06ef700. Preloaded elf module "/boot/kernel/uart.ko" at 0xc06ef7ac. Preloaded elf module "/boot/kernel/acpi.ko" at 0xc06ef858. Timecounter "i8254" frequency 1193121 Hz quality 0 CPU: Intel(R) XEON(TM) CPU 2.40GHz (2372.81-MHz 686-class CPU) Origin = "GenuineIntel
Re: NFS corruption on p4 machines (please test)
On Fri, Oct 03, 2003 at 10:10:20AM -0700, Lars Eggert wrote: > Kris, > > Kris Kennaway wrote: > > >For some months now I have been experiencing NFS corruption on the > >three machines in the dosirak.kr package cluster - these are SMP > >pentium 4 machines that run -CURRENT. Setting DISABLE_PSE and > >DISABLE_PG_G does not fix these problems. I am able to easily > >reproduce these problems using /usr/src/tools/regression/fsx on a > >loopback nfs mount - they are not deterministic, but it blows up > >within about 8000 operations (less than a minute of operation). In > >fact sometimes it even manages to make fsx segfault, which is fairly > >impressive :) > > > >Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs > >filesystem for a few minutes. > > I just ran an fsx cycle on my desktop machine over a TCP mount, and it > seemed to work fine: Thanks. What hardware specs? Kris pgp0.pgp Description: PGP signature
Re: NFS corruption on p4 machines (please test)
Lars Eggert wrote: Kris Kennaway wrote: Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs filesystem for a few minutes. I just ran an fsx cycle on my desktop machine over a TCP mount, and it seemed to work fine: I should have mentioned that this is a Pentium 4 Xeon SMP machine running -current. Lars -- Lars Eggert <[EMAIL PROTECTED]> USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
Re: NFS corruption on p4 machines (please test)
Kris, Kris Kennaway wrote: For some months now I have been experiencing NFS corruption on the three machines in the dosirak.kr package cluster - these are SMP pentium 4 machines that run -CURRENT. Setting DISABLE_PSE and DISABLE_PG_G does not fix these problems. I am able to easily reproduce these problems using /usr/src/tools/regression/fsx on a loopback nfs mount - they are not deterministic, but it blows up within about 8000 operations (less than a minute of operation). In fact sometimes it even manages to make fsx segfault, which is fairly impressive :) Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs filesystem for a few minutes. I just ran an fsx cycle on my desktop machine over a TCP mount, and it seemed to work fine: [EMAIL PROTECTED]: /usr/src/tools/regression/fsx] ./fsx /tmp/nfs/x truncating to largest ever: 0x13e76 truncating to largest ever: 0x2e52c truncating to largest ever: 0x3c2c2 truncating to largest ever: 0x3f15f truncating to largest ever: 0x3fcb9 truncating to largest ever: 0x3fe96 truncating to largest ever: 0x3ff9d truncating to largest ever: 0x3 skipping zero size read skipping zero size write skipping zero size write ^Csignal 2 testcalls = 166863 Lars -- Lars Eggert <[EMAIL PROTECTED]> USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
NFS corruption on p4 machines (please test)
For some months now I have been experiencing NFS corruption on the three machines in the dosirak.kr package cluster - these are SMP pentium 4 machines that run -CURRENT. Setting DISABLE_PSE and DISABLE_PG_G does not fix these problems. I am able to easily reproduce these problems using /usr/src/tools/regression/fsx on a loopback nfs mount - they are not deterministic, but it blows up within about 8000 operations (less than a minute of operation). In fact sometimes it even manages to make fsx segfault, which is fairly impressive :) Just mount something rw via loopback nfs, and run 'fsx foo' on the nfs filesystem for a few minutes. e.g.: dosirak# fsx foo truncating to largest ever: 0x13e76 truncating to largest ever: 0x2e52c truncating to largest ever: 0x3c2c2 truncating to largest ever: 0x3f15f truncating to largest ever: 0x3fcb9 ftruncate1: 30cc3 dotruncate: ftruncate: Permission denied Is anyone else able to test this? The three machines I see this on have the same hardware specs, so it may be an interaction with certain hardware. Copyright (c) 1992-2003 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.1-CURRENT #0: Fri Sep 26 20:23:51 KST 2003 [EMAIL PROTECTED]:/usr/obj/d/src/sys/DALKI Preloaded elf kernel "/boot/kernel/kernel" at 0xc0588000. Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) XEON(TM) CPU 2.20GHz (2199.94-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf24 Stepping = 4 Features=0x3febfbff Hyperthreading: 2 logical CPUs real memory = 2147418112 (2047 MB) avail memory = 2084302848 (1987 MB) Programming 16 pins in IOAPIC #0 IOAPIC #0 intpin 2 -> irq 0 Programming 16 pins in IOAPIC #1 Programming 16 pins in IOAPIC #2 Programming 16 pins in IOAPIC #3 FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): apic id: 0, version: 0x00050014, at 0xfee0 cpu1 (AP): apic id: 1, version: 0x00050014, at 0xfee0 cpu2 (AP): apic id: 2, version: 0x00050014, at 0xfee0 cpu3 (AP): apic id: 3, version: 0x00050014, at 0xfee0 io0 (APIC): apic id: 8, version: 0x000f0011, at 0xfec0 io1 (APIC): apic id: 9, version: 0x000f0011, at 0xfec01000 io2 (APIC): apic id: 10, version: 0x000f0011, at 0xfec02000 io3 (APIC): apic id: 11, version: 0x000f0011, at 0xfec03000 Pentium Pro MTRR support enabled ACPI-0660: *** Warning: Type override - [DEB_] had invalid type (Integer) for Scope operator, changed to ( Scope) ACPI-0660: *** Warning: Type override - [MLIB] had invalid type (Integer) for Scope operator, changed to ( Scope) ACPI-0660: *** Warning: Type override - [IO__] had invalid type (Integer) for Scope operator, changed to ( Scope) ACPI-0660: *** Warning: Type override - [DATA] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [SIO_] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [SB__] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [PM__] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [ICNT] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [ACPI] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [IORG] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [SB__] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [PM__] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [SIO_] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [PM__] had invalid type (String) for Scope operator, changed to (S cope) ACPI-0660: *** Warning: Type override - [BIOS] had invalid type (Integer) for Scope operator, changed to ( Scope) ACPI-0660: *** Warning: Type override - [CMOS] had invalid type (Integer) for Scope operator, changed to ( Scope) ACPI-0660: *** Warning: Type override - [KBC_] had invalid type (Integer) for Scope operator, changed to ( Scope) ACPI-0660: *** Warning: Type override - [OEM_] had invalid type (Integer) for Scope operator, changed to ( Scope) acpi0: on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000 pcibios: BIOS version 2.10 Using $PIR table, 7 entries at 0xc00f4a70 acpi_timer0: <32-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0 acpi_cpu0: on acpi0 acpi_cpu1: on acpi0 acpi_cpu2: on acpi0 acpi_cpu3: on acpi0 acpi_cpu4: on acpi0 acpi_cpu5: on acpi0 acpi_cpu6: on acpi0 acpi_cpu7: on acpi0 acpi_button0: on ac