Re: NFS writes being corrupted?
> On Aug 21, 2015, at 10:32 PM, Thor Lancelot Simon wrote: > > Whoah. Why? Because NFS marks mbufs as RO and the bus dma code did something "special" for preparing a dma write into a readonly mbuf. Now it just causes an assert to fire.
Re: NFS writes being corrupted?
On Fri, Aug 21, 2015 at 04:02:29PM -0300, Jared McNeill wrote: > On Sun, 9 Aug 2015, Jeff Rizzo wrote: > > >This would seem to indicate a problem with the particular interface > >(awge0), perhaps specific to the odroid-c1, as opposed to some l2 cache > >controller issue, which is kind of where I was leaning before. However, > >my banana pi has awge0 as well, but does not exhibit this corruption. > > The following awge patch fixes it for me: [...] > bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize, > - BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); > + BUS_DMASYNC_PREWRITE); Whoah. Why? Thor
Re: NFS writes being corrupted?
On 8/21/15 12:02 PM, Jared McNeill wrote: On Sun, 9 Aug 2015, Jeff Rizzo wrote: This would seem to indicate a problem with the particular interface (awge0), perhaps specific to the odroid-c1, as opposed to some l2 cache controller issue, which is kind of where I was leaning before. However, my banana pi has awge0 as well, but does not exhibit this corruption. The following awge patch fixes it for me: Index: dwc_gmac.c === RCS file: /cvsroot/src/sys/dev/ic/dwc_gmac.c,v retrieving revision 1.33 diff -u -p -r1.33 dwc_gmac.c --- dwc_gmac.c12 Jun 2015 11:54:39 -1.33 +++ dwc_gmac.c21 Aug 2015 18:43:13 - @@ -917,7 +917,7 @@ dwc_gmac_queue(struct dwc_gmac_softc *sc data->td_active = map; bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize, -BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); +BUS_DMASYNC_PREWRITE); /* Pass first to device */ sc->sc_txq.t_desc[first].ddesc_status = I can confirm this fixes things for me, too! Thanks! +j
Re: NFS writes being corrupted?
On Sun, 9 Aug 2015, Jeff Rizzo wrote: This would seem to indicate a problem with the particular interface (awge0), perhaps specific to the odroid-c1, as opposed to some l2 cache controller issue, which is kind of where I was leaning before. However, my banana pi has awge0 as well, but does not exhibit this corruption. The following awge patch fixes it for me: Index: dwc_gmac.c === RCS file: /cvsroot/src/sys/dev/ic/dwc_gmac.c,v retrieving revision 1.33 diff -u -p -r1.33 dwc_gmac.c --- dwc_gmac.c 12 Jun 2015 11:54:39 - 1.33 +++ dwc_gmac.c 21 Aug 2015 18:43:13 - @@ -917,7 +917,7 @@ dwc_gmac_queue(struct dwc_gmac_softc *sc data->td_active = map; bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize, - BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); + BUS_DMASYNC_PREWRITE); /* Pass first to device */ sc->sc_txq.t_desc[first].ddesc_status =
Re: NFS writes being corrupted?
> On Aug 9, 2015, at 4:01 PM, Jeff Rizzo wrote: > > This would seem to indicate a problem with the particular interface (awge0), > perhaps specific to the odroid-c1, as opposed to some l2 cache controller > issue, which is kind of where I was leaning before. However, my banana pi > has awge0 as well, but does not exhibit this corruption. The l2 cache flushing routines are different between the two. The awge on the a5 may be using the coherent interface to the pl310 and cache flushing may not be even needed. USB probably doesn’t use the coherent interface so that might be why it works.
Re: NFS writes being corrupted?
On 8/4/15 1:13 PM, Jeff Rizzo wrote: On 8/4/15 4:20 AM, Robert Swindells wrote: David Holland wrote: Does that size vary with the NFS block size? Yep. Reducing blocksize to 8192 makes it barf on 8192+ byte files. Also is it using UDP or TCP ? TCP, but I just confirmed UDP has the problem too. The symptoms make me think of scrambled mbufs, if anything... My guess is that the panics that wiz and I saw in the checksum code on amd64 were also due to scrambled mbufs. My cubietruck seems fine using awge(4), I have built a fair number of packages over NFS recently. Robert Swindells Looks like awge(4) is seeing output errors: Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Colls awge0 1500 00:1e:06:c3:49:c1 189582 0 134261 222 0 Not sure of what variety, though. The oerrs go up when reading a large file (90M) which checksums properly, but DON'T go up when writing/reading an 8k file which gets corrupted. +j I finally got around to sticking a USB interface (urtwn0) in and testing NFS over that... it was PAINFULLY SLOW - took well over a minute to copy a 4MB test file. But, the test file copied with no corruption! This would seem to indicate a problem with the particular interface (awge0), perhaps specific to the odroid-c1, as opposed to some l2 cache controller issue, which is kind of where I was leaning before. However, my banana pi has awge0 as well, but does not exhibit this corruption. Any suggestions what to try/test next gratefully accepted - I would really love to get nfs working on this board. +j
Re: NFS writes being corrupted?
On 8/4/15 4:20 AM, Robert Swindells wrote: David Holland wrote: Does that size vary with the NFS block size? Yep. Reducing blocksize to 8192 makes it barf on 8192+ byte files. Also is it using UDP or TCP ? TCP, but I just confirmed UDP has the problem too. The symptoms make me think of scrambled mbufs, if anything... My guess is that the panics that wiz and I saw in the checksum code on amd64 were also due to scrambled mbufs. My cubietruck seems fine using awge(4), I have built a fair number of packages over NFS recently. Robert Swindells Looks like awge(4) is seeing output errors: Name Mtu Network Address Ipkts IerrsOpkts Oerrs Colls awge0 1500 00:1e:06:c3:49:c1 189582 0 134261 222 0 Not sure of what variety, though. The oerrs go up when reading a large file (90M) which checksums properly, but DON'T go up when writing/reading an 8k file which gets corrupted. +j
Re: NFS writes being corrupted?
David Holland wrote: >On Mon, Aug 03, 2015 at 02:51:37PM -0700, Jeff Rizzo wrote: > > I need to look deeper, but a quick test writing lines of > > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz > > > > Shows that corruption starts when the file is exactly 65536 bytes long > > (with an 8192 byte page size), with anything that size or longer getting > > corrupted. It seems to be randomly garbled - same size, same bytes, but > > shuffled around. When I was narrowing it down, I sometimes saw random > > corruption inserted at larger file sizes - at one point I saw short strings > > of NUL and the string "posix2_upe" (which would appear to be a symbol?) > > inserted at seemingly-random spots. > >Does that size vary with the NFS block size? Also is it using UDP or TCP ? >The symptoms make me think of scrambled mbufs, if anything... My guess is that the panics that wiz and I saw in the checksum code on amd64 were also due to scrambled mbufs. My cubietruck seems fine using awge(4), I have built a fair number of packages over NFS recently. Robert Swindells
Re: NFS writes being corrupted?
On Mon, Aug 03, 2015 at 06:10:38PM -0400, Michael wrote: > That's been a problem on MIPS for a long time, nobody seems to know > why. Never seen it on ARM though, but then again I never checked. I use a netbsd-current evbearm (v5, no hf) arm machine diskless and see no trouble with NFS writes. Different network hardware, so I wouldn't rule out awge bugs. I'll test cubietruck NFS. Martin
Re: NFS writes being corrupted?
On Mon, Aug 03, 2015 at 02:51:37PM -0700, Jeff Rizzo wrote: > I need to look deeper, but a quick test writing lines of > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz > > Shows that corruption starts when the file is exactly 65536 bytes long > (with an 8192 byte page size), with anything that size or longer getting > corrupted. It seems to be randomly garbled - same size, same bytes, but > shuffled around. When I was narrowing it down, I sometimes saw random > corruption inserted at larger file sizes - at one point I saw short strings > of NUL and the string "posix2_upe" (which would appear to be a symbol?) > inserted at seemingly-random spots. Does that size vary with the NFS block size? The symptoms make me think of scrambled mbufs, if anything... -- David A. Holland dholl...@netbsd.org
Re: NFS writes being corrupted?
Hello, On Mon, 3 Aug 2015 09:02:19 -0700 Jeff Rizzo wrote: > I got my odroid-c1 back online yesterday with -current, and noticed that > anything I copied to an NFS-mounted volume would get silently > corrupted. (sha1 from the NFS client and on the NFS server read the > same, though) That's been a problem on MIPS for a long time, nobody seems to know why. Never seen it on ARM though, but then again I never checked. have fun Michael
Re: NFS writes being corrupted?
On 8/3/15 10:15 AM, Martin Husemann wrote: On Mon, Aug 03, 2015 at 09:02:19AM -0700, Jeff Rizzo wrote: I'm about 80% sure this was working around 7.99.9, but for a number of reasons it's complicated for me to check older builds, and in any event odroid-c1 support is fairly new. I noticed some changes to the NFS code on 15 July ( http://mail-index.netbsd.org/source-changes/2015/07/15/msg067309.html ), but backing these out does not change the behavior. What kind of differences do you see? Truncation to a multiple of page size? Last partial page filled with zeroes? Random corruption? Do you get identical content back when reading on the client directly after write? Typical culprit would be cache ops/pmap issues. Martin I need to look deeper, but a quick test writing lines of ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz Shows that corruption starts when the file is exactly 65536 bytes long (with an 8192 byte page size), with anything that size or longer getting corrupted. It seems to be randomly garbled - same size, same bytes, but shuffled around. When I was narrowing it down, I sometimes saw random corruption inserted at larger file sizes - at one point I saw short strings of NUL and the string "posix2_upe" (which would appear to be a symbol?) inserted at seemingly-random spots. +j
Re: NFS writes being corrupted?
On Mon, Aug 03, 2015 at 09:02:19AM -0700, Jeff Rizzo wrote: > I'm about 80% sure this was working around 7.99.9, but for a number of > reasons it's complicated for me to check older builds, and in any event > odroid-c1 support is fairly new. I noticed some changes to the NFS code > on 15 July ( > http://mail-index.netbsd.org/source-changes/2015/07/15/msg067309.html ), > but backing these out does not change the behavior. What kind of differences do you see? Truncation to a multiple of page size? Last partial page filled with zeroes? Random corruption? Do you get identical content back when reading on the client directly after write? Typical culprit would be cache ops/pmap issues. Martin
NFS writes being corrupted?
I got my odroid-c1 back online yesterday with -current, and noticed that anything I copied to an NFS-mounted volume would get silently corrupted. (sha1 from the NFS client and on the NFS server read the same, though) I'm about 80% sure this was working around 7.99.9, but for a number of reasons it's complicated for me to check older builds, and in any event odroid-c1 support is fairly new. I noticed some changes to the NFS code on 15 July ( http://mail-index.netbsd.org/source-changes/2015/07/15/msg067309.html ), but backing these out does not change the behavior. Has anyone else seen problems with NFS? Or with odroid-c1 or awge(4) in general? dmesg below in case it gives any hints: Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 The NetBSD Foundation, Inc. All rights reserved. Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. NetBSD 7.99.20 (ODROID-iscsi) #3: Sun Aug 2 21:06:09 PDT 2015 r...@cassava.tastylime.net:/scratch/evbarm7/obj/sys/arch/evbarm/compile/ODROID-iscsi total memory = 1024 MB avail memory = 1007 MB sysctl_createv: sysctl_create(machine_arch) returned 17 timecounter: Timecounters tick every 10.000 msec mainbus0 (root) cpu0 at mainbus0 core 0: 1512 MHz Cortex-A5 r0p1 (Cortex V7A core) cpu0: DC enabled IC enabled WB disabled EABT branch prediction enabled cpu0: sctlr: 0xc51c7d cpu0: actlr: 0x6041 cpu0: revidr: 0x410fc051 cpu0: mpidr: 0x8200 cpu0: isar: [0]=0x10 [1]=0x13112111 [2]=0x21232041 [3]=0x2131, [4]=0x11142, [5]=0 cpu0: mmfr: [0]=0x100103 [1]=0x4000 [2]=0x123 [3]=0x102211 cpu0: pfr: [0]=0x1231 [1]=0x11 cpu0: 32KB/32B 2-way L1 VIPT Instruction cache cpu0: 32KB/32B 4-way write-back-locking-C L1 PIPT Data cache cpu0: 512KB/32B 8-way write-back L2 PIPT Unified cache vfp0 at cpu0: NEON MPE (VFP 3.0+), rounding, NaN propagation, denormals vfp0: mvfr: [0]=0x10110222 [1]=0x cpu1 at mainbus0 core 1 cpu2 at mainbus0 core 2 cpu3 at mainbus0 core 3 armperiph0 at mainbus0 armgic0 at armperiph0: Generic Interrupt Controller, 256 sources (245 valid) armgic0: 32 Priorities, 224 SPIs, 5 PPIs, 16 SGIs a9tmr0 at armperiph0: A5 Global 64-bit Timer (378 MHz) a9tmr0: interrupting on irq 27 a9wdt0 at armperiph0: A5 Watchdog Timer, default period is 12 seconds arml2cc0 at armperiph0: ARM PL310 r3p3 L2 Cache Controller (disabled) arml2cc0: cache enabled amlogicio0 at mainbus0 amlogiccom0 at amlogicio0 port 0: console amlogiccom0: interrupting at irq 122 amlogicgpio0 at amlogicio0: GPIO controller gpio0 at amlogicgpio0 (GPIOX): 22 pins gpio1 at amlogicgpio0 (GPIOY): 15 pins gpio2 at amlogicgpio0 (GPIODV): 30 pins gpio3 at amlogicgpio0 (GPIOH): 6 pins gpio4 at amlogicgpio0 (GPIOAO): 14 pins gpio5 at amlogicgpio0 (BOOT): 19 pins gpio6 at amlogicgpio0 (CARD): 7 pins genfb0 at amlogicio0: switching to framebuffer console genfb0: framebuffer at 0xc9e0, size 1280x720, depth 16, stride 2560 wsdisplay0 at genfb0 kbdmux 1: console (default, vt100 emulation) wsmux1: connecting to wsdisplay0 wsdisplay0: screen 1-3 added (default, vt100 emulation) amlogicrng0 at amlogicio0 dwctwo0 at amlogicio0 port 0: USB controller dwctwo1 at amlogicio0 port 1: USB controller awge0 at amlogicio0: Gigabit Ethernet Controller awge0: interrupting on irq 40 awge0: Ethernet address: 00:1e:06:c3:49:c1 rgephy0 at awge0 phy 0: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 6 rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, auto rgephy1 at awge0 phy 1: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 6 rgephy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, auto amlogicsdhc0 at amlogicio0 port 1: SDHC controller amlogicsdhc0: interrupting on irq 110 amlogicrtc0 at amlogicio0: RTC battery not present or discharged usb0 at dwctwo0: USB revision 2.0 usb1 at dwctwo1: USB revision 2.0 timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0 timecounter: Timecounter "a9tmr0" frequency 37800 Hz quality 500 cpu2: 1512 MHz Cortex-A5 r0p1 (Cortex V7A core) cpu2: DC enabled IC enabled WB disabled EABT branch prediction enabled cpu2: sctlr: 0xc51c7d cpu2: actlr: 0x6041 cpu2: revidr: 0x410fc051 cpu2: mpidr: 0x8202 cpu2: isar: [0]=0x10 [1]=0x13112111 [2]=0x21232041 [3]=0x2131, [4]=0x11142, [5]=0 cpu2: mmfr: [0]=0x100103 [1]=0x4000 [2]=0x123 [3]=0x102211 cpu2: pfr: [0]=0x1231 [1]=0x11 cpu2: 32KB/32B 2-way L1 VIPT Instruction cache cpu2: 32KB/32B 4-way write-back-locking-C L1 PIPT Data cache cpu2: 512KB/32B 8-way write-back L2 PIPT Unified cache vfp2 at cpu2: NEON MPE (VFP 3.0+), rounding, NaN propagation, denormals vfp2: mvfr: [0]=0x10110222 [1]=0x cpu3: 1512 MHz Cortex-A5 r0p1 (Cortex V7A core) cpu3: DC enabled IC enabled WB disabled EABT branch prediction enabled cpu3: sctlr: 0xc51c7d cpu3: actlr: 0x6041 cpu3: revidr: 0x410fc051 cpu3: mpidr: 0x8203