Re: NFS writes being corrupted?

2015-08-21 Thread Matt Thomas

> On Aug 21, 2015, at 10:32 PM, Thor Lancelot Simon  wrote:
> 
> Whoah.  Why?

Because NFS marks mbufs as RO and the bus dma code did something
"special" for preparing a dma write into a readonly mbuf.  Now
it just causes an assert to fire.


Re: NFS writes being corrupted?

2015-08-21 Thread Thor Lancelot Simon
On Fri, Aug 21, 2015 at 04:02:29PM -0300, Jared McNeill wrote:
> On Sun, 9 Aug 2015, Jeff Rizzo wrote:
> 
> >This would seem to indicate a problem with the particular interface
> >(awge0), perhaps specific to the odroid-c1, as opposed to some l2 cache
> >controller issue, which is kind of where I was leaning before.  However,
> >my banana pi has awge0 as well, but does not exhibit this corruption.
> 
> The following awge patch fixes it for me:
[...]
>   bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize,
> - BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
> + BUS_DMASYNC_PREWRITE);

Whoah.  Why?

Thor


Re: NFS writes being corrupted?

2015-08-21 Thread Jeff Rizzo

On 8/21/15 12:02 PM, Jared McNeill wrote:

On Sun, 9 Aug 2015, Jeff Rizzo wrote:

This would seem to indicate a problem with the particular interface 
(awge0), perhaps specific to the odroid-c1, as opposed to some l2 
cache controller issue, which is kind of where I was leaning before.  
However, my banana pi has awge0 as well, but does not exhibit this 
corruption.


The following awge patch fixes it for me:

Index: dwc_gmac.c
===
RCS file: /cvsroot/src/sys/dev/ic/dwc_gmac.c,v
retrieving revision 1.33
diff -u -p -r1.33 dwc_gmac.c
--- dwc_gmac.c12 Jun 2015 11:54:39 -1.33
+++ dwc_gmac.c21 Aug 2015 18:43:13 -
@@ -917,7 +917,7 @@ dwc_gmac_queue(struct dwc_gmac_softc *sc
 data->td_active = map;

 bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize,
-BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
+BUS_DMASYNC_PREWRITE);

 /* Pass first to device */
 sc->sc_txq.t_desc[first].ddesc_status =


I can confirm this fixes things for me, too!

Thanks!
+j



Re: NFS writes being corrupted?

2015-08-21 Thread Jared McNeill

On Sun, 9 Aug 2015, Jeff Rizzo wrote:

This would seem to indicate a problem with the particular interface (awge0), 
perhaps specific to the odroid-c1, as opposed to some l2 cache controller 
issue, which is kind of where I was leaning before.  However, my banana pi 
has awge0 as well, but does not exhibit this corruption.


The following awge patch fixes it for me:

Index: dwc_gmac.c
===
RCS file: /cvsroot/src/sys/dev/ic/dwc_gmac.c,v
retrieving revision 1.33
diff -u -p -r1.33 dwc_gmac.c
--- dwc_gmac.c  12 Jun 2015 11:54:39 -  1.33
+++ dwc_gmac.c  21 Aug 2015 18:43:13 -
@@ -917,7 +917,7 @@ dwc_gmac_queue(struct dwc_gmac_softc *sc
data->td_active = map;

bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize,
-   BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
+   BUS_DMASYNC_PREWRITE);

/* Pass first to device */
sc->sc_txq.t_desc[first].ddesc_status =



Re: NFS writes being corrupted?

2015-08-09 Thread Matt Thomas

> On Aug 9, 2015, at 4:01 PM, Jeff Rizzo  wrote:
> 
> This would seem to indicate a problem with the particular interface (awge0), 
> perhaps specific to the odroid-c1, as opposed to some l2 cache controller 
> issue, which is kind of where I was leaning before.  However, my banana pi 
> has awge0 as well, but does not exhibit this corruption.

The l2 cache flushing routines are different between the two.

The awge on the a5 may be using the coherent interface to the pl310
and cache flushing may not be even needed.  USB probably doesn’t use
the coherent interface so that might be why it works.



Re: NFS writes being corrupted?

2015-08-09 Thread Jeff Rizzo

On 8/4/15 1:13 PM, Jeff Rizzo wrote:

On 8/4/15 4:20 AM, Robert Swindells wrote:

David Holland wrote:


Does that size vary with the NFS block size?


Yep.  Reducing blocksize to 8192 makes it barf on 8192+ byte files.


Also is it using UDP or TCP ?


TCP, but I just confirmed UDP has the problem too.


The symptoms make me think of scrambled mbufs, if anything...

My guess is that the panics that wiz and I saw in the checksum code
on amd64 were also due to scrambled mbufs.

My cubietruck seems fine using awge(4), I have built a fair number of
packages over NFS recently.

Robert Swindells


Looks like awge(4) is seeing output errors:

Name  Mtu   Network   Address  Ipkts Ierrs Opkts Oerrs 
Colls
awge0 1500  00:1e:06:c3:49:c1   189582 0 134261   
222 0


Not sure of what variety, though.  The oerrs go up when reading a 
large file (90M) which checksums properly, but DON'T go up when 
writing/reading an 8k file which gets corrupted.


+j



I finally got around to sticking a USB interface (urtwn0) in and testing 
NFS over that... it was PAINFULLY SLOW - took well over a minute to copy 
a 4MB test file.  But, the test file copied with no corruption!


This would seem to indicate a problem with the particular interface 
(awge0), perhaps specific to the odroid-c1, as opposed to some l2 cache 
controller issue, which is kind of where I was leaning before.  However, 
my banana pi has awge0 as well, but does not exhibit this corruption.


Any suggestions what to try/test next gratefully accepted - I would 
really love to get nfs working on this board.


+j



Re: NFS writes being corrupted?

2015-08-04 Thread Jeff Rizzo

On 8/4/15 4:20 AM, Robert Swindells wrote:

David Holland wrote:


Does that size vary with the NFS block size?


Yep.  Reducing blocksize to 8192 makes it barf on 8192+ byte files.


Also is it using UDP or TCP ?


TCP, but I just confirmed UDP has the problem too.


The symptoms make me think of scrambled mbufs, if anything...

My guess is that the panics that wiz and I saw in the checksum code
on amd64 were also due to scrambled mbufs.

My cubietruck seems fine using awge(4), I have built a fair number of
packages over NFS recently.

Robert Swindells


Looks like awge(4) is seeing output errors:

Name  Mtu   Network   Address  Ipkts IerrsOpkts 
Oerrs Colls
awge0 1500  00:1e:06:c3:49:c1   189582 0 134261   
222 0


Not sure of what variety, though.  The oerrs go up when reading a large 
file (90M) which checksums properly, but DON'T go up when 
writing/reading an 8k file which gets corrupted.


+j



Re: NFS writes being corrupted?

2015-08-04 Thread Robert Swindells

David Holland wrote:
>On Mon, Aug 03, 2015 at 02:51:37PM -0700, Jeff Rizzo wrote:
> > I need to look deeper, but a quick test writing lines of
> > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
> > 
> > Shows that corruption starts when the file is exactly 65536 bytes long
> > (with an 8192 byte page size), with anything that size or longer getting
> > corrupted.  It seems to be randomly garbled - same size, same bytes, but
> > shuffled around.  When I was narrowing it down, I sometimes saw random
> > corruption inserted at larger file sizes - at one point I saw short strings
> > of NUL and the string "posix2_upe" (which would appear to be a symbol?)
> > inserted at seemingly-random spots.
>
>Does that size vary with the NFS block size?

Also is it using UDP or TCP ?

>The symptoms make me think of scrambled mbufs, if anything...

My guess is that the panics that wiz and I saw in the checksum code
on amd64 were also due to scrambled mbufs.

My cubietruck seems fine using awge(4), I have built a fair number of
packages over NFS recently.

Robert Swindells


Re: NFS writes being corrupted?

2015-08-03 Thread Martin Husemann
On Mon, Aug 03, 2015 at 06:10:38PM -0400, Michael wrote:
> That's been a problem on MIPS for a long time, nobody seems to know
> why. Never seen it on ARM though, but then again I never checked.

I use a netbsd-current evbearm (v5, no hf) arm machine diskless and see
no trouble with NFS writes. Different network hardware, so I wouldn't
rule out awge bugs.

I'll test cubietruck NFS.

Martin


Re: NFS writes being corrupted?

2015-08-03 Thread David Holland
On Mon, Aug 03, 2015 at 02:51:37PM -0700, Jeff Rizzo wrote:
 > I need to look deeper, but a quick test writing lines of
 > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
 > 
 > Shows that corruption starts when the file is exactly 65536 bytes long
 > (with an 8192 byte page size), with anything that size or longer getting
 > corrupted.  It seems to be randomly garbled - same size, same bytes, but
 > shuffled around.  When I was narrowing it down, I sometimes saw random
 > corruption inserted at larger file sizes - at one point I saw short strings
 > of NUL and the string "posix2_upe" (which would appear to be a symbol?)
 > inserted at seemingly-random spots.

Does that size vary with the NFS block size?

The symptoms make me think of scrambled mbufs, if anything...

-- 
David A. Holland
dholl...@netbsd.org


Re: NFS writes being corrupted?

2015-08-03 Thread Michael
Hello,

On Mon, 3 Aug 2015 09:02:19 -0700
Jeff Rizzo  wrote:

> I got my odroid-c1 back online yesterday with -current, and noticed that 
> anything I copied to an NFS-mounted volume would get silently 
> corrupted.  (sha1 from the NFS client and on the NFS server read the 
> same, though)

That's been a problem on MIPS for a long time, nobody seems to know
why. Never seen it on ARM though, but then again I never checked.

have fun
Michael


Re: NFS writes being corrupted?

2015-08-03 Thread Jeff Rizzo

On 8/3/15 10:15 AM, Martin Husemann wrote:

On Mon, Aug 03, 2015 at 09:02:19AM -0700, Jeff Rizzo wrote:

I'm about 80% sure this was working around 7.99.9, but for a number of
reasons it's complicated for me to check older builds, and in any event
odroid-c1 support is fairly new.  I noticed some changes to the NFS code
on 15 July (
http://mail-index.netbsd.org/source-changes/2015/07/15/msg067309.html ),
but backing these out does not change the behavior.

What kind of differences do you see?

Truncation to a multiple of page size? Last partial page filled with zeroes?
Random corruption? Do you get identical content back when reading on the
client directly after write?

Typical culprit would be cache ops/pmap issues.

Martin


I need to look deeper, but a quick test writing lines of
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Shows that corruption starts when the file is exactly 65536 bytes long 
(with an 8192 byte page size), with anything that size or longer getting 
corrupted.  It seems to be randomly garbled - same size, same bytes, but 
shuffled around.  When I was narrowing it down, I sometimes saw random 
corruption inserted at larger file sizes - at one point I saw short 
strings of NUL and the string "posix2_upe" (which would appear to be a 
symbol?) inserted at seemingly-random spots.


+j




Re: NFS writes being corrupted?

2015-08-03 Thread Martin Husemann
On Mon, Aug 03, 2015 at 09:02:19AM -0700, Jeff Rizzo wrote:
> I'm about 80% sure this was working around 7.99.9, but for a number of 
> reasons it's complicated for me to check older builds, and in any event 
> odroid-c1 support is fairly new.  I noticed some changes to the NFS code 
> on 15 July ( 
> http://mail-index.netbsd.org/source-changes/2015/07/15/msg067309.html ), 
> but backing these out does not change the behavior.

What kind of differences do you see?

Truncation to a multiple of page size? Last partial page filled with zeroes?
Random corruption? Do you get identical content back when reading on the
client directly after write?

Typical culprit would be cache ops/pmap issues.

Martin


NFS writes being corrupted?

2015-08-03 Thread Jeff Rizzo
I got my odroid-c1 back online yesterday with -current, and noticed that 
anything I copied to an NFS-mounted volume would get silently 
corrupted.  (sha1 from the NFS client and on the NFS server read the 
same, though)


I'm about 80% sure this was working around 7.99.9, but for a number of 
reasons it's complicated for me to check older builds, and in any event 
odroid-c1 support is fairly new.  I noticed some changes to the NFS code 
on 15 July ( 
http://mail-index.netbsd.org/source-changes/2015/07/15/msg067309.html ), 
but backing these out does not change the behavior.


Has anyone else seen problems with NFS?  Or with odroid-c1 or awge(4) in 
general?  dmesg below in case it gives any hints:


Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015
The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.  All rights reserved.

NetBSD 7.99.20 (ODROID-iscsi) #3: Sun Aug  2 21:06:09 PDT 2015
r...@cassava.tastylime.net:/scratch/evbarm7/obj/sys/arch/evbarm/compile/ODROID-iscsi
total memory = 1024 MB
avail memory = 1007 MB
sysctl_createv: sysctl_create(machine_arch) returned 17
timecounter: Timecounters tick every 10.000 msec
mainbus0 (root)
cpu0 at mainbus0 core 0: 1512 MHz Cortex-A5 r0p1 (Cortex V7A core)
cpu0: DC enabled IC enabled WB disabled EABT branch prediction enabled
cpu0: sctlr: 0xc51c7d
cpu0: actlr: 0x6041
cpu0: revidr: 0x410fc051
cpu0: mpidr: 0x8200
cpu0: isar: [0]=0x10 [1]=0x13112111 [2]=0x21232041 [3]=0x2131, 
[4]=0x11142, [5]=0

cpu0: mmfr: [0]=0x100103 [1]=0x4000 [2]=0x123 [3]=0x102211
cpu0: pfr: [0]=0x1231 [1]=0x11
cpu0: 32KB/32B 2-way L1 VIPT Instruction cache
cpu0: 32KB/32B 4-way write-back-locking-C L1 PIPT Data cache
cpu0: 512KB/32B 8-way write-back L2 PIPT Unified cache
vfp0 at cpu0: NEON MPE (VFP 3.0+), rounding, NaN propagation, denormals
vfp0: mvfr: [0]=0x10110222 [1]=0x
cpu1 at mainbus0 core 1
cpu2 at mainbus0 core 2
cpu3 at mainbus0 core 3
armperiph0 at mainbus0
armgic0 at armperiph0: Generic Interrupt Controller, 256 sources (245 valid)
armgic0: 32 Priorities, 224 SPIs, 5 PPIs, 16 SGIs
a9tmr0 at armperiph0: A5 Global 64-bit Timer (378 MHz)
a9tmr0: interrupting on irq 27
a9wdt0 at armperiph0: A5 Watchdog Timer, default period is 12 seconds
arml2cc0 at armperiph0: ARM PL310 r3p3 L2 Cache Controller (disabled)
arml2cc0: cache enabled
amlogicio0 at mainbus0
amlogiccom0 at amlogicio0 port 0: console
amlogiccom0: interrupting at irq 122
amlogicgpio0 at amlogicio0: GPIO controller
gpio0 at amlogicgpio0 (GPIOX): 22 pins
gpio1 at amlogicgpio0 (GPIOY): 15 pins
gpio2 at amlogicgpio0 (GPIODV): 30 pins
gpio3 at amlogicgpio0 (GPIOH): 6 pins
gpio4 at amlogicgpio0 (GPIOAO): 14 pins
gpio5 at amlogicgpio0 (BOOT): 19 pins
gpio6 at amlogicgpio0 (CARD): 7 pins
genfb0 at amlogicio0: switching to framebuffer console
genfb0: framebuffer at 0xc9e0, size 1280x720, depth 16, stride 2560
wsdisplay0 at genfb0 kbdmux 1: console (default, vt100 emulation)
wsmux1: connecting to wsdisplay0
wsdisplay0: screen 1-3 added (default, vt100 emulation)
amlogicrng0 at amlogicio0
dwctwo0 at amlogicio0 port 0: USB controller
dwctwo1 at amlogicio0 port 1: USB controller
awge0 at amlogicio0: Gigabit Ethernet Controller
awge0: interrupting on irq 40
awge0: Ethernet address: 00:1e:06:c3:49:c1
rgephy0 at awge0 phy 0: RTL8169S/8110S/8211 1000BASE-T media interface, 
rev. 6

rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, auto
rgephy1 at awge0 phy 1: RTL8169S/8110S/8211 1000BASE-T media interface, 
rev. 6

rgephy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, auto
amlogicsdhc0 at amlogicio0 port 1: SDHC controller
amlogicsdhc0: interrupting on irq 110
amlogicrtc0 at amlogicio0: RTC battery not present or discharged
usb0 at dwctwo0: USB revision 2.0
usb1 at dwctwo1: USB revision 2.0
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
timecounter: Timecounter "a9tmr0" frequency 37800 Hz quality 500
cpu2: 1512 MHz Cortex-A5 r0p1 (Cortex V7A core)
cpu2: DC enabled IC enabled WB disabled EABT branch prediction enabled
cpu2: sctlr: 0xc51c7d
cpu2: actlr: 0x6041
cpu2: revidr: 0x410fc051
cpu2: mpidr: 0x8202
cpu2: isar: [0]=0x10 [1]=0x13112111 [2]=0x21232041 [3]=0x2131, 
[4]=0x11142, [5]=0

cpu2: mmfr: [0]=0x100103 [1]=0x4000 [2]=0x123 [3]=0x102211
cpu2: pfr: [0]=0x1231 [1]=0x11
cpu2: 32KB/32B 2-way L1 VIPT Instruction cache
cpu2: 32KB/32B 4-way write-back-locking-C L1 PIPT Data cache
cpu2: 512KB/32B 8-way write-back L2 PIPT Unified cache
vfp2 at cpu2: NEON MPE (VFP 3.0+), rounding, NaN propagation, denormals
vfp2: mvfr: [0]=0x10110222 [1]=0x
cpu3: 1512 MHz Cortex-A5 r0p1 (Cortex V7A core)
cpu3: DC enabled IC enabled WB disabled EABT branch prediction enabled
cpu3: sctlr: 0xc51c7d
cpu3: actlr: 0x6041
cpu3: revidr: 0x410fc051
cpu3: mpidr: 0x8203