from:"Rick Macklem"

Re: kern/173479: [nfs] chown and chgrp operations fail between FreeBSD 9.1RC3 NFSv4 server and RH63 NFSv4 client

2012-11-26 Thread Rick Macklem

Christopher D. Harrison wrote:
> The following reply was made to PR kern/173479; it has been noted by
> GNATS.
> 
> From: "Christopher D. Harrison" 
> To: bug-follo...@freebsd.org, j...@cse.yorku.ca
> Cc:
> Subject: Re: kern/173479: [nfs] chown and chgrp operations fail
> between FreeBSD
> 9.1RC3 NFSv4 server and RH63 NFSv4 client
> Date: Mon, 26 Nov 2012 17:23:15 -0600
> 
> The same problem also occurs in FreeBSD 9.0 release.
> -C
> 
In case you didn't see the previous discussions, this happens for
Linux 3.3 or later kernels, where the default is to put the uid in
a string for the owner and owner_group attributes. RFC-3530, which
has not yet been replaced as the RFC for NFSv4.0 does not recommend
this. A requirement for client support of this is in an internet
draft called rfc3530bis, but this has not become an RFC yet.

I think the Linux folks "jumped the gun" when they made this the
default. You can change this using a sysctl on the server, so that
it uses the @ format recommended by RFC-3530 or
you can upgrade to stable/9, which does have client support for
the uid in a string. (The "uid in a string" was added mainly to
support NFSv4 root mounts for diskless clients.)

This PR will be closed when I get home next week and can do so.

rick

> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kern/173479: [nfs] chown and chgrp operations fail between FreeBSD 9.1RC3 NFSv4 server and RH63 NFSv4 client

2012-11-27 Thread Rick Macklem

Christopher D. Harrison wrote:
> do you know what that sysctl attr is called?
> -C
> 
I don't recall it and it seems I've deleted his email, but Jason Keltz
(which I've cc'd) emailed it to me and hopefully can remember it.

rick
ps: When I get it this time, I'll write it down in my note book.
I'm "old and old fashioned", so that's how I remember stuff;-)

> On 11/26/12 21:19, Rick Macklem wrote:
> > Christopher D. Harrison wrote:
> >> The following reply was made to PR kern/173479; it has been noted
> >> by
> >> GNATS.
> >>
> >> From: "Christopher D. Harrison"
> >> To: bug-follo...@freebsd.org, j...@cse.yorku.ca
> >> Cc:
> >> Subject: Re: kern/173479: [nfs] chown and chgrp operations fail
> >> between FreeBSD
> >> 9.1RC3 NFSv4 server and RH63 NFSv4 client
> >> Date: Mon, 26 Nov 2012 17:23:15 -0600
> >>
> >> The same problem also occurs in FreeBSD 9.0 release.
> >> -C
> >>
> > In case you didn't see the previous discussions, this happens for
> > Linux 3.3 or later kernels, where the default is to put the uid in
> > a string for the owner and owner_group attributes. RFC-3530, which
> > has not yet been replaced as the RFC for NFSv4.0 does not recommend
> > this. A requirement for client support of this is in an internet
> > draft called rfc3530bis, but this has not become an RFC yet.
> >
> > I think the Linux folks "jumped the gun" when they made this the
> > default. You can change this using a sysctl on the server, so that
> > it uses the@ format recommended by RFC-3530 or
> > you can upgrade to stable/9, which does have client support for
> > the uid in a string. (The "uid in a string" was added mainly to
> > support NFSv4 root mounts for diskless clients.)
> >
> > This PR will be closed when I get home next week and can do so.
> >
> > rick
> >
> >> ___
> >> freebsd-net@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >> To unsubscribe, send any mail to
> >> "freebsd-net-unsubscr...@freebsd.org"
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: WTF? RPCPROG_NFS: RPC: Program not registered

2013-02-16 Thread Rick Macklem

Ronald F. Guilmette wrote:
> I have a 9.1-RELEASE server whose /etc/rc.conf file contains, among
> other
> things, the following lines:
> 
> ifconfig_nfe0="inet 192.168.1.2 netmask 255.255.255.0"
> #
> nfs_client_enable="YES"
> nfs_server_enable="YES"
> nfs_server_flags="-h 192.168.1.2"
Add -t to these flags. It appears that the default is
UDP only.

> mountd_enable="YES"
> rpcbind_enable="YES"
> 
> On this server, I also have an /etc/exports file that contains:
> 
> /home/rfg -network 192.168.1.0 -mask 255.255.255.0
> /x -network 192.168.1.0 -mask 255.255.255.0
> 
> On this same server, when I do "showmount -e 192.168.1.2" I get the
> following
> output:
> 
> Exports list on 192.168.1.2:
> /x 192.168.1.0
> /home/rfg 192.168.1.0
> 
> 
> On this server, when I am root, I attempt to do:
> 
> mount -t nfs 192.168.1.2:/x /mnt
> 
> but tnen I just get the following error:
> 
> [tcp] 192.168.1.2:/x: RPCPROG_NFS: RPC: Program not registered
> 
> Why?
> 
It is trying to mount via TCP and you only have UDP enabled, I think.

> More to the point, what I can do to get rid of this error?
> 
I think adding -t to the nfs_server_flags should fix it.

> I really am stuck. I have no idea what causes this error, nor even how
> to
> debug it. I have already google'd the hell out of the problem, and I
> am
> still coming up empty.
> 
> Note also that when the failure occurs there is -nothing- added at
> that
> time to /var/log/messages.
> 
> 
> Reards,
> rfg
> 
> 
> P.S. Of course, I don't actually need to mount the exported volume
> onto
> the same machine where it physically already resides. I do however
> wish
> to mount it (via NFS) onto another system on my LAN, and over on that
> other
> system, when I try to mount it, I am getting the exact same *&^%$#@
> error.
> 
> 
> P.P.S. In case anybody should ask, this is the output of rpcinfo
> 192.168.1.2:
> 
> program version netid address service owner
> 10 4 tcp 0.0.0.0.0.111 rpcbind superuser
> 10 3 tcp 0.0.0.0.0.111 rpcbind superuser
> 10 2 tcp 0.0.0.0.0.111 rpcbind superuser
> 10 4 udp 0.0.0.0.0.111 rpcbind superuser
> 10 3 udp 0.0.0.0.0.111 rpcbind superuser
> 10 2 udp 0.0.0.0.0.111 rpcbind superuser
> 10 4 tcp6 ::.0.111 rpcbind superuser
> 10 3 tcp6 ::.0.111 rpcbind superuser
> 10 4 udp6 ::.0.111 rpcbind superuser
> 10 3 udp6 ::.0.111 rpcbind superuser
> 10 4 local /var/run/rpcbind.sock rpcbind superuser
> 10 3 local /var/run/rpcbind.sock rpcbind superuser
> 10 2 local /var/run/rpcbind.sock rpcbind superuser
> 15 1 udp6 ::.3.63 mountd superuser
> 15 3 udp6 ::.3.63 mountd superuser
> 15 1 tcp6 ::.3.63 mountd superuser
> 15 3 tcp6 ::.3.63 mountd superuser
> 15 1 udp 0.0.0.0.3.63 mountd superuser
> 15 3 udp 0.0.0.0.3.63 mountd superuser
> 15 1 tcp 0.0.0.0.3.63 mountd superuser
> 15 3 tcp 0.0.0.0.3.63 mountd superuser
> 13 2 udp 0.0.0.0.8.1 nfs superuser
> 13 3 udp 0.0.0.0.8.1 nfs superuser
> 
Only udp is here. After adding -t and rebooting, you should see
tcp lines as well. At least that`s my guess.

Good luck with it, rick

> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: WTF? RPCPROG_NFS: RPC: Program not registered

2013-02-17 Thread Rick Macklem

Ronald F. Guilmette wrote:
> In message
> <689563329.3076797.1361028594307.javamail.r...@erie.cs.uoguelph.ca>,
> Rick Macklem  wrote:
> 
> >Ronald F. Guilmette wrote:
> >> nfs_server_flags="-h 192.168.1.2"
> >Add -t to these flags. It appears that the default is UDP only.
> 
> 
> YE! Thank you. That did the trick alright.
> 
> I gather than in the 9.x series, there is a new nfs server thing, yes?
> 
> And I further gather than this one needs to new -t flag, yes?
> 
> (Sigh. My own feeling is that tcp support should have been enabled by
> default... as in the past.)
> 
Nope. The old server used "-t" as well. The settings in
/etc/default/rc.conf for
nfs_server_flags="-t -u -n 4"

You overrode those when you set nfs_server_flags.

rick

> Anyway, thanks again for your help.
> 
> 
> Regards,
> rfg
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-08 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > [stuff I wrote deleted]
> > You have an amd64 kernel running HEAD or 9.x?
> 
> Yes, these are 9.1 with some patches to reduce mutex contention on the
> NFS server's replay "cache".
> 
The cached replies are copies of the mbuf list done via m_copym().
As such, the clusters in these replies won't be free'd (ref cnt -> 0)
until the cache is trimmed (nfsrv_trimcache() gets called after the
TCP layer has received an ACK for receipt of the reply from the client).

If reducing the size to 4K doesn't fix the problem, you might want to
consider shrinking the tunable vfs.nfsd.tcphighwater and suffering
the increased CPU overhead (and some increased mutex contention) of
calling nfsrv_trimcache() more frequently.
(I'm assuming that you are using drc2.patch + drc3.patch. If you are
 using one of ivoras@'s variants of the patch, I'm not sure if the
 tunable is called the same thing, although it should have basically
 the same effect.)

Good luck with it and thanks for running on the "bleeding edge" so
these issues get identified, rick

> > Jumbo pages come directly from the kernel_map which on amd64 is
> > 512GB.
> > So KVA shouldn't be a problem. Your problem indeed appears to come
> > physical memory fragmentation in pmap.
> 
> I hadn't realized that they were physically contiguous, but that makes
> perfect sense.
> 
> > pages. Also since you're doing NFS serving almost all memory will be
> > in use for file caching.
> 
> I actually had the ZFS ARC tuned down to 64 GB (out of 96 GB physmem)
> when I experienced this, but there are plenty of data structures in
> the kernel that aren't subject to this limit and I could easily
> imagine them checkerboarding physical memory to the point where no
> contiguous three-page allocations were possible.
> 
> -GAWollman
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: NFS DRC size

2013-03-09 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > The cached replies are copies of the mbuf list done via m_copym().
> > As such, the clusters in these replies won't be free'd (ref cnt ->
> > 0)
> > until the cache is trimmed (nfsrv_trimcache() gets called after the
> > TCP layer has received an ACK for receipt of the reply from the
> > client).
> 
> I wonder if this bit is even working at all. In my experience, the
> size of the DRC quickly grows under load up to the maximum (or
> actually, slightly beyond), and never drops much below that level. On
> my production server right now, "nfsstat -se" reports:
> 
Well, once you add the patches and turn vfs.nfsd.tcphighwater up, it
will only trim the cache when that highwater mark is exceeded. When
it does the trim, the size does drop for the simple testing I do with
a single client. (I'll take another look at drc3.patch and see if I
can spot anywhere this might be broken, although my hunch is
that you have a lot of TCP connections and enough activity that it
rapidly grows back up to the limit.) The fact that it trims down to
around the highwater mark basically indicates this is working. If it wasn't
throwing away replies where the receipt has been ack'd at the TCP
level, the cache would grow very large, since they would only be
discarded after a loonnngg timeout (12hours unless you've changes
NFSRVCACHE_TCPTIMEOUT in sys/fs/nfs/nfs.h).

> Server Info:
> Getattr Setattr Lookup Readlink Read Write Create Remove
> 13036780 359901 1723623 3420 36397693 12385668 346590 109984
> Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access
> 45173 16 116791 14192 1176 24 12876747 3398533
> Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf
> 0 2703 14992 7502 1329196 0 1 1
> Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock
> 263034 0 0 263019 0 0 545104 0
> LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH
> 0 0 263012 0 0 23753375 0 1
> Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create
> 2 263006 263033 0 0 0
> Server:
> Retfailed Faults Clients
> 0 0 1
> OpenOwner Opens LockOwner Locks Delegs
> 56 10 0 0 0
> Server Cache Stats:
> Inprog Idem Non-idem Misses CacheSize TCPPeak
> 0 0 0 81714128 60997 61017
> 
> It's only been up for about the last 24 hours. Should I be setting
> the size limit to something truly outrageous, like 200,000? (I'd
> definitely need to deal with the mbuf cluster issue then!) The
> average request rate over this time is about 1000/s, but that includes
> several episodes of high-cpu spinning (which I resolved by increasing
> the DRC limit).
> 
It is the number of TCP connections from clients that determines how much
gets cached, not the request rate. For TCP, a scheme like LRU doesn't work,
because RPC retries (as opposed to TCP segment retransmits) only happen long
after the initial RPC request. (Usually after a TCP connection has broken and
the client has established a new connection, although some NFSv3 over TCP
clients will retry an RPC after a long timeout.) The cache needs to hold the
last N RPC replies for each TCP connection and discard them when further
traffic on the TCP connection indicates that the connection is still working.
(Some NFSv3 over TCP servers don't guarantee to generate a reply for an RPC
 when resource constrained, but the FreeBSD one always sends a reply, except
 for NFSv2, where it will close down the TCP connection when it has no choice.
 I doubt any client is doing NFSv2 over TCP, so I don't consider this relevent.)

If the CPU is spinning in nfsrc_trimcache() a lot, increasing 
vfs.nfsd.tcphighwater
should decrease that, but with an increase in mbuf cluster allocation.

If there is a lot of contention for mutexes, increasing the size of the hash
table might help. The drc3.patch bumped the hash table from 20->200,
but that would still be about 300 entries per hash list and one mutex for
those 300 entries, assuming the hash function is working well.
Increasing it only adds list head pointers and mutexes.
(It's NFSRVCACHE_HASHSIZE in sys/fs/nfs/nfsrvcache.h.)

Unfortunately, increasing it requires a kernel rebuild/reboot. Maybe the patch
for head should change the size of the hash table when vfs.nfsd.tcphighwater
is set much larger? (Not quite trivial and will probably result in a short 
stall of
the nfsd threads, since all the entries will need to be rehashed/moved to
new lists, but could be worth the effort.)

> Meanwhile, some relevant bits from sysctl:
> 
> vfs.nfsd.udphighwater: 500
> vfs.nfsd.tcphighwater: 61000
> vfs.nfsd.minthreads: 16
> vfs.nfsd.maxthreads: 64
> vfs.nfsd.threads: 64
> vfs.nfsd.request_space_used: 1416
> vfs.nfsd.request_space_used_highest: 4284672
> vfs.nfsd.request_space_high: 47185920
> vfs.nfsd.request_space_low: 31457280
> vfs.nfsd.request_space_throttled: 0
> vfs.nfsd.request_space_throttle_count: 0
> 
> (I'd actually like to put maxthreads back up at 256, which is where I
> had it during testing, but I need to test that the jumbo-frames issue
> is fixed first. I did

Re: Limits on jumbo mbuf cluster allocation

2013-03-09 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > If reducing the size to 4K doesn't fix the problem, you might want
> > to
> > consider shrinking the tunable vfs.nfsd.tcphighwater and suffering
> > the increased CPU overhead (and some increased mutex contention) of
> > calling nfsrv_trimcache() more frequently.
> 
> Can't do that -- the system becomes intolerably slow when it gets into
> that state, and seems to get stuck that way, such that the only way to
> restore performance is to increase the size of the "cache".
> (Essentially all of the nfsd service threads end up spinning most of
> the time, load average goes to N, and goodput goes to nearly nil.) It
> does seem like a lot of effort for an extreme edge case that, in
> practical terms, never happens.
> 
So, it sounds like you've found a reasonable setting. Yes, if it is too
small, it will keep trimming over and over and over again...

I suspect this indicates that it isn't mutex contention, since the
threads would block waiting for the mutex for that case, I think?
(Bumping up NFSRVCACHE_HASHSIZE can't hurt if/when you get the chance.)

> > (I'm assuming that you are using drc2.patch + drc3.patch.
> 
> I believe that's what I have. If my kernel coding skills were less
> rusty, I'd fix it to have a separate cache-trimming thread.
> 
I've thought about this. My concern is that the separate thread might
not keep up with the trimming demand. If that occurred, the cache would
grow veryyy laarrggge, with effects like running out of mbuf clusters.

By having the nfsd threads do it, they slow down, which provides feedback
to the clients (slower RPC replies->generate fewer request->less to cache).
(I think you are probably familiar with the generic concept that a system
 needs feedback to remain stable. An M/M/1 queue with open arrivals and
 no feedback to slow the arrival rate explodes when the arrival rate
 approaches the service rate, etc and so on...)

As such, I'm not convinced a separate thread is a good idea. I think
that simply allowing sysadmins to disable the DRC for TCP may make
sense. Although I prefer more reliable vs better performance, I can
see the argument that TCP transport for RPC is "good enough" for
some environments. (Basically, if a site has a high degree of
confidence in their network fabric, such that network partitioning
type failures are pretty well non-existent and the NFS server isn't
getting overloaded to the point of very slow RPC replies, I can
see TCP retransmits as being sufficient?)

> One other weird thing that I've noticed is that netstat(1) reports the
> send and receive queues on NFS connections as being far higher than I
> have the limits configured. Does NFS do something to override this?
> 
> -GAWollman
> 
The nfs server does soreserve(so, sb_max_adj, sb_max_adj); I can't
recall exactly why it is that way, except that it needs to be large
enough to handle the largest RPC request a client might generate.

I should take another look at this, in case sb_max_adj is now
too large?

rick

> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-09 Thread Rick Macklem

Garett Wollman wrote:
> In article <20795.29370.194678.963...@hergotha.csail.mit.edu>, I
> wrote:
> >< > said:
> >> I've thought about this. My concern is that the separate thread
> >> might
> >> not keep up with the trimming demand. If that occurred, the cache
> >> would
> >> grow veryyy laarrggge, with effects like running out of mbuf
> >> clusters.
> >
> >At a minimum, once one nfsd thread is committed to doing the cache
> >trim, a flag should be set to discourage other threads from trying to
> >do it. Having them all spinning their wheels punishes the clients
> >much too much.
> 
Yes, I think this is a good idea. The current code acquires the mutex
before updating the once/sec variable. As such it would be easy to
get multiple threads in there concurrently.

This is easy to do. Just define a static variable in nfsrc_trimcache()
initially 0.
If not 0,
   return.
set non-zero.
do the trimming.
set zero before returning.

Since this is just a heuristic to avoid multiple threads doing the
trim concurrently, I think it can be safely done outside of the mutex.

If you need help coding this, just email and I can come up with a
quick patch.

> Also, it occurs to me that this strategy is subject to livelock. To
> put backpressure on the clients, it is far better to get them to stop
> sending (by advertising a small receive window) than to accept their
> traffic but queue it for a long time. By the time the NFS code gets
> an RPC, the system has already invested so much into it that it should
> be processed as quickly as possible, and this strategy essentially
> guarantees[1] that, once those 2 MB socket buffers start to fill up,
> they
> will stay filled, sending latency through the roof. If nfsd didn't
> override the usual socket-buffer sizing mechanisms, then sysadmins
> could limit the buffers to ensure a stable response time.
> 
> The bandwidth-delay product in our network is somewhere between 12.5
> kB and 125 kB, depending on how the client is connected and what sort
> of latency they experience. The usual theory would suggest that
> socket buffers should be no more than twice that -- i.e., about 256
> kB.
> 
Well, the code that uses sb_max_adj wasn't written by me (I just cloned
it for the new server). In the author's defence, I believe SB_MAX was 256K when
it was written. It was 256K in 2011. I think sb_max_adj was used because
soreserve() fails for a larger value and the code doesn't check for such a 
failure.
(Yea, it should be fixed so that it checks for a failure return from 
soreserve().
 I did so for the client some time ago.;-)

Just grep for sb_max_adj. You'll see it sets a variable called "siz".
Make "siz" whatever you want (256K sounds like a good guess). Just make
sure it isn't > sb_max_adj.

The I/O sizes are limited to MAXBSIZE, which is currently 64Kb, although
I'd like to increase that to 128Kb someday soon. (As you note below, the
largest RPC is slightly bigger than that.)

Btw, net.inet.tcp.{send/recv}buf_max are both 2Mbytes, just like sb_max,
so those don't seem useful in this case?

I'm no TCP guy, so suggestions w.r.t. how big soreserve() should be set
are welcome.

> I'd actually like to see something like WFQ in the NFS server to allow
> me to limit the amount of damage one client or group of clients can
> do without unnecessarily limiting other clients.
> 
Sorry, I'll admit I have no idea what WFQ is? (I'll look it up on some
web site someday soon, but obviously can't comment before then.) Since
it is possible to receive RPC requests for a given client from multiple
IP addresses, it is pretty hard for NFS to know what client a request
has come from.

rick

> -GAWollman
> 
> [1] The largest RPC is a bit more than 64 KiB (negotiated), so if the
> server gets slow, the 2 MB receive queue will be refilled by the
> client before the server manages to perform the RPC and send a
> response.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: NFS DRC size

2013-03-09 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > around the highwater mark basically indicates this is working. If it
> > wasn't
> > throwing away replies where the receipt has been ack'd at the TCP
> > level, the cache would grow very large, since they would only be
> > discarded after a loonnngg timeout (12hours unless you've changes
> > NFSRVCACHE_TCPTIMEOUT in sys/fs/nfs/nfs.h).
> 
> That seems unreasonably large.
> 
I suppose. How long a network partitioning do you want the cache to
deal with? (My original design was trying to achieve a high level of
correctness by default.)

The only time cache entries normally hang around this long is when a
client has dismounted the volume(s) using the TCP connection. The
cached replies for the last few replies will then hang around until
the timeout. For a few clients this isn't an issue. For 2,000 clients,
I can see that it might be, if the clients choose to dismount volumes
(using something like amd).

Feel free to make it smaller, based on the longest network partitioning
that you anticipate might occur.

> > Well, the DRC will try to cache replies until the client's TCP layer
> > acknowledges receipt of the reply. It is hard to say how many
> > replies
> > that is for a given TCP connection, since it is a function of the
> > level
> > of concurrently (# of nfsiod threads in the FreeBSD client)
> > in the client. I'd guess it's somewhere between 1<->20?
> 
> Nearly all our clients are Linux, so it's likely to be whatever Debian
> does by default.
> 
> > Multiply that by the number of TCP connections from all clients and
> > you have about how big the server's DRC will be. (Some clients use
> > a single TCP connection for the client whereas others use a separate
> > TCP connection for each mount point.)
> 
> The Debian client appears to use a single TCP connection for
> everything.
> 
> So if I want to support 2,000 clients each with 20 requests in flight,
> that would suggest that I need a DRC size of 40,000, which my
> experience shows is not sufficient with even a much smaller number of
> clients.
> 
Well, especially since Debian is using one TCP connection for everything
from a client, the guess of 20 could be way low.

rick

> -GAWollman
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-09 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > I suspect this indicates that it isn't mutex contention, since the
> > threads would block waiting for the mutex for that case, I think?
> 
> No, because our mutexes are adaptive, so each thread spins for a while
> before blocking. With the current implementation, all of them end up
> doing this in pretty close to lock-step.
> 
> > (Bumping up NFSRVCACHE_HASHSIZE can't hurt if/when you get the
> > chance.)
> 
> I already have it set to 129 (up from 20); I could see putting it up
> to, say, 1023. It would be nice to have a sysctl for maximum chain
> length to see how bad it's getting (and if the hash function is
> actually effective).
> 
Yep, I'd bump it up to 1000 or so for a server the size you've built.

> > I've thought about this. My concern is that the separate thread
> > might
> > not keep up with the trimming demand. If that occurred, the cache
> > would
> > grow veryyy laarrggge, with effects like running out of mbuf
> > clusters.
> 
> At a minimum, once one nfsd thread is committed to doing the cache
> trim, a flag should be set to discourage other threads from trying to
> do it. Having them all spinning their wheels punishes the clients
> much too much.
> 
Yes, this is a good idea, as I mentioned in another reply.

> > By having the nfsd threads do it, they slow down, which provides
> > feedback
> > to the clients (slower RPC replies->generate fewer request->less to
> > cache).
> > (I think you are probably familiar with the generic concept that a
> > system
> >  needs feedback to remain stable. An M/M/1 queue with open arrivals
> >  and
> >  no feedback to slow the arrival rate explodes when the arrival rate
> >  approaches the service rate, etc and so on...)
> 
> Unfortunately, the feedback channel that I have is: one user starts
> 500 virtual machines accessing a filesystem on the server -> other
> users of this server see their goodput go to zero -> everyone sends in
> angry trouble tickets -> I increase the DRC size manually. It would
> be nice if, by the time I next want to take a vacation, I have this
> figured out.
> 
I probably shouldn't say this, but my response to complaints w.r.t. a
slow NFS server was "Tell the boss to spend big bucks on a Netapp.",
back when I was a sysadmin for a college.;-)

Well, it would be easy to come up with a patch that disables the DRC
for TCP. If you'd like a patch for that, just email. So long as your
network fabric is solid, it isn't that big a risk to run that way.
If 500 virtual machines start pounding on the NFS server, I'd be surprised
if other parts of the server don't "hit the wall", but disabling the DRC
will find that out.

It would be nice if there was a way to guarantee that clients get a fair
slice of the server pie, but I don't know of a way to do that. As I noted
in another reply, a client may use multiple IP addresses for the requests.
Also, since traffic from clients tends to be very bursty, putting a limit
on traffic when there isn't a lot of load from other clients doesn't make
sense, I think? Then there is the question of "How does the NFS server know the
system is nearing its load limit so it should apply limits to clients sending
a lot of RPC requests?".
All the NFS server does is translate the RPC requests to VFS/VOP ops,
so I don't see how it will know that the underlying file systems are
nearing their load limit, as one example.

When I ran (much smaller) NFS servers in production, I usually saw the
disks hit their io ops limit.

> I'm OK with throwing memory at the problem -- these servers have 96 GB
> and can hold up to 144 GB -- so long as I can find a tuning that
> provides stability and consistent, reasonable performance for the
> users.
> 
> > The nfs server does soreserve(so, sb_max_adj, sb_max_adj); I can't
> > recall exactly why it is that way, except that it needs to be large
> > enough to handle the largest RPC request a client might generate.
> 
> > I should take another look at this, in case sb_max_adj is now
> > too large?
> 
> It probably shouldn't be larger than the
> net.inet.tcp.{send,recv}buf_max, and the read and write sizes that are
> negotiated should be chosen so that a whole RPC can fit in that
> space. If that's too hard for whatever reason, nfsd should at least
> log a message saying "hey, your socket buffer limits are too small,
> I'm going to ignore them".
> 
As I mentioned in another reply, *buf_max is 2Mbytes these days. I think
I agree that 2Mbytes is larger than you need for your server, given your
LAN environment.

The problem is, I can't think of how an NFS server will know that a new
client connection is on a LAN and not a long-fat WAN connection. The
latter may need to but 2Mbytes on the wire to fill the pipe. My TCP
is *very rusty*, but I think that a sb_hiwat of 256Kbytes is going to
make the send windows shrink so that neither end can send 2Mbytes
of unacknowledged data segments to fill the pipe?

Also, the intent is to apply the "feedback" in cases whe

Re: Limits on jumbo mbuf cluster allocation

2013-03-10 Thread Rick Macklem

Andre Oppermann wrote:
> On 10.03.2013 03:22, Rick Macklem wrote:
> > Garett Wollman wrote:
> >> Also, it occurs to me that this strategy is subject to livelock. To
> >> put backpressure on the clients, it is far better to get them to
> >> stop
> >> sending (by advertising a small receive window) than to accept
> >> their
> >> traffic but queue it for a long time. By the time the NFS code gets
> >> an RPC, the system has already invested so much into it that it
> >> should
> >> be processed as quickly as possible, and this strategy essentially
> >> guarantees[1] that, once those 2 MB socket buffers start to fill
> >> up,
> >> they
> >> will stay filled, sending latency through the roof. If nfsd didn't
> >> override the usual socket-buffer sizing mechanisms, then sysadmins
> >> could limit the buffers to ensure a stable response time.
> >>
> >> The bandwidth-delay product in our network is somewhere between
> >> 12.5
> >> kB and 125 kB, depending on how the client is connected and what
> >> sort
> >> of latency they experience. The usual theory would suggest that
> >> socket buffers should be no more than twice that -- i.e., about 256
> >> kB.
> >>
> > Well, the code that uses sb_max_adj wasn't written by me (I just
> > cloned
> > it for the new server). In the author's defence, I believe SB_MAX
> > was 256K when
> > it was written. It was 256K in 2011. I think sb_max_adj was used
> > because
> > soreserve() fails for a larger value and the code doesn't check for
> > such a failure.
> > (Yea, it should be fixed so that it checks for a failure return from
> > soreserve().
> >   I did so for the client some time ago.;-)
> 
> We have TCP sockbuf size autotuning for some time now. So explicitly
> setting the size shouldn't be necessary anymore.
> 
Ok. Is it possible for the size to drop below the size of the largest RPC?
(Currently a little over 64K and hopefully a little over 128K soon.)

I'm thinking of the restriction in sosend_generic() where it won't allow a
request greater than sb_hiwat to be added to the send queue. (It is passed
in as an mbuf list via the "top" argument, which makes "atomic" true, I think?)

The soreserve() calls were done in the old days to make sure sb_hiwat was
big enough that sosend() wouldn't return EMSGSIZE.
(I'll take a look at the code and try to see if/when sb_hiwat gets autotuned.)

> > Just grep for sb_max_adj. You'll see it sets a variable called
> > "siz".
> > Make "siz" whatever you want (256K sounds like a good guess). Just
> > make
> > sure it isn't > sb_max_adj.
> >
> > The I/O sizes are limited to MAXBSIZE, which is currently 64Kb,
> > although
> > I'd like to increase that to 128Kb someday soon. (As you note below,
> > the
> > largest RPC is slightly bigger than that.)
> >
> > Btw, net.inet.tcp.{send/recv}buf_max are both 2Mbytes, just like
> > sb_max,
> > so those don't seem useful in this case?
> 
> These are just the limits for auto-tuning.
> 
> > I'm no TCP guy, so suggestions w.r.t. how big soreserve() should be
> > set
> > are welcome.
> 
> I'd have to look more at the NFS code to see what exactly is going on
> and what the most likely settings are going to be. Won't promise any
> ETA though.
> 
Basically an RPC request/reply is an mbuf list where its size can be
up to MAXBSIZE + a hundred bytes or so. (64Kb+ --> 128Kb+ soon)

These need to be queued for sending without getting EMSGSIZE back.

Then, if the mount is for a high bandwidth WAN, it would be nice if
the send window allows several of these to be "in flight" (not yet
acknowledged) so that the "bit pipe" can be kept full (use the
available bandwidth). These could be read-aheads/write-behinds or
requests for other processes/threads in the client.
For example:
- with a 128Kbyte MAXBSIZE and a read-ahead of 15, it would be possible
  to have 128 * 1024 * 16 bytes on the wire, if the TCP window allows
  that. (This would fill a 1Gbps network with a 20msec rtt, if I got
  my rusty math correct. It is rtt and not the time for a packet to
  go in one direction, since the RPC replies need to get back to the
  client before it will do any more reads.) This sounds like the
  upper bound of the current setup, given the 2Mbyte setting for
  net.inet.tcp.sendbuf_max, I think?
  (Yes, I know most use NFS over a LAN, but it would be nice if it
   can work well enough over a WAN to be useful.)
- for a fast LAN, obviously the rtt is much lower, so

Re: Limits on jumbo mbuf cluster allocation

2013-03-10 Thread Rick Macklem

Andre Oppermann wrote:
> On 10.03.2013 07:04, Garrett Wollman wrote:
> > <
> > said:
> >
> >> Yes, in the past the code was in this form, it should work fine
> >> Garrett,
> >> just make sure
> >> the 4K pool is large enough.
> >
> > [Andre Oppermann's patch:]
> >>> if (adapter->max_frame_size <= 2048)
> > adapter-> rx_mbuf_sz = MCLBYTES;
> >>> - else if (adapter->max_frame_size <= 4096)
> >>> + else
> > adapter-> rx_mbuf_sz = MJUMPAGESIZE;
> >>> - else if (adapter->max_frame_size <= 9216)
> >>> - adapter->rx_mbuf_sz = MJUM9BYTES;
> >>> - else
> >>> - adapter->rx_mbuf_sz = MJUM16BYTES;
> >
> > So I tried exactly this, and it certainly worked insofar as only 4k
> > clusters were allocated, but NFS performance went down precipitously
> > (to fewer than 100 ops/s where normally it would be doing 2000
> > ops/s). I took a tcpdump while it was in this state, which I will
> > try
> > to make some sense of when I get back to the office. (It wasn't
> > livelocked; in fact, the server was mostly idle, but responses would
> > take seconds rather than milliseconds -- assuming the client could
> > even successfully mount the server at all, which the Debian
> > automounter frequently refused to do.)
> 
> This is very weird and unlikely to come from the 4k mbufs by itself
> considering they are in heavy use in the write() path. Such a high
> delay smells like an issue in either the driver dealing with multiple
> mbufs per packet or nfs having a problem with it.
> 
I am not aware of anything within the NFS server that would care. The
code simply believes the m_len field.

  --> However, this is a good way to reduce server load. At 100ops/sec
  I'd think you shouldn't have any server resource exhaustion issues.
  --> Problem solved ;-);-)

rick

> > I ended up reverting back to the old kernel (which I managed to lose
> > the sources for), and once I get my second server up, I will try to
> > do
> > some more testing to see if I can identify the source of the
> > problem.
> 
> --
> Andre
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-10 Thread Rick Macklem

Andre Oppermann wrote:
> On 09.03.2013 01:47, Rick Macklem wrote:
> > Garrett Wollman wrote:
> >> < >>  said:
> >>
> >>> [stuff I wrote deleted]
> >>> You have an amd64 kernel running HEAD or 9.x?
> >>
> >> Yes, these are 9.1 with some patches to reduce mutex contention on
> >> the
> >> NFS server's replay "cache".
> >>
> > The cached replies are copies of the mbuf list done via m_copym().
> > As such, the clusters in these replies won't be free'd (ref cnt ->
> > 0)
> > until the cache is trimmed (nfsrv_trimcache() gets called after the
> > TCP layer has received an ACK for receipt of the reply from the
> > client).
> 
> If these are not received mbufs but locally generated with m_getm2()
> or so they won't be jumbo mbufs > PAGE_SIZE.
> 
Yes, you are correct. Since the DRC caches replies, they shouldn't
have jumbo clusters in them. (For his case of 60,000 cached entries,
there could be 100,000 or more regular clusters held. I'd think that
could make finding the space for the 3 page jumbo clusters
harder, wouldn't it?)

rick

> > If reducing the size to 4K doesn't fix the problem, you might want
> > to
> > consider shrinking the tunable vfs.nfsd.tcphighwater and suffering
> > the increased CPU overhead (and some increased mutex contention) of
> > calling nfsrv_trimcache() more frequently.
> > (I'm assuming that you are using drc2.patch + drc3.patch. If you are
> >   using one of ivoras@'s variants of the patch, I'm not sure if the
> >   tunable is called the same thing, although it should have
> >   basically
> >   the same effect.)
> >
> > Good luck with it and thanks for running on the "bleeding edge" so
> > these issues get identified, rick
> 
> --
> Andre
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-11 Thread Rick Macklem

Andre Oppermann wrote:
> On 11.03.2013 17:05, Garrett Wollman wrote:
> > In article <513db550.5010...@freebsd.org>, an...@freebsd.org writes:
> >
> >> Garrett's problem is receive side specific and NFS can't do much
> >> about it.
> >> Unless, of course, NFS is holding on to received mbufs for a longer
> >> time.
> >
> > Well, I have two problems: one is running out of mbufs (caused, we
> > think, by ixgbe requiring 9k clusters when it doesn't actually need
> > them), and one is livelock. Allowing potentially hundreds of clients
> > to queue 2 MB of requests before TCP pushes back on them helps to
> > sustain the livelock once it gets started, and of course those
> > packets
> > will be of the 9k jumbo variety, which makes the first problem worse
> > as well.
> 
> I think that TCP, or rather the send socket buffer, currently doesn't
> push back at all but simply accepts everything that gets thrown at it.
> This obviously is a problem and the NFS server seems to depend
> somewhat
> on that by requiring atomicity on a RPC send. I have to trace the mbuf
> path through NFS to the socket to be sure. The code is slightly opaque
> though.
> 
Yes, I think you are correct that when NFS sends RPC messages over TCP,
they just get queued via sbappendstream(). For some reason I thought the
krpc used sosend_generic(), whereas I just looked and it just uses sosend(),
which does nothing except call tcp_user_send() { for TCP sockets, of course }.

I tend to agree with Garrett that this is ok for the NFS server, since once
it has done the work of generating a reply, why should the nfsd thread get
stuck trying to queue the reply for the client. For the NFS client, it isn't
quite so obvious, but after queuing a request to be sent, it will sit waiting
for the reply, so it will "see" the NFS server's slow response. (Anyhow,
Garrett couldn't care less about the FreeBSD NFS client, since he isn't using
it.;-)

rick

> --
> Andre
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-11 Thread Rick Macklem

Garrett Wollman wrote:
> In article <513db550.5010...@freebsd.org>, an...@freebsd.org writes:
> 
> >Garrett's problem is receive side specific and NFS can't do much
> >about it.
> >Unless, of course, NFS is holding on to received mbufs for a longer
> >time.
The NFS server only holds onto receive mbufs until it performs the RPC
requested. Of course, if the server hits its load limit, there will
then be a backlog of RPC requests --> the received mbufs for these
requests will be held for a longer time.

To be honest, I'd consider seeing a lot of non-empty receive queues
for TCP connections to the NFS server to be an indication that it is
near/at its load limit. (Sure, if you do netstat a lot, you will occasionally
see a non-empty queue here or there, but I would not expect to see a lot
of them non-empty a lot of the time.) If that is the case, then the
question becomes "what is the bottleneck?". Below I suggest getting rid
of the DRC in case it is the bottleneck for your server.

> 
> Well, I have two problems: one is running out of mbufs (caused, we
> think, by ixgbe requiring 9k clusters when it doesn't actually need
> them), and one is livelock. Allowing potentially hundreds of clients
> to queue 2 MB of requests before TCP pushes back on them helps to
> sustain the livelock once it gets started, and of course those packets
> will be of the 9k jumbo variety, which makes the first problem worse
> as well.
> 
The problem for the receive side is "how small should you make it?".
Suppose we have the following situation:
- only one client is active and it is flushing writes for a large file
  written into that client's buffer cache.
  --> If you set the receive size so that it is just big enough for one
  write, then the client will end up doing:
  - send one write, wait a long while for the NFS_OK reply
  - send the next write, wait a long while for the NFS_OK reply
  and so on
  --> the write back will take a long time, even though no other client
  is generating load on the server.
  --> the user for this client won't be happy

If you make the receive side large enough to handle several Write requests,
then the above works much faster, however...
- the receive size is now large enough to accept many many other RPC requests
  (a Write request is 64Kbytes+ however Read requests are typically
   less than 100bytes)

Even if you set the receive size to the minimum that will handle one Write
request, that will allow the client to issue something like 650 Read requests.

Since NFS clients wait for replies to the RPC requests they send, they will
only queue so many requests before sending no more of them until they receive
some replies. This does delay the "feedback" somewhat, but I'd argue that 
buffering of
requests in the server's receive queue helps when clients generate bursts of
requests on a server that is well below its load limit.

Now, I'm not sure I understand what you mean by "livelock"?
A - Do you mean that the server becomes unresponsive and is generating almost
no RPC replies, with all the clients are reporting
"NFS server not responding"?
or
B - Do you mean that the server keeps responding to RPCs at a steady rate,
but that rate is slower than what the clients (and their users) would
like to see?
If it is B, I'd just consider that as hitting the server's load limit.

For either A or B, I'd suggest that you disable the DRC for TCP connections
(email if you need a patch for that), which will have a couple of effects:
1 - It will avoid the DRC from defining the server's load limit. (If the
DRC is the server's bottleneck, this will increase the server's load
limit to whatever else is the next bottleneck.)
2 - If the mbuf clusters held by the DRC are somehow contributing to the
mbuf cluster allocation problem for the receive side of the network
interface, this would alleviate that. (I'm not saying it fixes the
problem, but might allow the server to avoid it under the driver
guys come up with a good solution for it.)

rick

> -GAWollman
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-12 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > To be honest, I'd consider seeing a lot of non-empty receive queues
> > for TCP connections to the NFS server to be an indication that it is
> > near/at its load limit. (Sure, if you do netstat a lot, you will
> > occasionally
> > see a non-empty queue here or there, but I would not expect to see a
> > lot
> > of them non-empty a lot of the time.) If that is the case, then the
> > question becomes "what is the bottleneck?". Below I suggest getting
> > rid
> > of the DRC in case it is the bottleneck for your server.
> 
> The problem is not the DRC in "normal" operation, but the DRC when it
> gets into the livelocked state. I think we've talked about a number
> of solutions to the livelock problem, but I haven't managed to
> implement or test these ideas yet. I have a duplicate server up now,
> so I hope to do some testing this week.
> 
> In normal operation, the server is mostly idle, and the nfsd threads
> that aren't themselves idle are sleeping deep in ZFS waiting for
> something to happen on disk. When the arrival rate exceeds the rate
> at which requests are cleared from the DRC, *all* of the nfsd threads
> will spin, either waiting for the DRC mutex or walking the DRC finding
> that there is nothing that can be released yet. *That* is the
> livelock condition -- the spinning that takes over all nfsd threads is
> what causes the receive buffers to build up, and the large queues then
> maintain the livelocked condition -- and that is why it clears
> *immediately* when the DRC size is increased. (It's possible to
> reproduce this condition on a loaded server by simply reducing the
> tcphighwater to less than the current size.) Unfortunately, I'm at
> the NFSRC_FLOODSIZE limit right now (64k), so there is no room for
> further increases until I recompile the kernel. It's probably a bug
> that the sysctl definition in drc3.patch doesn't check the new value
> against this limit.
> 
> Note that I'm currently running 64 nfsd threads on a 12-core
> (24-thread) system. In the livelocked condition, as you would expect,
> the system goes to 100% CPU utilization and the load average peaks out
> at 64, while goodput goes to nearly nil.
> 
Ok, I think I finally understand what you are referring to by your livelock.
Basically, you are at the tcphighwater mark and the nfsd threads don't
succeed in freeing up many cache entries so each nfsd thread tries to
trim the cache for each RPC and that slows the server right down.

I suspect it is the cached entries from dismounted clients that are
filling up the cache (you did mention clients using amd at some point
in the discussion, which implies frequent mounts/dismounts).
I'm guessing that the tcp cache timeout needs to be made a lot smaller
for your case.

> > For either A or B, I'd suggest that you disable the DRC for TCP
> > connections
> > (email if you need a patch for that), which will have a couple of
> > effects:
> 
> I would like to see your patch, since it's more likely to be correct
> than one I might dream up.
> 
> The alternative solution is twofold: first, nfsrv_trimcache() needs to
> do something to ensure forward progress, even when that means dropping
> something that hasn't timed out yet, and second, the server code needs
> to ensure that nfsrv_trimcache() is only executing on one thread at a
> time. An easy way to do the first part would be to maintain an LRU
> queue for TCP in addition to the UDP LRU, and just blow away the first
> N (>NCPU) entries on the queue if, after checking all the TCP replies,
> the DRC is still larger than the limit. The second part is just an
> atomic_cmpset_int().
> 
I've attached a patch that has assorted changes. I didn't use an LRU list,
since that results in a single mutex to contend on, but I added a second
pass to the nfsrc_trimcache() function that frees old entries. (Approximate
LRU, using a histogram of timeout values to select a timeout value that
frees enough of the oldest ones.)

Basically, this patch:
- allows setting of the tcp timeout via vfs.nfsd.tcpcachetimeo
  (I'd suggest you go down to a few minutes instead of 12hrs)
- allows TCP caching to be disabled by setting vfs.nfsd.cachetcp=0
- does the above 2 things you describe to try and avoid the livelock,
  although not quite using an lru list
- increases the hash table size to 500 (still a compile time setting)
  (feel free to make it even bigger)
- sets nfsrc_floodlevel to at least nfsrc_tcphighwater, so you can
  grow vfs.nfsd.tcphighwater as big as you dare

The patch includes a lot of drc2.patch and drc3.patch, so don't try
and apply it to a patched kernel. Hopefully it will apply cleanly to
vanilla sources.

Tha patch has been minimally tested.

If you'd rather not apply the patch, you can change NFSRVCACHE_TCPTIMEOUT
and set the variable nfsrc_tcpidempotent == 0 to get a couple of
the changes. (You'll have to recompile the kernel for these changes to
take effect.)

Good luck with it, rick

> -GAWollman
> _

Re: Limits on jumbo mbuf cluster allocation

2013-03-13 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > Basically, this patch:
> > - allows setting of the tcp timeout via vfs.nfsd.tcpcachetimeo
> >   (I'd suggest you go down to a few minutes instead of 12hrs)
> > - allows TCP caching to be disabled by setting vfs.nfsd.cachetcp=0
> > - does the above 2 things you describe to try and avoid the
> > livelock,
> >   although not quite using an lru list
> > - increases the hash table size to 500 (still a compile time
> > setting)
> >   (feel free to make it even bigger)
> > - sets nfsrc_floodlevel to at least nfsrc_tcphighwater, so you can
> >   grow vfs.nfsd.tcphighwater as big as you dare
> 
> Thanks, this looks very good. One quibble with the last bit: I'd do
> that in a sysctl() handler rather than checking it every time through.
> If somebody uses a debugger rather than sysctl to change tcphighwater,
> they deserve what's coming to them. Also, I might suggest adding a
> counter for how many times we had to go through the "try harder"
> phase, so that the sysadmin has some indication that the defaults need
> adjustment.
> 
I agree w.r.t. both comments. I did the one line setting of nfsrc_floodlevel,
just because I was getting lazy while doing the patch.

And if/when a patch like this (I think if this works, it can easily be
added to ivoras@'s patch) goes into head, some way for a sysadmin to
monitor how well it's working would be good. A count of the "try harder"
attempts seems like a good candidate.

Good luck with testing of it, rick

> I will test this out later this week and see how it performs. I have
> a user who has been able to reproducibly clobber servers before, so if
> he has time and cycles available it should be pretty easy to tell
> whether it's working or not.
> 
> -GAWollman
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-19 Thread Rick Macklem

Garrett Wollman wrote:
> <  said:
> 
> > I've attached a patch that has assorted changes.
> 
> So I've done some preliminary testing on a slightly modified form of
> this patch, and it appears to have no major issues. However, I'm
> still waiting for my user with 500 VMs to have enough free to be able
> to run some real stress tests for me.
> 
> I was able to get about 2.5 Gbit/s throughput for a single streaming
> client over local 10G interfaces with jumbo frames (through a single
> switch and with LACP on both sides -- how well does lagg(4) interact
> with TSO and checksum offload?) This is a little bit disappointing
> (considering that the filesystem can do 14 Gbit/s locally) but still
> pretty decent for one single-threaded client. This obviously does not
> implicate the DRC changes at all, but does suggest that there is room
> for more performance improvement. (In previous tests last year, I
> was able to get a sustained 8 Gbit/s when using multiple clients.) I
> also found that one of our 10G switches is reordering TCP segments in
> a way that causes poor performance.
> 
If the server for this test isn't doing anything else yet, you could
try a test run with a single nfsd thread and see if that improves
performance.

ken@ emailed yesterday mentioning that out of order reads was resulting
in poor performance related to ZFS and that a single nfsd thread improved
that for his test.

Although a single nfsd thread isn't practical, it suggests that the nfsd
thread affinity code that I had forgotten about and has never been ported
to the new server, might be needed for this. (I'm not sure how to do the
affinity stuff for NFSv4, but it should at least be easy to port the code
so that it works for NFSv3 mounts.)

rick
ps: For a couple of years I had assumed that Isilon would be doing this,
but they are no longer working on the FreeBSD NFS server, so the
affinity stuff slipped through the cracks.

> I'll hopefully have some proper testing results later in the week.
> 
> -GAWollman
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

2013-03-19 Thread Rick Macklem

I wrote:
> Garrett Wollman wrote:
> > < >  said:
> >
> > > I've attached a patch that has assorted changes.
> >
> > So I've done some preliminary testing on a slightly modified form of
> > this patch, and it appears to have no major issues. However, I'm
> > still waiting for my user with 500 VMs to have enough free to be
> > able
> > to run some real stress tests for me.
> >
> > I was able to get about 2.5 Gbit/s throughput for a single streaming
> > client over local 10G interfaces with jumbo frames (through a single
> > switch and with LACP on both sides -- how well does lagg(4) interact
> > with TSO and checksum offload?) This is a little bit disappointing
> > (considering that the filesystem can do 14 Gbit/s locally) but still
> > pretty decent for one single-threaded client. This obviously does
> > not
> > implicate the DRC changes at all, but does suggest that there is
> > room
> > for more performance improvement. (In previous tests last year, I
> > was able to get a sustained 8 Gbit/s when using multiple clients.) I
> > also found that one of our 10G switches is reordering TCP segments
> > in
> > a way that causes poor performance.
> >
> If the server for this test isn't doing anything else yet, you could
> try a test run with a single nfsd thread and see if that improves
> performance.
> 
> ken@ emailed yesterday mentioning that out of order reads was
> resulting
> in poor performance related to ZFS and that a single nfsd thread
> improved
> that for his test.
> 
> Although a single nfsd thread isn't practical, it suggests that the
> nfsd
> thread affinity code that I had forgotten about and has never been
> ported
> to the new server, might be needed for this. (I'm not sure how to do
> the
> affinity stuff for NFSv4, but it should at least be easy to port the
> code
> so that it works for NFSv3 mounts.)
> 
Oh, and don't hesitate to play with the rsize and readahead options on
the client mount. It is not obvious what is an optimal setting for a
given LAN/server config. (I think the Linux client has a readahead option?)

rick

> rick
> ps: For a couple of years I had assumed that Isilon would be doing
> this,
> but they are no longer working on the FreeBSD NFS server, so the
> affinity stuff slipped through the cracks.
> 
> > I'll hopefully have some proper testing results later in the week.
> >
> > -GAWollman
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscr...@freebsd.org"
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

panic in tcp_do_segment()

2013-04-04 Thread Rick Macklem

Hi,

When pho@ was doing some NFS testing, he got the
following crash, which I can't figure out. (As far
as I can see, INP_WLOCK() is always held when
tp->t_state = TCPS_CLOSED and it is held from before
the test for TCPS_CLOSED in tcp_input() up until
the tcp_do_segment() call. As such, I don't see how
tp->t_state can be TCPS_CLOSED, but that seems to
be what causes the panic?)

The "umount -f" will result in:
  soshutdown(so, SHUT_WR);
  soclose(so);
being done by the krpc on the socket.

Anyone have any ideas on this?
pho@ wrote:
> I continued running the NFS tests and got this "panic: tcp_do_segment:
> TCPS_LISTEN". It's the second time I get this panic with the same test
> scenario, so it seems to be reproducible. The scenario is "umount -f"
> of a mount point that is very active.
>
> http://people.freebsd.org/~pho/stress/log/kostik555.txt

Thanks in advance for any help, rick
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: panic in tcp_do_segment()

2013-04-07 Thread Rick Macklem

Juan Mojica wrote:
> Agree with Matt.
> 
> Whenever there is an UNLOCK/LOCK like is present in soclose(), there
> is a window to allow something through.  Unsetting SO_ACCEPTCONN was
> put in place because the LOCK/UNLOCK in soclose let a new socket to be
> added to the so_incomp list causing a different ASSERT to be hit - and
> memory to be leaked.
> 
> As far as I can tell, because of the upcall performed in soabort(), we
> can't just hold the ACCEPT lock all the way through the close.
> 
> 
> The simplest thing to do in this case seemed to be to just drop the
> TCP segment - the connection is being closed anyway.  Or like Matt
> said, having someone look at the lock logic and see if there is
> something there that can be exploited to prevent this would also help.
> 
> 
> -Juan
> 
> 
> 
> 
> On Fri, Apr 5, 2013 at 7:09 AM, Matt Miller < m...@matthewjmiller.net
> > wrote:
> 
> 
> 
> 
> Hey Rick,
> 
> I believe Juan and I have root caused this crash recently.  The
> t_state = 0x1, TCPS_LISTEN, in the link provided at the time of the
> assertion.
> 
> In tcp_input(), if we're in TCPS_LISTEN, SO_ACCEPTCONN should be set
> on the socket and we should never enter tcp_do_segment() for this
> state.  I think if you look in your corefile, you'll see the socket
> *doesn't* have this flag set in your case.
> 
Thanks guys. I had missed the *tp near the end and mistakenly was
thinking it was the client socket. This sounds like a good explanation
to me.

Hopefully, one of the tcp stack guys will pick this up and commit
a patch, rick.

> 
> 1043 /*
> 1044  * When the socket is accepting connections (the INPCB is
> in LISTEN
> 1045  * state) we look into the SYN cache if this is a new
> connection
> 1046  * attempt or the completion of a previous one.  Because
> listen
> 1047  * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo
> lock will be
> 1048  * held in this case.
> 1049  */
> 1050 if (so->so_options & SO_ACCEPTCONN) {
> 1051 struct in_conninfo inc;
> 1052
> 1053 KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so
> accepting but "
> 1054 "tp not listening", __func__));
> ...
> 1356 syncache_add(&inc, &to, th, inp, &so, m, NULL,
> NULL);
> 1357 /*
> 1358  * Entry added to syncache and mbuf consumed.
> 1359  * Everything already unlocked by syncache_add().
> 1360  */
> 1361 INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
> 1362 return;
> 1363 }
> ...
> 1384 /*
> 1385  * Segment belongs to a connection in SYN_SENT,
> ESTABLISHED or later
> 1386  * state.  tcp_do_segment() always consumes the mbuf
> chain, unlocks
> 1387  * the inpcb, and unlocks pcbinfo.
> 1388  */
> 1389 tcp_do_segment(m, th, so, tp, drop_hdrlen, tlen, iptos,
> ti_locked);
> 
> 
> I think this has to do with this patch in soclose() where
> SO_ACCEPTCONN is being turned off in soclose().  I suspect if you look
> at the other threads in your corefile, you'll see one at this point in
> soclose() operating on the same socket as the one in the
> tcp_do_segment() thread.
> 
> 
> http://svnweb.freebsd.org/base?view=revision&revision=243627
> 
>  817 /*
>  818  * Prevent new additions to the accept queues due
>  819  * to ACCEPT_LOCK races while we are draining
> them.
>  820  */
>  821 so->so_options &= ~SO_ACCEPTCONN;
>  822 while ((sp = TAILQ_FIRST(&so->so_incomp)) !=
> NULL) {
>  823 TAILQ_REMOVE(&so->so_incomp, sp,
> so_list);
>  824 so->so_incqlen--;
>  825 sp->so_qstate &= ~SQ_INCOMP;
>  826 sp->so_head = NULL;
>  827 ACCEPT_UNLOCK();
>  828 soabort(sp);
>  829 ACCEPT_LOCK();
>  830 }
> 
> 
> Juan had evaluated this code path and it seemed safe to just drop the
> packet in this case:
> 
> 
> +     /*
> +      * In closing down the socket, the SO_ACCEPTCONN flag is removed
> to
> +      * prevent new connections from being established.  This means
> that
> +      * any frames in that were in the midst of being processed could
> +      * make it here.  Need to just drop the frame.
> +      */
> +     if (TCPS_LISTEN == tp->t_state) {

Re: TSO and FreeBSD vs Linux

2013-09-04 Thread Rick Macklem

David Wolfskill wrote:
> On Wed, Aug 21, 2013 at 07:12:38PM +0200, Andre Oppermann wrote:
> > On 13.08.2013 19:29, Julian Elischer wrote:
> > > I have been tracking down a performance embarrassment on AMAZON
> > > EC2 and have found it I think.
> > > Our OS cousins over at Linux land have implemented some
> > > interesting behaviour when TSO is in use.
> > 
> > There used to be a different problem with EC2 and FreeBSD TSO.  The
> > Xen hypervisor
> > doesn't like large 64K TSO bursts we generate, the drivers drops
> > the whole TSO chain,
> > TCP gets upset and turns off TSO alltogether leaving the connection
> > going at one
> > packet a time as in the old days.
> > ...
> 
> My apologies for jumping in so late -- I'm not subscribed to -net@.
> 
> At work, I received a new desktop machine a few months ago; here's a
> recent history of what it has been running:
> 
> FreeBSD 9.2-PRERELEASE #4  r254801M/254827:902501: Sun Aug 25
> 05:15:29 PDT 2013 root@dwolf-fbsd:/usr/obj/usr/src/sys/DWOLF
>  amd64
> FreeBSD 9.2-PRERELEASE #5  r255066M/255091:902503: Sat Aug 31
> 11:58:53 PDT 2013 root@dwolf-fbsd:/usr/obj/usr/src/sys/DWOLF
>  amd64
> FreeBSD 9.2-PRERELEASE #5  r255104M/255115:902503: Sun Sep  1
> 05:02:12 PDT 2013 root@dwolf-fbsd:/usr/obj/usr/src/sys/DWOLF
>  amd64
> 
> Now, I like to have a "private playground" for doing things with
> machines, so I make use of both em(4) NICs on the machine: em0
> connects
> to the rest of the work network; em1 is connected to a switch I
> brought
> in from home, and to which I connect "other things" (such as my
> laptop).
> And because I'm fairly comfortable with them, I use IPFW & natd.  For
> some folks here, none of that should come as a surprise. :-})
> 
> For reference, the em(4) devices in question are:
> 
> em0@pci0:0:25:0:class=0x02 card=0x060d15d9
> chip=0x10ef8086 rev=0x06 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82578DM Gigabit Network Connection'
> 
> and
> 
> em1@pci0:3:0:0: class=0x02 card=0x060d15d9 chip=0x10d38086
> rev=0x00 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82574L Gigabit Network Connection'
> 
> 
> 
> I noticed that when I tried to write files to NFS, I could write
> small
> files OK, but larger ones seemed to ... hang.
> 
> Note: We don't use jumbo frames.  (Work IT is convinced that they
> don't help.  I'm trying to better-understand their reasoning.)
> 
> Further poking around showed that (under the above conditions):
> * natd CPU% was climbing as more of the file was copied, up to 2^21
>   bytes.  (At that point, nothing further was saved on NFS.)
> * dhcpd CPU% was also climbing.  I tried killing that, but doing so
>   didn't affect the other results.  (Killing natd made connectivity
>   cease, given the IPFW rules in effect.)
> * Performing a tcpdump while trying to copy a file of length
> 117709618
>   showed lots of TCP retransmissions.  In fact, I'd hazard that every
>   TCP
>   packet was getting retransmitted.
> * "ifconfig -v em0" showed flags TSO4 & VLAN_HWTSO turned on.
> * "sysctl net.inet.tcp.tso" showed "1" -- enabled.
> 
> As soon as I issued "sudo net.inet.tcp.tso=0" ... the copy worked
> without
> a hitch or a whine.  And I was able to copy all 117709618 bytes, not
> just
> 2097152 (2^21).
> 
> Is the above expected?  It came rather as a surprise to me.
> 
Not surprising to me, I'm afraid. When there are serious NFS problems
like this, it is often caused by a network fabric issue and broken
TSO is at the top of the list w.r.t. cause.

rick

> Peace,
> david
> --
> David H. Wolfskillda...@catwhisker.org
> Taliban: Evil cowards with guns afraid of truth from a 14-year old
> girl.
> 
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: nfsv4 fails with kerberos

2013-09-07 Thread Rick Macklem

Martin Laabs wrote:
> Hi,
> 
> I set up a nfsv4 server with kerberos but when starting the nfs
> server on
> the arm (RBI-B) board I get the following error message and the first
> (managing part) of the nfs exits:
> 
> "nfsd: can't register svc name"
> 
> This error message is produced by the following code in
> /usr/src/sys/fs/nfsserver/nfs_nfsdkrpc.c:
> 
> 
> ==:<===
> /* An empty string implies AUTH_SYS only. */
> if (principal[0] != '\0') {
>  ret2 = rpc_gss_set_svc_name_call(principal,
>"kerberosv5", GSS_C_INDEFINITE, NFS_PROG, NFS_VER2);
>  ret3 = rpc_gss_set_svc_name_call(principal,
> "kerberosv5", GSS_C_INDEFINITE, NFS_PROG, NFS_VER3);
>  ret4 = rpc_gss_set_svc_name_call(principal,
> "kerberosv5", GSS_C_INDEFINITE, NFS_PROG, NFS_VER4);
> 
> if (!ret2 || !ret3 || !ret4)
>   printf("nfsd: can't register svc name\n");
> ==:<===
> 
> So something went wrong with the principals. Is there a way to get
> more
> information or more verbose debugging output from the nfs-server part
> of
> the kernel?
> 
The above message normally indicates that the gssd daemon isn't running.

Here's a few places you can get info:
man nfsv4, gssd
http://code.google.com/p/macnfsv4/wiki/FreeBSD8KerberizedNFSSetup
- This was done quite a while ago and I should ggo in and update it,
  but I think it is still mostly correct for server side. (The client
  in head/10 now does have "host based initiator cred" support.)
  Feel free to update it. All you should need to do so is a Google
  login.

You need a service principal for "nfs", which means an entry for a
principal that looks like:
nfs/.@
(Stuff in "<>" needs to be filled in with the answer for your machine.)
in /etc/krb5.keytab i the server.

rick

> Thank you,
>  Martin Laabs
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: TSO help or hindrance ? (was Re: TSO and FreeBSD vs Linux)

2013-09-10 Thread Rick Macklem

Mike Tancsa wrote:
> On 9/10/2013 6:42 PM, Barney Cordoba wrote:
> > NFS has been broken since Day 1, so lets not come to conclusions
> > about
> > anything
> > as it relates to NFS.
> 
> iSCSI is NFS ?
> 
It would be really nice if you could try trasz`s new iSCSI stack and
see how well it works. (I, for one, am hoping it makes it into 10.0,
but it may be too late.)

rick

>   ---Mike
> 
> > 
> > BC
> > 
> > --------
> > *From:* Mike Tancsa 
> > *To:* Rick Macklem 
> > *Cc:* FreeBSD Net ; David Wolfskill
> > 
> > *Sent:* Wednesday, September 4, 2013 11:26 AM
> > *Subject:* TSO help or hindrance ? (was Re: TSO and FreeBSD vs
> > Linux)
> > 
> > On 9/4/2013 8:50 AM, Rick Macklem wrote:
> >> David Wolfskill wrote:
> >>>
> >>>
> >>> I noticed that when I tried to write files to NFS, I could write
> >>> small
> >>> files OK, but larger ones seemed to ... hang.
> >>> * "ifconfig -v em0" showed flags TSO4 & VLAN_HWTSO turned on.
> >>> * "sysctl net.inet.tcp.tso" showed "1" -- enabled.
> >>>
> >>> As soon as I issued "sudo net.inet.tcp.tso=0" ... the copy worked
> >>> without
> >>> a hitch or a whine.  And I was able to copy all 117709618 bytes,
> >>> not
> >>> just
> >>> 2097152 (2^21).
> >>>
> >>> Is the above expected?  It came rather as a surprise to me.
> >>>
> >> Not surprising to me, I'm afraid. When there are serious NFS
> >> problems
> >> like this, it is often caused by a network fabric issue and broken
> >> TSO is at the top of the list w.r.t. cause.
> > 
> > 
> > I was just experimenting a bit with iSCSI via FreeNAS and was a
> > little
> > disappointed at the speeds I was getting. So, I tried disabling tso
> > on
> > both boxes and it did seem to speed things up a bit.  Data and
> > testing
> > methods attached in a txt file.
> > 
> > I did 3 cases.
> > 
> > Just boot up FreeNAS and the initiator without tweaks.  That had
> > the
> > worst performance.
> > disable tso on the nic as well as via sysctl on both boxes. That
> > had the
> > best performance.
> > re-enable tso on both boxes. That had better performance than the
> > first
> > case, but still not as good as totally disabling it.  I am guessing
> > something is not quite being re-enabled properly ? But its
> > different
> > than the other two cases ?!?
> > 
> > tgt is FreeNAS-9.1.1-RELEASE-x64 (a752d35) and initiator is r254328
> > 9.2
> > AMD64
> > 
> > The FreeNAS box has 16G of RAM, so the file is being served out of
> > cache
> > as gstat shows no activity when sending out the file
> > 
> > 
> > 
> > ---Mike
> > 
> > 
> > --
> > ---
> > Mike Tancsa, tel +1 519 651 3400
> > Sentex Communications, m...@sentex.net <mailto:m...@sentex.net>
> > Providing Internet services since 1994 www.sentex.net
> > Cambridge, Ontario Canada  http://www.tancsa.com/
> > 
> > ___
> > freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org> mailing
> > list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscr...@freebsd.org
> > <mailto:freebsd-net-unsubscr...@freebsd.org>"
> > 
> 
> 
> --
> ---
> Mike Tancsa, tel +1 519 651 3400
> Sentex Communications, m...@sentex.net
> Providing Internet services since 1994 www.sentex.net
> Cambridge, Ontario Canada   http://www.tancsa.com/
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Network stack changes

2013-09-13 Thread Rick Macklem

George Neville-Neil wrote:
> 
> On Aug 29, 2013, at 7:49 , Adrian Chadd  wrote:
> 
> > Hi,
> > 
> > There's a lot of good stuff to review here, thanks!
> > 
> > Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless
> > to keep
> > locking things like that on a per-packet basis. We should be able
> > to do
> > this in a cleaner way - we can defer RX into a CPU pinned taskqueue
> > and
> > convert the interrupt handler to a fast handler that just schedules
> > that
> > taskqueue. We can ignore the ithread entirely here.
> > 
> > What do you think?
> > 
> > Totally pie in the sky handwaving at this point:
> > 
> > * create an array of mbuf pointers for completed mbufs;
> > * populate the mbuf array;
> > * pass the array up to ether_demux().
> > 
> > For vlan handling, it may end up populating its own list of mbufs
> > to push
> > up to ether_demux(). So maybe we should extend the API to have a
> > bitmap of
> > packets to actually handle from the array, so we can pass up a
> > larger array
> > of mbufs, note which ones are for the destination and then the
> > upcall can
> > mark which frames its consumed.
> > 
> > I specifically wonder how much work/benefit we may see by doing:
> > 
> > * batching packets into lists so various steps can batch process
> > things
> > rather than run to completion;
> > * batching the processing of a list of frames under a single lock
> > instance
> > - eg, if the forwarding code could do the forwarding lookup for 'n'
> > packets
> > under a single lock, then pass that list of frames up to
> > inet_pfil_hook()
> > to do the work under one lock, etc, etc.
> > 
> > Here, the processing would look less like "grab lock and process to
> > completion" and more like "mark and sweep" - ie, we have a list of
> > frames
> > that we mark as needing processing and mark as having been
> > processed at
> > each layer, so we know where to next dispatch them.
> > 
> 
> One quick note here.  Every time you increase batching you may
> increase bandwidth
> but you will also increase per packet latency for the last packet in
> a batch.
> That is fine so long as we remember that and that this is a tuning
> knob
> to balance the two.
> 
And any time you increase latency, that will have a negative impact on
NFS performance. NFS RPCs are usually small messages (except Write requests
and Read replies) and the RTT for these (mostly small, bidirectional)
messages can have a significant impact on NFS perf.

rick

> > I still have some tool coding to do with PMC before I even think
> > about
> > tinkering with this as I'd like to measure stuff like per-packet
> > latency as
> > well as top-level processing overhead (ie,
> > CPU_CLK_UNHALTED.THREAD_P /
> > lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core,
> > etc.)
> > 
> 
> This would be very useful in identifying the actual hot spots, and
> would be helpful
> to anyone who can generate a decent stream of packets with, say, an
> IXIA.
> 
> Best,
> George
> 
> 
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Network stack changes

2013-09-14 Thread Rick Macklem

Sam Fourman Jr. wrote:
> >
> 
> > And any time you increase latency, that will have a negative impact
> > on
> > NFS performance. NFS RPCs are usually small messages (except Write
> > requests
> > and Read replies) and the RTT for these (mostly small,
> > bidirectional)
> > messages can have a significant impact on NFS perf.
> >
> > rick
> >
> >
> this may be a bit off topic but not much... I have wondered with all
> of the
> new
> tcp algorithms
> http://freebsdfoundation.blogspot.com/2011/03/summary-of-five-new-tcp-congestion.html
> 
> what algorithm is best suited for NFS over gigabit Ethernet, say
> FreeBSD to
> FreeBSD.
> and further more would a NFS optimized tcp algorithm be useful?
> 
I have no idea what effect they might have. NFS traffic is quite different than
streaming or bulk data transfer. I think this might make a nice research
project for someone.

rick

> Sam Fourman Jr.
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-19 Thread Rick Macklem

Christopher Forgeron wrote:
> Hello,
> 
> 
> 
> I can report this problem as well on 10.0-RELEASE.
> 
> 
> 
> I think it's the same as kern/183390?
> 
> 
> 
> I have two physically identical machines, one running 9.2-STABLE, and
> one
> on 10.0-RELEASE.
> 
> 
> 
> My 10.0 machine used to be running 9.0-STABLE for over a year without
> any
> problems.
> 
> 
> 
> I'm not having the problems with 9.2-STABLE as far as I can tell, but
> it
> does seem to be a load-based issue more than anything. Since my 9.2
> system
> is in production, I'm unable to load it to see if the problem exists
> there.
> I have a ping_logger.py running on it now to see if it's experiencing
> problems briefly or not.
> 
> 
> 
> I am able to reproduce it fairly reliably within 15 min of a reboot
> by
> loading the server via NFS with iometer and some large NFS file
> copies at
> the same time. I seem to need to sustain ~2 Gbps for a few minutes.
> 
If you can easily do so, testing with the attached patch might shed
some light on the problem. It just adds a couple of diagnostic checks
before and after m_defrag() is called when bus_dmamap_load_mbuf_sg()
returns EFBIG.

If the "before" printf happens, it would suggest a problem with the
loop in tcp_output() that creates TSO segments.

If the "after" printf happens, it would suggest that m_defrag() somehow
doesn't create a list of 32 or fewer mbufs for the TSO segment.

I don't have any ix hardware, so this patch is completely untested.

Just something maybe worth trying, rick

> 
> 
> It will happen with just ix0 (no lagg) or with lagg enabled across
> ix0 and
> ix1.
> 
> 
> 
> I've been load-testing new FreeBSD-10.0-RELEASE SAN's for production
> use
> here, so I'm quite willing to put time into this to help find out
> where
> it's coming from.  It took me a day to track down my iometer issues
> as
> being network related, and another day to isolate and write scripts
> to
> reproduce.
> 
> 
> 
> The symptom I notice is:
> 
> -  A running flood ping (ping -f 172.16.0.31) to the same
> hardware
> (running 9.2) will come back with "ping: sendto: File too large" when
> the
> problem occurs
> 
> -  Network connectivity is very spotty during these incidents
> 
> -  It can run with sporadic ping errors, or it can run a
> straight
> set of errors for minutes at a time
> 
> -  After a long run of ping errors, ESXi will show a
> disconnect
> from the hosted NFS stores on this machine.
> 
> -  I've yet to see it happen right after boot. Fastest is
> around 5
> min, normally it's within 15 min.
> 
> 
> 
> System Specs:
> 
> 
> 
> -  Dell PowerEdge M610x Blade
> 
> -  2 Xeon 6600  @ 2.40GHz (24 Cores total)
> 
> -  96 Gig RAM
> 
> -  35.3 TB ZFS Mirrored pool, lz4 compression on my test pool
> (ZFS
> pool is the latest)
> 
> -  Intel 520-DA2 10 Gb dual-port Blade Mezz. Cards
> 
> 
> 
> Currently this 10.0 testing machine is clean for all sysctl's other
> than
> hw.intr_storm_threshold=9900. I have the problem if that's set or
> not, so I
> leave it on.
> 
> 
> 
> ( I used to set manual nmbclusters, etc. as per the Intel Readme.doc,
> but I
> notice that the defaults on the new 10.0 system are larger. I did try
> using
> all of the old sysctl's from an older 9.0-STABLE, and still had the
> problem, but it did seem to take longer to occur? I haven't run
> enough
> tests to confirm that time observation is true. )
> 
> 
> 
> What logs / info can I provide to help?
> 
> 
> 
> I have written a small script called ping_logger.py that pings an IP,
> and
> checks to see if there is an error. On error it will execute and log:
> 
> -  netstat -m
> 
> -  sysctl hw.ix
> 
> -  sysctl dev.ix
> 
> 
> 
> then go back to pinging. It will also log those values on the startup
> of
> the script, and every 5 min (so you can see the progression on the
> system).
> I can add any number of things to the reporting, so I'm looking for
> suggestions.
> 
> 
> 
> This results in some large log files, but I can email a .gz directly
> to
> anyone who need them, or perhaps put it up on a website.
> 
> 
> 
> I will also make the ping_logger.py script available if anyone else
> wants
> it.
> 
> 
> 
> 
> 
> LASTLY:
> 
> 
> 
> The one thing I can see that is different in my 10.0 System and my
> 9.2 is:
> 
> 
> 
> 9.2's netstat -m:
> 
> 
> 
> 37965/16290/54255 mbufs in use (current/cache/total)
> 
> 4080/8360/12440/524288 mbuf clusters in use (current/cache/total/max)
> 
> 4080/4751 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 
> 0/452/452/262144 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 
> 32773/4129/36902/96000 9k jumbo clusters in use
> (current/cache/total/max)
> 
> 0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)
> 
> 312608K/59761K/372369K bytes allocated to network
> (current/cache/total)
> 
> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 
> 0/0

Re: 9.2 ixgbe tx queue hang

2014-03-20 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> Output from the patch you gave me (I have screens of it.. let me know
> what you're hoping to see.
> 
> 
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Hmm. I think this means that the loop that generates TSO segments in
tcp_output() is broken, since I'm pretty sure that the maximum size
should be is IP_MAXPACKET (65535).

Either that or some non-TCP socket is trying to send a packet that
exceeds IP_MAXPACKET for some reason.

Would it be possible to add a printf() for m->m_pkthdr.csum_flags
to the before case, in the "if" that generates the before printf?
I didn't think to put this in, but CSUM_TSO will be set if it
is a TSO segment, I think? My networking is very rusty.
(If how to add this isn't obvious, just email and I'll update
 the patch.)

Thanks for doing this, rick

> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:23 SAN0 kernel: before pklen=65538 actl=65538
> Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:23 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:23 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 20 16:37:23 SAN0 kernel: before pklen=65542 actl=65542
> Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65
> 
> 
> 
> 
> 
> On Wed, Mar 19, 2014 at 11:29 PM, Rick Macklem < rmack...@uoguelph.ca
> > wrote:
> 
> 
> 
> Christopher Forgeron wrote:
> > Hello,
> > 
> > 
> > 
> > I can report this problem as well on 10.0-RELEASE.
> > 
> > 
> > 
> > I think it's the same as kern/183390?
> > 
> > 
> > 
> > I have two physically identical machines, one running 9.2-STABLE,
> > and
> > one
> > on 10.0-RELEASE.
> > 
> > 
> > 
> > My 10.0 machine used to be running 9.0-STABLE for over a year
> > without
> > any
> > problems.
> > 
> > 
> > 
> > I'm not having the problems with 9.2-STABLE as far as I can tell,
> > but
> > it
> > does seem to be a load-based issue more than anything. Since my 9.2
> > system
> > is in production, I'm unable to load it to see if the problem
> > exists
> > there.
> > I have a ping_logger.py running on it now to see if it's
> > experiencing
> > problems briefly or not.
> > 
> > 
> > 
> > I am able to reproduce it fairly reliably within 15 min of a reboot
> > by
> > loading the server via NFS with iometer and some large NFS file
> > copies at
> > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
> > 
> If you can easily do so, testing with the attached patch might shed
> some light on the problem. It just adds a couple of diagnostic checks
> before and after m_defrag

Re: 9.2 ixgbe tx queue hang

2014-03-20 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert <
> markus.geb...@hostpoint.ch > wrote:
> 
> 
> 
> 
> 
> Possible. We still see this on nfsclients only, but I’m not convinced
> that nfs is the only trigger.
> 
> 
Since Christopher is getting a bunch of the "before" printf()s from
my patch, it indicates that a packet/TSO segment that is > 65535 bytes
in length is showing up at ixgbe_xmit(). I've asked him to add a printf()
for the m_pkthdr.csum_flags field to see if it is really a TSO segment.

If it is a TSO segment, that indicates to me that the code in tcp_output() that 
should
generate a TSO segment no greater than 65535 bytes in length is busted.
And this would imply just about any app doing large sosend()s could cause
this, I think? (NFS read replies/write requests of 64K would be one of them.)

rick

> 
> 
> 
> Just to clarify, I'm experiencing this error with NFS, but also with
> iSCSI - I turned off my NFS server in rc.conf and rebooted, and I'm
> still able to create the error. This is not just a NFS issue on my
> machine.
> 
> 
> 
> I our case, when it happens, the problem persists for quite some time
> (minutes or hours) if we don’t interact (ifconfig or reboot).
> 
> 
> 
> The first few times that I ran into it, I had similar issues -
> Because I was keeping my system up and treating it like a temporary
> problem/issue. Worst case scenario resulted in reboots to reset the
> NIC. Then again, I find the ix's to be cranky if you ifconfig them
> too much.
> 
> Now, I'm trying to find a root cause, so as soon as I start seeing
> any errors, I abort and reboot the machine to test the next theory.
> 
> 
> Additionally, I'm often able to create the problem with just 1 VM
> running iometer on the SAN storage. When the problem occurs, that
> connection is broken temporarily, taking network load off the SAN -
> That may improve my chances of keeping this running.
> 
> 
> 
> 
> 
> > I am able to reproduce it fairly reliably within 15 min of a reboot
> > by
> > loading the server via NFS with iometer and some large NFS file
> > copies at
> > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
> 
> That’s probably why we can’t reproduce it reliably here. Although
> having 10gig cards in our blade servers, the ones affected are
> connected to a 1gig switch.
> 
> 
> 
> 
> 
> It seems that it needs a lot of traffic. I have a 10 gig backbone
> between my SANs and my ESXi machines, so I can saturate quite
> quickly (just now I hit a record.. the error occurred within ~5 min
> of reboot and testing). In your case, I recommend firing up multiple
> VM's running iometer on different 1 gig connections and see if you
> can make it pop. I also often turn off ix1 to drive all traffic
> through ix0 - I've noticed it happens faster this way, but once
> again I'm not taking enough observations to make decent time
> predictions.
> 
> 
> 
> 
> 
> 
> Can you try this when the problem occurs?
> 
> for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2
> -c 2 -W 1 10.0.0.1 | grep sendto; done
> 
> It will tie ping to certain cpus to test the different tx queues of
> your ix interface. If the pings reliably fail only on some queues,
> then your problem is more likely to be the same as ours.
> 
> Also, if you have dtrace available:
> 
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / {
> stack(); }'
> 
> while you run pings over the interface affected. This will give you
> hints about where the EFBIG error comes from.
> 
> > […]
> 
> 
> Markus
> 
> 
> 
> 
> Will do. I'm not sure what shell the first script was written for,
> it's not working in csh, here's a re-write that does work in csh in
> case others are using the default shell:
> 
> #!/bin/csh
> foreach CPU (`seq 0 23`)
> echo "CPU$CPU";
> cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto;
> end
> 
> 
> Thanks for your input. I should have results to post to the list
> shortly.
> 
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Network stack returning EFBIG?

2014-03-20 Thread Rick Macklem

Markus Gebert wrote:
> 
> On 20.03.2014, at 14:51, woll...@bimajority.org wrote:
> 
> > In article <21290.60558.750106.630...@hergotha.csail.mit.edu>, I
> > wrote:
> > 
> >> Since we put this server into production, random network system
> >> calls
> >> have started failing with [EFBIG] or maybe sometimes [EIO].  I've
> >> observed this with a simple ping, but various daemons also log the
> >> errors:
> >> Mar 20 09:22:04 nfs-prod-4 sshd[42487]: fatal: Write failed: File
> >> too
> >> large [preauth]
> >> Mar 20 09:23:44 nfs-prod-4 nrpe[42492]: Error: Could not complete
> >> SSL
> >> handshake. 5
> > 
> > I found at least one call stack where this happens and it does get
> > returned all the way to userspace:
> > 
> > 17  15547   _bus_dmamap_load_buffer:return
> >  kernel`_bus_dmamap_load_mbuf_sg+0x5f
> >  kernel`bus_dmamap_load_mbuf_sg+0x38
> >  kernel`ixgbe_xmit+0xcf
> >  kernel`ixgbe_mq_start_locked+0x94
> >  kernel`ixgbe_mq_start+0x12a
> >  if_lagg.ko`lagg_transmit+0xc4
> >  kernel`ether_output_frame+0x33
> >  kernel`ether_output+0x4fe
> >  kernel`ip_output+0xd74
> >  kernel`tcp_output+0xfea
> >  kernel`tcp_usr_send+0x325
> >  kernel`sosend_generic+0x3f6
> >  kernel`soo_write+0x5e
> >  kernel`dofilewrite+0x85
> >  kernel`kern_writev+0x6c
> >  kernel`sys_write+0x64
> >  kernel`amd64_syscall+0x5ea
> >  kernel`0x808443c7
> 
> This looks pretty similar to what we’ve seen when we got EFBIG:
> 
>  3  28502   _bus_dmamap_load_buffer:return
>   kernel`_bus_dmamap_load_mbuf_sg+0x5f
>   kernel`bus_dmamap_load_mbuf_sg+0x38
>   kernel`ixgbe_xmit+0xcf
>   kernel`ixgbe_mq_start_locked+0x94
>   kernel`ixgbe_mq_start+0x12a
>   kernel`ether_output_frame+0x33
>   kernel`ether_output+0x4fe
>   kernel`ip_output+0xd74
>   kernel`rip_output+0x229
>   kernel`sosend_generic+0x3f6
>   kernel`kern_sendit+0x1a3
>   kernel`sendit+0xdc
>   kernel`sys_sendto+0x4d
>   kernel`amd64_syscall+0x5ea
>   kernel`0x80d35667
> 
> In our case it looks like some of the ixgbe tx queues get stuck, and
> some don’t. You can test, wether your server shows the same symptoms
> with this command:
> 
> # for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i
> 0.5 -c 2 -W 1 10.0.0.1 | grep sendto; done
> 
> We also use 82599EB based ixgbe controllers on affected systems.
> 
> Also see these two threads on freebsd-net:
> 
> http://lists.freebsd.org/pipermail/freebsd-net/2014-February/037967.html
> http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038061.html
> 
> I have started the second one, and there are some more details of
> what we were seeing in case you’re interested.
> 
> Then there is:
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=183390
> and:
> https://bugs.freenas.org/issues/4560
> 
Well, the "before" printf() from my patch is indicating a packet > 65535
and that will definitely result in a EFBIG. (There is no way that m_defrag()
can squeeze > 64K into 32 MCLBYTES mbufs.)

Note that the EFBIG will be returned by the call that dequeues this packet
and tries to transmit it (not necessarily the one that generated/queued the
packet). This was pointed out by Ryan in a previous discussion of this.

The code snippet from sys/netinet/tcp_output.c looks pretty straightforward:
   /*
772 * Limit a burst to t_tsomax minus IP,
773 * TCP and options length to keep ip->ip_len
774 * from overflowing or exceeding the maximum
775 * length allowed by the network interface.
776 */
777 if (len > tp->t_tsomax - hdrlen) {
778len = tp->t_tsomax - hdrlen;
779sendalot = 1;
780 }
If it is a TSO segment of > 65535, at a glance it would seem that this "if"
is busted. Just to see, you could try replacing line# 777-778 with
   if (len > IP_MAXPACKET - hdrlen) {
   len = IP_MAXPACKET - hdrlen;
which was what it was in 9.1. (Maybe t_tsomax isn't set correctly or somehow
screws up the calculation?

rick

> 
> Markus
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-20 Thread Rick Macklem

Christopher Forgeron wrote:
> Yes, there is something broken in TSO for sure, as disabling it
> allows me
> to run without error. It is possible that the drop in performance is
> allowing me to stay under a critical threshold for the problem, but
> I'd
> feel happier testing to make sure.
> 
> I understand what you're asking for in the patch, I'll make the edits
> tomorrow and recompile a test kernel and see.
> 
I also suggested a small change (basically reverting it to the 9.1 code)
for tcp_output() in sys/netinet/tcp_output.c (around line# 777-778).
You might as well throw that in at the same time.

Thanks for all your work with this (and this applies to others that
have been working on this as well.)

rick

> Right now I'm running tests on the ixgbe that Jack sent. Even if his
> patch
> fixes the issue, I wonder if something else isn't broken in TSO, as
> the
> ixgbe code has had these lines for a long time, and it's only on this
> 10.0
> build that I have issues.
> 
> I'll be following up tomorrow with info on either outcome.
> 
> Thanks for your help.. your rusty networking is still better than
> mine. :-)
> 
> 
> On Thu, Mar 20, 2014 at 11:13 PM, Rick Macklem 
> wrote:
> 
> > Christopher Forgeron wrote:
> > >
> > > Output from the patch you gave me (I have screens of it.. let me
> > > know
> > > what you're hoping to see.
> > >
> > >
> > > Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538
> > > actl=65538
> > > Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> > Hmm. I think this means that the loop that generates TSO segments
> > in
> > tcp_output() is broken, since I'm pretty sure that the maximum size
> > should be is IP_MAXPACKET (65535).
> >
> > Either that or some non-TCP socket is trying to send a packet that
> > exceeds IP_MAXPACKET for some reason.
> >
> > Would it be possible to add a printf() for m->m_pkthdr.csum_flags
> > to the before case, in the "if" that generates the before printf?
> > I didn't think to put this in, but CSUM_TSO will be set if it
> > is a TSO segment, I think? My networking is very rusty.
> > (If how to add this isn't obvious, just email and I'll update
> >  the patch.)
> >
> > Thanks for doing this, rick
> >
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang (packets that exceed 65535bytes in length)

2014-03-21 Thread Rick Macklem

Christopher Forgeron wrote:
> (Pardon me, for some reason my gmail is sending on my cut-n-pastes if
> I cr
> down too fast)
> 
> First set of logs:
> 
> Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
Ok, so this isn't a TSO segment then, unless I don't understand how
the csum flags are used, which is quite possible.
Assuming that you printed this out in decimal:
4116->0x1014
Looking in mbuf.h, 0x1014 is
CSUM_SCTP_VALID | CSUM_FRAGMENT | CSUM_UDP

alternately, if 4116 is hex, then it is:
CSUM_TCP_IPV6 | CSUM_IP_CHECKED | CSUM_FRAGMENT | CSUM_UDP

either way, it doesn't appear to be a TCP TSO?
(But you said that disabling TSO fixed the problem, so colour me
 confused by this.;-)

Sorry, but my rusty networking is confused by this, so maybe someone
else can explain it? (I don't think any packet handed to the net interface
should exceed 65535. Am I right?)

Anyhow, all I can say is that I think these mbuf chains should fail with EFBIG,
since they are too big. I have no idea where they come from and I don't
know why this would lead to exhaustion of the transmit descriptor entries,
which seems to be when things get really wedged.
(From what little I can see in the driver sources, these transmit descriptor
 entries should be released via interrupts, but I've just glanced at it.)

Sorry, but I think this will need someone conversant with the networking side
to figure out, rick

> Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
> Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
> Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
> Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
> Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
> 
> Here's a few later on.
> 
> Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> 
> Mar 21 11:23:00 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
> Mar 21 11:23:01 SAN0 kernel: before pklen=65546 actl=65546 csum=4116
> Mar 21 11:23:01 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
> Mar 21 11:23:03 SAN0 kernel: before pklen=65546 actl=65546 csum=4116
> Mar 21 11:23:03 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
> Mar 21 11:23:04 SAN0 kernel: before pklen=65546 actl=65546 csum=4116
> Mar 21 11:23:04 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
> 
> Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:41:26 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:41:26 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 21 11:41:26 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
> Mar 21 11:41:26 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> 
> To be clear, I changed tp->t_tsomax to IP_MAXPACKET at ~ 777 in
> sys/netinet/tcp_output.c like so:
> 
> if (len > IP_MAXPACKET - hdrlen) {
> len = IP_MAXPACKET - hdrlen;
> sendalot = 1;
> }
> 
> I notice there is more that is different between 9.1 and 10 for this
> file:
> http://fxr.watson.org/fxr/diff/netinet/tcp_output.c?v=FREEBSD10;diffval=FREEBSD91;diffvar=v
> 
> I'm going to attempt inserting a 9.1 tcp_output.c and see if that
> makes any
> difference.
> 
> Otherwise, I wait further ideas from the list.
> 
> Thanks.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-21 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> Hello all,
> 
> I ran Jack's ixgbe MJUM9BYTES removal patch, and let iometer hammer
> away at the NFS store overnight - But the problem is still there.
> 
> 
> From what I read, I think the MJUM9BYTES removal is probably good
> cleanup (as long as it doesn't trade performance on a lightly memory
> loaded system for performance on a heavily memory loaded system). If
> I can stabilize my system, I may attempt those benchmarks.
> 
> 
> I think the fix will be obvious at boot for me - My 9.2 has a 'clean'
> netstat
> - Until I can boot and see a 'netstat -m' that looks similar to that,
> I'm going to have this problem.
> 
> 
> Markus: Do your systems show denied mbufs at boot like mine does?
> 
> 
> Turning off TSO works for me, but at a performance hit.
> 
> I'll compile Rick's patch (and extra debugging) this morning and let
> you know soon.
> 
> 
> 
> 
> 
> 
> On Thu, Mar 20, 2014 at 11:47 PM, Christopher Forgeron <
> csforge...@gmail.com > wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> BTW - I think this will end up being a TSO issue, not the patch that
> Jack applied.
> 
> When I boot Jack's patch (MJUM9BYTES removal) this is what netstat -m
> shows:
> 
> 21489/2886/24375 mbufs in use (current/cache/total)
> 4080/626/4706/6127254 mbuf clusters in use (current/cache/total/max)
> 4080/587 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 16384/50/16434/3063627 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/907741 9k jumbo clusters in use (current/cache/total/max)
> 
> 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)
> 79068K/2173K/81241K bytes allocated to network (current/cache/total)
> 18831/545/4542 requests for mbufs denied
> (mbufs/clusters/mbuf+clusters)
> 
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> 15626/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 
> Here is an un-patched boot:
> 
> 21550/7400/28950 mbufs in use (current/cache/total)
> 4080/3760/7840/6127254 mbuf clusters in use (current/cache/total/max)
> 4080/2769 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/42/42/3063627 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 16439/129/16568/907741 9k jumbo clusters in use
> (current/cache/total/max)
> 
> 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)
> 161498K/10699K/172197K bytes allocated to network
> (current/cache/total)
> 18345/155/4099 requests for mbufs denied
> (mbufs/clusters/mbuf+clusters)
> 
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> 3/3723/0 requests for jumbo clusters denied (4k/9k/16k)
> 
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 
> 
> 
> See how removing the MJUM9BYTES is just pushing the problem from the
> 9k jumbo cluster into the 4k jumbo cluster?
> 
> Compare this to my FreeBSD 9.2 STABLE machine from ~ Dec 2013 : Exact
> same hardware, revisions, zpool size, etc. Just it's running an
> older FreeBSD.
> 
> # uname -a
> FreeBSD SAN1.X 9.2-STABLE FreeBSD 9.2-STABLE #0: Wed Dec 25
> 15:12:14 AST 2013 aatech@FreeBSD-Update
> Server:/usr/obj/usr/src/sys/GENERIC amd64
> 
> root@SAN1:/san1 # uptime
> 7:44AM up 58 days, 38 mins, 4 users, load averages: 0.42, 0.80, 0.91
> 
> root@SAN1:/san1 # netstat -m
> 37930/15755/53685 mbufs in use (current/cache/total)
> 4080/10996/15076/524288 mbuf clusters in use
> (current/cache/total/max)
> 4080/5775 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/692/692/262144 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 32773/4257/37030/96000 9k jumbo clusters in use
> (current/cache/total/max)
> 
> 0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)
> 312599K/67011K/379611K bytes allocated to network
> (current/cache/total)
> 
> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0/0/0 sfbufs in use (current/peak/max)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines
> 
> Lastly, please note this link:
> 
> http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033660.html
> 
Hmm, this mentioned the ethernet header being in the TSO segment. I think
I already mentioned my TCP/IP is rusty and I know diddly about TSO.
However, at a glance it does appear the driver uses ether_output() for
TSO segments and, as such, I think an ethernet header is prepended to the
TSO segment. (This makes sense, since how else would the hardware kn

Re: 9.2 ixgbe tx queue hang

2014-03-21 Thread Rick Macklem

Christopher Forgeron wrote:
> It may be a little early, but I think that's it!
> 
> It's been running without error for nearly an hour - It's very rare
> it
> would go this long under this much load.
> 
> I'm going to let it run longer, then abort and install the kernel
> with the
> extra printfs so I can see what value ifp->if_hw_tsomax is before you
> set
> it.
> 
I think you'll just find it set to 0. Code in if_attach_internal()
{ in sys/net/if.c } sets it to IP_MAXPACKET (which is 65535) if it
is 0. In other words, if the if_attach routine in the driver doesn't
set it, this code sets it to the maximum possible value.

Here's the snippet:
 /* Initialize to max value. */
657 if (ifp->if_hw_tsomax == 0)
658  ifp->if_hw_tsomax = IP_MAXPACKET;

Anyhow, this sounds like progress.

As far as NFS is concerned, I'd rather set it to a smaller value
(maybe 56K) so that m_defrag() doesn't need to be called, but I
suspect others wouldn't like this.

Hopefully Jack can decide if this patch is ok?

Thanks yet again for doing this testing, rick
ps: I've attached it again, so Jack (and anyone else who reads this)
can look at it.
pss: Please report if it keeps working for you.

> It still had netstat -m denied entries on boot, but they are not
> climbing
> like they did before:
> 
> 
> $ uptime
>  9:32PM  up 25 mins, 4 users, load averages: 2.43, 6.15, 4.65
> $ netstat -m
> 21556/7034/28590 mbufs in use (current/cache/total)
> 4080/3076/7156/6127254 mbuf clusters in use (current/cache/total/max)
> 4080/2281 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/53/53/3063627 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 16444/118/16562/907741 9k jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)
> 161545K/9184K/170729K bytes allocated to network
> (current/cache/total)
> 17972/2230/4111 requests for mbufs denied
> (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> 35/8909/0 requests for jumbo clusters denied (4k/9k/16k)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 
> - Started off bad with the 9k denials, but it's not going up!
> 
> uptime
> 10:20PM  up  1:13, 6 users, load averages: 2.10, 3.15, 3.67
> root@SAN0:/usr/home/aatech # netstat -m
> 21569/7141/28710 mbufs in use (current/cache/total)
> 4080/3308/7388/6127254 mbuf clusters in use (current/cache/total/max)
> 4080/2281 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/53/53/3063627 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 16447/121/16568/907741 9k jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)
> 161575K/9702K/171277K bytes allocated to network
> (current/cache/total)
> 17972/2261/4111 requests for mbufs denied
> (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> 35/8913/0 requests for jumbo clusters denied (4k/9k/16k)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 
> This is the 9.2 ixgbe that I'm patching into 10.0, I'll move into the
> base
> 10.0 code tomorrow.
> 
> 
> On Fri, Mar 21, 2014 at 8:44 PM, Rick Macklem 
> wrote:
> 
> > Christopher Forgeron wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hello all,
> > >
> > > I ran Jack's ixgbe MJUM9BYTES removal patch, and let iometer
> > > hammer
> > > away at the NFS store overnight - But the problem is still there.
> > >
> > >
> > > From what I read, I think the MJUM9BYTES removal is probably good
> > > cleanup (as long as it doesn't trade performance on a lightly
> > > memory
> > > loaded system for performance on a heavily memory loaded system).
> > > If
> > > I can stabilize my system, I may attempt those benchmarks.
> > >
> > >
> > > I think the fix will be obvious at boot for me - My 9.2 has a
> > > 'clean'
> > > netstat
> > > - Until I can boot and see a 'netstat -m' that looks similar to
> > > that,
> > > I'm going to have this problem.
> > >
> > >
> > > Markus: Do your systems show denied mbufs at boot like mine does?
> > >
>

Re: 9.2 ixgbe tx queue hang

2014-03-22 Thread Rick Macklem

Christopher Forgeron wrote:
> Status Update: Hopeful, but not done.
> 
> So the 9.2-STABLE ixgbe with Rick's TSO patch has been running all
> night
> while iometer hammered away at it. It's got over 8 hours of test time
> on
> it.
> 
> It's still running, the CPU queues are not clogged, and everything is
> functional.
> 
> However, my ping_logger.py did record 23 incidents of "sendto: File
> too
> large" over the 8 hour run.
> 
Well, you could try making if_hw_tsomax somewhat smaller. (I can't see
how the packet including ethernet header would be more than 64K with the
patch, but?? For example, the ether_output() code can call ng_output()
and I have no idea if that might grow the data size of the packet?)

To be honest, the optimum for NFS would be setting if_hw_tsomax == 56K,
since that would avoid the overhead of the m_defrag() calls. However,
it is suboptimal for other TCP transfers.

One other thing you could do (if you still have them) is scan the logs
for the code with my previous printf() patch and see if there is ever
a size > 65549 in it. If there is, then if_hw_tsomax needs to be smaller
by at least that size - 65549. (65535 + 14 == 65549)

If I were you, I'd try setting it to 57344 (56K) instead of
 "num_segs * MCLBYTES - ETHER_HDR_LEN"
ie. replace
 ifp->if_hw_tsomax = adapter->num_segs * MCLBYTES - ETHER_HDR_LEN;
with
 ifp->if_hw_tsomax = 57344;
in the patch.

Then see if all the errors go away. (Jack probably won't like making it
that small, but it will show if decreasing it a bit will completely
fix the problem.)

> That's really nothing compared to what I usually run into - Normally
> I'd
> have 23 incidents within a 5 minute span.
> 
> During those 23 incidents, (ping_logger.py triggers a cpuset ping) I
> see
> it's having the same symptoms of clogging on a few CPU cores. That
> clogging
> does go away, a symptom that Markus says he sometimes experiences.
> 
> So I would say the TSO patch makes things remarkably better, but
> something
> else is still up. Unfortunately, with the TSO patch in place it's now
> harder to trigger the error, so testing will be more difficult.
> 
> Could someone confirm for me where the jumbo clusters denied/mbuf
> denied
> counters come from for netstat? Would it be from a m_defrag call that
> fails?
> 
I'm not familiar enough with the mbuf/uma allocators to "confirm" it,
but I believe the "denied" refers to cases where m_getjcl() fails to get
a jumbo mbuf and returns NULL.

If this were to happen in m_defrag(), it would return NULL and the ix
driver returns ENOBUFS, so this is not the case for EFBIG errors.

I don't know if increasing the limits for the jumbo mbufs via sysctl
will help. If you are using the code without Jack's patch, which uses
9K mbufs, then I think it can fragment the address space and result
in no 9K contiguous areas to allocate from. (I'm just going by what
Garrett and others have said about this.)

> I feel the netstat -m stats on boot are part of this issue - I was
> able to
> greatly reduce them during one of my test iterations. I'm going to
> see if I
> can repeat that with the TSO patch.
> 
> Getting this working on the 10-STABLE ixgbe:
> 
> Mike's contributed some edits (slightly different thread) I want to
> try on
> that driver. At the same time, a diff of 9.2 <-> 10.0 may give hints,
> as
> the 10.0 driver with TSO patch has issues quickly, and frequently...
> it's
> doing something that aggravates this condition.
> 
> 
> Thanks for all the help, please keep the suggestions or tidbits of
> info
> coming.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-22 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> Ah yes, I see it now: Line #658
> 
> #if defined(INET) || defined(INET6)
> /* Initialize to max value. */
> if (ifp->if_hw_tsomax == 0)
> ifp->if_hw_tsomax = IP_MAXPACKET;
> KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> ("%s: tsomax outside of range", __func__));
> #endif
> 
> 
> Should this be the location where it's being set rather than in
> ixgbe? I would assume that other drivers could fall prey to this
> issue.
> 
All of this should be prepended with "I'm an NFS guy, not a networking
guy, so I might be wrong".

Other drivers (and ixgbe for the 82598 chip) can handle a packet that
is in more than 32 mbufs. (I think the 82598 handles 100, grep for SCATTER
in *.h in sys/dev/ixgbe.)

Now, since several drivers do have this 32 mbufs limit, I can see an argument
for making the default a little smaller to make these work, since the
driver can override the default. (About now someone usually jumps in and says
something along the lines of "You can't do that until all the drivers that
can handle IP_MAXPACKET are fixed to set if_hw_tsomax" and since I can't fix
drivers I can't test, that pretty much puts a stop on it.)

You see the problem isn't that IP_MAXPACKET is too big, but that the hardware
has a limit of 32 non-contiguous chunks (mbufs)/packet and 32 * MCLBYTES = 64K.
(Hardware/network drivers that can handle 35 or more chunks (they like to call
 them transmit segments, although ixgbe uses the term scatter) shouldn't have
 any problems.)

I have an untested patch that adds a tsomaxseg count to use along with tsomax
bytes so that a driver could inform tcp_output() it can only handle 32 mbufs
and then tcp_output() would limit a TSO segment using both, but I can't test
it, so who knows when/if that might happen.

I also have a patch that modifies NFS to use pagesize clusters (reducing the
mbuf count in the list), but that one causes grief when testing on an i386
(seems to run out of kernel memory to the point where it can't allocate 
something
 called "boundary tags" and pretty well wedges the machine at that point.)
Since I don't know how to fix this (I thought of making the patch "amd64 only"),
I can't really commit this to head, either.

As such, I think it's going to be "fix the drivers one at a time" and tell
folks to "disable TSO or limit rsize,wsize to 32K" when they run into trouble.
(As you might have guessed, I'd rather just be "the NFS guy", but since NFS
 "triggers the problem" I\m kinda stuck with it;-)

> Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to
> make sure VLANs fit?
> 
No idea. (I wouldn't know a VLAN if it jumped up and tried to
bite me on the nose.;-) So, I have no idea what does this, but
if it means the total ethernet header size can be > 14bytes, then I'd agree.

> Perhaps there is something in the newer network code that is filling
> up the frames to the point where they are full - thus a TSO =
> IP_MAXPACKET is just now causing problems.
> 
Yea, I have no idea why this didn't bite running 9.1. (Did 9.1 have
TSO enabled by default?)

> I'm back on the 9.2-STABLE ixgbe with the tso patch for now. I'll
> make it run overnight while copying a few TB of data to make sure
> it's stable there before investigating the 10.0-STABLE driver more.
> 
I have no idea what needs to be changed to back-port a 10.0 driver to
9.2.

Good luck with it and thanks for what you've learned sofar, rick

> ..and there is still the case of the denied jumbo clusters on boot -
> something else is off someplace.
> 
> BTW - In all of this, I did not mention that my ix0 uses a MTU of
> 9000 - I assume others assumed this.
> 
> 
> 
> 
> 
> 
> 
> 
> On Fri, Mar 21, 2014 at 11:39 PM, Rick Macklem < rmack...@uoguelph.ca
> > wrote:
> 
> 
> 
> Christopher Forgeron wrote:
> > It may be a little early, but I think that's it!
> > 
> > It's been running without error for nearly an hour - It's very rare
> > it
> > would go this long under this much load.
> > 
> > I'm going to let it run longer, then abort and install the kernel
> > with the
> > extra printfs so I can see what value ifp->if_hw_tsomax is before
> > you
> > set
> > it.
> > 
> I think you'll just find it set to 0. Code in if_attach_internal()
> { in sys/net/if.c } sets it to IP_MAXPACKET (which is 65535) if it
> is 0. In other words, if the if_attach routine in the driver doesn't
> set it, this code sets it to the maximum possible value.
> 
> He

Re: 9.2 ixgbe tx queue hang

2014-03-22 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> Ah yes, I see it now: Line #658
> 
> #if defined(INET) || defined(INET6)
> /* Initialize to max value. */
> if (ifp->if_hw_tsomax == 0)
> ifp->if_hw_tsomax = IP_MAXPACKET;
> KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> ("%s: tsomax outside of range", __func__));
> #endif
> 
> 
> Should this be the location where it's being set rather than in
> ixgbe? I would assume that other drivers could fall prey to this
> issue.
> 
> Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to
> make sure VLANs fit?
> 
I took a look and, yes, this does seem to be needed. It will only be
needed for the case where a vlan is in use and hwtagging is disabled,
if I read the code correctly.

Do you use vlans?

I've attached an updated patch.

It might be nice to have the printf() patch in the driver too, so
we can see how big the ones that are too big are?

Good luck with it, rick

> Perhaps there is something in the newer network code that is filling
> up the frames to the point where they are full - thus a TSO =
> IP_MAXPACKET is just now causing problems.
> 
> I'm back on the 9.2-STABLE ixgbe with the tso patch for now. I'll
> make it run overnight while copying a few TB of data to make sure
> it's stable there before investigating the 10.0-STABLE driver more.
> 
> ..and there is still the case of the denied jumbo clusters on boot -
> something else is off someplace.
> 
> BTW - In all of this, I did not mention that my ix0 uses a MTU of
> 9000 - I assume others assumed this.
> 
> 
> 
> 
> 
> 
> 
> 
> On Fri, Mar 21, 2014 at 11:39 PM, Rick Macklem < rmack...@uoguelph.ca
> > wrote:
> 
> 
> 
> Christopher Forgeron wrote:
> > It may be a little early, but I think that's it!
> > 
> > It's been running without error for nearly an hour - It's very rare
> > it
> > would go this long under this much load.
> > 
> > I'm going to let it run longer, then abort and install the kernel
> > with the
> > extra printfs so I can see what value ifp->if_hw_tsomax is before
> > you
> > set
> > it.
> > 
> I think you'll just find it set to 0. Code in if_attach_internal()
> { in sys/net/if.c } sets it to IP_MAXPACKET (which is 65535) if it
> is 0. In other words, if the if_attach routine in the driver doesn't
> set it, this code sets it to the maximum possible value.
> 
> Here's the snippet:
> /* Initialize to max value. */
> 657 if (ifp->if_hw_tsomax == 0)
> 658 ifp->if_hw_tsomax = IP_MAXPACKET;
> 
> Anyhow, this sounds like progress.
> 
> As far as NFS is concerned, I'd rather set it to a smaller value
> (maybe 56K) so that m_defrag() doesn't need to be called, but I
> suspect others wouldn't like this.
> 
> Hopefully Jack can decide if this patch is ok?
> 
> Thanks yet again for doing this testing, rick
> ps: I've attached it again, so Jack (and anyone else who reads this)
> can look at it.
> pss: Please report if it keeps working for you.
> 
> 
> 
> > It still had netstat -m denied entries on boot, but they are not
> > climbing
> > like they did before:
> > 
> > 
> > $ uptime
> > 9:32PM up 25 mins, 4 users, load averages: 2.43, 6.15, 4.65
> > $ netstat -m
> > 21556/7034/28590 mbufs in use (current/cache/total)
> > 4080/3076/7156/6127254 mbuf clusters in use
> > (current/cache/total/max)
> > 4080/2281 mbuf+clusters out of packet secondary zone in use
> > (current/cache)
> > 0/53/53/3063627 4k (page size) jumbo clusters in use
> > (current/cache/total/max)
> > 16444/118/16562/907741 9k jumbo clusters in use
> > (current/cache/total/max)
> > 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)
> > 161545K/9184K/170729K bytes allocated to network
> > (current/cache/total)
> > 17972/2230/4111 requests for mbufs denied
> > (mbufs/clusters/mbuf+clusters)
> > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> > 35/8909/0 requests for jumbo clusters denied (4k/9k/16k)
> > 0 requests for sfbufs denied
> > 0 requests for sfbufs delayed
> > 0 requests for I/O initiated by sendfile
> > 
> > - Started off bad with the 9k denials, but it's not going up!
> > 
> > uptime
> > 10:20PM up 1:13, 6 users, load averages: 2.10, 3.15, 3.67
> > root@SAN0:/usr/home/aatech # netstat -m
> > 21569/7141/28710 mbufs in use (current/cache/total)
&

Re: 9.2 ixgbe tx queue hang

2014-03-23 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> On Sat, Mar 22, 2014 at 6:41 PM, Rick Macklem < rmack...@uoguelph.ca
> > wrote:
> 
> 
> 
> Christopher Forgeron wrote:
> > #if defined(INET) || defined(INET6)
> > /* Initialize to max value. */
> > if (ifp->if_hw_tsomax == 0)
> > ifp->if_hw_tsomax = IP_MAXPACKET;
> > KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> > ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> > ("%s: tsomax outside of range", __func__));
> > #endif
> > 
> > 
> > Should this be the location where it's being set rather than in
> > ixgbe? I would assume that other drivers could fall prey to this
> > issue.
> > 
> All of this should be prepended with "I'm an NFS guy, not a
> networking
> guy, so I might be wrong".
> 
> Other drivers (and ixgbe for the 82598 chip) can handle a packet that
> is in more than 32 mbufs. (I think the 82598 handles 100, grep for
> SCATTER
> in *.h in sys/dev/ixgbe.)
> 
> 
> [...]
> 
> 
> Yes, I agree we have to be careful about the limitations of other
> drivers, but I'm thinking setting tso to IP_MAXPACKET is a bad idea,
> unless all of the header subtractions are happening elsewhere. Then
> again, perhaps every other driver (and possibly ixgbe.. i need to
> look more) does a maxtso - various_headers to set a limit for data
> packets.
> 
> 
> I'm not familiar with the Freebsd network conventions/styles - I'm
> just asking questions, something I have a bad habit for, but I'm in
> charge of code stability issues at my work so it's hard to stop.
> 
Well, IP_MAXPACKET is simply the largest # that fits in the 16bit length
field of an IP header (65535). This limit is on the TSO segment (which
is really just a TCP/IP packet greater than the MTU) and does not include
a MAC level (ethernet) header.

Beyond that, it is the specific hardware that limits things, such as
this case, which is limited to 32 mbufs (which happens to imply 64K
total, including ethernet header using 2K mbuf clusters).
(The 64K limit is just a quirk caused by the 32mbuf limit and the fact
 that mbuf clusters hold 2K of data each.)

> 
> 
> Now, since several drivers do have this 32 mbufs limit, I can see an
> argument
> for making the default a little smaller to make these work, since the
> driver can override the default. (About now someone usually jumps in
> and says
> something along the lines of "You can't do that until all the drivers
> that
> can handle IP_MAXPACKET are fixed to set if_hw_tsomax" and since I
> can't fix
> drivers I can't test, that pretty much puts a stop on it.)
> 
> 
> 
> 
> Testing is a problem isn't it? I once again offer my stack of network
> cards and systems for some sort of testing.. I still have coax and
> token ring around. :-)
> 
> 
> 
> You see the problem isn't that IP_MAXPACKET is too big, but that the
> hardware
> has a limit of 32 non-contiguous chunks (mbufs)/packet and 32 *
> MCLBYTES = 64K.
> (Hardware/network drivers that can handle 35 or more chunks (they
> like to call
> them transmit segments, although ixgbe uses the term scatter)
> shouldn't have
> any problems.)
> 
> I have an untested patch that adds a tsomaxseg count to use along
> with tsomax
> bytes so that a driver could inform tcp_output() it can only handle
> 32 mbufs
> and then tcp_output() would limit a TSO segment using both, but I
> can't test
> it, so who knows when/if that might happen.
> 
> 
> 
> 
> I think you give that to me in the next email - if not, please send.
> 
> 
> 
> I also have a patch that modifies NFS to use pagesize clusters
> (reducing the
> mbuf count in the list), but that one causes grief when testing on an
> i386
> (seems to run out of kernel memory to the point where it can't
> allocate something
> called "boundary tags" and pretty well wedges the machine at that
> point.)
> Since I don't know how to fix this (I thought of making the patch
> "amd64 only"),
> I can't really commit this to head, either.
> 
> 
> 
> 
> Send me that one too. I love NFS patches.
> 
> 
> 
> As such, I think it's going to be "fix the drivers one at a time" and
> tell
> folks to "disable TSO or limit rsize,wsize to 32K" when they run into
> trouble.
> (As you might have guessed, I'd rather just be "the NFS guy", but
> since NFS
> "triggers the problem" I\m kinda stuck with it;-)
> 
> 
> 
> I know in some circumstances disabling TSO can be a benefit, but in
> general yo

Re: 9.2 ixgbe tx queue hang

2014-03-23 Thread Rick Macklem

Christopher Forgeron wrote:
> Hi Rick, very helpful as always.
> 
> 
> On Sat, Mar 22, 2014 at 6:18 PM, Rick Macklem 
> wrote:
> 
> > Christopher Forgeron wrote:
> >
> > Well, you could try making if_hw_tsomax somewhat smaller. (I can't
> > see
> > how the packet including ethernet header would be more than 64K
> > with the
> > patch, but?? For example, the ether_output() code can call
> > ng_output()
> > and I have no idea if that might grow the data size of the packet?)
> >
> 
> That's what I was thinking - I was going to drop it down to 32k,
> which is
> extreme, but I wanted to see if it cured it or not. Something would
> have to
> be very broken to be adding nearly 32k to a packet.
> 
> 
> > To be honest, the optimum for NFS would be setting if_hw_tsomax ==
> > 56K,
> > since that would avoid the overhead of the m_defrag() calls.
> > However,
> > it is suboptimal for other TCP transfers.
> >
> 
Ok, here is the critical code snippet from tcp_output():
   /*
774 * Limit a burst to t_tsomax minus IP,
775 * TCP and options length to keep ip->ip_len
776 * from overflowing or exceeding the maximum
777 * length allowed by the network interface.
778 */
779 if (len > tp->t_tsomax - hdrlen) {
780 len = tp->t_tsomax - hdrlen;
781 sendalot = 1;
782 }
783 
784 /*
785 * Prevent the last segment from being
786 * fractional unless the send sockbuf can
787 * be emptied.
788 */
789 if (sendalot && off + len < so->so_snd.sb_cc) {
790 len -= len % (tp->t_maxopd - optlen);
791 sendalot = 1;
792 }
The first "if" at #779 limits the len to if_hw_tsomax - hdrlen.
(tp->t_tsomax == if_hw_tsomax and hdrlen == size of TCP/IP header)
The second "if" at #789 reduces the len to an exact multiple of the output
MTU if it won't empty the send queue.

Here's how I think things work:
- For a full 64K of read/write data, NFS generates an mbuf list with
  32 MCLBYTES clusters of data and two small header packets prepended
  in front of them (one for the RPC header + one for the NFS args that
  come before the data).
  Total data length is a little over 64K (something like 65600bytes).
  - When the above code processes this, it reduces the length to
if_hw_tsomax (65535 by default). { if at #779 }
  - Second "if" at #789 reduces it further (63000 for a 9000byte MTU).
  tcp_output() prepends an mbuf with the TCP/IP header in it, resulting
  is a total data length somewhat less than 64K and passes this to the
  ixgbe.c driver.
- The ixgbe.c driver prepends an ethernet header (14 or maybe 18bytes in
  length) by calling ether_output() and then hands it (a little less than
  64K bytes of data in 35mbufs) to ixgbe_xmit().
  ixgbe_xmit() calls bus_dmamap_load_mbuf_sg() which fails, returning
  EFBIG, because the list has more than 32 mbufs in it.
  - then it calls m_defrag(), which copies the slightly less than 64K
of data to a list of 32 mbuf clusters.
  - bus_dmamap_load_mbuf_sg() is called again and succeeds this time
because the list is only 32 mbufs long.
   (The call to m_defrag() adds some overhead and does have the potential
to fail if mbuf clusters are exhausted, so this works, but isn't ideal.)

The problem case happens when the size of the I/O is a little less than
the full 64K (hit EOF for read or a smaller than 64K dirty region in a
buffer cache block for write.
- Now, for example, the total data length for the mbuf chain (including
  RPC, NFS and TCP/IP headers) could be 65534 (slightly less than 64K).
The first "if" doesn't change the "len", since it is less than if_hw_tsomax.
The second "if" doesn't change the "len" if there is no additional data in
the send queue.
--> Now the ixgbe driver prepends an ethernet header, increasing the total
data length to 65548 (a little over 64K).
   - First call to bus_dmamap_load_mbuf_sg() fails with EFBIG because the
 mbuf list has more than 32 entries.
   - calls m_defrag(), which copies the data to a list of 33 mbuf clusters.
 (> 64K requires 33 * 2K clusters)
   - Second call to bus_dmamap_load_mbuf_sg() fails again with EFBIG, because
 the list has 33 mbufs in it.
   --> Returns EFBIG and throws away the TSO segment without sending it.

For NFS, the ideal would be to not only never fail with EFBIG, but to not
have the overhead of calling m_defrag().
- One way is to use pagesize (4K) clusters, so that the mbuf list only has
  19 entries.
- Another way is to teach tcp_output() to limit the mbuf list to 32 mbufs
  as well as 65535 bytes in length.
- Yet another is to make if_hw_tsomax small enough that the mbuf list
  doesn't exceed 32 mbufs. (56K would do this for NFS, but is sub

Re: 9.2 ixgbe tx queue hang

2014-03-23 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> Update:
> 
> For giggles, I set IP_MAXPACKET = 32768.
> 
Well, I'm pretty sure you don't want to do that, except for an experiment.
You can just set if_hw_tsomax to whatever you want to try, at the place
my ixgbe.patch put it (just before the call to ether_ifattach()).

> Over a hour of runtime, and no issues. This is better than with the
> TSO patch and the 9.2 ixgbe, as that was just a drastic reduction in
> errors.
> 
So now the question becomes "how much does if_hw_tsomax need to be
reduced from 65535 to get this?". If reducing it by the additional
4bytes for a vlan header is sufficient, then I understand what is
going on. If it needs to be reduced by more than that, then there
is something going on that I still don't understand.

> Still have an 'angry' netstat -m on boot, and I'm still incrementing
> denied netbuf calls, so something else is wrong.
> 
> I'm going to modify Rick's prinft in ixgbe to also output when we're
> over 32768. I'm sure it's still happening, but with an extra 32k of
> space, we're not busting like we did before.
> 
> 
> I notice a few interesting ip->ip_len changes since 9.2 - Like here,
> at line 720
> 
> http://fxr.watson.org/fxr/diff/netinet/ip_output.c?v=FREEBSD10;im=kwqeqdhhvovqn;diffval=FREEBSD92;diffvar=v
> 
> Looks like older code didn't byteswap with ntohs - I see that often
> in tcp_output.c, and in tcp_options.c.
> 
> 
> I'm also curious about this:Line 524
> http://fxr.watson.org/fxr/diff/netinet/ip_options.c?v=FREEBSD10;diffval=FREEBSD92;diffvar=v
> 
> 
> New 10 code:
> 
> ip ->ip_len = htons ( ntohs ( ip ->ip_len) + optlen); Old 9.2 Code:
> ip ->ip_len += optlen;
> 
Well, TSO segments aren't generated when optlen > 0, so I doubt this
matters for our issue (and I would find it hard to believe that this
would have been broken?). You can always look at the svn commit logs
to see why/how something was changed.

> 
> 
> I wonder if there are any unexpected consequences of these changes,
> or perhaps a line someplace that doesn't make the change.
> 
> Is there a dtrace command I could use to watch these functions and
> compare the new ip_len with ip->ip_len or other variables?
> 
> 
> 
> 
> 
> 
> 
> On Sun, Mar 23, 2014 at 12:25 PM, Christopher Forgeron <
> csforge...@gmail.com > wrote:
> 
> 
> 
> 
> 
> 
> 
> On Sat, Mar 22, 2014 at 11:58 PM, Rick Macklem < rmack...@uoguelph.ca
> > wrote:
> 
> 
> 
> 
> Christopher Forgeron wrote:
> > 
> 
> > Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax
> > to
> > make sure VLANs fit?
> > 
> I took a look and, yes, this does seem to be needed. It will only be
> needed for the case where a vlan is in use and hwtagging is disabled,
> if I read the code correctly.
> 
> 
> 
> Yes, or in the rare care where you configure your switch to pass the
> v_lan header through to the NIC.
> 
> 
> 
> Do you use vlans?
> 
> 
> (Answered in above email)
> 
> 
> 
> 
> 
> I've attached an updated patch.
> 
> It might be nice to have the printf() patch in the driver too, so
> we can see how big the ones that are too big are?
> 
> 
> 
> Yes, I'm going to leave those in until I know we have this fixed..
> will probably leave it in a while longer as it should only have a
> minor performance impact to iter-loop like that, and I'd like to see
> what the story is a few months down the road.
> 
> 
> Thanks for the patches, will have to start giving them code-names so
> we can keep them straight. :-) I guess we have printf, tsomax, and
> this one.
> 
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-24 Thread Rick Macklem

Julian Elischer wrote:
> On 3/23/14, 4:57 PM, Rick Macklem wrote:
> > Christopher Forgeron wrote:
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Mar 22, 2014 at 6:41 PM, Rick Macklem <
> >> rmack...@uoguelph.ca
> >>> wrote:
> >>
> >>
> >> Christopher Forgeron wrote:
> >>> #if defined(INET) || defined(INET6)
> >>> /* Initialize to max value. */
> >>> if (ifp->if_hw_tsomax == 0)
> >>> ifp->if_hw_tsomax = IP_MAXPACKET;
> >>> KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> >>> ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> >>> ("%s: tsomax outside of range", __func__));
> >>> #endif
> >>>
> >>>
> >>> Should this be the location where it's being set rather than in
> >>> ixgbe? I would assume that other drivers could fall prey to this
> >>> issue.
> >>>
> >> All of this should be prepended with "I'm an NFS guy, not a
> >> networking
> >> guy, so I might be wrong".
> >>
> >> Other drivers (and ixgbe for the 82598 chip) can handle a packet
> >> that
> >> is in more than 32 mbufs. (I think the 82598 handles 100, grep for
> >> SCATTER
> >> in *.h in sys/dev/ixgbe.)
> >>
> 
> the Xen backend can not handle mor ethan 32 segments in some versions
> of Xen.
> 
Oops, poorly worded. I should have said "Some other drivers...". Yes,
there are several (I once did a find/grep, but didn't keep the output)
that have this 32 limit.

Also, I have no idea if the limit can easily be increased to 35 for them?
(Bryan was able to do that for the virtio network driver.)

rick
ps: If it was just "ix" I wouldn't care as much about this.

> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-24 Thread Rick Macklem

Christopher Forgeron wrote:
> I'm going to split this into different posts to focus on each topic.
> This
> is about setting IP_MAXPACKET to 65495
> 
> Update on Last Night's Run:
> 
> (Last night's run is a kernel with IP_MAXPACKET = 65495)
> 
> - Uptime on this run: 10:53AM  up 13:21, 5 users, load averages:
> 1.98,
> 2.09, 2.13
> - Ping logger records no ping errors for the entire run.
> - At Mar 24th 10:57 I did a grep through the night's log for 'before'
> (which is the printf logging that Rick suggested a few days ago), and
> saved
> it to before_total.txt
> - With wc -l on before_total.txt I can see that we have 504 lines,
> thus 504
> incidents of the packet being above IP_MAXPACKET during this run.
> - I did tr -c '[:alnum:]' '[\n*]' < before_total.txt | sort | uniq -c
> |
> sort -nr | head -50 to list the most common words. Ignoring the
> non-pklen
> output. The relevant output is:
> 
>  344 65498 (3)
>  330 65506 (11)
>  330 65502 (7)
> 
This makes sense to me, since tp->t_tsomax is used in tcp_output() for
the TCP/IP packet, which does not include the link level (ethernet)
header. When that is added, I would expect the length to be up to 14
(or maybe 18 for vlan cases) greater than IP_MAXPACKET. Since none of
these are greater than 65509, this looks fine to me.

So, unless you get ones greater than (65495 + 18 = 65513), this makes
sense and does not indicate a problem.

In another post, you indicate that having the driver set if_hw_tsomax
didn't set tp->t_tsomax to the same value.
--> I believe that is a bug and would mean my ixgbe.patch would not
fix the problem, because it is tp->t_tsomax that must be decreased
to at least (65536 - 18 = 65518).
--> Now, have you tried a case between 65495 and 65518 and seen
any EFBIG errors?
If so, then I don't understand why 65518 isn't small enough?

rick

>  - First # being the # of times. (Each pklen is printed twice on the
>  log,
> thus 2x the total line count).
>  - Last (#) being the byte overrun from 65495
>  - A fairly even distribution of each type of packet overrun.
> 
>  You will recall that my IP_MAXPACKET is 65495, so each of these
>  packet
> lengths represents a overshoot.
> 
>  The fact that we have only 3 different types of overrun is good - It
> suggests a non-random event, more like a broken 'if' statement for a
> particular case.
> 
I think it just means that your load happens to do only 3 sizes of I/O
that is a little less than 65536.

>  If IP_MAXPACKET was set to 65535 as it normally is, I would have had
>  504
> incidents of errors, with a chance that any one of them could have
> blocked
> the queue for considerable time.
> 
If tp->t_tsomax hasn't been set to a smaller value than 65535, the
ixgbe.patch didn't do what I thought it would.

>  Question: Should there be logic that discards packets that are over
> IP_MAXPACKET to ensure that we don't end up in a blocked queue
> situation
> again?
> 
> 
>  Moving forward, I am doing two things:
> 
>  1) I'm running a longer test with TSO disabled on my ix0 adapter. I
>  want
> to make sure that over say 4 hours I don't have even 1 packet over
> 65495.
> This will at least locate the issue to TSO related code.
> 
>  2) I have tcpdump running, to see if I can capture the packets over
>  65495.
> Here is my command. Any suggestions on additional switches I should
> include?
> 
> tcpdump -ennvvXS greater 65495
> 
> I'll report in on this again once I have new info.
> 
> Thanks for reading.
> 
> On Mon, Mar 24, 2014 at 2:14 AM, Christopher Forgeron
> wrote:
> 
> > Hi,
> >
> >  I'll follow up more tomorrow, as it's late and I don't have time
> >  for
> > detail.
> >
> >  The basic TSO patch didn't work, as packets were were still going
> >  over
> > 65535 by a fair amount. I thought I wrote that earlier, but I am
> > dumping a
> > lot of info into a few threads, so I apologize if I'm not as
> > concise as I
> > could be.
> >
> >  However, setting IP_MAXPACKET did. 4 hours of continuous run-time,
> >  no
> > issues. No lost pings, no issues. Of course this isn't a fix - but
> > it helps
> > isolate the problem.
> > > what the story is a few months down the road.
> > >
> > >
> > > Thanks for the patches, will have to start giving them code-names
> > > so
> > > we can keep them straight. :-) I guess we have printf, tsomax,
> > > and
> > > this one.
> > >
> > >
> >
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-24 Thread Rick Macklem

Markus Gebert wrote:
> 
> On 24.03.2014, at 16:21, Christopher Forgeron 
> wrote:
> 
> > This is regarding the TSO patch that Rick suggested earlier. (With
> > many
> > thanks for his time and suggestion)
> > 
> > As I mentioned earlier, it did not fix the issue on a 10.0 system.
> > It did
> > make it less of a problem on 9.2, but either way, I think it's not
> > needed,
> > and shouldn't be considered as a patch for testing/etc.
> > 
> > Patching TSO to anything other than a max value (and by default the
> > code
> > gives it IP_MAXPACKET) is confusing the matter, as the packet
> > length
> > ultimately needs to be adjusted for many things on the fly like TCP
> > Options, etc. Using static header sizes won't be a good idea.
> > 
> > Additionally, it seems that setting nic TSO will/may be ignored by
> > code
> > like this in sys/netinet/tcp_output.c:
> > 
> > 10.0 Code:
> > 
> >  780 if (len > tp->t_tsomax - hdrlen)
> > { !!
> >  781 len = tp->t_tsomax -
> > hdrlen;   !!
> >  782 sendalot =
> > 1;
> >  783 }
> > 
> > 
> > I've put debugging here, set the nic's max TSO as per Rick's patch
> > ( set to
> > say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET. It's
> > being set
> > someplace else, and thus our attempts to set TSO on the nic may be
> > in vain.
> > 
> > It may have mattered more in 9.2, as I see the code doesn't use
> > tp->t_tsomax in some locations, and may actually default to what
> > the nic is
> > set to.
> > 
> > The NIC may still win, I didn't walk through the code to confirm,
> > it was
> > enough to suggest to me that setting TSO wouldn't fix this issue.
> 
> 
> I just applied Rick’s ixgbe TSO patch and additionally wanted to be
> able to easily change the value of hw_tsomax, so I made a sysctl out
> of it.
> 
> While doing that, I asked myself the same question. Where and how
> will this value actually be used and how comes that tcp_output()
> uses that other value in struct tcpcb.
> 
> The only place tcpcb->t_tsomax gets set, that I have found so far, is
> in tcp_input.c’s tcp_mss() function. Some subfunctions get called:
> 
> tcp_mss() -> tcp_mss_update() -> tcp_maxmtu()
> 
> Then tcp_maxmtu() indeed uses the interface’s hw_tsomax value:
> 
> 1746 cap->tsomax = ifp->if_hw_tsomax;
> 
> It get’s passed back to tcp_mss() where it is set on the  connection
> level which will be used in tcp_output() later on.
> 
> tcp_mss() gets called from multiple places, I’ll look into that
> later. I will let you know if I find out more.
> 
> 
> Markus
> 
Well, if tp->t_tsomax isn't set to a value of 65518, then the ixgbe.patch
isn't doing what I thought it would.

The only explanation I can think of for this is that there might be
another net interface driver stacked on top of the ixgbe.c one and
that the setting doesn't get propagated up.
Does this make any sense?

IP_MAXPACKET can't be changed from 65535, but I can see an argument
for setting the default value of if_hw_tsomax to a smaller value.
For example, in sys/net/if.c change it from:
657 if (ifp->if_hw_tsomax == 0)
658 ifp->if_hw_tsomax = IP_MAXPACKET;
to
657 if (ifp->if_hw_tsomax == 0)
658 ifp->if_hw_tsomax = 65536 - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

This is a slightly smaller default which won't have much impact unless
the hardware device can only handle 32 mbuf clusters for transmit of
a segment and there are several of those.

Christopher, can you do your test run with IP_MAXPACKET set to 65518,
which should be the same as the above. If that gets rid of all the
EFBIG error replies, then I think the above patch will have the same
effect.

Thanks, rick

> 
> > However, this is still a TSO related issue, it's just not one
> > related to
> > the setting of TSO's max size.
> > 
> > A 10.0-STABLE system with tso disabled on ix0 doesn't have a single
> > packet
> > over IP_MAXPACKET in 1 hour of runtime. I'll let it go a bit longer
> > to
> > increase confidence in this assertion, but I don't want to waste
> > time on
> > this when I could be logging problem packets on a system with TSO
> > enabled.
> > 
> > Comments are very welcome..
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscr...@freebsd.org"
> > 
> 
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-24 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> This is regarding the TSO patch that Rick suggested earlier. (With
> many thanks for his time and suggestion)
> 
> 
> As I mentioned earlier, it did not fix the issue on a 10.0 system. It
> did make it less of a problem on 9.2, but either way, I think it's
> not needed, and shouldn't be considered as a patch for testing/etc.
> 
> 
> Patching TSO to anything other than a max value (and by default the
> code gives it IP_MAXPACKET) is confusing the matter, as the packet
> length ultimately needs to be adjusted for many things on the fly
> like TCP Options, etc. Using static header sizes won't be a good
> idea.
> 
If you look at tcp_output(), you'll notice that it doesn't do TSO if
there are any options. That way it knows that the TCP/IP header is
just hdrlen.

If you don't limit the TSO packet (including TCP/IP and ethernet headers)
to 64K, then the "ix" driver can't send them, which is the problem
you guys are seeing.

There are other ways to fix this problem, but they all may introduce
issues that reducing if_hw_tsomax by a small amount does not.
For example, m_defrag() could be modified to use 4K pagesize clusters,
but this might introduce memory fragmentation problems. (I observed
what I think are memory fragmentation problems when I switched NFS
to use 4K pagesize clusters for large I/O messages.)

If setting IP_MAXPACKET to 65518 fixes the problem (no more EFBIG
error replies), then that is the size that if_hw_tsomax can be set
to (just can't change IP_MAXPACKET, but that is defined for other
things). (It just happens that IP_MAXPACKET is what if_hw_tsomax
defaults to. It has no other effect w.r.t. TSO.)

> 
> Additionally, it seems that setting nic TSO will/may be ignored by
> code like this in sys/netinet/tcp_output.c:
> 
Yes, but I don't know why.
The only conjecture I can come up with is that another net driver is
stacked above "ix" and the setting for if_hw_tsomax doesn't propagate
up. (If you look at the commit log message for r251296, the intent
of adding if_hw_tsomax was to allow device drivers to set a smaller
tsomax than IP_MAXPACKET.)

Are you using any of the "stacked" network device drivers like
lagg? I don't even know what the others all are?
Maybe someone else can list them?

rick
> 
> 10.0 Code:
> 
> 780 if (len > tp->t_tsomax - hdrlen) { !!
> 781 len = tp->t_tsomax - hdrlen; !!
> 782 sendalot = 1;
> 783 }
> 
> 
> 
> 
> I've put debugging here, set the nic's max TSO as per Rick's patch (
> set to say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET.
> It's being set someplace else, and thus our attempts to set TSO on
> the nic may be in vain.
> 
> 
> It may have mattered more in 9.2, as I see the code doesn't use
> tp->t_tsomax in some locations, and may actually default to what the
> nic is set to.
> 
> The NIC may still win, I didn't walk through the code to confirm, it
> was enough to suggest to me that setting TSO wouldn't fix this
> issue.
> 
> 
> However, this is still a TSO related issue, it's just not one related
> to the setting of TSO's max size.
> 
> A 10.0-STABLE system with tso disabled on ix0 doesn't have a single
> packet over IP_MAXPACKET in 1 hour of runtime. I'll let it go a bit
> longer to increase confidence in this assertion, but I don't want to
> waste time on this when I could be logging problem packets on a
> system with TSO enabled.
> 
> 
> Comments are very welcome..
> 
> 
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-24 Thread Rick Macklem

Julian Elischer wrote:
- Original Message -
> I wrote (and snipped):
>> Other drivers (and ixgbe for the 82598 chip) can handle a packet that
>> is in more than 32 mbufs. (I think the 82598 handles 100, grep for
>> SCATTER
>> in *.h in sys/dev/ixgbe.)
>>
>
> the Xen backend can not handle mor ethan 32 segments in some versions 
> of Xen.
Btw, I just did a quick find/grep (so I may have missed some), but here
is the list of net devices that appear to support TSO, but limited to
32 transmit segments for at least some supported chips:

jme, fxp, age, sge, msk, als, ale, ixgbe/ix, nfe, e1000/em, re

Also, several of these call m_collapse() instead of m_defrag() when
the run into a transmit mbuf list with > 32 elements.
m_collapse() - Isn't likely to squeeze the 35 mbuf 64Kbyte NFS I/O
message into 32 mbufs, so I don't think these ones will work
at all for NFS with default 64K I/O size and TSO enabled.

rick
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-25 Thread Rick Macklem

Markus Gebert wrote:
> 
> On 25.03.2014, at 02:18, Rick Macklem  wrote:
> 
> > Christopher Forgeron wrote:
> >> 
> >> 
> >> 
> >> This is regarding the TSO patch that Rick suggested earlier. (With
> >> many thanks for his time and suggestion)
> >> 
> >> 
> >> As I mentioned earlier, it did not fix the issue on a 10.0 system.
> >> It
> >> did make it less of a problem on 9.2, but either way, I think it's
> >> not needed, and shouldn't be considered as a patch for
> >> testing/etc.
> >> 
> >> 
> >> Patching TSO to anything other than a max value (and by default
> >> the
> >> code gives it IP_MAXPACKET) is confusing the matter, as the packet
> >> length ultimately needs to be adjusted for many things on the fly
> >> like TCP Options, etc. Using static header sizes won't be a good
> >> idea.
> >> 
> > If you look at tcp_output(), you'll notice that it doesn't do TSO
> > if
> > there are any options. That way it knows that the TCP/IP header is
> > just hdrlen.
> > 
> > If you don't limit the TSO packet (including TCP/IP and ethernet
> > headers)
> > to 64K, then the "ix" driver can't send them, which is the problem
> > you guys are seeing.
> > 
> > There are other ways to fix this problem, but they all may
> > introduce
> > issues that reducing if_hw_tsomax by a small amount does not.
> > For example, m_defrag() could be modified to use 4K pagesize
> > clusters,
> > but this might introduce memory fragmentation problems. (I observed
> > what I think are memory fragmentation problems when I switched NFS
> > to use 4K pagesize clusters for large I/O messages.)
> > 
> > If setting IP_MAXPACKET to 65518 fixes the problem (no more EFBIG
> > error replies), then that is the size that if_hw_tsomax can be set
> > to (just can't change IP_MAXPACKET, but that is defined for other
> > things). (It just happens that IP_MAXPACKET is what if_hw_tsomax
> > defaults to. It has no other effect w.r.t. TSO.)
> > 
> >> 
> >> Additionally, it seems that setting nic TSO will/may be ignored by
> >> code like this in sys/netinet/tcp_output.c:
> >> 
> 
> Is this confirmed or still a ‘it seems’? Have you actually seen a
> tp->t_tsomax value in tcp_output() bigger than if_hw_tsomax or was
> this just speculation because the values are stored in different
> places? (Sorry, if you already stated this in another email, it’s
> currently hard to keep track of all the information.)
> 
> Anyway, this dtrace one-liner should be a good test if other values
> appear in tp->t_tsomax:
> 
> # dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
> args[0]->t_tsomax != 65518 / { printf("unexpected tp->t_tsomax:
> %i\n", args[0]->t_tsomax); stack(); }'
> 
> Remember to adjust the value in the condition to whatever you’re
> currently expecting. The value seems to be 0 for new connections,
> probably when tcp_mss() has not been called yet. So that’s seems
> normal and I have excluded that case too. This will also print a
> kernel stack trace in case it sees an unexpected value.
> 
> 
> > Yes, but I don't know why.
> > The only conjecture I can come up with is that another net driver
> > is
> > stacked above "ix" and the setting for if_hw_tsomax doesn't
> > propagate
> > up. (If you look at the commit log message for r251296, the intent
> > of adding if_hw_tsomax was to allow device drivers to set a smaller
> > tsomax than IP_MAXPACKET.)
> > 
> > Are you using any of the "stacked" network device drivers like
> > lagg? I don't even know what the others all are?
> > Maybe someone else can list them?
> 
> I guess the most obvious are lagg and vlan (and probably carp on
> FreeBSD 9.x or older).
> 
> On request from Jack, we’ve eliminated lagg and vlan from the
> picture, which gives us plain ixgbe interfaces with no stacked
> interfaces on top of it. And we can still reproduce the problem.
> 
This was related to the "did if_hw_tsomax set tp->t_tsomax to the
same value?" question. Since you reported that my patch that set
if_hw_tsomax in the driver didn't fix the problem, that suggests
that tp->t_tsomax isn't being set to if_hw_tsomax from the driver,
but we don't know why?

rick

> 
> Markus
> 
> 
> > 
> > rick
> >> 
> >> 10.0 Code:
> >> 
> >> 780 if (len > tp->t_tsomax - hdrlen) {

Re: 9.2 ixgbe tx queue hang

2014-03-25 Thread Rick Macklem

Markus Gebert wrote:
> 
> On 25.03.2014, at 22:46, Rick Macklem  wrote:
> 
> > Markus Gebert wrote:
> >> 
> >> On 25.03.2014, at 02:18, Rick Macklem 
> >> wrote:
> >> 
> >>> Christopher Forgeron wrote:
> >>>> 
> >>>> 
> >>>> 
> >>>> This is regarding the TSO patch that Rick suggested earlier.
> >>>> (With
> >>>> many thanks for his time and suggestion)
> >>>> 
> >>>> 
> >>>> As I mentioned earlier, it did not fix the issue on a 10.0
> >>>> system.
> >>>> It
> >>>> did make it less of a problem on 9.2, but either way, I think
> >>>> it's
> >>>> not needed, and shouldn't be considered as a patch for
> >>>> testing/etc.
> >>>> 
> >>>> 
> >>>> Patching TSO to anything other than a max value (and by default
> >>>> the
> >>>> code gives it IP_MAXPACKET) is confusing the matter, as the
> >>>> packet
> >>>> length ultimately needs to be adjusted for many things on the
> >>>> fly
> >>>> like TCP Options, etc. Using static header sizes won't be a good
> >>>> idea.
> >>>> 
> >>> If you look at tcp_output(), you'll notice that it doesn't do TSO
> >>> if
> >>> there are any options. That way it knows that the TCP/IP header
> >>> is
> >>> just hdrlen.
> >>> 
> >>> If you don't limit the TSO packet (including TCP/IP and ethernet
> >>> headers)
> >>> to 64K, then the "ix" driver can't send them, which is the
> >>> problem
> >>> you guys are seeing.
> >>> 
> >>> There are other ways to fix this problem, but they all may
> >>> introduce
> >>> issues that reducing if_hw_tsomax by a small amount does not.
> >>> For example, m_defrag() could be modified to use 4K pagesize
> >>> clusters,
> >>> but this might introduce memory fragmentation problems. (I
> >>> observed
> >>> what I think are memory fragmentation problems when I switched
> >>> NFS
> >>> to use 4K pagesize clusters for large I/O messages.)
> >>> 
> >>> If setting IP_MAXPACKET to 65518 fixes the problem (no more EFBIG
> >>> error replies), then that is the size that if_hw_tsomax can be
> >>> set
> >>> to (just can't change IP_MAXPACKET, but that is defined for other
> >>> things). (It just happens that IP_MAXPACKET is what if_hw_tsomax
> >>> defaults to. It has no other effect w.r.t. TSO.)
> >>> 
> >>>> 
> >>>> Additionally, it seems that setting nic TSO will/may be ignored
> >>>> by
> >>>> code like this in sys/netinet/tcp_output.c:
> >>>> 
> >> 
> >> Is this confirmed or still a ‘it seems’? Have you actually seen a
> >> tp->t_tsomax value in tcp_output() bigger than if_hw_tsomax or was
> >> this just speculation because the values are stored in different
> >> places? (Sorry, if you already stated this in another email, it’s
> >> currently hard to keep track of all the information.)
> >> 
> >> Anyway, this dtrace one-liner should be a good test if other
> >> values
> >> appear in tp->t_tsomax:
> >> 
> >> # dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
> >> args[0]->t_tsomax != 65518 / { printf("unexpected tp->t_tsomax:
> >> %i\n", args[0]->t_tsomax); stack(); }'
> >> 
> >> Remember to adjust the value in the condition to whatever you’re
> >> currently expecting. The value seems to be 0 for new connections,
> >> probably when tcp_mss() has not been called yet. So that’s seems
> >> normal and I have excluded that case too. This will also print a
> >> kernel stack trace in case it sees an unexpected value.
> >> 
> >> 
> >>> Yes, but I don't know why.
> >>> The only conjecture I can come up with is that another net driver
> >>> is
> >>> stacked above "ix" and the setting for if_hw_tsomax doesn't
> >>> propagate
> >>> up. (If you look at the commit log message for r251296, the
> >>> intent
> >>> of adding if_hw_tsomax was to allow device drivers to set a
> >>> smaller

RFC: How to fix the NFS/iSCSI vs TSO problem

2014-03-25 Thread Rick Macklem

Hi,

First off, I hope you don't mind that I cross-posted this,
but I wanted to make sure both the NFS/iSCSI and networking
types say it.
If you look in this mailing list thread:
  
http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
you'll see that several people have been working hard at testing and
thanks to them, I think I now know what is going on.
(This applies to network drivers that support TSO and are limited to 32 transmit
 segments->32 mbufs in chain.) Doing a quick search I found the following
drivers that appear to be affected (I may have missed some):
 jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re

Further, of these drivers, the following use m_collapse() and not m_defrag()
to try and reduce the # of mbufs in the chain. m_collapse() is not going to
get the 35 mbufs down to 32 mbufs, as far as I can see, so these ones are
more badly broken:
 jme, fxp, age, sge, alc, ale, nfe, re

The long description is in the above thread, but the short version is:
- NFS generates a chain with 35 mbufs in it for (read/readdir replies and write 
requests)
  made up of (tcpip header, RPC header, NFS args, 32 clusters of file data)
- tcp_output() usually trims the data size down to tp->t_tsomax (65535) and
  then some more to make it an exact multiple of TCP transmit data size.
  - the net driver prepends an ethernet header, growing the length by 14 (or
sometimes 18 for vlans), but in the first mbuf and not adding one to the 
chain.
  - m_defrag() copies this to a chain of 32 mbuf clusters (because the total 
data
length is <= 64K) and it gets sent

However, it the data length is a little less than 64K when passed to 
tcp_output()
so that the length including headers is in the range 65519->65535...
- tcp_output() doesn't reduce its size.
  - the net driver adds an ethernet header, making the total data length 
slightly
greater than 64K
  - m_defrag() copies it to a chain of 33 mbuf clusters, which fails with EFBIG
--> trainwrecks NFS performance, because the TSO segment is dropped instead of 
sent.

A tester also stated that the problem could be reproduced using iSCSI. Maybe
Edward Napierala might know some details w.r.t. what kind of mbuf chain iSCSI
generates?

Also, one tester has reported that setting if_hw_tsomax in the driver before
the ether_ifattach() call didn't make the value of tp->t_tsomax smaller.
However, reducing IP_MAXPACKET (which is what it is set to by default) did
reduce it. I have no idea why this happens or how to fix it, but it implies
that setting if_hw_tsomax in the driver isn't a solution until this is resolved.

So, what to do about this?
First, I'd like a simple fix/workaround that can go into 9.3. (which is code
freeze in May). The best thing I can think of is setting if_hw_tsomax to a
smaller default value. (Line# 658 of sys/net/if.c in head.)

Version A:
replace
  ifp->if_hw_tsomax = IP_MAXPACKET;
with
  ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + 
ETHER_VLAN_ENCAP_LEN),
  IP_MAXPACKET);
plus
  replace m_collapse() with m_defrag() in the drivers listed above.

This would only reduce the default from 65535->65518, so it only impacts
the uncommon case where the output size (with tcpip header) is within
this range. (As such, I don't think it would have a negative impact for
drivers that handle more than 32 transmit segments.)
>From the testers, it seems that this is sufficient to get rid of the EFBIG
errors. (The total data length including ethernet header doesn't exceed 64K,
so m_defrag() fits it into 32 mbuf clusters.)

The main downside of this is that there will be a lot of m_defrag() calls
being done and they do quite a bit of bcopy()'ng.

Version B:
replace
  ifp->if_hw_tsomax = IP_MAXPACKET;
with
  ifp->if_hw_tsomax = min(29 * MCLBYTES, IP_MAXPACKET);

This one would avoid the m_defrag() calls, but might have a negative
impact on TSO performance for drivers that can handle 35 transmit segments,
since the maximum TSO segment size is reduced by about 6K. (Because of the
second size reduction to an exact multiple of TCP transmit data size, the
exact amount varies.)

Possible longer term fixes:
One longer term fix might be to add something like if_hw_tsomaxseg so that
a driver can set a limit on the number of transmit segments (mbufs in chain)
and tcp_output() could use that to limit the size of the TSO segment, as
required. (I have a first stab at such a patch, but no way to test it, so
I can't see that being done by May. Also, it would require changes to a lot
of drivers to make it work. I've attached this patch, in case anyone wants
to work on it?)

Another might be to increase the size of MCLBYTES (I don't see this as
practical for 9.3, although the actual change is simple.). I do think
that increasing MCLBYTES might be something to consider doing in the
future, for reasons beyond fixing this.

So, what do others think should be done? rick


___
freebsd-net@f

Re: 9.2 ixgbe tx queue hang

2014-03-25 Thread Rick Macklem

Christopher Forgeron wrote:
> Update:
> 
>  I'm changing my mind, and I believe Rick's TSO patch is fixing
>  things
> (sorry). In looking at my notes, it's possible I had lagg on for
> those
> tests.  lagg does seem to negate the TSO patch in my case.
> 
Ok, that's useful information. It implies that r251296 doesn't quite
work and needs to be fixed for "stacked" network interface drivers
before it can be used. I've cc'd Andre who is the author of that
patch, in case he knows how to fix it.

Thanks for checking this, rick

> kernel.10stable_basicTSO_65535/
> 
> - IP_MAXPACKET = 65535;
> - manually forced (no if statement) ifp->if_hw_tsomax = IP_MAXPACKET
> -
> (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> - Verified on boot via printf that ifp->if_hw_tsomax = 65517
> - Boot in a NON LAGG environment.  ix0 only.
> 
> ixgbe's printf is showing packets up to 65530. Haven't run long
> enough yet
> to see if anything will go over 65535
> 
> I have this tcpdump running to check packet size.
> tcpdump -ennvvXS -i ix0 greater 65518
> 
> I do expect to get packets over 65518, but I was just curious to see
> if any
> of them would go over 65535. Time will tell.
> 
> In a separate test, If I enable lagg, we have LOTS of oversized
> packet
> problems. It looks like tsomax is definitely not making it through in
> if_lagg.c - Any recommendations there? I will eventually need lagg,
> as I'm
> sure will others.
> 
> With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not be
> happening?
> 
> 
> dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
> args[0]->t_tsomax >= 65518 / { printf("unexpected tp->t_tsomax:
> %i\n",
> args[0]->t_tsomax); stack(); }'
> 
> 
>   6  31403 tcp_output:entry unexpected tp->t_tsomax:
>   65535
> 
>   kernel`tcp_do_segment+0x2c99
>   kernel`tcp_input+0x11a2
>   kernel`ip_input+0xa2
>   kernel`netisr_dispatch_src+0x5e
>   kernel`ether_demux+0x12a
>   kernel`ether_nh_input+0x35f
>   kernel`netisr_dispatch_src+0x5e
>   kernel`bce_intr+0x765
>   kernel`intr_event_execute_handlers+0xab
>   kernel`ithread_loop+0x96
>   kernel`fork_exit+0x9a
>   kernel`0x80c75b2e
> 
>   3  31403 tcp_output:entry unexpected tp->t_tsomax:
>   65535
> 
>   kernel`tcp_do_segment+0x2c99
>   kernel`tcp_input+0x11a2
>   kernel`ip_input+0xa2
>   kernel`netisr_dispatch_src+0x5e
>   kernel`ether_demux+0x12a
>   kernel`ether_nh_input+0x35f
>   kernel`netisr_dispatch_src+0x5e
>   kernel`bce_intr+0x765
>   kernel`intr_event_execute_handlers+0xab
>   kernel`ithread_loop+0x96
>   kernel`fork_exit+0x9a
>   kernel`0x80c75b2e
> 
>   6  31403 tcp_output:entry unexpected tp->t_tsomax:
>   65535
> 
>   kernel`tcp_do_segment+0x2c99
>   kernel`tcp_input+0x11a2
>   kernel`ip_input+0xa2
>   kernel`netisr_dispatch_src+0x5e
>   kernel`ether_demux+0x12a
>   kernel`ether_nh_input+0x35f
>   kernel`netisr_dispatch_src+0x5e
>   kernel`bce_intr+0x765
>   kernel`intr_event_execute_handlers+0xab
>   kernel`ithread_loop+0x96
>   kernel`fork_exit+0x9a
>   kernel`0x80c75b2e
> 
>   1  31403 tcp_output:entry unexpected tp->t_tsomax:
>   65535
> 
>   kernel`tcp_do_segment+0x2c99
>   kernel`tcp_input+0x11a2
>   kernel`ip_input+0xa2
>   kernel`netisr_dispatch_src+0x5e
>   kernel`ether_demux+0x12a
>   kernel`ether_nh_input+0x35f
>   kernel`netisr_dispatch_src+0x5e
>   kernel`bce_intr+0x765
>   kernel`intr_event_execute_handlers+0xab
>   kernel`ithread_loop+0x96
>   kernel`fork_exit+0x9a
>   kernel`0x80c75b2e
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-25 Thread Rick Macklem

Markus Gebert wrote:
> 
> On 26.03.2014, at 00:06, Christopher Forgeron 
> wrote:
> 
> > Update:
> > 
> > I'm changing my mind, and I believe Rick's TSO patch is fixing
> > things
> > (sorry). In looking at my notes, it's possible I had lagg on for
> > those
> > tests.  lagg does seem to negate the TSO patch in my case.
> 
> I’m glad to hear you could check that scenario again. In the other
> email I just sent, I just asked you to redo this test. Now it makes
> perfect sense why you saw oversized packets despite Rick’s
> if_hw_tsomax patch.
> 
> 
> > kernel.10stable_basicTSO_65535/
> > 
> > - IP_MAXPACKET = 65535;
> > - manually forced (no if statement) ifp->if_hw_tsomax =
> > IP_MAXPACKET -
> > (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> > - Verified on boot via printf that ifp->if_hw_tsomax = 65517
> 
> Is 65517 correct? With Ricks patch, I get this:
> 
> dev.ix.0.hw_tsomax: 65518
> 
> Also the dtrace command you used excludes 65518...
> 
I am using 32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN) which
is 65518. Although IP_MAXPACKET (maximum IP len, not including ethernet header)
is 65535 (largest # that fits in 16bits), the maximum data length
(including ethernet header) that will fit in 32 mbuf clusters is 65536.
(In practice 65517 or anything <= 65518 should fix the problem.)

rick

> > - Boot in a NON LAGG environment.  ix0 only.
> > 
> > ixgbe's printf is showing packets up to 65530. Haven't run long
> > enough yet
> > to see if anything will go over 65535
> > 
With the ethernet header length, it can be <= 65536, because that
is 32 * MCLBYTES.

rick

> > I have this tcpdump running to check packet size.
> > tcpdump -ennvvXS -i ix0 greater 65518
> > 
> > I do expect to get packets over 65518, but I was just curious to
> > see if any
> > of them would go over 65535. Time will tell.
> > 
> > In a separate test, If I enable lagg, we have LOTS of oversized
> > packet
> > problems. It looks like tsomax is definitely not making it through
> > in
> > if_lagg.c - Any recommendations there? I will eventually need lagg,
> > as I'm
> > sure will others.
> 
> I think somebody has to invent a way to propagate if_hw_maxtso to
> interfaces on top of each other.
> 
> 
> > With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not be
> > happening?
> 
> Looks like these all come from bce interfaces (bce_intr in the stack
> trace), which probably have another value for if_hw_tsomax.
> 
> 
> Markus
> 
> 
> > dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
> > args[0]->t_tsomax >= 65518 / { printf("unexpected tp->t_tsomax:
> > %i\n",
> > args[0]->t_tsomax); stack(); }'
> > 
> > 
> >  6  31403 tcp_output:entry unexpected tp->t_tsomax:
> >  65535
> > 
> >  kernel`tcp_do_segment+0x2c99
> >  kernel`tcp_input+0x11a2
> >  kernel`ip_input+0xa2
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`ether_demux+0x12a
> >  kernel`ether_nh_input+0x35f
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`bce_intr+0x765
> >  kernel`intr_event_execute_handlers+0xab
> >  kernel`ithread_loop+0x96
> >  kernel`fork_exit+0x9a
> >  kernel`0x80c75b2e
> > 
> >  3  31403 tcp_output:entry unexpected tp->t_tsomax:
> >  65535
> > 
> >  kernel`tcp_do_segment+0x2c99
> >  kernel`tcp_input+0x11a2
> >  kernel`ip_input+0xa2
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`ether_demux+0x12a
> >  kernel`ether_nh_input+0x35f
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`bce_intr+0x765
> >  kernel`intr_event_execute_handlers+0xab
> >  kernel`ithread_loop+0x96
> >  kernel`fork_exit+0x9a
> >  kernel`0x80c75b2e
> > 
> >  6  31403 tcp_output:entry unexpected tp->t_tsomax:
> >  65535
> > 
> >  kernel`tcp_do_segment+0x2c99
> >  kernel`tcp_input+0x11a2
> >  kernel`ip_input+0xa2
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`ether_demux+0x12a
> >  kernel`ether_nh_input+0x35f
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`bce_intr+0x765
> >  kernel`intr_event_execute_handlers+0xab
> >  kernel`ithread_loop+0x96
> >  kernel`fork_exit+0x9a
> >  kernel`0x80c75b2e
> > 
> >  1  31403 tcp_output:entry unexpected tp->t_tsomax:
> >  65535
> > 
> >  kernel`tcp_do_segment+0x2c99
> >  kernel`tcp_input+0x11a2
> >  kernel`ip_input+0xa2
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`ether_demux+0x12a
> >  kernel`ether_nh_input+0x35f
> >  kernel`netisr_dispatch_src+0x5e
> >  kernel`bce_intr+0x765
> >  kernel`intr_event_execute_handlers+0xab

Re: RFC: How to fix the NFS/iSCSI vs TSO problem

2014-03-26 Thread Rick Macklem

pyu...@gmail.com wrote:
> On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote:
> > Hi,
> > 
> > First off, I hope you don't mind that I cross-posted this,
> > but I wanted to make sure both the NFS/iSCSI and networking
> > types say it.
> > If you look in this mailing list thread:
> >   
> > http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
> > you'll see that several people have been working hard at testing
> > and
> > thanks to them, I think I now know what is going on.
> 
> 
> Thanks for your hard work on narrowing down that issue.  I'm too
> busy for $work in these days so I couldn't find time to investigate
> the issue.
> 
> > (This applies to network drivers that support TSO and are limited
> > to 32 transmit
> >  segments->32 mbufs in chain.) Doing a quick search I found the
> >  following
> > drivers that appear to be affected (I may have missed some):
> >  jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re
> > 
> 
> The magic number 32 was chosen long time ago when I implemented TSO
> in non-Intel drivers.  I tried to find optimal number to reduce the
> size kernel stack usage at that time.  bus_dma(9) will coalesce
> with previous segment if possible so I thought the number 32 was
> not an issue.  Not sure current bus_dma(9) also has the same code
> though.  The number 32 is arbitrary one so you can increase
> it if you want.
> 
Well, in the case of "ix" Jack Vogel says it is a hardware limitation.
I can't change drivers that I can't test and don't know anything about
the hardware. Maybe replacing m_collapse() with m_defrag() is an exception,
since I know what that is doing and it isn't hardware related, but I
would still prefer a review by the driver author/maintainer before making
such a change.

If there are drivers that you know can be increased from 32->35 please do
so, since that will not only avoid the EFBIG failures but also avoid a
lot of calls to m_defrag().

> > Further, of these drivers, the following use m_collapse() and not
> > m_defrag()
> > to try and reduce the # of mbufs in the chain. m_collapse() is not
> > going to
> > get the 35 mbufs down to 32 mbufs, as far as I can see, so these
> > ones are
> > more badly broken:
> >  jme, fxp, age, sge, alc, ale, nfe, re
> 
> I guess m_defeg(9) is more optimized for non-TSO packets. You don't
> want to waste CPU cycles to copy the full frame to reduce the
> number of mbufs in the chain.  For TSO packets, m_defrag(9) looks
> better but if we always have to copy a full TSO packet to make TSO
> work, driver writers have to invent better scheme rather than
> blindly relying on m_defrag(9), I guess.
> 
Yes, avoiding m_defrag() calls would be nice. For this issue, increasing
the transmit segment limit from 32->35 does that, if the change can be
done easily/safely.

Otherwise, all I can think of is my suggestion to add something like
if_hw_tsomaxseg which the driver can use to tell tcp_output() the
driver's limit for # of mbufs in the chain.

> > 
> > The long description is in the above thread, but the short version
> > is:
> > - NFS generates a chain with 35 mbufs in it for (read/readdir
> > replies and write requests)
> >   made up of (tcpip header, RPC header, NFS args, 32 clusters of
> >   file data)
> > - tcp_output() usually trims the data size down to tp->t_tsomax
> > (65535) and
> >   then some more to make it an exact multiple of TCP transmit data
> >   size.
> >   - the net driver prepends an ethernet header, growing the length
> >   by 14 (or
> > sometimes 18 for vlans), but in the first mbuf and not adding
> > one to the chain.
> >   - m_defrag() copies this to a chain of 32 mbuf clusters (because
> >   the total data
> > length is <= 64K) and it gets sent
> > 
> > However, it the data length is a little less than 64K when passed
> > to tcp_output()
> > so that the length including headers is in the range
> > 65519->65535...
> > - tcp_output() doesn't reduce its size.
> >   - the net driver adds an ethernet header, making the total data
> >   length slightly
> > greater than 64K
> >   - m_defrag() copies it to a chain of 33 mbuf clusters, which
> >   fails with EFBIG
> > --> trainwrecks NFS performance, because the TSO segment is dropped
> > instead of sent.
> > 
> > A tester also stated that the problem could be reproduced using
> > iSCSI. Maybe
> > Edward Napierala might know some details w.r.t. what kind of mbuf
> > chain iSCSI
> >

Re: 9.2 ixgbe tx queue hang

2014-03-26 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> On Tue, Mar 25, 2014 at 8:21 PM, Markus Gebert <
> markus.geb...@hostpoint.ch > wrote:
> 
> 
> 
> 
> 
> Is 65517 correct? With Ricks patch, I get this:
> 
> dev.ix.0.hw_tsomax: 65518
> 
> 
> 
> Perhaps a difference between 9.2 and 10 for one of the macros? My
> code is:
> 
> ifp->if_hw_tsomax = IP_MAXPACKET - (ETHER_HDR_LEN +
> ETHER_VLAN_ENCAP_LEN);
> printf("CSF - 3 Init, ifp->if_hw_tsomax = %d\n", ifp->if_hw_tsomax);
> 
The difference is simply that IP_MAXPACKET == 65535, but I've been using
32 * MCLBYTES == 65536 (the latter is the amount of data m_defrag() can
squeeze into 32 mbuf clusters).

ie. I've suggested:
ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + 
ETHER_VLAN_ENCAP_LEN),
 IP_MAXPACKET);
- I put the min() in just so it wouldn't break if MCLBYTES is increased someday.

rick

> 
> (BTW, you should submit the hw_tsomax sysctl patch, that's useful to
> others)
> 
> 
> 
> 
> 
> Also the dtrace command you used excludes 65518...
> 
> 
> 
> Oh, I thought it was giving every packet that is greater than or
> equal to 65518 - Could you show me the proper command? That's the
> third time I've used dtrace, so I'm making this up as I go. :-)
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-26 Thread Rick Macklem

Christopher Forgeron wrote:
> That's interesting. I see here in the r251296 commit Andre says :
> 
>   Drivers can set ifp->if_hw_tsomax before calling ether_ifattach()
>   to
>   change the limit.
> 
>  I wonder if we add your same TSO patch to if_lagg.c before line
>  356's
> ether_ifattach() will fix it.
> 
I think the value(s) for underlying hardware drivers have to somehow
be propagated up through lagg. I haven't looked at the code, so I
don't know what that would be.

Putting the patch for ixgbe.c in lagg wouldn't make sense, since it
doesn't know if the underlying devices have the 32 limit.

I've suggested in the other thread what you suggested in a recent
post...ie. to change the default, at least until the propagation
of driver set values is resolved.

rick

>  Ultimately, it will need to load the if_hw_tsomax from the if below
>  it -
> but then again, if the calculation for ixgbe is good enough for that
> driver, why wouldn't it be good enough for lagg?
> 
>  Unless people think I'm crazy, I'll compile that in at line 356 in
> if_lagg.c and give it a test run tomorrow.
> 
>  This may need to go into vlan and carp as well, I'm not sure yet.
> 
> 
> On Tue, Mar 25, 2014 at 8:16 PM, Rick Macklem 
> wrote:
> 
> > Christopher Forgeron wrote:
> > > Update:
> > >
> > >  I'm changing my mind, and I believe Rick's TSO patch is fixing
> > >  things
> > > (sorry). In looking at my notes, it's possible I had lagg on for
> > > those
> > > tests.  lagg does seem to negate the TSO patch in my case.
> > >
> > Ok, that's useful information. It implies that r251296 doesn't
> > quite
> > work and needs to be fixed for "stacked" network interface drivers
> > before it can be used. I've cc'd Andre who is the author of that
> > patch, in case he knows how to fix it.
> >
> > Thanks for checking this, rick
> >
> > > kernel.10stable_basicTSO_65535/
> > >
> > > - IP_MAXPACKET = 65535;
> > > - manually forced (no if statement) ifp->if_hw_tsomax =
> > > IP_MAXPACKET
> > > -
> > > (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> > > - Verified on boot via printf that ifp->if_hw_tsomax = 65517
> > > - Boot in a NON LAGG environment.  ix0 only.
> > >
> > > ixgbe's printf is showing packets up to 65530. Haven't run long
> > > enough yet
> > > to see if anything will go over 65535
> > >
> > > I have this tcpdump running to check packet size.
> > > tcpdump -ennvvXS -i ix0 greater 65518
> > >
> > > I do expect to get packets over 65518, but I was just curious to
> > > see
> > > if any
> > > of them would go over 65535. Time will tell.
> > >
> > > In a separate test, If I enable lagg, we have LOTS of oversized
> > > packet
> > > problems. It looks like tsomax is definitely not making it
> > > through in
> > > if_lagg.c - Any recommendations there? I will eventually need
> > > lagg,
> > > as I'm
> > > sure will others.
> > >
> > > With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not
> > > be
> > > happening?
> > >
> > >
> > > dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
> > > args[0]->t_tsomax >= 65518 / { printf("unexpected tp->t_tsomax:
> > > %i\n",
> > > args[0]->t_tsomax); stack(); }'
> > >
> > >
> > >   6  31403 tcp_output:entry unexpected
> > >   tp->t_tsomax:
> > >   65535
> > >
> > >   kernel`tcp_do_segment+0x2c99
> > >   kernel`tcp_input+0x11a2
> > >   kernel`ip_input+0xa2
> > >   kernel`netisr_dispatch_src+0x5e
> > >   kernel`ether_demux+0x12a
> > >   kernel`ether_nh_input+0x35f
> > >   kernel`netisr_dispatch_src+0x5e
> > >   kernel`bce_intr+0x765
> > >   kernel`intr_event_execute_handlers+0xab
> > >   kernel`ithread_loop+0x96
> > >   kernel`fork_exit+0x9a
> > >   kernel`0x80c75b2e
> > >
> > >   3  31403 tcp_output:entry unexpected
> > >   tp->t_tsomax:
> > >   65535
> > >
> > >   kernel`tcp_do_segment+0x2c99
> > >   kernel`tc

Re: RFC: How to fix the NFS/iSCSI vs TSO problem

2014-03-27 Thread Rick Macklem

Christopher Forgeron wrote:
> I'm quite sure the problem is on 9.2-RELEASE, not 9.1-RELEASE or
> earlier,
> as a 9.2-STABLE from last year I have doesn't exhibit the problem.
>  New
> code in if.c at line 660 looks to be what is starting this, which
> makes me
> wonder how TSO was being handled before 9.2.
> 
> I also like Rick's NFS patch for cluster size. I notice an
> improvement, but
> don't have solid numbers yet. I'm still stress testing it as we
> speak.
> 
Unfortunately, this causes problems for small i386 systems, so I
am reluctant to commit it to head. Maybe a variant that is only
enabled for amd64 systems with lots of memory would be ok?

> 
> On Wed, Mar 26, 2014 at 11:44 PM, Marcelo Araujo
> wrote:
> 
> > Hello All,
> >
> >
> > 2014-03-27 8:27 GMT+08:00 Rick Macklem :
> > >
> > > Well, bumping it from 32->35 is all it would take for NFS (can't
> > > comment
> > > w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the
> > > 82599
> > > (just so others aren't confused by the above comment). I
> > > understand
> > > your point was w.r.t. using 100 without blowing the kernel stack,
> > > but
> > > since the testers have been using "ix" with the 82599 chip which
> > > is
> > > limited to 32 transmit segments...
> > >
> > > However, please increase any you know can be safely done from
> > > 32->35,
> > rick
> > >
> > >
> > I have plenty of machines using Intel X540 that is based on 82599
> > chipset.
> > I have applied Rick's patch on ixgbe to check if the packet size is
> > bigger
> > than 65535 or cluster is bigger than 32. So far till now, on
> > FreeBSD
> > 9.1-RELEASE this problem does not happens.
> >
> > Unfortunately all my environment here is based on 9.1-RELEASE, with
> > some
> > merges from 10-RELEASE such like: NFS and IXGBE.
> >
I can't see why it couldn't happen on 9.1 or earlier, since it just
uses IP_MAXPACKET in tcp_output().

However, to make it happen NFS has to do a read reply (server) or
write request (client) that is a little under 64K bytes. Normally
the default will be a full 64K bytes, so for the server side it
would take a read of a file where the EOF is just shy of the 64K
boundary. For the client write it would be a write of a partially
dirtied block where most of the block has been dirtied. (Software
builds generate a lot of patially dirtied blocks, but I don't know
what else would. For sequential writing it would be a file that
ends just shy of a 64K boundary (similar to the server side) being
written.

I think it is more likely your NFS file activity and not 9.1 vs 9.2
that avoids the problem. (I suspect there are quite a few folk running
NFS 9.2 or later on these ix chips who don't see the problem as well.)
Fortunately you (Christopher) were able to reproduce it, so the problem
could be isolated.

Thanks everyone for your help with this, rick

> > Also I have applied the patch that Rick sent in another email with
> > the
> > subject 'NFS patch to use pagesize mbuf clusters'. And we can see
> > some
> > performance boost over 10Gbps Intel. However here at the company,
> > we are
> > still doing benchmarks. If someone wants to have my benchmark
> > result, I can
> > send it later.
> >
> > I'm wondering, if this update on ixgbe from 32->35 could be applied
> > also
> > for versions < 9.2. I'm thinking, that this problem arise only on
> > 9-STABLE
> > and consequently on 9.2-RELEASE. And fortunately or not 9.1-RELEASE
> > doesn't
> > share it.
> >
> > Best Regards,
> > --
> > Marcelo Araujo
> > ara...@freebsd.org
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscr...@freebsd.org"
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: RFC: How to fix the NFS/iSCSI vs TSO problem

2014-03-27 Thread Rick Macklem

Marcelo Araujo wrote:
> Hello All,
> 
> 
> 2014-03-27 8:27 GMT+08:00 Rick Macklem :
> >
> > Well, bumping it from 32->35 is all it would take for NFS (can't
> > comment
> > w.r.t. iSCSI). ixgbe uses 100 for the 82598 chip and 32 for the
> > 82599
> > (just so others aren't confused by the above comment). I understand
> > your point was w.r.t. using 100 without blowing the kernel stack,
> > but
> > since the testers have been using "ix" with the 82599 chip which is
> > limited to 32 transmit segments...
> >
> > However, please increase any you know can be safely done from
> > 32->35, rick
> >
> >
> I have plenty of machines using Intel X540 that is based on 82599
> chipset.
> I have applied Rick's patch on ixgbe to check if the packet size is
> bigger
> than 65535 or cluster is bigger than 32. So far till now, on FreeBSD
> 9.1-RELEASE this problem does not happens.
> 
> Unfortunately all my environment here is based on 9.1-RELEASE, with
> some
> merges from 10-RELEASE such like: NFS and IXGBE.
> 
> Also I have applied the patch that Rick sent in another email with
> the
> subject 'NFS patch to use pagesize mbuf clusters'. And we can see
> some
> performance boost over 10Gbps Intel. However here at the company, we
> are
> still doing benchmarks. If someone wants to have my benchmark result,
> I can
> send it later.
> 
> I'm wondering, if this update on ixgbe from 32->35 could be applied
> also
> for versions < 9.2. I'm thinking, that this problem arise only on
> 9-STABLE
> and consequently on 9.2-RELEASE. And fortunately or not 9.1-RELEASE
> doesn't
> share it.
> 
My understanding is that the 32 limitation is a hardware one for the
82599. It appears that other drivers than the ixgbe.c can be increased
from 32->35, but not ixgbe.c (for the 82599 chips).

rick

> Best Regards,
> --
> Marcelo Araujo
> ara...@freebsd.org
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

2014-03-27 Thread Rick Macklem

Christopher Forgeron wrote:
> 
> 
> 
> 
> 
> 
> On Wed, Mar 26, 2014 at 9:35 PM, Rick Macklem < rmack...@uoguelph.ca
> > wrote:
> 
> 
> 
> 
> I've suggested in the other thread what you suggested in a recent
> post...ie. to change the default, at least until the propagation
> of driver set values is resolved.
> 
> rick
> 
> 
> 
> I wonder if we need to worry about propagating values up from the
> sub-if's - Setting the default in if.c means this is set for all
> if's, and it's a simple 1 line code change. If a specific 'if' needs
> a different value, it can be set before ether_attach() is called.
> 
> 
> I'm more concerned with the equation we use to calculate if_hw_tsomax
> - Are we considering the right variables? Are we thinking on the
> wrong OSI layer for headers?
> 
Well, I'm pragmatic (which means I mostly care about some fix that works),
but it seems to me that:
- The problem is that some TSO enabled network drivers/hardware can only
  handle 32 transmit segments (or 32 mbufs in the chain for the TSO packet
  to be transmitted, if that is clearer).
--> Since the problem is in certain drivers, it seems that those drivers
should be where the long term fix goes.
--> Since some hardware can't handle more than 32, it seems that the
driver should be able to specify that limit, which tcp_output() can
then apply.

I have an untested patch that does this by adding if_hw_tsomaxseg.
(The attachment called tsomaxseg.patch.)

Changing if_hw_tsomax or its default value is just a hack that gets tcp_output()
to apply a limit that the driver can then fix to 32 mbufs in the chain via
m_defrag().

Since if_hw_tsomax (and if_hw_tsomaxseg in the untested patch) aren't
propagated up through lagg, that needs to be fixed.
(Yet another attached untested patch called lagg.patch.)

As I said before, I don't see these patches getting tested/reviewed etc
in time for 9.3, so I think reducing the default value of if_hw_tsomax
is a reasonable short term hack to work around the problem.
(And it sounds like Pyun YongHyeon has volunteered to fix many of the
drivers, where the 32 limit isn't a hardware one.)

rick

--- kern/uipc_sockbuf.c.sav	2014-01-30 20:27:17.0 -0500
+++ kern/uipc_sockbuf.c	2014-01-30 22:12:08.0 -0500
@@ -965,6 +965,39 @@ sbsndptr(struct sockbuf *sb, u_int off, 
 }
 
 /*
+ * Return the first mbuf for the provided offset.
+ */
+struct mbuf *
+sbsndmbuf(struct sockbuf *sb, u_int off, long *first_len)
+{
+	struct mbuf *m;
+
+	KASSERT(sb->sb_mb != NULL, ("%s: sb_mb is NULL", __func__));
+
+	*first_len = 0;
+	/*
+	 * Is off below stored offset? Happens on retransmits.
+	 * If so, just use sb_mb.
+	 */
+	if (sb->sb_sndptr == NULL || sb->sb_sndptroff > off)
+		m = sb->sb_mb;
+	else {
+		m = sb->sb_sndptr;
+		off -= sb->sb_sndptroff;
+	}
+	while (off > 0 && m != NULL) {
+		if (off < m->m_len)
+			break;
+		off -= m->m_len;
+		m = m->m_next;
+	}
+	if (m != NULL)
+		*first_len = m->m_len - off;
+
+	return (m);
+}
+
+/*
  * Drop a record off the front of a sockbuf and move the next record to the
  * front.
  */
--- sys/sockbuf.h.sav	2014-01-30 20:42:28.0 -0500
+++ sys/sockbuf.h	2014-01-30 22:08:43.0 -0500
@@ -153,6 +153,8 @@ int	sbreserve_locked(struct sockbuf *sb,
 	struct thread *td);
 struct mbuf *
 	sbsndptr(struct sockbuf *sb, u_int off, u_int len, u_int *moff);
+struct mbuf *
+	sbsndmbuf(struct sockbuf *sb, u_int off, long *first_len);
 void	sbtoxsockbuf(struct sockbuf *sb, struct xsockbuf *xsb);
 int	sbwait(struct sockbuf *sb);
 int	sblock(struct sockbuf *sb, int flags);
--- netinet/tcp_input.c.sav	2014-01-30 19:37:52.0 -0500
+++ netinet/tcp_input.c	2014-01-30 19:39:07.0 -0500
@@ -3627,6 +3627,7 @@ tcp_mss(struct tcpcb *tp, int offer)
 	if (cap.ifcap & CSUM_TSO) {
 		tp->t_flags |= TF_TSO;
 		tp->t_tsomax = cap.tsomax;
+		tp->t_tsomaxsegs = cap.tsomaxsegs;
 	}
 }
 
--- netinet/tcp_output.c.sav	2014-01-30 18:55:15.0 -0500
+++ netinet/tcp_output.c	2014-01-30 22:18:56.0 -0500
@@ -166,8 +166,8 @@ int
 tcp_output(struct tcpcb *tp)
 {
 	struct socket *so = tp->t_inpcb->inp_socket;
-	long len, recwin, sendwin;
-	int off, flags, error = 0;	/* Keep compiler happy */
+	long len, recwin, sendwin, tso_tlen;
+	int cnt, off, flags, error = 0;	/* Keep compiler happy */
 	struct mbuf *m;
 	struct ip *ip = NULL;
 	struct ipovly *ipov = NULL;
@@ -780,6 +780,24 @@ send:
 			}
 
 			/*
+			 * Limit the number of TSO transmit segments (mbufs
+			 * in mbuf list) to tp->t_tsomaxsegs.
+			 */
+			cnt = 0;
+			m = sbsndmbuf(&so->so_snd, off, &tso_tlen);
+			while (m != NULL && cnt < tp->t_tsomaxsegs &&
+			tso_tlen < len) {
+if (cnt > 0)
+	tso_tlen += m->m_len;

Re: RFC: How to fix the NFS/iSCSI vs TSO problem

2014-03-31 Thread Rick Macklem

Yonghyeon Pyun wrote:
> On Wed, Mar 26, 2014 at 08:27:48PM -0400, Rick Macklem wrote:
> > pyu...@gmail.com wrote:
> > > On Tue, Mar 25, 2014 at 07:10:35PM -0400, Rick Macklem wrote:
> > > > Hi,
> > > > 
> > > > First off, I hope you don't mind that I cross-posted this,
> > > > but I wanted to make sure both the NFS/iSCSI and networking
> > > > types say it.
> > > > If you look in this mailing list thread:
> > > >   
> > > > http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root
> > > > you'll see that several people have been working hard at
> > > > testing
> > > > and
> > > > thanks to them, I think I now know what is going on.
> > > 
> > > 
> > > Thanks for your hard work on narrowing down that issue.  I'm too
> > > busy for $work in these days so I couldn't find time to
> > > investigate
> > > the issue.
> > > 
> > > > (This applies to network drivers that support TSO and are
> > > > limited
> > > > to 32 transmit
> > > >  segments->32 mbufs in chain.) Doing a quick search I found the
> > > >  following
> > > > drivers that appear to be affected (I may have missed some):
> > > >  jme, fxp, age, sge, msk, alc, ale, ixgbe/ix, nfe, e1000/em, re
> > > > 
> > > 
> > > The magic number 32 was chosen long time ago when I implemented
> > > TSO
> > > in non-Intel drivers.  I tried to find optimal number to reduce
> > > the
> > > size kernel stack usage at that time.  bus_dma(9) will coalesce
> > > with previous segment if possible so I thought the number 32 was
> > > not an issue.  Not sure current bus_dma(9) also has the same code
> > > though.  The number 32 is arbitrary one so you can increase
> > > it if you want.
> > > 
> > Well, in the case of "ix" Jack Vogel says it is a hardware
> > limitation.
> > I can't change drivers that I can't test and don't know anything
> > about
> > the hardware. Maybe replacing m_collapse() with m_defrag() is an
> > exception,
> > since I know what that is doing and it isn't hardware related, but
> > I
> > would still prefer a review by the driver author/maintainer before
> > making
> > such a change.
> > 
> > If there are drivers that you know can be increased from 32->35
> > please do
> > so, since that will not only avoid the EFBIG failures but also
> > avoid a
> > lot of calls to m_defrag().
> > 
> > > > Further, of these drivers, the following use m_collapse() and
> > > > not
> > > > m_defrag()
> > > > to try and reduce the # of mbufs in the chain. m_collapse() is
> > > > not
> > > > going to
> > > > get the 35 mbufs down to 32 mbufs, as far as I can see, so
> > > > these
> > > > ones are
> > > > more badly broken:
> > > >  jme, fxp, age, sge, alc, ale, nfe, re
> > > 
> > > I guess m_defeg(9) is more optimized for non-TSO packets. You
> > > don't
> > > want to waste CPU cycles to copy the full frame to reduce the
> > > number of mbufs in the chain.  For TSO packets, m_defrag(9) looks
> > > better but if we always have to copy a full TSO packet to make
> > > TSO
> > > work, driver writers have to invent better scheme rather than
> > > blindly relying on m_defrag(9), I guess.
> > > 
> > Yes, avoiding m_defrag() calls would be nice. For this issue,
> > increasing
> > the transmit segment limit from 32->35 does that, if the change can
> > be
> > done easily/safely.
> > 
> > Otherwise, all I can think of is my suggestion to add something
> > like
> > if_hw_tsomaxseg which the driver can use to tell tcp_output() the
> > driver's limit for # of mbufs in the chain.
> > 
> > > > 
> > > > The long description is in the above thread, but the short
> > > > version
> > > > is:
> > > > - NFS generates a chain with 35 mbufs in it for (read/readdir
> > > > replies and write requests)
> > > >   made up of (tcpip header, RPC header, NFS args, 32 clusters
> > > >   of
> > > >   file data)
> > > > - tcp_output() usually trims the data size down to tp->t_tsomax
> > > > (65535) and
> > > >   then some more to make it an exact multiple of TCP transmit
&

Re: 9.2 ixgbe tx queue hang

2014-04-02 Thread Rick Macklem

K Simon wrote:
> Hi, Rick,
>Does these patches will commit to the stable soon, or I had to
>patch
> it manually?
> 
Yonghyeon Pyun has already committed the changes for the drivers to
head (making them handle 35 mbufs in the chain instead of 32). I'll
assume those will be in stable in a couple of weeks.

I will be able to commit the one line change that reduces the default
setting for if_hw_tsomax in a couple of weeks, so it should be in stable
in about 1month.

rick

> Regards
> Simon
> 
> 于 14-3-28 6:44, Rick Macklem 写道:
> > Christopher Forgeron wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Mar 26, 2014 at 9:35 PM, Rick Macklem <
> >> rmack...@uoguelph.ca
> >>> wrote:
> >>
> >>
> >>
> >>
> >> I've suggested in the other thread what you suggested in a recent
> >> post...ie. to change the default, at least until the propagation
> >> of driver set values is resolved.
> >>
> >> rick
> >>
> >>
> >>
> >> I wonder if we need to worry about propagating values up from the
> >> sub-if's - Setting the default in if.c means this is set for all
> >> if's, and it's a simple 1 line code change. If a specific 'if'
> >> needs
> >> a different value, it can be set before ether_attach() is called.
> >>
> >>
> >> I'm more concerned with the equation we use to calculate
> >> if_hw_tsomax
> >> - Are we considering the right variables? Are we thinking on the
> >> wrong OSI layer for headers?
> >>
> > Well, I'm pragmatic (which means I mostly care about some fix that
> > works),
> > but it seems to me that:
> > - The problem is that some TSO enabled network drivers/hardware can
> > only
> >handle 32 transmit segments (or 32 mbufs in the chain for the
> >TSO packet
> >to be transmitted, if that is clearer).
> > --> Since the problem is in certain drivers, it seems that those
> > drivers
> >  should be where the long term fix goes.
> > --> Since some hardware can't handle more than 32, it seems that
> > the
> >  driver should be able to specify that limit, which
> >  tcp_output() can
> >  then apply.
> >
> > I have an untested patch that does this by adding if_hw_tsomaxseg.
> > (The attachment called tsomaxseg.patch.)
> >
> > Changing if_hw_tsomax or its default value is just a hack that gets
> > tcp_output()
> > to apply a limit that the driver can then fix to 32 mbufs in the
> > chain via
> > m_defrag().
> >
> > Since if_hw_tsomax (and if_hw_tsomaxseg in the untested patch)
> > aren't
> > propagated up through lagg, that needs to be fixed.
> > (Yet another attached untested patch called lagg.patch.)
> >
> > As I said before, I don't see these patches getting tested/reviewed
> > etc
> > in time for 9.3, so I think reducing the default value of
> > if_hw_tsomax
> > is a reasonable short term hack to work around the problem.
> > (And it sounds like Pyun YongHyeon has volunteered to fix many of
> > the
> > drivers, where the 32 limit isn't a hardware one.)
> >
> > rick
> >
> >
> >
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscr...@freebsd.org"
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: re0: watchdog timeout

2014-04-06 Thread Rick Macklem

Frank Volf wrote:
> 
> Hello,
> 
> I'm experiencing watchdog timeouts with my Realtek interface card.
> 
> I'm using a fairly new system (Shuttle DS47), running FreeBSD
> 10-STABLE.
> For this shuttle a patch has been recently committed to SVN to make
> this
> card work at all (revision *262391*
> ).
> 
> The timeout is only experienced under heavy network load (the system
> is
> running a bacula backup server that backups to NFS connected
> storage),
> and typically large full backups trigger this. Normal traffic works
> fine
> (this system is e.g. also my firewall to the Internet).
> 
Since you mention NFS, you could try disabling TSO on the interface
and see if that helps. (I'm beginning to feel like a parrot saying this,
but...) If you care about why it might help, read this email thread:
  
http://docs.FreeBSD.org/cgi/mid.cgi?1850411724.1687820.1395621539316.JavaMail.root

If it happens to help, please email again, since there are probably
better ways to fix the problem than disabling TSO.

Good luck with it, rick

> What might not be standard is that I use sub-interfaces on this
> system.
> First of all, the only way that I can get the sub-interfaces to work
> at
> all is by using
> 
>  ifconfig_re0="-vlanhwtag"
> 
> I'm not sure that is related.
> 
> The question is how can we debug this to solve the problem?
> I have non clue, but I'm happy to assist if somebody can tell me what
> I
> should do.
> 
> Some information that might be useful:
> 
> root@drawbridge:/usr/local/etc/bacula # dmesg | grep re0
> re0:  port
> 0xd000-0xd0ff mem 0xf7a0-0xf7a00fff,0xf010-0xf0103fff irq 17
> at
> device 0.0 on pci2
> re0: Using 1 MSI-X message
> re0: ASPM disabled
> re0: Chip rev. 0x4c00
> re0: MAC rev. 0x
> miibus0:  on re0
> re0: Ethernet address: 80:ee:73:77:e9:ab
> re0: watchdog timeout
> re0: link state changed to DOWN
> re0.98: link state changed to DOWN
> re0.10: link state changed to DOWN
> re0.11: link state changed to DOWN
> re0.12: link state changed to DOWN
> re0: link state changed to UP
> re0.98: link state changed to UP
> re0.10: link state changed to UP
> re0.11: link state changed to UP
> re0.12: link state changed to UP
> ...
> 
> root@drawbridge:/usr/local/etc/bacula # uname -a
> FreeBSD drawbridge.internal.deze.org 10.0-STABLE FreeBSD 10.0-STABLE
> #0
> r262433: Mon Feb 24 16:25:35 CET 2014
> r...@drawbridge-new.internal.deze.org:/usr/obj/usr/sources/src10-stable/sys/SHUTTLE
> i386
> 
> root@drawbridge:/usr/local/etc/bacula # pciconf -lv re0
> re0@pci0:2:0:0: class=0x02 card=0x40181297 chip=0x816810ec
> rev=0x0c
> hdr=0x00
>  vendor = 'Realtek Semiconductor Co., Ltd.'
>  device = 'RTL8111/8168B PCI Express Gigabit Ethernet
>  controller'
>  class  = network
>  subclass   = ethernet
> 
> 
> Kind regards,
> 
> Frank
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Network troubles after 8.3 -> 8.4 upgrade

2014-04-17 Thread Rick Macklem

John Nielsen wrote:
> On Apr 17, 2014, at 2:38 PM, Andrea Venturoli  wrote:
> 
> > Three days ago I upgraded an amd64 8.3 box to the latest 8.4.
> > Since then the outside network is misbehaving: large mails are not
> > sended (although small ones do), svn operations will work for a
> > while, then come to a sudden stop, etc...
> > Perhaps the most evident test is "wget"ting a big file: it will
> > download some chunk, halt; restart after a while and download
> > another chunk; lose the connection once again, then restart and so
> > on.
> > 
> > I remember a couple of similar experiences in the past, from which
> > I got out by disabling TSO; however those box had fxp cards, while
> > this has an em.
> > In any case disabling TSO did not help.
> 
> My first thought was TSO as well, since I've seen the symptoms you
> describe a few times on systems running 10.0. Do you use IPFW or any
> kind of NAT on this system? When an application encounters a network
> problem, does it report or log anything at all? Anything in the
> kernel log/dmesg?
> 
> A bit of a shot in the dark, but could you try applying r264517
> (fixes a problem with VLAN and TSO interaction)?
> http://svnweb.freebsd.org/base/head/sys/net/if_vlan.c?r1=257241&r2=264517
> 
Since the only net driver that sets if_hw_tsomax is Xen's netfront, the
patch only affects systems that use that at this time. (The bug, which was
also in if_lagg.c, was found during testing of an experimental patch for
a net driver.)

So, I'm pretty sure that patch won't help, rick

> Otherwise my only other thought would be the driver. Can you try
> reverting only the em(4) driver back to 8.3? If that helps it would
> give you both a workaround and a clue for where to look for a
> solution. Build modules and a kernel without em(4) from unmodified
> 8.4 src, load em(4) as a module, confirm that the problem persists.
> Replace the contents of src/sys/dev/e1000, src/sys/modules/em and
> src/sys/conf/files with those from an 8.3 src tree (or otherwise
> revert revision 247430), rebuild em module, unload/reload or reboot,
> see if problem goes away. (Could be somewhat complicated by the fact
> that you also have igb interfaces which also use code from the e1000
> directory, but rather than speculate I'll leave solving that as an
> exercise for someone else.)
> 
> JN
> 
> > This is the relevant part of rc.conf:
> >> cloned_interfaces="lagg0 vlan1 vlan2 vlan3 carp0 carp1 carp3 carp4
> >> carp6 carp7 carp9 carp10"
> >> ifconfig_igb0="up"
> >> ifconfig_igb1="up"
> >> ifconfig_lagg0="laggproto lacp laggport igb0 laggport igb1
> >> 192.168.101.4 netmask 255.255.255.0"
> >> ifconfig_lagg0_alias0="inet 192.168.101.101 netmask 0x"
> >> ifconfig_carp0="vhid 1 advskew 100 pass xxx 192.168.101.10"
> >> ifconfig_carp1="vhid 2 pass  192.168.101.10"
> >> ifconfig_em0="up"
> >> ifconfig_vlan1="inet 81.174.30.11 netmask 255.255.255.248 vlan 4
> >> vlandev em0"
> >> ifconfig_vlan2="inet 83.211.188.186 netmask 255.255.255.248 vlan 2
> >> vlandev em0"
> >> ifconfig_vlan3="inet 192.168.2.202 netmask 255.255.255.0 vlan 3
> >> vlandev em0"
> >> ifconfig_carp3="vhid 4 advskew 100 pass  81.174.30.12"
> >> ifconfig_carp4="vhid 5 pass xxx 81.174.30.12"
> >> ifconfig_carp6="vhid 7 advskew 100 pass xx 83.211.188.187"
> >> ifconfig_carp7="vhid 8 pass xxx 83.211.188.187"
> >> ifconfig_carp9="vhid 10 advskew 100 pass  192.168.2.203"
> >> ifconfig_carp10="vhid 11 pass  192.168.2.203"
> >> ifconfig_lo0_alias0="inet 127.0.0.2 netmask 0x"
> >> ifconfig_lo0_alias1="inet 127.0.0.3 netmask 0x"
> >> ifconfig_lo0_alias2="inet 127.0.0.4 netmask 0x"
> > 
> > As you can see the setup is quite complicated, but worked like a
> > charm until the upgrade; actually the internal net (igb+lagg+carp)
> > still does, so this is what points me toward em0, where I cannot
> > seem to get any kind of stability.
> > 
> > The card is
> >> em0@pci0:6:0:0: class=0x02 card=0x10828086 chip=0x107d8086
> >> rev=0x06 hdr=0x00
> >>vendor = 'Intel Corporation'
> >>device = 'PRO/1000 PT'
> >>class  = network
> >>subclass   = ethernet
> > 
> > I tried disabling TSO, RXCSUM, TXCSUM, VLANHWTAG, VLANHWCSUM,
> > VLANHWTSO...
> > I tried putting the card into 10baseT/UTP  mode...
> > I tried sysctl net.inet.tcp.tso=0...
> > 
> > None helped.
> > 
> > Maybe I'm barking up the wrong tree, but nothing is in the logs to
> > help...
> > 
> > Nor did Google or wading through bug reports.
> > 
> > 
> > 
> > Now I could restore the dumps I made before upgrading to 8.4 (but
> > I'd really like to avoid this), try to upgrade even further to 9.2
> > (although this will be a lot of work and I'm not looking forward
> > to it as a shot in the dark), drop in another NIC...
> > What I'd really like, however, is some insight.
> > 
> > Is this a known problem of some sort? Is this card or this driver
> > known to be broken?
> > Is there any way I could get some

Re: NFS over LAGG / lacp poor performance

2014-04-25 Thread Rick Macklem

Marek Salwerowicz wrote:
> Hi list,
> 
> I have two FreeBSD boxes (both based on SuperMicro X9DRD-7LN4F-JBOD
> motherboard,  with 32GB RAM, 1 CPU :Intel(R) Xeon(R) CPU E5-2640 v2)
> 
> storage1% uname -a
> FreeBSD storage1 9.1-RELEASE-p10 FreeBSD 9.1-RELEASE-p10 #0: Sun Jan
> 12
> 20:11:23 UTC 2014
> r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
>  amd64
> 
> 
> storage2% uname
> -a
> FreeBSD storage2 10.0-RELEASE-p1 FreeBSD 10.0-RELEASE-p1 #0: Tue Apr
>  8
> 06:45:06 UTC 2014
> r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
>  amd64
> 
> 
> 
> They work as NFS storages for VMware ESXi.
> 
> On both there are installed 4 igb 1Gbit NICs, with LACP and 2 VLANs
> (vlan900 is for NFS, vlan14 is for management/general purpose):
> 
> on storage1:
> igb0:  port
> 0xd060-0xd07f mem 0xfbb6-0xfbb7,0xfbb8c000-0xfbb8 irq 42
> at
> device 0.0 on pci5
> igb1:  port
> 0xd040-0xd05f mem 0xfbb4-0xfbb5,0xfbb88000-0xfbb8bfff irq 45
> at
> device 0.1 on pci5
> igb2:  port
> 0xd020-0xd03f mem 0xfbb2-0xfbb3,0xfbb84000-0xfbb87fff irq 44
> at
> device 0.2 on pci5
> igb3:  port
> 0xd000-0xd01f mem 0xfbb0-0xfbb1,0xfbb8-0xfbb83fff irq 46
> at
> device 0.3 on pci5
> 
> on storage2:
> igb0:  port
> 0xd060-0xd07f mem 0xfbb6-0xfbb7,0xfbb8c000-0xfbb8 irq 42
> at
> device 0.0 on pci5
> igb1:  port
> 0xd040-0xd05f mem 0xfbb4-0xfbb5,0xfbb88000-0xfbb8bfff irq 45
> at
> device 0.1 on pci5
> igb2:  port
> 0xd020-0xd03f mem 0xfbb2-0xfbb3,0xfbb84000-0xfbb87fff irq 44
> at
> device 0.2 on pci5
> igb3:  port
> 0xd000-0xd01f mem 0xfbb0-0xfbb1,0xfbb8-0xfbb83fff irq 46
> at
> device 0.3 on pci5
> 
> 
> storage1% ifconfig -a
> igb0: flags=8843 metric 0 mtu
> 9000
>
> options=401bb
> ether 00:25:90:c1:1d:18
> inet6 fe80::225:90ff:fec1:1d18%igb0 prefixlen 64 scopeid 0x1
> nd6 options=29
> media: Ethernet autoselect (1000baseT )
> status: active
> igb1: flags=8843 metric 0 mtu
> 9000
>
> options=401bb
> ether 00:25:90:c1:1d:18
> inet6 fe80::225:90ff:fec1:1d19%igb1 prefixlen 64 scopeid 0x2
> nd6 options=29
> media: Ethernet autoselect (1000baseT )
> status: active
> igb2: flags=8843 metric 0 mtu
> 9000
>
> options=401bb
> ether 00:25:90:c1:1d:18
> inet6 fe80::225:90ff:fec1:1d1a%igb2 prefixlen 64 scopeid 0x3
> nd6 options=29
> media: Ethernet autoselect (1000baseT )
> status: active
> igb3: flags=8843 metric 0 mtu
> 9000
>
> options=401bb
> ether 00:25:90:c1:1d:18
> inet6 fe80::225:90ff:fec1:1d1b%igb3 prefixlen 64 scopeid 0x4
> nd6 options=29
> media: Ethernet autoselect (1000baseT )
> status: active
> lo0: flags=8049 metric 0 mtu 16384
> options=63
> inet6 ::1 prefixlen 128
> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x7
> inet 127.0.0.1 netmask 0xff00
> nd6 options=21
> lagg0: flags=8843 metric 0
> mtu 9000
>
> options=401bb
> ether 00:25:90:c1:1d:18
> inet6 fe80::225:90ff:fec1:1d18%lagg0 prefixlen 64 scopeid 0x8
> nd6 options=21
> media: Ethernet autoselect
> status: active
> laggproto lacp lagghash l2,l3,l4
> laggport: igb3 flags=1c
> laggport: igb2 flags=1c
> laggport: igb1 flags=1c
> laggport: igb0 flags=1c
> vlan14: flags=8843 metric 0
> mtu 9000
> options=103
> ether 00:25:90:c1:1d:18
> inet 192.168.1.65 netmask 0xff00 broadcast 192.168.1.255
> inet6 fe80::225:90ff:fec1:1d18%vlan14 prefixlen 64 scopeid
> 0x9
> nd6 options=29
> media: Ethernet autoselect
> status: active
> vlan: 14 parent interface: lagg0
> vlan900: flags=8843 metric 0
> mtu
> 9000
> options=103
> ether 00:25:90:c1:1d:18
> inet 172.25.25.65 netmask 0xff00 broadcast 172.25.25.255
> inet6 fe80::225:90ff:fec1:1d18%vlan900 prefixlen 64 scopeid
> 0xa
> nd6 options=29
> media: Ethernet autoselect
> status: active
> vlan: 900 parent interface: lagg0
> 
> 
> storage2% ifconfig
> -a
> igb0: flags=8843 metric 0 mtu
> 9000
>
> options=403bb
> ether 00:25:90:ca:3b:e0
> inet6 fe80::225:90ff:feca:3be0%igb0 prefixlen 64 scopeid 0x1
> nd6 options=29
> media: Ethernet autoselect (1000baseT )
> status: active
> igb1: flags=8843 metric 0 mtu
> 9000
>
> options=403bb
> ether 00:25:90:ca:3b:e0
> inet6 fe80::225:90ff:feca:3be1%igb1 prefixlen 64 scopeid 0x2
> nd6 options=29
> media: Ethernet autoselect (1000baseT )
> status: active
> igb2: flags=8843 metric 0 mtu
> 9000
>
> options=403bb
> ether 00:25:90:ca:3b:e0
> inet6 fe80::225:90ff:feca:3be2%igb2 prefixlen 64 scopeid 0x3
> nd6 options=29
>

Re: NFS over LAGG / lacp poor performance

2014-04-25 Thread Rick Macklem

Marek Salwerowicz wrote:
> W dniu 2014-04-25 13:48, Rick Macklem pisze:
> > Well, you don't mention what command(s) you are using to transfer
> > the
> > data, but I would guess you have one serial data transfer for each
> > command.
> > (Put another way, if you are only running one command to transfer
> > the data,
> >  there will only be one RPC happening at a time and that will only
> >  use one
> >  network interface.) I don't know anything about lagg, so I can't
> >  comment
> >  related to it, but if there is only one NFS RPC at a time, you'll
> >  only
> >  be transferring one message at a time on the wire.)
> 
> I need to transfer 15 files, each is about 1TB sized.
> 
> From 9.1-RELEASE[storage1] to 10-RELEASE[storage2]
> 
> I have tried to run concurrent 'cp'  and transfer 4 files at the same
> time:
> 
> (executed on storage1)
> # cp -a  file1 /net/storage2/ &
> # cp -a  file2 /net/storage2/ &
> # cp -a  file3 /net/storage2/ &
> # cp -a  file4 /net/storage2/ &
> 
Although I doubt it will make much difference, you might want to
try "dd" with a fairly large blocksize (at least 64K). I don't know
what blocksize "cp" uses and whether or not it does mmap'd file
access. (mmap will only do I/O in page size blocks, so I think it
will be slower.)

> 
> But in fact I did not observe bigger throughput
> 
> Both servers have filesystem exported using NFS, so I can execute
> copy
> on source, or destination.
> Would you recommend running this on source-side, or rather
> destination-side ?
> 
Usually reads run faster than writes for NFS, so I'd try doing the
mounts and running the commands on the destination side.
That is also when "readahead=8" might help some. I'd add that option
to the NFS mount (you can try any value you'd like, up to 16, but if 8
doesn't run faster than the default of 1, it probably isn't worth trying
other values).

> >
> > Adding the mount option "readahead=8" to the machine receiving the
> > data
> > might help, if the data transfer command is being done there. (ie.
> > The machine
> > the data is being copied to has the other one NFS mounted and it is
> > where
> > you are running the data transfer command(s).)
> 
> 
> Regarding what I wrote above - how should I mount the NFS volumes?
> 
As above, I'd use nfsv3,readahead=8 options on the destination as a
starting point.

rick

> Cheers,
> Marek
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: NFS over LAGG / lacp poor performance

2014-04-25 Thread Rick Macklem

Steven Hartland wrote:
> 
> - Original Message -
> From: "Marek Salwerowicz" 
> To: "Steven Hartland" ; "Gerrit Kühn"
> 
> Cc: 
> Sent: Friday, April 25, 2014 2:06 PM
> Subject: Re: NFS over LAGG / lacp poor performance
> 
> 
> >W dniu 2014-04-25 14:55, Steven Hartland pisze:
> >> - Original Message - From: "Marek Salwerowicz"
> >> 
> >>
> >>
> >>> W dniu 2014-04-25 14:01, Gerrit Kühn pisze:
>  Thanks for your input. As far as I understood so far, there
>  should
>  be one
>  igb queue created per cpu core in the system by default (and
>  this is
>  what
>  I see on my system). But my irq rate looks quite high to me (and
>  it is
>  only on one of these queues).
> >>>
> >>>
> >>> My CPU has 8 cores:
> >>>
> >>> http://ark.intel.com/products/75267/Intel-Xeon-Processor-E5-2640-v2-20M-Cache-2_00-GHz
> >>>
> >>>
> >>> So why do I have only 1 queue ?
> >>
> >> What does "sysctl hw.igb.num_queues" report?
> >
> > storage1% sysctl hw.igb.num_queues
> > hw.igb.num_queues: 1
> >>
> >> num_queues does default to 1 for Legacy or MSI so you might be
> >> hitting
> >> that.
> >>
> >> Do you see "Using MSIX interrupts with" in your dmesg?
> > storage% dmesg | grep MSIX
> > igb0: Using MSIX interrupts with 2 vectors
> > igb1: Using MSIX interrupts with 2 vectors
> > igb2: Using MSIX interrupts with 2 vectors
> > igb3: Using MSIX interrupts with 2 vectors
> > igb0: Using MSIX interrupts with 2 vectors
> > igb1: Using MSIX interrupts with 2 vectors
> > igb2: Using MSIX interrupts with 2 vectors
> > igb3: Using MSIX interrupts with 2 vectors
> 
> In that case I believe you've hard coded the number of queues, check
> /boot/loader.conf
> for references to this.
> 
Not really replying to Steve's email, but...

NFS uses a single TCP connection for a mount. I still know nothing
about lagg, but if lagg/lacp requires multiple TCP connections to
spread the load..I'd just switch to using something like ftp, given
you are only moving a few large files.

If you must use NFS, then to get multiple TCP connections, you'll
need to do multiple mounts and then do the file transfers concurrently
over the different mounts.

rick

> Regards
> Steve
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: amd + NFS reconnect = ICMP storm + unkillable process.

2011-06-16 Thread Rick Macklem

> * We try to reconnect again, and again, and again
> * the process in this state is unkillable
> 
If you use the "intr" mount option, then an nfs reconnect
should be killable. I know diddly about amd, so I can't
help beyond that.

rick
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: udp checksum implementation error in FreeBSD 7.2?

2011-06-28 Thread Rick Macklem

Benoit Panizzon wrote:
> Hi
> 
> We are running a DHCP Server on a FreeBSD 7.2-RELEASE-p4 box.
> 
> This works for most of our customers, except ones with some kind of
> SonicWall
> Firewalls. We have analyzed the problem with the sonicwall tech
> support:
> 
> We found the problem being in the sonicwall setting a UDP checksum of
> 0x
> for DHCP Requests.
> 
> According to the RFC this is a valid value and tells the receiving UDP
> stack
> not to check the checksum:
> 
> http://www.faqs.org/rfcs/rfc768.html
> 
> If the value is different from 0x the receiving UDP stack can
> perform a
> checksum check and if this fails, silently drop that packet.
> 
> What we observe is:
> 
> DHCP Request with UDP checksum set => Packet reaches DHCP Daemon and
> is being
> answered.
> DHCP Request with UDP checksum 0x => ICMP Port Unreachable from
> FreeBSD.
> 
> Can someone confirm this non RFC conform behaviour and knows how to
> fix it?
> 
Well, I took a quick look at the sources (which are in sys/netinet/udp_usrreq.c
in a function called udp_input() at about line#300) and it only does the
checksum if it is non-zero. It looks like:
   if (uh->uh_sum) {
  do checksum 
(If you don't have kernel sources handy, you can find them here:
  http://svn.freebsd.org/viewvc/base/releng/7.2)

So, I have no idea why the packet without the checksum doesn't make it
through, but it doesn't appear to be because the checksum field is set to 0.
In fact, if you do "netstat -s", you should see the count for UDP packets
with no checksum increase as it receives them.
If this count isn't increaing when the request with checksum == 0x is
being sent to the FreeBSD box, it isn't getting as far as the udp checksum
calculation. (The code fails a UDP packet with a 0 destination port#, for
example.) Or maybe your network hardware is trying to do the checksum and
then dropping the packet? Look at "ifconfig -a" and if RXCSUM is enabled,
you could try disabling it with "-rxcsum" on the ifconfig command line.

Otherwise, all I can suggest is good old fashioned printf()s in the
udp_input() function to try and figure out why the packet is being dropped?
(Oh, this assumes you've already looked at the packet via wireshark or
tcpdump to make sure that the UDP packet looks ok otherwise when it has
the checksum == 0x.)
> As I understand, setting net.inet.udp.checksum to zero would not fix
> the
> problem, as this is only for packet generation.
> 
Yes, the code shouldn't ever try and calculate a UDP checksum when it's
0 in the packet.

Maybe others have better suggestions, rick
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: LOR with nfsclient "sillyrename"

2011-07-22 Thread Rick Macklem

Kostik Belousov wrote:
> On Fri, Jul 22, 2011 at 08:55:10AM -0400, John Baldwin wrote:
> > On Thursday, July 21, 2011 4:19:59 pm Jeremiah Lott wrote:
> > > We're seeing nfsclient deadlocks with what looks like lock order
> > > reversal after removing a "silly rename". It is fairly rare, but
> > > we've seen it
> > happen a few times. I included relevant back traces from an
> > occurrence. From what I can see, nfs_inactive() is called with the
> > vnode locked. If
> > there is a silly-rename, it will call vrele() on its parent
> > directory, which can potentially try to lock the parent directory.
> > Since this is the
> > opposite order of the lock acquisition in lookup, it can deadlock.
> > This happened in a FreeBSD7 build, but I looked through freebsd head
> > and
> > didn't see any change that addressed this. Anyone seen this before?
> >
> > I haven't seen this before, but your analysis looks correct to me.
> >
> > Perhaps the best fix would be to defer the actual freeing of the
> > sillyrename
> > to an asynchronous task? Maybe something like this (untested,
> > uncompiled):
> >
> > Index: nfsclient/nfsnode.h
> > ===
> > --- nfsclient/nfsnode.h (revision 224254)
> > +++ nfsclient/nfsnode.h (working copy)
> > @@ -36,6 +36,7 @@
> >  #ifndef _NFSCLIENT_NFSNODE_H_
> >  #define _NFSCLIENT_NFSNODE_H_
> >
> > +#include 
> >  #if !defined(_NFSCLIENT_NFS_H_) && !defined(_KERNEL)
> >  #include 
> >  #endif
> > @@ -45,8 +46,10 @@
> >   * can be removed by nfs_inactive()
> >   */
> >  struct sillyrename {
> > + struct task s_task;
> > struct ucred *s_cred;
> > struct vnode *s_dvp;
> > + struct vnode *s_vp;
> > int (*s_removeit)(struct sillyrename *sp);
> > long s_namlen;
> > char s_name[32];
> > Index: nfsclient/nfs_vnops.c
> > ===
> > --- nfsclient/nfs_vnops.c (revision 224254)
> > +++ nfsclient/nfs_vnops.c (working copy)
> > @@ -1757,7 +1757,6 @@
> >  {
> > /*
> >  * Make sure that the directory vnode is still valid.
> > - * XXX we should lock sp->s_dvp here.
> >  */
> > if (sp->s_dvp->v_type == VBAD)
> > return (0);
> > @@ -2754,8 +2753,10 @@
> > M_NFSREQ, M_WAITOK);
> > sp->s_cred = crhold(cnp->cn_cred);
> > sp->s_dvp = dvp;
> > + sp->s_vp = vp;
> > sp->s_removeit = nfs_removeit;
> > VREF(dvp);
> > + vhold(vp);
> >
> > /*
> >  * Fudge together a funny name.
> > Index: nfsclient/nfs_node.c
> > ===
> > --- nfsclient/nfs_node.c (revision 224254)
> > +++ nfsclient/nfs_node.c (working copy)
> > @@ -47,6 +47,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >
> >  #include 
> > @@ -185,6 +186,26 @@
> > return (0);
> >  }
> >
> > +static void
> > +nfs_freesillyrename(void *arg, int pending)
> > +{
> > + struct sillyrename *sp;
> > +
> > + sp = arg;
> > + vn_lock(sp->s_dvp, LK_SHARED | LK_RETRY);
> I think taking an exclusive lock is somewhat more clean.
> > + vn_lock(sp->s_vp, LK_EXCLUSIVE | LK_RETRY);
> I believe that you have to verify that at least dvp is not doomed.
> 
> Due to this, I propose to only move the vrele() call to taskqueue.

Yes. I was thinking that it would be simpler (and I'm a chicken about
changing more than I have to for these kinds of things:-) to juts defer
the vrele(). I wasn't sure that holding onto "vp" when it was being
recycled was such a good plan, although I'm not saying it would actually
break anything. (As I understand it, VOP_INACTIVE() sometimes gets delayed
until just before VOP_RECLAIM() and doing a VHOLD(vp) in there just seems
like it's asking for trouble?;-)

I'll post with a patch, once I've tested something.

> > + (void)nfs_vinvalbuf(ap->a_vp, 0, td, 1);
> > + /*
> > + * Remove the silly file that was rename'd earlier
> > + */
> > + (sp->s_removeit)(sp);
> > + crfree(sp->s_cred);
> > + vput(sp->s_dvp);
> > + VOP_UNLOCK(sp->s_vp, 0);
> > + vdrop(sp->s_vp);
> > + free((caddr_t)sp, M_NFSREQ);
> > +}
> > +
> >  int
> >  nfs_inactive(struct vop_inactive_args *ap)
> >  {
> > @@ -200,15 +221,9 @@
> > } else
> > sp = NULL;
> > if (sp) {
> > + TASK_INIT(&sp->task, 0, nfs_freesillyrename, sp);
> > + taskqueue_enqueue(taskqueue_thread, &sp->task);
> > mtx_unlock(&np->n_mtx);
> > - (void)nfs_vinvalbuf(ap->a_vp, 0, td, 1);
> > - /*
> > - * Remove the silly file that was rename'd earlier
> > - */
> > - (sp->s_removeit)(sp);
> > - crfree(sp->s_cred);
> > - vrele(sp->s_dvp);
> > - free((caddr_t)sp, M_NFSREQ);
> > mtx_lock(&np->n_mtx);
> > }
> > np->n_flag &= NMODIFIED;
> >
Thanks everyone, for the helpful suggestions, rick

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: LOR with nfsclient "sillyrename"

2011-07-22 Thread Rick Macklem

Jeremiah Lott wrote:
> We're seeing nfsclient deadlocks with what looks like lock order
> reversal after removing a "silly rename". It is fairly rare, but we've
> seen it happen a few times. I included relevant back traces from an
> occurrence. From what I can see, nfs_inactive() is called with the
> vnode locked. If there is a silly-rename, it will call vrele() on its
> parent directory, which can potentially try to lock the parent
> directory. Since this is the opposite order of the lock acquisition in
> lookup, it can deadlock. This happened in a FreeBSD7 build, but I
> looked through freebsd head and didn't see any change that addressed
> this. Anyone seen this before?
> 
> Jeremiah Lott
> Avere Systems
> 
Please try the attached patch (which is also at):
  http://people.freebsd.org/~rmacklem/oldsilly.patch
  http://people.freebsd.org/~rmacklem/newsilly.patch
(for the old and new clients in -current, respectively)

- I think oldsilly.patch should apply to the 7.n kernel
  sources, although you might have to do the edit by hand?

The patch is based on what jhb@ posted, with changes as recommended
by kib@.

Please let me know how testing goes with it, rick
ps: Kostik, could you please review this, thanks.

--- nfsclient/nfsnode.h.sav	2011-07-22 15:31:11.0 -0400
+++ nfsclient/nfsnode.h	2011-07-22 15:32:54.0 -0400
@@ -36,6 +36,7 @@
 #ifndef _NFSCLIENT_NFSNODE_H_
 #define _NFSCLIENT_NFSNODE_H_
 
+#include 
 #if !defined(_NFSCLIENT_NFS_H_) && !defined(_KERNEL)
 #include 
 #endif
@@ -45,6 +46,7 @@
  * can be removed by nfs_inactive()
  */
 struct sillyrename {
+	struct	task s_task;
 	struct	ucred *s_cred;
 	struct	vnode *s_dvp;
 	int	(*s_removeit)(struct sillyrename *sp);
--- nfsclient/nfs_node.c.sav	2011-07-22 15:33:04.0 -0400
+++ nfsclient/nfs_node.c	2011-07-22 16:31:45.0 -0400
@@ -47,6 +47,7 @@ __FBSDID("$FreeBSD: head/sys/nfsclient/n
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -59,6 +60,8 @@ __FBSDID("$FreeBSD: head/sys/nfsclient/n
 
 static uma_zone_t nfsnode_zone;
 
+static void	nfs_freesillyrename(void *arg, __unused int pending);
+
 #define TRUE	1
 #define	FALSE	0
 
@@ -185,6 +188,20 @@ nfs_nget(struct mount *mntp, nfsfh_t *fh
 	return (0);
 }
 
+/*
+ * Do the vrele(sp->s_dvp) as a separate task in order to avoid a
+ * deadlock because of a LOR when vrele() locks the directory vnode.
+ */
+static void
+nfs_freesillyrename(void *arg, __unused int pending)
+{
+	struct sillyrename *sp;
+
+	sp = arg;
+	vrele(sp->s_dvp);
+	free(sp, M_NFSREQ);
+}
+
 int
 nfs_inactive(struct vop_inactive_args *ap)
 {
@@ -207,8 +224,8 @@ nfs_inactive(struct vop_inactive_args *a
 		 */
 		(sp->s_removeit)(sp);
 		crfree(sp->s_cred);
-		vrele(sp->s_dvp);
-		free((caddr_t)sp, M_NFSREQ);
+		TASK_INIT(&sp->s_task, 0, nfs_freesillyrename, sp);
+		taskqueue_enqueue(taskqueue_thread, &sp->s_task);
 		mtx_lock(&np->n_mtx);
 	}
 	np->n_flag &= NMODIFIED;
--- fs/nfsclient/nfsnode.h.sav2	2011-07-22 15:42:14.0 -0400
+++ fs/nfsclient/nfsnode.h	2011-07-22 15:43:25.0 -0400
@@ -35,11 +35,14 @@
 #ifndef _NFSCLIENT_NFSNODE_H_
 #define	_NFSCLIENT_NFSNODE_H_
 
+#include 
+
 /*
  * Silly rename structure that hangs off the nfsnode until the name
  * can be removed by nfs_inactive()
  */
 struct sillyrename {
+	struct	task s_task;
 	struct	ucred *s_cred;
 	struct	vnode *s_dvp;
 	long	s_namlen;
--- fs/nfsclient/nfs_clnode.c.sav2	2011-07-22 15:43:40.0 -0400
+++ fs/nfsclient/nfs_clnode.c	2011-07-22 16:32:53.0 -0400
@@ -47,6 +47,7 @@ __FBSDID("$FreeBSD: head/sys/fs/nfsclien
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -65,6 +66,8 @@ MALLOC_DECLARE(M_NEWNFSREQ);
 
 uma_zone_t newnfsnode_zone;
 
+static void	nfs_freesillyrename(void *arg, __unused int pending);
+
 void
 ncl_nhinit(void)
 {
@@ -186,6 +189,20 @@ ncl_nget(struct mount *mntp, u_int8_t *f
 	return (0);
 }
 
+/*
+ * Do the vrele(sp->s_dvp) as a separate task in order to avoid a
+ * deadlock because of a LOR when vrele() locks the directory vnode.
+ */
+static void
+nfs_freesillyrename(void *arg, __unused int pending)
+{
+	struct sillyrename *sp;
+
+	sp = arg;
+	vrele(sp->s_dvp);
+	free(sp, M_NEWNFSREQ);
+}
+
 int
 ncl_inactive(struct vop_inactive_args *ap)
 {
@@ -220,8 +237,8 @@ ncl_inactive(struct vop_inactive_args *a
 		 */
 		ncl_removeit(sp, vp);
 		crfree(sp->s_cred);
-		vrele(sp->s_dvp);
-		FREE((caddr_t)sp, M_NEWNFSREQ);
+		TASK_INIT(&sp->s_task, 0, nfs_freesillyrename, sp);
+		taskqueue_enqueue(taskqueue_thread, &sp->s_task);
 		mtx_lock(&np->n_mtx);
 	}
 	np->n_flag &= NMODIFIED;
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: LOR with nfsclient "sillyrename"

2011-07-31 Thread Rick Macklem

Jeremiah Lott wrote:
> On Jul 22, 2011, at 5:11 PM, Rick Macklem wrote:
> 
> > Please try the attached patch (which is also at):
> >  http://people.freebsd.org/~rmacklem/oldsilly.patch
> >  http://people.freebsd.org/~rmacklem/newsilly.patch
> > (for the old and new clients in -current, respectively)
> >
> > - I think oldsilly.patch should apply to the 7.n kernel
> >  sources, although you might have to do the edit by hand?
> 
> It applied with minimal futzing.
> 
Just to clarify.. Was there anything other than different line #s
needed? (If I am going to MFC it back to stable/7, I'll need to know,
since I don't currently have a stable/7 system installed to test with.)

Thanks for letting me know how it's going, rick

> > Please let me know how testing goes with it, rick
> 
> Unfortunately we've never reproduced the original problem in the lab.
> Only in the field under heavy stress. I did build a kernel with the
> patch and run it under some of our tests, it seems to work correctly.
> We'll continue to test it, but I wanted to give you an update. Thanks
> a lot for your help.
> 
> Jeremiah Lott
> Avere Systems
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: amd + NFS reconnect = ICMP storm + unkillable process.

2011-08-25 Thread Rick Macklem

Artem Belevich wrote:
> On Wed, Jul 6, 2011 at 4:50 AM, Martin Birgmeier 
> wrote:
> > Hi Artem,
> >
> > I have exactly the same problem as you are describing below, also
> > with quite
> > a number of amd mounts.
> >
> > In addition to the scenario you describe, another way this happens
> > here
> > is when downloading a file via firefox to a directory currently open
> > in
> > dolphin (KDE file manager). This will almost surely trigger the
> > symptoms
> > you describe.
> >
> > I've had 7.4 running on the box before, now with 8.2 this has
> > started to
> > happen.
> >
> > Alas, I don't have a solution.
> 
> I may be on to something. Here's what seems to be happening in my
> case:
> 
> * Process, that's in the middle of a syscall accessing amd mountpoint
> gets interrupted.
> * If the syscall was restartable, msleep at the beginning of
> get_reply: loop in in clnt_dg_call() would return ERESTART.
> * ERESTART will result in clnt_dg_call() returning with RPC_CANTRECV
> * clnt_reconnect_call() then will try to reconnect, and msleep will
> eventually fail with ERESTART in clnt_dg_call() again and the whole
> thing will be repeating for a while.
> 
Btw, I fixed exactly the same issue for the TCP code (clnt_vc.c) in
r221127, so I wouldn't be surprised if the UDP code suffers the same
problem. I'll take a look at your patch tomorrow. You could also try
a TCP mount and see if the problem goes away. (For TCP on a pre-r221127
system, the symptom would be a client thread looping in the kernel in
"R" state.)

I'll look tomorrow, but it sounds like you've figured it out. Looks like
a good catch to me at this point, rick

> I'm not familiar enough with the RPC code, but looking and clnt_vc.c
> and other RPC places, it appears that both EINTR and ERESTART should
> translate into RPC_INTR error. However in clnt_dg.c that's not the
> case and that's what seems to make amd-mounted accesses hang.
> 
> Following patch (against RELENG-8 @ r225118) seems to have fixed the
> issue for me. With the patch I no longer see the hangs or ICMP storms
> on the test case that could reliably reproduce the issue within
> minutes. Let me know if it helps in your case.
> 
> --- a/sys/rpc/clnt_dg.c
> +++ b/sys/rpc/clnt_dg.c
> @@ -636,7 +636,7 @@ get_reply:
> */
> if (error != EWOULDBLOCK) {
> errp->re_errno = error;
> - if (error == EINTR)
> + if (error == EINTR || error == ERESTART)
> errp->re_status = stat = RPC_INTR;
> else
> errp->re_status = stat = RPC_CANTRECV;
> 
> --Artem
> 
> >
> > We should probably file a PR, but I don't even know where to assign
> > it to.
> > Amd does not seem much maintained, it's probably using some
> > old-style
> > mounts (it never mounts anything via IPv6, for example).
> >
> > Regards,
> >
> > Martin
> >
> >> Hi,
> >>
> >> I wonder if someone else ran into this issue before and, maybe,
> >> have a
> >> solution.
> >>
> >> I've been running into a problem where access to filesystems mouted
> >> with amd wedges processes in an unkillable state and produces ICMP
> >> storm on loopback interface.I've managed to narrow down to NFS
> >> reconnect, but that's when I ran out of ideas.
> >>
> >> Usually the problem happens when I abort a parallel build job in an
> >> i386 jail on FreeBSD-8/amd64 (r223055). When the build job is
> >> killed
> >> now and then I end up with one process consuming 100% of CPU time
> >> on
> >> one of the cores. At the same time I get a lot of messages on the
> >> console saying "Limiting icmp unreach response from 49837 to 200
> >> packets/sec" and the loopback traffic goes way up.
> >>
> >> As far as I can tell here's what's happening:
> >>
> >> * My setup uses a lot of filesystems mounted by amd.
> >> * amd itself pretends to be an NFS server running on the localhost
> >> and
> >> serving requests for amd mounts.
> >> * Now and then amd seems to change the ports it uses. Beats me why.
> >> * the problem seems to happen when some process is about to access
> >> amd
> >> mountpoint when amd instance disappears from the port it used to
> >> listen on. In my case it does correlate with interrupted builds,
> >> but I
> >> have no clue why.
> >> * NFS client detects disconnect and tries to reconnect using the
> >> same
> >> destination port.
> >> * That generates ICMP response as port is unreachable and it
> >> reconnect
> >> call returns almost immediatelly.
> >> * We try to reconnect again, and again, and again
> >> * the process in this state is unkillable
> >>
> >> Here's what the stack of the 'stuck' process looks like in those
> >> rare
> >> moments when it gets to sleep:
> >> 18779 100511 collect2 - mi_switch+0x176
> >> turnstile_wait+0x1cb _mtx_lock_sleep+0xe1
> >> sleepq_catch_signals+0x386
> >> sleepq_timedwait_sig+0x19 _sleep+0x1b1 clnt_dg_call+0x7e6
> >> clnt_reconnect_call+0x12e nfs_request+0x212 nfs_getattr+0x2e4
> >> VOP_GETATTR_APV+0x44 nfs_bioread+0x42a VOP_READLINK_APV+0x4a
> >> namei+0x4f9 kern_statat_vnhook+0x92 kern_statat+0x15
> >> freebsd32_stat+0x2e syscallente

Re: amd + NFS reconnect = ICMP storm + unkillable process.

2011-08-26 Thread Rick Macklem

Artem Belevich wrote:
> On Thu, Aug 25, 2011 at 6:24 PM, Rick Macklem 
> wrote:
> > Btw, I fixed exactly the same issue for the TCP code (clnt_vc.c) in
> > r221127, so I wouldn't be surprised if the UDP code suffers the same
> 
> The code in clnt_vc.c was exactly what made me wonder about treatment
> of ERESTART.
> 
> > problem. I'll take a look at your patch tomorrow. You could also try
> > a TCP mount and see if the problem goes away. (For TCP on a
> > pre-r221127
> > system, the symptom would be a client thread looping in the kernel
> > in
> > "R" state.)
> 
> In my case the process was also stuck in unkillable running state
> because the process never returns from the syscall.
> 
> Unfortunately amd itself seems to handle NFS requests for its own
> top-level mountpoints only via UDP. At least I haven't found a way to
> do so without hacking rather convoluted amd code.
> 
> > I'll look tomorrow, but it sounds like you've figured it out. Looks
> > like
> > a good catch to me at this point, rick
> 
> Let me know if you're OK with the patch and I'll commit to head and
> MFC it to stable/8.
> 
The patch looks good to me. The only thing is that *maybe* it should
also do the same for the other msleep() higher up in clnt_dg_call()?
(It seems to me that if this msleep() were to return ERESTART, the same
 kernel loop would occur.)

Here's this variant of the patch (I'll let you decide which to commit).

Good work tracking this down, rick

--- rpc/clnt_dg.c.sav   2011-08-26 14:44:27.0 -0400
+++ rpc/clnt_dg.c   2011-08-26 14:48:07.0 -0400
@@ -467,7 +467,10 @@ send_again:
cu->cu_waitflag, "rpccwnd", 0);
if (error) {
errp->re_errno = error;
-   errp->re_status = stat = RPC_CANTSEND;
+   if (error == EINTR || error == ERESTART)
+   errp->re_status = stat = RPC_INTR;
+   else
+   errp->re_status = stat = RPC_CANTSEND;
goto out;
}
}
@@ -636,7 +639,7 @@ get_reply:
 */
if (error != EWOULDBLOCK) {
errp->re_errno = error;
-   if (error == EINTR)
+   if (error == EINTR || error == ERESTART)
errp->re_status = stat = RPC_INTR;
else
errp->re_status = stat = RPC_CANTRECV;

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: amd + NFS reconnect = ICMP storm + unkillable process.

2011-08-27 Thread Rick Macklem

Martin Birgmeier wrote:
> Thank you for these patches.
> 
> One interesting thing: I was trying to backport them to 7.4.0 and
> RELENG_7, too, but there the portion of the code dealing with the
> RPC_CANTSEND case does not exist. On the other hand, the problem
> surfaced (for me) when upgrading from 7.4 to 8.2. So could one
> probably
> conclude that it is more the write case which leads to the erroneous
> behavior?
> 
Well, the kernel rpc code isn't used by NFS in FreeBSD7.n. The only
thing that uses it in FreeBSD7.n is the NLM (Network Lock Manager).

As such, I don't think the patch is critical for FreeBSD7.n, rick

> Regards,
> 
> Martin
> 
> On 08/26/11 21:19, Artem Belevich wrote:
> > On Fri, Aug 26, 2011 at 12:04 PM, Rick Macklem
> > wrote:
> >> The patch looks good to me. The only thing is that *maybe* it
> >> should
> >> also do the same for the other msleep() higher up in
> >> clnt_dg_call()?
> >> (It seems to me that if this msleep() were to return ERESTART, the
> >> same
> >>   kernel loop would occur.)
> >>
> >> Here's this variant of the patch (I'll let you decide which to
> >> commit).
> >>
> >> Good work tracking this down, rick
> >>
> >> --- rpc/clnt_dg.c.sav 2011-08-26 14:44:27.0 -0400
> >> +++ rpc/clnt_dg.c 2011-08-26 14:48:07.0 -0400
> >> @@ -467,7 +467,10 @@ send_again:
> >> cu->cu_waitflag, "rpccwnd", 0);
> >> if (error) {
> >> errp->re_errno = error;
> >> - errp->re_status = stat = RPC_CANTSEND;
> >> + if (error == EINTR || error == ERESTART)
> >> + errp->re_status = stat = RPC_INTR;
> >> + else
> >> + errp->re_status = stat = RPC_CANTSEND;
> >> goto out;
> >> }
> >> }
> > You're right. I'll add the change to the commit.
> >
> > --Artem
> >
> >> @@ -636,7 +639,7 @@ get_reply:
> >>  */
> >> if (error != EWOULDBLOCK) {
> >> errp->re_errno = error;
> >> - if (error == EINTR)
> >> + if (error == EINTR || error == ERESTART)
> >> errp->re_status = stat = RPC_INTR;
> >> else
> >> errp->re_status = stat =
> >> RPC_CANTRECV;
> >>
> >>
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscr...@freebsd.org"
> >
> >
> >
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Gigabit Ethernet performance with Realtek 8111E

2011-11-05 Thread Rick Macklem

bevan wrote:
> Hi!
> 
> I've got a small NAS with Intel D525MW (Atom) board inside using
> FreeBSD
> 9.0-RC1 as operating system. It has an onboard Realtek 8111E ethernet
> adapter. I'm experiencing heavy performance problems when transfering
> files from a specific PC in my network to that NAS. I did the
> following
> tests by transfering large amount of data between the diferrent
> machines
> (using dd and nc):
> 
> NAS -> Linux1: ~ 400Mbit/s
> NAS -> Linux2: ~ 400Mbit/s
> Linux1 -> NAS: heavy fluctuation, between 700Mbit/s and 0bit/s
> Linux2 -> NAS: ~ 400Mbit/s
> Linux1 -> Linux2: ~ 400Mbit/s
> Linux2 -> Linux1: ~ 400Mbit/s
> 
> As you can see everythink works fine except for transfering data from
> Linux1 to that NAS box. The following graph shows the problem:
> http://dl.dropbox.com/u/25455527/network-problems.png
> 
> While the transfer rate drops to zero the NAS also has a very bad ping
> up to one second. Ping of Linux1 is perfectly fine during these
> outages.
> 
> I also had a quick look on the data stream with wireshark on Linux1
> and
> it shows a lot of TCP Dup ACK (up to 263 Dup ACKs created by NAS for
> one
> frame).
> 
> What can be eliminated as a cause is:
> - Switch (I tried connecting Linux1 and NAS directly)
> - Cable (I changed that a few times)
> - Harddisk I/O (I'm only writing from /dev/zero to /dev/null)
> 
> The sevirity of that problem varies from one minute to another but can
> always be reproduced with a few tries.
> 
> When limiting either NAS or Linux1 to 100Mbit I'm getting a steady
> transfer rate of about 90Mbit/s.
> When decreasing the MTU on NAS to 1200 the problem seems to disappear,
> getting a transfer rate of about 160Mbit/s.
> 
> ifconfig re0:
> > re0: flags=8843 metric 0 mtu
> > 1500
> > 
> > options=388b
> > ether 38:60:77:3e:af:a5
> > inet 192.168.178.54 netmask 0xff00 broadcast 192.168.178.255
> > nd6 options=29
> > media: Ethernet autoselect (1000baseT )
> > status: active
> 
try typing:
# sysctl dev.re.0.stats=1
- this will dump out the stats on the chip
  if the "Rx missed frames" count is non-zero, you're probably snookered,
  to put it technically:-)
  - That's what I get for a re chip is this laptop and I haven't found
a way around it. I just live with flakey net performance.

rick

> pciconf -lv:
> > re0@pci0:1:0:0: class=0x02 card=0xd6258086 chip=0x816810ec
> > rev=0x06 hdr=0x00
> > vendor = 'Realtek Semiconductor Co., Ltd.'
> > device = 'RTL8111/8168B PCI Express Gigabit Ethernet controller'
> > class = network
> > subclass = ethernet
> 
> Because Linux1 seems to be involved in that problem: It's running
> Linux
> 3.0 and it has an "Atheros Communications AR8121/AR8113/AR8114"
> onboard.
> 
> Does anyone have an idea what could be the problem here? Decreasing
> the
> MTU is some kind of solution but the performance is still not optimal
> and a MTU of 1500 should be no problem.
> 
> Greetings,
> Michael Laß
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Gigabit Ethernet performance with Realtek 8111E

2011-11-06 Thread Rick Macklem

Michael wrote:
> Hi!
> 
> Am Samstag, den 05.11.2011, 11:03 -0400 schrieb Rick Macklem:
> > try typing:
> > # sysctl dev.re.0.stats=1
> > - this will dump out the stats on the chip
> >   if the "Rx missed frames" count is non-zero, you're probably
> >   snookered,
> >   to put it technically:-)
> >   - That's what I get for a re chip is this laptop and I haven't
> >   found
> > a way around it. I just live with flakey net performance.
> 
> Rx missed frames is >0 indeed. Every time I see those drops in speed
> the
> number of missed frames increases by approx. 20-50.
> 
> When searching for this problem I found your old thread on
> freebsd-current[1]. It seems that the problem is way less severe here.
> Some transfers even don't cause any problems. Others however spend
> more
> time at 0kbit/s than actually transferring data...
> It also seems like transfers are stabilizing after some seconds but
> that
> is not always the case.
> In good times the rate of missed frames is below 0.01%.
> 
> I think the Dup ACKs are just a result of these lost packages. I do
> not
> see them always when these problems occur.
> 
> Was there any progress after your last mail on 8th of Nov.?
> 
Nope. For my case, when Rx frames are missed, there is a Fifo overflow
reported. I'm no hardware guy, but my understanding is that, sometimes,
the dma engine transferring data to the receive buffers doesn't keep up
and the fifo fills up.

I did try assorted hacks on the driver, but none of them got rid of
the problem. For my case the combination of these two things did
reduce the # of Rx packets missed, but not down to 0.
- disable msi interrupts (there's an option in the driver)
- comment out the few lines of code that disabled/re-enabled
  interrupts (I don't think this code is broken, but for some reason,
  leaving the interrupts enabled reduced the # of Rx missed for this
  laptop. Maybe the dma engine stops running when interrupts are being
  switched on/off? Just pure conjecture, of course.)
Also, only both of the above together made a difference. Each one
individually didn't help.

I heard that there was a driver for BSD out there somewhere that puts
all the Realtek chips in 8139 compatible mode and drives them that way,
but I never even gotten as far as searching for this driver.

Good luck with it, rick
> Greetings,
> Michael
> 
> [1]:
> http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020793.html
> http://lists.freebsd.org/pipermail/freebsd-current/2010-November/020797.html
> 
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: choosing distribution: FreeBSD

2011-11-27 Thread Rick Macklem

LinuxIsOne wrote:
> Hi,
> 
> Well, I am basically a Windows convert, but very frankly saying that:
> I am
> new to the world of Linux. So I should use FreeBSD or something easier
> distribution in the Linux...? Or it is perfectly okay for a newbie to
> go
> with FreeBSD?
> 
As others have noted, FreeBSD isn't a Linux ditribution, but another
Unix-like operating system.

You might also consider PC-BSD, which is a desktop distribution based,
at least in part, on FreeBSD.

Good luck with whatever you choose, rick
> Thanks.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Can we do perform a C style file Read/Write from within a ARP module

2011-12-27 Thread Rick Macklem

perryh wrote:
> Jason Hellenthal  wrote:
> >
> > See siftr(4). This module writes to a file.
> 
> Is siftr(4) new since 8.1?
> 
> $ man siftr
> No manual entry for siftr
> $ cd /usr/ports
> $ ls -d */*siftr*
> ls: */*siftr*: No such file or directory
> 
You can look at:
  http://people.freebsd.org/~rmacklem/nfs_clpackrat.c

I won't say it is the best or even a good way to do it, but
this code reads/writes files directly in the kernel.

rick

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: low network speed

2012-01-25 Thread Rick Macklem

Eugene M. Zheganin wrote:
> Hi.
> 
> I'm suffering from low network performance on one of my FreeBSDs.
> I have an i386 8.2-RELEASE machine with an fxp(4) adapter. It's
> connected though a bunch of catalysts 2950 to another 8.2. While other
> machines in this server room using the same sequence of switches and
> the
> same target source server (which, btw, is equipped with an em(4) and a
> gigabit link bia catalyst 3750) show sufficient speed, this particular
> machine while using scp starts with a speed of 200 Kbytes/sec and
> while
> copying the file shows speed about 600-800 Kbytes/sec.
> 
> I've added this tweak to the sysctl:
> 
> net.local.stream.recvspace=196605
> net.local.stream.sendspace=196605
> net.inet.tcp.sendspace=196605
> net.inet.tcp.recvspace=196605
> net.inet.udp.recvspace=196605
> kern.ipc.maxsockbuf=2621440
> kern.ipc.somaxconn=4096
> net.inet.tcp.sendbuf_max=524288
> net.inet.tcp.recvbuf_max=524288
> 
> With these settings the copying starts at 9.5 Mbytes/sec speed, but
> then, as file is copying, drops down to 3.5 Megs/sec in about
> two-three
> minutes.
> 
> Is there some way to maintain 9.5 Mbytes/sec (I like this speed more)
> ?
> 
You might want to try disabling the hardware checksumming via ifconfig.
(I very vaguely recall doing that for a fxp(4) interface some time ago,
 but am probably completely wrong.:-)

rick

> 
> Thanks.
> Eugene.
> 
> P.S. This machine also runs zfs, I don't know if it's important but I
> decided to mention it.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kerberized NFS

2012-01-27 Thread Rick Macklem

Giulio Ferro wrote:
> I'm trying to setup a kerberized NFS system made of a server and a
> client (both freebsd 9 amd64 stable)
> 
> I've tried to follow this howto:
> http://code.google.com/p/macnfsv4/wiki/FreeBSD8KerberizedNFSSetup
> 
> But couldn't get much out of it.
> 
> First question : is this howto still valid or something more recent
> should be followed? I've searched with Google but I've come up empty.
> 
It's all there is. I don't think anything has changed since it was
written. (I haven't had a kerberos setup for about 2 years, so I know
I haven't changed anything recently.)

It was a google wiki, since I hoped others would add to it, but I don't
think that has happened?

> I've set up kerberos heimdal, created the dns entries for both
> client and server, set up krb5.keytab and copied it to client, set
> up nfs4 according to man nfsv4:
> 
> (server)
> cat /etc/exports
> V4: /usr/src -sec=krb5:krb5i:krb5p
> 
The V4: line doesn't export any file system. It only defines
where the root of the directory tree is for NFSv4 and what
authentication can be used for "system operations" which do
not take any file handle and, therefore, aren't tied to any
server file system.

For example, the above would need to be something like:
V4: /usr/src -sec=krb5:krb5i:krb5p
/usr/src -sec=krb5:krb5i:krb5p 
- If /usr/src is not the root of a file system on the server,
  it is less confusing to export the root of the file system,
  such as "/usr" or "/".

> and then tried to mount it from the client:
> 
> mount_nfs -o ntfsv4,sec=krb5i,gssname=nfs
> nfsinternal1.dcssrl.it:/usr/src /usr/src
> 
To make the "gssname" case work, you need a couple of things:
- You need the patch it refers to applied to the client's kernel,
  so it can handle "host based initiator credentials". After
  applying the patch, you also need to have an entry in the
  client's /etc/keytab that looks like:
nfs/client-host.dnsdomain@YOUR.REALM

Without the above, the client can only do an NFSv4 mount as a
user (not root) that has a valid credential. For example:
- non-root mounts enabled via
  # sysctl vfs.usermount=1
- then a user logs in
  - gets a kerberos TGT via "kinit"
  - then does a mount command that looks like:
  % mount -t nfs -o nfsv4,sec=krb5i :/path
  - this mount breaks if this user's TGT expires, so it either
must be maintained via some utility (there are a couple out
there, but I can't remember the name of one offhand) or
manually by doing "kinit" again before it expires
  - this user must umount the file system when done with it

(I know, it would be nice if the host based initiator cred. worked,
 "out of the box", but the patch is ugly and the reviewer understandably
 didn't agree with it. However, I don't know how to do it another way
 for the version of Heimdal in FreeBSD. There is a bug that has apparently
 been fixed for newer Heimdal releases, where it gets confused w.r.t.
 encryption type for the keytab entry unless it is forced to one
 encryption type only.)

Also, you need the following in the server's /etc/rc.conf:
nfsv4_server_enable="YES"
gssd_enable="YES"

and in the client:
nfsuserd_enable="YES"
gssd_enable="YES"

Finally, I'd suggest that you get NFSv4 mounts over "sys" working first
and then you can try Kerberos.

> but it failed with :
> [tcp] nfsinternal1.dcssrl.it:/usr/src: Permission denied
> 
> Can you point me to something that I might have got wrong?
> 
> Thanks in advance.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kerberized NFS

2012-01-27 Thread Rick Macklem

Yuri Pankov wrote:
> On Fri, Jan 27, 2012 at 06:58:47PM +0100, Giulio Ferro wrote:
> > I'm trying to setup a kerberized NFS system made of a server and a
> > client (both freebsd 9 amd64 stable)
> >
> > I've tried to follow this howto:
> > http://code.google.com/p/macnfsv4/wiki/FreeBSD8KerberizedNFSSetup
> >
> > But couldn't get much out of it.
> >
> > First question : is this howto still valid or something more recent
> > should be followed? I've searched with Google but I've come up
> > empty.
> >
> > I've set up kerberos heimdal, created the dns entries for both
> > client and server, set up krb5.keytab and copied it to client, set
> > up nfs4 according to man nfsv4:
> >
> > (server)
> > cat /etc/exports
> > V4: /usr/src -sec=krb5:krb5i:krb5p
> >
> > and then tried to mount it from the client:
> >
> > mount_nfs -o ntfsv4,sec=krb5i,gssname=nfs
> > nfsinternal1.dcssrl.it:/usr/src /usr/src
> >
> > but it failed with :
> > [tcp] nfsinternal1.dcssrl.it:/usr/src: Permission denied
> >
> > Can you point me to something that I might have got wrong?
> 
> Not really related to Kerberos question, but.. Some problems here:
> - ntfsv4 - probably a typo
> - more serious one - V4: line specifies the ROOT of NFSv4 exported FS
> - nfsinternal1.dcssrl.it:/usr/src points to /usr/src/usr/src.
> 
> What you /etc/exports could look like (the way it works for me,
> doesn't
> mean that it's correct though):
> 
> /usr/src  
> V4: / -sec=krb5:krb5i:krb5p 
> 
Yes. If you specify "/", then the tree starts at the root. The main
problem with doing this is that, for ZFS, you then have to export
all file systems from "/" down to where you want to mount. (Again,
these are done by export lines separate from the "V4:" line.)

If you specify:
V4: /usr/src -sec=krb5:krb5i:krb5p
/usr/src -sec=krb5:krb5i:krb5p 

then the client mounts /usr/src via:
% mount -t nfs -o nfsv4,sec=krb5i server:/ /mntpoint

rick

> 
> Yuri
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kerberized NFS

2012-01-27 Thread Rick Macklem

Yuri Pankov wrote:
> On Fri, Jan 27, 2012 at 06:58:47PM +0100, Giulio Ferro wrote:
> > I'm trying to setup a kerberized NFS system made of a server and a
> > client (both freebsd 9 amd64 stable)
> >
> > I've tried to follow this howto:
> > http://code.google.com/p/macnfsv4/wiki/FreeBSD8KerberizedNFSSetup
> >
> > But couldn't get much out of it.
> >
> > First question : is this howto still valid or something more recent
> > should be followed? I've searched with Google but I've come up
> > empty.
> >
> > I've set up kerberos heimdal, created the dns entries for both
> > client and server, set up krb5.keytab and copied it to client, set
> > up nfs4 according to man nfsv4:
> >
> > (server)
> > cat /etc/exports
> > V4: /usr/src -sec=krb5:krb5i:krb5p
> >
> > and then tried to mount it from the client:
> >
> > mount_nfs -o ntfsv4,sec=krb5i,gssname=nfs
> > nfsinternal1.dcssrl.it:/usr/src /usr/src
> >
> > but it failed with :
> > [tcp] nfsinternal1.dcssrl.it:/usr/src: Permission denied
> >
> > Can you point me to something that I might have got wrong?
> 
> Not really related to Kerberos question, but.. Some problems here:
> - ntfsv4 - probably a typo
> - more serious one - V4: line specifies the ROOT of NFSv4 exported FS
> - nfsinternal1.dcssrl.it:/usr/src points to /usr/src/usr/src.
> 
> What you /etc/exports could look like (the way it works for me,
> doesn't
> mean that it's correct though):
> 
> /usr/src  
> V4: / -sec=krb5:krb5i:krb5p 
> 
> 
> Yuri
Btw, Guilio, your email address bounces for me, so hopefully you
read the mailing list and see the previous messages.

rick
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kerberized NFS

2012-01-28 Thread Rick Macklem

Giulio Ferro wrote:
> I forgot to mentioned that I compiled both servers with
> option KGSSAPI and device crypto, and I enabled gssd
> on both.
> 
> Is there anyone who was able to configure this setup?
>
I had a server at the nfsv4 testing event last June and it
worked ok. I haven't tried one since then.

Step 1: make sure that nfsv4 mounts work over auth_sys.
   (You'll need to add "sys" to the sec= flavours, so your
/etc/exports will look something like:

V4: /usr/src -sec=sys:krb5:krb5i:krb5p
/usr/src -sec=sys:krb5:krb5i:krb5p 

Then on the client:
# mount -t nfs -o nfsv4 :/ /
(Where "<" and ">" indicate "replace this with what yours".)
- Then cd / and do an "ls -l" to see that the file
  ownership looks ok. If it doesn't, it will be related to
  "nfsuserd", which must be running in both client and server.

Once, Step 1 looks fine:
Step 2: Check that Kerberos is working ok in the server.
- Log into the server as root and do the following:
  # kinit -k nfs/@
  - This should work ok.
  # klist
  - This should list a TGT for nfs/@

If this doesn't work, something isn't right in the Kerberos setup
on the server. The NFS server (not client) must have a /etc/krb5.keytab
file with an entry for:
  nfs/@
in it. You should create it on your KDC with encryption type
  DES-CBC_CRC initially
and you should specify that as your default enctype in your /etc/krb5.conf.

Once that is working, make sure all the daemons are running on the server.
mountd, nfsd, nfsuserd and gssd

If this all looks good, go to the client:
# sysctl vfs.usermount=1
- make sure these daemons are running
nfsuserd, gssd

- Log in as non-root user:
% kinit
% klist
- there should be a TGT for the user you are logged in as

- Now, try a kerberos mount, as follows:
% mount -t nfs -o nfsv4,sec=krb5 :/ /
- if that works
% cd /
% ls -l

If these last steps fail, it is not easy to figure out why.
(Look in /var/log/messages for any errors. If you get what
 the gssd calls an minor status, that is the kerberos error.)

rick

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kerberized NFS

2012-02-17 Thread Rick Macklem

Giulio Ferro wrote:
> Thanks everybody again for your help with setting up a working
> kerberized nfsv4 system.
> 
> I was able to user-mount a nfsv4 share with krb5 security, and I was
> trying to do the same as root.
> 
> Unfortunately the patch I found here:
> http://people.freebsd.org/~rmacklem/rpcsec_gss.patch
> 
> fails to apply cleanly on a 9 stable system.
> 
I'll try and generate an updated patch. I guess some commit has
changed the code enough that "patch" gets confused and it's a little
big to do the patch manually. (I'm pretty sure any changes done to
the sys/rpc/rpcsec_gss code hasn't broken the patch, but I have no
way of doing Kerberos testing these days.)

> Is there a more recent patch available or some better way to
> automatically
> mount the share at boot time?
> 
> Thanks again.
> ___
> freebsd-sta...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: kerberized NFS

2012-02-17 Thread Rick Macklem

Giulio Ferro wrote:
> Thanks everybody again for your help with setting up a working
> kerberized nfsv4 system.
> 
> I was able to user-mount a nfsv4 share with krb5 security, and I was
> trying to do the same as root.
> 
> Unfortunately the patch I found here:
> http://people.freebsd.org/~rmacklem/rpcsec_gss.patch
> 
> fails to apply cleanly on a 9 stable system.
> 
There is now a patch called:
  http://people.freebsd.org/~rmacklem/rpcsec_gss-9.patch
that should apply to a FreeBSD9 or later kernel.

For the kernel to build after applying the patch, you will
need a kernel config with
options KGSSAPI
in it, since the patch adds a function that can't be called
via one of the XXX_call() functions using the function pointers.

Also, review the section of the wiki where it discusses setting
  vfs.rpcsec.keytab_enctype
because the host based initiator keytab entry won't work unless
it is set correctly.

Good luck with it, rick

> Is there a more recent patch available or some better way to
> automatically
> mount the share at boot time?
> 
> Thanks again.
> ___
> freebsd-sta...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: OFED stack, RDMA, ipoib help needed

2012-05-08 Thread Rick Macklem

Gergely CZUCZY wrote:
> Hello,
> 
> I'd like to ask a few question in order to get some hardware to work
> we've got recently.
> 
> The hardwares are the following:
> - 2x dualport Mellanox ConnectX-3 VPI cards, with 56Gbps ports
> - 4 computing modules with a singleport Mellanox MT27500-family
> ConnectX-3 port.
> 
> The 2 dualport cards are in a storage box, and the 4 singleport ones
> are integrated on blade-like computing nodes (4 boxes in 2U). The
> storage is running FreeBSD 9-STABLE, 2012-05-07 cvsup, and the
> computing nodes are running linux.
> 
> So far we had been able to bring up the subnet-manager on the FreeBSD
> node, and one of the links got into Active state, which is quite good.
> We had been able to ibping between the nodes. The FreeBSD kernel
> config, in addition to GENERIC, is the following:
> 
> options OFED
> options SDP
> device ipoib
> options IPOIB_CM
> device mlx4ib
> device mthca
> device mlxen
> 
> Right now we're having problems with the following issues, situations:
> 
> 1) we assigned IP addresses to both ib interfaces (fbsd, linux side),
> but weren't able to ping over IP. We've seen icmp-echo-requests
> leaving
> the box on the linux box, but haven't seen any incoming traffic on the
> freebsd one. On the freebsd side, we had several issues:
> - no incoming packets seen by tcpdump on the ib interface
> - when trying to ping the other side, we've got "no route to host",
> but the routing entry existed in the routing table.
> - we had a few of these messages in our messages: "ib2: timing out; 0
> sends N recieves not completed", where started at 22,34 and was
> growing.
> 
> 2) We're unable to find any resources on how to do RDMA on the FreeBSD
> side. We'd like to use SRP (SCSI RDMA Protocol) communication, and/or
> NFS-over-RDMA for our storage link between the boxes. Where could we
> find any info on this?
> 
NFS-over-RDMA requires sessions, which are a part of NFSv4.1. There is
no NFSv4.1 server support at this time.

I know diddly about infiniband, so I can't help w.r.t. the rest.

Good luck with it, rick

> 3) Enabling connected-mode, we weren't able to find a way to specify
> or
> query the port that connected mode is using. Could someone please
> point
> us to the right direction regarding this minor issue?
> 
> 4) We were also unable to find how to switch these dual-personality
> cards between infiniband and ethernet modes. Could we also get some
> pointers regarding this please?
> 
> Basically any help would be welcome which could help making infiniband
> work.
> 
> As a side question, I've seen a comming for OFED in HEAD by jhb,
> fixing
> a few things, may I ask when will that get MFC'd to RELENG-9?
> 
> Thanks in advance.
> 
> Best regards,
> Gergely CZUCZY
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-13 Thread Rick Macklem

Niels Kobschätzki wrote:
>sorry for the cross-posting but so far I had no real luck on the forum
>or on question, thus I want to try my luck here as well.
I read email lists but don't do the other stuff, so I just saw this yesterday.
Short answer, I haven't a clue why cache hits rate would have changed.

The code that decides if there is a hit/miss for the attribute cache is in
ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
except the old code did a mtx_lock(&Giant), but I can't imagine how that
would affect the code.

You might want to:
# sysctl -a | fgrep vfs.nfs
for both the 10.3 and 11.1 systems, to check if any defaults have somehow
been changed. (I don't recall any being changed, but??)

If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c}
and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
top, where it calculates "timeo" from it.
Running this hacked kernel might show you if either of these fields is bogus.
(You could then printf() "timeo" and "np->n_attrtimeo" just before the "if"
clause that increments "attrcache_misses", which is where the cache misses
happen to see why it is missing the cache.)
If you could do this for the 10.3 kernel as well, this might indicate why the
miss rate has increased?

>I upgraded a machine from 10.3-Prerelease (custom kernel with
>tcp_fastopen added) to 11.1-Release (standard kernel) with
>freebsd-update. I have two other machines that are still on
>10.3-Prerelease. Those machines mount an NFS-export from a
>Linux-NFS-server and use NFSv3. The machine that got upgraded shows now
>far more cache misses for getattr than on the 10.3-machines (we talk a
>factor of 100) in munin. munin also shows a lot more cache-misses for
>other metrics like biow, biorl, biod (where can I find what those
>metrics mean…currently I have not even an understanding what these are)
>etc.
>
>Can anybody help me how I can debug this problem or has an idea what
>could cause the problem? The result of this behavior is that this
>machine shows a lower performance than the others and I cannot upgrade
>other machines before I didn't fix this bug.
I haven't run a 10.x system in quite a while. When I get home in a few days,
I might be able to reproduce this. If I can. I can poke at it, but it would be 
at
least a week before I might have an answer and I may not figure it out for a
long time.

rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-14 Thread Rick Macklem

Niels Kobschätzki wrote:
>On 04/14/2018 03:49 AM, Rick Macklem wrote:
>> Niels Kobschätzki wrote:
>>> sorry for the cross-posting but so far I had no real luck on the forum
>>> or on question, thus I want to try my luck here as well.
>> I read email lists but don't do the other stuff, so I just saw this 
>> yesterday.
>> Short answer, I haven't a clue why cache hits rate would have changed.
>>
>> The code that decides if there is a hit/miss for the attribute cache is in
>> ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
>> except the old code did a mtx_lock(&Giant), but I can't imagine how that
>> would affect the code.
>>
>> You might want to:
>> # sysctl -a | fgrep vfs.nfs
>> for both the 10.3 and 11.1 systems, to check if any defaults have somehow
>> been changed. (I don't recall any being changed, but??)
>
>I did that and there did nothing change.
>
>> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c}
>> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
>> top, where it calculates "timeo" from it.
>> Running this hacked kernel might show you if either of these fields is bogus.
>> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if"
>> clause that increments "attrcache_misses", which is where the cache misses
>> happen to see why it is missing the cache.)
>> If you could do this for the 10.3 kernel as well, this might indicate why the
>> miss rate has increased?
>
>I will do this next week. On monday we switch for other reasons to other
>nfs-servers and when we see that they run stable, I will do this next.
With a miss rate of 2.7%, I doubt printing the above will help. I thought
you were seeing a high miss rate.

>Btw. I calculated now the percentages. The old servers had a attr miss
>rate of something like 0.004%, while the upgraded one has more like
>2.7%. This is till low from what I've read (I remember that you should
>start adjusting acreg* when you hit more than 40% misses) but far higher
>than before.
You could try increasing acregmin, acregmax and see if the misses are reduced.
(The only risk with increasing the cache timeout is that, if another client 
changes
 the attributes, then the client will use stale ones for longer. Usually, this 
doesn't
 cause serious problems.)
To be honest, a Getattr RPC is pretty low overhead, so I doubt the increase
to 2.7% will affect your application's performance, but it is interesting that
it increased.

You might also try increasing acdirmin, acdirmax in case it is the directory
attributes that are having cache misses.

Oh, and check that your time of day clocks are in sync with the server,
since the caches are time based, since there is no cache coherency protocol
in NFS.
[good stuff snipped]
rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-15 Thread Rick Macklem

Niels Kobschaetzki wrote:
[stuff snipped]
>It is a website with quite some traffic handles by three webservers behind a 
>pair >of loadbalancers.
>We see a loss of 20% in speed(TTFB reduced by 100ms; sounds not a lot but 
>>Google et al doesn’t like it at all) after upgrading to 11.1 with a combined 
>upgrade >to php7.1. On another server without NFS that upgrade improved 
>performance >considerably (I was told ca 30% by the front end-dev)
One thing you could try is booting the 11.1 kernel on an 10.3 system. Newer
FreeBSD kernels should work with older userland.
This would tell you if it is kernel changes or userland changes that are causing
the higher miss rate.

Good luck with it, rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-16 Thread Rick Macklem

Niels Kobschaetzki wrote:
[stuff smipped]
>I just checked the code to see if I can figure out where exactly I have
>to put the printf(). And then I saw that there are ifdefs for
>NFS_ACDEBUG which seems to be a kernel option. When I add NFS_ACDEBUG in
>the config-file for the kernel, the build fails with an
I don't have sources handy right now, but you can probably just put a line
like:
#define NFS_ACDEBUG 1
at the top of the file /usr/src/sys/fs/nfsclient/nfs_clsubs.c

After building/booting the kernel "sysctl -a" should have a
vfs.nfs.acdebug
in the list. Set it to "1" to get the basic timeout info.

>/usr/src/sys/amd64/conf/ACDEBUG: unknown option "NFS_ACDEBUG"
>
>I looked in sysctl.h and there it isn't defined. Do I do something wrong
>or did this sysctl-tunable got lost at some point in time?
>Can I just use this code by removing the ifdef for getting information?
>
>Sorry, my C is not really existent, thus I have to ask :/
>
>The parts (except the part that looks at the sysctl looks like this):
>#ifdef NFS_ACDEBUG
>if (nfs_acdebug>1)
>   printf("ncl_getattrcache: initial timeo = %d\n", timeo);
>#endif
>
>……
>
>
>#ifdef NFS_ACDEBUG
>if (nfs_acdebug > 2)
>printf("acregmin %d; acregmax %d; acdirmin %d; acdirmax
>%d\n",
>nmp->nm_acregmin, nmp->nm_acregmax,
>nmp->nm_acdirmin, nmp->nm_acdirmax);
>
>if (nfs_acdebug)
>printf("ncl_getattrcache: age = %d; final timeo = %d\n",
>(time_second - np->n_attrstamp), timeo);
>#endif
>
>
>I would remove the ifdefs and the "if (nfs_acdebug …)"
This would work, too, rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-18 Thread Rick Macklem

Niels Kobschätzki wrote:
[stuff snipped]
>I solved now finally my problem after two weeks and it wasn't the NFS. I
>just got derailed from the real solution again and again from some
>people, thus I didn't look in the right place. The cache misses are gone
>now, the application performs now faster than on the other servers.
Good work. Btw, that was why I suggested running the new kernel on a
server with the old userland. It would have isolated out any userland 
differences,
and hopefully what was causing the problem.

Glad to hear NFS isn't the culprit, rick

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Diagnosing terrible ixl performance

2018-04-20 Thread Rick Macklem

I don't know if this post is helpful, but just in case:
http://docs.FreeBSD.org/cgi/mid.cgi?04125f40-6388-f074-d935-ce6c16d220fa

Hope you don't mind a top post, rick


From: owner-freebsd-...@freebsd.org  on behalf 
of hiren panchasara 
Sent: Friday, April 20, 2018 5:20:45 PM
To: Garrett Wollman
Cc: freebsd-net@freebsd.org
Subject: Re: Diagnosing terrible ixl performance

On 04/20/18 at 12:03P, Garrett Wollman wrote:
> I'm commissioning a new NFS server with an Intel dual-40G XL710
> interface, running 11.1.  I have a few other servers with this
> adapter, although not running 40G, and they work fine so long as you
> disable TSO.  This one ... not so much.  On the receive side, it gets
> about 600 Mbit/s with lots of retransmits.  On the *sending* side,
> though, it's not even able to sustain 10 Mbit/s -- but there's no
> evidence of retransmissions, it's just sending really really slowly.
> (Other machines with XL710 adapters are able to sustain full 10G.)
> There is no evidence of any errors on either the adapter or the switch
> it's connected to.
>
> So far, I've tried:
>
> - Using the latest Intel driver (no change)
> - Using the latest Intel firmware (breaks the adapter)
> - Disabling performance tweaks in loader.conf and sysctl.conf
> - Changing congestion-control algorithms
>
> Anyone have suggestions while I still have time to test this?  (My
> plan B is to fall back to an X520 card that I have in my spares kit,
> because I *know* those work great with no faffing about.)  Any
> relevant MIBs to inspect?
>
> The test I'm doing here is simple iperf over TCP, with MTU 9120.  It
> takes about 10 seconds for the sending side to complete, but buffers
> are severely constipated for 20 seconds after that (delaying all
> traffic, including ssh connections).
>
> I'm at the point of trying different switch ports just to eliminate
> that as a possibility.

You are already trying to check if the switch in-between isn't causing
the problem. A few other (probably obvious) things to try:
- sysctl -a | grep hw.ixl or dev.ixl to see if you find anything useful
(actual name might not be ixl, but you get the point)
- Try with lower mtu to see if that's causing anything interesting
- If you can reproduce easily, a single stream pcap might be useful from
  both send and recv side to understand the slowness.

Cheers,
Hiren
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: mlx5(4) jumbo receive

2018-04-25 Thread Rick Macklem

Ryan Stone wrote:
>On Tue, Apr 24, 2018 at 4:55 AM, Konstantin Belousov 
>>>wrote:
>> +#ifndef MLX5E_MAX_RX_BYTES
>> +#defineMLX5E_MAX_RX_BYTES MCLBYTES
>> +#endif
>
>Why do you use a 2KB buffer rather than a PAGE_SIZE'd buffer?
>MJUMPAGESIZE should offer significantly better performance for jumbo
>frames without increasing the risk of memory fragmentation.
Actually, when I was playing with using jumbo mbuf clusters for NFS, I was able
to get it to fragment to the point where allocations failed when mixing 2K and
4K mbuf clusters.
Admittedly I was using a 256Mbyte i386 and it wasn't easily reproduced, but
it was possible.
--> Using a mix of 2K and 4K mbuf clusters can result in fragmentation, although
  I suspect that it isn't nearly as serious as what can happen when using 9K
  mbuf clusters.

rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Starting and stopping nfsd apparently results in permanently disabling it

2018-04-29 Thread Rick Macklem

After a little look at nfsd.c, I think you need to SIGKILL the kernel daemon
to get rid of it. (That is what nfsd.c does.)

If you do a "ps ax" and find a "nfsd (server)" still there, "kill -9 " it
and then you can probably start the nfsd again.

rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

alignment of ai_addr in struct addrinfo

2018-06-07 Thread Rick Macklem

I have been doing "make universe" for the first time in a long time and
I ran into an interesting one.
I had code that looked like:
struct sockaddr_in *sin;
struct addrinfo *res;

- did a getaddrinfo() and then after this I had:
   ...
   sin = (struct sockaddr_in *)res->ai_addr;

For mips, it complained that the alignment requirement for "struct sockaddr_in"
is different than "struct sockaddr" related to the type cast.

I've worked around this by:
   struct sockaddr_in sin;
   ...
   memcpy(&sin, res->ai_addr, sizeof(sin));

Is this a real problem or a compiler quirk?

If it is real, it seems to me it would be nice if the alignment requirement for
"struct sockaddr" was the same as "struct sockaddr_in" and "struct 
sockaddr_in6".
Is there a "trick" that could be applied to "struct sockaddr" to force good 
alignment?

rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

IPv6 scope handling, was Re: svn commit: r335806 - projects/pnfs-planb-server/usr.sbin/nfsd

2018-06-30 Thread Rick Macklem

Andrey V. Elsukov wrote:
>On 30.06.2018 01:07, Rick Macklem wrote:
>> Author: rmacklem
>> Date: Fri Jun 29 22:07:25 2018
>> New Revision: 335806
>> URL: https://svnweb.freebsd.org/changeset/base/335806
>>
>> Log:
>>   Add support for IPv6 addresses to the "-p" option for the pNFS server DS
>>   specifications.
>>
>> + char *mdspath, *mdsp, ip6[INET6_ADDRSTRLEN];
>> + const char *ad;
>>   int ecode;
>> + hints.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
>> + hints.ai_family = PF_UNSPEC;
>>   hints.ai_socktype = SOCK_STREAM;
>>   hints.ai_protocol = IPPROTO_TCP;
>>   ecode = getaddrinfo(cp, NULL, &hints, &ai_tcp);
>>   if (ecode != 0)
>>   err(1, "getaddrinfo pnfs: %s %s", cp,
>>   gai_strerror(ecode));
>> + memcpy(&sin6, res->ai_addr, sizeof(sin6));
>> + ad = inet_ntop(AF_INET6, &sin6.sin6_addr, ip6,
>> + sizeof(ip6));
>
>Hi,
>
>I'm unaware of applicability of IPv6 addresses with restricted scope in
>this area, but when you use inet_ntop() to get IPv6 address text
>representation, you can lost IPv6 scope zone id. getaddrinfo() can
>return sockaddr structure with properly filled sin6_scope_id field. It
>is better to use getnameinfo() with NI_NUMERICHOST flag. Also the size
>of ip6 buffer should be enough to keep scope specifier.
Thanks for mentioning this. First off, you could write what I know about IPv6
addresses on a very small postage stamp...

Are you referring to the 4bits in the second octet of the address or the stuff 
that
can end up as a suffix starting with"%"?

In this case, the address string is put "on the wire" for the client to use to 
connect
to a data server (DS). I'm not sure if the "%..." stuff is useful in this case 
and,
when it gets to the client, it will be translated to an address via the kernel
version of inet_pton(), which does not parse "%..." as far as I can see.

So maybe others can clarify if it would be better to use getnameinfo() for this
use case?

Thanks, rick
ps: I changed the mailing list to freebsd-nat@ so hopefully the net folks will 
notice.

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: IPv6 scope handling, was Re: svn commit: r335806 - projects/pnfs-planb-server/usr.sbin/nfsd

2018-07-01 Thread Rick Macklem

Andrey V. Elsukov wrote:
[stuff snipped]
>>
>> I think what you are saying above is that a Link-local address won't work
>> and that the address must be a global one?
>> Should the code check for "fe8" at the start and skip over those ones?
>
>It is possible that all hosts are in the same scope zone, e.g. they are
>connected in the one broadcast domain through the switch.
>In this case it is possible to use link-local addresses and they all
>will be reachable.
>
>> The "on-the-wire" address sent to HostC is specified in standard string form
>> (can't remember the RFC#, but it is referenced by RFC5661), so I can't send
>> any more than that to HostC.
>
>So if I understand correctly, after formatting you are sending this
>address string to the some foreign host?
Yes, that is what happens.

>The scope zone id specifier is only does matter for the host where it is
>used. I.e. there is no sense to send "%ifname" to the foreign host,
>because it can have different ifname for the link and that address
>specification won't work.
That is what I thought.

>I think for now we can leave the code as is (put some XXX with comment
>here), and then in the future, if it will be needed, add better handling
>for that :)
How about this patch? (Basically use the link local address if it is the only 
one
returned by getaddrinfo().)
--- nfsd.c  2018-06-30 08:16:51.771742000 -0400
+++ /tmp/nfsd.c 2018-07-01 13:01:30.243285000 -0400
@@ -1309,7 +1309,17 @@ parse_dsserver(const char *optionarg, st
memcpy(&sin6, res->ai_addr, sizeof(sin6));
ad = inet_ntop(AF_INET6, &sin6.sin6_addr, ip6,
sizeof(ip6));
-   break;
+
+   /*
+* XXX
+* Since a link local address will only
+* work if the client and DS are in the
+* same scope zone, only use it if it is
+* the only address.
+*/
+   if (ad != NULL &&
+   !IN6_IS_ADDR_LINKLOCAL(&sin6.sin6_addr))
+   break;
}
}
if (ad == NULL)
[more stuff snipped]

Thanks for the comments, rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

review of a timeout patch for the kernel RPC

2018-07-17 Thread Rick Macklem

HI,

I've created D16293 on reviews.freebsd.org for a patch that sets a timeout for
a failed server connected over TCP. The code changes are simple, so the
review is more about the technique I used.

kib@ has already made a comment.

If anyone else would like to comment or review this, it would be appreciated, 
rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9k jumbo clusters

2018-07-29 Thread Rick Macklem

Adrian Chadd wrote:
>John-Mark Gurney wrote:
[stuff snipped]
>>
>> Drivers need to be fixed to use 4k pages instead of cluster.  I really hope
>> no one is using a card that can't do 4k pages, or if they are, then they
>> should get a real card that can do scatter/gather on 4k pages for jumbo
>> frames..
>
>Yeah but it's 2018 and your server has like minimum a dozen million 4k
>pages.
>
>So if you're doing stuff like lots of network packet kerchunking why not
>have specialised allocator paths that can do things like "hey, always give
>me 64k physical contig pages for storage/mbufs because you know what?
>they're going to be allocated/freed together always."
>
>There was always a race between bus bandwidth, memory bandwidth and
>bus/memory latencies. I'm not currently on the disk/packet pushing side of
>things, but the last couple times I were it was at different points in that
>4d space and almost every single time there was a benefit from having a
>couple of specialised allocators so you didn't have to try and manage a few
>dozen million 4k pages based on your changing workload.
>
>I enjoy the 4k page size management stuff for my 128MB routers. Your 128G
>server has a lot of 4k pages. It's a bit silly.
Here's my NFS guy perspective.
I do think 9K mbuf clusters should go away. I'll note that I once coded NFS so 
it
would use 4K mbuf clusters for the big RPCs (write requests and read replies) 
and
I actually could get the mbuf cluster pool fragmented to the point it stopped
working on a small machine, so it is possible (although not likely) to fragment
even a 2K/4K mix.

For me, send and receive are two very different cases:
- For sending a large NFS RPC (lets say a reply to a 64K read), the NFS code 
will
  generate a list of 33 2K mbuf clusters. If the net interface doesn't do TSO, 
this
  is probably fine, since tcp_output() will end up busting this up into a bunch 
of
  TCP segments using the list of mbuf clusters with TCP/IP headers added for
  each segment, etc...
  - If the net interface does TSO, this long list goes down to the net driver 
and uses
34->35 ring entries to send it (it adds at least one segment for the MAC 
header
typically). If the driver isn't buggy and the net chip supports lots of 
transmit
ring entries, this works ok but...
 - If there was a 64K supercluster, the NFS code could easily use that for the 
64K
   of data and the TSO enabled net interface would use 2 transmit ring entries.
   (one for the MAC/TCP/NFS header and one for the 64K of data). If the net 
interface
   can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO 
segments
   from tcp_output(), but that still is a lot less than 35.
I don't know enough about net hardware to know when/if this will help perf., but
it seems that it might, at least for some chipsets?

For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets, 
but as
others have noted, they won't be allocated for long unless packets arrive out of
order, at least for NFS. (For other apps., they  might not read the socket for 
a while
to get the data, so they might sit in the socket rcv queue for a while.)

I chose 64K, since that is what most net interfaces can handle for TSO these 
days.
(If it will soon be larger, I think this should be even larger, but all of them 
the same
 size to avoid fragmentation.) For the send case for NFS, it wouldn't even need 
to
be a very large pool, since they get free'd as soon as the net interface 
transmits
the TSO segment.

For NFS, it could easily call mget_supercl() and then fall back on the current 
code using 2K mbuf clusters if mget_supercl() failed, so a small pool would be 
fine for the
 NFS send side.

I'd like to see a pool for 64K or larger mbuf clusters for the send side.
For the receive side, I'll let others figure out the best solution (4K or larger
for jumbo clusters). I do think anything larger than 4K needs a separate 
allocation
pool to avoid fragmentation.
(I don't know, but I'd guess iSCSI could use them as well?)

rick

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Fw: 100.chksetuid handging on nfs mounts

2018-08-31 Thread Rick Macklem

Gerrit Kühn wrote:
>On Thu, 30 Aug 2018 08:07:52 -0600 Alan Somers  wrote
>about Re: Fw: 100.chksetuid handging on nfs mounts:
>
>> Well that's not very illuminating.  I was wondering if it had weird mount
>> options or something.  Are you sure that's why find is hanging?  What
>> happens if you unmount and repeat the command?
>
>I just tried these things:
>
>find command with nfs mounted and connection working: runs fine
>find command with nfs unmounted: runs fine
>find command with nfs mounted and nfs-nic down: hangs
Without a functioning network, NFS just keeps trying to do the RPC.
This is normal behaviour for NFS and has been since 1985.
If you are using NFSv3 and want the I/O attempt to fail after a couple of
minutes instead of "just keep trying", you can use the mount options:
"soft,retrans=2". These options are not recommended for NFSV4.

>As soon as I "up" the interface again, find continues to run:
Yep. At this point, the NFS client can do the RPC.

rick

---
root@crest:/ # find -sx / /dev/null \( ! -fstype local
\) -prune -o -type f \( \( ! -perm +010 -and -perm +001 \) -or \( ! -perm
+020 -and -perm +002 \) -or \( ! -perm +040 -and -perm +004 \) \) -exec ls
-liTd \{\} \+
nfs server hellpool:/samqfs/FC5/Gerrit: not responding
nfs server hellpool:/samqfs/FC5/Gerrit: is alive again
root@crest:/ #
---


cu
  Gerrit
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Fw: 100.chksetuid handging on nfs mounts

2018-08-31 Thread Rick Macklem

Gerrit Kühn wrote:
[stuff snipped]
>My impression was that the find commandline is crafted in a way that
>should prevent it from touching any filesystems mounted into the root
>tree. Maybe I am mistaken here, then hanging on the unavailable nfs mount
>is certainly expected (although not nice ;-) behaviour.
Without looking at the find sources, I'd guess that it does a stat(2) on a 
directory
and then checks the st_dev to see if it had changed. This will hang for the 
mount
point directory if it can't do the Getattr RPC.

rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: NFS poor performance in ipfw_nat

2018-09-19 Thread Rick Macklem

KIRIYAMA Kazuhiko wrote:
[good stuff snipped]
>
> Thanks for your advice. Add '-lro' and '-tso' to ifconfig,
> transfer rate up to almost native NIC speed:
>
> # dd if=/dev/zero of=/.dake/tmp/foo.img bs=1k count=1m
> 1048576+0 records in
> 1048576+0 records out
> 1073741824 bytes transferred in 10.688162 secs (100460852 bytes/sec)
> #
>
> BTW in VM on behyve, transfer rate to NFS mount of VM server
> (bhyve) is appreciably low level:
>
> # dd if=/dev/zero of=/.dake/tmp/foo.img bs=1k count=1m
> 1048576+0 records in
> 1048576+0 records out
> 1073741824 bytes transferred in 32.094448 secs (33455687 bytes/sec)
>
>This was limited by disk transfer speed:
>
># dd if=/dev/zero of=/var/tmp/foo.img bs=1k count=1m
>1048576+0 records in
>1048576+0 records out
>1073741824 bytes transferred in 21.692358 secs (49498623 bytes/sec)
>#
It sounds like this is resolved, thanks to Andrey.

If you have more problems like this, another thing to try is reducing the I/O
size with mount options at the client.
For example, you might try adding "rsize=4096,wsize=4096" to your mount and
then increase the size by powers of 2 (8192, 16384,32768) and see which size
works best. (This is another way to work around TSO problems. It also helps
when a net interface or packet filter can't keep up with a burst of 40+ ethernet
packets, which is what is generated when 64K I/O is used.)

Btw, doing "nfsstat -m" on the client will show you what mount options are
actually being used. This can be useful information.

Good to hear it has been resolved, rick
[more stuff snipped]

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: FCP-0101: Deprecating most 10/100 Ethernet drivers

2018-10-04 Thread Rick Macklem

Warner Losh wrote:
[lots of stuff snipped]
>That's why that one way to get the driver off the list is to convert to
>iflib. That greatly reduces the burden by centralizing all the stupid,
>common things of a driver so that we only have to change one place, not
>dozens.

I can probably do this for bfe and fxp, since I have both.
Can someone suggest a good example driver that has already been converted,
so I can see what needs to be done?

Again, I don't care if they stay in the current/head tree.

[more stuff snipped]

rick

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

1 2 3 >

1 - 100 of 273 matches

Mail list logo