Re: 9.2 ixgbe tx queue hang

Christopher Forgeron Fri, 21 Mar 2014 07:56:32 -0700

Rick:

 Unfortunately your patch didn't work. I expected as much as soon as I saw
my boot time 'netstat -m', but I wanted to run the tests to make sure.


 First, here is where I put in your additional line - Let me know if that's
what you were hoping for, as I'm using mmm->m_pkthdr.csum_flags, as m
doesn't exist until the call to m_defrag a few lines below.

printf("before pklen=%d actl=%d csum=%lu\n", mmm->m_pkthdr.len, iii,
mmm->m_pkthdr.csum_flags);

 With this in place, here is the first set of logs after ~ 5min of load:







On Thu, Mar 20, 2014 at 11:25 PM, Rick Macklem <rmack...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert <
> > markus.geb...@hostpoint.ch > wrote:
> >
> >
> >
> >
> >
> > Possible. We still see this on nfsclients only, but I'm not convinced
> > that nfs is the only trigger.
> >
> >
> Since Christopher is getting a bunch of the "before" printf()s from
> my patch, it indicates that a packet/TSO segment that is > 65535 bytes
> in length is showing up at ixgbe_xmit(). I've asked him to add a printf()
> for the m_pkthdr.csum_flags field to see if it is really a TSO segment.
>
> If it is a TSO segment, that indicates to me that the code in tcp_output()
> that should
> generate a TSO segment no greater than 65535 bytes in length is busted.
> And this would imply just about any app doing large sosend()s could cause
> this, I think? (NFS read replies/write requests of 64K would be one of
> them.)
>
> rick
>
> >
> >
> >
> > Just to clarify, I'm experiencing this error with NFS, but also with
> > iSCSI - I turned off my NFS server in rc.conf and rebooted, and I'm
> > still able to create the error. This is not just a NFS issue on my
> > machine.
> >
> >
> >
> > I our case, when it happens, the problem persists for quite some time
> > (minutes or hours) if we don't interact (ifconfig or reboot).
> >
> >
> >
> > The first few times that I ran into it, I had similar issues -
> > Because I was keeping my system up and treating it like a temporary
> > problem/issue. Worst case scenario resulted in reboots to reset the
> > NIC. Then again, I find the ix's to be cranky if you ifconfig them
> > too much.
> >
> > Now, I'm trying to find a root cause, so as soon as I start seeing
> > any errors, I abort and reboot the machine to test the next theory.
> >
> >
> > Additionally, I'm often able to create the problem with just 1 VM
> > running iometer on the SAN storage. When the problem occurs, that
> > connection is broken temporarily, taking network load off the SAN -
> > That may improve my chances of keeping this running.
> >
> >
> >
> >
> >
> > > I am able to reproduce it fairly reliably within 15 min of a reboot
> > > by
> > > loading the server via NFS with iometer and some large NFS file
> > > copies at
> > > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
> >
> > That's probably why we can't reproduce it reliably here. Although
> > having 10gig cards in our blade servers, the ones affected are
> > connected to a 1gig switch.
> >
> >
> >
> >
> >
> > It seems that it needs a lot of traffic. I have a 10 gig backbone
> > between my SANs and my ESXi machines, so I can saturate quite
> > quickly (just now I hit a record.. the error occurred within ~5 min
> > of reboot and testing). In your case, I recommend firing up multiple
> > VM's running iometer on different 1 gig connections and see if you
> > can make it pop. I also often turn off ix1 to drive all traffic
> > through ix0 - I've noticed it happens faster this way, but once
> > again I'm not taking enough observations to make decent time
> > predictions.
> >
> >
> >
> >
> >
> >
> > Can you try this when the problem occurs?
> >
> > for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2
> > -c 2 -W 1 10.0.0.1 | grep sendto; done
> >
> > It will tie ping to certain cpus to test the different tx queues of
> > your ix interface. If the pings reliably fail only on some queues,
> > then your problem is more likely to be the same as ours.
> >
> > Also, if you have dtrace available:
> >
> > kldload dtraceall
> > dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / {
> > stack(); }'
> >
> > while you run pings over the interface affected. This will give you
> > hints about where the EFBIG error comes from.
> >
> > > [...]
> >
> >
> > Markus
> >
> >
> >
> >
> > Will do. I'm not sure what shell the first script was written for,
> > it's not working in csh, here's a re-write that does work in csh in
> > case others are using the default shell:
> >
> > #!/bin/csh
> > foreach CPU (`seq 0 23`)
> > echo "CPU$CPU";
> > cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto;
> > end
> >
> >
> > Thanks for your input. I should have results to post to the list
> > shortly.
> >
> >
>
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: 9.2 ixgbe tx queue hang

Reply via email to