On 01/22/2018 10:46 AM, Josh Hunt wrote:
On Mon, Jan 22, 2018 at 10:30 AM, Ben Greear <gree...@candelatech.com> wrote:
On 01/22/2018 10:16 AM, Eric Dumazet wrote:

On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:

My test case is to have 6 processes each create 5000 TCP IPv4 connections
to each other
on a system with 16GB RAM and send slow-speed data.  This works fine on a
4.7 kernel, but
will not work at all on a 4.13.  The 4.13 first complains about running
out of tcp memory,
but even after forcing those values higher, the max connections we can
get is around 15k.

Both kernels have my out-of-tree patches applied, so it is possible it is
my fault
at this point.

Any suggestions as to what this might be caused by, or if it is fixed in
more recent kernels?

I will start bisecting in the meantime...


Hi Ben

Unfortunately I have no idea.

Are you using loopback flows, or have I misunderstood you ?

How loopback connections can be slow-speed ?


I am sending to self, but over external network interfaces, by using
routing tables and rules and such.

On 4.13.16+, I see the Intel driver bouncing when I try to start 20k
connections.  In this case, I have a pair of 10G ports doing 15k, and then
I try to start 5k on two of the 1G ports....

Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:41 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:43 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Down
Jan 22 10:15:45 lf1003-e3v2-13100124-f20x64 kernel: e1000e: eth3 NIC Link is
Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: NETDEV WATCHDOG: eth3
(e1000e): transmit queue 0 timed out, trans_s...es: 1
Jan 22 10:15:51 lf1003-e3v2-13100124-f20x64 kernel: e1000e 0000:07:00.0
eth3: Reset adapter unexpectedly


Ben

We had an interface doing this and grabbing these commits resolved it for us:

4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts
19110cfbb34d e1000e: Separate signaling for link check/link up
d3509f8bc7b0 e1000e: Fix return value test
65a29da1f5fd e1000e: Fix wrong comment related to link detection
c4c40e51f9c3 e1000e: Fix error path in link detection

They are in the LTS kernels now, but don't believe they were when we
first hit this problem.

Thanks a lot for the suggestions, I can confirm that these patches applied to 
my 4.13.16+
tree does indeed seem to fix the problem.

Thanks,
Ben


Josh



--
Ben Greear <gree...@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

Reply via email to