[Bug 254244] panics after upgrade to stable/13-n244861-b9773574371

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254244

Richard Scheffenegger  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|In Progress |Closed

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-03-18 Thread tuexen
> On 18. Mar 2021, at 21:55, Rick Macklem  wrote:
> 
> Michael Tuexen wrote:
>>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard 
>>>  wrote:
>>> 
> Output from the NFS Client when the issue occurs # netstat -an | grep
> NFS.Server.IP.X
> tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
> FIN_WAIT2
 I'm no TCP guy. Hopefully others might know why the client would be stuck 
 in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but 
 could be wrong?)
>>> 
>>> When the client is in Fin-Wait2 this is the state you end up when the 
>>> Client side actively close() the tcp session, and then the server also 
>>> ACKed the FIN.
> Jason noted:
> 
>> When the issue occurs, this is what I see on the NFS Server.
>> tcp4   0  0 NFS.Server.IP.X.2049  NFS.Client.IP.X.51550 
>> CLOSE_WAIT
>> 
>> which corresponds to the state on the client side. The server received the 
>> FIN
>> from the client and acked it.
>> The server is waiting for a close call to happen.
>> So the question is: Is the server also closing the connection?
> Did you mean to say "client closing the connection here?"
Yes.
> 
> The server should call soclose() { it never calls soshutdown() } when
> soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates
> the socket is broken.
> --> The soreceive() call is triggered by an upcall for the rcv side of the 
> socket.
> So, are you saying the FreeBSD NFS server did not call soclose() for this 
> case?
Yes. If the state at the server side is CLOSE_WAIT, no close call has happened 
yet.
The FIN from the client was received, it was ACKED, but no close() call
(or shutdown(..., SHUT_WR) or shutdown(..., SHUT_RDWR)) was issued. Therefore,
no FIN was sent and the client should be in the FINWAIT-2 state. This was also
reported. So the reported states are consistent.

Best regards
Michael

> 
> rick
> 
> Best regards
> Michael
>> This will last for ~2 min or so, but is asynchronous. However, the same 
>> 4-tuple can not be reused during this time.
>> 
>> With other words, from the socket / TCP, a properly executed active close() 
>> will end up in this state. (If the other side initiated the close, a passive 
>> close, will not end in this state)
>> 
>> 
>> ___
>> freebsd-net@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 
> 
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Severe IPv6 TCP transfer issues on 13.0-RC1 and RC2

2021-03-18 Thread tuexen
> On 18. Mar 2021, at 21:09, Michael Tuexen  wrote:
> 
>> On 15. Mar 2021, at 12:56, Blake Hartshorn  
>> wrote:
>> 
>> The short version, when I use FreeBSD 13, delivering data can take 5 minutes 
>> for 1MB over SSH or HTTP when using IPv6. This problem does not happen with 
>> IPv4. I installed FreeBSD 12 and Linux on that same device, neither had the 
>> problem.
>> 
>> Did some troubleshooting with Linode, have ultimately ruled the network 
>> itself out at this point. When the server is on FreeBSD 13, it can download 
>> quickly over IPv6, but not deliver. Started investigating after noticing my 
>> SSH session was lagging when cat'ing large files or running builds. This 
>> problem even occurs between VMs in the same datacenter. I generated a 1MB 
>> file of base64 garbage served by nginx for testing. IPv6 is being configured 
>> by SLAAC and on both 12 and 13 installs was setup by the installer. Linode 
>> uses Linux/KVM hosts for their virtual machines so it's running on that 
>> virtual adapter.
>> 
>> I asked on the forums, another user recommended going to the mailing lists 
>> instead. Does anyone know if config settings need to be different on 13? Did 
>> I maybe just find a real issue? I can provide any requested details. Thanks!
> I was able to reproduce the issue locally. A fix is under review:
> https://reviews.freebsd.org/D29331
The fix is now committed in main, stable/13, releng/13.0, and will be included
in the upcoming RC3.

Best regards
Michael
> 
> Best regards
> Michael
>> 
>> 
>> ___
>> freebsd-net@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-03-18 Thread Rick Macklem
Michael Tuexen wrote:
>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard 
>>  wrote:
>>
 Output from the NFS Client when the issue occurs # netstat -an | grep
 NFS.Server.IP.X
 tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
 FIN_WAIT2
>>> I'm no TCP guy. Hopefully others might know why the client would be stuck 
>>> in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but 
>>> could be wrong?)
>>
>> When the client is in Fin-Wait2 this is the state you end up when the Client 
>> side actively close() the tcp session, and then the server also ACKed the 
>> FIN.
Jason noted:

>When the issue occurs, this is what I see on the NFS Server.
>tcp4   0  0 NFS.Server.IP.X.2049  NFS.Client.IP.X.51550 
>CLOSE_WAIT
>
>which corresponds to the state on the client side. The server received the FIN
>from the client and acked it.
>The server is waiting for a close call to happen.
>So the question is: Is the server also closing the connection?
Did you mean to say "client closing the connection here?"

The server should call soclose() { it never calls soshutdown() } when
soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates
the socket is broken.
--> The soreceive() call is triggered by an upcall for the rcv side of the 
socket.
So, are you saying the FreeBSD NFS server did not call soclose() for this case?

rick

Best regards
Michael
> This will last for ~2 min or so, but is asynchronous. However, the same 
> 4-tuple can not be reused during this time.
>
> With other words, from the socket / TCP, a properly executed active close() 
> will end up in this state. (If the other side initiated the close, a passive 
> close, will not end in this state)
>
>
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

Michael Tuexen  changed:

   What|Removed |Added

 Status|In Progress |Closed
 Resolution|--- |FIXED
  Flags||mfc-stable13+

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

--- Comment #21 from Michael Tuexen  ---
This will be in the upcoming RC3.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

--- Comment #20 from commit-h...@freebsd.org ---
A commit in branch releng/13.0 references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=11660fa28fd39a644cb7d30a4378cf4751b89f15

commit 11660fa28fd39a644cb7d30a4378cf4751b89f15
Author: Michael Tuexen 
AuthorDate: 2021-03-18 20:25:47 +
Commit: Michael Tuexen 
CommitDate: 2021-03-18 20:35:49 +

vtnet: fix TSO for TCP/IPv6

The decision whether a TCP packet is sent over IPv4 or IPv6 was
based on ethertype, which works correctly. In D27926 the criteria
was changed to checking if the CSUM_IP_TSO flag is set in the
csum-flags and then considering it to be TCP/IPv4.
However, the TCP stack sets the flag to CSUM_TSO for IPv4 and IPv6,
where CSUM_TSO is defined as CSUM_IP_TSO|CSUM_IP6_TSO.
Therefore TCP/IPv6 packets gets mis-classified as TCP/IPv4,
which breaks TSO for TCP/IPv6.
This patch bases the check again on the ethertype.
This fix is instantly MFCed.

Approved by:re(gjb)
PR: 254366
Sponsored by:   Netflix, Inc.
Differential Revision:  https://reviews.freebsd.org/D29331

(cherry picked from commit d4697a6b56168876fc0ffec1a0bb1b24d25b198e)
(cherry picked from commit 6064ea8172f078c54954bc6e8865625feb7979fe)

 sys/dev/virtio/network/if_vtnet.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

--- Comment #19 from commit-h...@freebsd.org ---
A commit in branch stable/13 references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=6064ea8172f078c54954bc6e8865625feb7979fe

commit 6064ea8172f078c54954bc6e8865625feb7979fe
Author: Michael Tuexen 
AuthorDate: 2021-03-18 20:25:47 +
Commit: Michael Tuexen 
CommitDate: 2021-03-18 20:33:32 +

vtnet: fix TSO for TCP/IPv6

The decision whether a TCP packet is sent over IPv4 or IPv6 was
based on ethertype, which works correctly. In D27926 the criteria
was changed to checking if the CSUM_IP_TSO flag is set in the
csum-flags and then considering it to be TCP/IPv4.
However, the TCP stack sets the flag to CSUM_TSO for IPv4 and IPv6,
where CSUM_TSO is defined as CSUM_IP_TSO|CSUM_IP6_TSO.
Therefore TCP/IPv6 packets gets mis-classified as TCP/IPv4,
which breaks TSO for TCP/IPv6.
This patch bases the check again on the ethertype.
This fix is instantly MFCed.

Approved by:re(gjb)
PR: 254366
Sponsored by:   Netflix, Inc.
Differential Revision:  https://reviews.freebsd.org/D29331

(cherry picked from commit d4697a6b56168876fc0ffec1a0bb1b24d25b198e)

 sys/dev/virtio/network/if_vtnet.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

--- Comment #18 from commit-h...@freebsd.org ---
A commit in branch main references this bug:

URL:
https://cgit.FreeBSD.org/src/commit/?id=d4697a6b56168876fc0ffec1a0bb1b24d25b198e

commit d4697a6b56168876fc0ffec1a0bb1b24d25b198e
Author: Michael Tuexen 
AuthorDate: 2021-03-18 20:25:47 +
Commit: Michael Tuexen 
CommitDate: 2021-03-18 20:32:20 +

vtnet: fix TSO for TCP/IPv6

The decision whether a TCP packet is sent over IPv4 or IPv6 was
based on ethertype, which works correctly. In D27926 the criteria
was changed to checking if the CSUM_IP_TSO flag is set in the
csum-flags and then considering it to be TCP/IPv4.
However, the TCP stack sets the flag to CSUM_TSO for IPv4 and IPv6,
where CSUM_TSO is defined as CSUM_IP_TSO|CSUM_IP6_TSO.
Therefore TCP/IPv6 packets gets mis-classified as TCP/IPv4,
which breaks TSO for TCP/IPv6.
This patch bases the check again on the ethertype.
This fix will be MFC instantly as discussed with re(gjb).

MFC after:  instantly
PR: 254366
Sponsored by:   Netflix, Inc.
Differential Revision:  https://reviews.freebsd.org/D29331

 sys/dev/virtio/network/if_vtnet.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Severe IPv6 TCP transfer issues on 13.0-RC1 and RC2

2021-03-18 Thread Michael Tuexen
> On 15. Mar 2021, at 12:56, Blake Hartshorn  wrote:
> 
> The short version, when I use FreeBSD 13, delivering data can take 5 minutes 
> for 1MB over SSH or HTTP when using IPv6. This problem does not happen with 
> IPv4. I installed FreeBSD 12 and Linux on that same device, neither had the 
> problem.
> 
> Did some troubleshooting with Linode, have ultimately ruled the network 
> itself out at this point. When the server is on FreeBSD 13, it can download 
> quickly over IPv6, but not deliver. Started investigating after noticing my 
> SSH session was lagging when cat'ing large files or running builds. This 
> problem even occurs between VMs in the same datacenter. I generated a 1MB 
> file of base64 garbage served by nginx for testing. IPv6 is being configured 
> by SLAAC and on both 12 and 13 installs was setup by the installer. Linode 
> uses Linux/KVM hosts for their virtual machines so it's running on that 
> virtual adapter.
> 
> I asked on the forums, another user recommended going to the mailing lists 
> instead. Does anyone know if config settings need to be different on 13? Did 
> I maybe just find a real issue? I can provide any requested details. Thanks!
I was able to reproduce the issue locally. A fix is under review:
https://reviews.freebsd.org/D29331

Best regards
Michael
> 
> 
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

--- Comment #17 from Michael Tuexen  ---
(In reply to Michael Tratz from comment #16)
Thanks for reporting back!

The path is under review: https://reviews.freebsd.org/D29331

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

--- Comment #16 from Michael Tratz  ---
Thank you so much Michael for the quick fix. Everything is back to normal. IPv6
is speedy again.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254366] Severe IPv6 TCP transfer issues on 13.0-RC2 with virtio network adapter

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254366

Michael Tuexen  changed:

   What|Removed |Added

   Assignee|b...@freebsd.org|tue...@freebsd.org
 Status|New |In Progress
 CC||n...@freebsd.org

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 254333] [tcp] sysctl net.inet.tcp.hostcache.list hangs

2021-03-18 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254333

--- Comment #2 from Maxim Shalomikhin  ---
It will take some time to reproduce the issue. Plese let me now if there any
other information I can collect when sysctl hangs again.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: AW: NFS Mount Hangs

2021-03-18 Thread Rodney W. Grimes
> >>Output from the NFS Client when the issue occurs # netstat -an | grep 
> >>NFS.Server.IP.X
> >>tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
> >>FIN_WAIT2
> >I'm no TCP guy. Hopefully others might know why the client would be stuck in 
> >FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but 
> >could be wrong?)
> 
> When the client is in Fin-Wait2 this is the state you end up when the Client 
> side actively close() the tcp session, and then the server also ACKed the 
> FIN. 
> This will last for ~2 min or so, but is asynchronous. However, the same 
> 4-tuple can not be reused during this time.

I do not think this is the full story.  If infact the Client did call close() 
you are correct, you would end up in FIN_WAIT2 for upto what ever the timeout 
is set to, IIRC default on that is 60 seconds, further in this case the socket 
is disconnected from the application and only the kernel has knowledge of it.  
The socket can stay in FIN_WAIT2 indefanitly if the application
calls shutdown(2) on it, and never follows up with a close(2).

Which situation exists can be determined by looking at fstat to see if the 
socket has an associted PID or not,
not sure if that works on Linux though.

> 
> With other words, from the socket / TCP, a properly executed active close() 
> will end up in this state. (If the other side initiated the close, a passive 
> close, will not end in this state)

But only for a brief period.  Stuck in this state indicates close(2) was not 
called, but shutdown(2) was.

I believe it is also possible to get sockets stuck in this state when a client 
side port number reuse colides with a server side socket that is still in s 
CLOSE_WAIT state, oh wait, that ends up in no response to the SYN, hu... 

> freebsd-net@freebsd.org mailing list
-- 
Rod Grimes rgri...@freebsd.org
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-03-18 Thread Jason Breitman
The laggproto is lacp and the switch is made by Extreme Networks.

Jason Breitman


On Mar 18, 2021, at 4:06 AM, Gerrit Kuehn  wrote:


On Wed, 17 Mar 2021 18:17:14 -0400
Jason Breitman  wrote:

> I will look into disabling the TSO and LRO options and let the group
> know how it goes. Below are the current options on the NFS Server.
> lagg0: flags=8943
> metric 0 mtu 1500
> options=e507bb

What laggproto are you using, and what kind of switch is connected on
the other end?


cu
Gerrit
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-03-18 Thread tuexen
> On 18. Mar 2021, at 13:53, Rodney W. Grimes  
> wrote:
> 
> Note I am NOT a TCP expert, but know enough about it to add a comment...
> 
>> Alan Somers wrote:
>> [stuff snipped]
>>> Is the 128K limit related to MAXPHYS?  If so, it should be greater in 13.0.
>> For the client, yes. For the server, no.
>> For the server, it is just a compile time constant NFS_SRVMAXIO.
>> 
>> It's mainly related to the fact that I haven't gotten around to testing 
>> larger
>> sizes yet.
>> - kern.ipc.maxsockbuf needs to be several times the limit, which means it 
>> would
>>   have to increase for 1Mbyte.
>> - The session code must negotiate a maximum RPC size > 1 Mbyte.
>>   (I think the server code does do this, but it needs to be tested.)
>> And, yes, the client is limited to MAXPHYS.
>> 
>> Doing this is on my todo list, rick
>> 
>> The client should acquire the attributes that indicate that and set 
>> rsize/wsize
>> to that. "# nfsstat -m" on the client should show you what the client
>> is actually using. If it is larger than 128K, set both rsize and wsize to 
>> 128K.
>> 
>>> Output from the NFS Client when the issue occurs
>>> # netstat -an | grep NFS.Server.IP.X
>>> tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
>>> FIN_WAIT2
>> I'm no TCP guy. Hopefully others might know why the client would be
>> stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack,
>> but could be wrong?)
> 
> The most common way to get stuck in FIN_WAIT2 is to call
> shutdown(2) on a socket, but never following up with a
> close(2) after some timeout period.  The "client" is still
> connected to the socket and can stay in this shutdown state
> for ever, the kernel well not reap the socket as it is
> associated with a processes, aka not orphaned.  I suspect
> that the Linux client has a corner condition that is leading
> to this socket leak.
> 
> If on the Linux client you can look at the sockets to see
> if these are still associated with a process, ala fstat or
> what ever Linux tool does this that would be helpfull.
> If they are infact connected to a process it is that
> process that must call close(2) to clean these up.
> 
> IIRC the server side socket would be gone at this point
> and there is nothing the server can do that would allow
> a FIN_WAIT2 to close down.
Jason reported that the server is in CLOSE-WAIT. This would
mean the the server received the FIN, ACKed it, but has not
initiated the teardown of the Server->Client direction.
So the server side socket is still there and close has not
be called yet.
> 
> The real TCP experts can now correct my 30 year old TCP
> stack understanding...
I wouldn't count myself as a real TCP expert, but the behaviour
hasn't changed in the last 30 years, I think...

Best regards
Michael
> 
>> 
>>> # cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info
>>> netid: tcp
>>> addr:  NFS.Server.IP.X
>>> port:  2049
>>> state: 0x51
>>> 
>>> syslog
>>> Mar  4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status -client- 
>>> --rqstp- ->timeout ---ops--
>>> Mar  4 10:29:27 hostname kernel: [437414.133158] 57419 40a1  0 9b723c73 
>>> >143cfadf3 4ca953b5 nfsv4 OPEN_NOATTR a:call_connect_status 
>>> [sunrpc] >q:xprt_pending
>> I don't know what OPEN_NOATTR means, but I assume it is some variant
>> of NFSv4 Open operation.
>> [stuff snipped]
>>> Mar  4 10:29:30 hostname kernel: [437417.110517] RPC: 57419 
>>> xprt_connect_status: >connect attempt timed out
>>> Mar  4 10:29:30 hostname kernel: [437417.112172] RPC: 57419 
>>> call_connect_status
>>> (status -110)
>> I have no idea what status -110 means?
>>> Mar  4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout 
>>> (major)
>>> Mar  4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind 
>>> (status 0)
>>> Mar  4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect 
>>> xprt >e061831b is not connected
>>> Mar  4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect 
>>> xprt >e061831b is not connected
>>> Mar  4 10:30:31 hostname kernel: [437478.551090] RPC: 57419 
>>> xprt_connect_status: >connect attempt timed out
>>> Mar  4 10:30:31 hostname kernel: [437478.552396] RPC: 57419 
>>> call_connect_status >(status -110)
>>> Mar  4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout 
>>> (minor)
>>> Mar  4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind 
>>> (status 0)
>>> Mar  4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect 
>>> xprt >e061831b is not connected
>>> Mar  4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect 
>>> xprt >e061831b is not connected
>> Is it possible that the client is trying to (re)connect using the same 
>> client port#?
>> I would normally expect the client to create a new TCP connection using a
>> different client port# and then retry the outstanding RPCs.
>> --> Capturing packets when this happens would show us what is going on.

Re: NFS Mount Hangs

2021-03-18 Thread Michael Tuexen
> On 18. Mar 2021, at 13:42, Scheffenegger, Richard 
>  wrote:
> 
>>> Output from the NFS Client when the issue occurs # netstat -an | grep 
>>> NFS.Server.IP.X
>>> tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
>>> FIN_WAIT2
>> I'm no TCP guy. Hopefully others might know why the client would be stuck in 
>> FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but 
>> could be wrong?)
> 
> When the client is in Fin-Wait2 this is the state you end up when the Client 
> side actively close() the tcp session, and then the server also ACKed the 
> FIN. 
Jason noted:

When the issue occurs, this is what I see on the NFS Server.
tcp4   0  0 NFS.Server.IP.X.2049  NFS.Client.IP.X.51550 
CLOSE_WAIT  

which corresponds to the state on the client side. The server received the FIN
from the client and acked it. The server is waiting for a close call to happen.
So the question is: Is the server also closing the connection?

Best regards
Michael
> This will last for ~2 min or so, but is asynchronous. However, the same 
> 4-tuple can not be reused during this time.
> 
> With other words, from the socket / TCP, a properly executed active close() 
> will end up in this state. (If the other side initiated the close, a passive 
> close, will not end in this state)
> 
> 
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-03-18 Thread Rodney W. Grimes
Note I am NOT a TCP expert, but know enough about it to add a comment...

> Alan Somers wrote:
> [stuff snipped]
> >Is the 128K limit related to MAXPHYS?  If so, it should be greater in 13.0.
> For the client, yes. For the server, no.
> For the server, it is just a compile time constant NFS_SRVMAXIO.
> 
> It's mainly related to the fact that I haven't gotten around to testing larger
> sizes yet.
> - kern.ipc.maxsockbuf needs to be several times the limit, which means it 
> would
>have to increase for 1Mbyte.
> - The session code must negotiate a maximum RPC size > 1 Mbyte.
>(I think the server code does do this, but it needs to be tested.)
> And, yes, the client is limited to MAXPHYS.
> 
> Doing this is on my todo list, rick
> 
> The client should acquire the attributes that indicate that and set 
> rsize/wsize
> to that. "# nfsstat -m" on the client should show you what the client
> is actually using. If it is larger than 128K, set both rsize and wsize to 
> 128K.
> 
> >Output from the NFS Client when the issue occurs
> ># netstat -an | grep NFS.Server.IP.X
> >tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
> >FIN_WAIT2
> I'm no TCP guy. Hopefully others might know why the client would be
> stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack,
> but could be wrong?)

The most common way to get stuck in FIN_WAIT2 is to call
shutdown(2) on a socket, but never following up with a
close(2) after some timeout period.  The "client" is still
connected to the socket and can stay in this shutdown state
for ever, the kernel well not reap the socket as it is
associated with a processes, aka not orphaned.  I suspect
that the Linux client has a corner condition that is leading
to this socket leak.

If on the Linux client you can look at the sockets to see
if these are still associated with a process, ala fstat or
what ever Linux tool does this that would be helpfull.
If they are infact connected to a process it is that
process that must call close(2) to clean these up.

IIRC the server side socket would be gone at this point
and there is nothing the server can do that would allow
a FIN_WAIT2 to close down.

The real TCP experts can now correct my 30 year old TCP
stack understanding...

> 
> ># cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info
> >netid: tcp
> >addr:  NFS.Server.IP.X
> >port:  2049
> >state: 0x51
> >
> >syslog
> >Mar  4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status -client- 
> >--rqstp- ->timeout ---ops--
> >Mar  4 10:29:27 hostname kernel: [437414.133158] 57419 40a1  0 9b723c73 
> >>143cfadf3 4ca953b5 nfsv4 OPEN_NOATTR a:call_connect_status [sunrpc] 
> >>q:xprt_pending
> I don't know what OPEN_NOATTR means, but I assume it is some variant
> of NFSv4 Open operation.
> [stuff snipped]
> >Mar  4 10:29:30 hostname kernel: [437417.110517] RPC: 57419 
> >xprt_connect_status: >connect attempt timed out
> >Mar  4 10:29:30 hostname kernel: [437417.112172] RPC: 57419 
> >call_connect_status
> >(status -110)
> I have no idea what status -110 means?
> >Mar  4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout 
> >(major)
> >Mar  4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind 
> >(status 0)
> >Mar  4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect 
> >xprt >e061831b is not connected
> >Mar  4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect 
> >xprt >e061831b is not connected
> >Mar  4 10:30:31 hostname kernel: [437478.551090] RPC: 57419 
> >xprt_connect_status: >connect attempt timed out
> >Mar  4 10:30:31 hostname kernel: [437478.552396] RPC: 57419 
> >call_connect_status >(status -110)
> >Mar  4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout 
> >(minor)
> >Mar  4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind 
> >(status 0)
> >Mar  4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect 
> >xprt >e061831b is not connected
> >Mar  4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect 
> >xprt >e061831b is not connected
> Is it possible that the client is trying to (re)connect using the same client 
> port#?
> I would normally expect the client to create a new TCP connection using a
> different client port# and then retry the outstanding RPCs.
> --> Capturing packets when this happens would show us what is going on.
> 
> If there is a problem on the FreeBSD end, it is most likely a broken
> network device driver.
> --> Try disabling TSO , LRO.
> --> Try a different driver for the net hardware on the server.
> --> Try a different net chip on the server.
> If you can capture packets when (not after) the hang
> occurs, then you can look at them in wireshark and see
> what is actually happening. (Ideally on both client and
> server, to check that your network hasn't dropped anything.)
> --> I know, if the hangs aren't easily reproducible, this isn't
> easily done.
> --> Try a 

AW: NFS Mount Hangs

2021-03-18 Thread Scheffenegger, Richard
>>Output from the NFS Client when the issue occurs # netstat -an | grep 
>>NFS.Server.IP.X
>>tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
>>FIN_WAIT2
>I'm no TCP guy. Hopefully others might know why the client would be stuck in 
>FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but could 
>be wrong?)

When the client is in Fin-Wait2 this is the state you end up when the Client 
side actively close() the tcp session, and then the server also ACKed the FIN. 
This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple 
can not be reused during this time.

With other words, from the socket / TCP, a properly executed active close() 
will end up in this state. (If the other side initiated the close, a passive 
close, will not end in this state)


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-03-18 Thread Gerrit Kuehn


On Wed, 17 Mar 2021 18:17:14 -0400
Jason Breitman  wrote:

> I will look into disabling the TSO and LRO options and let the group
> know how it goes. Below are the current options on the NFS Server.
> lagg0: flags=8943
> metric 0 mtu 1500
> options=e507bb

What laggproto are you using, and what kind of switch is connected on
the other end?


cu
  Gerrit
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"