Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

2021-06-09 Thread tuexen
> On 9. Jun 2021, at 08:57, Don Lewis  wrote:
> 
> On  8 Jun, Michael Gmelin wrote:
>> 
>> 
>> On Thu, 3 Jun 2021 15:09:06 +0200
>> Michael Gmelin  wrote:
>> 
>>> On Tue, 1 Jun 2021 13:47:47 +0200
>>> Michael Gmelin  wrote:
>>> 
 Hi,
 
 Since upgrading servers from 12.2 to 13.0, I get
 
 Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
 
 consistently, usually after about 11 idle minutes, that's with and
 without pf enabled. Client (11.4 in a VM) wasn't altered.
 
 Verbose logging (client and server side) doesn't show anything
 special when the connection breaks. In the past, QoS problems
 caused these disconnects, but I didn't see anything apparent
 changing between 12.2 and 13 in this respect.
 
 I did a test on a newly commissioned server to rule out other
 factors (so, same client connections, some routes, same
 everything). On 12.2 before the update: Connection stays open for
 hours. After the update (same server): connections breaks
 consistently after < 15 minutes (this is with unaltered
 configurations, no *AliveInterval configured on either side of the
 connection). 
>>> 
>>> I did a little bit more testing and realized that the problem goes
>>> away when I disable "Proportional Rate Reduction per RFC 6937" on the
>>> server side:
>>> 
>>> sysctl net.inet.tcp.do_prr=0
>>> 
>>> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
>>> fix the problem.
>>> 
>>> This seems to be specific to Parallels. After some more digging, I
>>> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
>>> keep-alive between the VM and the external server on its own. There is
>>> no direct communication between the client and the server. This means:
>>> 
>>> - The NAT daemon starts sending keep-alive packages right away (not
>>> after the VM's net.inet.tcp.keepidle), every 75 seconds.
>>> - Keep-alive packages originating in the VM never reach the server.
>>> - Keep-alive originating on the server never reaches the VM.
>>> - Client and server basically do keep-alive with the nat daemon, not
>>> with each other.
>>> 
>>> It also seems like Parallels is filtering the tos field (so it's
>>> always 0x00), but that's unrelated.
>>> 
>>> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
>>> the same network for comparison and is has no such issues.
>>> 
>>> Looking at TCP dump output on the server, this is what a keep-alive
>>> package sent by Parallels looks like:
>>> 
>>> 10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
>>> [none], proto TCP (6), length 40)
>>>   192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
>>>   seq 2534, ack 3851, win 4096, length 0
>>> 
>>> While those originating from the bhyve VM (after lowering
>>> net.inet.tcp.keepidle) look like this:
>>> 
>>> 12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
>>>   proto TCP (6), length 52)
>>>   192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
>>>   (correct), seq 1780337696, ack 45831723, win 1026, options
>>>   [nop,nop,TS val 3003646737 ecr 3331923346], length 0
>>> 
>>> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
>>> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
>>> the connection, as its keep-alive requests are not answered (well,
>>> that's what I think is happening):
>>> 
>>> 10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
>>>   proto TCP (6), length 40)
>>>   192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
>>>   seq 2535, ack 3851, win 4096, length 0
>>> 
>>> The easiest way to work around the problem Client side is to configure
>>> ServerAliveInterval in ~/.ssh/config in the Client VM.
>>> 
>>> I'm curious though if this is basically a Parallels problem that has
>>> only been exposed by PRR being more correct (which is what I suspect),
>>> or if this is actually a FreeBSD problem.
>>> 
>> 
>> So, PRR probably was a red herring and the real reason that's happening
>> is that FreeBSD (since version 13[0]) by default discards packets
>> without timestamps for connections that formally had negotiated to have
>> them. This new behavior seems to be in line with RFC 7323, section
>> 3.2[1]:
>> 
>>   "Once TSopt has been successfully negotiated, that is both  and
>>contain TSopt, the TSopt MUST be sent in every non-
>>   segment for the duration of the connection, and SHOULD be sent in an
>>segment (see Section 5.2 for details)."
>> 
>> As it turns out, macOS does exactly this - send keep-alive packets
>> without a timestamp for connections that were negotiated to have them.
> 
> I wonder if I'm running into this with ssh connections to freefall.  My
> outgoing IPv6 connections pass through an ipfw firewall that uses
> dynamic rules.  When the dynamic rule gets close to expiration, it
> generates keep alive packets that just seem to be ignored by freefall.
> Eventually the 

Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

2021-06-09 Thread Rodney W. Grimes
> On  8 Jun, Michael Gmelin wrote:
> > 
> > 
> > On Thu, 3 Jun 2021 15:09:06 +0200
> > Michael Gmelin  wrote:
> > 
> >> On Tue, 1 Jun 2021 13:47:47 +0200
> >> Michael Gmelin  wrote:
> >> 
> >> > Hi,
> >> > 
> >> > Since upgrading servers from 12.2 to 13.0, I get
> >> > 
> >> >   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
> >> > 
> >> > consistently, usually after about 11 idle minutes, that's with and
> >> > without pf enabled. Client (11.4 in a VM) wasn't altered.
> >> > 
> >> > Verbose logging (client and server side) doesn't show anything
> >> > special when the connection breaks. In the past, QoS problems
> >> > caused these disconnects, but I didn't see anything apparent
> >> > changing between 12.2 and 13 in this respect.
> >> > 
> >> > I did a test on a newly commissioned server to rule out other
> >> > factors (so, same client connections, some routes, same
> >> > everything). On 12.2 before the update: Connection stays open for
> >> > hours. After the update (same server): connections breaks
> >> > consistently after < 15 minutes (this is with unaltered
> >> > configurations, no *AliveInterval configured on either side of the
> >> > connection). 
> >> 
> >> I did a little bit more testing and realized that the problem goes
> >> away when I disable "Proportional Rate Reduction per RFC 6937" on the
> >> server side:
> >> 
> >>   sysctl net.inet.tcp.do_prr=0
> >> 
> >> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
> >> fix the problem.
> >> 
> >> This seems to be specific to Parallels. After some more digging, I
> >> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
> >> keep-alive between the VM and the external server on its own. There is
> >> no direct communication between the client and the server. This means:
> >> 
> >> - The NAT daemon starts sending keep-alive packages right away (not
> >>   after the VM's net.inet.tcp.keepidle), every 75 seconds.
> >> - Keep-alive packages originating in the VM never reach the server.
> >> - Keep-alive originating on the server never reaches the VM.
> >> - Client and server basically do keep-alive with the nat daemon, not
> >>   with each other.
> >> 
> >> It also seems like Parallels is filtering the tos field (so it's
> >> always 0x00), but that's unrelated.
> >> 
> >> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
> >> the same network for comparison and is has no such issues.
> >> 
> >> Looking at TCP dump output on the server, this is what a keep-alive
> >> package sent by Parallels looks like:
> >> 
> >>   10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
> >> [none], proto TCP (6), length 40)
> >> 192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
> >> seq 2534, ack 3851, win 4096, length 0
> >> 
> >> While those originating from the bhyve VM (after lowering
> >> net.inet.tcp.keepidle) look like this:
> >> 
> >>   12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
> >> proto TCP (6), length 52)
> >> 192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
> >> (correct), seq 1780337696, ack 45831723, win 1026, options
> >> [nop,nop,TS val 3003646737 ecr 3331923346], length 0
> >> 
> >> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
> >> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
> >> the connection, as its keep-alive requests are not answered (well,
> >> that's what I think is happening):
> >> 
> >>   10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> >> proto TCP (6), length 40)
> >> 192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
> >> seq 2535, ack 3851, win 4096, length 0
> >> 
> >> The easiest way to work around the problem Client side is to configure
> >> ServerAliveInterval in ~/.ssh/config in the Client VM.
> >> 
> >> I'm curious though if this is basically a Parallels problem that has
> >> only been exposed by PRR being more correct (which is what I suspect),
> >> or if this is actually a FreeBSD problem.
> >> 
> > 
> > So, PRR probably was a red herring and the real reason that's happening
> > is that FreeBSD (since version 13[0]) by default discards packets
> > without timestamps for connections that formally had negotiated to have
> > them. This new behavior seems to be in line with RFC 7323, section
> > 3.2[1]:
> > 
> > "Once TSopt has been successfully negotiated, that is both  and
> >  contain TSopt, the TSopt MUST be sent in every non-
> > segment for the duration of the connection, and SHOULD be sent in an
> >  segment (see Section 5.2 for details)."
> > 
> > As it turns out, macOS does exactly this - send keep-alive packets
> > without a timestamp for connections that were negotiated to have them.
> 
> I wonder if I'm running into this with ssh connections to freefall.  My
> outgoing IPv6 connections pass through an ipfw firewall that uses
> dynamic rules.  When the dynamic rule 

Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

2021-06-09 Thread Don Lewis
On  8 Jun, Michael Gmelin wrote:
> 
> 
> On Thu, 3 Jun 2021 15:09:06 +0200
> Michael Gmelin  wrote:
> 
>> On Tue, 1 Jun 2021 13:47:47 +0200
>> Michael Gmelin  wrote:
>> 
>> > Hi,
>> > 
>> > Since upgrading servers from 12.2 to 13.0, I get
>> > 
>> >   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
>> > 
>> > consistently, usually after about 11 idle minutes, that's with and
>> > without pf enabled. Client (11.4 in a VM) wasn't altered.
>> > 
>> > Verbose logging (client and server side) doesn't show anything
>> > special when the connection breaks. In the past, QoS problems
>> > caused these disconnects, but I didn't see anything apparent
>> > changing between 12.2 and 13 in this respect.
>> > 
>> > I did a test on a newly commissioned server to rule out other
>> > factors (so, same client connections, some routes, same
>> > everything). On 12.2 before the update: Connection stays open for
>> > hours. After the update (same server): connections breaks
>> > consistently after < 15 minutes (this is with unaltered
>> > configurations, no *AliveInterval configured on either side of the
>> > connection). 
>> 
>> I did a little bit more testing and realized that the problem goes
>> away when I disable "Proportional Rate Reduction per RFC 6937" on the
>> server side:
>> 
>>   sysctl net.inet.tcp.do_prr=0
>> 
>> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
>> fix the problem.
>> 
>> This seems to be specific to Parallels. After some more digging, I
>> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
>> keep-alive between the VM and the external server on its own. There is
>> no direct communication between the client and the server. This means:
>> 
>> - The NAT daemon starts sending keep-alive packages right away (not
>>   after the VM's net.inet.tcp.keepidle), every 75 seconds.
>> - Keep-alive packages originating in the VM never reach the server.
>> - Keep-alive originating on the server never reaches the VM.
>> - Client and server basically do keep-alive with the nat daemon, not
>>   with each other.
>> 
>> It also seems like Parallels is filtering the tos field (so it's
>> always 0x00), but that's unrelated.
>> 
>> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
>> the same network for comparison and is has no such issues.
>> 
>> Looking at TCP dump output on the server, this is what a keep-alive
>> package sent by Parallels looks like:
>> 
>>   10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
>> [none], proto TCP (6), length 40)
>> 192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
>> seq 2534, ack 3851, win 4096, length 0
>> 
>> While those originating from the bhyve VM (after lowering
>> net.inet.tcp.keepidle) look like this:
>> 
>>   12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
>> proto TCP (6), length 52)
>> 192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
>> (correct), seq 1780337696, ack 45831723, win 1026, options
>> [nop,nop,TS val 3003646737 ecr 3331923346], length 0
>> 
>> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
>> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
>> the connection, as its keep-alive requests are not answered (well,
>> that's what I think is happening):
>> 
>>   10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
>> proto TCP (6), length 40)
>> 192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
>> seq 2535, ack 3851, win 4096, length 0
>> 
>> The easiest way to work around the problem Client side is to configure
>> ServerAliveInterval in ~/.ssh/config in the Client VM.
>> 
>> I'm curious though if this is basically a Parallels problem that has
>> only been exposed by PRR being more correct (which is what I suspect),
>> or if this is actually a FreeBSD problem.
>> 
> 
> So, PRR probably was a red herring and the real reason that's happening
> is that FreeBSD (since version 13[0]) by default discards packets
> without timestamps for connections that formally had negotiated to have
> them. This new behavior seems to be in line with RFC 7323, section
> 3.2[1]:
> 
> "Once TSopt has been successfully negotiated, that is both  and
>  contain TSopt, the TSopt MUST be sent in every non-
> segment for the duration of the connection, and SHOULD be sent in an
>  segment (see Section 5.2 for details)."
> 
> As it turns out, macOS does exactly this - send keep-alive packets
> without a timestamp for connections that were negotiated to have them.

I wonder if I'm running into this with ssh connections to freefall.  My
outgoing IPv6 connections pass through an ipfw firewall that uses
dynamic rules.  When the dynamic rule gets close to expiration, it
generates keep alive packets that just seem to be ignored by freefall.
Eventually the dynamic rule expires, then sometime later sshd on
freefall sends a keepalive which gets dropped at my end.

Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

2021-06-08 Thread tuexen
> On 9. Jun 2021, at 00:20, Rodney W. Grimes  
> wrote:
> 
>> 
>> On Thu, 3 Jun 2021 15:09:06 +0200
>> Michael Gmelin  wrote:
>> 
>>> On Tue, 1 Jun 2021 13:47:47 +0200
>>> Michael Gmelin  wrote:
>>> 
 Hi,
 
 Since upgrading servers from 12.2 to 13.0, I get
 
 Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
 
 consistently, usually after about 11 idle minutes, that's with and
 without pf enabled. Client (11.4 in a VM) wasn't altered.
 
 Verbose logging (client and server side) doesn't show anything
 special when the connection breaks. In the past, QoS problems
 caused these disconnects, but I didn't see anything apparent
 changing between 12.2 and 13 in this respect.
 
 I did a test on a newly commissioned server to rule out other
 factors (so, same client connections, some routes, same
 everything). On 12.2 before the update: Connection stays open for
 hours. After the update (same server): connections breaks
 consistently after < 15 minutes (this is with unaltered
 configurations, no *AliveInterval configured on either side of the
 connection). 
>>> 
>>> I did a little bit more testing and realized that the problem goes
>>> away when I disable "Proportional Rate Reduction per RFC 6937" on the
>>> server side:
>>> 
>>> sysctl net.inet.tcp.do_prr=0
>>> 
>>> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
>>> fix the problem.
>>> 
>>> This seems to be specific to Parallels. After some more digging, I
>>> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
>>> keep-alive between the VM and the external server on its own. There is
>>> no direct communication between the client and the server. This means:
>>> 
>>> - The NAT daemon starts sending keep-alive packages right away (not
>>> after the VM's net.inet.tcp.keepidle), every 75 seconds.
>>> - Keep-alive packages originating in the VM never reach the server.
>>> - Keep-alive originating on the server never reaches the VM.
>>> - Client and server basically do keep-alive with the nat daemon, not
>>> with each other.
>>> 
>>> It also seems like Parallels is filtering the tos field (so it's
>>> always 0x00), but that's unrelated.
>>> 
>>> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
>>> the same network for comparison and is has no such issues.
>>> 
>>> Looking at TCP dump output on the server, this is what a keep-alive
>>> package sent by Parallels looks like:
>>> 
>>> 10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
>>> [none], proto TCP (6), length 40)
>>>   192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
>>>   seq 2534, ack 3851, win 4096, length 0
>>> 
>>> While those originating from the bhyve VM (after lowering
>>> net.inet.tcp.keepidle) look like this:
>>> 
>>> 12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
>>>   proto TCP (6), length 52)
>>>   192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
>>>   (correct), seq 1780337696, ack 45831723, win 1026, options
>>>   [nop,nop,TS val 3003646737 ecr 3331923346], length 0
>>> 
>>> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
>>> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
>>> the connection, as its keep-alive requests are not answered (well,
>>> that's what I think is happening):
>>> 
>>> 10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
>>>   proto TCP (6), length 40)
>>>   192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
>>>   seq 2535, ack 3851, win 4096, length 0
>>> 
>>> The easiest way to work around the problem Client side is to configure
>>> ServerAliveInterval in ~/.ssh/config in the Client VM.
>>> 
>>> I'm curious though if this is basically a Parallels problem that has
>>> only been exposed by PRR being more correct (which is what I suspect),
>>> or if this is actually a FreeBSD problem.
>>> 
>> 
>> So, PRR probably was a red herring and the real reason that's happening
>> is that FreeBSD (since version 13[0]) by default discards packets
>> without timestamps for connections that formally had negotiated to have
>> them. This new behavior seems to be in line with RFC 7323, section
>> 3.2[1]:
>> 
>>   "Once TSopt has been successfully negotiated, that is both  and
>>contain TSopt, the TSopt MUST be sent in every non-
>>   segment for the duration of the connection, and SHOULD be sent in an
>>segment (see Section 5.2 for details)."
>> 
>> As it turns out, macOS does exactly this - send keep-alive packets
>> without a timestamp for connections that were negotiated to have them.
>> 
>> Under normal circumstances - ssh from macOS to a server running FreeBSD
>> 13 - this won't be noticed, since macOS uses the same default settings
>> as FreeBSD (2 hours idle time, 75 seconds intervals), so the server
>> side initiated keep-alive will save the connection before it has a
>> chance to break due to eight 

Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

2021-06-08 Thread Rodney W. Grimes
> 
> On Thu, 3 Jun 2021 15:09:06 +0200
> Michael Gmelin  wrote:
> 
> > On Tue, 1 Jun 2021 13:47:47 +0200
> > Michael Gmelin  wrote:
> > 
> > > Hi,
> > > 
> > > Since upgrading servers from 12.2 to 13.0, I get
> > > 
> > >   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
> > > 
> > > consistently, usually after about 11 idle minutes, that's with and
> > > without pf enabled. Client (11.4 in a VM) wasn't altered.
> > > 
> > > Verbose logging (client and server side) doesn't show anything
> > > special when the connection breaks. In the past, QoS problems
> > > caused these disconnects, but I didn't see anything apparent
> > > changing between 12.2 and 13 in this respect.
> > > 
> > > I did a test on a newly commissioned server to rule out other
> > > factors (so, same client connections, some routes, same
> > > everything). On 12.2 before the update: Connection stays open for
> > > hours. After the update (same server): connections breaks
> > > consistently after < 15 minutes (this is with unaltered
> > > configurations, no *AliveInterval configured on either side of the
> > > connection). 
> > 
> > I did a little bit more testing and realized that the problem goes
> > away when I disable "Proportional Rate Reduction per RFC 6937" on the
> > server side:
> > 
> >   sysctl net.inet.tcp.do_prr=0
> > 
> > Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
> > fix the problem.
> > 
> > This seems to be specific to Parallels. After some more digging, I
> > realized that Parallels Desktop's NAT daemon (prl_naptd) handles
> > keep-alive between the VM and the external server on its own. There is
> > no direct communication between the client and the server. This means:
> > 
> > - The NAT daemon starts sending keep-alive packages right away (not
> >   after the VM's net.inet.tcp.keepidle), every 75 seconds.
> > - Keep-alive packages originating in the VM never reach the server.
> > - Keep-alive originating on the server never reaches the VM.
> > - Client and server basically do keep-alive with the nat daemon, not
> >   with each other.
> > 
> > It also seems like Parallels is filtering the tos field (so it's
> > always 0x00), but that's unrelated.
> > 
> > I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
> > the same network for comparison and is has no such issues.
> > 
> > Looking at TCP dump output on the server, this is what a keep-alive
> > package sent by Parallels looks like:
> > 
> >   10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
> > [none], proto TCP (6), length 40)
> > 192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
> > seq 2534, ack 3851, win 4096, length 0
> > 
> > While those originating from the bhyve VM (after lowering
> > net.inet.tcp.keepidle) look like this:
> > 
> >   12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
> > proto TCP (6), length 52)
> > 192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
> > (correct), seq 1780337696, ack 45831723, win 1026, options
> > [nop,nop,TS val 3003646737 ecr 3331923346], length 0
> > 
> > Like written above, once net.inet.tcp.do_prr is disabled, keepalive
> > seems to be working just fine. Otherwise, Parallel's NAT daemon kills
> > the connection, as its keep-alive requests are not answered (well,
> > that's what I think is happening):
> > 
> >   10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > proto TCP (6), length 40)
> > 192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
> > seq 2535, ack 3851, win 4096, length 0
> > 
> > The easiest way to work around the problem Client side is to configure
> > ServerAliveInterval in ~/.ssh/config in the Client VM.
> > 
> > I'm curious though if this is basically a Parallels problem that has
> > only been exposed by PRR being more correct (which is what I suspect),
> > or if this is actually a FreeBSD problem.
> > 
> 
> So, PRR probably was a red herring and the real reason that's happening
> is that FreeBSD (since version 13[0]) by default discards packets
> without timestamps for connections that formally had negotiated to have
> them. This new behavior seems to be in line with RFC 7323, section
> 3.2[1]:
> 
> "Once TSopt has been successfully negotiated, that is both  and
>  contain TSopt, the TSopt MUST be sent in every non-
> segment for the duration of the connection, and SHOULD be sent in an
>  segment (see Section 5.2 for details)."
> 
> As it turns out, macOS does exactly this - send keep-alive packets
> without a timestamp for connections that were negotiated to have them.
> 
> Under normal circumstances - ssh from macOS to a server running FreeBSD
> 13 - this won't be noticed, since macOS uses the same default settings
> as FreeBSD (2 hours idle time, 75 seconds intervals), so the server
> side initiated keep-alive will save the connection before it has a
> chance to break due to eight consecutive unanswered 

Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

2021-06-08 Thread Michael Gmelin



On Thu, 3 Jun 2021 15:09:06 +0200
Michael Gmelin  wrote:

> On Tue, 1 Jun 2021 13:47:47 +0200
> Michael Gmelin  wrote:
> 
> > Hi,
> > 
> > Since upgrading servers from 12.2 to 13.0, I get
> > 
> >   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
> > 
> > consistently, usually after about 11 idle minutes, that's with and
> > without pf enabled. Client (11.4 in a VM) wasn't altered.
> > 
> > Verbose logging (client and server side) doesn't show anything
> > special when the connection breaks. In the past, QoS problems
> > caused these disconnects, but I didn't see anything apparent
> > changing between 12.2 and 13 in this respect.
> > 
> > I did a test on a newly commissioned server to rule out other
> > factors (so, same client connections, some routes, same
> > everything). On 12.2 before the update: Connection stays open for
> > hours. After the update (same server): connections breaks
> > consistently after < 15 minutes (this is with unaltered
> > configurations, no *AliveInterval configured on either side of the
> > connection). 
> 
> I did a little bit more testing and realized that the problem goes
> away when I disable "Proportional Rate Reduction per RFC 6937" on the
> server side:
> 
>   sysctl net.inet.tcp.do_prr=0
> 
> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
> fix the problem.
> 
> This seems to be specific to Parallels. After some more digging, I
> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
> keep-alive between the VM and the external server on its own. There is
> no direct communication between the client and the server. This means:
> 
> - The NAT daemon starts sending keep-alive packages right away (not
>   after the VM's net.inet.tcp.keepidle), every 75 seconds.
> - Keep-alive packages originating in the VM never reach the server.
> - Keep-alive originating on the server never reaches the VM.
> - Client and server basically do keep-alive with the nat daemon, not
>   with each other.
> 
> It also seems like Parallels is filtering the tos field (so it's
> always 0x00), but that's unrelated.
> 
> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
> the same network for comparison and is has no such issues.
> 
> Looking at TCP dump output on the server, this is what a keep-alive
> package sent by Parallels looks like:
> 
>   10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
> [none], proto TCP (6), length 40)
> 192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
> seq 2534, ack 3851, win 4096, length 0
> 
> While those originating from the bhyve VM (after lowering
> net.inet.tcp.keepidle) look like this:
> 
>   12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
> proto TCP (6), length 52)
> 192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
> (correct), seq 1780337696, ack 45831723, win 1026, options
> [nop,nop,TS val 3003646737 ecr 3331923346], length 0
> 
> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
> the connection, as its keep-alive requests are not answered (well,
> that's what I think is happening):
> 
>   10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto TCP (6), length 40)
> 192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
> seq 2535, ack 3851, win 4096, length 0
> 
> The easiest way to work around the problem Client side is to configure
> ServerAliveInterval in ~/.ssh/config in the Client VM.
> 
> I'm curious though if this is basically a Parallels problem that has
> only been exposed by PRR being more correct (which is what I suspect),
> or if this is actually a FreeBSD problem.
> 

So, PRR probably was a red herring and the real reason that's happening
is that FreeBSD (since version 13[0]) by default discards packets
without timestamps for connections that formally had negotiated to have
them. This new behavior seems to be in line with RFC 7323, section
3.2[1]:

"Once TSopt has been successfully negotiated, that is both  and
 contain TSopt, the TSopt MUST be sent in every non-
segment for the duration of the connection, and SHOULD be sent in an
 segment (see Section 5.2 for details)."

As it turns out, macOS does exactly this - send keep-alive packets
without a timestamp for connections that were negotiated to have them.

Under normal circumstances - ssh from macOS to a server running FreeBSD
13 - this won't be noticed, since macOS uses the same default settings
as FreeBSD (2 hours idle time, 75 seconds intervals), so the server
side initiated keep-alive will save the connection before it has a
chance to break due to eight consecutive unanswered keep-alives at the
client side.

This is different for ssh connections originating from a VM inside
Parallels, as connections created by prl_naptd will start sending tcp
keep-alives shortly after the connection becomes 

Re: ssh connections break with "Fssh_packet_write_wait" on 13

2021-06-03 Thread Michael Gmelin



On Tue, 1 Jun 2021 13:47:47 +0200
Michael Gmelin  wrote:

> Hi,
> 
> Since upgrading servers from 12.2 to 13.0, I get
> 
>   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
> 
> consistently, usually after about 11 idle minutes, that's with and
> without pf enabled. Client (11.4 in a VM) wasn't altered.
> 
> Verbose logging (client and server side) doesn't show anything special
> when the connection breaks. In the past, QoS problems caused these
> disconnects, but I didn't see anything apparent changing between 12.2
> and 13 in this respect.
> 
> I did a test on a newly commissioned server to rule out other factors
> (so, same client connections, some routes, same everything). On 12.2
> before the update: Connection stays open for hours. After the update
> (same server): connections breaks consistently after < 15 minutes
> (this is with unaltered configurations, no *AliveInterval configured
> on either side of the connection).
> 

I did a little bit more testing and realized that the problem goes away
when I disable "Proportional Rate Reduction per RFC 6937" on the server
side:

  sysctl net.inet.tcp.do_prr=0

Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't fix
the problem.

This seems to be specific to Parallels. After some more digging, I
realized that Parallels Desktop's NAT daemon (prl_naptd) handles
keep-alive between the VM and the external server on its own. There is
no direct communication between the client and the server. This means:

- The NAT daemon starts sending keep-alive packages right away (not
  after the VM's net.inet.tcp.keepidle), every 75 seconds.
- Keep-alive packages originating in the VM never reach the server.
- Keep-alive originating on the server never reaches the VM.
- Client and server basically do keep-alive with the nat daemon, not
  with each other.

It also seems like Parallels is filtering the tos field (so it's always
0x00), but that's unrelated.

I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
the same network for comparison and is has no such issues.

Looking at TCP dump output on the server, this is what a keep-alive
package sent by Parallels looks like:

  10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags [none],
proto TCP (6), length 40)
192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
seq 2534, ack 3851, win 4096, length 0

While those originating from the bhyve VM (after lowering
net.inet.tcp.keepidle) look like this:

  12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
proto TCP (6), length 52)
192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
(correct), seq 1780337696, ack 45831723, win 1026, options
[nop,nop,TS val 3003646737 ecr 3331923346], length 0

Like written above, once net.inet.tcp.do_prr is disabled, keepalive
seems to be working just fine. Otherwise, Parallel's NAT daemon kills
the connection, as its keep-alive requests are not answered (well,
that's what I think is happening):

  10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
proto TCP (6), length 40)
192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
seq 2535, ack 3851, win 4096, length 0

The easiest way to work around the problem Client side is to configure
ServerAliveInterval in ~/.ssh/config in the Client VM.

I'm curious though if this is basically a Parallels problem that has
only been exposed by PRR being more correct (which is what I suspect),
or if this is actually a FreeBSD problem.

Michael

-- 
Michael Gmelin



ssh connections break with "Fssh_packet_write_wait" on 13

2021-06-01 Thread Michael Gmelin
Hi,

Since upgrading servers from 12.2 to 13.0, I get

  Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe

consistently, usually after about 11 idle minutes, that's with and
without pf enabled. Client (11.4 in a VM) wasn't altered.

Verbose logging (client and server side) doesn't show anything special
when the connection breaks. In the past, QoS problems caused these
disconnects, but I didn't see anything apparent changing between 12.2
and 13 in this respect.

I did a test on a newly commissioned server to rule out other factors
(so, same client connections, some routes, same everything). On 12.2
before the update: Connection stays open for hours. After the update
(same server): connections breaks consistently after < 15 minutes (this
is with unaltered configurations, no *AliveInterval configured on
either side of the connection).

Thanks
Michael

-- 
Michael Gmelin