Increasing TCP TSO size support

2024-02-02 Thread Scheffenegger, Richard


Hi,

We have run a test for a RPC workload with 1MB IO sizes, and collected 
the tcp_default_output() len(gth) during the first pass in the output loop.


In such a scenario, where the application frequently introduces small 
pauses (since the next large IO is only sent after the corresponding 
request from the client has been received and processed) between sending 
additional data, the current TSO limit of 64kB TSO maximum (45*1448 in 
effect) requires multiple passes in the output routine to send all the 
allowable (cwnd limited) data.


I'll try to get a data collection with better granulariy above 90 000 
bytes - but even here the average strongly indicates that a majority of 
transmission opportunities are in the 512 kB area - probably also having 
to do with LRO and ACK thinning effects by the client.


With other words, the tcp output has to run about 9 times with TSO, to 
transmit all elegible data - increasing the FreeBSD supported maximum 
TSO size to what current hardware could handle (256kB..1MB) would reduce 
the CPU burden here.



Is increasing the sofware supported TSO size to allow for what the NICs 
could nowadays do something anyone apart from us would be interested in 
(in particular, those who work with the drivers)?



Best regards,

  Richard




tso size (transmissions < 1448 would not be accounted here at all)

                    # count

<10000
<200023
<3000111
<400040
<500030
<700014
<8000134
<9000442
<1   9396
<2   46227
<3   25646
<4   33060
<6   23162
<7   24368
<8   19772
<9   40101
>=9  75384169
Average:578844.44



OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: TSO + ECN

2023-12-22 Thread Scheffenegger, Richard


Thanks Michael.

Having looked at that document, the bit masks there are incorrect.

In RFC3168, the CWR bit is supposed to be sent once only (and ideally as 
early as possible). The documented bitmasks for the First, Mid and Last 
segments don't make sense in that case:


0xFF6 0xFF6 0xF7F

These masks would allow the CWR bit in the first and any middle segment, 
only clearing it in the last - where PSH and FIN would be allowed to be 
sent... (Also, why the SYN and RST bits aren't similarly masked out 
escapes me).



I also checked how the vmxnet3 driver behaves when TSO is active - and 
found that it will leave the CWR bit unchanged on any of the TSO segments.


Finally, (and this is where this came from), the virtio driver discards 
TSO mbufs with ENOTSUP when encountering the CWR bit, but the host 
didn't indicate that the TSO capability there would "properly" support 
ECN. That leads to massive performance degradations, as TSO remains 
enabled, but every time a CWR bit is tried to be sent, the cwnd has to 
collapse to 1 MSS in order for a successful transmission. This typically 
takes an RTO...



Ultimately we also need to consider the upcoming changes in semantics of 
these ECN-related bits with AccECN (which do *NOT* require any special 
handling on the TX path for these bits any longer).



I decided to create D43166 to fix this in tcp_output(),
and D43167 to no longer stop TSO transmissions when encountering CWR on 
"unsupporting" hosts.



By restructuring some of the ECN handling, whenever the CWR bit is 
scheduled to be sent, this bypasses the TSO TX path completely.


For 3168 ECN - where only a single segment per RTT would be expected to 
have the CWR bit set, I believe this is an acceptable compromise - to 
bypass the various broken or misbehaving TSO implementations when it 
comes to ECN.


For AccECN, where long flights of data could easily have the CWR bit (as 
part of the ACE counter) set, a more performant solution would be needed.


I imagine the most simple one would be to remove any error branch for 
special handling of CWR - even on older TSO drivers, where ECN is not 
supported; Reprogramming the Header Bitmasks in "ECN-aware" TSO offload 
hardware to send the CWR bit unobstructed for the entire TSO:


0xFF6 0xFF6 0xFFF

and once that is all in place, allow TSO only for AccECN enabled 
sessions when the CWR bit is encountered...




I would like to gather some feedback by those who work on the various 
network drivers (intel, mlx, virtio, ...) if that sounds like a viable 
plan to rectify the sad state of ECN support with TSO - while becoming 
future-proof.



> On Dec 20, 2023, at 12:15, Scheffenegger, Richard 
 wrote:

>
> Hi,
>
> I am curious if anyone here has expirience with the handling of ECN 
in TSO-enabled drivers/hardware...


Some data pointer if I read the specification correctly.
Have a look at the specification of the 10GBit/sec card ix:
https://cdrdv2-public.intel.com/331520/82599-datasheet-v3-4.pdf

According to section 7.2.4 and 8.2.3.9.3 and 8.2.3.9.4 the
* first segment gets all flags except PSH and FIN.
* middle segments get all flags except PSH and FIN.
* last segment gets all flags except the CWR.

I think you should be able to change the masks.

Best regards
Michael

>
> The other day I found that the virtio driver would bail out with 
ENOTSUP when encountering the TCP CWR header bit on a TSO-enabled flow, 
when the host does not also claim ECN-support for TSO.

>
> But this made me wonder, how the expected behavior is.
>
> Presumably, this means that the hardware (or driver) would clear the 
CWR bit after the first packet is sent, correct?

>
> However, in light of the upcoming AccECN signalling protocol, that is 
not what TSO should be doing (with AccECN, all segments should retain 
the exact same header flags, maybe expect PSH).

>
> Probably "non-ECN" capable TSO offload would actually work better 
with AccECN - and if the above behavior is what ECN-aware TSO is doing, 
AccECN sessions would need to somehow work around that (e.g. 
spoon-feeding any segment with CWR set individually - e.g. bypassing the 
TSO capabilities in tcp_output)?

>
>
> Would appreciate any feedback around this...
>
> Best regards,
> Richard



OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


TSO + ECN

2023-12-20 Thread Scheffenegger, Richard

Hi,

I am curious if anyone here has expirience with the handling of ECN in 
TSO-enabled drivers/hardware...


The other day I found that the virtio driver would bail out with ENOTSUP 
when encountering the TCP CWR header bit on a TSO-enabled flow, when the 
host does not also claim ECN-support for TSO.


But this made me wonder, how the expected behavior is.

Presumably, this means that the hardware (or driver) would clear the CWR 
bit after the first packet is sent, correct?


However, in light of the upcoming AccECN signalling protocol, that is 
not what TSO should be doing (with AccECN, all segments should retain 
the exact same header flags, maybe expect PSH).


Probably "non-ECN" capable TSO offload would actually work better with 
AccECN - and if the above behavior is what ECN-aware TSO is doing, 
AccECN sessions would need to somehow work around that (e.g. 
spoon-feeding any segment with CWR set individually - e.g. bypassing the 
TSO capabilities in tcp_output)?



Would appreciate any feedback around this...

Best regards,
  Richard


OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


RE: Network starvation question

2023-11-04 Thread Scheffenegger, Richard


Cheng is correct;

A non-reactive UDP flow (an application pushing data as quickly as it can, 
without any regards if the packet even departs the machine) will always be able 
to ursurp excessive amounts of network capacity.

TCP uses (well-designed) congestion control, in order to prevent TCP from 
ursurping excessive amounts of resources - but that also means in the presence 
of a bully, it will yield excessively.

If you can not control that aggressive application directly, the canonical 
approach is to introduce some form of AQM (active queue manangment) on your 
bottleneck queue; E.g. FQ-Codel - with the FQ part making sure, that there is 
an effective minimum bandwidth guarantee for very light flows at all times 
(DNS, NTP, RDP, SSH) even in the presence of unreactive, aggressive and 
bandwidth-hogging flows (regardless of protocol; meaning this approach works 
not only with UDP, or a misused CC in TCP, but also SCTP, QUIC, RTMP and all 
others).

Best regards,
   Richard




From: owner-freebsd-...@freebsd.org  On Behalf 
Of Cheng Cui
Sent: Freitag, 3. November 2023 12:53
To: Yuri 
Cc: freebsd-net@freebsd.org
Subject: Re: Network starvation question


Hi Yuri,

If I understand your situation correctly, your application A is using UDP but
application B is using TCP and the max outbound bandwidth is < 4MBps.

If A can send at 3.5 Mbps and B can send at 0.5 Mbps, why B is suffering
more in throughput when the outbound limit is < 4Mbps?

There are many factors I think in my experience that can contribute to this.
But maybe we prefer a solution rather than the root cause of the difference.
If you can tune A to use less bandwidth and let B to use the full 0.5 Mbps,
the problem may be solved. (for example, iperf default in UDP is ~1Mbps but
can be tuned to send at max bandwidth, so vice versa)

Then, back to the possible contribution factors of TCP suffering from a UDP
traffic competition :
1.  UDP traffic rate can be considered constant so it does not yield
2.  TCP congestion control may be encountered
3.  application's responsiveness may be different

Best Regards,
Cheng Cui


On Fri, Nov 3, 2023 at 12:42 AM Yuri  wrote:
Hi,


I've encountered the situation when the application A was using 100% of
the outbound bandwidth which is approximately 3.5 MBps of UDP traffic.

Then the application B (the RDP TCP connection) attempted to use a much
lower outbound speed, probably < 0.5 MBps, and it got starved.

Application B (RDP) was super slow as long as the application A kept
running. It was almost impossible to use the RDP connection.


My question is: shouldn't the system allow less intense streams to also
run at a decent speed?


Let's say that the outbound bandwidth threshold of the connection is 3.5
MBps.

The application A can send 3.5 MBps (or more).

The application B can send up to 0.5 MBps.

Obviously, they can't send 4.0 MBps in total, and their speeds should be
tuned down.

If both of the applications would be tuned down proportionately, this
could be done using the 3.5/4.0 ratio, which would be 0.875.

So why then does the slower connection get slowed down so much?

It was obviously slowed down many times, not just by 13%.


FreeBSD 13.2


Thanks,

Yuri




OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


RE: Very slow scp performance comparing to Linux

2023-10-25 Thread Scheffenegger, Richard

Posting the full "iperf3 -i 1" output, as well as "netstat -snp tcp" before and 
after (or just the delta) would be nice;

On high speed NICs, iperf3 is nowadays typically core-limited (scales with 
clock speed of the active core where the singular worker thread is running), 
but that should be pretty much identical to how scp is doing things. On real 
hardware, it may be tricky to achieve more than 10Gbps or 25Gbps (depending on 
how modern the platform is) with iperf3.

Also, for high bandwidth operation, a number of NIC drivers typically perform 
better when tweaking their tx/rs queues:

CC: However, tuning "sysctl net.link.ifqmaxlen" directly does not work. There is a per NIC 
interface setup in the driver to setup device tx/rx queues. I have to increase the tx queue 
"ifq_maxlen" from the device sysctl "hw.bce.tx_pages". After tuning that, I can achieve a 
stable 1Gbps x 100ms delay BDP.


Richard Scheffenegger


-Original Message-
From: owner-freebsd-...@freebsd.org  On Behalf 
Of mike tancsa
Sent: Montag, 28. August 2023 16:02
To: Wei Hu ; freebsd-hack...@freebsd.org
Cc: freebsd-net@FreeBSD.org
Subject: Re: Very slow scp performance comparing to Linux



On 8/28/2023 3:32 AM, Wei Hu wrote:

Hi,

When I was testing a new NIC, I found the single stream scp performance was 
almost 8 time slower than Linux on the RX side. Initially I thought it might be 
something with the NIC. But when I switched to sending the file on localhost, 
the numbers stay the same.


Just curious, how does iperf3 perform in comparison ?

 ---Mike




OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


RE: Very slow scp performance comparing to Linux

2023-10-25 Thread Scheffenegger, Richard
Posting the full "iperf3 -i 1" output, as well as "netstat -snp tcp" before and 
after (or just the delta) would be nice; 

On high speed NICs, iperf3 is nowadays typically core-limited (scales with 
clock speed of the active core where the singular worker thread is running), 
but that should be pretty much identical to how scp is doing things. On real 
hardware, it may be tricky to achieve more than 10Gbps or 25Gbps (depending on 
how modern the platform is) with iperf3.

Also, for high bandwidth operation, a number of NIC drivers typically perform 
better when tweaking their tx/rs queues:

CC: However, tuning "sysctl net.link.ifqmaxlen" directly does not work. There 
is a per NIC interface setup in the driver to setup device tx/rx queues. I have 
to increase the tx queue "ifq_maxlen" from the device sysctl "hw.bce.tx_pages". 
After tuning that, I can achieve a stable 1Gbps x 100ms delay BDP.


Richard Scheffenegger


-Original Message-
From: owner-freebsd-...@freebsd.org  On Behalf 
Of mike tancsa
Sent: Montag, 28. August 2023 16:02
To: Wei Hu ; freebsd-hack...@freebsd.org
Cc: freebsd-net@FreeBSD.org
Subject: Re: Very slow scp performance comparing to Linux

[Sie erhalten nicht häufig E-Mails von m...@sentex.net. Weitere Informationen, 
warum dies wichtig ist, finden Sie unter 
https://aka.ms/LearnAboutSenderIdentification ]

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




On 8/28/2023 3:32 AM, Wei Hu wrote:
> Hi,
>
> When I was testing a new NIC, I found the single stream scp performance was 
> almost 8 time slower than Linux on the RX side. Initially I thought it might 
> be something with the NIC. But when I switched to sending the file on 
> localhost, the numbers stay the same.
>
Just curious, how does iperf3 perform in comparison ?

 ---Mike




RE: BPF to filter/mod ARP

2023-03-01 Thread Scheffenegger, Richard
>> On 1. Mar 2023, at 21:33, Scheffenegger, Richard  wrote:
>>
>> Hi group,
>>
>> Maybe someone can help me with this question - as I am usually only looking 
>> at L4 and the top side of L3 ;)
>
>> In order to validate a peculiar switches behavior, I want to adjust some 
>> fields in gracious arps sent out by an interface, after a new IP is assigned 
>> or changed.

> Wouldn't scapy allow you to do this kind of testing?

Unfortunately not - I don't want to forge another packet, I want to make sure 
only the specific one is being sent, with the standard GARP retransmissions and 
so on.

Richard



mlx5en & tcpdump -Q

2023-03-01 Thread Scheffenegger, Richard
Related to the other issue just mentioned, I found that when trying to 
perform unidirectional packet captures using the tcpdump -Q option, when 
trying this against a CX5 NIC, i get this error message:


tcpdump: e4a: pcap_setdirection() failed: Setting direction is not 
implemented on this platform


(this is a 13.0 kernel, can not really check main).

Does anyone know if this functionality is available already, or any 
plans to implement this for mlx5en ?


Thanks,
  Richard





OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


BPF to filter/mod ARP

2023-03-01 Thread Scheffenegger, Richard

Hi group,

Maybe someone can help me with this question - as I am usually only 
looking at L4 and the top side of L3 ;)


In order to validate a peculiar switches behavior, I want to adjust some 
fields in gracious arps sent out by an interface, after a new IP is 
assigned or changed.


I believe BPF can effectively filter on arbitrary bit patterns and 
modify packets on the fly.


However, as ARP doesn't seem to be accessible in the ipfw 
infrastructure, I was wondering how to go about setting up an BPF to 
tweak (temporarily) some of these ARPs to validate how the switch will 
behave.


(I need to validate, if there is some difference when the target 
hardware address doesn't conform to RFC5227 - which states it SHOULD be 
zero and is ignored on the receiving side; i have reasons to believe 
that the switch needs either a target hardware address of 
ff:ff:ff:ff:ff:ff or the local interface MAC, to properly update it's 
entries.)


Thanks a lot!

Richard


OpenPGP_0x17BE5899E0B1439B.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


RE: Too aggressive TCP ACKs

2022-11-10 Thread Scheffenegger, Richard
This is the current draft in this space:

https://datatracker.ietf.org/doc/draft-gomez-tcpm-ack-rate-request/

and it has been adopted as WG document at this weeks IETF, from what I can tell.

So it has traction – if you want to give your feedback, please subscribe to the 
tcpm mailing list, and discuss your use case and how/if the approach aligns 
with this there.

Richard



From: owner-freebsd-...@freebsd.org  On Behalf 
Of Zhenlei Huang
Sent: Donnerstag, 10. November 2022 09:07
To: Hans Petter Selasky 
Cc: Michael Tuexen ; freebsd-net@freebsd.org
Subject: Re: Too aggressive TCP ACKs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.


On Nov 9, 2022, at 11:18 AM, Zhenlei Huang 
mailto:zlei.hu...@gmail.com>> wrote:


On Oct 22, 2022, at 6:14 PM, Hans Petter Selasky 
mailto:h...@selasky.org>> wrote:

Hi,

Some thoughts about this topic.

Sorry for late response.



Delaying ACKs means loss of performance when using Gigabit TCP connections in 
data centers. There it is important to ACK the data as quick as possible, to 
avoid running out of TCP window space. Thinking about TCP connections at 30 
GBit/s and above!

In data centers, the bandwidth is much more and the latency is extremely low 
(compared to WAN), sub-milliseconds .
The TCP window space is bandwidth multiply RTT. For a 30 GBit/s network it is 
about 750KiB . I think that is trivial for a
datacenter server.


4.2.3.2 in RFC 1122 states:
> in a stream of full-sized segments there SHOULD be an ACK for at least every 
> second segment
Even if the ACK every tenth segment, the impact of delayed ACKs on TCP window 
is not significant ( at most
 ten segments not ACKed in TCP send window ).

Anyway, for datacenter usage the bandwidth is symmetric and the reverse path ( 
TX path of receiver ) is sufficient.
Servers can even ACK every segment (no delaying ACK).


I think the implementation should be exactly like it is.

There is a software LRO in FreeBSD to coalesce the ACKs before they hit the 
network stack, so there are no real problems there.

I'm OK with the current implementation.

I think upper layers (or application) have (business) information to indicate 
whether delaying ACKs should be employed.
After googling I found there's a draft [1].

[1] Sender Control of Delayed Acknowledgments in TCP: 
https://www.ietf.org/archive/id/draft-gomez-tcpm-delack-suppr-reqs-01.xml

Found the html / pdf / txt version of the draft RFC.
https://datatracker.ietf.org/doc/draft-gomez-tcpm-ack-pull/





--HPS


Best regards,
Zhenlei



RE: Too aggressive TCP ACKs

2022-10-27 Thread Scheffenegger, Richard
To come back to this.

With TSO / LRO disabled, FBSD is behaving per RFC by acking every other data 
packet, or delaying an ACK (e.g when data stops on an “uneven” packet) after a 
short delay (delayed ACKs).

If you want to see different ACK ratios, and also higher gigabit throughput 
rates for a single session, maybe test with the FBSD RACK stack

https://klarasystems.com/articles/using-the-freebsd-rack-tcp-stack/

sysctl net.inet.tcp.functions_default=rack

Richard


From: owner-freebsd-...@freebsd.org  On Behalf 
Of Zhenlei Huang
Sent: Freitag, 21. Oktober 2022 16:19
To: freebsd-net@freebsd.org
Subject: Too aggressive TCP ACKs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.


Hi,

While I was repeating https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=258755, 
I observed a
strange behavior. The TCP ACKs from FreeBSD host are too aggressive.

My setup is simple:
 A B
   [ MacOS ]  <> [ FreeBSD VM ]
192.168.120.1192.168.12.134 (disable tso and lro)
While A <--- B, i.e. A as server and B as client, the packets rate looks good.

One session on B:

root@:~ # iperf3 -c 192.168.120.1 -b 10m
Connecting to host 192.168.120.1, port 5201
[  5] local 192.168.120.134 port 54459 connected to 192.168.120.1 port 5201
[ ID] Interval   Transfer Bitrate Retr  Cwnd
[  5]   0.00-1.00   sec  1.25 MBytes  10.5 Mbits/sec0257 KBytes
[  5]   1.00-2.00   sec  1.25 MBytes  10.5 Mbits/sec0257 KBytes
[  5]   2.00-3.00   sec  1.12 MBytes  9.44 Mbits/sec0257 KBytes
[  5]   3.00-4.00   sec  1.25 MBytes  10.5 Mbits/sec0257 KBytes
[  5]   4.00-5.00   sec  1.12 MBytes  9.44 Mbits/sec0257 KBytes
[  5]   5.00-6.00   sec  1.25 MBytes  10.5 Mbits/sec0257 KBytes
[  5]   6.00-7.00   sec  1.12 MBytes  9.44 Mbits/sec0257 KBytes
[  5]   7.00-8.00   sec  1.25 MBytes  10.5 Mbits/sec0257 KBytes
[  5]   8.00-9.00   sec  1.12 MBytes  9.44 Mbits/sec0257 KBytes
[  5]   9.00-10.00  sec  1.25 MBytes  10.5 Mbits/sec0257 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bitrate Retr
[  5]   0.00-10.00  sec  12.0 MBytes  10.1 Mbits/sec0 sender
[  5]   0.00-10.00  sec  12.0 MBytes  10.1 Mbits/sec  receiver

iperf Done.

Another session on B:

root@:~ # netstat -w 1 -I vmx0
input   vmx0   output
   packets  errs idrops  bytespackets  errs  bytes colls
 0 0 0  0  0 0  0 0
 0 0 0  0  0 0  0 0
   342 0 0  22600526 0 775724 0
   150 0 0   9900851 01281454 0
   109 0 0   7194901 01357850 0
   126 0 0   8316828 01246632 0
   122 0 0   8052910 01370780 0
   109 0 0   7194819 01233702 0
   120 0 0   7920910 01370780 0
   110 0 0   7260819 01233702 0
   123 0 0   8118910 01370780 0
   109 0 0   7194819 01233702 0
73 0 0   5088465 0 686342 0
 0 0 0  0  0 0  0 0
 0 0 0  0  0 0  0 0






While A ---> B, i.e. A as client and B as server, the ACKs sent from B looks 
strange.

Session on A:

% iperf3 -c 192.168.120.134 -b 10m
Connecting to host 192.168.120.134, port 5201
[  5] local 192.168.120.1 port 52370 connected to 192.168.120.134 port 5201
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   1.00-2.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   2.00-3.00   sec  1.12 MBytes  9.44 Mbits/sec
[  5]   3.00-4.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   4.00-5.00   sec  1.12 MBytes  9.44 Mbits/sec
[  5]   5.00-6.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   6.00-7.00   sec  1.12 MBytes  9.44 Mbits/sec
[  5]   7.00-8.00   sec  1.25 MBytes  10.5 Mbits/sec
[  5]   8.00-9.00   sec  1.12 MBytes  9.44 Mbits/sec
[  5]   9.00-10.00  sec  1.25 MBytes  10.5 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bitrate
[  5]   0.00-10.00  sec  12.0 MBytes  10.1 Mbits/sec  sender
[  5]   0.00-10.00  sec  12.0 MBytes  10.1 Mbits/sec  receiver

iperf Done.

Session on B:

root@:~ # netstat -w 1 -I vmx0
input   vmx0   output
   packets  errs idrops  bytespackets  errs  bytes colls
 0 0 

RE: Too aggressive TCP ACKs

2022-10-27 Thread Scheffenegger, Richard
 It focuses on QUIC, but congestion control dynamics don't change 
 with the protocol. You should be able to read there, but if not I'm 
 happy to send anyone a pdf.
>>> Is QUIC using an L=2 for ABC?
>>
>> I think that is the rfc recommendation, actual deployed reality is 
>> more scattershot.
>Wouldn't that be relevant? If you get an ack for, let's say 8 packets, you 
>would only increment (in slow start) the cwnd by 2 packets, not 8?
>
>Best regards
>Michael

Isn't that the optimization in Linux with QuickAck during the periods, where 
the data receiver assumes, that the sender is still in SlowStart - and acking 
every packet?

Richard



RE: IPv6 - NS, DAD and MLDv2 interaction

2022-02-23 Thread Scheffenegger, Richard
-Original Message-
From: Lutz Donnerhacke  

> Yup. IPv6 replaced broadcast by multicast on the link layer.
>
>> It appears that some vendors of switches have started to become overly 
>> restrictive in forwarding Ethernet Multicast, and only deliver these 
>> *after* a Host has registered itself to receive / participate in 
>> specific IPv6 Multicast groups.
>
> May you please drop an information about the vendor?
> So that we can warn the community to buy those broken products.

The problem is described - with some workarounds here 

http://inconcepts.biz/~jsw/ipv6_nd_problems_with_l2_mcast.pdf


>> So far, I could not find any guidance (and with my lack of depth into 
>> IPv6, it is also unclear to me, if this would even be possible) if 
>> registering a host into a IPv6 MC group via MLDv2 in order for it to 
>> receive NS, ND, DAD is something that would be expected…
>
> https://datatracker.ietf.org/doc/html/rfc4861#section-7.2.1
>
> The host must join (using MLDv2) the multicast groups for all of its own 
> addresses to be reachable by Neighbor Discovery. This does not include the 
> standard RD group (IIRC).

Is there a known defect in fbsd11 or later, where the solicited-node multicast 
address is not joined/announced via MLD? Or any way to validate the current 
entries that are supposed to be announced via MLD on an interface?

Apparently the MLD snooping tables on the affected switch do NOT show any IPv6 
host (completely empty) - which is surprising, when an IPv6 host following 
RFC4861 is supposed to join the solicited-node MC address via MLD...

(it also appears that some other OS may not use the multicast NS process using 
this Ethernet Multicast addresses - or else issues should have been apparent 
immediately).

Best regards,
  Richard




IPv6 - NS, DAD and MLDv2 interaction

2022-02-23 Thread Scheffenegger, Richard
Hi,

I hope someone more knowledgeable then me in IPv6 affairs can give an informed 
opinion on the following:

As far as I know, an IPv6 host initially tries to perform Duplicate Address 
Detection, as well as Neighbor Discovery / Neighbor Solicitation. All of this 
typically works on Ethernet, by mapping into a well-known Ethernet multicast 
destination MAC “33-33-xx-xx-xx-xx”.

However, IPv6 Multicast addresses are largely indendent of the above protocols 
(afaik) and can be freely defined in the IPv6 address space. A “proper” 
(non-local) IPv6 multicast is formed by a Host registering via MLD, and again 
mapping packets destinged to a partical IPv6 multicast group into a similarly 
formed Ethernet MAC 33-33-xx-xx-xx-xx.

It appears that some vendors of switches have started to become overly 
restrictive in forwarding Ethernet Multicast, and only deliver these *after* a 
Host has registered itself to receive / participate in specific IPv6 Multicast 
groups.

A bit similar to IGMP snooping, with the difference that in v4, crucial basic 
information was exchanged using Eth broadcasts (ARP) rather than Eth Multicasts 
which look alike “data” IPv6 multicast.

So far, I could not find any guidance (and with my lack of depth into IPv6, it 
is also unclear to me, if this would even be possible) if registering a host 
into a IPv6 MC group via MLDv2 in order for it to receive NS, ND, DAD is 
something that would be expected…

Best regards,
   Richard


AW: NFS Mount Hangs

2021-04-12 Thread Scheffenegger, Richard
I was trying to do some simple tests yesterday - but don't know if these are 
representative:

Using an old Debian 3.16.3 linux box as nfs client, and simulating the 
disconnect with an ipfw rule, while introducing some packet drops using 
dummynet (I really should be adding a simple markov-chain state machine for 
burst losses), to utilize some of the socket upcalls in the tcp_input code 
flow. But it got too late before I arrived at any relevant results...


Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
richard.scheffeneg...@netapp.com

https://ts.la/richard49892


-Ursprüngliche Nachricht-
Von: Rick Macklem  
Gesendet: Montag, 12. April 2021 00:50
An: Scheffenegger, Richard ; 
tue...@freebsd.org
Cc: Youssef GHORBAL ; freebsd-net@freebsd.org
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




I should be able to test D69290 in about a week.
Note that I will not be able to tell if it fixes otis@'s hung Linux client 
problem.

rick


From: Scheffenegger, Richard 
Sent: Sunday, April 11, 2021 12:54 PM
To: tue...@freebsd.org; Rick Macklem
Cc: Youssef GHORBAL; freebsd-net@freebsd.org
Subject: Re: NFS Mount Hangs

CAUTION: This email originated from outside of the University of Guelph. Do not 
click links or open attachments unless you recognize the sender and know the 
content is safe. If in doubt, forward suspicious emails to ith...@uoguelph.ca


>From what i understand rick stating around the socket state changing before 
>the upcall, i can only speculate that the rst fight is for the new sessios the 
>client tries with the same 5tuple, while server side the old original session 
>persists, as the nfs server never closes /shutdown the session .

But a debug logged version of the socket upcall used by the nfs server should 
reveal any differences in socket state at the time of upcall.

I would very much like to know if d29690 addresses that problem (if it was due 
to releasing the lock before the upcall), or if that still shows differences 
between prior to my central upcall change, post that change and with d29690 ...


Von: tue...@freebsd.org 
Gesendet: Sunday, April 11, 2021 2:30:09 PM
An: Rick Macklem 
Cc: Scheffenegger, Richard ; Youssef GHORBAL 
; freebsd-net@freebsd.org 
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 23:59, Rick Macklem  wrote:
>
> tue...@freebsd.org wrote:
>> Rick wrote:
> [stuff snipped]
>>>> With r367492 you don't get the upcall with the same error state? Or you 
>>>> don't get an error on a write() call, when there should be one?
>> If Send-Q is 0 when the network is partitioned, after healing, the 
>> krpc sees no activity on the socket (until it acquires/processes an RPC it 
>> will not do a sosend()).
>> Without the 6minute timeout, the RST battle goes on "forever" (I've 
>> never actually waited more than 30minutes, which is close enough to 
>> "forever" for me).
>> --> With the 6minute timeout, the "battle" stops after 6minutes, when 
>> --> the timeout
>> causes a soshutdown(..SHUT_WR) on the socket.
>> (Since the soshutdown() patch is not yet in "main". I got comments, but 
>> no "reviewed"
>>  on it, the 6minute timer won't help if enabled in main. The soclose() 
>> won't happen
>>  for TCP connections with the back channel enabled, such as Linux 
>> 4.1/4.2 ones.) I'm confused. So you are saying that if the Send-Q is 
>> empty when you partition the network, and the peer starts to send 
>> SYNs after the healing, FreeBSD responds with a challenge ACK which 
>> triggers the sending of a RST by Linux. This RST is ignored multiple times.
>> Is that true? Even with my patch for the the bug I introduced?
> Yes and yes.
> Go take another look at linuxtofreenfs.pcap ("fetch 
> https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap; if you don't  
> already have it.) Look at packet #1949->2069. I use wireshark, but 
> you'll have your favourite.
> You'll see the "RST battle" that ends after 6minutes at packet#2069. 
> If there is no 6minute timeout enabled in the server side krpc, then 
> the battle just continues (I once let it run for about 30minutes 
> before giving up). The 6minute timeout is not currently enabled in 
> main, etc.
Hmm. I don't understand why r367492 can impact the processing of the RST, which 
basically d

Re: NFS Mount Hangs

2021-04-11 Thread Scheffenegger, Richard
>From what i understand rick stating around the socket state changing before 
>the upcall, i can only speculate that the rst fight is for the new sessios the 
>client tries with the same 5tuple, while server side the old original session 
>persists, as the nfs server never closes /shutdown the session .

But a debug logged version of the socket upcall used by the nfs server should 
reveal any differences in socket state at the time of upcall.

I would very much like to know if d29690 addresses that problem (if it was due 
to releasing the lock before the upcall), or if that still shows differences 
between prior to my central upcall change, post that change and with d29690 ...


Von: tue...@freebsd.org 
Gesendet: Sunday, April 11, 2021 2:30:09 PM
An: Rick Macklem 
Cc: Scheffenegger, Richard ; Youssef GHORBAL 
; freebsd-net@freebsd.org 
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 23:59, Rick Macklem  wrote:
>
> tue...@freebsd.org wrote:
>> Rick wrote:
> [stuff snipped]
>>>> With r367492 you don't get the upcall with the same error state? Or you 
>>>> don't get an error on a write() call, when there should be one?
>> If Send-Q is 0 when the network is partitioned, after healing, the krpc sees 
>> no activity on
>> the socket (until it acquires/processes an RPC it will not do a sosend()).
>> Without the 6minute timeout, the RST battle goes on "forever" (I've never 
>> actually
>> waited more than 30minutes, which is close enough to "forever" for me).
>> --> With the 6minute timeout, the "battle" stops after 6minutes, when the 
>> timeout
>> causes a soshutdown(..SHUT_WR) on the socket.
>> (Since the soshutdown() patch is not yet in "main". I got comments, but 
>> no "reviewed"
>>  on it, the 6minute timer won't help if enabled in main. The soclose() 
>> won't happen
>>  for TCP connections with the back channel enabled, such as Linux 
>> 4.1/4.2 ones.)
>> I'm confused. So you are saying that if the Send-Q is empty when you 
>> partition the
>> network, and the peer starts to send SYNs after the healing, FreeBSD responds
>> with a challenge ACK which triggers the sending of a RST by Linux. This RST 
>> is
>> ignored multiple times.
>> Is that true? Even with my patch for the the bug I introduced?
> Yes and yes.
> Go take another look at linuxtofreenfs.pcap
> ("fetch https://people.freebsd.org/~rmacklem/linuxtofreenfs.pcap; if you don't
>  already have it.)
> Look at packet #1949->2069. I use wireshark, but you'll have your favourite.
> You'll see the "RST battle" that ends after
> 6minutes at packet#2069. If there is no 6minute timeout enabled in the
> server side krpc, then the battle just continues (I once let it run for about
> 30minutes before giving up). The 6minute timeout is not currently enabled
> in main, etc.
Hmm. I don't understand why r367492 can impact the processing of the RST, which
basically destroys the TCP connection.

Richard: Can you explain that?

Best regards
Michael
>
>> What version of the kernel are you using?
> "main" dated Dec. 23, 2020 + your bugfix + assorted NFS patches that
> are not relevant + 2 small krpc related patches.
> --> The two small krpc related patches enable the 6minute timeout and
>   add a soshutdown(..SHUT_WR) call when the 6minute timeout is
>   triggered. These have no effect until the 6minutes is up and, without
>   them the "RTS battle" goes on forever.
>
> Add to the above a revert of r367492 and the RST battle goes away and things
> behave as expected. The recovery happens quickly after the network is
> unpartitioned, with either 0 or 1 RSTs.
>
> rick
> ps: Once the irrelevant NFS patches make it into "main", I will upgrade to
> main bits-de-jur for testing.
>
> Best regards
> Michael
>>
>> If Send-Q is non-empty when the network is partitioned, the battle will not 
>> happen.
>>
>>>
>>> My understanding is that he needs this error indication when calling 
>>> shutdown().
>> There are several ways the krpc notices that a TCP connection is no longer 
>> functional.
>> - An error return like EPIPE from either sosend() or soreceive().
>> - A return of 0 from soreceive() with no data (normal EOF from other end).
>> - A 6minute timeout on the server end, when no activity has occurred on the
>> connection. This timer is currently disabled for NFSv4.1/4.2 mounts i

AW: NFS Mount Hangs

2021-04-10 Thread Scheffenegger, Richard
I went through all the instances, where there would be an immediate soupcall 
triggered (before r367492).

If the problem is related to a race condition, where the socket is unlocked 
before the upcall, I can change the patch in such a way, to retain the lock on 
the socket all through TCP processing.

Both sorwakeups are with a locked socket (which is the critical part, I 
understand), while for the write upcall there is one unlocked, and one 
locked


Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
richard.scheffeneg...@netapp.com

https://ts.la/richard49892


-Ursprüngliche Nachricht-
Von: tue...@freebsd.org  
Gesendet: Samstag, 10. April 2021 18:13
An: Rick Macklem 
Cc: Scheffenegger, Richard ; Youssef GHORBAL 
; freebsd-net@freebsd.org
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 17:56, Rick Macklem  wrote:
>
> Scheffenegger, Richard  wrote:
>>> Rick wrote:
>>> Hi Rick,
>>>
>>>> Well, I have some good news and some bad news (the bad is mostly for 
>>>> Richard).
>>>>
>>>> The only message logged is:
>>>> tcpflags 0x4; tcp_do_segment: Timestamp missing, segment 
>>>> processed normally
>>>>
> Btw, I did get one additional message during further testing (with r367492 
> reverted):
> tcpflags 0x4; syncache_chkrst: Our SYN|ACK was rejected, connection 
> attempt aborted
>   by remote endpoint
>
> This only happened once of several test cycles.
That is OK.
>
>>>> But...the RST battle no longer occurs. Just one RST that works and then 
>>>> the SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>>>
>>>> So, what is different?
>>>>
>>>> r367492 is reverted from the FreeBSD server.
>>>> I did the revert because I think it might be what otis@ hang is being 
>>>> caused by. (In his case, the Recv-Q grows on the socket for the stuck 
>>>> Linux client, while others work.
>>>>
>>>> Why does reverting fix this?
>>>> My only guess is that the krpc gets the upcall right away and sees a EPIPE 
>>>> when it does soreceive()->results in soshutdown(SHUT_WR).
> This was bogus and incorrect. The diagnostic printf() I saw was 
> generated for the back channel, and that would have occurred after the socket 
> was shut down.
>
>>>
>>> With r367492 you don't get the upcall with the same error state? Or you 
>>> don't get an error on a write() call, when there should be one?
> If Send-Q is 0 when the network is partitioned, after healing, the 
> krpc sees no activity on the socket (until it acquires/processes an RPC it 
> will not do a sosend()).
> Without the 6minute timeout, the RST battle goes on "forever" (I've 
> never actually waited more than 30minutes, which is close enough to "forever" 
> for me).
> --> With the 6minute timeout, the "battle" stops after 6minutes, when 
> --> the timeout
>  causes a soshutdown(..SHUT_WR) on the socket.
>  (Since the soshutdown() patch is not yet in "main". I got comments, but 
> no "reviewed"
>   on it, the 6minute timer won't help if enabled in main. The soclose() 
> won't happen
>   for TCP connections with the back channel enabled, such as Linux 
> 4.1/4.2 ones.)
I'm confused. So you are saying that if the Send-Q is empty when you partition 
the network, and the peer starts to send SYNs after the healing, FreeBSD 
responds with a challenge ACK which triggers the sending of a RST by Linux. 
This RST is ignored multiple times.
Is that true? Even with my patch for the the bug I introduced?
What version of the kernel are you using?

Best regards
Michael
>
> If Send-Q is non-empty when the network is partitioned, the battle will not 
> happen.
>
>>
>> My understanding is that he needs this error indication when calling 
>> shutdown().
> There are several ways the krpc notices that a TCP connection is no longer 
> functional.
> - An error return like EPIPE from either sosend() or soreceive().
> - A return of 0 from soreceive() with no data (normal EOF from other end).
> - A 6minute timeout on the server end, when no activity has occurred 
> on the  connection. This timer is currently disabled for NFSv4.1/4.2 
> mounts in "main",  but I enabled it for this testing, to stop the "RST battle 
> goes on forever"
>  during testing. I am thinking of enabling it on "main", but t

Re: NFS Mount Hangs

2021-04-10 Thread Scheffenegger, Richard




Von: tue...@freebsd.org 
Gesendet: Samstag, April 10, 2021 2:19 PM
An: Scheffenegger, Richard
Cc: Rick Macklem; Youssef GHORBAL; freebsd-net@freebsd.org
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




> On 10. Apr 2021, at 11:19, Scheffenegger, Richard 
>  wrote:
>
> Hi Rick,
>
>> Well, I have some good news and some bad news (the bad is mostly for 
>> Richard).
>>
>> The only message logged is:
>> tcpflags 0x4; tcp_do_segment: Timestamp missing, segment processed 
>> normally
>>
>> But...the RST battle no longer occurs. Just one RST that works and then the 
>> SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>>
>> So, what is different?
>>
>> r367492 is reverted from the FreeBSD server.
>> I did the revert because I think it might be what otis@ hang is being caused 
>> by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, 
>> while others work.
>>
>> Why does reverting fix this?
>> My only guess is that the krpc gets the upcall right away and sees a EPIPE 
>> when it does soreceive()->results in soshutdown(SHUT_WR).
>
> With r367492 you don't get the upcall with the same error state? Or you don't 
> get an error on a write() call, when there should be one?

My understanding is that he needs this error indication when calling shutdown().

>
> From what you describe, this is on writes, isn't it? (I'm asking, at the 
> original problem that was fixed with r367492, occurs in the read path 
> (draining of ths so_rcv buffer in the upcall right away, which subsequently 
> influences the ACK sent by the stack).
>
> I only added the so_snd buffer after some discussion, if the WAKESOR 
> shouldn't have a symmetric equivalent on WAKESOW
>
> Thus a partial backout (leaving the WAKESOR part inside, but reverting the 
> WAKESOW part) would still fix my initial problem about erraneous DSACKs 
> (which can also lead to extremely poor performance with Linux clients), but 
> possible address this issue...
>
> Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for 
> the revert only on the so_snd upcall?

Since the release of 13.0 is almost done, can we try to fix the issue instead 
of reverting the commit?

Rs: agree, a good understanding where the interaction btwn stack, socket and in 
kernel tcp user breaks is needed;

>
> If this doesn't help, some major surgery will be necessary to prevent NFS 
> sessions with SACK enabled, to transmit DSACKs...

My understanding is that the problem is related to getting a local error 
indication after
receiving a RST segment too late or not at all.

Rs: but the move of the upcall should not materially change that; i don’t have 
a pc here to see if any upcall actually happens on rst...

Best regards
Michael
>
>
>> I know from a printf that this happened, but whether it caused the RST 
>> battle to not happen, I don't know.
>>
>> I can put r367492 back in and do more testing if you'd like, but I think it 
>> probably needs to be reverted?
>
> Please, I don't quite understand why the exact timing of the upcall would be 
> that critical here...
>
> A comparison of the soxxx calls and errors between the "good" and the "bad" 
> would be perfect. I don't know if this is easy to do though, as these calls 
> appear to be scattered all around the RPC / NFS source paths.
>
>> This does not explain the original hung Linux client problem, but does shed 
>> light on the RST war I could create by doing a network partitioning.
>>
>> rick
>
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


AW: NFS Mount Hangs

2021-04-10 Thread Scheffenegger, Richard
Hi Rick,

> Well, I have some good news and some bad news (the bad is mostly for Richard).
>
> The only message logged is:
> tcpflags 0x4; tcp_do_segment: Timestamp missing, segment processed 
> normally
>
> But...the RST battle no longer occurs. Just one RST that works and then the 
> SYN gets SYN,ACK'd by the FreeBSD end and off it goes...
>
> So, what is different?
>
> r367492 is reverted from the FreeBSD server.
> I did the revert because I think it might be what otis@ hang is being caused 
> by. (In his case, the Recv-Q grows on the socket for the stuck Linux client, 
> while others work.
>
> Why does reverting fix this?
> My only guess is that the krpc gets the upcall right away and sees a EPIPE 
> when it does soreceive()->results in soshutdown(SHUT_WR).

With r367492 you don't get the upcall with the same error state? Or you don't 
get an error on a write() call, when there should be one?

>From what you describe, this is on writes, isn't it? (I'm asking, at the 
>original problem that was fixed with r367492, occurs in the read path 
>(draining of ths so_rcv buffer in the upcall right away, which subsequently 
>influences the ACK sent by the stack).

I only added the so_snd buffer after some discussion, if the WAKESOR shouldn't 
have a symmetric equivalent on WAKESOW

Thus a partial backout (leaving the WAKESOR part inside, but reverting the 
WAKESOW part) would still fix my initial problem about erraneous DSACKs (which 
can also lead to extremely poor performance with Linux clients), but possible 
address this issue...

Can you perhaps take MAIN and apply https://reviews.freebsd.org/D29690 for the 
revert only on the so_snd upcall?

If this doesn't help, some major surgery will be necessary to prevent NFS 
sessions with SACK enabled, to transmit DSACKs...


> I know from a printf that this happened, but whether it caused the RST battle 
> to not happen, I don't know.
> 
> I can put r367492 back in and do more testing if you'd like, but I think it 
> probably needs to be reverted?

Please, I don't quite understand why the exact timing of the upcall would be 
that critical here...

A comparison of the soxxx calls and errors between the "good" and the "bad" 
would be perfect. I don't know if this is easy to do though, as these calls 
appear to be scattered all around the RPC / NFS source paths.

> This does not explain the original hung Linux client problem, but does shed 
> light on the RST war I could create by doing a network partitioning.
>
> rick

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS Mount Hangs

2021-04-04 Thread Scheffenegger, Richard
For what it‘s worth, suse found two bugs in the linux nfconntrack (stateful 
firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions 
hang forever.

One is a missed updaten when the cöient is not using the noresvport moint 
option, which makes tje firewall think rsts are illegal (and drop them);

The fast scheduler can run into an issue if only a single packet should be 
forwarded (note that this is not the default scheduler, but often recommended 
for perf, as it runs lockless and lower cpu cost that pfq (default). If no 
other/additional packet pushes out that last packet of a flow, it can become 
stuck forever...

I can try getting the relevant bug info next week...


Von: owner-freebsd-...@freebsd.org  im Auftrag 
von Rick Macklem 
Gesendet: Friday, April 2, 2021 11:31:01 PM
An: tue...@freebsd.org 
Cc: Youssef GHORBAL ; freebsd-net@freebsd.org 

Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




tue...@freebsd.org wrote:
>> On 2. Apr 2021, at 02:07, Rick Macklem  wrote:
>>
>> I hope you don't mind a top post...
>> I've been testing network partitioning between the only Linux client
>> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch
>> (does soshutdown(..SHUT_WR) when it knows the socket is broken)
>> applied to it.
>>
>> I'm not enough of a TCP guy to know if this is useful, but here's what
>> I see...
>>
>> While partitioned:
>> On the FreeBSD server end, the socket either goes to CLOSED during
>> the network partition or stays ESTABLISHED.
>If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also
>sent a FIN, but you never called close() on the socket.
>If the socket stays in ESTABLISHED, there is no communication ongoing,
>I guess, and therefore the server does not even detect that the peer
>is not reachable.
>> On the Linux end, the socket seems to remain ESTABLISHED for a
>> little while, and then disappears.
>So how does Linux detect the peer is not reachable?
Well, here's what I see in a packet capture in the Linux client once
I partition it (just unplug the net cable):
- lots of retransmits of the same segment (with ACK) for 54sec
- then only ARP queries

Once I plug the net cable back in:
- ARP works
- one more retransmit of the same segement
- receives RST from FreeBSD
** So, is this now a "new" TCP connection, despite
using the same port#.
--> It matters for NFS, since "new connection"
   implies "must retry all outstanding RPCs".
- sends SYN
- receives SYN, ACK from FreeBSD
--> connection starts working again
Always uses same port#.

On the FreeBSD server end:
- receives the last retransmit of the segment (with ACK)
- sends RST
- receives SYN
- sends SYN, ACK

I thought that there was no RST in the capture I looked at
yesterday, so I'm not sure if FreeBSD always sends an RST,
but the Linux client behaviour was the same. (Sent a SYN, etc).
The socket disappears from the Linux "netstat -a" and I
suspect that happens after about 54sec, but I am not sure
about the timing.

>>
>> After unpartitioning:
>> On the FreeBSD server end, you get another socket showing up at
>> the same port#
>> Active Internet connections (including servers)
>> Proto Recv-Q Send-Q Local Address  Foreign Address(state)
>> tcp4   0  0 nfsv4-new3.nfsdnfsv4-linux.678ESTABLISHED
>> tcp4   0  0 nfsv4-new3.nfsdnfsv4-linux.678CLOSED
>>
>> The Linux client shows the same connection ESTABLISHED.
But disappears from "netstat -a" for a while during the partitioning.

>> (The mount sometimes reports an error. I haven't looked at packet
>> traces to see if it retries RPCs or why the errors occur.)
I have now done so, as above.

>> --> However I never get hangs.
>> Sometimes it goes to SYN_SENT for a while and the FreeBSD server
>> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the
>> mount starts working again.
>>
>> The most obvious thing is that the Linux client always keeps using
>> the same port#. (The FreeBSD client will use a different port# when
>> it does a TCP reconnect after no response from the NFS server for
>> a little while.)
>>
>> What do those TCP conversant think?
>I guess you are you are never calling close() on the socket, for with
>the connection state is CLOSED.
Ok, that makes sense. For this case the Linux client has not done a
BindConnectionToSession to re-assign the back channel.
I'll have to bug them about this. However, I'll bet they'll answer
that I have to tell them the back channel needs re-assignment
or something like that.

I am pretty certain they are broken, in that the client needs to
retry all outstanding RPCs.

For others, here's the long winded version of this that I just
put on the phabricator review:
 In the server side kernel RPC, the socket (struct socket *) is in a
  structure 

AW: tcp-testsuite into src?

2021-03-23 Thread Scheffenegger, Richard
>> Yeah, it's not a problem to use binaries from ports in /usr/tests.  As 
>> long as the tests can compile they can live in the base system.  Is 
>> there a strong incentive to import them? 
>
> The tests are just scripts, which can be executed by packetdrill, which is 
> available in the ports tree.
>
>> Do they need to be adjusted for each release?
>
> It depends. If things like default timeouts or so change, then the tests need 
> to be adapted.
>
> If we would have (and I guess we will) tests for loss recovery, then 
> improvements to the code might also require changes to the tests.

Yes, I would really like to have the packetdrill scripts in the source tree. 
And a recipe, how to run a subtree from the test (e.g. the TCP tests) as part 
of a kernel build...

As I work on adding newer mechanisms into base stack TCP, I would be 
documenting these changes in microscopic timing etc in terms of test cases...

Right now, the test suite is organized in a similar layout of the source files. 
However, as UDP, TCP and SCTP all live in /sys/netinet, and the existing 
packetdrill scripts cover a lot of ground in various scenarios, I am wondering 
if it wouldn't be easier to have a subdirectory under 
/tests/sys/netinet/packetdrill/tcp which mirrors freebsd-net/tcp-testsuite

>
> Best regards
> Michael

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


AW: NFS Mount Hangs

2021-03-19 Thread Scheffenegger, Richard
Sorry, I though this was a problem on stable/13.

This is only in HEAD, stable/13 and 13.0 - never MFC'd to stable/12 or 
backported to 12.1

> I did some reshuffling of socket-upcalls recently in the TCP stack, to 
> prevent some race conditions with our $work in-kernel NFS server 
> implementation.
Are these changes in 12.1p5? This is the OS version used by the reporter of the 
bug.

Best regards
Michael

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


AW: NFS Mount Hangs

2021-03-19 Thread Scheffenegger, Richard
Hi Rick,

I did some reshuffling of socket-upcalls recently in the TCP stack, to prevent 
some race conditions with our $work in-kernel NFS server implementation.

Just mentioning this, as this may slightly change the timing (mostly delay the 
upcall until TCP processing is all done, while before an in-kernel consumer 
could register for a socket upcall, do some fancy stuff with the data sitting 
in the socket bufferes, before returning to the tcp processing).

But I think there is no socket data handling being done in the upstream 
in-kernel NFS server (and I have not even checked, if it actually registers an 
socket-upcall handler).

https://reviews.freebsd.org/R10:4d0770f1725f84e8bcd059e6094b6bd29bed6cc3

If you can reproduce this easily, perhaps back out this change and see if that 
has an impact...

NFS server is to my knowledge the only upstream in-kernel TCP consumer which 
may be impacted by this.

Richard Scheffenegger


-Ursprüngliche Nachricht-
Von: owner-freebsd-...@freebsd.org  Im Auftrag 
von Rick Macklem
Gesendet: Freitag, 19. März 2021 16:58
An: tue...@freebsd.org
Cc: Scheffenegger, Richard ; 
freebsd-net@freebsd.org; Alexander Motin 
Betreff: Re: NFS Mount Hangs

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Michael Tuexen wrote:
>> On 18. Mar 2021, at 21:55, Rick Macklem  wrote:
>>
>> Michael Tuexen wrote:
>>>> On 18. Mar 2021, at 13:42, Scheffenegger, Richard 
>>>>  wrote:
>>>>
>>>>>> Output from the NFS Client when the issue occurs # netstat -an | 
>>>>>> grep NFS.Server.IP.X
>>>>>> tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049  
>>>>>>  FIN_WAIT2
>>>>> I'm no TCP guy. Hopefully others might know why the client would 
>>>>> be stuck in FIN_WAIT2 (I vaguely recall this means it is waiting 
>>>>> for a fin/ack, but could be wrong?)
>>>>
>>>> When the client is in Fin-Wait2 this is the state you end up when the 
>>>> Client side actively close() the tcp session, and then the server also 
>>>> ACKed the FIN.
>> Jason noted:
>>
>>> When the issue occurs, this is what I see on the NFS Server.
>>> tcp4   0  0 NFS.Server.IP.X.2049  NFS.Client.IP.X.51550 
>>> CLOSE_WAIT
>>>
>>> which corresponds to the state on the client side. The server 
>>> received the FIN from the client and acked it.
>>> The server is waiting for a close call to happen.
>>> So the question is: Is the server also closing the connection?
>> Did you mean to say "client closing the connection here?"
>Yes.
>>
>> The server should call soclose() { it never calls soshutdown() } when 
>> soreceive(with MSG_WAIT) returns 0 bytes or an error that indicates 
>> the socket is broken.
Btw, I looked and the soreceive() is done with MSG_DONTWAIT, but the 
EWOULDBLOCK is handled appropriately.

>> --> The soreceive() call is triggered by an upcall for the rcv side of the 
>> socket.
>> So, are you saying the FreeBSD NFS server did not call soclose() for this 
>> case?
>Yes. If the state at the server side is CLOSE_WAIT, no close call has happened 
>yet.
>The FIN from the client was received, it was ACKED, but no close() call 
>(or shutdown(..., SHUT_WR) or shutdown(..., SHUT_RDWR)) was issued. 
>Therefore, no FIN was sent and the client should be in the FINWAIT-2 
>state. This was also reported. So the reported states are consistent.
For a test, I commented out the soclose() call in the server side krpc and, 
when I dismounted, it did leave the server socket in CLOSE_WAIT.
For the FreeBSD client, it did the dismount and the socket was in FIN_WAIT2 for 
a little while and then disappeared (someone mentioned a short timeout and that 
seems to be the case).
I might argue that the Linux client should not get hung when this occurs, but 
there does appear to be an issue on the FreeBSD end.

So it does appear you have a case where the soclose() call is not happening on 
the FreeBSD NFS server. I am a little surprised since I don't think I've heard 
of this before and the code is at least 10years old (at least the parts related 
to this).

For the soclose() to not happen, the reference count on the socket structure 
cannot have gone to zero. (ie a SVC_RELEASE() was missed) Upon code inspection, 
I was not able to spot a reference counting bug.
(Not too surprising, since a reference counting bug should have shown  up long 
ago.)

The only thing I spotted that could conceivably explain this is that the 
function svc_vc_stat() which returns the indication that the socket has been 
closed at the other e

AW: NFS Mount Hangs

2021-03-18 Thread Scheffenegger, Richard
>>Output from the NFS Client when the issue occurs # netstat -an | grep 
>>NFS.Server.IP.X
>>tcp0  0 NFS.Client.IP.X:46896  NFS.Server.IP.X:2049   
>>FIN_WAIT2
>I'm no TCP guy. Hopefully others might know why the client would be stuck in 
>FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack, but could 
>be wrong?)

When the client is in Fin-Wait2 this is the state you end up when the Client 
side actively close() the tcp session, and then the server also ACKed the FIN. 
This will last for ~2 min or so, but is asynchronous. However, the same 4-tuple 
can not be reused during this time.

With other words, from the socket / TCP, a properly executed active close() 
will end up in this state. (If the other side initiated the close, a passive 
close, will not end in this state)


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


AW: panic: sackhint bytes rtx >= 0

2021-02-23 Thread Scheffenegger, Richard
Hi Andriy,

I guess I am currently the person who has the most recent knowledge about that 
part of the base stack...

Do you happen to have more (preceding) information about this, or a way to 
reproduce this?

Are you running any special stack (RACK, BBR) which may have switched back to 
the base stack in the middle of a loss recovery (I suspected at one point that 
this may cause issues, potentially)?

Or was something done with the ipfw that may have temporarily impacted a tcp 
session?


The accounting with sack_bytes_rexmit is rather old, and not touched recently 
(but the sackhint struct was changed recently, and other/additional scoreboard 
accounting was added).


(kgdb) p *cur
$1 = {start = 3846347980, end = 3846352300, rxmit = 3846352300, scblink =
{tqe_next = 0xf8013da5a220, tqe_prev = 0xf80754818930}}

This indicates, that the current hole in the SACK scoreboard (3 segments of 
size 1440 bytes) were retransmitted  (rxmit == end), before the current 
acknowledgement came back.

Thus the expectation is, that sackhint.sack_bytes_rexmit also has a value of at 
least that number of bytes (4320). It is increased in tcp_output() for each 
packet leaving while performing a retransmission.

But this is the peculiar part:
(kgdb) p 
tp@entry->sackhint.sack_bytes_rexmit
$3 = -1440

Indicating negative one packet had been retransmitted before (thus subtracting 
the hole, which was previously retransmitted violates the invariant). And the 
only piece of code decrementing it appears to be in tcp_output() during 
non-permanent error handling...


All updates to sackhint should be protected by the INPLOCK, so even if the rx 
and tx paths are running on different core, the sack_bytes_rexmit should never 
become negative.


The sack blocks returned indicate that (with snd.una as zero baseline, in 
segments) the client knows about segments 2..34 and 35..47.

The first hole has shrunk from the right (unusual; possible when two 
retransmissions were lost again, or the 3 segment originally sent, delayed by 
~50 segments (unlikely).


Sorry to not being able to spot something obvious right away...


Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
richard.scheffeneg...@netapp.com

https://ts.la/richard49892


Von: tue...@freebsd.org 
Gesendet: Dienstag, 23. Februar 2021 22:21
An: Richard Scheffenegger 
Betreff: Fwd: panic: sackhint bytes rtx >= 0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.


FYI


Begin forwarded message:

From: Andriy Gapon mailto:a...@freebsd.org>>
Subject: panic: sackhint bytes rtx >= 0
Date: 23. February 2021 at 22:02:20 CET
To: FreeBSD Current mailto:curr...@freebsd.org>>, 
n...@freebsd.org


Got this panic on 13.0-STABLE 4b2a20dfde9c using a custom kernel with INVARIANTS
enabled.
Below is some information from the crash dump.
If anyone has any clues, suggestions, etc, please help.
I will try to help you to help me the best I can.

#0  doadump (textdump=textdump@entry=1)
   at /usr/devel/git/trant/sys/kern/kern_shutdown.c:399
#1  0x808396b2 in kern_reboot (howto=260)
   at /usr/devel/git/trant/sys/kern/kern_shutdown.c:486
#2  0x80839d07 in vpanic (
   fmt=0x80cbd551 "sackhint bytes rtx >= 0", ap=0xfe0120b9e6d0)
   at /usr/devel/git/trant/sys/kern/kern_shutdown.c:919
#3  0x808398b3 in panic (fmt=)
   at /usr/devel/git/trant/sys/kern/kern_shutdown.c:843
#4  0x8098a82c in tcp_sack_doack (tp=,
   tp@entry=0xf807548187f0, to=,
   to@entry=0xfe0120b9e780, th_ack=)
   at /usr/devel/git/trant/sys/netinet/tcp_sack.c:691
#5  0x80983699 in tcp_do_segment (m=0xf8029868ca00,
   m@entry=,
   th=,
   th@entry=,
   so=0xf804e7359b10,
   so@entry=,
   tp=0xf807548187f0,
   tp@entry=,
   drop_hdrlen=60,
   drop_hdrlen@entry=,
   tlen=,
   tlen@entry=,
   iptos=72 'H',
   iptos@entry=)
   at /usr/devel/git/trant/sys/netinet/tcp_input.c:2497
#6  0x80980d97 in tcp_input (mp=,
   mp@entry=,
   offp=,
   offp@entry=,
   proto=)
   at /usr/devel/git/trant/sys/netinet/tcp_input.c:1381
#7  0x80976eb7 in ip_input (m=0x0)
   at /usr/devel/git/trant/sys/netinet/ip_input.c:833
#8  0x8094c78f in netisr_dispatch_src (proto=1,
   source=source@entry=0, m=0xf8029868ca00)
   at /usr/devel/git/trant/sys/net/netisr.c:1143
#9  0x8094cb0e in netisr_dispatch (proto=,
   m=) at /usr/devel/git/trant/sys/net/netisr.c:1234
#10 0x80943345 in ether_demux (ifp=ifp@entry=0xf80008c75000,
   m=) at /usr/devel/git/trant/sys/net/if_ethersubr.c:923
#11 0x809446c1 in ether_input_internal (ifp=0xf80008c75000,
   m=) at /usr/devel/git/trant/sys/net/if_ethersubr.c:709
#12 0x809443d0 in 

RE: Socket option to configure Ethernet PCP / CoS per-flow

2020-10-09 Thread Scheffenegger, Richard
Hi Ryan,

D26409 was committed in r366569. The socket option is now "living" under the 
IP_PROTO or IPV6_PROTO (depending on the AF_FAMILY used by the socket), and 
stored with the INPCB for more efficient processing.

Would be grateful if anyone could look at D26627, which adds a "-C " 
option to ping, to validate the functionality (host + network).

That patch for ping also shows a simple example as to how to use this new 
functionality (basically, perform a setsockopt between a bind/connect or 
bind/listen of the socket, similar to other such setsockopt calls). 

Best regards,


Richard Scheffenegger


-Original Message-
From: Ryan Stone  
Sent: Donnerstag, 24. September 2020 23:31
To: Scheffenegger, Richard 
Cc: n...@freebsd.org; transp...@freebsd.org
Subject: Re: Socket option to configure Ethernet PCP / CoS per-flow


Hi Richard,

At $WORK we're running into situations where PFC support would be very useful, 
so I think that this would be a good thing to add.  I have a
question: does your work also communicate the priority value for an mbuf down 
to the Ethernet driver, so that it can put the packet in the proper queue?
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: Socket option to configure Ethernet PCP / CoS per-flow

2020-09-24 Thread Scheffenegger, Richard
Hi Ryan,

As you can see in the code, when a specific PCP value is associated with a 
session, a vlan header is added to the mbuf, before all that gets handed off to 
the device drivers.

(I did improve upon the $work code basis, in allowing "default" and "explicit" 
pcp values - rather than assuming an underlying interface will always have a 
default PCP of 0).

I'm not perfectly happy with the pcp value living in the socket struct, but 
frankly, there is no more appropriate layer anyway, and this approach should be 
pretty speed-efficient.

I'm not a hw driver person, so whatever happens to the mbuf after the vlan tag 
is added (a pure pcp=x, vlan=0 may be attached) is all up to how the driver / 
hardware deals with that header as part of the mbuf chain...

Also, if you do have an account on reviews.freebsd.org, perhaps you want to 
comment on the Diff, that this is valuable work... As this is outside my normal 
scope of tweaks, I would certainly need some positive reviews around this to 
get it approved for committing.

Were you able to patch you kernel and achieve what you were trying to do?

Do you see any value in an interface default, that effectively lets each new 
session rotate through all PCPs, to make PFC more useful and not degrade into 
simple xon/xoff "global" flow control?


Richard Scheffenegger

-Original Message-
From: Ryan Stone  
Sent: Donnerstag, 24. September 2020 23:31
To: Scheffenegger, Richard 
Cc: n...@freebsd.org; transp...@freebsd.org
Subject: Re: Socket option to configure Ethernet PCP / CoS per-flow

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




On Fri, Sep 11, 2020 at 12:33 PM Scheffenegger, Richard 
 wrote:
>
> Hi,
>
> Currently, upstream head has only an IOCTL API to set up interface-wide 
> default PCP marking:
>
> #define  SIOCGVLANPCPSIOCGLANPCP /* Get VLAN PCP */
> #define   SIOCSVLANPCPSIOCSLANPCP  /* Set VLAN PCP */
>
> And the interface is via ifconfig  pcp .
>
> However, while this allows all traffic sent via a specific interface to be 
> marked with a PCP (priority code point), it defeats the purpose of PFC 
> (priority flow control) which works by individually pausing different queues 
> of an interface, provided there is an actual differentiation of traffic into 
> those various classes.
>
> Internally, we have added a socket option (SO_VLAN_PCP) to change the PCP 
> specifically for traffic associated with that socket, to be marked 
> differently from whatever the interface default is (unmarked, or the default 
> PCP).
>
> Does the community see value in having such a socket option widely available? 
> (Linux currently doesn't seem to have a per-socket option either, only a 
> per-interface IOCTL API).
>
> Best regards,
>
> Richard Scheffenegger
>
> ___
> freebsd-transp...@freebsd.org mailing list 
> https://lists.freebsd.org/mailman/listinfo/freebsd-transport
> To unsubscribe, send any mail to "freebsd-transport-unsubscr...@freebsd.org"

Hi Richard,

At $WORK we're running into situations where PFC support would be very useful, 
so I think that this would be a good thing to add.  I have a
question: does your work also communicate the priority value for an mbuf down 
to the Ethernet driver, so that it can put the packet in the proper queue?
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: Socket option to configure Ethernet PCP / CoS per-flow

2020-09-11 Thread Scheffenegger, Richard
Thank you for the quick feedback.

On a related note - it just occurred to me, that the PCP functionality could be 
extended to make more effective use of PFC (priority flow control) without 
explicitly managing it on an application level directly.

Right now, PFC typically degenerates to good-old Flow control, as all traffic 
is handled just in the default class (0, or whatever is set up using the IOCTL 
interface API).

Typically, the different Ethernet classes come with a notion of prioritization 
between them - traffic in a "higher" class may be forwarded prior to traffic in 
a lower class. But that is not a strong requirement - using WRR with 1/8th 
bandwidth "reserved" for each class in a switch, assigning flows to a random 
PCP value, PFC could work in a more scalable fashion - only blocking a fraction 
of traffic, that is actually queue building (has to go over a lower bandwidth 
link, or a NIC excessively pausing its ingress), thus reducing the chance of 
the formation of congrestion trees...

E.g. PCP runs from 0 (default) to 7; 

Adding a socket option to explicitly assign traffic to one of these flows would 
allow testing and configuring applications to make use of "real" prioritization 
capabilities of modern switches.

And what I was just pondering was a special interface level setting (e.g. 8), 
which results in a socket to pick a "random" value when created, to distribute 
packets across all the queues available in hardware, allowing PFC to no longer 
collapse in effect to old FC style "on"/"off" for all traffic... 

Perhaps someone here has experience with congestion tree formation in multi-hop 
switching environments, and can comment if the above approach would be feasible 
to address that FC issue?


Richard Scheffenegger


-Original Message-
From: sth...@nethelp.no  
Sent: Freitag, 11. September 2020 18:55
To: Scheffenegger, Richard 
Cc: n...@freebsd.org; transp...@freebsd.org
Subject: Re: Socket option to configure Ethernet PCP / CoS per-flow

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




> However, while this allows all traffic sent via a specific interface to be 
> marked with a PCP (priority code point), it defeats the purpose of PFC 
> (priority flow control) which works by individually pausing different queues 
> of an interface, provided there is an actual differentiation of traffic into 
> those various classes.
>
> Internally, we have added a socket option (SO_VLAN_PCP) to change the PCP 
> specifically for traffic associated with that socket, to be marked 
> differently from whatever the interface default is (unmarked, or the default 
> PCP).
>
> Does the community see value in having such a socket option widely available? 
> (Linux currently doesn't seem to have a per-socket option either, only a 
> per-interface IOCTL API).

I've been doing quite a bit of network testing using iperf3 and similar tools, 
and have wanted this type of functionality since the interface option became 
available. Having this on a socket level would make it possible to teach 
iperf3, ping and other tools to set PCP and facilitate/simplify testing of L2 
networks.

So the answer is a definite yes! This would be valuable.

Steinar Haug, Nethelp consulting, sth...@nethelp.no
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Socket option to configure Ethernet PCP / CoS per-flow

2020-09-11 Thread Scheffenegger, Richard
Hi,

Currently, upstream head has only an IOCTL API to set up interface-wide default 
PCP marking:

#define  SIOCGVLANPCPSIOCGLANPCP /* Get VLAN PCP */
#define   SIOCSVLANPCPSIOCSLANPCP  /* Set VLAN PCP */

And the interface is via ifconfig  pcp .

However, while this allows all traffic sent via a specific interface to be 
marked with a PCP (priority code point), it defeats the purpose of PFC 
(priority flow control) which works by individually pausing different queues of 
an interface, provided there is an actual differentiation of traffic into those 
various classes.

Internally, we have added a socket option (SO_VLAN_PCP) to change the PCP 
specifically for traffic associated with that socket, to be marked differently 
from whatever the interface default is (unmarked, or the default PCP).

Does the community see value in having such a socket option widely available? 
(Linux currently doesn't seem to have a per-socket option either, only a 
per-interface IOCTL API).

Best regards,

Richard Scheffenegger

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: TFO for NFS

2020-08-29 Thread Scheffenegger, Richard
Hi Rick,

> It seems that, for TFO to be useful, the application needs to be doing 
> frequent short lived TCP connections and often across WAN/Internet.
> NFS mounts do neither of the above.
> - They, as we've noted, only normally do a TCP connect at mount time.
>   Usually run on low latency LAN environments. (High latency connections
>   hammer NFS performance, due to its frequent small RPCs that the client
>   must wait for replies to sychronously.)

Standard, run of the mill kernel-NFS-client mounts don't. OTOH, TFO is a 
transparent feature (on the server side), and requires slight changes on the 
client side to be useful. Providing  a "worked example" of how to do this 
(properly?) might inspire other uses later on. 

You may have heared from a large corporation mostly known for their Databases 
(and Java). There, the application itself implements a streamlined, 
light-weight NFS client (dNFS). In this scenario, each DB task / worker very 
frequently sets up a dedicated NFS session to perform the set of IOs necessary 
to complete its task, and then stop using this tcp session. And there are a 
couple more implementations of User-space, streamlined NFS applications, which 
utilize many parallel TCP sessions when working against an NFS server, 
bypassing the in-kernel NFS client.

> (If you were to implement this and benchmarking showed a significant  
> improvement in elapsed time to do an NFS mount, then that could be a 
> different story.)

Let me reach out to the maintainer of one of the applications using a userspace 
NFS client (a rsync like app, bypassing the in-kernel NFS client for dramatic 
bandwidth gains during simple copy jobs, assuming exclusive access for the 
duration). That could be tweaked more easily to show the behavior of above 
mentione, widely deploye DB application.


> I'm not sure I understand this. NFS always uses port# 2049.
> If you are referring to the host IP address, then wouldn't that be handled 
> via.

The NFS entity (and state, for v4) is handed off between hosts, while the IP 
remains, in this scenario. However, TCP state is not, and the client will 
observe a sudden RST when such a migration happened on the server side; then it 
has to re-establish the TCP and reclaim locks (in the v4 case), before being 
able to continue.

Anyway, I try to come up with a proof-of-concept  patch, and try to get 
benchmark data. Let's continue the discussion then.

Best regards,
   Richard



-Original Message-
From: Rick Macklem  
Sent: Samstag, 29. August 2020 04:11
To: Scheffenegger, Richard 
Cc: Michael Tuexen ; freebsd-net@freebsd.org
Subject: Re: TFO for NFS

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Scheffenegger, Richard wrote:
>I know, NFS TCP sessions are some of the most long-lived sessions in regular 
>use.
Ok, so I'll admit I can't wrap my head around this.
It is way out of my area of expertise (so I've added freebsd-net@ to the cc), 
but it seems to me that NFS is the about the least appropriate use fot TFO.

It seems that, for TFO to be useful, the application needs to be doing frequent 
short lived TCP connections and often across WAN/Internet.
NFS mounts do neither of the above.
- They, as we've noted, only normally do a TCP connect at mount time.
   Usually run on low latency LAN environments. (High latency connections
   hammer NFS performance, due to its frequent small RPCs that the client
   must wait for replies to sychronously.)

All you might save is one RTT. Take a look at how many RPCs (each with a RTT) 
happen on an active NFS mount.

>My rationale is two-fold:
>
>First, having a relatively high-profile use of the TFO option in the core OS 
>modules >will definitely expose that feature to at least some use.
Well, I don't think it is NFS's job to expose a feature that is not useful for 
it.
(If you were to implement this and benchmarking showed a significant  
improvement in elapsed time to do an NFS mount, then that could be a different 
story.)

>Second, in case of a network disconnect (or, something with my company does, 
>>that would be most comparable to unassigning and reassigning the server IP 
>>address between different physical ports), while there is IO load, TFO may 
>reduce >(ever so slightly) the latency impact of the enqueued IOs.
I'm not sure I understand this. NFS always uses port# 2049.
If you are referring to the host IP address, then wouldn't that be handled via.
Arp and routing? (Does this require a fresh TCP connection to the same server 
IP address?)

>My plan is first to simply enable the socket option - that should result in 
>TFO to >get negotiated for, but no actual latency improvement, while the 
>traditional >connect() sequence to set up a TCP session is done., from the 
>client side; the >serve

RE: Fast recovery ssthresh value

2020-08-23 Thread Scheffenegger, Richard
Hi Liang,

In SACK loss recovery, you can recover up to ssthresh (prior cwnd/2 [or 70% in 
case of cubic]) lost bytes - at least in theory.

In comparison, (New)Reno can only recover one lost packet per window, and then 
keeps on transmitting new segments (ack + cwnd), even before the receipt of the 
retransmitted packet is acked.

For historic reasons, the semantic of the variable cwnd is overloaded during 
loss recovery, and it doesn't "really" indicate cwnd, but rather indicates 
if/when retransmissions can happen.


In both cases (also the simple one, with only one packet loss), cwnd should be 
equal (or near equal) to ssthresh by the time loss recovery is finished - but 
NOT before! While it may appear like slow-start, the value of the cwnd variable 
really increases by acked_bytes only per ACK (not acked_bytes + SMSS), since 
the left edge (snd_una) doesn't move right - unlike during slow-start. But 
numerically, these different phases (slow-start / sack loss-recovery) may 
appear very similar.

You could check this using the (loadable) SIFTR module, which captures t_flags 
(indicating if cong/loss recovery is active), ssthresh, cwnd, and other 
parameters.

That is at least how things are supposed to work; or have you investigated the 
timing and behavior of SACK loss recovery and found a deviation to RFC3517? 
Note that FBSD currently has not fully implemented RFC6675 support (which 
deviates slightly from 3517 under specific circumstances; I have a patch 
pending to implemente 6675 rescue retransmissions, but haven't tweaked the 
other aspects of 6675 vs. 3517.

BTW: While freebsd-net is not the wrong DL per se, TCP, UDP, SCTP specific 
questions can also be posted to freebsd-transport, which is more narrowly 
focused.

Best regards,

Richard Scheffenegger

-Original Message-
From: owner-freebsd-...@freebsd.org  On Behalf 
Of Liang Tian
Sent: Sonntag, 23. August 2020 00:14
To: freebsd-net 
Subject: Fast recovery ssthresh value

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hi all,

When 3 dupacks are received and TCP enter fast recovery, if SACK is used, the 
CWND is set to maxseg:

2593 if (tp->t_flags & TF_SACK_PERMIT) {
2594 TCPSTAT_INC(
2595 tcps_sack_recovery_episode);
2596 tp->snd_recover = tp->snd_nxt;
2597 tp->snd_cwnd = maxseg;
2598 (void) tp->t_fb->tfb_tcp_output(tp);
2599 goto drop;
2600 }

Otherwise(SACK is not in use), CWND is set to maxseg before
tcp_output() and then set back to snd_ssthresh+inflation
2601 tp->snd_nxt = th->th_ack;
2602 tp->snd_cwnd = maxseg;
2603 (void) tp->t_fb->tfb_tcp_output(tp);
2604 KASSERT(tp->snd_limited <= 2,
2605 ("%s: tp->snd_limited too big",
2606 __func__));
2607 tp->snd_cwnd = tp->snd_ssthresh +
2608  maxseg *
2609  (tp->t_dupacks - tp->snd_limited);
2610 if (SEQ_GT(onxt, tp->snd_nxt))
2611 tp->snd_nxt = onxt;
2612 goto drop;

I'm wondering in the SACK case, should CWND be set back to ssthresh(which has 
been slashed in cc_cong_signal() a few lines above) before line 2599, like 
non-SACK case, instead of doing slow start from maxseg?
I read rfc6675 and a few others, and it looks like that's the case. I 
appreciate your opinion, again.

Thanks,
Liang
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: FreeBSD TCP/IP Tasks I (a contributor) could work on?

2020-08-12 Thread Scheffenegger, Richard
Hi Neel,

If you are brave enough to leave the (mostly) stateless domain of L3 packet 
handling, and take on the challenge to tip your toes into the unforgiving realm 
of stateful L4 transport protocols, that would certainly be an area where every 
helping hand counts.

E.g. Rod has recently found some quite dated Diffs, around TSO, which would 
probably need some love (reviews and validation): 

D6611
D6612
D6656

(There are many more semi-abandoned Diffs waiting on reviews.freebsd.org; and I 
also have a few Diffs for new functionality waiting for someone willing to read 
into the RFCs and confirm that the code performs as intended, around RFC6675, 
ECN+/ECN++, ...)

If this is something for you, there are also biweekly transport calls happening 
(should be in your evening) - and anyone wanting to join is welcome.

Best regards,

Richard Scheffenegger

-Original Message-
From: owner-freebsd-...@freebsd.org  On Behalf 
Of Neel Chauhan
Sent: Dienstag, 11. August 2020 06:59
To: freebsd-net@freebsd.org
Subject: FreeBSD TCP/IP Tasks I (a contributor) could work on?


Hi freebsd-net@,

Sorry if this is the wrong place to post this.

In case you were wondering, I am responsible for patches like r357092 
(IPFW/libalias RFC6598, original idea), r363403 and r362900 (related to routing 
KPI, suggested by melifaro@).

However, despite my current accepted code, I am dry of ideas. At the same time, 
I'd love to work with kernel code, especially the network code.

Sadly, the Wiki is dreadfully out-of-date.

Is there any things I could work on in the FreeBSD networking stack? Any things 
you committers need help with?

Best,

Neel Chauhan

==

https://www.neelc.org/
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: SFP I2C interface in drivers (driver SIOCGI2C support)

2020-06-22 Thread Scheffenegger, Richard
Hi Alex,

I was looking for the SIOCGI2C socket ioctl API, and did not find it in the 
intel driver. Which led me to believe this was the reason for me not seeing the 
SFP data.

 But after correctly looking up the SFP info (the port was in a different jail, 
which confused me), it turns out that it's not Intel NIC, but Qlogic CNA who 
are not providing this info when in IP mode.

slot 0: 1G/10G Ethernet Controller CNA EP 8324
(Dual-port, QLogic CNA 8324(8362) rev. 2)
e0f MAC Address:00:a0:98:ef:a9:4c (auto-10g_twinax-fd-up)
e0e MAC Address:00:a0:98:ef:a9:4b (auto-10g_twinax-fd-up)
Device Type:EP8324N
Firmware Version:   5.4.66.0

ql0@pci0:14:0:0:class=0x02 card=0xfb051275 chip=0x88301077 rev=0x02 
hdr=0x00
vendor = 'QLogic Corp.'
class  = network
subclass   = ethernet


Which is the driver qlxgbe...




Richard Scheffenegger


-Original Message-
From: Alexander V.Chernikov  
Sent: Montag, 22. Juni 2020 14:49
To: Scheffenegger, Richard ; n...@freebsd.org
Subject: Re: SFP I2C interface in drivers (driver SIOCGI2C support)

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




22.06.2020, 12:52, "Scheffenegger, Richard" :
> Hi,
>
> I am just curious if anyone is working to get the NIC drivers support to read 
> the pluggables I2C status (temperature, voltage level, optical power levels) 
> from Intel NICs and Qlogic CNAs?
Hi Richard,
which Intel nics you're referring to? IIRC ixgbe/ixl SIOCGI2C support was added 
5? years ago.

>
> Richard Scheffenegger
>
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


SFP I2C interface in drivers (driver SIOCGI2C support)

2020-06-22 Thread Scheffenegger, Richard
Hi,

I am just curious if anyone is working to get the NIC drivers support to read 
the pluggables I2C status (temperature, voltage level, optical power levels) 
from Intel NICs and Qlogic CNAs?

Richard Scheffenegger

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: Some question about DCTCP implementation in FreeBSD

2019-06-06 Thread Scheffenegger, Richard

Hi Yu He,

This code is simply using integer arithmetics (float is not really possible in 
the kernel), left-shifting the fractional value of g by 1024 (10 bits).

Max_alpha_value = 1024 is “1” shifted left by 10.

Agreed that this is not clearly documented, and I believe the sysctl handler 
also is not properly implemented to adjust this value.

I thought I had been working on this…

Ah, here is it. I was trying to implement the Fluid-model DCTCP for much better 
RTT fairness, but apparently got distracted before putting on a Diff.
Fluid-model DCTCP was analyzed by the original authors of DCTCP, and basically 
adjusts cwnd fractionally immediately after a CE is received, instead of once 
at the end of a window. The update to Alpha is kept to once per window, to keep 
the bookkeeping easy and straight forward.

For reference, here is the partial code I came up with.

I’ll break this into an initial Diff to fix the sysctl tunables, as soon as I 
can. Would appreciate any help in getting the fluid-model improvement fully 
tested though.


[root@freebsd ~/netinet/cc]# git diff master cc_dctcp.c
diff --git a/sys/netinet/cc/cc_dctcp.c b/sys/netinet/cc/cc_dctcp.c
index 9affd0da2b3..778ff7a8477 100644
--- a/sys/netinet/cc/cc_dctcp.c
+++ b/sys/netinet/cc/cc_dctcp.c
@@ -56,7 +56,38 @@ __FBSDID("$FreeBSD$");
#include 
#include 

-#define MAX_ALPHA_VALUE 1024
+#define DCTCP_SHIFT 10
+#define MAX_ALPHA_VALUE 1bytes_this_ack;
@@ -132,6 +166,7 @@ dctcp_ack_received(struct cc_var *ccv, uint16_t type)

/* Update total marked bytes. */
if (dctcp_data->ece_curr) {
+   // Fluid-model of DCTCP for RTT fairness here (adjust 
cwnd on each ACK, rather than once per window)
if (!dctcp_data->ece_prev
&& bytes_acked > CCV(ccv, t_maxseg)) {
dctcp_data->bytes_ecn +=
@@ -143,10 +178,13 @@ dctcp_ack_received(struct cc_var *ccv, uint16_t type)
if (dctcp_data->ece_prev
&& bytes_acked > CCV(ccv, t_maxseg))
dctcp_data->bytes_ecn += CCV(ccv, t_maxseg);
+// (bytes_acked - CCV(ccv, t_maxseg));
dctcp_data->ece_prev = 0;
}
dctcp_data->ece_curr = 0;

/*
 * Update the fraction of marked bytes at the end of
 * current window size.

static void
@@ -165,18 +205,21 @@ dctcp_after_idle(struct cc_var *ccv)
{
struct dctcp *dctcp_data;

-   dctcp_data = ccv->cc_data;
+   if (CCV(ccv, t_flags) & TF_ECN_PERMIT) {

-   /* Initialize internal parameters after idle time */
-   dctcp_data->bytes_ecn = 0;
-   dctcp_data->bytes_total = 0;
-   dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
-   dctcp_data->alpha = V_dctcp_alpha;
-   dctcp_data->ece_curr = 0;
-   dctcp_data->ece_prev = 0;
-   dctcp_data->num_cong_events = 0;
+   dctcp_data = ccv->cc_data;

-   dctcp_cc_algo.after_idle = newreno_cc_algo.after_idle;
+   /* Initialize internal parameters after idle time */
+   dctcp_data->bytes_ecn = 0;
+   dctcp_data->bytes_total = 0;
+   dctcp_data->save_sndnxt = CCV(ccv, snd_nxt);
+   dctcp_data->alpha = V_dctcp_alpha << DCTCP_SHIFT;
+   dctcp_data->ece_curr = 0;
+   dctcp_data->ece_prev = 0;
+   dctcp_data->num_cong_events = 0;
+   }
+
+   newreno_cc_algo.after_idle(ccv);
}

static void
@@ -209,7 +252,7 @@ dctcp_cb_init(struct cc_var *ccv)
 * Note: DCTCP draft suggests initial alpha to be 1 but we've decided to
 * keep it 0 as default.
 */
-   dctcp_data->alpha = V_dctcp_alpha;
+   dctcp_data->alpha = V_dctcp_alpha << DCTCP_SHIFT;
dctcp_data->save_sndnxt = 0;
dctcp_data->ce_prev = 0;
dctcp_data->ece_curr = 0;
@@ -227,63 +270,73 @@ static void
dctcp_cong_signal(struct cc_var *ccv, uint32_t type)
{
struct dctcp *dctcp_data;
-   u_int win, mss;
+   u_int cwnd, mss;

-   dctcp_data = ccv->cc_data;
-   win = CCV(ccv, snd_cwnd);
-   mss = CCV(ccv, t_maxseg);
+   if (CCV(ccv, t_flags) & TF_ECN_PERMIT) {

-   switch (type) {
-   case CC_NDUPACK:
-   if (!IN_FASTRECOVERY(CCV(ccv, t_flags))) {
+   dctcp_data = ccv->cc_data;
+   cwnd = CCV(ccv, 

RE: RFC8312 Cubic

2019-01-23 Thread Scheffenegger, Richard
Thanks,

But this is against BSD11 (and checked against HEAD, which is virtually the 
same) – sorry for omitting this critical piece of info.


The issue we are observing is a performance degradation (only) when the sender 
has frequent idle periods from the application (not) handing new data 
continuously to the socket.

That is, new data is generated at infrequent intervals, ranging from sub-RTT to 
dozens of RTTs, with varying size (sub window, to multiple windows worth). The 
problem sufaces only sometimes, after the RTO timer has run out (cc_after_idle 
is called in tcp_output).

Depending on the exact timing, cwnd can then become very large (>int32), with 
the burst harming the session itself, or very small (2 MSS) growing very slowly 
at first…

But it’s “just” a ~25% reduced performance for that application compared with 
vanilla newreno. In general (more streaming applications, longer RTT paths) 
cubic beats newreno by a large margin in goodput (up to 5x higher for 
transcontinental).

We suspect that not clearing the cubic state in after_idle may be part of the 
culprit here (cubic epoch should not stretch beyond idle periods, according to 
RFC8312), and slow start not always invoked with a properly (reset) ssthresh.

Best regards,
  Richard


From: Freddie Cash 
Sent: Mittwoch, 23. Jänner 2019 21:41
To: Scheffenegger, Richard 
Cc: freebsd-transp...@freebsd.org; freebsd-net@freebsd.org
Subject: Re: RFC8312 Cubic

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



On Wed, Jan 23, 2019 at 12:34 PM Scheffenegger, Richard 
mailto:richard.scheffeneg...@netapp.com>> 
wrote:
Hi,

we encounted an issue with the BSD11 cubic implementation during testing, when 
dealing with app-stalled and app-limited traffic patterns.

The gist is, that at least in the after_idle cong_func for cubic, the states 
for ssthresh and cubic epoch time are not reset, leading to excessive cwnd 
values (overflows even) and self-inflicted drops due to dramatic burst 
transmissions.

Bug Report, or Patch directly to Phabricator (when it is fully qualified)?

Search the freebsd-stable mailing list archives for the thread with subject 
line:

 HEADS UP: TCP CUBIC Broken on 12.0-RELEASE/STABLE

https://lists.freebsd.org/pipermail/freebsd-stable/2018-December/090255.html

An Errata Notice should be going out sometime this month-ish.

--
Freddie Cash
fjwc...@gmail.com<mailto:fjwc...@gmail.com>
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RFC8312 Cubic

2019-01-23 Thread Scheffenegger, Richard
Hi,

we encounted an issue with the BSD11 cubic implementation during testing, when 
dealing with app-stalled and app-limited traffic patterns.

The gist is, that at least in the after_idle cong_func for cubic, the states 
for ssthresh and cubic epoch time are not reset, leading to excessive cwnd 
values (overflows even) and self-inflicted drops due to dramatic burst 
transmissions.

Bug Report, or Patch directly to Phabricator (when it is fully qualified)?

Best regards, 
   Richard

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RFC6675

2018-12-21 Thread Scheffenegger, Richard
For those inclined, I have a working patch for full RFC6675 support now (need 
to validate pipe and behavior in various scenarios still):

a) enters Loss Recovery on either Nth dupack, or when more than (N-1)*MSS bytes 
are sacked (the latter relevant for ack thinning)

b) a cumulative ack below snd_max, while snd_max == recovery_point (application 
limited) no longer requires a lengthy RTO and slow start to recover from. 
Instead, the Rescue Retransmission mechanism is implemented.

c) proper accounting for delivered_data and sacked_bytes in the single-pass 
update of the scoreboard. These variables enable further mechanisms like 
Proportional Rate Reduction. Also, this fixes potential exploits by malicious 
clients, and support thin IoT clients that can store data to the right of 
rcv_ack, but not keep state for a full RFC3517 compliant scoreboard.

I certainly need reviewers, if you are interested please let me know so that I 
can sign you in for it. Also, further testing would be required.

Best regards,
   Richard


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: ECN+ Implementation

2018-11-04 Thread Scheffenegger, Richard
Hi Pavan,

Try to make this behavior change dependent on a sysctl, possibly with 3 
different settings (legacy, ECN+, ECN++):

You may also want to look at ECN++
https://tools.ietf.org/html/draft-ietf-tcpm-generalized-ecn-03

Especially when you are looking to deploy this in Datacenters in conjunction 
with DCTCP.
 
Best regards,
   Richard

-Original Message-
From: owner-freebsd-...@freebsd.org  On Behalf 
Of Pavan Vachhani
Sent: Samstag, 3. November 2018 20:33
To: freebsd-net@freebsd.org
Subject: ECN+ Implementation

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




Hi,
I am trying to implement ECN+ (rfc5562 )
in FreeBSD.
I am not able to figure out the code where SYN and SYN+ACK is sent and received.
Please guide me to correct part of code. It was looking into tcp_input.c and 
tcp_output.c but couldn't get it for sure.
Sorry for simple questions, I am a beginner.

Thanks in advance.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


TCP SACK improvements (RFC6675 rescue retransmission and lost retransmission detection)

2015-03-25 Thread Scheffenegger, Richard
Hi,

I hope this is the correct forum to ask for help improving a rather crude patch 
to introduce RFC6675 Rescue Retransmissions and efficient Lost Retransmission 
Detection. Note that this is not a full implementation of the RFC6675.

The patch that I have is against 8.0, but I believe the SACK scoreboard has not 
really changed and thus should be applicable still.

One outstanding issue of that patch is the missing interaction with the 
congestion control part, when retransmitting a lost retransmission (that should 
reduce cwnd once  per cycle).

Also, the implementation is not very efficient, as more traversals of the 
scoreboard are done checking the elegibility of prior holes to be in need of 
being retransmitted once more...


Best regards,
  Richard Scheffenegger



sack5-freebsd8.0-release.diff
Description: sack5-freebsd8.0-release.diff
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org

RE: 1gbit LFN WAN link - odd tcp behavior

2011-06-28 Thread Scheffenegger, Richard

What is the effective latency under load, and the packet loss
probability?

Just to add some more detail.

Richard Scheffenegger


 -Original Message-
 From: William Salt [mailto:williamejs...@googlemail.com]
 Sent: Montag, 27. Juni 2011 12:15
 To: freebsd-net@freebsd.org
 Subject: 1gbit LFN WAN link - odd tcp behavior
 
 Hi All,
  For the last couple of months i have been pulling my hair out
 trying to solve this problem.
 We have a 1Gbps transatlantic link from the UK to the US, which has
 successfully passed the RFC2544 test.
 
 At either end, we have a media converter, and a supermicro server with
 an
 intel quad port NIC running pfsense 2 (freebsd 8.1) with the NIC
 running on
 the yandex IGB driver.
 
 We can pass 1gbps either way with UDP. However we are experiencing
very
 strange issues with tcp connections.
 
 With window scaling enabled, and a max socket buffer set to 16MB, we
 see no
 difference.
 Even disabling window scaling and setting the window to 16MB makes no
 difference.
 
 Each TCP connection starts very slowly, and will max out at around
 190mbps,
 taking nearly 2 minutes to climb to this speed before *plateauing*.
 
 We have to initiate many (5+) connections to saturate the link with
tcp
 connections with iperf.
 
 I have followed guides like this:
 http://www.psc.edu/networking/projects/tcptune/#FreeBSD
 
 With no luck, and have tweaked, disabled, and enabled nearly every
 relevant
 sysctl parameter with no luck.
 
 Can anyone shed some light on this?
 
 I am now doubting the IGB driver, and am looking to swap out the cards
 as a
 last ditch effort.
 However, we have tried different hardware (L3 switches, media
convertes
 +
 laptops etc), and the symptoms still persist...
 The only constant is freebsd 8.1 (or 8.2 for production systems).
 
 
 Cheers in advance
 Will
 ___
 freebsd-net@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


kern/140597: Lost Retransmission Detection

2011-05-31 Thread Scheffenegger, Richard
Hi,

please review the following patch, which enables the detection and
recovery of lost retransmissions for SACK. This patch address the second
most prominent cause of retransmission timeouts (after the failure to
initiate loss recovery for small window sessions - e.g. Early
Retransmit).

The idea behind this patch is the same one that is emergent behavior in
the linux stack: 1RTT after sending a retransmission, that
retransmission should have made it to the receiver; the 1RTT signal is a
newly sent segment (snd.fack advances) being acknowledged by the
receiver...

The pointer indicating where a retransmission is to be started, is set
to snd.nxt (new segment after this loss window) - outside the boundaries
of the hole itself - to remember when a hole needs to be retransmitted.
If all retransmissions are proceeding in-order, these holes would close
and eventually be evicted from the scoreboard, before the first new
transmission after the loss window is SACKed by the receiver.

Over the first instance of this patch, it addresses a slight oversight,
when retransmitted segments from multiple holes became lost - it
traverses all the holes from the sackhint forward to the beginning of
the scoreboard when snd.fack is adjusted, and resets the pointer (and
sackhint) where the next transmission should come from accordingly.

However - just like Linux - there is no congestion control reaction
(even though the papers discussing lost retransmission all mention that
another reduction of cwnd would be appropriate).


Richard Scheffenegger


diff -u netinet.orig/tcp_output.c netinet/tcp_output.c
--- netinet.orig/tcp_output.c   2009-10-25 02:10:29.0 +0100
+++ netinet/tcp_output.c2010-04-02 16:55:14.0 +0200
@@ -953,6 +953,10 @@
} else {
th-th_seq = htonl(p-rxmit);
p-rxmit += len;
+   /* lost again detection */
+   if (SEQ_GEQ(p-rxmit, p-end)) {
+   p-rxmit = tp-snd_nxt;
+   }
tp-sackhint.sack_bytes_rexmit += len;
}
th-th_ack = htonl(tp-rcv_nxt);
diff -u netinet.orig/tcp_sack.c netinet.simple_mod/tcp_sack.c
--- netinet.orig/tcp_sack.c 2009-10-25 02:10:29.0 +0100
+++ netinet/tcp_sack.c  2010-04-21 00:48:23.0 +0200
@@ -508,7 +508,9 @@
if (SEQ_GEQ(sblkp-end, cur-end)) {
/* Move end of hole backward. */
cur-end = sblkp-start;
-   cur-rxmit = SEQ_MIN(cur-rxmit,
cur-end);
+   if (SEQ_GEQ(cur-rxmit, cur-end)) {
+   cur-rxmit = tp-snd_nxt;
+   }
} else {
/*
 * ACKs some data in middle of a hole;
need
@@ -524,8 +526,9 @@
- temp-start);
}
cur-end = sblkp-start;
-   cur-rxmit = SEQ_MIN(cur-rxmit,
-   cur-end);
+   if (SEQ_GEQ(cur-rxmit,
cur-end)) {
+   cur-rxmit =
tp-snd_nxt;
+   }
}
}
}
@@ -540,6 +543,15 @@
else
sblkp--;
}
+   /* retransmission lost again - then restart */
+   if ((temp = tp-sackhint.nexthole) != NULL) {
+   do {
+   if (SEQ_GT(tp-snd_fack, temp-rxmit)) {
+   temp-rxmit = temp-start;
+   tp-sackhint.nexthole = temp;
+   }
+   } while ((temp = TAILQ_PREV(temp, sackhole_head,
scblink)) != NULL);
+   }
 }

 /*

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


RFC3517bis rescue retransmission

2011-05-31 Thread Scheffenegger, Richard
Hi,

RFC3517bis has added a provision to fix a special corner case of SACK
loss recovery. Under certain circumstances (end of stream), TCP SACK can
be much less effective in loss recovery than TCP NewReno.

For a history of this corner case, please see

https://datatracker.ietf.org/doc/draft-scheffenegger-tcpm-sack-loss-reco
very/

http://tools.ietf.org/html/draft-nishida-tcpm-rescue-retransmission-00

and the final agreed algorithm:

http://www.ietf.org/mail-archive/web/tcpm/current/msg06518.html
(to be included in 3517bis)


However, the final version would not work well, if that first
retransmitted segment is also lost (unfortunately, the first
retransmitted segment has the highest probability to get lost again...).
I would like to get feedback on the following idea:

Whenever the socket does not hold any additional data, or no new
segments can be sent beyond snd.max:

min(so-so_snd.sb_cc, sendwin) - off == 0

and a hole in the scoreboard shrinks (or is removed), set  a flag (ie.
Sackhint.hole_shrunk = 1) in sack_do_ack.

In sack_output, if all the holes were transmitted once (the entire
scoreboard is traversed, without finding a new hole to send),  and the
above conditions (no new data elegible to send, and flag is set, and
rescue hole does not exist) hole, add one rescue hole of size mss,
starting at snd.nxt-mss to the scoreboard, adjust the retransmission
trigger values (from snd.nxt to snd.nxt-1)...

This would keep that change co-located with the SACK code, the rescue
retransmission would be treated like any other hole during SACK recovery
processing. Also, this would be compatible with lost retransmission
detection. When the receive buffer is full, or the sender does not have
any new data to send, the sender can still track if the rescue
retransmission made it. If it does, by adjusting the retransmission
trigger, any still pending holes would be retransmitted (pipe
allowing...) again. If the receiver buffer was full, because a lost
first retransmission cause a head-of-line blocking, this would unlock
the receive window as snd.una advances.

Any comments?

Richard Scheffenegger


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [CFT] Early Retransmit for TCP (rfc5827) patch

2011-05-31 Thread Scheffenegger, Richard
Hi Weongyo,

Good to know that you are addressing the primary reason for
retransmission timeouts with SACK.

(Small window (early retransmit) is ~70%, lost retransmission ~25%,
end-of-stream loss ~5% of all addressable causes for a RTO).

I looked at your code to enable RFC5827 Early Retransmits.

There is one minor nit-pick: tcp_input is calling tcp_getrexmtthresh for
every duplicate ACK. When SACK is enabled (over 90% of all sessions
today), the byte-based tcp_sack_ownd routine cycles over the entire SACK
scoreboard.

As the scoreboard can become huge with fat, long pipes, this appears to
be suboptimal. 

Perhaps something along these lines:

ackedbyte = 0;
int mark = tp-snd_una;
TAILQ_FOREACH(p, tp-snd_holes, scblink) {
  ackedbyte += p-start - mark;
  if (ackedbyte = amout)
return(TRUE);
  mark = p-end;
}
ackedbyte += tp-snd_fack - mark;
  if (ackedbyte = amout)
return(TRUE);
return(FALSE);

Would be more scalable (only a holes at the start need to be cycled,
increasing the chances that they stick close to the CPU)...

Perhaps adding a variable to track the number of bytes SACKed to the
scoreboard (and updated with the receipt of a new SACK block) would be
even more efficient

Best regards,
  Richard Scheffenegger





From: weon...@freebsd.org
Date: Sat May 7 00:19:38 UTC 2011

Hello all,

I'd like to send another patch to support RFC5827 in TCP stack which
could be found at:

http://people.freebsd.org/~weongyo/patch_20110506_rfc5827.diff
http://people.freebsd.org/%7Eweongyo/patch_20110506_rfc5827.diff 

This patch supports all Early Retransmit logics (Byte-Based Early
Retransmit and Segment-Based Early Retransmit) when net.inet.tcp.rfc5827
sysctl knob is turned on.

Please note that Segment-Based Early Retransmit logic is separated using
khelp module because it adds additional operations and requires variable
spaces to track segment boundaries on the right side window.

So if the khelp module is loaded, it's a preference but if not the
default logic is `Byte-Based Early Retransmit'.

I implemented based on DragonflyBSD's implementation but it looked it's
not same with RFC specification what I thought so I changed most of
parts.  In my test environments it looks it's working correctly.

Please review and test my work and tell me if you have any concerns and
questions.

regards,
Weongyo Jeong

-- next part --
A non-text attachment was scrubbed...
Name: patch_20110506_rfc5827.diff
Type: text/x-diff
Size: 18455 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-net/attachments/20110507/90f2
f164/patch_20110506_rfc5827.bin

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: kern/140597 implement Lost Retransmission Detection

2010-04-20 Thread Scheffenegger, Richard
The following reply was made to PR kern/140597; it has been noted by GNATS.

From: Scheffenegger, Richard r...@netapp.com
To: bug-follo...@freebsd.org, Lawrence Stewart lastew...@swin.edu.au
Cc: Biswas, Anumita anumita.bis...@netapp.com
Subject: Re: kern/140597 implement Lost Retransmission Detection
Date: Wed, 21 Apr 2010 02:16:01 +0100

 I found a small oversight (bug) in my earlier simple fix. If we had sent
 out multiple holes already, all of which get (partially) lost in the
 retransmission again, the original simple patch would only work for the
 very first hole. So, subsequent holes would not get re-sent, unless
 another ACK (SACK) would be received - however, seeing more SACKs
 becomes less likely the more loss is expirienced.
 
 This is a updated patch diff, which accounts for that case as well - but
 at the cost of O(n) time, instead of O(c).
 
 (Just check all holes from the hint backwards to the head, if any
 already fully resent hole needs to be reset, instead of only the first
 one - which might go away during subsequent processing in the original
 patch).
 
 
 
 diff -u netinet.orig/tcp_output.c netinet/tcp_output.c
 --- netinet.orig/tcp_output.c   2009-10-25 02:10:29.0 +0100
 +++ netinet/tcp_output.c2010-04-02 16:55:14.0 +0200
 @@ -953,6 +953,10 @@
 } else {
 th-th_seq =3D htonl(p-rxmit);
 p-rxmit +=3D len;
 +   /* lost again detection */
 +   if (SEQ_GEQ(p-rxmit, p-end)) {
 +   p-rxmit =3D tp-snd_nxt;
 +   }
 tp-sackhint.sack_bytes_rexmit +=3D len;
 }
 th-th_ack =3D htonl(tp-rcv_nxt);
 diff -u netinet.orig/tcp_sack.c netinet.simple_mod/tcp_sack.c
 --- netinet.orig/tcp_sack.c 2009-10-25 02:10:29.0 +0100
 +++ netinet/tcp_sack.c  2010-04-21 00:48:23.0 +0200
 @@ -508,7 +508,9 @@
 if (SEQ_GEQ(sblkp-end, cur-end)) {
 /* Move end of hole backward. */
 cur-end =3D sblkp-start;
 -   cur-rxmit =3D SEQ_MIN(cur-rxmit,
 cur-end);
 +   if (SEQ_GEQ(cur-rxmit, cur-end)) {
 +   cur-rxmit =3D tp-snd_nxt;
 +   }
 } else {
 /*
  * ACKs some data in middle of a hole;
 need
 @@ -524,8 +526,9 @@
 - temp-start);
 }
 cur-end =3D sblkp-start;
 -   cur-rxmit =3D =
 SEQ_MIN(cur-rxmit,
 -   cur-end);
 +   if (SEQ_GEQ(cur-rxmit,
 cur-end)) {
 +   cur-rxmit =3D
 tp-snd_nxt;
 +   }
 }
 }
 }
 @@ -540,6 +543,15 @@
 else
 sblkp--;
 }
 +   /* retransmission lost again - then restart */
 +   if ((temp =3D tp-sackhint.nexthole) !=3D NULL) {
 +   do {
 +   if (SEQ_GT(tp-snd_fack, temp-rxmit)) {
 +   temp-rxmit =3D temp-start;
 +   tp-sackhint.nexthole =3D temp;
 +   }
 +   } while ((temp =3D TAILQ_PREV(temp, sackhole_head,
 scblink)) !=3D NULL);
 +   }
  }
 
  /*
 
 
 
 
 Richard Scheffenegger
 
 =20
 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: kern/140597 implement Lost Retransmission Detection

2010-04-02 Thread Scheffenegger, Richard
The following reply was made to PR kern/140597; it has been noted by GNATS.

From: Scheffenegger, Richard r...@netapp.com
To: bug-follo...@freebsd.org, Lawrence Stewart lastew...@swin.edu.au
Cc: Biswas, Anumita anumita.bis...@netapp.com
Subject: Re: kern/140597 implement Lost Retransmission Detection
Date: Fri, 2 Apr 2010 18:48:04 +0100

 As discussed earlier, here is a simple fix.
 
 Caveat: Doesn't work at the end of a session or when cwnd is very small
 (4 segments). Also, during heavy reordering, some spurious
 re-retransmissions might occur (but that would only affect very few
 re-retransmitted segments, as holes would still close with each
 additional received SACK, reducing the chance of spurious
 re-transmissions).=20
 
 Benefit: during LAN burst drop events, TCP will not revert to
 retransmission timeouts in order to recover. In a LAN, the RTO is
 typically many orders of magnitude larger than the RTT. Not relying on
 RTO whenever possible can help keep throughput up..
 
 
 Simple Patch:
 
 
 --
 diff -u netinet.orig/tcp_output.c netinet/tcp_output.c
 --- netinet.orig/tcp_output.c   2009-10-25 02:10:29.0 +0100
 +++ netinet/tcp_output.c2010-04-02 16:55:14.0 +0200
 @@ -953,6 +953,10 @@
 } else {
 th-th_seq =3D htonl(p-rxmit);
 p-rxmit +=3D len;
 +   /* lost again detection */
 +   if (SEQ_GEQ(p-rxmit, p-end)) {
 +   p-rxmit =3D tp-snd_nxt;
 +   }
 tp-sackhint.sack_bytes_rexmit +=3D len;
 }
 th-th_ack =3D htonl(tp-rcv_nxt);
 diff -u netinet.orig/tcp_sack.c netinet/tcp_sack.c
 --- netinet.orig/tcp_sack.c 2009-10-25 02:10:29.0 +0100
 +++ netinet/tcp_sack.c  2010-04-02 16:46:42.0 +0200
 @@ -460,6 +460,13 @@
 /* We must have at least one SACK hole in scoreboard. */
 KASSERT(!TAILQ_EMPTY(tp-snd_holes),
 (SACK scoreboard must not be empty));
 +   /* lost again - then restart */
 +   if ((temp =3D TAILQ_FIRST(tp-snd_holes)) !=3D NULL) {
 +   if (SEQ_GT(tp-snd_fack, temp-rxmit)) {
 +   temp-rxmit =3D temp-start;
 +   tp-sackhint.nexthole =3D temp;
 +   }
 +   }
 cur =3D TAILQ_LAST(tp-snd_holes, sackhole_head); /* Last SACK
 hole. */
 /*
  * Since the incoming sack blocks are sorted, we can process
 them
 @@ -508,7 +515,9 @@
 if (SEQ_GEQ(sblkp-end, cur-end)) {
 /* Move end of hole backward. */
 cur-end =3D sblkp-start;
 -   cur-rxmit =3D SEQ_MIN(cur-rxmit,
 cur-end);
 +   if (SEQ_GEQ(cur-rxmit, cur-end)) {
 +   cur-rxmit =3D tp-snd_nxt;
 +   }
 } else {
 /*
  * ACKs some data in middle of a hole;
 need
 @@ -524,8 +533,9 @@
 - temp-start);
 }
 cur-end =3D sblkp-start;
 -   cur-rxmit =3D =
 SEQ_MIN(cur-rxmit,
 -   cur-end);
 +   if (SEQ_GEQ(cur-rxmit,
 cur-end)) {
 +   cur-rxmit =3D
 tp-snd_nxt;
 +   }
 }
 }
 }
 --
 
 Richard Scheffenegger
 Field Escalation Engineer
 NetApp Global Support=20
 NetApp
 +43 1 3676811 3146 Office (2143 3146 - internal)
 +43 676 654 3146 Mobile
 www.netapp.com BLOCKED::http://www.netapp.com/ =20
 Franz-Klein-Gasse 5
 1190 Wien=20
 
 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org