Throughout the history of AFS there has been recognition that growing
the Rx window size is necessary to increase the throughput on high
latency or fat pipes where the meaning of "high-latency" and "fat" have
changed over time as networks have become faster.   The maximum window
sizes were increased in both IBM AFS 3.4 and 3.5 resulting in the
current default OpenAFS Rx window size of 32 packets (44KB).   Prior to
the release of OpenAFS 1.6, there were efforts to grow the default Rx
window size to 64 packets (88KB) in May 2008 and then to 128 packets
(176KB) in Sept 2009 with the expectation that there would be an
increase in throughput.   These changes were reverted in Sept 2010 after
the late Andrei Maslennikov presented his findings in Pilsen that
OpenAFS 1.5.77 was 50-60% slower than 1.4.12.

At DESY in 2011 Simon Wilkinson presented his findings and the
improvements that were subsequently made to OpenAFS Rx to slightly
improve the situation.  Simon said at the time, "There's only two things
wrong with RX: the protocol and the implementation".   To sustain a
10gbit/second flow Rx needs to consistently process 175,000 DATA
packets/second as well the matching ACK packets.  That requires not only
highly efficient packet processing but it also requires the ability to
maintain a full network pipe instead of stalling each time the DATA
sender has filled the peer's advertised receive window.

Over the last decade AuriStor has continued to invest in its Rx
implementation in order to reduce the costs associated with DATA and ACK
packet processing, more effectively measure the pipe's congestion
window, more efficiently recover from packet loss, and improve
fairness.   These efforts have paid off in that AuriStor has been able
to increase the default window size to 60 packets (82KB) in 2014, 128
packets (176KB) in 2018, and 255 packets (351KB) in 2021.

One of the reasons that filesystems such as Lustre and GPFS can achieve
high throughput is because they support TCP window sizes of 8MB or
larger.    In order for AFS to match their performance Rx needs to
support windows sizes on the order of 6000 packets.   The ACK packet's
receiveWindow field has ample room to advertise larger window sizes as
its an unsigned 32-bit integer.   In 2018 AuriStor removed the
restriction that the maximum window size be restricted by the number of
packets that can be represented in the ACK packet's Selective
Acknowledgement (SACK) table.  There is TCP research that describes how
to perform congestion avoidance when the SACK provides limited
visibility into the state of the in-flight packets.  However, it is
always preferred to have access to SACK data for all of the in-flight
packets.

AuriStor is therefore proposing a backward compatible protocol extension
which will permit incrementally growing the ACK packet's SACK table and
address two other design weaknesses in the ACK packet: the inconsistent
use of the 'previousPacket' field which makes it unusable and the lack
of a count for the number of ACK trailer fields.

There are three commits in OpenAFS Gerrit.  

"rx: compare RX_ACK_TYPE_ACK as a bit-field"
https://gerrit.openafs.org/#/c/14465/ is a code change that ensures that
OpenAFS Rx will only examine Bit-0 of each SACK table element.   This
permits Bit-1 through Bit-7 of each SACK element to be defined for
future use when the rx_maxWindow is increased above 255 packets.  
AuriStor Rx already implements this behavior.

"doc: rx-spec Update for accuracy with current Rx implementations"
https://gerrit.openafs.org/#/c/14692/2 is an update to Nickolai
Zeldovich's Rx Specification.  I hope it improves the description of the
protocol correcting a number of misconceptions and explains how it
should be used.  The Historical Implementation Notes section is
particularly important in the context of ACK packet processing and
possible extensions.

"doc: rx-spec Document the Extended SACK Table protocol extension"
https://gerrit.openafs.org/#/c/14693/2 describes the proposed
EXTENDED-SACK ACK packet protocol extension which defines ACK packet
Flags Bit-3 as EXTENDED-SACK when set in an ACK packet; Bit-3 currently
only has meaning for DATA packets (MORE-PACKETS).   When the
EXTENDED-SACK flag is set the following is true:

  * The previousPacket field must be the largest DATA packet sequence
number
     accepted by the peer.  This allows (previousPacket - firstPacket +
1) to
     represent the number of DATA packets that should be represented in SACK
     tables.

  * The SACK table can grow up to 256 octets instead of 255 octets by
leveraging
    one of the three unused octets between the SACK and the first trailer.

  * The SACK table can represent the ACK/NACK state for up to 2048 DATA
packets
     using horizontal striping.

  * The second unused octet between the SACK and the first trailer is
used for
     a count of the number of unsigned 32-bit trailer fields.   This
will permit
     future extensibility.   The current value for this field is 4.

  * The third unused octet is a count of the number of additional SACK
tables
     which are appended after the final trailer field.   Each SACK is
variable
     length and can grow up to 256 octets representing up to 2048 DATA
packets.

With these changes up to 2048 DATA packets can be represented by an ACK
packet that fits within the minimum IPv4 MTU size and up to 8192 DATA
packets can be represented by an ACK packet that fits within the minimum
IPv6 MTU size.  Larger window sizes can be represented with larger ACK
packet but 8192 DATA packets is 11MB which should be more than
sufficient for now.

Even though it is unlikely that OpenAFS Rx will be able to increase the
default window sizes to benefit from these changes in the near term,
there are still benefits to OpenAFS Rx implementing the EXTENDED-SACK
flag and its associated meanings of previousPacket and the unused
octets.   As documented by gerrit 14692 the prior usage of
previousPacket makes the field unusable as a means of detecting
out-of-sequence ACK packets and having an accurate view of the leading
edge of the in-flight window that has been received by the peer.    The
trailer and extra SACK counts provide much needed clarity of the  ACK
packet size before Path MTU discovery padding.

AuriStor has implemented the EXTENDED-SACK proposal with up to one extra
SACK table or 4096 DATA packets (5.5MB).   With these changes AuriStor
is prepared to ship a default window size of 4096 in our September 2021
release provided that there is review from and consensus with the
OpenAFS community.

Your review and feedback will be appreciated.  AuriStor is prepared to
make changes as needed.

Sincerely,

Jeffrey Altman



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to