On Thu, Aug 08, 2019 at 11:54:44AM -0700, xg...@reliancememory.com wrote:
> To make sure I captured all the explanations correctly, please allow me to 
> summarize my understandings:

I think your understanding is basically correct (and thanks to Jeffrey for
the detailed explanation!).  A few more remarks inline, though none
particularly actionable.

> Flow control over a high-latency, potentially congested link is a fundamental 
> challenge that both TCP and UDP+Rx face. Both protocol and implementation can 
> pose a problem. The reason why I did not see an improvement when enlarging 
> the window size in rxperf is that firstly I chose too few data bytes to 
> transfer and secondly that OpenAFS's Rx has some implementation limitations 
> that become a limiting factor before the window size limit kicks in. They are 
> non-trivial to fix, as demonstrated in the 1.5.x throughput "hiccup". But 
> AuriStor fixed a significant amount of it in its proprietary Rx 
> re-implementation. 
> 
> One can borrow ideas and principals from algorithm research in TCP's flow 
> control to improve Rx throughput. I am not an expert on this topic, but I 
> wonder if the principals in Google's BBR algorithm can help further improve 
> Rx throughput, and I wonder if there is anything that makes TCP fundamentally 
> superior than UDP in implementing flow control. 

Congestion control algorithms are largely independent of TCP vs. UDP, and
many TCP stacks have a "modular" concestion controller, in which one
algorithm can be swapped out for another by configuration.  The current
draft version of IETF QUIC, which is UDP-based, is using what is
effectively TCP NewReno's congestion control algorithm:
https://tools.ietf.org/html/draft-ietf-quic-recovery-22 .  In principle,
BBR could also be adopted to Rx (or QUIC), but I expect the design
philosophy to have substantial impedence mismatch for the current openafs
Rx implementation.

In general, how to achieve good performance on the so-called "high
bandwidth-delay product" links remains a difficult problem, with active
research efforts underway, as well as engineering work in SDOs like the
IETF.

-Ben

> When it comes to deployment strategy, there may be workarounds to the 
> high-latency limitation. Each of them, of course, has limitations. I can 
> probably use the technique mentioned below to leverage the TCP throughput in 
> RO volume synchronization, 
> https://lists.openafs.org/pipermail/openafs-info/2018-August/042502.html
> and wait until DPF becomes available in vos operations:
> https://openafs-workshop.org/2019/schedule/faster-wan-volume-operations-with-dpf/
> 
> I can also adopt a small home volume, distributed subfolder volume strategy 
> that allows home volumes to move with relocated users across WAN, but keep 
> subdirectory volumes at their respective geographic location. Users can pick 
> a subdirectory that is closest to their current location to work with. When 
> combined with a version control system that uses TCP in syncing, project data 
> synching can be alleviated. 
> 
> There is a commercial path that we can pursue with AuriStor or other vendors. 
> But I guess that is out of the scope of this mail list. 
> 
> Any other strategies that may help?
> 
> Thank you, Jeff!
> 
> Simon Guan
> 
> 
> -----Original Message-----
> From: Jeffrey E Altman <jalt...@auristor.com> 
> Sent: Wednesday, August 7, 2019 9:01 PM
> To: xg...@reliancememory.com; openafs-info@openafs.org
> Subject: Re: [OpenAFS] iperf vs rxperf in high latency network
> 
> On 8/7/2019 9:35 PM, xg...@reliancememory.com wrote:
> > Hello,
> > 
> > Can someone kindly explain again the possible reasons why Rx is so painfully
> > slow for a high latency (~230ms) link? 
> 
> As Simon Wilkinson said on slide 5 of "RX Performance"
> 
>   https://indico.desy.de/indico/event/4756/session/2/contribution/22
> 
>   "There's only two things wrong with RX
>     * The protocol
>     * The implementation"
> 
> This presentation was given at DESY on 5 Oct 2011.  Although there have
> been some improvements in the OpenAFS RX implementation since then the
> fundamental issues described in that presentation still remain.
> 
> To explain slides 3 and 4.  Prior to the 1.5.53 release the following
> commit was merged which increased the default maximum window size from
> 32 packets to 64 packets.
> 
>   commit 3feee9278bc8d0a22630508f3aca10835bf52866
>   Date:   Thu May 8 22:24:52 2008 +0000
> 
>     rx-retain-windowing-per-peer-20080508
> 
>     we learned about the peer in a previous connection... retain the
>     information and keep using it. widen the available window.
>     makes rx perform better over high latency wans. needs to be present
>     in both sides for maximal effect.
> 
> Then prior to 1.5.66 this commit raised the maximum window size to 128
> 
>   commit 310cec9933d1ff3a74bcbe716dba5ade9cc28d15
>   Date:   Tue Sep 29 05:34:30 2009 -0400
> 
>     rx window size increase
> 
>     window size was previously pushed to 64; push to 128.
> 
> and then prior to 1.5.78 which was just before the 1.6 release:
> 
>   commit a99e616d445d8b713934194ded2e23fe20777f9a
>   Date:   Thu Sep 23 17:41:47 2010 +0100
> 
>     rx: Big windows make us sad
> 
>     The commit which took our Window size to 128 caused rxperf to run
>     40 times slower than before. All of the recent rx improvements have
>     reduced this to being around 2x slower than before, but we're still
>     not ready for large window sizes.
> 
>     As 1.6 is nearing release, reset back to the old, fast, window size
>     of 32. We can revist this as further performance improvements and
>     restructuring happen on master.
> 
> After 1.6 AuriStor Inc. (then Your File System Inc.) continued to work
> on reducing the overhead of RX packet processing.  Some of the results
> were presented in Simon Wilkinson's 16 October 2012 talk entitled "AFS
> Performance" slides 25 to 30
> 
>   http://conferences.inf.ed.ac.uk/eakc2012/
> 
> The performance of OpenAFS 1.8 RX is roughly the same as the OpenAFS
> master performance from slide 28.  The Experimental RX numbers were the
> AuriStor RX stack at the time which was not contributed to OpenAFS.
> 
> Since 2012 AuriStor has addressed many of the issues raised in
> the "RX Performance" presentation
> 
>  0. Per-packet processing expense
>  1. Bogus RTT calculations
>  2. Bogus RTO implmentation
>  3. Lack of Congestion avoidance
>  4. Incorrect window estimation when retransmitting
>  5. Incorrect window handling during loss recovery
>  6. Lock contention
> 
> The current AuriStor RX state machine implements SACK based loss
> recovery as documented in RFC6675, with elements of New Reno from
> RFC5682 on top of TCP-style congestion control elements as documented in
> RFC5681. The new RX also implements RFC2861 style congestion window
> validation.
> 
> When sending data the RX peer implementing these changes will be more
> likely to sustain the maximum available throughput while at the same
> time improving fairness towards competing network data flows. The
> improved estimation of available pipe capacity permits an increase in
> the default maximum window size from 60 packets (84.6 KB) to 128 packets
> (180.5 KB). The larger window size increases the per call theoretical
> maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec
> and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.
> 
> AuriStor RX also includes experimental support for RX windows larger
> than 255 packets (360KB). This release extends the RX flow control state
> machine to support windows larger than the Selective Acknowledgment
> table. The new maximum of 65535 packets (90MB) could theoretically fill
> a 100 gbit/second pipe provided that the packet allocator and packet
> queue management strategies could keep up.  Hint: at present, they don't.
> 
> To saturate a 60 Mbit/sec link with 230ms latency with rxmaxmtu set to
> 1344 requires a window size of approximately 1284 packets.
> 
> > From a user perspective, I wonder if there is any *quick Rx code hacking*
> > that could help reduce the throughput gap of (iperf2 = 30Mb/s vs rxperf =
> > 800Kb/s) for the following specific case. 
> 
> Probably not.  AuriStor's RX is significant re-implementation of the
> protocol with one eye focused on backward compatibility and the other on
> the future.
> 
> > We are considering the possibility of including two hosts ~230ms RTT apart
> > as server and client. I used iperf2 and rxperf to test throughput between
> > the two. There is no other connection competing with the test. So this is
> > different from a low-latency, thread or udp buffer exhaustion scenario. 
> > 
> > iperf2's UDP test shows a bandwidth of ~30Mb/s without packet loss, though
> > some of them have been re-ordered at the receiver side. Below 5 Mb/s, the
> > receiver sees no packet re-ordering.  Above 30 Mb/s, packet loss is seen by
> > the receiver. Test result is pretty consistent at multiple time points
> > within 24 hours. UDP buffer size used by iperf is 208 KB. Write length is
> > set at 1300 (-l 1300) which is below the path MTU. 
> 
> Out of order packet delivery and packet loss have significant
> performance impacts on OpenAFS RX.
> 
> > Interestingly, a quick skim through the iperf2 source code suggests that an
> > iperf sender does not wait for the receiver's ack. It simply keeps
> > write(mSettins->mSock, mBuf, mSettings->mBufLen) and timing it to extract
> > the numerical value for the throughput. It only checks, in the end, to see
> > if the receiver complains about packet loss. 
> 
> This is because iperf2 is not attempting to perform any flow control,
> any error recovery and no fairness model.  RX calls are sequenced data
> flows that are modeled on the same principals as TCP.
> 
> > rxperf, on the other hand, only gets ~800 Kb/s. What makes it worse is that
> > it does not seem to be dependent on the window size (-W 32~255), or udpsize
> > (-u default~512*1024). I tried to re-compile rxperf that has #define
> > RXPERF_BUFSIZE (1024 * 1024 * 64) instead of the original (512 * 1024). I
> > did not see a throughput improvement from going above -u 512K. Occasionally
> > some packets are re-transmitted. If I reduce -W or -u to very small values,
> > I see some penalty. 
> 
> Changing RXPERF_BUFSIZE to 64MB is not going to help when the total
> bytes being sent per call is 1000KB.  Given how little data is being
> sent per call and the fact that each RX call begins in "slow start" I
> suspect that your test isn't growing the window size to 32 packets let
> alone 255.
> >
> >[snip]
> >
> > The theory goes if I have a 32-packet recv/send window (Ack Count) with 1344
> > bytes of packet size and RTT=230ms, I should expect a theoretical upper
> > bound of 32 x 8 x 1344 / 0.23 / 1000000 =  1.5 Mb/s. If the AFS-implemented
> > Rx windows size (32) is really the limiting factor of the throughput, then
> > the throughput should increase when I increase the window size (-w) above 32
> > and configure a sufficiently big kernel socket buffer size.
> 
> The fact that OpenAFS RX requires large kernel socket buffers to get
> reasonable performance is bad indication.  It means that for OpenAFS RX
> it is better to deliver packets with long delays than to drop them and
> permit timely congestion detection.
> 
> > I did not see either of the predictions by the theory above. I wonder if
> > some light could be shed on:
> > 
> > 1. What else may be the limiting factor in my case
> 
> Not enough data is being sent per call.  30MB are being sent by iperf2
> and rxperf is sending 1000KB.  Its not an equivalent comparison.
> 
> > 2. If there is a quick way to increase recv/send window from 32 to 255 in Rx
> > code without breaking other parts of AFS. 
> 
> As shown in the commits specified above, it doesn't take much to
> increase the default maximum window size.  However, performance is
> unlikely to increase unless the root causes are addressed.
> 
> > 3. If there is any quick (maybe dirty) way to leverage the iperf2
> > observation, relax the wait for ack as long as the received packets are in
> > order and not lost (that is, get me up to 5Mb/s...)
> 
> Not without further violating the TCP Fairness principal.
> 
> > Thank you in advance.
> > ==========================
> > Ximeng (Simon) Guan, Ph.D.
> > Director of Device Technology
> > Reliance Memory
> > ==========================
> 
> Jeffrey Altman
> 
> > 
> 
> _______________________________________________
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to