On Thu, Aug 08, 2019 at 11:54:44AM -0700, xg...@reliancememory.com wrote: > To make sure I captured all the explanations correctly, please allow me to > summarize my understandings:
I think your understanding is basically correct (and thanks to Jeffrey for the detailed explanation!). A few more remarks inline, though none particularly actionable. > Flow control over a high-latency, potentially congested link is a fundamental > challenge that both TCP and UDP+Rx face. Both protocol and implementation can > pose a problem. The reason why I did not see an improvement when enlarging > the window size in rxperf is that firstly I chose too few data bytes to > transfer and secondly that OpenAFS's Rx has some implementation limitations > that become a limiting factor before the window size limit kicks in. They are > non-trivial to fix, as demonstrated in the 1.5.x throughput "hiccup". But > AuriStor fixed a significant amount of it in its proprietary Rx > re-implementation. > > One can borrow ideas and principals from algorithm research in TCP's flow > control to improve Rx throughput. I am not an expert on this topic, but I > wonder if the principals in Google's BBR algorithm can help further improve > Rx throughput, and I wonder if there is anything that makes TCP fundamentally > superior than UDP in implementing flow control. Congestion control algorithms are largely independent of TCP vs. UDP, and many TCP stacks have a "modular" concestion controller, in which one algorithm can be swapped out for another by configuration. The current draft version of IETF QUIC, which is UDP-based, is using what is effectively TCP NewReno's congestion control algorithm: https://tools.ietf.org/html/draft-ietf-quic-recovery-22 . In principle, BBR could also be adopted to Rx (or QUIC), but I expect the design philosophy to have substantial impedence mismatch for the current openafs Rx implementation. In general, how to achieve good performance on the so-called "high bandwidth-delay product" links remains a difficult problem, with active research efforts underway, as well as engineering work in SDOs like the IETF. -Ben > When it comes to deployment strategy, there may be workarounds to the > high-latency limitation. Each of them, of course, has limitations. I can > probably use the technique mentioned below to leverage the TCP throughput in > RO volume synchronization, > https://lists.openafs.org/pipermail/openafs-info/2018-August/042502.html > and wait until DPF becomes available in vos operations: > https://openafs-workshop.org/2019/schedule/faster-wan-volume-operations-with-dpf/ > > I can also adopt a small home volume, distributed subfolder volume strategy > that allows home volumes to move with relocated users across WAN, but keep > subdirectory volumes at their respective geographic location. Users can pick > a subdirectory that is closest to their current location to work with. When > combined with a version control system that uses TCP in syncing, project data > synching can be alleviated. > > There is a commercial path that we can pursue with AuriStor or other vendors. > But I guess that is out of the scope of this mail list. > > Any other strategies that may help? > > Thank you, Jeff! > > Simon Guan > > > -----Original Message----- > From: Jeffrey E Altman <jalt...@auristor.com> > Sent: Wednesday, August 7, 2019 9:01 PM > To: xg...@reliancememory.com; openafs-info@openafs.org > Subject: Re: [OpenAFS] iperf vs rxperf in high latency network > > On 8/7/2019 9:35 PM, xg...@reliancememory.com wrote: > > Hello, > > > > Can someone kindly explain again the possible reasons why Rx is so painfully > > slow for a high latency (~230ms) link? > > As Simon Wilkinson said on slide 5 of "RX Performance" > > https://indico.desy.de/indico/event/4756/session/2/contribution/22 > > "There's only two things wrong with RX > * The protocol > * The implementation" > > This presentation was given at DESY on 5 Oct 2011. Although there have > been some improvements in the OpenAFS RX implementation since then the > fundamental issues described in that presentation still remain. > > To explain slides 3 and 4. Prior to the 1.5.53 release the following > commit was merged which increased the default maximum window size from > 32 packets to 64 packets. > > commit 3feee9278bc8d0a22630508f3aca10835bf52866 > Date: Thu May 8 22:24:52 2008 +0000 > > rx-retain-windowing-per-peer-20080508 > > we learned about the peer in a previous connection... retain the > information and keep using it. widen the available window. > makes rx perform better over high latency wans. needs to be present > in both sides for maximal effect. > > Then prior to 1.5.66 this commit raised the maximum window size to 128 > > commit 310cec9933d1ff3a74bcbe716dba5ade9cc28d15 > Date: Tue Sep 29 05:34:30 2009 -0400 > > rx window size increase > > window size was previously pushed to 64; push to 128. > > and then prior to 1.5.78 which was just before the 1.6 release: > > commit a99e616d445d8b713934194ded2e23fe20777f9a > Date: Thu Sep 23 17:41:47 2010 +0100 > > rx: Big windows make us sad > > The commit which took our Window size to 128 caused rxperf to run > 40 times slower than before. All of the recent rx improvements have > reduced this to being around 2x slower than before, but we're still > not ready for large window sizes. > > As 1.6 is nearing release, reset back to the old, fast, window size > of 32. We can revist this as further performance improvements and > restructuring happen on master. > > After 1.6 AuriStor Inc. (then Your File System Inc.) continued to work > on reducing the overhead of RX packet processing. Some of the results > were presented in Simon Wilkinson's 16 October 2012 talk entitled "AFS > Performance" slides 25 to 30 > > http://conferences.inf.ed.ac.uk/eakc2012/ > > The performance of OpenAFS 1.8 RX is roughly the same as the OpenAFS > master performance from slide 28. The Experimental RX numbers were the > AuriStor RX stack at the time which was not contributed to OpenAFS. > > Since 2012 AuriStor has addressed many of the issues raised in > the "RX Performance" presentation > > 0. Per-packet processing expense > 1. Bogus RTT calculations > 2. Bogus RTO implmentation > 3. Lack of Congestion avoidance > 4. Incorrect window estimation when retransmitting > 5. Incorrect window handling during loss recovery > 6. Lock contention > > The current AuriStor RX state machine implements SACK based loss > recovery as documented in RFC6675, with elements of New Reno from > RFC5682 on top of TCP-style congestion control elements as documented in > RFC5681. The new RX also implements RFC2861 style congestion window > validation. > > When sending data the RX peer implementing these changes will be more > likely to sustain the maximum available throughput while at the same > time improving fairness towards competing network data flows. The > improved estimation of available pipe capacity permits an increase in > the default maximum window size from 60 packets (84.6 KB) to 128 packets > (180.5 KB). The larger window size increases the per call theoretical > maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec > and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec. > > AuriStor RX also includes experimental support for RX windows larger > than 255 packets (360KB). This release extends the RX flow control state > machine to support windows larger than the Selective Acknowledgment > table. The new maximum of 65535 packets (90MB) could theoretically fill > a 100 gbit/second pipe provided that the packet allocator and packet > queue management strategies could keep up. Hint: at present, they don't. > > To saturate a 60 Mbit/sec link with 230ms latency with rxmaxmtu set to > 1344 requires a window size of approximately 1284 packets. > > > From a user perspective, I wonder if there is any *quick Rx code hacking* > > that could help reduce the throughput gap of (iperf2 = 30Mb/s vs rxperf = > > 800Kb/s) for the following specific case. > > Probably not. AuriStor's RX is significant re-implementation of the > protocol with one eye focused on backward compatibility and the other on > the future. > > > We are considering the possibility of including two hosts ~230ms RTT apart > > as server and client. I used iperf2 and rxperf to test throughput between > > the two. There is no other connection competing with the test. So this is > > different from a low-latency, thread or udp buffer exhaustion scenario. > > > > iperf2's UDP test shows a bandwidth of ~30Mb/s without packet loss, though > > some of them have been re-ordered at the receiver side. Below 5 Mb/s, the > > receiver sees no packet re-ordering. Above 30 Mb/s, packet loss is seen by > > the receiver. Test result is pretty consistent at multiple time points > > within 24 hours. UDP buffer size used by iperf is 208 KB. Write length is > > set at 1300 (-l 1300) which is below the path MTU. > > Out of order packet delivery and packet loss have significant > performance impacts on OpenAFS RX. > > > Interestingly, a quick skim through the iperf2 source code suggests that an > > iperf sender does not wait for the receiver's ack. It simply keeps > > write(mSettins->mSock, mBuf, mSettings->mBufLen) and timing it to extract > > the numerical value for the throughput. It only checks, in the end, to see > > if the receiver complains about packet loss. > > This is because iperf2 is not attempting to perform any flow control, > any error recovery and no fairness model. RX calls are sequenced data > flows that are modeled on the same principals as TCP. > > > rxperf, on the other hand, only gets ~800 Kb/s. What makes it worse is that > > it does not seem to be dependent on the window size (-W 32~255), or udpsize > > (-u default~512*1024). I tried to re-compile rxperf that has #define > > RXPERF_BUFSIZE (1024 * 1024 * 64) instead of the original (512 * 1024). I > > did not see a throughput improvement from going above -u 512K. Occasionally > > some packets are re-transmitted. If I reduce -W or -u to very small values, > > I see some penalty. > > Changing RXPERF_BUFSIZE to 64MB is not going to help when the total > bytes being sent per call is 1000KB. Given how little data is being > sent per call and the fact that each RX call begins in "slow start" I > suspect that your test isn't growing the window size to 32 packets let > alone 255. > > > >[snip] > > > > The theory goes if I have a 32-packet recv/send window (Ack Count) with 1344 > > bytes of packet size and RTT=230ms, I should expect a theoretical upper > > bound of 32 x 8 x 1344 / 0.23 / 1000000 = 1.5 Mb/s. If the AFS-implemented > > Rx windows size (32) is really the limiting factor of the throughput, then > > the throughput should increase when I increase the window size (-w) above 32 > > and configure a sufficiently big kernel socket buffer size. > > The fact that OpenAFS RX requires large kernel socket buffers to get > reasonable performance is bad indication. It means that for OpenAFS RX > it is better to deliver packets with long delays than to drop them and > permit timely congestion detection. > > > I did not see either of the predictions by the theory above. I wonder if > > some light could be shed on: > > > > 1. What else may be the limiting factor in my case > > Not enough data is being sent per call. 30MB are being sent by iperf2 > and rxperf is sending 1000KB. Its not an equivalent comparison. > > > 2. If there is a quick way to increase recv/send window from 32 to 255 in Rx > > code without breaking other parts of AFS. > > As shown in the commits specified above, it doesn't take much to > increase the default maximum window size. However, performance is > unlikely to increase unless the root causes are addressed. > > > 3. If there is any quick (maybe dirty) way to leverage the iperf2 > > observation, relax the wait for ack as long as the received packets are in > > order and not lost (that is, get me up to 5Mb/s...) > > Not without further violating the TCP Fairness principal. > > > Thank you in advance. > > ========================== > > Ximeng (Simon) Guan, Ph.D. > > Director of Device Technology > > Reliance Memory > > ========================== > > Jeffrey Altman > > > > > _______________________________________________ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info