Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the switch? If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount of packet, reducing available bandwidth for data.
Jongwoo Han 2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria <pinkesh.vald...@oracle.com>님이 작성: > Thanks Andreas for your response. > > > > I ran anotherLnet Self test with 48 concurrent processes, since the nodes > have 52 physical cores and I was able to achieve same throughput (2052.71 > MiB/s = 2152 MB/s). > > > > Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on > ethernet with Lnet? > > > > > > Thanks, > > Pinkesh Valdria > > Oracle Cloud Infrastructure > > > > > > > > > > *From: *Andreas Dilger <adil...@whamcloud.com> > *Date: *Wednesday, November 27, 2019 at 1:25 AM > *To: *Pinkesh Valdria <pinkesh.vald...@oracle.com> > *Cc: *"lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> > *Subject: *Re: [lustre-discuss] Lnet Self Test > > > > The first thing to note is that lst reports results in binary units > > (MiB/s) while iperf reports results in decimal units (Gbps). If you do the > > conversion you get 2055.31 MiB/s = 2155 MB/s. > > > > The other thing to check is the CPU usage. For TCP the CPU usage can > > be high. You should try RoCE+o2iblnd instead. > > > > Cheers, Andreas > > > On Nov 26, 2019, at 21:26, Pinkesh Valdria <pinkesh.vald...@oracle.com> > wrote: > > Hello All, > > > > I created a new Lustre cluster on CentOS7.6 and I am running > lnet_selftest_wrapper.sh to measure throughput on the network. The nodes > are connected to each other using 25Gbps ethernet, so theoretical max is 25 > Gbps * 125 = 3125 MB/s. Using iperf3, I get 22Gbps (2750 MB/s) between > the nodes. > > > > > > [root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ; do echo $c ; > ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S) CN=$c SZ=1M TM=30 BRW=write > CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" > /root/lnet_selftest_wrapper.sh; done ; > > > > When I run lnet_selftest_wrapper.sh (from Lustre wiki > <https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.lustre.org_LNET-5FSelftest&d=DwMGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dEosA07cQm7WPohubrpzab8agc4uFDGesC-4tI4ylm0&s=-ne2Yke64JRw4BQu9pa0DXwf3tHkDqaUbp7S6Eq_C_Q&e=>) > between 2 nodes, I get a max of 2055.31 MiB/s, Is that expected at the > Lnet level? Or can I further tune the network and OS kernel (tuning I > applied are below) to get better throughput? > > > > > > > > *Result Snippet from lnet_selftest_wrapper.sh* > > > > [LNet Rates of lfrom] > > [R] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s > > [W] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s > > [LNet Bandwidth of lfrom] > > [R] Avg: 0.31 MiB/s Min: 0.31 MiB/s Max: 0.31 MiB/s > > [W] Avg: 2055.30 MiB/s Min: 2055.30 MiB/s Max: 2055.30 MiB/s > > [LNet Rates of lto] > > [R] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s > > [W] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s > > [LNet Bandwidth of lto] > > [R] Avg: 2055.31 MiB/s Min: 2055.31 MiB/s Max: 2055.31 MiB/s > > [W] Avg: 0.32 MiB/s Min: 0.32 MiB/s Max: 0.32 MiB/s > > > > > > *Tuning applied: * > > *Ethernet NICs: * > > ip link set dev ens3 mtu 9000 > > ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191 > > > > > > *less /etc/sysctl.conf* > > net.core.wmem_max=16777216 > > net.core.rmem_max=16777216 > > net.core.wmem_default=16777216 > > net.core.rmem_default=16777216 > > net.core.optmem_max=16777216 > > net.core.netdev_max_backlog=27000 > > kernel.sysrq=1 > > kernel.shmmax=18446744073692774399 > > net.core.somaxconn=8192 > > net.ipv4.tcp_adv_win_scale=2 > > net.ipv4.tcp_low_latency=1 > > net.ipv4.tcp_rmem = 212992 87380 16777216 > > net.ipv4.tcp_sack = 1 > > net.ipv4.tcp_timestamps = 1 > > net.ipv4.tcp_window_scaling = 1 > > net.ipv4.tcp_wmem = 212992 65536 16777216 > > vm.min_free_kbytes = 65536 > > net.ipv4.tcp_congestion_control = cubic > > net.ipv4.tcp_timestamps = 0 > > net.ipv4.tcp_congestion_control = htcp > > net.ipv4.tcp_no_metrics_save = 0 > > > > > > > > echo "# > > *# tuned configuration* > > *#* > > [main] > > summary=Broadly applicable tuning that provides excellent performance > across a variety of common server workloads > > > > [disk] > > devices=!dm-*, !sda1, !sda2, !sda3 > > readahead=>4096 > > > > [cpu] > > force_latency=1 > > governor=performance > > energy_perf_bias=performance > > min_perf_pct=100 > > [vm] > > transparent_huge_pages=never > > [sysctl] > > kernel.sched_min_granularity_ns = 10000000 > > kernel.sched_wakeup_granularity_ns = 15000000 > > vm.dirty_ratio = 30 > > vm.dirty_background_ratio = 10 > > vm.swappiness=30 > > " > lustre-performance/tuned.conf > > > > tuned-adm profile lustre-performance > > > > > > Thanks, > > Pinkesh Valdria > > > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwMGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dEosA07cQm7WPohubrpzab8agc4uFDGesC-4tI4ylm0&s=ejwMDqk5D3TzRE5eTzFdEKo9cQ0I6GVqN04wgaJcn0s&e=> > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- Jongwoo Han +82-505-227-6108
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org