Thanks @Moreno  Diego (ID SIS)  for a detailed response to my email.   It gave 
me lot of options to further tune my cluster.   I have yet to apply those 
changes, but thought I share the changes I plan to make.  Also I have some 
follow-up questions to ensure the changes I am thinking of applying 
collectively make sense or not conflict with each other.  

 
On lnet:
Before

/usr/sbin/lnetctl net add --net tcp1 --if eno2  –peer-timeout 180 –peer-credits 
8 –credits 1024 

After

/usr/sbin/lnetctl net add --net tcp1 --if eno2  –peer-timeout 180 –peer-credits 
128 –credits 1024 -peer_buffer_credits 0

 

 
Do you have an example of how to set the PCI config to performance?  I tried 
google search, but was unable to find an example.  
Currently the RPC size is 4M and related rpcs settings are below

set_param obdfilter.lfsbv-*.brw_size=4

set_param osc.*.max_pages_per_rpc=1024

lctl set_param osc.*.max_rpcs_in_flight=256

lctl set_param osc.*.max_dirty_mb=2048

 
Should I update brw_size to 16M and related settings to higher value for better 
performance?   If yes,  does it also require changes to credits and 
peer_credits value for lnet.conf and ksocklnd.conf file to ensure there is 
enough credits to send so many RPC requests.    Should max_rpcs_in_flights be 
less than peer_credits value in lnet.conf or they are not related.  
 

set_param obdfilter.lfsbv-*.brw_size=16

set_param osc.*.max_pages_per_rpc=4096

lctl set_param osc.*.max_rpcs_in_flight=256

lctl set_param osc.*.max_dirty_mb=8092

 

 

 

 
On ksocklnd module options: more schedulers (10, 6 by default which was not 
enough for my server), also changed some of the buffers (tx_buffer_size and 
rx_buffer_size set to 1073741824) but you need to be very careful on these
Response:  I had none before.I plan to add the below, based on various Lustre 
recommendations in Lustre ppt presentations at Lustre UG meetings.   

 

echo "options ksocklnd sock_timeout=100 credits=2560 peer_credits=63 
enable_irq_affinity=0 concurrent_sends=63 fmr_pool_size=1280 pmr_pool_size=1280 
fmr_flush_trigger=1024 nscheds=10  tx_buffer_size=1073741824 
rx_buffer_size=1073741824"  >  /etc/modprobe.d/ksocklnd.conf

 

 
Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, 
net.core.max and default, check disabling timestamps if you can afford it)
Given below are my current settings.  My OSS and MDS node have 768 GB memory 
and 52 physical cores (104 vcpu).   And my lustre clients have 320GB memory and 
24 physical cores.  

 

echo "net.ipv4.tcp_window_scaling = 1" >> /etc/sysctl.conf

 

echo "net.ipv4.tcp_adv_win_scale=2" >> /etc/sysctl.conf

echo "net.ipv4.tcp_low_latency=1" >> /etc/sysctl.conf

 

echo "net.core.wmem_max=16777216" >> /etc/sysctl.conf

echo "net.core.rmem_max=16777216" >> /etc/sysctl.conf

echo "net.core.wmem_default=16777216" >> /etc/sysctl.conf

echo "net.core.rmem_default=16777216" >> /etc/sysctl.conf

echo "net.core.optmem_max=16777216" >> /etc/sysctl.conf

echo "net.core.netdev_max_backlog=27000" >> /etc/sysctl.conf   

echo "kernel.sysrq=1" >> /etc/sysctl.conf

echo "kernel.shmmax=18446744073692774399" >> /etc/sysctl.conf 

echo "net.core.somaxconn=8192" >> /etc/sysctl.conf

 

echo "net.ipv4.tcp_rmem = 212992 87380 16777216" >> /etc/sysctl.conf

echo "net.ipv4.tcp_sack = 1" >> /etc/sysctl.conf

echo "net.ipv4.tcp_timestamps = 1" >> /etc/sysctl.conf

echo "net.ipv4.tcp_window_scaling = 1" >> /etc/sysctl.conf

echo "net.ipv4.tcp_wmem = 212992 65536 16777216" >> /etc/sysctl.conf

echo "vm.min_free_kbytes = 65536" >> /etc/sysctl.conf

 

echo "net.ipv4.tcp_no_metrics_save = 0" >> /etc/sysctl.conf

echo "net.ipv4.tcp_timestamps = 0" >> /etc/sysctl.conf

echo "net.ipv4.tcp_congestion_control = htcp" >> /etc/sysctl.conf

 

 
I am running lfs 2.12.3 and Lustre 2.12.1 has a fix for single threaded issue 
with ksocklnd
http://wiki.lustre.org/Lustre_2.12.1_Changelog  has LU-11415: ksocklnd 
performance improvement on 40Gbps ethernet

 

[opc@lustre-oss-server-nic0-4 ~]$ top 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

   552 root      20   0       0      0      0 S   4.0  0.0  39:37.43 kswapd1

 60869 root      20   0       0      0      0 S   4.0  0.0  81:25.20 
socknal_sd01_04

 60870 root      20   0       0      0      0 S   4.0  0.0  81:14.20 
socknal_sd01_05

 60865 root      20   0       0      0      0 S   3.6  0.0  81:33.27 
socknal_sd01_00

 60866 root      20   0       0      0      0 S   3.6  0.0  81:09.03 
socknal_sd01_01

 60867 root      20   0       0      0      0 S   3.6  0.0  81:11.95 
socknal_sd01_02

 60868 root      20   0       0      0      0 S   3.6  0.0  81:30.26 
socknal_sd01_03

   551 root      20   0       0      0      0 S   2.6  0.0  39:24.00 kswapd0

 60860 root      20   0       0      0      0 S   2.3  0.0  30:54.35 
socknal_sd00_01

 60864 root      20   0       0      0      0 S   2.3  0.0  30:58.20 
socknal_sd00_05

 64426 root      20   0       0      0      0 S   2.3  0.0   7:28.65 
ll_ost_io01_102

 60859 root      20   0       0      0      0 S   2.0  0.0  30:56.70 
socknal_sd00_00

 60861 root      20   0       0      0      0 S   2.0  0.0  30:54.97 
socknal_sd00_02

 60862 root      20   0       0      0      0 S   2.0  0.0  30:56.06 
socknal_sd00_03

 60863 root      20   0       0      0      0 S   2.0  0.0  30:56.32 
socknal_sd00_04

64334 root      20   0       0      0      0 D   1.3  0.0   7:19.46 
ll_ost_io01_010

 64329 root      20   0       0      0      0 S   1.0  0.0   7:46.48 
ll_ost_io01_005

 

 

 

From: "Moreno Diego (ID SIS)" <diego.mor...@id.ethz.ch>
Date: Wednesday, December 4, 2019 at 11:12 PM
To: Pinkesh Valdria <pinkesh.vald...@oracle.com>, Jongwoo Han 
<jongwoo...@gmail.com>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test

 

I recently did some work on 40Gb and 100Gb ethernet interfaces and these are a 
few of the things that helped me during lnet_selftest:

 
On lnet: credits set to higher than the default (e.g: 1024 or more), 
peer_credits to 128 at least for network testing (it’s just 8 by default which 
is good for a big cluster maybe not for lnet_selftest with 2 clients),
On ksocklnd module options: more schedulers (10, 6 by default which was not 
enough for my server), also changed some of the buffers (tx_buffer_size and 
rx_buffer_size set to 1073741824) but you need to be very careful on these
Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, 
net.core.max and default, check disabling timestamps if you can afford it)
Other: cpupower governor (set to performance at least for testing), BIOS 
settings (e.g: on my AMD routers it was better to disable  HT, disable a few 
virtualization oriented features and set the PCI config to performance). 
Basically, be aware that Lustre ethernet’s performance will take CPU resources 
so better optimize for it
 

Last but not least be aware that Lustre’s ethernet driver (ksocklnd) does not 
load balance as well as Infiniband’s (ko2iblnd). I already saw sometimes 
several Lustre peers using the same socklnd thread on the destination but the 
other socklnd threads might not be active which means that your entire load is 
on just dependent on one core. For that the best is to try with more clients 
and check in your node what’s the cpu load per thread with top. 2 clients do 
not seem enough to me. With the proper configuration you should be perfectly 
able to saturate a 25Gb link in lnet_selftest.

 

Regards,

 

Diego

 

 

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Pinkesh Valdria <pinkesh.vald...@oracle.com>
Date: Thursday, 5 December 2019 at 06:14
To: Jongwoo Han <jongwoo...@gmail.com>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test

 

Thanks Jongwoo. 

 

I have the MTU set for 9000 and also ring buffer setting set to max. 

 

ip link set dev $primaryNICInterface mtu 9000

ethtool -G $primaryNICInterface rx 2047 tx 2047 rx-jumbo 8191

 

I read about changing  Interrupt Coalesce, but unable to find what values 
should be changed and also if it really helps or not. 

# Several packets in a rapid sequence can be coalesced into one interrupt 
passed up to the CPU, providing more CPU time for application processing.

 

Thanks,

Pinkesh valdria

Oracle Cloud

 

 

 

From: Jongwoo Han <jongwoo...@gmail.com>
Date: Wednesday, December 4, 2019 at 8:07 PM
To: Pinkesh Valdria <pinkesh.vald...@oracle.com>
Cc: Andreas Dilger <adil...@whamcloud.com>, "lustre-discuss@lists.lustre.org" 
<lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test

 

Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the 
switch? 

If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount 
of packet, reducing available bandwidth for data.

 

Jongwoo Han

 

2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria <pinkesh.vald...@oracle.com>님이 작성:

Thanks Andreas for your response.  

 

I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 
52 physical cores and I was able to achieve same throughput (2052.71  MiB/s = 
2152 MB/s).

 

Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on 
ethernet with Lnet?

 

 

Thanks,

Pinkesh Valdria

Oracle Cloud Infrastructure 

 

 

 

 

From: Andreas Dilger <adil...@whamcloud.com>
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria <pinkesh.vald...@oracle.com>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lnet Self Test

 

The first thing to note is that lst reports results in binary units 

(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the

conversion you get 2055.31 MiB/s = 2155 MB/s.

 

The other thing to check is the CPU usage. For TCP the CPU usage can

be high. You should try RoCE+o2iblnd instead. 

 

Cheers, Andreas


On Nov 26, 2019, at 21:26, Pinkesh Valdria <pinkesh.vald...@oracle.com> wrote:

Hello All, 

 

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.    Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.

 

 

[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

 

When I run lnet_selftest_wrapper.sh (from Lustre wiki) between 2 nodes,  I get 
a max of  2055.31  MiB/s,  Is that expected at the Lnet level?  Or can I 
further tune the network and OS kernel (tuning I applied are below) to get 
better throughput?

 

 

 

Result Snippet from lnet_selftest_wrapper.sh

 

[LNet Rates of lfrom]

[R] Avg: 4112     RPC/s Min: 4112     RPC/s Max: 4112     RPC/s

[W] Avg: 4112     RPC/s Min: 4112     RPC/s Max: 4112     RPC/s

[LNet Bandwidth of lfrom]

[R] Avg: 0.31     MiB/s Min: 0.31     MiB/s Max: 0.31     MiB/s

[W] Avg: 2055.30  MiB/s Min: 2055.30  MiB/s Max: 2055.30  MiB/s

[LNet Rates of lto]

[R] Avg: 4136     RPC/s Min: 4136     RPC/s Max: 4136     RPC/s

[W] Avg: 4136     RPC/s Min: 4136     RPC/s Max: 4136     RPC/s

[LNet Bandwidth of lto]

[R] Avg: 2055.31  MiB/s Min: 2055.31  MiB/s Max: 2055.31  MiB/s

[W] Avg: 0.32     MiB/s Min: 0.32     MiB/s Max: 0.32     MiB/s

 

 

Tuning applied: 

Ethernet NICs: 

ip link set dev ens3 mtu 9000 

ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191

 

 

less /etc/sysctl.conf

net.core.wmem_max=16777216

net.core.rmem_max=16777216

net.core.wmem_default=16777216

net.core.rmem_default=16777216

net.core.optmem_max=16777216

net.core.netdev_max_backlog=27000

kernel.sysrq=1

kernel.shmmax=18446744073692774399

net.core.somaxconn=8192

net.ipv4.tcp_adv_win_scale=2

net.ipv4.tcp_low_latency=1

net.ipv4.tcp_rmem = 212992 87380 16777216

net.ipv4.tcp_sack = 1

net.ipv4.tcp_timestamps = 1

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_wmem = 212992 65536 16777216

vm.min_free_kbytes = 65536

net.ipv4.tcp_congestion_control = cubic

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_congestion_control = htcp

net.ipv4.tcp_no_metrics_save = 0

 

 

 

echo "#

# tuned configuration

#

[main]

summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

 

[disk]

devices=!dm-*, !sda1, !sda2, !sda3

readahead=>4096

 

[cpu]

force_latency=1

governor=performance

energy_perf_bias=performance

min_perf_pct=100

[vm]

transparent_huge_pages=never

[sysctl]

kernel.sched_min_granularity_ns = 10000000

kernel.sched_wakeup_granularity_ns = 15000000

vm.dirty_ratio = 30

vm.dirty_background_ratio = 10

vm.swappiness=30

" > lustre-performance/tuned.conf

 

tuned-adm profile lustre-performance

 

 

Thanks,

Pinkesh Valdria

 

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


 

-- 

Jongwoo Han

+82-505-227-6108

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to