[lustre-discuss] Lustre client on kernel.org 4.4 ?

2019-02-12 Thread Louis Bailleul
Hi all,

As Lustre 2.10 is supported on Ubuntu 16.04 and SLES12 with both based on the 
4.4 kernel.
I tried to compile the 2.10.6 client using DKMS on the latest Kernel.org 4.4 
LTS (4.4-174).

But it is failing with what looks like differences in the API (full output 
attached).

make[3]: Entering directory `/usr/src/kernels/4.4.174-1.el6.elrepo.x86_64'
  LD  /var/lib/dkms/lustre-client/2.10.6/build/built-in.o
  LD  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/built-in.o
  LD  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/built-in.o
  CC [M]  
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-tracefile.o
  CC [M]  
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-debug.o
  CC [M]  
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-prim.o
  CC [M]  
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-cpu.o
  CC [M]  
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.o
cc1: warnings being treated as errors
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c: 
In function ‘cfs_access_process_vm’:
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c:157:
 error: passing argument 6 of ‘get_user_pages’ makes pointer from integer 
without a cast
include/linux/mm.h:1200: note: expected ‘struct page **’ but argument is of 
type ‘int’
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c:157:
 error: passing argument 7 of ‘get_user_pages’ from incompatible pointer type
include/linux/mm.h:1200: note: expected ‘struct vm_area_struct **’ but argument 
is of type ‘struct page **’
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c:157:
 error: too many arguments to function ‘get_user_pages’
make[6]: *** 
[/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.o] 
Error 1
make[5]: *** [/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs] Error 2
make[4]: *** [/var/lib/dkms/lustre-client/2.10.6/build/libcfs] Error 2
make[3]: *** [_module_/var/lib/dkms/lustre-client/2.10.6/build] Error 2
make[3]: Leaving directory `/usr/src/kernels/4.4.174-1.el6.elrepo.x86_64'
make[2]: *** [modules] Error 2
make[2]: Leaving directory `/var/lib/dkms/lustre-client/2.10.6/build'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/var/lib/dkms/lustre-client/2.10.6/build'
make: *** [all] Error 2

Is there a way to get that to work ?

Full disclosure: I need to upgrade the kernel of some Centos6 boxes due to a 
bug (not Lustre related), but can't upgrade the OS for *reasons*.


Best regards,
Louis


DKMS make.log for lustre-client-2.10.6 for kernel 4.4.174-1.el6.elrepo.x86_64 (x86_64)
Mon Feb 11 17:54:21 GMT 2019
make  all-recursive
make[1]: Entering directory `/var/lib/dkms/lustre-client/2.10.6/build'
Making all in .
make[2]: Entering directory `/var/lib/dkms/lustre-client/2.10.6/build'
make LDFLAGS= CC="gcc" -C /usr/src/kernels/4.4.174-1.el6.elrepo.x86_64 \
	-f /var/lib/dkms/lustre-client/2.10.6/build/build/Makefile LUSTRE_LINUX_CONFIG=/usr/src/kernels/4.4.174-1.el6.elrepo.x86_64/.config \
	LINUXINCLUDE='-I$(srctree)/arch/$(SRCARCH)/include -Iarch/$(SRCARCH)/include/generated -Iinclude $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) -I$(srctree)/arch/$(SRCARCH)/include/uapi -Iarch/$(SRCARCH)/include/generated/uapi -I$(srctree)/include/uapi -Iinclude/generated/uapi -include /usr/src/kernels/4.4.174-1.el6.elrepo.x86_64/include/linux/kconfig.h' \
	M=/var/lib/dkms/lustre-client/2.10.6/build -o tmp_include_depends -o scripts -o \
	include/config/MARKER modules
make[3]: Entering directory `/usr/src/kernels/4.4.174-1.el6.elrepo.x86_64'
  LD  /var/lib/dkms/lustre-client/2.10.6/build/built-in.o
  LD  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/built-in.o
  LD  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/built-in.o
  CC [M]  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-tracefile.o
  CC [M]  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-debug.o
  CC [M]  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-prim.o
  CC [M]  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-cpu.o
  CC [M]  /var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.o
cc1: warnings being treated as errors
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c: In function ‘cfs_access_process_vm’:
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c:157: error: passing argument 6 of ‘get_user_pages’ makes pointer from integer without a cast
include/linux/mm.h:1200: note: expected ‘struct page **’ but argument is of type ‘int’
/var/lib/dkms/lustre-client/2.10.6/build/libcfs/libcfs/linux/linux-curproc.c:157: error: passing argument 7 of ‘get_user_pages’ from incompatible pointer type
include/linux/mm.h:1200: note: expected ‘struct vm_area_struct 

[lustre-discuss] obdfilter/mdt stats meaning ?

2019-07-16 Thread Louis Bailleul
Hi all,

I am trying to make sense of some of the OST/MDT stats for 2.12.
Can anybody point me to the doc that explain what the metrics are ?
The wiki only mention read/write/get_info : 
http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide
But the list I get is quite different :
obdfilter.OST001.stats=
snapshot_time 1563285450.647120173 secs.nsecs
read_bytes340177708 samples [bytes] 4096 4194304 
396712660910080
write_bytes   30008856 samples [bytes] 24 4194304 78618271501667
setattr   1755 samples [reqs]
punch 73463 samples [reqs]
sync  50606 samples [reqs]
destroy   31990 samples [reqs]
create956 samples [reqs]
statfs75378743 samples [reqs]
connect   5798 samples [reqs]
reconnect 3242 samples [reqs]
disconnect5820 samples [reqs]
statfs3737980 samples [reqs]
preprw370186566 samples [reqs]
commitrw  370186557 samples [reqs]
ping  882096292 samples [reqs]

For the MDT, most are pretty much self explanatory, but I'll still be happy to 
be pointed to some doc.
mdt.MDT.md_stats=
snapshot_time 1563287416.006001068 secs.nsecs
open  3174644054 samples [reqs]
close 3174494603 samples [reqs]
mknod 107564 samples [reqs]
unlink99625 samples [reqs]
mkdir 199643 samples [reqs]
rmdir 45021 samples [reqs]
rename12728 samples [reqs]
getattr   50227431 samples [reqs]
setattr   103435 samples [reqs]
getxattr  9051470 samples [reqs]
setxattr  14 samples [reqs]
statfs7525513 samples [reqs]
sync  20597 samples [reqs]
samedir_rename207 samples [reqs]
crossdir_rename   12521 samples [reqs]

And anyone knows how to read the OST brw_stats ?

obdfilter.OST0014.brw_stats=
snapshot_time: 1563287631.511085465 (secs.nsecs)

   read  | write
pages per bulk r/w rpcs  % cum % |  rpcs% cum %
1:   231699298  66  66   | 180944   0   0
2:  855611   0  67   | 322359   1   1
4:  541749   0  67   | 5539716  18  20
8: 1281219   0  67   | 67837   0  20
16: 637808   0  67   | 114546   0  20
32:1342813   0  68   | 3099780  10  31
64:1559834   0  68   | 173166   0  31
128:   1583127   0  69   | 211512   0  32
256:  10627583   3  72   | 499978   1  34
512:   3909601   1  73   | 1029686   3  37
1K:   92141161  26 100   | 18788597  62 100

   read  | write
discontiguous pagesrpcs  % cum % |  rpcs% cum %
0:   346179839 100 100   | 180946   0   0
1:   0   0 100   | 322363   1   1
2:   0   0 100   | 5521062  18  20
3:   0   0 100   | 18650   0  20
4:   0   0 100   | 18159   0  20
5:   0   0 100   | 26664   0  20
6:   0   0 100   | 10830   0  20
7:   0   0 100   | 12189   0  20
8:   0   0 100   | 11365   0  20
9:   0   0 100   | 10253   0  20
10:  0   0 100   | 8810   0  20
11:  0   0 100   | 9825   0  20
12:  0   0 100   | 16740   0  20
13:  0   0 100   | 14421   0  20
14:  0   0 100   | 10513   0  20
15:  0   0 100   | 32655   0  20
16:  0   0 100   | 1418677   4  25
17:  0   0 100   | 1477077   4  30
18:  0   0 100   | 6227   0  30
19:  0   0 100   | 7071   0  30
20:  0   0 100   | 7297   0  30
21:  0   0 100   | 8478   0  30
22:  0   0 100   | 34591   0  30
23:  0   0 100   | 35591   0  30
24:  0   0 100   | 8378   0  30
25:  0   0 100   | 8724   0  30
26:  0   0 100   | 52300   0  30
27:  0   0 100   | 14038   0  30
28:  0   0 100   | 4734   0  30
29:  0   0 100   | 4878   0  31
30:  0   0 100   | 6232   0  31
31:  0   0 100   | 20708383  68 100
   read  | write
disk I/Os in flightios   % cum % |  ios % cum %
1:   211177215  61  61   | 29305564  97  97
2:41332944  11  72   | 498260   1  99
3:  

Re: [lustre-discuss] [External] Re: obdfilter/mdt stats meaning ?

2019-07-16 Thread Louis Bailleul
Hi Aurélien,

Thanks for the prompt reply.
For the ost stats, any idea what the preprw and commitrw mean ?
And why there are two entries with different values for statfs ?

For brw_stats even with the doc I still struggle to read this.
For example how do you make sense of disk I/O in flight ?
   read  | write
disk I/Os in flightios   % cum % |  ios % cum %
1:   211177215  61  61   | 29305564  97  97
2:41332944  11  72   | 498260   1  99
[..]
Does these lines means :
Since last snapshot there was 211177215x1 and read 41332944x2 I/O in flight ?

Best regards,
Louis

On 16/07/2019 15:50, Degremont, Aurelien wrote:
Hi Louis,

About brw_stats, there are a bit of explanation in the Lustre Doc (not that 
detailed, but still)
http://doc.lustre.org/lustre_manual.xhtml#dbdoclet.50438271_55057

> Last thing, is there any way to get the name of the filesystem an OST is part 
> of by using lctl ?

I don't know what you want exactly, but the OST names are self explanatory, 
there always are like: fsname-OST
Where fsname is the lustre filesystem they are part of.

For obdfilter stats, these are mostly action to OST objects or client 
connection management RPCs.

setattr: changing an OST object attributes (owner, group, ...)
punch: mostly used for truncate (theorically can do holes in files, like 
truncate with a start and length)
sync: straighforward, sync OST to disk
destroy: delete an OST object (mostly when a file is deleted)
create: create an OST object
statfs: like 'df' for this specific OST (used by 'lfs df' by example)
(re)connect: when a client connect/reconnect to this OST
ping: when a client ping this OST.


Aurélien

De : lustre-discuss 
<mailto:lustre-discuss-boun...@lists.lustre.org>
 au nom de Louis Bailleul 
<mailto:louis.baill...@pgs.com>
Date : mardi 16 juillet 2019 à 16:38
À : lustre-discuss 
<mailto:lustre-discuss@lists.lustre.org>
Objet : [lustre-discuss] obdfilter/mdt stats meaning ?

Hi all,

I am trying to make sense of some of the OST/MDT stats for 2.12.
Can anybody point me to the doc that explain what the metrics are ?
The wiki only mention read/write/get_info : 
http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide<https://urldefense.proofpoint.com/v2/url?u=http-3A__wiki.lustre.org_Lustre-5FMonitoring-5Fand-5FStatistics-5FGuide&d=DwMGaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=FTXmt89oLXmbXfP78w86-PxB1XdLYgxG8hEoAnZvCvs&m=UC1t7z9tgmxUE2FWaTFHFT_Y69z_VMH0dEYF1VXadX0&s=cdXTUStD_NPwj3GtNYBqJA2nkJ1Ec53F9aD5UxFo5tw&e=>
But the list I get is quite different :
obdfilter.OST001.stats=
snapshot_time 1563285450.647120173 secs.nsecs
read_bytes340177708 samples [bytes] 4096 4194304 
396712660910080
write_bytes   30008856 samples [bytes] 24 4194304 78618271501667
setattr   1755 samples [reqs]
punch 73463 samples [reqs]
sync  50606 samples [reqs]
destroy   31990 samples [reqs]
create956 samples [reqs]
statfs75378743 samples [reqs]
connect   5798 samples [reqs]
reconnect 3242 samples [reqs]
disconnect5820 samples [reqs]
statfs3737980 samples [reqs]
preprw370186566 samples [reqs]
commitrw  370186557 samples [reqs]
ping  882096292 samples [reqs]
For the MDT, most are pretty much self explanatory, but I'll still be happy to 
be pointed to some doc.
mdt.MDT.md_stats=
snapshot_time 1563287416.006001068 secs.nsecs
open  3174644054 samples [reqs]
close 3174494603 samples [reqs]
mknod 107564 samples [reqs]
unlink99625 samples [reqs]
mkdir 199643 samples [reqs]
rmdir 45021 samples [reqs]
rename12728 samples [reqs]
getattr   50227431 samples [reqs]
setattr   103435 samples [reqs]
getxattr  9051470 samples [reqs]
setxattr  14 samples [reqs]
statfs7525513 samples [reqs]
sync  20597 samples [reqs]
samedir_rename207 samples [reqs]
crossdir_rename   12521 samples [reqs]
And anyone knows how to read the OST brw_stats ?
obdfilter.OST0014.brw_stats=
snapshot_time: 1563287631.511085465 (secs.nsecs)

   read  | write
pages per bulk r/w rpcs  % cum % |  rpcs% cum %
1:   231699298  66  66   | 180944   0   0
2:  855611   0  67   | 322359   1   1
4:  541749   0  67   | 5539716  18  20
8: 128121

[lustre-discuss] Very bad lnet ethernet read performance

2019-08-12 Thread Louis Bailleul
Hi all,

I am trying to understand what I am doing wrong here.
I have a Lustre 2.12.1 system backed by NVME drives under zfs for which 
obdfilter-survey gives descent values
ost  2 sz 536870912K rsz 1024K obj2 thr  256 write 15267.49 [6580.36, 
8664.20] rewrite 15225.24 [6559.05, 8900.54] read 19739.86 [9062.25, 10429.04]
But my actual Lustre performances are pretty poor in comparison (can't top 
8GB/s write and 13.5GB/s read)
So I started to question my lnet tuning but playing with peer_credits and 
max_rpc_per_pages didn't help.

My test setup consist of 133x10G Ethernet clients (uplinks between end devices 
and OSS are 2x100G for every 20 nodes).
The single OSS is fitted with a bonding of 2x100G Ethernet.

I have tried to understand the problem using lnet_selftest but I'll need some 
help/doco as this doesn't make sense to me.

Testing a single 10G client
[LNet Rates of lfrom]
[R] Avg: 2231 RPC/s Min: 2231 RPC/s Max: 2231 RPC/s
[W] Avg: 1156 RPC/s Min: 1156 RPC/s Max: 1156 RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 1075.16  MiB/s Min: 1075.16  MiB/s Max: 1075.16  MiB/s
[W] Avg: 0.18 MiB/s Min: 0.18 MiB/s Max: 0.18 MiB/s
[LNet Rates of lto]
[R] Avg: 1179 RPC/s Min: 1179 RPC/s Max: 1179 RPC/s
[W] Avg: 2254 RPC/s Min: 2254 RPC/s Max: 2254 RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.19 MiB/s Min: 0.19 MiB/s Max: 0.19 MiB/s
[W] Avg: 1075.17  MiB/s Min: 1075.17  MiB/s Max: 1075.17  MiB/s
With 10x10G clients :
[LNet Rates of lfrom]
[R] Avg: 1416 RPC/s Min: 1102 RPC/s Max: 1642 RPC/s
[W] Avg: 708  RPC/s Min: 551  RPC/s Max: 821  RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 708.20   MiB/s Min: 550.77   MiB/s Max: 820.96   MiB/s
[W] Avg: 0.11 MiB/s Min: 0.08 MiB/s Max: 0.13 MiB/s
[LNet Rates of lto]
[R] Avg: 7084 RPC/s Min: 7084 RPC/s Max: 7084 RPC/s
[W] Avg: 14165RPC/s Min: 14165RPC/s Max: 14165RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.08 MiB/s Min: 1.08 MiB/s Max: 1.08 MiB/s
[W] Avg: 7081.86  MiB/s Min: 7081.86  MiB/s Max: 7081.86  MiB/s

With all 133x10G clients:
[LNet Rates of lfrom]
[R] Avg: 510  RPC/s Min: 98   RPC/s Max: 23457RPC/s
[W] Avg: 510  RPC/s Min: 49   RPC/s Max: 45863RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 169.87   MiB/s Min: 48.77MiB/s Max: 341.26   MiB/s
[W] Avg: 169.86   MiB/s Min: 0.01 MiB/s Max: 22757.92 MiB/s
[LNet Rates of lto]
[R] Avg: 23458RPC/s Min: 23458RPC/s Max: 23458RPC/s
[W] Avg: 45876RPC/s Min: 45876RPC/s Max: 45876RPC/s
[LNet Bandwidth of lto]
[R] Avg: 341.12   MiB/s Min: 341.12   MiB/s Max: 341.12   MiB/s
[W] Avg: 22758.42 MiB/s Min: 22758.42 MiB/s Max: 22758.42 MiB/s

So if I add clients the aggregate write bandwidth somewhat stacks, but the read 
bandwidth decrease ???
When throwing all the nodes at the system, I am pretty happy with the ~22GB/s 
on write pretty as this is in the 90% of the 2x100G, but the 341MB/s read 
sounds very weird considering that this is a third of the performance of a 
single client.

This are my ksocklnd tuning :

# for i in /sys/module/ksocklnd/parameters/*; do echo "$i : $(cat $i)"; done
/sys/module/ksocklnd/parameters/credits : 1024
/sys/module/ksocklnd/parameters/eager_ack : 0
/sys/module/ksocklnd/parameters/enable_csum : 0
/sys/module/ksocklnd/parameters/enable_irq_affinity : 0
/sys/module/ksocklnd/parameters/inject_csum_error : 0
/sys/module/ksocklnd/parameters/keepalive : 30
/sys/module/ksocklnd/parameters/keepalive_count : 5
/sys/module/ksocklnd/parameters/keepalive_idle : 30
/sys/module/ksocklnd/parameters/keepalive_intvl : 5
/sys/module/ksocklnd/parameters/max_reconnectms : 6
/sys/module/ksocklnd/parameters/min_bulk : 1024
/sys/module/ksocklnd/parameters/min_reconnectms : 1000
/sys/module/ksocklnd/parameters/nagle : 0
/sys/module/ksocklnd/parameters/nconnds : 4
/sys/module/ksocklnd/parameters/nconnds_max : 64
/sys/module/ksocklnd/parameters/nonblk_zcack : 1
/sys/module/ksocklnd/parameters/nscheds : 12
/sys/module/ksocklnd/parameters/peer_buffer_credits : 0
/sys/module/ksocklnd/parameters/peer_credits : 128
/sys/module/ksocklnd/parameters/peer_timeout : 180
/sys/module/ksocklnd/parameters/round_robin : 1
/sys/module/ksocklnd/parameters/rx_buffer_size : 0
/sys/module/ksocklnd/parameters/sock_timeout : 50
/sys/module/ksocklnd/parameters/tx_buffer_size : 0
/sys/module/ksocklnd/parameters/typed_conns : 1
/sys/module/ksocklnd/parameters/zc_min_payload : 16384
/sys/module/ksocklnd/parameters/zc_recv : 0
/sys/module/ksocklnd/parameters/zc_recv_min_nfrags : 16

Best regards,
Louis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [External] Re: Very bad lnet ethernet read performance

2019-08-16 Thread Louis Bailleul
Thanks for the pointers.

Flow control has limited impact at this point (no change under lnet_selftest 
and ~10% drop when disabled under iperf).
All machines have tcp_sack enabled.
Checksum don't seems to make a difference either.
Bumping up the max_rpc_in_flights didn't improve much but seems to have made 
the write speed more consistent.
read_ahead had no effect on read performance.

At this point I am struggling to understand what has actual effects on reads.
iperf between clients and OSS gives a combined bandwidth that reach ~90% of 
link capacity (43.7GB/s), but lnet_selftest max out at ~14GB/s so about 28%.

Any clues on what lnet tunables / settings could have any impacts here  ?

Best regards,
Louis

On 13/08/2019 12:53, Raj wrote:
Louis,
I would also try:
- turning on selective ack (net.ipv4.tcp_sack=1) on all nodes. This helps 
although there is a CVE out there for older kernels.
- turning off checksum osc.ostid*.checksums. This can be turned off per OST/FS 
on clients.
- Increasing max_pages_per_rpc to 16M. Although this may not help with your 
reads.
- Increasing max_rpcs_in_flight and max_dirty_mb be  2 x max_rpcs_in_flight
- Increasing llite.ostid*.max_read_ahead_mb to up to 1024 on clients. Again 
this can be set per OST/FS.

_Raj

On Mon, Aug 12, 2019 at 12:12 PM Shawn Hall 
mailto:shawn.h...@nag.com>> wrote:
Do you have Ethernet flow control configured on all ports (especially the 
uplink ports)?  We’ve found that flow control is critical when there are 
mismatched uplink/client port speeds.

Shawn

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 On Behalf Of Louis Bailleul
Sent: Monday, August 12, 2019 1:08 PM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Very bad lnet ethernet read performance

Hi all,

I am trying to understand what I am doing wrong here.
I have a Lustre 2.12.1 system backed by NVME drives under zfs for which 
obdfilter-survey gives descent values
ost  2 sz 536870912K rsz 1024K obj2 thr  256 write 15267.49 [6580.36, 
8664.20] rewrite 15225.24 [6559.05, 8900.54] read 19739.86 [9062.25, 10429.04]
But my actual Lustre performances are pretty poor in comparison (can't top 
8GB/s write and 13.5GB/s read)
So I started to question my lnet tuning but playing with peer_credits and 
max_rpc_per_pages didn't help.

My test setup consist of 133x10G Ethernet clients (uplinks between end devices 
and OSS are 2x100G for every 20 nodes).
The single OSS is fitted with a bonding of 2x100G Ethernet.

I have tried to understand the problem using lnet_selftest but I'll need some 
help/doco as this doesn't make sense to me.

Testing a single 10G client
[LNet Rates of lfrom]
[R] Avg: 2231 RPC/s Min: 2231 RPC/s Max: 2231 RPC/s
[W] Avg: 1156 RPC/s Min: 1156 RPC/s Max: 1156 RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 1075.16  MiB/s Min: 1075.16  MiB/s Max: 1075.16  MiB/s
[W] Avg: 0.18 MiB/s Min: 0.18 MiB/s Max: 0.18 MiB/s
[LNet Rates of lto]
[R] Avg: 1179 RPC/s Min: 1179 RPC/s Max: 1179 RPC/s
[W] Avg: 2254 RPC/s Min: 2254 RPC/s Max: 2254 RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.19 MiB/s Min: 0.19 MiB/s Max: 0.19 MiB/s
[W] Avg: 1075.17  MiB/s Min: 1075.17  MiB/s Max: 1075.17  MiB/s
With 10x10G clients :
[LNet Rates of lfrom]
[R] Avg: 1416 RPC/s Min: 1102 RPC/s Max: 1642 RPC/s
[W] Avg: 708  RPC/s Min: 551  RPC/s Max: 821  RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 708.20   MiB/s Min: 550.77   MiB/s Max: 820.96   MiB/s
[W] Avg: 0.11 MiB/s Min: 0.08 MiB/s Max: 0.13 MiB/s
[LNet Rates of lto]
[R] Avg: 7084 RPC/s Min: 7084 RPC/s Max: 7084 RPC/s
[W] Avg: 14165RPC/s Min: 14165RPC/s Max: 14165RPC/s
[LNet Bandwidth of lto]
[R] Avg: 1.08 MiB/s Min: 1.08 MiB/s Max: 1.08 MiB/s
[W] Avg: 7081.86  MiB/s Min: 7081.86  MiB/s Max: 7081.86  MiB/s

With all 133x10G clients:
[LNet Rates of lfrom]
[R] Avg: 510  RPC/s Min: 98   RPC/s Max: 23457RPC/s
[W] Avg: 510  RPC/s Min: 49   RPC/s Max: 45863RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 169.87   MiB/s Min: 48.77MiB/s Max: 341.26   MiB/s
[W] Avg: 169.86   MiB/s Min: 0.01 MiB/s Max: 22757.92 MiB/s
[LNet Rates of lto]
[R] Avg: 23458RPC/s Min: 23458RPC/s Max: 23458RPC/s
[W] Avg: 45876RPC/s Min: 45876RPC/s Max: 45876RPC/s
[LNet Bandwidth of lto]
[R] Avg: 341.12   MiB/s Min: 341.12   MiB/s Max: 341.12   MiB/s
[W] Avg: 22758.42 MiB/s Min: 22758.42 MiB/s Max: 22758.42 MiB/s

So if I add clients the aggregate write bandwidth somewhat stacks, but the read 
bandwidth decrease ???
When throwing all the nodes at the system, I am pretty happy with the ~22GB/s 
on write pretty as this is in the 90% of the 2x100G, but the 341MB/s read 
sounds very weird considering that this is a third of the performance of a 
single client.

This are my ksocklnd tuning :
# for i i