[lustre-discuss] How to make OSTs active again

2021-09-30 Thread Pinkesh Valdria via lustre-discuss
I have a simple lustre setup ( 1 MGS,  1 MDS (2 MDT),  2 OSS (2 OST each) and 1 
client node to run some IO load).   I was testing what happens if one of the 
OSS dies (but no impact to data).  To recover from failed OSS, I create a new 
instance and attached the 2 OSTs from failed node.   I assume, since I am using 
existing OSTs from failed node and the index will remain the same,  I tried 
directly mount of it like below:

mount -t lustre /dev/oracleoci/oraclevdb /mnt/oss-2-ost-1
mount -t lustre /dev/oracleoci/oraclevdc /mnt/oss-2-ost-2

Since I tried many different time,  I also tried the below:
Ran mkfs.lustre on the OSTs:
mkfs.lustre --fsname=lustrefs  --index=2 --ost --mgsnode=10.0.6.2@tcp1  
/dev/oracleoci/oraclevdb
mkfs.lustre --fsname=lustrefs  --index=3 --ost --mgsnode=10.0.6.2@tcp1  
/dev/oracleoci/oraclevdc
mount -t lustre /dev/oracleoci/oraclevdb /mnt/oss-2-ost-1
mount -t lustre /dev/oracleoci/oraclevdc /mnt/oss-2-ost-2

Ran mkfs.lustre on the OSTs with  --reformat --replace
mkfs.lustre --fsname=lustrefs --reformat --replace --index=2  --ost 
--mgsnode=10.0.6.2@tcp1  /dev/oracleoci/oraclevdb
mount -t lustre /dev/oracleoci/oraclevdb /mnt/oss-2-ost-1

mkfs.lustre --fsname=lustrefs --reformat --replace --index=3  --ost 
--mgsnode=10.0.6.2@tcp1  /dev/oracleoci/oraclevdc
mount -t lustre /dev/oracleoci/oraclevdc /mnt/oss-2-ost-2


Questions:

  1.  After OSS node was replaced,  the client node mount was still in hang 
state and I had to reboot the client node for the mount to work.  Is there some 
config I need to set , so it auto-recovers.
  2.  On the client node,  I see the 2 OSTs are showing as INACTIVE,   how do I 
make them active again.   I read on forums to do “lctl –device  
recover/activate and I ran that on MDS and Client, and it still shows INACTIVE. 
  It was confusing on what to pass as  and where to find the 
correct name.

[root@client-1 ~]# lfs osts
OBDS:
0: lustrefs-OST_UUID ACTIVE
1: lustrefs-OST0001_UUID ACTIVE
2: lustrefs-OST0002_UUID INACTIVE
3: lustrefs-OST0003_UUID INACTIVE

[root@client-1 ~]# lctl dl
  0 UP mgc MGC10.0.6.2@tcp1 0e4fae60-66e5-963d-1aea-59b80f9fd77b 4
  1 UP lov lustrefs-clilov-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 3
  2 UP lmv lustrefs-clilmv-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
  3 UP mdc lustrefs-MDT-mdc-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
  4 UP mdc lustrefs-MDT0001-mdc-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
  5 UP osc lustrefs-OST0002-osc-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
  6 UP osc lustrefs-OST0003-osc-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
  7 UP osc lustrefs-OST-osc-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
  8 UP osc lustrefs-OST0001-osc-89259ae86000 
6c141ed7-bffe-3d1b-a094-11fbdaab9ac5 4
[root@client-1 ~]#

MDS node
$ sudo lctl dl
  0 UP osd-ldiskfs lustrefs-MDT0001-osd lustrefs-MDT0001-osd_UUID 10
  1 UP osd-ldiskfs lustrefs-MDT-osd lustrefs-MDT-osd_UUID 11
  2 UP mgc MGC10.0.6.2@tcp1 acc3160e-9975-9262-89e1-8dc66812ac94 4
  3 UP mds MDS MDS_uuid 2
  4 UP lod lustrefs-MDT-mdtlov lustrefs-MDT-mdtlov_UUID 3
  5 UP mdt lustrefs-MDT lustrefs-MDT_UUID 18
  6 UP mdd lustrefs-MDD lustrefs-MDD_UUID 3
  7 UP qmt lustrefs-QMT lustrefs-QMT_UUID 3
  8 UP osp lustrefs-MDT0001-osp-MDT lustrefs-MDT-mdtlov_UUID 4
  9 UP osp lustrefs-OST0002-osc-MDT lustrefs-MDT-mdtlov_UUID 4
10 UP osp lustrefs-OST0003-osc-MDT lustrefs-MDT-mdtlov_UUID 4
11 UP osp lustrefs-OST-osc-MDT lustrefs-MDT-mdtlov_UUID 4
12 UP osp lustrefs-OST0001-osc-MDT lustrefs-MDT-mdtlov_UUID 4
13 UP lwp lustrefs-MDT-lwp-MDT lustrefs-MDT-lwp-MDT_UUID 4
14 UP lod lustrefs-MDT0001-mdtlov lustrefs-MDT0001-mdtlov_UUID 3
15 UP mdt lustrefs-MDT0001 lustrefs-MDT0001_UUID 14
16 UP mdd lustrefs-MDD0001 lustrefs-MDD0001_UUID 3
17 UP osp lustrefs-MDT-osp-MDT0001 lustrefs-MDT0001-mdtlov_UUID 4
18 UP osp lustrefs-OST0002-osc-MDT0001 lustrefs-MDT0001-mdtlov_UUID 4
19 UP osp lustrefs-OST0003-osc-MDT0001 lustrefs-MDT0001-mdtlov_UUID 4
20 UP osp lustrefs-OST-osc-MDT0001 lustrefs-MDT0001-mdtlov_UUID 4
21 UP osp lustrefs-OST0001-osc-MDT0001 lustrefs-MDT0001-mdtlov_UUID 4
22 UP lwp lustrefs-MDT-lwp-MDT0001 lustrefs-MDT-lwp-MDT0001_UUID 4



Thanks,
Pinkesh Valdria
Oracle Cloud Infrastructure
+65-8932-3639 (m) - Singapore
+1-425-205-7834 (m) - USA
https://blogs.oracle.com/author/pinkesh-valdria
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre using RDMA (RoCEv2)

2021-07-07 Thread Pinkesh Valdria via lustre-discuss
This is my first attempt to configure Lustre for RDMA (Mellanox RoCEv2).

lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up

Below results in an error.   The interface (ens800f0) is working and I can ping 
other nodes on that network.
lnetctl net add --net o2ib --if ens800f0

add:
- net:
  errno: -100
  descr: "cannot add network: Network is down"


[root@inst-fknk9-relaxing-louse ~]# dmesg | tail
[ 1399.903159] Lustre: Lustre: Build Version: 2.12.6
[ 1427.411527] LNetError: 20092:0:(o2iblnd.c:2781:kiblnd_dev_failover()) Failed 
to bind ens800f0:192.168.169.112 to device(  (null)): -19
[ 1427.564213] LNetError: 20092:0:(o2iblnd.c:3314:kiblnd_startup()) ko2iblnd: 
Can't initialize device: rc = -19
[ 1428.681259] LNetError: 105-4: Error -100 starting up LNI o2ib
[ 1474.343671] LNetError: 20260:0:(o2iblnd.c:2781:kiblnd_dev_failover()) Failed 
to bind ens800f0:192.168.169.112 to device(  (null)): -19
[ 1474.496347] LNetError: 20260:0:(o2iblnd.c:3314:kiblnd_startup()) ko2iblnd: 
Can't initialize device: rc = -19
[ 1475.610993] LNetError: 105-4: Error -100 starting up LNI o2ib
[ 1535.441463] LNetError: 20549:0:(o2iblnd.c:2781:kiblnd_dev_failover()) Failed 
to bind ens800f0:192.168.169.112 to device(  (null)): -19
[ 1535.594183] LNetError: 20549:0:(o2iblnd.c:3314:kiblnd_startup()) ko2iblnd: 
Can't initialize device: rc = -19
[ 1536.709841] LNetError: 105-4: Error -100 starting up LNI o2ib



Interface: ens800f0 is the 100Gbps RDMA Mlnx NIC:
ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
2: ens300f0:  mtu 9000 qdisc mq state UP group 
default qlen 1000
link/ether b8:ce:f6:25:ff:5e brd ff:ff:ff:ff:ff:ff
inet 172.16.5.112/22 brd 172.16.7.255 scope global dynamic ens300f0
   valid_lft 84734sec preferred_lft 84734sec
3: ens300f1:  mtu 1500 qdisc mq state UP group 
default qlen 1000
link/ether b8:ce:f6:25:ff:5f brd ff:ff:ff:ff:ff:ff
4: ens800f0:  mtu 1500 qdisc mq state UP group 
default qlen 1000
link/ether 04:3f:72:e3:08:42 brd ff:ff:ff:ff:ff:ff
inet 192.168.169.112/22 brd 192.168.171.255 scope global ens800f0
   valid_lft forever preferred_lft forever
5: ens800f1:  mtu 1500 qdisc mq state DOWN 
group default qlen 1000
link/ether 04:3f:72:e3:08:43 brd ff:ff:ff:ff:ff:ff


OS:  RHCK 7.9  3.10.0-1160.2.1.el7_lustre.x86_64
OFED:  Mellanox
ofed_info -n
4.9-3.1.5.0


cat /etc/lnet.conf  is empty

cat   /etc/modprobe.d/lnet.conf
cat: /etc/modprobe.d/lnet.conf: No such file or directory



[root@inst-fknk9-relaxing-louse ~]# modprobe -v lustre
insmod 
/lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/obdclass.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/ptlrpc.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/fld.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/fid.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/lov.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/osc.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/mdc.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/lmv.ko
insmod /lib/modules/3.10.0-1160.2.1.el7_lustre.x86_64/extra/lustre/fs/lustre.ko
[root@inst-fknk9-relaxing-louse ~]#


Based on discussion threads from Google search,  one thread said to add this, 
still same error.
echo 'options lnet networks="o2ib(ens800f0)" ' > /etc/modprobe.d/lustre.conf
echo 'options lnet networks="o2ib(ens800f0)" ' > /etc/modprobe.d/lnet.conf




Thanks,
Pinkesh Valdria
Principal Solutions Architect – HPC

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MOFED & Lustre 2.14.51 - install fails with dependency failure related to ksym/MOFED

2021-05-25 Thread Pinkesh Valdria via lustre-discuss

Running out of ideas.  I also searched old messages on this distro and on 
Google and found an unanswered questions from Aug, 2020 - 
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg16346.html


Hello Laura,
I tried your recommendation of passing mlnx_add_kernel_support.sh --kmp , my 
steps are below,  but I still get the same ksym error for lustre clients.   
Also I see that the kmp support is still not enabled (may be KMP support is 
only available on Redhat & SUSE, but not CentOS, Oracle Linux etc – based on 
this link:  
https://docs.mellanox.com/display/MLNXOFEDv461000/Installing+Mellanox+OFED )

Step1:  On build server:
./mlnx_add_kernel_support.sh --make-tgz --verbose --yes --kernel 
3.10.0-1160.15.2.el7_lustre.x86_64 --kernel-sources 
/usr/src/kernels/3.10.0-1160.15.2.el7_lustre.x86_64 --tmpdir /tmp --distro 
ol7.9 --mlnx_ofed /root/MLNX_OFED_LINUX-5.3-1.0.0.1-ol7.9-x86_64 –kmp
….
Detected MLNX_OFED_LINUX-5.3-1.0.0.1
….
Building MLNX_OFED_LINUX RPMS . Please wait...
….

Running MLNX_OFED_SRC-5.3-1.0.0.1/install.pl --tmpdir /tmp/mlnx_iso.7168_logs 
--kernel-only --kernel 3.10.0-1160.15.2.el7_lustre.x86_64 --kernel-sources 
/usr/src/kernels/3.10.0-1160.15.2.el7_lustre.x86_64 --builddir 
/tmp/mlnx_iso.7168 --build-only --distro ol7.9 --bump-kmp-version 202105221109
….
Creating metadata-rpms for 3.10.0-1160.15.2.el7_lustre.x86_64 ...
Created /tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-ol7.9-x86_64-ext.tgz

Then:
./mlnxofedinstall --kernel 3.10.0-1160.15.2.el7_lustre.x86_64 --kernel-sources 
/usr/src/kernels/3.10.0-1160.15.2.el7_lustre.x86_64 --add-kernel-support 
--skip-repo --skip-distro-check --distro ol7.9 –kmp
…..
Installation finished successfully.
…
Updating / installing...
   1:mlnx-fw-updater-5.3-1.0.0.1  # [100%]
Failed to update Firmware.
See /tmp/MLNX_OFED_LINUX.235507.logs/fw_update.log
…
To load the new driver, run:
/etc/init.d/openibd restart

Ran /etc/init.d/openibd restart
Unloading HCA driver:  [  OK  ]
Loading HCA driver and Access Layer:   [  OK  ]



Step2:  On build server:  Create Lustre client package
./configure --disable-server --enable-client \
--with-linux=/usr/src/kernels/*_lustre.x86_64 \
--with-o2ib=/usr/src/ofa_kernel/default

make rpms


Step3:  On Lustre client node:  Install MOFED
Untar the MOFED package from Step1.
Run mlnxofedinstall   (I tried running with and without --kmp ,  but same ksym 
error).


  1.  Passing --kmp parameter
mlnxofedinstall --force --kernel 3.10.0-1160.15.2.el7_lustre.x86_64  
--kernel-sources /usr/src/kernels/3.10.0-1160.15.2.el7_lustre.x86_64 
--skip-distro-check --distro ol7.9 --kmp

  1.  Not Passing --kmp parameter
mlnxofedinstall --force --kernel 3.10.0-1160.15.2.el7_lustre.x86_64  
--kernel-sources /usr/src/kernels/3.10.0-1160.15.2.el7_lustre.x86_64 
--skip-distro-check --distro ol7.9



Step4:  On Lustre client node:

yum localinstall lustre-client-2.14.51-1.el7.x86_64.rpm 
kmod-lustre-client-2.14.51-1.el7.x86_64.rpm
……
Error: Package: kmod-lustre-client-2.14.51-1.el7.x86_64 
(/kmod-lustre-client-2.14.51-1.el7.x86_64)
   Requires: ksym(__ib_create_cq) = 0x1bb05802
Error: Package: kmod-lustre-client-2.14.51-1.el7.x86_64 
(/kmod-lustre-client-2.14.51-1.el7.x86_64)
   Requires: ksym(rdma_listen) = 0xf6bd553e
…..

I tried 3 different scenarios:

  1.  Copied the lustre client rpms from build server to luster client node and 
ran above command -   it failed
  2.  On lustre client node – create lustre package after running step 3a 
(mlnxofedinstall command with  --kmp)
  3.  On lustre client node – create lustre package after running step 3b 
(mlnxofedinstall command without  --kmp)


Any other suggestions ?


Sidenote:
For lustre server – I found this workaround, not sure yet if it will create 
issues once I mount and run Lustre.
Previously, I was installing lustre on Lustre servers using below command and 
getting ksym errors

sudo yum install lustre-tests -y

But if I use, the below, the install works:
rpm -ivh --nodeps lustre-2.14.51-1.el7.x86_64.rpm  
kmod-lustre-2.14.51-1.el7.x86_64.rpm 
kmod-lustre-osd-ldiskfs-2.14.51-1.el7.x86_64.rpm 
lustre-osd-ldiskfs-mount-2.14.51-1.el7.x86_64.rpm 
lustre-resource-agents-2.14.51-1.el7.x86_64.rpm


I have tested “modprobe lnet” and “lnetctl net add”,  MGS/MGT mount works,  MDT 
mount fails with “Invalid filesystem option set: 
dirdata,uninit_bg,^extents,dir_nlink,quota,project,huge_file,ea_inode,large_dir,flex_bg”



Thanks,
Pinkesh Valdria


From: Pinkesh Valdria 
Date: Friday, May 21, 2021 at 6:04 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: MOFED & Lustre 2.14.51 - install fails with dependency failure related 
to ksym/MOFED

Sorry for a long email,  wanted to make sure I share enough details for 
community to provide guidance.   I am building all lustre packages for Oracle 
Linux7.9-RHCK and MOFED: 5.3-1.0.0.1 using steps described here:  
https

[lustre-discuss] MOFED & Lustre 2.14.51 - install fails with dependency failure related to ksym/MOFED

2021-05-21 Thread Pinkesh Valdria via lustre-discuss
ustreserver)
   Requires: ksym(rdma_disconnect) = 0x49262e62
Error: Package: kmod-lustre-2.14.51-1.el7.x86_64 (hpddLustreserver)
   Requires: ksym(rdma_connect_locked) = 0x7eaa4a8a
….
…. All ib/rdma related errors similar to above for kmod-lustre.x
….
Error: Package: kmod-lustre-2.14.51-1.el7.x86_64 (hpddLustreserver)
   Requires: ksym(ib_destroy_cq_user) = 0x5671830b
You could try using --skip-broken to work around the problem
** Found 3 pre-existing rpmdb problem(s), 'yum check' output follows:
oracle-cloud-agent-1.11.1-5104.el7.x86_64 is a duplicate with 
oracle-cloud-agent-1.8.2-3843.el7.x86_64
rdma-core-devel-52mlnx1-1.53100.x86_64 has missing requires of 
pkgconfig(libnl-3.0)
rdma-core-devel-52mlnx1-1.53100.x86_64 has missing requires of 
pkgconfig(libnl-route-3.0)
[opc@inst-dwnv3-topical-goblin ~]$





RPMS from:  LDISKFS and Patching the Linux Kernel
ls lustre-kernel/RPMS/

  *   bpftool-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   bpftool-debuginfo-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-debug-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-debug-debuginfo-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-debug-devel-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-debuginfo-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-debuginfo-common-x86_64-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-devel-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-headers-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-tools-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-tools-debuginfo-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-tools-libs-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   kernel-tools-libs-devel-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   perf-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   perf-debuginfo-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   python-perf-3.10.0-1160.15.2.el7_lustre.x86_64.rpm
  *   python-perf-debuginfo-3.10.0-1160.15.2.el7_lustre.x86_64.rpm


MOFED rpms

Steps followed:
Download from MLNX site the source:  MLNX_OFED_SRC-5.3-1.0.0.1.tgz
tar -zvxf $HOME/MLNX_OFED_SRC-5.3-1.0.0.1.tgz
cd MLNX_OFED_SRC-5.3-1.0.0.1/
./install.pl --build-only --kernel-only \
--kernel 3.10.0-1160.15.2.el7.x86_64 \
--kernel-sources /usr/src/kernels/3.10.0-1160.15.2.el7.x86_64

cp RPMS/*/*/*.rpm  $HOME/releases/mofed

Question:  I am passing regular kernel (3.10.0-1160.15.2.el7.x86_64) and its 
source (not Lustre patched kernel)  as input to MOFED install command above,  I 
hope that is correct.



  *   kernel-mft-4.16.3-12.kver.3.10.0_1160.15.2.el7.x86_64.x86_64.rpm
  *   knem-1.1.4.90mlnx1-OFED.5.1.2.5.0.1.ol7u9.x86_64.rpm
  *   
knem-modules-1.1.4.90mlnx1-OFED.5.1.2.5.0.1.kver.3.10.0_1160.15.2.el7.x86_64.x86_64.rpm
  *   
mlnx-nfsrdma-5.3-OFED.5.3.0.3.8.1.kver.3.10.0_1160.15.2.el7.x86_64.x86_64.rpm
  *   
mlnx-nfsrdma-debuginfo-5.3-OFED.5.3.0.3.8.1.kver.3.10.0_1160.15.2.el7.x86_64.x86_64.rpm
  *   mlnx-ofa_kernel-5.3-OFED.5.3.1.0.0.1.ol7u9.x86_64.rpm
  *   mlnx-ofa_kernel-debuginfo-5.3-OFED.5.3.1.0.0.1.ol7u9.x86_64.rpm
  *   mlnx-ofa_kernel-devel-5.3-OFED.5.3.1.0.0.1.ol7u9.x86_64.rpm
  *   
mlnx-ofa_kernel-modules-5.3-OFED.5.3.1.0.0.1.kver.3.10.0_1160.15.2.el7.x86_64.x86_64.rpm
  *   ofed-scripts-5.3-OFED.5.3.1.0.0.x86_64.rpm


Lustre Server packages

./configure --enable-server \
--with-linux=/usr/src/kernels/*_lustre.x86_64 \
--with-o2ib=/usr/src/ofa_kernel/default

make rpms



  *   kmod-lustre-2.14.51-1.el7.x86_64.rpm
  *   kmod-lustre-osd-ldiskfs-2.14.51-1.el7.x86_64.rpm
  *   kmod-lustre-tests-2.14.51-1.el7.x86_64.rpm
  *   lustre-2.14.51-1.el7.x86_64.rpm
  *   lustre-2.14.51-1.src.rpm
  *   lustre-debuginfo-2.14.51-1.el7.x86_64.rpm
  *   lustre-devel-2.14.51-1.el7.x86_64.rpm
  *   lustre-iokit-2.14.51-1.el7.x86_64.rpm
  *   lustre-osd-ldiskfs-mount-2.14.51-1.el7.x86_64.rpm
  *   lustre-resource-agents-2.14.51-1.el7.x86_64.rpm
  *   lustre-tests-2.14.51-1.el7.x86_64.rpm


Lustre Client packages

./configure --disable-server --enable-client \
--with-linux=/usr/src/kernels/*_lustre.x86_64 \
--with-o2ib=/usr/src/ofa_kernel/default

make rpms


  *   kmod-lustre-client-2.14.51-1.el7.x86_64.rpm
  *   kmod-lustre-client-tests-2.14.51-1.el7.x86_64.rpm
  *   lustre-2.14.51-1.src.rpm
  *   lustre-client-2.14.51-1.el7.x86_64.rpm
  *   lustre-client-debuginfo-2.14.51-1.el7.x86_64.rpm
  *   lustre-client-devel-2.14.51-1.el7.x86_64.rpm
  *   lustre-client-tests-2.14.51-1.el7.x86_64.rpm
  *   lustre-iokit-2.14.51-1.el7.x86_64.rpm



Thanks,
Pinkesh Valdria
Principal Solutions Architect – HPC
Oracle Cloud Infrastructure
+65-8932-3639 (m) - Singapore
+1-425-205-7834 (m) - USA

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [External] : lustre-discuss Digest, Vol 179, Issue 26

2021-02-26 Thread Pinkesh Valdria via lustre-discuss
Nikitas, 

Your steps were perfect.  It worked. I am able to compile the client. 

@Andreas Dilger -   I am happy to add a wiki page with the steps I followed to 
get it compiled or fix existing page,  if I can get login account to update.   
Similarly, I would like to add how to compile Lustre for Oracle Linux UEK 
kernels, if that's okay.  



Thanks,
Pinkesh Valdria
Principal Solutions Architect – HPC
Oracle Cloud Infrastructure
+65-8932-3639 (m) - Singapore 
+1-425-205-7834 (m) - USA

 

On 2/26/21, 12:46 AM, "lustre-discuss on behalf of 
lustre-discuss-requ...@lists.lustre.org" 
 wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit

https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!GqivPVa7Brio!JOuAP3APa1PYUYllxZG6H17Z9x96It0TNVvuN9BHUxVx5jj95BZllCbFIsz6eoZZiqC7$
 
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

   1. Re: Lustre Client compile on Ubuntu18.04 failing
  (Nikitas Angelinas)


--

Message: 1
Date: Fri, 26 Feb 2021 08:46:19 +0000
From: Nikitas Angelinas 
To: Pinkesh Valdria 
Cc: "lustre-discuss@lists.lustre.org"

Subject: Re: [lustre-discuss] Lustre Client compile on Ubuntu18.04
failing
Message-ID: <2bf290dd-a3c8-4727-8c87-012074c48...@cray.com>
Content-Type: text/plain; charset="utf-8"

Hi Pinkesh,

Could you please try running ?make oldconfig? and then ?make 
modules_prepare? in the kernel source after copying the .config? The latter 
command should generate the missing files 
/root/linux-oracle/include/generated/autoconf.h and 
/root/linux-oracle/include/linux/autoconf.h.


Cheers,
    Nikitas

On 2/25/21, 10:14 PM, "lustre-discuss on behalf of Pinkesh Valdria via 
lustre-discuss" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>> wrote:

Hello All,

I am trying to compile lustre client (2.13.57) on Ubuntu18.04 and I am 
following the steps listed here:  
https://urldefense.com/v3/__https://wiki.whamcloud.com/pages/viewpage.action?pageId=63968116__;!!GqivPVa7Brio!JOuAP3APa1PYUYllxZG6H17Z9x96It0TNVvuN9BHUxVx5jj95BZllCbFIsz6ehaK0R8W$
 ,  but its failing with below error.

Any pointer/advice on what am I missing ?

uname -r
5.4.0-1035-oracle

# Using this one
cd /root
git clone 
https://urldefense.com/v3/__https://git.launchpad.net/*canonical-kernel/ubuntu/*source/linux-oracle__;fis!!GqivPVa7Brio!JOuAP3APa1PYUYllxZG6H17Z9x96It0TNVvuN9BHUxVx5jj95BZllCbFIsz6esq05dSq$
 
cd linux-oracle/
git checkout Ubuntu-oracle-5.4-5.4.0-1035.38_18.04.1


BUILDPATH=/root
cd ${BUILDPATH}
git clone git://git.whamcloud.com/fs/lustre-release.git
cd lustre-release
git checkout 2.13.57
git reset --hard && git clean -dfx
sh autogen.sh
./configure --disable-server --with-linux=/root/linux-oracle
# above command fails, saying .config file is missing
CONFIG_RETPOLINE=y
checking for Linux sources... /root/linux-oracle
checking for /root/linux-oracle... yes
checking for Linux objects... /root/linux-oracle
checking for /root/linux-oracle/.config... no
configure: error:

Kernel config could not be found.



cp /boot/config-5.4.0-1035-oracle /root/linux-oracle/.config

# Re-ran it
./configure --disable-server --with-linux=/root/linux-oracle


checking for swig2.0... no
yes
checking whether to build Lustre client support... yes
dirname: missing operand
Try 'dirname --help' for more information.
checking whether mpitests can be built... no
checking whether to build Linux kernel modules... yes (linux-gnu)
find: '/usr/src/kernels/': No such file or directory
checking for Linux sources... /root/linux-oracle
checking for /root/linux-oracle... yes
checking for Linux objects... /root/linux-oracle
checking for /root/linux-oracle/.config... yes
checking for /boot/kernel.h... no
checking for /var/adm/running-kernel.h... no
checking for /root/linux-oracle/include/generated/autoconf.h... no
checking for /root/linux-oracle/include/linux/autoconf.h... no
configure: error: Run make config in /root/linux-oracle.
root@lustre-client-2-12-4-ubuntu1804:~/lustre-release#




Thanks,
Pinkesh Valdria
Principal Solut

[lustre-discuss] Lustre Client compile on Ubuntu18.04 failing

2021-02-25 Thread Pinkesh Valdria via lustre-discuss
Hello All,

I am trying to compile lustre client (2.13.57) on Ubuntu18.04 and I am 
following the steps listed here:  
https://wiki.whamcloud.com/pages/viewpage.action?pageId=63968116,  but its 
failing with below error.

Any pointer/advice on what am I missing ?

uname -r
5.4.0-1035-oracle

# Using this one
cd /root
git clone 
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-oracle
cd linux-oracle/
git checkout Ubuntu-oracle-5.4-5.4.0-1035.38_18.04.1


BUILDPATH=/root
cd ${BUILDPATH}
git clone git://git.whamcloud.com/fs/lustre-release.git
cd lustre-release
git checkout 2.13.57
git reset --hard && git clean -dfx
sh autogen.sh
./configure --disable-server --with-linux=/root/linux-oracle
# above command fails, saying .config file is missing
CONFIG_RETPOLINE=y
checking for Linux sources... /root/linux-oracle
checking for /root/linux-oracle... yes
checking for Linux objects... /root/linux-oracle
checking for /root/linux-oracle/.config... no
configure: error:

Kernel config could not be found.



cp /boot/config-5.4.0-1035-oracle /root/linux-oracle/.config

# Re-ran it
./configure --disable-server --with-linux=/root/linux-oracle


checking for swig2.0... no
yes
checking whether to build Lustre client support... yes
dirname: missing operand
Try 'dirname --help' for more information.
checking whether mpitests can be built... no
checking whether to build Linux kernel modules... yes (linux-gnu)
find: '/usr/src/kernels/': No such file or directory
checking for Linux sources... /root/linux-oracle
checking for /root/linux-oracle... yes
checking for Linux objects... /root/linux-oracle
checking for /root/linux-oracle/.config... yes
checking for /boot/kernel.h... no
checking for /var/adm/running-kernel.h... no
checking for /root/linux-oracle/include/generated/autoconf.h... no
checking for /root/linux-oracle/include/linux/autoconf.h... no
configure: error: Run make config in /root/linux-oracle.
root@lustre-client-2-12-4-ubuntu1804:~/lustre-release#




Thanks,
Pinkesh Valdria
Principal Solutions Architect – HPC
Oracle Cloud Infrastructure
+65-8932-3639 (m) - Singapore
+1-425-205-7834 (m) - USA

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Complete list of rules for PCC

2020-08-25 Thread Pinkesh Valdria
I am looking for the various policy rules which can be applied for Lustre 
Persistent Client Cache.   In the docs,  I see below example using projid, 
fname and uid.    Where can I find a complete list of supported rules.    

 

Also is there a way for PCC to only cache content of few folders 

http://doc.lustre.org/lustre_manual.xhtml#pcc.design.rules

 

 

The following command adds a PCC backend on a client:

client# lctl pcc add /mnt/lustre /mnt/pcc  --param 
"projid={500,1000}&fname={*.h5},uid=1001 rwid=2"

The first substring of the config parameter is the auto-cache rule, where "&" 
represents the logical AND operator while "," represents the logical OR 
operator. The example rule means that new files are only auto cached if either 
of the following conditions are satisfied:
The project ID is either 500 or 1000 and the suffix of the file name is "h5";
The user ID is 1001;
 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Bulk Attach/Detach - Lustre PCC (Persistent Client Cache)

2020-08-24 Thread Pinkesh Valdria
I am new to Lustre PCC (Persistent Client Cache)  feature.   I was looking at 
the Lustre documentation PCC section and found how I can attach or detach a 
file from PCC.   

http://doc.lustre.org/lustre_manual.xhtml#pcc.operations.detach

 

 

Question:  Is there a command to specify a folder in the command and all files 
under that folder & sub-folders will be attached or detached?   

 

How is everyone doing bulk attach or detach ?    Are folks using some custom 
script to traverse a directory tree (recursively) and for each file found,  
call this command “lfs pcc detach ”  or there is another 
command and I missed it in the docs. 

 

 

 

Thanks, 

Pinkesh Valdria

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre (latest) access via NFSv4

2020-06-04 Thread Pinkesh Valdria
Can Lustre be access via NFSv4.   I know we can use NFSv3, but wanted to ask 
about NFSv4 support ?

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] NFS Client Attributes caching - equivalent feature/config in Lustre

2020-04-21 Thread Pinkesh Valdria
Does lustre have mount options to mimic  NFS mount option behavior , listed 
below? 

 

I know in most cases,  Lustre would perform much better than NFS and can scale 
and support a lot of clients in parallel.   I have a use case,  where there are 
only few clients accessing the filesystem and the files are really small, but 
in millions and files are very infrequently updated.   The files are stored on 
an NFS server and its mounted on the clients with the below mount options, 
which results in caching of file attributes/metadata on the client and thus 
reduces # of calls to metadata and delivers better performance.  

 

NFS mount options

type nfs (rw,nolock,nocto,actimeo=900,nfsvers=3,proto=tcp)

 

A custom proprietary application which compile (make command) on some of these 
files takes 20-24 seconds to run.  The same command when ran on the same files 
stored in BeeGFS parallel filesystem takes 80-90 seconds (4x times slow),  
mainly because there is no client caching in BeeGFS and client has to make a 
lot more metadata calls compared to NFS cache file attributes. 

 

Question

I already tried BeeGFS and I am asking this question to determine, if Lustre 
performance would be better than NFS for very small file workloads (50 bytes, 
200 bytes, 2KB files) with 5 millions files spread across nested directories.  
Does lustre have mount options to mimic  NFS mount option behavior, listed 
below? Or is there some optional feature in Lustre to achieve this cache  
behavior? 

 

 

 

https://linux.die.net/man/5/nfs

 

ac / noac

Selects whether the client may cache file attributes. If neither option is 
specified (or if ac is specified), the client caches file attributes.

 

For my custom applications,  cache of file attributes is fine (no negative 
impact) and it helps to improve performance of NFS.   

 

actimeo=n

Using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the 
same value. If this option is not specified, the NFS client uses the defaults 
for each of these options listed above.

For my applications, it’s okay to use cache file attributes/metadata for few 
mins (eg: 5mins)  by setting this value, it can reduce # of metadata calls been 
made to the server and especially with filesystems storing lot of small files,  
it’s a huge performance penalty, which can be avoided.   

 

nolock

 

When mounting servers that do not support the NLM protocol, or when mounting an 
NFS server through a firewall that blocks the NLM service port, specify the 
nolock mount option. Specifying the nolock option may also be advised to 
improve the performance of a proprietary application which runs on a single 
client and uses file locks extensively.

 

 

Appreciate any guidance.  

 

Thanks,

pinkesh valdria

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre with 100 Gbps Mellanox CX5 card

2020-01-22 Thread Pinkesh Valdria
Hello Lustre Community, 

 

I am trying to configure lustre for 100 Gbps Mellanox CX5 card.   I tried using 
2.12.3 version first, but it failed when I tried to run lnetctl net add --net 
o2ib0 --if enp94s0f0,  so I started looking at the lustre binaries and found 
the below repos for ib.   

Is the below a special build for Mellanox cards?  or should I still be using 
the common Lustre binaries which are also used for tcp/ksocklnd networks. 

 

 

[hpddLustreserver]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.13.0-ib/MOFED-4.7-1.0.0.1/el7/server/

gpgcheck=0

 

[e2fsprogs]

name=CentOS- - Ldiskfs

baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/

gpgcheck=0

 

[hpddLustreclient]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/lustre-2.13.0-ib/MOFED-4.7-1.0.0.1/el7/client/

gpgcheck=0

 

When I use the above repos,  the below command returns success, but the options 
I passed are not taking effect.   NIC card:  enp94s0f0 is my 100 Gbps card.  

lnetctl net add --net o2ib0 --if enp94s0f0 –peer-timeout 100 –peer-credits 16 
–credits 2560

 

Similarly, when I try to configure some options via this file: 
/etc/modprobe.d/ko2iblnd.conf,  they are not taking effect and are not applied 
when I run the command:  

 

cat  /etc/modprobe.d/ko2iblnd.conf

alias ko2iblnd ko2iblnd

options ko2iblnd map_on_demand=256 concurrent_sends=63 peercredits_hiw=31 
fmr_pool_size=1280 fmr_flush_trigger=1024 fmr_cache=1

 

 

lnetctl net show -v --net o2ib

net:

    - net type: o2ib

  local NI(s):

    - nid: 192.168.1.2@o2ib

  status: up

  interfaces:

  0: enp94s0f0

  statistics:

  send_count: 0

  recv_count: 0

  drop_count: 0

  tunables:

  peer_timeout: 100

  peer_credits: 16

  peer_buffer_credits: 0

  credits: 2560

  peercredits_hiw: 8

  map_on_demand: 0

  concurrent_sends: 16

  fmr_pool_size: 512

  fmr_flush_trigger: 384

  fmr_cache: 1

  ntx: 512

  conns_per_peer: 1

  lnd tunables:

  dev cpt: 0

  tcp bonding: 0

  CPT: "[0,1]"

[root@inst-ran1f-lustre ~]#

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2020-01-22 Thread Pinkesh Valdria
To close the loop on this topic.   

 

The below parameters were not set by default and hence they were not showing up 
in lctl list_param commands.  I have to set them first.   

lctl set_param llite.*.max_read_ahead_mb=256

lctl set_param llite.*.max_read_ahead_per_file_mb=256

 

 

Thanks to the Lustre Community for their help to tune Lustre,  I was able to 
tune Lustre on Oracle Cloud Infrastructure to get good performance on Bare 
metal nodes with 2x25Gbps network.   We have open sourced the deployment of 
Lustre on Oracle Cloud as well as all the performance tuning done at the 
Infrastructure level as well as Lustre FS level for everyone to benefit from 
it.  

 

https://github.com/oracle-quickstart/oci-lustre

Terraform files are in :  
https://github.com/oracle-quickstart/oci-lustre/tree/master/terraform

Tuning scripts are in this folder:  
https://github.com/oracle-quickstart/oci-lustre/tree/master/scripts

 

 

As next step -  I plan to test deployment of Lustre on 100 Gbps RoCEv2 RDMA 
network (Mellanox CX5).  

 

 

Thanks, 

Pinkesh Valdria 

Oracle Cloud – Principal Solutions Architect 

https://blogs.oracle.com/cloud-infrastructure/lustre-file-system-performance-on-oracle-cloud-infrastructure

https://blogs.oracle.com/author/pinkesh-valdria

 

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Friday, December 13, 2019 at 11:14 AM
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

I ran the latest command you provided and it does not show the parameter, like 
you see.I can do screenshare. 

 

 

[opc@lustre-client-1 ~]$ df -h

Filesystem  Size  Used Avail Use% Mounted on

/dev/sda339G  2.5G   36G   7% /

devtmpfs158G 0  158G   0% /dev

tmpfs   158G 0  158G   0% /dev/shm

tmpfs   158G   17M  158G   1% /run

tmpfs   158G 0  158G   0% /sys/fs/cgroup

/dev/sda1   512M   12M  501M   3% /boot/efi

10.0.3.6@tcp1:/lfsbv 50T   89M   48T   1% /mnt/mdt_bv

10.0.3.6@tcp1:/lfsnvme  185T  8.7M  176T   1% /mnt/mdt_nvme

tmpfs32G 0   32G   0% /run/user/1000

 

 

[opc@lustre-client-1 ~]$ lctl list_param -R llite | grep max_read_ahead

[opc@lustre-client-1 ~]$

 

So I ran this: 

 

[opc@lustre-client-1 ~]$ lctl list_param -R llite  >  llite_parameters.txt

 

There are other parameters under llite.   I attached the complete list. 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Friday, December 13, 2019 at 8:36 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

>From what I can see I think you just ran the wrong command (lctl list_param -R 
>* ) or it doesn’t work as you expected on 2.12.3.

 

But llite params are sure there on a *mounted* Lustre client. 

 

This will give you the parameters you’re looking for and need to modify to 
have, likely, better read performance:

 

lctl list_param -R llite | grep max_read_ahead

 

 

From: Pinkesh Valdria 
Date: Friday, 13 December 2019 at 17:33
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

This is how I installed lustre clients (only showing packages installed steps). 

 

 

cat > /etc/yum.repos.d/lustre.repo << EOF

[hpddLustreserver]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/server/

gpgcheck=0

 

[e2fsprogs]

name=CentOS- - Ldiskfs

baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/

gpgcheck=0

 

[hpddLustreclient]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/client/

gpgcheck=0

EOF

 

yum  install  lustre-client  -y

 

reboot

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Friday, December 13, 2019 at 2:55 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

>From what I can see they exist on my 2.12.3 client node:

 

[root@rufus4 ~]# lctl list_param -R llite | grep max_read_ahead

llite.reprofs-9f7c3b4a8800.max_read_ahead_mb

llite.reprofs-9f7c3b4a8800.max_read_ahead_per_file_mb

llite.reprofs-9f7c3b4a8800.max_read_ahead_whole_mb

 

Regards,

 

Diego

 

 

From: Pinkesh Valdria 
Date: Wednesday, 11 December 2019 at 17:46
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

I was not able to find those parameters on my client nodes,  OSS or MGS nodes.  
 Here is how I was extracting all parameters .  


Re: [lustre-discuss] Lemur Lustre - make rpm fails

2020-01-08 Thread Pinkesh Valdria
Hello Nathaniel, 

As a workaround,  is there an older lemur rpm version or older Lustre version I 
should use to unblock myself? 
https://github.com/whamcloud/lemur/issues/7

https://github.com/whamcloud/lemur/issues/8

Thanks,
Pinkesh Valdria


On 12/11/19, 6:31 AM, "Pinkesh Valdria"  wrote:

Hi Nathaniel, 

I have an issue ticket opened:  https://github.com/whamcloud/lemur/issues/7

I tried to do it locally,  that also fails, given below is the error.  

[root@lustre-client-4 lemur]# lfs --version
lfs 2.12.3
[root@lustre-client-4 lemur]# uname -a
Linux lustre-client-4 3.10.0-1062.7.1.el7.x86_64 #1 SMP Mon Dec 2 17:33:29 
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
 [root@lustre-client-4 lemur]# lsb_release -r
Release:7.6.1810
[root@lustre-client-4 lemur]#


[root@lustre-client-4 lemur]# make local-rpm
make -C packaging/rpm NAME=lemur VERSION=0.6.0_4_g4655df8 RELEASE=1 
URL="https://github.com/intel-hpdd/lemur";
make[1]: Entering directory `/root/lemur/packaging/rpm'
cd ../../ && \
.



github.com/intel-hpdd/lemur/vendor/github.com/aws/aws-sdk-go/service/s3/s3manager
github.com/intel-hpdd/lemur/cmd/lhsm-plugin-s3
install -d $(dirname 
/root/rpmbuild/BUILDROOT/lemur-hsm-agent-0.6.0_4_g4655df8-1.x86_64//usr/bin/lhsm-plugin-s3)
install -m 755 lhsm-plugin-s3 
/root/rpmbuild/BUILDROOT/lemur-hsm-agent-0.6.0_4_g4655df8-1.x86_64//usr/bin/lhsm-plugin-s3
go build -v -i -ldflags "-X 'main.version=0.6.0_4_g4655df8'" -o lhsm 
./cmd/lhsm
github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/pkg/pool
github.com/intel-hpdd/lemur/cmd/lhsmd/agent/fileid
github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/llapi
github.com/intel-hpdd/lemur/vendor/gopkg.in/yaml.v2
github.com/intel-hpdd/lemur/vendor/gopkg.in/urfave/cli.v1
# github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/llapi
cgo-gcc-prolog: In function '_cgo_c110903d49cd_C2func_llapi_get_version':
cgo-gcc-prolog:58:2: warning: 'llapi_get_version' is deprecated (declared 
at /usr/include/lustre/lustreapi.h:398) [-Wdeprecated-declarations]
cgo-gcc-prolog: In function '_cgo_c110903d49cd_Cfunc_llapi_get_version':
cgo-gcc-prolog:107:2: warning: 'llapi_get_version' is deprecated (declared 
at /usr/include/lustre/lustreapi.h:398) [-Wdeprecated-declarations]
# github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/llapi
vendor/github.com/intel-hpdd/go-lustre/llapi/changelog.go:273:39: cannot 
use _Ctype_int(r.flags) (type _Ctype_int) as type int32 in argument to 
_Cfunc_hsm_get_cl_flags
make[2]: *** [lhsm] Error 2
make[2]: Leaving directory 
`/root/rpmbuild/BUILD/lemur-0.6.0_4_g4655df8/src/github.com/intel-hpdd/lemur'
error: Bad exit status from /var/tmp/rpm-tmp.cPPeEL (%install)


RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.cPPeEL (%install)
make[1]: *** [rpm] Error 1
make[1]: Leaving directory `/root/lemur/packaging/rpm'
    make: *** [local-rpm] Error 2
[root@lustre-client-4 lemur]#



Thanks,
Pinkesh Valdria 



On 12/10/19, 4:55 AM, "lustre-discuss on behalf of Nathaniel Clark" 
 
wrote:

Can you open at ticket for this on 


https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_whamcloud_lemur_issues&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dvUy7ZhvTpzQ9yJzUhQmk0UHrXXOGiSc2X1_Sm5yOhY&s=aD2CBP6CmEF14pb7PM2A-H4aFyzbd09y5IRcQXqIHj8&e=
 

And possibly


https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_projects_LMR&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dvUy7ZhvTpzQ9yJzUhQmk0UHrXXOGiSc2X1_Sm5yOhY&s=SoLFFKtz2XY9CNh4vyFssmTyhvmkIqyABrH_FzZzUQk&e=
 



You could also try:

$ make local-rpm



Which will avoid the docker stack and just build on the local machine

(beware it sudo's to install rpm build dependencies).





-- 

    Nathaniel Clark 

Senior Engineer

Whamcloud / DDN



On Mon, 2019-12-09 at 15:04 -0800, Pinkesh Valdria wrote:

> I am trying to install Lemur on CentOS 7.6 (7.6.1810) to integrate

> with Object storage but the install fails.   I used the instructions

> on below page to install.  I already had Lustre client (2.12.3)

> installed on the machine,  so I started 

Re: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2019-12-13 Thread Pinkesh Valdria
I ran the latest command you provided and it does not show the parameter, like 
you see.    I can do screenshare. 

 

 

[opc@lustre-client-1 ~]$ df -h

Filesystem  Size  Used Avail Use% Mounted on

/dev/sda3    39G  2.5G   36G   7% /

devtmpfs    158G 0  158G   0% /dev

tmpfs   158G 0  158G   0% /dev/shm

tmpfs   158G   17M  158G   1% /run

tmpfs   158G 0  158G   0% /sys/fs/cgroup

/dev/sda1   512M   12M  501M   3% /boot/efi

10.0.3.6@tcp1:/lfsbv 50T   89M   48T   1% /mnt/mdt_bv

10.0.3.6@tcp1:/lfsnvme  185T  8.7M  176T   1% /mnt/mdt_nvme

tmpfs    32G 0   32G   0% /run/user/1000

 

 

[opc@lustre-client-1 ~]$ lctl list_param -R llite | grep max_read_ahead

[opc@lustre-client-1 ~]$

 

So I ran this: 

 

[opc@lustre-client-1 ~]$ lctl list_param -R llite  >  llite_parameters.txt

 

There are other parameters under llite.   I attached the complete list. 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Friday, December 13, 2019 at 8:36 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

>From what I can see I think you just ran the wrong command (lctl list_param -R 
>* ) or it doesn’t work as you expected on 2.12.3.

 

But llite params are sure there on a *mounted* Lustre client. 

 

This will give you the parameters you’re looking for and need to modify to 
have, likely, better read performance:

 

lctl list_param -R llite | grep max_read_ahead

 

 

From: Pinkesh Valdria 
Date: Friday, 13 December 2019 at 17:33
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

This is how I installed lustre clients (only showing packages installed steps). 

 

 

cat > /etc/yum.repos.d/lustre.repo << EOF

[hpddLustreserver]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/server/

gpgcheck=0

 

[e2fsprogs]

name=CentOS- - Ldiskfs

baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/

gpgcheck=0

 

[hpddLustreclient]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/client/

gpgcheck=0

EOF

 

yum  install  lustre-client  -y

 

reboot

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Friday, December 13, 2019 at 2:55 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

>From what I can see they exist on my 2.12.3 client node:

 

[root@rufus4 ~]# lctl list_param -R llite | grep max_read_ahead

llite.reprofs-9f7c3b4a8800.max_read_ahead_mb

llite.reprofs-9f7c3b4a8800.max_read_ahead_per_file_mb

llite.reprofs-ffff9f7c3b4a8800.max_read_ahead_whole_mb

 

Regards,

 

Diego

 

 

From: Pinkesh Valdria 
Date: Wednesday, 11 December 2019 at 17:46
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

I was not able to find those parameters on my client nodes,  OSS or MGS nodes.  
 Here is how I was extracting all parameters .  

 

mkdir -p lctl_list_param_R/

cd lctl_list_param_R/

lctl list_param -R *  > lctl_list_param_R

 

[opc@lustre-client-1 lctl_list_param_R]$ less lctl_list_param_R  | grep ahead

llite.lfsbv-98231c3bc000.statahead_agl

llite.lfsbv-98231c3bc000.statahead_max

llite.lfsbv-98231c3bc000.statahead_running_max

llite.lfsnvme-98232c30e000.statahead_agl

llite.lfsnvme-98232c30e000.statahead_max

llite.lfsnvme-98232c30e000.statahead_running_max

[opc@lustre-client-1 lctl_list_param_R]$

 

I also tried these commands:  

 

Not working: 

On client nodes

lctl get_param llite.lfsbv-*.max_read_ahead_mb

error: get_param: param_path 'llite/lfsbv-*/max_read_ahead_mb': No such file or 
directory

[opc@lustre-client-1 lctl_list_param_R]$

 

Works 

On client nodes

lctl get_param llite.*.statahead_agl

llite.lfsbv-98231c3bc000.statahead_agl=1

llite.lfsnvme-98232c30e000.statahead_agl=1

[opc@lustre-client-1 lctl_list_param_R]$

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Tuesday, December 10, 2019 at 2:06 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

With that kind of degradation performance on read I would immediately think on 
llite’s max_read_ahead parameters on the client. Specifically these 2:

 

max_read_ahead_mb: total amount of MB allocated for read ahead, usually quite 
low for bandwidth benchmarking purposes and when there’re several files per 
client

max_read_ahead_per_file_mb

Re: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2019-12-13 Thread Pinkesh Valdria
This is how I installed lustre clients (only showing packages installed steps). 

 

 

cat > /etc/yum.repos.d/lustre.repo << EOF

[hpddLustreserver]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/server/

gpgcheck=0

 

[e2fsprogs]

name=CentOS- - Ldiskfs

baseurl=https://downloads.whamcloud.com/public/e2fsprogs/latest/el7/

gpgcheck=0

 

[hpddLustreclient]

name=CentOS- - Lustre

baseurl=https://downloads.whamcloud.com/public/lustre/latest-release/el7/client/

gpgcheck=0

EOF

 

yum  install  lustre-client  -y

 

reboot

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Friday, December 13, 2019 at 2:55 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

>From what I can see they exist on my 2.12.3 client node:

 

[root@rufus4 ~]# lctl list_param -R llite | grep max_read_ahead

llite.reprofs-9f7c3b4a8800.max_read_ahead_mb

llite.reprofs-9f7c3b4a8800.max_read_ahead_per_file_mb

llite.reprofs-9f7c3b4a8800.max_read_ahead_whole_mb

 

Regards,

 

Diego

 

 

From: Pinkesh Valdria 
Date: Wednesday, 11 December 2019 at 17:46
To: "Moreno Diego (ID SIS)" , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

I was not able to find those parameters on my client nodes,  OSS or MGS nodes.  
 Here is how I was extracting all parameters .  

 

mkdir -p lctl_list_param_R/

cd lctl_list_param_R/

lctl list_param -R *  > lctl_list_param_R

 

[opc@lustre-client-1 lctl_list_param_R]$ less lctl_list_param_R  | grep ahead

llite.lfsbv-98231c3bc000.statahead_agl

llite.lfsbv-98231c3bc000.statahead_max

llite.lfsbv-98231c3bc000.statahead_running_max

llite.lfsnvme-98232c30e000.statahead_agl

llite.lfsnvme-98232c30e000.statahead_max

llite.lfsnvme-98232c30e000.statahead_running_max

[opc@lustre-client-1 lctl_list_param_R]$

 

I also tried these commands:  

 

Not working: 

On client nodes

lctl get_param llite.lfsbv-*.max_read_ahead_mb

error: get_param: param_path 'llite/lfsbv-*/max_read_ahead_mb': No such file or 
directory

[opc@lustre-client-1 lctl_list_param_R]$

 

Works 

On client nodes

lctl get_param llite.*.statahead_agl

llite.lfsbv-98231c3bc000.statahead_agl=1

llite.lfsnvme-98232c30e000.statahead_agl=1

[opc@lustre-client-1 lctl_list_param_R]$

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Tuesday, December 10, 2019 at 2:06 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

With that kind of degradation performance on read I would immediately think on 
llite’s max_read_ahead parameters on the client. Specifically these 2:

 

max_read_ahead_mb: total amount of MB allocated for read ahead, usually quite 
low for bandwidth benchmarking purposes and when there’re several files per 
client

max_read_ahead_per_file_mb: the default is quite low for 16MB RPCs (only a few 
RPCs per file)

 

You probably need to check the effect increasing both of them.

 

Regards,

 

Diego

 

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Tuesday, 10 December 2019 at 09:40
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB 
RPC)

 

I was expecting better or same read performance with Large Bulk IO (16MB RPC),  
but I see degradation in performance.   Do I need to tune any other parameter 
to benefit from Large Bulk IO?   Appreciate if I can get any pointers to 
troubleshoot further. 

 

Throughput before 

-  Read:  2563 MB/s

-  Write:  2585 MB/s

 

Throughput after

-  Read:  1527 MB/s. (down by ~1025)

-  Write:  2859 MB/s

 

 

Changes I did are: 

On oss

-  lctl set_param obdfilter.lfsbv-*.brw_size=16

 

On clients 

-  unmounted and remounted

-  lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096  (got 
auto-updated after re-mount)

-  lctl set_param osc.*.max_rpcs_in_flight=64   (Had to manually 
increase this to 64,  since after re-mount, it was auto-set to 8,  but 
read/write performance was poor)

-  lctl set_param osc.*.max_dirty_mb=2040. (setting the value to 2048 
was failing with : Numerical result out of range error.   Previously it was set 
to 2000 when I got good performance. 

 

 

My other settings: 

-  lnetctl net add --net tcp1 --if $interface  –peer-timeout 180 
–peer-credits 128 –credits 1024

-  echo "options ksocklnd nscheds=10 sock_timeout=100 credits=2560 
peer_credits=63 enable_irq_affinity=0"  >  /etc/modprobe.d/ksocklnd.conf

-  lfs setstripe -c 1 -S 1M /mnt/mdt_bv/test1

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2019-12-11 Thread Pinkesh Valdria
I was not able to find those parameters on my client nodes,  OSS or MGS nodes.  
 Here is how I was extracting all parameters .  

 

mkdir -p lctl_list_param_R/

cd lctl_list_param_R/

lctl list_param -R *  > lctl_list_param_R

 

[opc@lustre-client-1 lctl_list_param_R]$ less lctl_list_param_R  | grep ahead

llite.lfsbv-98231c3bc000.statahead_agl

llite.lfsbv-98231c3bc000.statahead_max

llite.lfsbv-98231c3bc000.statahead_running_max

llite.lfsnvme-98232c30e000.statahead_agl

llite.lfsnvme-98232c30e000.statahead_max

llite.lfsnvme-98232c30e000.statahead_running_max

[opc@lustre-client-1 lctl_list_param_R]$

 

I also tried these commands:  

 

Not working: 

On client nodes

lctl get_param llite.lfsbv-*.max_read_ahead_mb

error: get_param: param_path 'llite/lfsbv-*/max_read_ahead_mb': No such file or 
directory

[opc@lustre-client-1 lctl_list_param_R]$

 

Works 

On client nodes

lctl get_param llite.*.statahead_agl

llite.lfsbv-98231c3bc000.statahead_agl=1

llite.lfsnvme-98232c30e000.statahead_agl=1

[opc@lustre-client-1 lctl_list_param_R]$

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Tuesday, December 10, 2019 at 2:06 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Degraded read performance with Large Bulk IO 
(16MB RPC)

 

With that kind of degradation performance on read I would immediately think on 
llite’s max_read_ahead parameters on the client. Specifically these 2:

 

max_read_ahead_mb: total amount of MB allocated for read ahead, usually quite 
low for bandwidth benchmarking purposes and when there’re several files per 
client

max_read_ahead_per_file_mb: the default is quite low for 16MB RPCs (only a few 
RPCs per file)

 

You probably need to check the effect increasing both of them.

 

Regards,

 

Diego

 

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Tuesday, 10 December 2019 at 09:40
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Degraded read performance with Large Bulk IO (16MB 
RPC)

 

I was expecting better or same read performance with Large Bulk IO (16MB RPC),  
but I see degradation in performance.   Do I need to tune any other parameter 
to benefit from Large Bulk IO?   Appreciate if I can get any pointers to 
troubleshoot further. 

 

Throughput before 

-  Read:  2563 MB/s

-  Write:  2585 MB/s

 

Throughput after

-  Read:  1527 MB/s. (down by ~1025)

-  Write:  2859 MB/s

 

 

Changes I did are: 

On oss

-  lctl set_param obdfilter.lfsbv-*.brw_size=16

 

On clients 

-  unmounted and remounted

-  lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096  (got 
auto-updated after re-mount)

-  lctl set_param osc.*.max_rpcs_in_flight=64   (Had to manually 
increase this to 64,  since after re-mount, it was auto-set to 8,  but 
read/write performance was poor)

-  lctl set_param osc.*.max_dirty_mb=2040. (setting the value to 2048 
was failing with : Numerical result out of range error.   Previously it was set 
to 2000 when I got good performance. 

 

 

My other settings: 

-  lnetctl net add --net tcp1 --if $interface  –peer-timeout 180 
–peer-credits 128 –credits 1024

-  echo "options ksocklnd nscheds=10 sock_timeout=100 credits=2560 
peer_credits=63 enable_irq_affinity=0"  >  /etc/modprobe.d/ksocklnd.conf

-  lfs setstripe -c 1 -S 1M /mnt/mdt_bv/test1

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lemur Lustre - make rpm fails

2019-12-11 Thread Pinkesh Valdria
Hi Nathaniel, 

I have an issue ticket opened:  https://github.com/whamcloud/lemur/issues/7

I tried to do it locally,  that also fails, given below is the error.  

[root@lustre-client-4 lemur]# lfs --version
lfs 2.12.3
[root@lustre-client-4 lemur]# uname -a
Linux lustre-client-4 3.10.0-1062.7.1.el7.x86_64 #1 SMP Mon Dec 2 17:33:29 UTC 
2019 x86_64 x86_64 x86_64 GNU/Linux
 [root@lustre-client-4 lemur]# lsb_release -r
Release:7.6.1810
[root@lustre-client-4 lemur]#


[root@lustre-client-4 lemur]# make local-rpm
make -C packaging/rpm NAME=lemur VERSION=0.6.0_4_g4655df8 RELEASE=1 
URL="https://github.com/intel-hpdd/lemur";
make[1]: Entering directory `/root/lemur/packaging/rpm'
cd ../../ && \
.


github.com/intel-hpdd/lemur/vendor/github.com/aws/aws-sdk-go/service/s3/s3manager
github.com/intel-hpdd/lemur/cmd/lhsm-plugin-s3
install -d $(dirname 
/root/rpmbuild/BUILDROOT/lemur-hsm-agent-0.6.0_4_g4655df8-1.x86_64//usr/bin/lhsm-plugin-s3)
install -m 755 lhsm-plugin-s3 
/root/rpmbuild/BUILDROOT/lemur-hsm-agent-0.6.0_4_g4655df8-1.x86_64//usr/bin/lhsm-plugin-s3
go build -v -i -ldflags "-X 'main.version=0.6.0_4_g4655df8'" -o lhsm ./cmd/lhsm
github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/pkg/pool
github.com/intel-hpdd/lemur/cmd/lhsmd/agent/fileid
github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/llapi
github.com/intel-hpdd/lemur/vendor/gopkg.in/yaml.v2
github.com/intel-hpdd/lemur/vendor/gopkg.in/urfave/cli.v1
# github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/llapi
cgo-gcc-prolog: In function '_cgo_c110903d49cd_C2func_llapi_get_version':
cgo-gcc-prolog:58:2: warning: 'llapi_get_version' is deprecated (declared at 
/usr/include/lustre/lustreapi.h:398) [-Wdeprecated-declarations]
cgo-gcc-prolog: In function '_cgo_c110903d49cd_Cfunc_llapi_get_version':
cgo-gcc-prolog:107:2: warning: 'llapi_get_version' is deprecated (declared at 
/usr/include/lustre/lustreapi.h:398) [-Wdeprecated-declarations]
# github.com/intel-hpdd/lemur/vendor/github.com/intel-hpdd/go-lustre/llapi
vendor/github.com/intel-hpdd/go-lustre/llapi/changelog.go:273:39: cannot use 
_Ctype_int(r.flags) (type _Ctype_int) as type int32 in argument to 
_Cfunc_hsm_get_cl_flags
make[2]: *** [lhsm] Error 2
make[2]: Leaving directory 
`/root/rpmbuild/BUILD/lemur-0.6.0_4_g4655df8/src/github.com/intel-hpdd/lemur'
error: Bad exit status from /var/tmp/rpm-tmp.cPPeEL (%install)


RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.cPPeEL (%install)
make[1]: *** [rpm] Error 1
make[1]: Leaving directory `/root/lemur/packaging/rpm'
make: *** [local-rpm] Error 2
[root@lustre-client-4 lemur]#



Thanks,
Pinkesh Valdria 



On 12/10/19, 4:55 AM, "lustre-discuss on behalf of Nathaniel Clark" 
 
wrote:

Can you open at ticket for this on 


https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_whamcloud_lemur_issues&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dvUy7ZhvTpzQ9yJzUhQmk0UHrXXOGiSc2X1_Sm5yOhY&s=aD2CBP6CmEF14pb7PM2A-H4aFyzbd09y5IRcQXqIHj8&e=
 

And possibly


https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_projects_LMR&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dvUy7ZhvTpzQ9yJzUhQmk0UHrXXOGiSc2X1_Sm5yOhY&s=SoLFFKtz2XY9CNh4vyFssmTyhvmkIqyABrH_FzZzUQk&e=
 



You could also try:

$ make local-rpm



Which will avoid the docker stack and just build on the local machine

(beware it sudo's to install rpm build dependencies).

    


    
-- 

Nathaniel Clark 

Senior Engineer

Whamcloud / DDN



On Mon, 2019-12-09 at 15:04 -0800, Pinkesh Valdria wrote:

> I am trying to install Lemur on CentOS 7.6 (7.6.1810) to integrate

> with Object storage but the install fails.   I used the instructions

> on below page to install.  I already had Lustre client (2.12.3)

> installed on the machine,  so I started with steps for Lemur.

>  

> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.whamcloud.com_display_PUB_HPDD-2BHSM-2BAgent-2Band-2BData-2BMovers-2B-2528Lemur-2529-2BGetting-2BStarted-2BGuide&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=dvUy7ZhvTpzQ9yJzUhQmk0UHrXXOGiSc2X1_Sm5yOhY&s=IftJE1TWubVV7pX19fr31fzo-14G01LrgoBPRUHKO0g&e=
 

>  

>  

> Steps followed:

>  

> git clone 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_whamcloud_lemur.git&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7Hg

[lustre-discuss] Degraded read performance with Large Bulk IO (16MB RPC)

2019-12-10 Thread Pinkesh Valdria
I was expecting better or same read performance with Large Bulk IO (16MB RPC),  
but I see degradation in performance.   Do I need to tune any other parameter 
to benefit from Large Bulk IO?   Appreciate if I can get any pointers to 
troubleshoot further. 

 

Throughput before 
Read:  2563 MB/s
Write:  2585 MB/s
 

Throughput after
Read:  1527 MB/s. (down by ~1025)
Write:  2859 MB/s
 

 

Changes I did are: 

On oss
lctl set_param obdfilter.lfsbv-*.brw_size=16
 

On clients 
unmounted and remounted
lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096  (got auto-updated after 
re-mount)
lctl set_param osc.*.max_rpcs_in_flight=64   (Had to manually increase this to 
64,  since after re-mount, it was auto-set to 8,  but read/write performance 
was poor)
lctl set_param osc.*.max_dirty_mb=2040. (setting the value to 2048 was failing 
with : Numerical result out of range error.   Previously it was set to 2000 
when I got good performance. 
 

 

My other settings: 
lnetctl net add --net tcp1 --if $interface  –peer-timeout 180 –peer-credits 128 
–credits 1024
echo "options ksocklnd nscheds=10 sock_timeout=100 credits=2560 peer_credits=63 
enable_irq_affinity=0"  >  /etc/modprobe.d/ksocklnd.conf
lfs setstripe -c 1 -S 1M /mnt/mdt_bv/test1
 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lemur Lustre - make rpm fails

2019-12-09 Thread Pinkesh Valdria
I am trying to install Lemur on CentOS 7.6 (7.6.1810) to integrate with Object 
storage but the install fails.   I used the instructions on below page to 
install.  I already had Lustre client (2.12.3) installed on the machine,  so I 
started with steps for Lemur. 

 

https://wiki.whamcloud.com/display/PUB/HPDD+HSM+Agent+and+Data+Movers+%28Lemur%29+Getting+Started+Guide

 

 

Steps followed: 

 
git clone https://github.com/whamcloud/lemur.git
cd lemur
git checkout master
service docker start
make rpm
 

[root@lustre-client-4 lemur]# make rpm

make -C packaging/docker

make[1]: Entering directory `/root/lemur/packaging/docker'

make[2]: Entering directory `/root/lemur/packaging/docker/go-el7'

Building go-el7/1.13.5-1.fc32 for 1.13.5-1.fc32

docker build -t go-el7:1.13.5-1.fc32 -t go-el7:latest 
--build-arg=go_version=1.13.5-1.fc32 --build-arg=go_macros_version=3.0.8-4.fc31 
 .

Sending build context to Docker daemon 4.608 kB

Step 1/9 : FROM centos:7

 ---> 5e35e350aded

Step 2/9 : MAINTAINER Robert Read 

 ---> Using cache

 ---> 4be0d7fa27a2

Step 3/9 : RUN yum install -y @development golang pcre-devel glibc-static which

 ---> Using cache

 ---> ac83254f37f7

Step 4/9 : RUN mkdir -p /go/src /go/bin && chmod -R 777 /go

 ---> Using cache

 ---> fdbb4d031716

Step 5/9 : ENV GOPATH /go PATH $GOPATH/bin:$PATH

 ---> Using cache

 ---> 216c5484727e

Step 6/9 : RUN go get github.com/tools/godep && cp /go/bin/godep /usr/local/bin

 ---> Running in aed86ac3eb87

 

/bin/sh: go: command not found

The command '/bin/sh -c go get github.com/tools/godep && cp /go/bin/godep 
/usr/local/bin' returned a non-zero code: 127

make[2]: *** [go-el7/1.13.5-1.fc32] Error 127

make[2]: Leaving directory `/root/lemur/packaging/docker/go-el7'

make[1]: *** [go-el7] Error 2

make[1]: Leaving directory `/root/lemur/packaging/docker'

make: *** [docker] Error 2

[root@lustre-client-4 lemur]#

 

Is this repo for Lemur the most updated version? 

 

 

[root@lustre-client-4 lemur]# lfs --version

lfs 2.12.3

[root@lustre-client-4 lemur]#

 

 

 

 

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lnet Self Test

2019-12-07 Thread Pinkesh Valdria
0  0 S   3.6  0.0  81:30.26 
socknal_sd01_03

   551 root  20   0   0  0  0 S   2.6  0.0  39:24.00 kswapd0

 60860 root      20   0   0  0  0 S   2.3  0.0  30:54.35 
socknal_sd00_01

 60864 root  20   0   0  0  0 S   2.3  0.0  30:58.20 
socknal_sd00_05

 64426 root  20   0   0  0  0 S   2.3  0.0   7:28.65 
ll_ost_io01_102

 60859 root  20   0   0  0  0 S   2.0  0.0  30:56.70 
socknal_sd00_00

 60861 root  20   0   0  0  0 S   2.0  0.0  30:54.97 
socknal_sd00_02

 60862 root  20   0   0  0  0 S   2.0  0.0  30:56.06 
socknal_sd00_03

 60863 root  20   0   0  0  0 S   2.0  0.0  30:56.32 
socknal_sd00_04

64334 root  20   0   0  0  0 D   1.3  0.0   7:19.46 
ll_ost_io01_010

 64329 root  20   0   0  0  0 S   1.0  0.0   7:46.48 
ll_ost_io01_005

 

 

 

From: "Moreno Diego (ID SIS)" 
Date: Wednesday, December 4, 2019 at 11:12 PM
To: Pinkesh Valdria , Jongwoo Han 

Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Lnet Self Test

 

I recently did some work on 40Gb and 100Gb ethernet interfaces and these are a 
few of the things that helped me during lnet_selftest:

 
On lnet: credits set to higher than the default (e.g: 1024 or more), 
peer_credits to 128 at least for network testing (it’s just 8 by default which 
is good for a big cluster maybe not for lnet_selftest with 2 clients),
On ksocklnd module options: more schedulers (10, 6 by default which was not 
enough for my server), also changed some of the buffers (tx_buffer_size and 
rx_buffer_size set to 1073741824) but you need to be very careful on these
Sysctl.conf: increase buffers (tcp_rmem, tcp_wmem, check window_scaling, 
net.core.max and default, check disabling timestamps if you can afford it)
Other: cpupower governor (set to performance at least for testing), BIOS 
settings (e.g: on my AMD routers it was better to disable  HT, disable a few 
virtualization oriented features and set the PCI config to performance). 
Basically, be aware that Lustre ethernet’s performance will take CPU resources 
so better optimize for it
 

Last but not least be aware that Lustre’s ethernet driver (ksocklnd) does not 
load balance as well as Infiniband’s (ko2iblnd). I already saw sometimes 
several Lustre peers using the same socklnd thread on the destination but the 
other socklnd threads might not be active which means that your entire load is 
on just dependent on one core. For that the best is to try with more clients 
and check in your node what’s the cpu load per thread with top. 2 clients do 
not seem enough to me. With the proper configuration you should be perfectly 
able to saturate a 25Gb link in lnet_selftest.

 

Regards,

 

Diego

 

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Thursday, 5 December 2019 at 06:14
To: Jongwoo Han 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Lnet Self Test

 

Thanks Jongwoo. 

 

I have the MTU set for 9000 and also ring buffer setting set to max. 

 

ip link set dev $primaryNICInterface mtu 9000

ethtool -G $primaryNICInterface rx 2047 tx 2047 rx-jumbo 8191

 

I read about changing  Interrupt Coalesce, but unable to find what values 
should be changed and also if it really helps or not. 

# Several packets in a rapid sequence can be coalesced into one interrupt 
passed up to the CPU, providing more CPU time for application processing.

 

Thanks,

Pinkesh valdria

Oracle Cloud

 

 

 

From: Jongwoo Han 
Date: Wednesday, December 4, 2019 at 8:07 PM
To: Pinkesh Valdria 
Cc: Andreas Dilger , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] Lnet Self Test

 

Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the 
switch? 

If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount 
of packet, reducing available bandwidth for data.

 

Jongwoo Han

 

2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria 님이 작성:

Thanks Andreas for your response.  

 

I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 
52 physical cores and I was able to achieve same throughput (2052.71  MiB/s = 
2152 MB/s).

 

Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on 
ethernet with Lnet?

 

 

Thanks,

Pinkesh Valdria

Oracle Cloud Infrastructure 

 

 

 

 

From: Andreas Dilger 
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Lnet Self Test

 

The first thing to note is that lst reports results in binary units 

(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the

conversion you get 2055.31 MiB/s = 2155 MB/s.

 

The other thing to check is the CPU usage. For TCP the CPU usage can

be high. You should try RoCE+o2iblnd instead. 

 

Cheers, Andreas


On Nov 26, 20

Re: [lustre-discuss] Lnet Self Test

2019-12-04 Thread Pinkesh Valdria
Thanks Jongwoo. 

 

I have the MTU set for 9000 and also ring buffer setting set to max. 

 

ip link set dev $primaryNICInterface mtu 9000

ethtool -G $primaryNICInterface rx 2047 tx 2047 rx-jumbo 8191

 

I read about changing  Interrupt Coalesce, but unable to find what values 
should be changed and also if it really helps or not. 

# Several packets in a rapid sequence can be coalesced into one interrupt 
passed up to the CPU, providing more CPU time for application processing.

 

Thanks,

Pinkesh valdria

Oracle Cloud

 

 

 

From: Jongwoo Han 
Date: Wednesday, December 4, 2019 at 8:07 PM
To: Pinkesh Valdria 
Cc: Andreas Dilger , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] Lnet Self Test

 

Have you tried MTU >= 9000 bytes (AKA jumbo frame) on the 25G ethernet and the 
switch? 

If it is set to 1500 bytes, ethernet + IP + TCP frame headers take quite amount 
of packet, reducing available bandwidth for data.

 

Jongwoo Han

 

2019년 11월 28일 (목) 오전 3:44, Pinkesh Valdria 님이 작성:

Thanks Andreas for your response.  

 

I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 
52 physical cores and I was able to achieve same throughput (2052.71  MiB/s = 
2152 MB/s).

 

Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on 
ethernet with Lnet?

 

 

Thanks,

Pinkesh Valdria

Oracle Cloud Infrastructure 

 

 

 

 

From: Andreas Dilger 
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Lnet Self Test

 

The first thing to note is that lst reports results in binary units 

(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the

conversion you get 2055.31 MiB/s = 2155 MB/s.

 

The other thing to check is the CPU usage. For TCP the CPU usage can

be high. You should try RoCE+o2iblnd instead. 

 

Cheers, Andreas


On Nov 26, 2019, at 21:26, Pinkesh Valdria  wrote:

Hello All, 

 

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.

 

 

[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

 

When I run lnet_selftest_wrapper.sh (from Lustre wiki) between 2 nodes,  I get 
a max of  2055.31  MiB/s,  Is that expected at the Lnet level?  Or can I 
further tune the network and OS kernel (tuning I applied are below) to get 
better throughput?

 

 

 

Result Snippet from lnet_selftest_wrapper.sh

 

[LNet Rates of lfrom]

[R] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s

[W] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s

[LNet Bandwidth of lfrom]

[R] Avg: 0.31 MiB/s Min: 0.31 MiB/s Max: 0.31 MiB/s

[W] Avg: 2055.30  MiB/s Min: 2055.30  MiB/s Max: 2055.30  MiB/s

[LNet Rates of lto]

[R] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s

[W] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s

[LNet Bandwidth of lto]

[R] Avg: 2055.31  MiB/s Min: 2055.31  MiB/s Max: 2055.31  MiB/s

[W] Avg: 0.32 MiB/s Min: 0.32 MiB/s Max: 0.32 MiB/s

 

 

Tuning applied: 

Ethernet NICs: 

ip link set dev ens3 mtu 9000 

ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191

 

 

less /etc/sysctl.conf

net.core.wmem_max=16777216

net.core.rmem_max=16777216

net.core.wmem_default=16777216

net.core.rmem_default=16777216

net.core.optmem_max=16777216

net.core.netdev_max_backlog=27000

kernel.sysrq=1

kernel.shmmax=18446744073692774399

net.core.somaxconn=8192

net.ipv4.tcp_adv_win_scale=2

net.ipv4.tcp_low_latency=1

net.ipv4.tcp_rmem = 212992 87380 16777216

net.ipv4.tcp_sack = 1

net.ipv4.tcp_timestamps = 1

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_wmem = 212992 65536 16777216

vm.min_free_kbytes = 65536

net.ipv4.tcp_congestion_control = cubic

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_congestion_control = htcp

net.ipv4.tcp_no_metrics_save = 0

 

 

 

echo "#

# tuned configuration

#

[main]

summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

 

[disk]

devices=!dm-*, !sda1, !sda2, !sda3

readahead=>4096

 

[cpu]

force_latency=1

governor=performance

energy_perf_bias=performance

min_perf_pct=100

[vm]

transparent_huge_pages=never

[sysctl]

kernel.sched_min_granularity_ns = 1000

kernel.sched_wakeup_granularity_ns = 1500

vm.dirty_ratio = 30

vm.dirty_background_ratio = 10

vm.swappiness=30

" > lustre-performance/tuned.conf

 

tuned-adm profile lustre-performance

 

 

Thanks,

Pinkesh Valdria

 

___

Re: [lustre-discuss] Lnet Self Test

2019-11-27 Thread Pinkesh Valdria
Thanks Andreas for your response.  

 

I ran anotherLnet Self test with 48 concurrent processes, since the nodes have 
52 physical cores and I was able to achieve same throughput (2052.71  MiB/s = 
2152 MB/s).

 

Is it expected to lose almost 600 MB/s (2750-2150= ) due to overheads on 
ethernet with Lnet?

 

 

Thanks,

Pinkesh Valdria

Oracle Cloud Infrastructure 

 

 

 

 

From: Andreas Dilger 
Date: Wednesday, November 27, 2019 at 1:25 AM
To: Pinkesh Valdria 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Lnet Self Test

 

The first thing to note is that lst reports results in binary units 

(MiB/s) while iperf reports results in decimal units (Gbps).  If you do the

conversion you get 2055.31 MiB/s = 2155 MB/s.

 

The other thing to check is the CPU usage. For TCP the CPU usage can

be high. You should try RoCE+o2iblnd instead. 

 

Cheers, Andreas


On Nov 26, 2019, at 21:26, Pinkesh Valdria  wrote:

Hello All, 

 

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.

 

 

[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

 

When I run lnet_selftest_wrapper.sh (from Lustre wiki) between 2 nodes,  I get 
a max of  2055.31  MiB/s,  Is that expected at the Lnet level?  Or can I 
further tune the network and OS kernel (tuning I applied are below) to get 
better throughput?

 

 

 

Result Snippet from lnet_selftest_wrapper.sh

 

[LNet Rates of lfrom]

[R] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s

[W] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s

[LNet Bandwidth of lfrom]

[R] Avg: 0.31 MiB/s Min: 0.31 MiB/s Max: 0.31 MiB/s

[W] Avg: 2055.30  MiB/s Min: 2055.30  MiB/s Max: 2055.30  MiB/s

[LNet Rates of lto]

[R] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s

[W] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s

[LNet Bandwidth of lto]

[R] Avg: 2055.31  MiB/s Min: 2055.31  MiB/s Max: 2055.31  MiB/s

[W] Avg: 0.32 MiB/s Min: 0.32 MiB/s Max: 0.32 MiB/s

 

 

Tuning applied: 

Ethernet NICs: 

ip link set dev ens3 mtu 9000 

ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191

 

 

less /etc/sysctl.conf

net.core.wmem_max=16777216

net.core.rmem_max=16777216

net.core.wmem_default=16777216

net.core.rmem_default=16777216

net.core.optmem_max=16777216

net.core.netdev_max_backlog=27000

kernel.sysrq=1

kernel.shmmax=18446744073692774399

net.core.somaxconn=8192

net.ipv4.tcp_adv_win_scale=2

net.ipv4.tcp_low_latency=1

net.ipv4.tcp_rmem = 212992 87380 16777216

net.ipv4.tcp_sack = 1

net.ipv4.tcp_timestamps = 1

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_wmem = 212992 65536 16777216

vm.min_free_kbytes = 65536

net.ipv4.tcp_congestion_control = cubic

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_congestion_control = htcp

net.ipv4.tcp_no_metrics_save = 0

 

 

 

echo "#

# tuned configuration

#

[main]

summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

 

[disk]

devices=!dm-*, !sda1, !sda2, !sda3

readahead=>4096

 

[cpu]

force_latency=1

governor=performance

energy_perf_bias=performance

min_perf_pct=100

[vm]

transparent_huge_pages=never

[sysctl]

kernel.sched_min_granularity_ns = 1000

kernel.sched_wakeup_granularity_ns = 1500

vm.dirty_ratio = 30

vm.dirty_background_ratio = 10

vm.swappiness=30

" > lustre-performance/tuned.conf

 

tuned-adm profile lustre-performance

 

 

Thanks,

Pinkesh Valdria

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lnet Self Test

2019-11-26 Thread Pinkesh Valdria
Hello All, 

 

I created a new Lustre cluster on CentOS7.6 and I am running 
lnet_selftest_wrapper.sh to measure throughput on the network.  The nodes are 
connected to each other using 25Gbps ethernet, so theoretical max is 25 Gbps * 
125 = 3125 MB/s.    Using iperf3,  I get 22Gbps (2750 MB/s) between the nodes.

 

 

[root@lustre-client-2 ~]# for c in 1 2 4 8 12 16 20 24 ;  do echo $c ; 
ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=$c  SZ=1M  TM=30 BRW=write 
CKSUM=simple LFROM="10.0.3.7@tcp1" LTO="10.0.3.6@tcp1" 
/root/lnet_selftest_wrapper.sh; done ;

 

When I run lnet_selftest_wrapper.sh (from Lustre wiki) between 2 nodes,  I get 
a max of  2055.31  MiB/s,  Is that expected at the Lnet level?  Or can I 
further tune the network and OS kernel (tuning I applied are below) to get 
better throughput?

 

 

 

Result Snippet from lnet_selftest_wrapper.sh

 

 [LNet Rates of lfrom]

[R] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s

[W] Avg: 4112 RPC/s Min: 4112 RPC/s Max: 4112 RPC/s

[LNet Bandwidth of lfrom]

[R] Avg: 0.31 MiB/s Min: 0.31 MiB/s Max: 0.31 MiB/s

[W] Avg: 2055.30  MiB/s Min: 2055.30  MiB/s Max: 2055.30  MiB/s

[LNet Rates of lto]

[R] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s

[W] Avg: 4136 RPC/s Min: 4136 RPC/s Max: 4136 RPC/s

[LNet Bandwidth of lto]

[R] Avg: 2055.31  MiB/s Min: 2055.31  MiB/s Max: 2055.31  MiB/s

[W] Avg: 0.32 MiB/s Min: 0.32 MiB/s Max: 0.32 MiB/s

 

 

Tuning applied: 

Ethernet NICs: 

ip link set dev ens3 mtu 9000 

ethtool -G ens3 rx 2047 tx 2047 rx-jumbo 8191

 

 

less /etc/sysctl.conf

net.core.wmem_max=16777216

net.core.rmem_max=16777216

net.core.wmem_default=16777216

net.core.rmem_default=16777216

net.core.optmem_max=16777216

net.core.netdev_max_backlog=27000

kernel.sysrq=1

kernel.shmmax=18446744073692774399

net.core.somaxconn=8192

net.ipv4.tcp_adv_win_scale=2

net.ipv4.tcp_low_latency=1

net.ipv4.tcp_rmem = 212992 87380 16777216

net.ipv4.tcp_sack = 1

net.ipv4.tcp_timestamps = 1

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_wmem = 212992 65536 16777216

vm.min_free_kbytes = 65536

net.ipv4.tcp_congestion_control = cubic

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_congestion_control = htcp

net.ipv4.tcp_no_metrics_save = 0

 

 

 

echo "#

# tuned configuration

#

[main]

summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

 

[disk]

devices=!dm-*, !sda1, !sda2, !sda3

readahead=>4096

 

[cpu]

force_latency=1

governor=performance

energy_perf_bias=performance

min_perf_pct=100

[vm]

transparent_huge_pages=never

[sysctl]

kernel.sched_min_granularity_ns = 1000

kernel.sched_wakeup_granularity_ns = 1500

vm.dirty_ratio = 30

vm.dirty_background_ratio = 10

vm.swappiness=30

" > lustre-performance/tuned.conf

 

tuned-adm profile lustre-performance

 

 

Thanks,

Pinkesh Valdria

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] max_pages_per_rpc=4096 fails on the client nodes

2019-08-14 Thread Pinkesh Valdria
For others, incase they face this issue.  


Solution: I had to unmount and remount for the command to work. 

 

From: Pinkesh Valdria 
Date: Wednesday, August 14, 2019 at 9:25 AM
To: "lustre-discuss@lists.lustre.org" 
Subject: max_pages_per_rpc=4096 fails on the client nodes

 

I want to enable large RPC size.   I followed the steps as per the Lustre 
manual section: 33.9.2 Usage (http://doc.lustre.org/lustre_manual.xhtml),  but 
I get the below we error when I try to update the client.

 

Updated the OSS server: 

[root@lustre-oss-server-nic0-1 test]# lctl set_param 
obdfilter.lfsbv-*.brw_size=16

obdfilter.lfsbv-OST.brw_size=16

obdfilter.lfsbv-OST0001.brw_size=16

obdfilter.lfsbv-OST0002.brw_size=16

obdfilter.lfsbv-OST0003.brw_size=16

obdfilter.lfsbv-OST0004.brw_size=16

obdfilter.lfsbv-OST0005.brw_size=16

obdfilter.lfsbv-OST0006.brw_size=16

obdfilter.lfsbv-OST0007.brw_size=16

obdfilter.lfsbv-OST0008.brw_size=16

obdfilter.lfsbv-OST0009.brw_size=16

[root@lustre-oss-server-nic0-1 test]#

 

Add the above change permanently using MGS node: 

[root@lustre-mds-server-nic0-1 ~]# lctl set_param -P 
obdfilter.lfsbv-*.brw_size=16

[root@lustre-mds-server-nic0-1 ~]#

 

 

Client side update – failed 

[root@lustre-client-1 ~]# lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST0001-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST0002-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST0003-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

…..

…..

 

 

33.9.2. Usage

In order to enable a larger RPC size, brw_size must be changed to an IO size 
value up to 16MB. To temporarily change brw_size, the following command should 
be run on the OSS:

oss# lctl set_param obdfilter.fsname-OST*.brw_size=16

To persistently change brw_size, the following command should be run:

oss# lctl set_param -P obdfilter.fsname-OST*.brw_size=16

When a client connects to an OST target, it will fetch brw_size from the target 
and pick the maximum value of brw_size and its local setting for 
max_pages_per_rpc as the actual RPC size. Therefore, the max_pages_per_rpc on 
the client side would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to 
enable a 16MB RPC. To temporarily make the change, the following command should 
be run on the client to setmax_pages_per_rpc:

client$ lctl set_param osc.fsname-OST*.max_pages_per_rpc=16M

To persistently make this change, the following command should be run:

client$ lctl set_param -P obdfilter.fsname-OST*.osc.max_pages_per_rpc=16M

Caution

The brw_size of an OST can be changed on the fly. However, clients have to be 
remounted to renegotiate the new maximum RPC size.

 

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] max_pages_per_rpc=4096 fails on the client nodes

2019-08-14 Thread Pinkesh Valdria
I want to enable large RPC size.   I followed the steps as per the Lustre 
manual section: 33.9.2 Usage (http://doc.lustre.org/lustre_manual.xhtml),  but 
I get the below we error when I try to update the client.    

 

Updated the OSS server: 

[root@lustre-oss-server-nic0-1 test]# lctl set_param 
obdfilter.lfsbv-*.brw_size=16

obdfilter.lfsbv-OST.brw_size=16

obdfilter.lfsbv-OST0001.brw_size=16

obdfilter.lfsbv-OST0002.brw_size=16

obdfilter.lfsbv-OST0003.brw_size=16

obdfilter.lfsbv-OST0004.brw_size=16

obdfilter.lfsbv-OST0005.brw_size=16

obdfilter.lfsbv-OST0006.brw_size=16

obdfilter.lfsbv-OST0007.brw_size=16

obdfilter.lfsbv-OST0008.brw_size=16

obdfilter.lfsbv-OST0009.brw_size=16

[root@lustre-oss-server-nic0-1 test]#

 

Add the above change permanently using MGS node: 

[root@lustre-mds-server-nic0-1 ~]# lctl set_param -P 
obdfilter.lfsbv-*.brw_size=16

[root@lustre-mds-server-nic0-1 ~]#

 

 

Client side update – failed 

[root@lustre-client-1 ~]# lctl set_param osc.lfsbv-OST*.max_pages_per_rpc=4096

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST0001-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST0002-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

error: set_param: setting 
/proc/fs/lustre/osc/lfsbv-OST0003-osc-8e66b4b08000/max_pages_per_rpc=4096: 
Numerical result out of range

…..

…..

 

 

33.9.2. Usage

In order to enable a larger RPC size, brw_size must be changed to an IO size 
value up to 16MB. To temporarily change brw_size, the following command should 
be run on the OSS:

oss# lctl set_param obdfilter.fsname-OST*.brw_size=16

To persistently change brw_size, the following command should be run:

oss# lctl set_param -P obdfilter.fsname-OST*.brw_size=16

When a client connects to an OST target, it will fetch brw_size from the target 
and pick the maximum value of brw_size and its local setting for 
max_pages_per_rpc as the actual RPC size. Therefore, the max_pages_per_rpc on 
the client side would have to be set to 16M, or 4096 if the PAGESIZE is 4KB, to 
enable a 16MB RPC. To temporarily make the change, the following command should 
be run on the client to setmax_pages_per_rpc:

client$ lctl set_param osc.fsname-OST*.max_pages_per_rpc=16M

To persistently make this change, the following command should be run:

client$ lctl set_param -P obdfilter.fsname-OST*.osc.max_pages_per_rpc=16M

Caution

The brw_size of an OST can be changed on the fly. However, clients have to be 
remounted to renegotiate the new maximum RPC size.

 

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lnet_selftest - fails for me

2019-08-12 Thread Pinkesh Valdria
Figured out the issue.   I forgot to load module on the server side.  

 

Solution:   Load module on all nodes involved in testing.  

 

[root@lustre-oss-server-nic0-1 ~]# modprobe lnet_selftest

[root@lustre-oss-server-nic0-1 ~]#

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Monday, August 12, 2019 at 10:55 AM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] lnet_selftest - fails for me

 

Hello,  

 

Does anyone know why this simple lnet_selftest is failing.I am able to use 
the Lustre file system without any problem.I looked at /var/log/messages on 
the client and server node and there are no errors. Googling for the error, 
 was not helpful. 

 

 

 

The script:  lnet_selftest_wrapper.sh. has content which is mentioned on this 
page:  http://wiki.lustre.org/LNET_Selftest  (wrapper script at the end of the 
page). 

 

LFROM="10.0.3.4@tcp1"   is the client node where I am running this script. 

LTO="10.0.3.6@tcp1"   is one of the OSS server.I have total 3 of them. 

 

 

[root@lustre-client-1 ~]# ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=1  SZ=1M 
 TM=60 BRW=read CKSUM=simple LFROM="10.0.3.4@tcp1" LTO="10.0.3.6@tcp1" 
./lnet_selftest_wrapper.sh

lst-output-2019-08-12-17:47:09 1 1M 60 read simple 10.0.3.4@tcp1 10.0.3.6@tcp1

LST_SESSION = 6787

SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No

10.0.3.4@tcp1 are added to session

create session RPC failed on 12345-10.0.3.6@tcp1: Unknown error -110

No nodes added successfully, deleting group lto

Group is deleted

Can't get count of nodes from lto: No such file or directory

bulk_read is running now

Capturing statistics for 60 secs Invalid nid: lto

Failed to get count of nodes from lto: Success

 

 

___ lustre-discuss mailing list 
lustre-discuss@lists.lustre.org 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4sZuI&m=kCjhxVXU3X4xpBaM7SJPPhcFh8shTcCUiBfvAcZsICs&s=XGV5I0U6R0olO7ocBaiHEeBgcrCMIJFe2xUQMh_ymjI&e=
 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lnet_selftest - fails for me

2019-08-12 Thread Pinkesh Valdria
Hello,  

 

Does anyone know why this simple lnet_selftest is failing.    I am able to use 
the Lustre file system without any problem.    I looked at /var/log/messages on 
the client and server node and there are no errors.     Googling for the error, 
 was not helpful. 

 

 

 

The script:  lnet_selftest_wrapper.sh. has content which is mentioned on this 
page:  http://wiki.lustre.org/LNET_Selftest  (wrapper script at the end of the 
page). 

 

LFROM="10.0.3.4@tcp1"   is the client node where I am running this script. 

LTO="10.0.3.6@tcp1"   is one of the OSS server.    I have total 3 of them. 

 

 

[root@lustre-client-1 ~]# ST=lst-output-$(date +%Y-%m-%d-%H:%M:%S)  CN=1  SZ=1M 
 TM=60 BRW=read CKSUM=simple LFROM="10.0.3.4@tcp1" LTO="10.0.3.6@tcp1" 
./lnet_selftest_wrapper.sh

lst-output-2019-08-12-17:47:09 1 1M 60 read simple 10.0.3.4@tcp1 10.0.3.6@tcp1

LST_SESSION = 6787

SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No

10.0.3.4@tcp1 are added to session

create session RPC failed on 12345-10.0.3.6@tcp1: Unknown error -110

No nodes added successfully, deleting group lto

Group is deleted

Can't get count of nodes from lto: No such file or directory

bulk_read is running now

Capturing statistics for 60 secs Invalid nid: lto

Failed to get count of nodes from lto: Success

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] LND tunables - how to set on Ethernet network

2019-08-11 Thread Pinkesh Valdria
I have been trying to find out,  how can I set these below values,  if I use 
Luster & LNET using 25 gbps Ethernet network.   Seems like for Infiniband or 
Intel OPA,  you can set them in ko2iblnd.conf.  

 

 

tunables:
  peer_timeout: 180
  peer_credits: 8
  peer_buffer_credits: 0
  credits: 256
lnd tunables:

  peercredits_hiw: 64

      map_on_demand: 32

  concurrent_sends: 256

  fmr_pool_size: 2048

  fmr_flush_trigger: 512

  fmr_cache: 1

 

Also,  I have been doing Dynamic Network Configuration using lnetctl command, 
in that case,  is there a way to set the above values or it has to be via some 
config file only? 

 

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] LNET tunables and LND tunables

2019-08-11 Thread Pinkesh Valdria
Hello, 

 

I have a lustre cluster using 25gbps ethernet network.  (no infinitiband).    I 
see lot of examples online for infiniband and what tunables to use for it,  but 
I am struggling to find recommendations when using ethernet networks.   

 

Appreciate if someone can share their experience and settings when using 
ethernet or if there is any details online for ethernet recommended values.  

 

 

>From my lustre client node.   I don’t see any LND tunables below like 
>(peercredits_hiw: 64, map_on_demand: 32, concurrent_sends: 256, fmr_pool_size: 
>2048, fmr_flush_trigger: 512, fmr_cache: 1).

 

[root@lustre-client-1 ~]# lnetctl net show --verbose

net:

    - net type: lo

  local NI(s):

    - nid: 0@lo

  status: up

  statistics:

  send_count: 0

  recv_count: 0

  drop_count: 0

  tunables:

  peer_timeout: 0

  peer_credits: 0

  peer_buffer_credits: 0

  credits: 0

  dev cpt: 0

  tcp bonding: 0

  CPT: "[0,1,2,3,4,5,6,7,8,9,10,11]"

    - net type: tcp1

  local NI(s):

    - nid: 10.0.3.4@tcp1

  status: up

  interfaces:

  0: ens3

  statistics:

  send_count: 657209676

      recv_count: 657208330

  drop_count: 0

  tunables:

  peer_timeout: 180

  peer_credits: 8

  peer_buffer_credits: 0

  credits: 256

  dev cpt: -1

  tcp bonding: 0

  CPT: "[0,1,2,3,4,5,6,7,8,9,10,11]"

[root@lustre-client-1 ~]#

 

 

OSS Server 

 

[root@lustre-oss-server-nic0-1 ~]# lnetctl net show --verbose

net:

    - net type: lo

  local NI(s):

    - nid: 0@lo

  status: up

  statistics:

  send_count: 0

  recv_count: 0

  drop_count: 0

  tunables:

  peer_timeout: 0

  peer_credits: 0

  peer_buffer_credits: 0

  credits: 0

  dev cpt: 0

  tcp bonding: 0

  CPT: "[0,1]"

    - net type: tcp1

  local NI(s):

    - nid: 10.0.3.6@tcp1

  status: up

  interfaces:

  0: eno3d1

  statistics:

  send_count: 232650108

  recv_count: 232650019

  drop_count: 0

  tunables:

  peer_timeout: 180

  peer_credits: 8

  peer_buffer_credits: 0

  credits: 256

  dev cpt: 0

  tcp bonding: 0

  CPT: "[0,1]"

[root@lustre-oss-server-nic0-1 ~]#

 

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre tuning - help

2019-08-09 Thread Pinkesh Valdria
Lustre experts, 

 

I recently installed Lustre for the first time.  Its working (so I am happy),  
but now I am trying to do some performance testing/tuning.   My goal is to run 
SAS workload and use Lustre as the shared file system for SAS Grid.    Later,  
do tuning of Lustre for generic HPC workload.   

 

 

Through Google search,  I read articles on Lustre and recommendation for tuning 
from LUG conference slides, etc 

https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/0/2327/files/2014/03/Fragalla.pdf

http://cdn.opensfs.org/wp-content/uploads/2019/07/LUG2019-Sysadmin-tutorial.pdf

http://support.sas.com/rnd/scalability/grid/SGMonAWS.pdf

 

I have results of IBM Spectrum Scale (GPFS) running on same hardware/software 
stack and based on Lustre tuning I have done,  I am not getting optimal 
performance.    My understanding was that Lustre can deliver better performance 
compare to GPFS, if tuned correctly.  

I have tried, changing the following:

Use Stripe count =1, 4, 8, 16, 24 , -1 (to stripe across all OSTs). And 
progressive file layout: lfs setstripe -E 256M -c 1 -E 4G -c 4 -E -1 -c -1 -S 
4M  /mnt/mdt_bv

Use Stripe Size:  default (1M),  4M,  64K (since SAS apps uses this).

 

SAS Grid uses large-block, sequential IO patterns. (Block size: 64K, 128K, 
256K,  - 64K is their preferred value).   

 

Question 1:  How should I tune the Stripe Count and Stripe Size  for the above. 
 Also should I use Progressive Stripe Layout? 

 

 

So appreciate, if I can get some feedback on tuning I have done and if its 
correct and if I am missing anything.  

 

 

Details:

It’s a cloud based solution – Oracle Cloud Infrastructure.   Installed Lustre 
using instructions on WhamCloud. 

All running CentOS 7.

MGS -1 node (shared with MDS), MDS -1 node, OSS -3 nodes.    All nodes are 
Baremetal machines (no VM) with  52 physical cores, 768GB RAM and have 2 NICs 
(2x25gbps ethernet, no dual bonding).   1 NIC is configured to connect to Block 
Storage disks.  2nd NiC is configured to talk to clients.   So LNET is 
configured with 2nd NIC.   Each OSS is connected to 10 Block Volume disk, 
800GB each.   So 10 OSTs per OSS.   Total of 30 OSTs (21TB storage) .  Have 1 
MDT (800GB) attached to MDS.   

 

Clients are 24 physical cores VMs,  320GB RAM, 1 NIC (24.6gbps).  Using 3 
clients in the above setup.  

 

 

On all nodes (MDS/OSS/Clients): 

 

###

### OS Performance tuning

###

 

setenforce 0

echo "

*  hard   memlock   unlimited

*  soft    memlock   unlimited

" >> /etc/security/limits.conf

 

# The below applies for both compute and server nodes (storage)

cd /usr/lib/tuned/

cp -r throughput-performance/ sas-performance

 

echo "#

# tuned configuration

#

[main]

include=throughput-performance

summary=Broadly applicable tuning that provides excellent performance across a 
variety of common server workloads

[disk]

devices=!dm-*, !sda1, !sda2, !sda3

readahead=>4096

 

[cpu]

force_latency=1

governor=performance

energy_perf_bias=performance

min_perf_pct=100

[vm]

transparent_huge_pages=never

[sysctl]

kernel.sched_min_granularity_ns = 1000

kernel.sched_wakeup_granularity_ns = 1500

vm.dirty_ratio = 30

vm.dirty_background_ratio = 10

vm.swappiness=30

" > sas-performance/tuned.conf

 

tuned-adm profile sas-performance

 

# Display active profile

tuned-adm active

   

 

Networking: 

All NICs are configured to use MTU – 9000

 

Block Volumes/Disks 

For all OSTs/MDT: 

cat /sys/block/$disk/queue/max_hw_sectors_kb 

32767

echo “32767” > /sys/block/$disk/queue/max_sectors_kb ;

echo "192" > /sys/block/$disk/queue/nr_requests ;

echo "deadline" > /sys/block/$disk/queue/scheduler ;

echo "0" > /sys/block/$disk/queue/read_ahead_kb ;

echo "68" > /sys/block/$disk/device/timeout ;

 

Only OSTs: 

lctl set_param osd-ldiskfs.*.readcache_max_filesize=2M

 

   

Lustre clients:

lctl set_param osc.*.checksums=0

lctl set_param timeout=600

#lctl set_param ldlm_timeout=200  - This fails with below error 

#error: set_param: param_path 'ldlm_timeout': No such file or directory

lctl set_param ldlm_timeout=200

lctl set_param at_min=250

lctl set_param at_max=600

lctl set_param ldlm.namespaces.*.lru_size=128

lctl set_param osc.*.max_rpcs_in_flight=32

lctl set_param osc.*.max_dirty_mb=256

lctl set_param debug="+neterror"

 

 

# 
https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/0/2327/files/2014/03/Fragalla.pdf
 - says turn off checksum at network level

ethtool -K ens3 rx off tx off

 

Lustre mounted with  -o flock option

mount -t lustre -o flock ${mgs_ip}@tcp1:/$fsname $mount_point

 

 

Once again, appreciate any guidance or help you can provide or you can point me 
to docs, articles, which will be helpful for me. 

 

 

Thanks,

Pinkesh Valdr

Re: [lustre-discuss] lctl set_param obdfilter.*.readcache_max_filesize=2M fails

2019-08-08 Thread Pinkesh Valdria
Thanks to Shaun and Chris.   

 

Sorry I forgot to paste,    I tried osd first,  it didn’t work,  so I tried 
ost,  incase ost was the new name. 

 

[root@lustre-oss-server-nic0-1 ~]#  lctl set_param 
osd-*.readcache_max_filesize=2M

error: set_param: param_path 'osd-*/readcache_max_filesize': No such file or 
directory

[root@lustre-oss-server-nic0-1 ~]#  lctl set_param 
ost-*.readcache_max_filesize=2M

error: set_param: param_path 'ost-*/readcache_max_filesize': No such file or 
directory

[root@lustre-oss-server-nic0-1 ~]#

 

 

So I have this, based on this,  

 

[root@lustre-oss-server-nic0-1 ~]# lctl list_param -R * | grep 
readcache_max_filesize

osd-ldiskfs.lfsbv-OST.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0001.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0002.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0003.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0004.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0005.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0006.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0007.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0008.readcache_max_filesize

osd-ldiskfs.lfsbv-OST0009.readcache_max_filesize

[root@lustre-oss-server-nic0-1 ~]#

 

So I did trial and error and found I need to do this:  osd-ldiskfs.* instead of 
osd.*

 

[root@lustre-oss-server-nic0-1 ~] lctl set_param 
osd-ldiskfs.*.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0001.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0002.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0003.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0004.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0005.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0006.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0007.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0008.readcache_max_filesize=2M

osd-ldiskfs.lfsbv-OST0009.readcache_max_filesize=2M

[root@lustre-oss-server-nic0-1 ~]#

 

 

 

From: Chris Horn 
Date: Thursday, August 8, 2019 at 11:11 AM
To: Pinkesh Valdria , Shaun Tancheff 
, "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] lctl set_param 
obdfilter.*.readcache_max_filesize=2M fails

 

You have a typo in the command you ran. Shaun wrote
> lctl set_param osd-*.readcache_max_filesize=2M

 

but you have:
> lctl set_param ost-*.readcache_max_filesize=2M

ost->osd

 

> Is  there commands to see all the currently parameters using some get command 

 

lctl list_param -R *

 

# lctl list_param -R * | grep readcache_max_filesize

osd-ldiskfs.snx11922-OST0002.readcache_max_filesize

osd-ldiskfs.snx11922-OST0003.readcache_max_filesize

#

 

Chris Horn

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Thursday, August 8, 2019 at 11:57 AM
To: Shaun Tancheff , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] lctl set_param 
obdfilter.*.readcache_max_filesize=2M fails

 

That also fails 

 

[root@lustre-oss-server-nic0-1 ~]#  lctl set_param 
ost-*.readcache_max_filesize=2M

error: set_param: param_path 'ost-*/readcache_max_filesize': No such file or 
directory

[root@lustre-oss-server-nic0-1 ~]#

 

Is  there commands to see all the currently parameters using some get command 

 

 

From: Shaun Tancheff 
Date: Thursday, August 8, 2019 at 9:50 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] lctl set_param 
obdfilter.*.readcache_max_filesize=2M fails

 

I think the parameter has changed:

'obdfilter.*.readcache_max_filesize’ => 'osd-*.readcache_max_filesize'

 

So try:

 lctl set_param osd-*.readcache_max_filesize=2M

 

 

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Thursday, August 8, 2019 at 11:43 AM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] lctl set_param obdfilter.*.readcache_max_filesize=2M 
fails

 


 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lctl set_param obdfilter.*.readcache_max_filesize=2M fails

2019-08-08 Thread Pinkesh Valdria
That also fails 

 

[root@lustre-oss-server-nic0-1 ~]#  lctl set_param 
ost-*.readcache_max_filesize=2M

error: set_param: param_path 'ost-*/readcache_max_filesize': No such file or 
directory

[root@lustre-oss-server-nic0-1 ~]#

 

Is  there commands to see all the currently parameters using some get command 

 

 

From: Shaun Tancheff 
Date: Thursday, August 8, 2019 at 9:50 AM
To: Pinkesh Valdria , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] lctl set_param 
obdfilter.*.readcache_max_filesize=2M fails

 

I think the parameter has changed:

'obdfilter.*.readcache_max_filesize’ => 'osd-*.readcache_max_filesize'

 

So try:

 lctl set_param osd-*.readcache_max_filesize=2M

 

 

 

From: lustre-discuss  on behalf of 
Pinkesh Valdria 
Date: Thursday, August 8, 2019 at 11:43 AM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] lctl set_param obdfilter.*.readcache_max_filesize=2M 
fails

 


 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lctl set_param obdfilter.*.readcache_max_filesize=2M fails

2019-08-08 Thread Pinkesh Valdria
Hello Lustre experts,

 

I am fairly new to lustre and I did a deployment of it on Oracle Public Cloud 
using instructions on whamcloud wiki pages.    I am now trying to set some 
parameters for better performance and need help to understand why I am getting 
this error: 

 

On OSS Servers:  

[root@lustre-oss-server-nic0-1 ~]# lctl set_param 
obdfilter.*.readcache_max_filesize=2M

error: set_param: param_path 'obdfilter/*/readcache_max_filesize': No such file 
or directory

[root@lustre-oss-server-nic0-1 ~]#

 

 

Also I have seen this happen on Client nodes:

lctl set_param ldlm_timeout=200  - This fails with below error 

error: set_param: param_path 'ldlm_timeout': No such file or directory

 

 

Appreciate your help.  

 

 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS/MGS has a block storage device mounted and it does not have any permissions (no read , no write, no execute)

2019-02-06 Thread Pinkesh Valdria
Thanks Andreas.   Given below are the output of the commands you asked to run:.

> [root@lustre-mds-server-1 opc]#
>   • Assuming if the above is not an issue,  after setting up OSS/OST and 
> Client node,  When my client tries to mount, I get the below error: 
> [root@lustre-client-1 opc]# mount -t lustre 10.0.2.4@tcp:/lustrewt 
> /mnt
> mount.lustre: mount 10.0.2.4@tcp:/lustrewt at /mnt failed: 
> Input/output error Is the MGS running?
> [root@lustre-client-1 opc]#

Andreas:  Can you do "lctl ping" from the client to the MGS node?  Most 
commonly this happens because the client still has a firewall configured, or it 
is defined to have "127.0.0.1" as the local node address.

Pinkesh response:  
[root@lustre-client-1 opc]# lctl ping 10.0.2.6@tcp
12345-0@lo
12345-10.0.2.6@tcp

So there is a "lo" mentioned here, could that be a problem?


I also ran the mount command on client node to capture logs on both the client 
node and MDS node.  

(ran command at 18.11 time)
[root@lustre-client-1 opc]# mount -t lustre 10.0.2.6@tcp:/lustrewt /mnt
mount.lustre: mount 10.0.2.6@tcp:/lustrewt at /mnt failed: Input/output error
Is the MGS running?
[root@lustre-client-1 opc]#


[root@lustre-mds-server-1 opc]# tail -f /var/log/messages  
Feb  6 18:11:38 lustre-mds-server-1 kernel: Lustre: MGS: Connection restored to 
88e1c321-1eaa-6914-5a37-4fff2063b526 (at 10.0.0.2@tcp)
Feb  6 18:11:38 lustre-mds-server-1 kernel: Lustre: Skipped 1 previous similar 
message
Feb  6 18:11:45 lustre-mds-server-1 kernel: Lustre: MGS: Received new LWP 
connection from 10.0.0.2@tcp, removing former export from same NID
Feb  6 18:11:45 lustre-mds-server-1 kernel: Lustre: MGS: Connection restored to 
88e1c321-1eaa-6914-5a37-4fff2063b526 (at 10.0.0.2@tcp)


[root@lustre-client-1 opc]# less /var/log/messages  
Feb  6 18:10:01 lustre-client-1 systemd: Removed slice User Slice of root.
Feb  6 18:10:01 lustre-client-1 systemd: Stopping User Slice of root.
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: 
10376:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1549476698/real 1549476698]  req@9259bb42a100 
x1624614953288736/t0(0) o503->MGC10.0.2.6@tcp@10.0.2.6@tcp:26/25 lens 272/8416 
e 0 to 1 dl 1549476705 ref 2 fl Rpc:X/0/ rc 0/-1
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: 
10376:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar 
message
Feb  6 18:11:45 lustre-client-1 kernel: LustreError: 166-1: MGC10.0.2.6@tcp: 
Connection to MGS (at 10.0.2.6@tcp) was lost; in progress operations using this 
service will fail
Feb  6 18:11:45 lustre-client-1 kernel: LustreError: 15c-8: MGC10.0.2.6@tcp: 
The configuration from log 'lustrewt-client' failed (-5). This may be the 
result of communication errors between this node and the MGS, a bad 
configuration, or other errors. See the syslog for more information.
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: MGC10.0.2.6@tcp: Connection 
restored to MGC10.0.2.6@tcp_0 (at 10.0.2.6@tcp)
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: Unmounted lustrewt-client
Feb  6 18:11:45 lustre-client-1 kernel: LustreError: 
10376:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-5)




Thanks,
Pinkesh Valdria
OCI – Big Data
Principal Solutions Architect 
m: +1-206-234-4314
pinkesh.vald...@oracle.com


-Original Message-
From: Andreas Dilger  
Sent: Wednesday, February 6, 2019 2:28 AM
To: Pinkesh Valdria 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] MDS/MGS has a block storage device mounted and it 
does not have any permissions (no read , no write, no execute)

On Feb 5, 2019, at 15:39, Pinkesh Valdria  wrote:
> 
> Hello All,
>  
> I am new to Lustre.   I started by using the docs on this page to deploy 
> Lustre on Virtual machines running CentOS 7.x (CentOS-7-2018.08.15-0).
> Included below are the content of the scripts I used and the error I get.  
> I have not done any setup for “o2ib0(ib0)” and lnet is using tcp.   All the 
> nodes are on the same network & subnet and cannot communicate on my protocol 
> and port #. 
>  
> Thanks for your help.  I am completely blocked and looking for ideas. 
> (already did google search ☹).  
>  
> I have 2 questions:  
>   • The MDT mounted on MDS has no permissions (no read , no write, no 
> execute), even for root user on MDS/MGS node.   Is that expected? .   See 
> “MGS/MDS node setup” section for more details on what I did. 
> [root@lustre-mds-server-1 opc]# mount -t lustre /dev/sdb /mnt/mdt
>  
> [root@lustre-mds-server-1 opc]# ll /mnt total 0 d-. 1 root 
> root 0 Jan  1  1970 mdt

The mountpoint on the MDS is just there for "df" to work and to manage the 
block device. It does not provide access to filesystem.  You need to do a 
client mount for that (typically on another node, but 

[lustre-discuss] MDS/MGS has a block storage device mounted and it does not have any permissions (no read , no write, no execute)

2019-02-05 Thread Pinkesh Valdria
work up

LNET configured

[root@lustre-mds-server-1 opc]# lctl list_nids

10.0.2.4@tcp

 

[root@lustre-mds-server-1 opc]# ll /mnt

total 0

d-. 1 root root 0 Jan  1  1970 mdt

[root@lustre-mds-server-1 opc]#

 

 

OSS/OST node

1 OSS node with 1 block device for OST (/dev/sdb). The setup to update kernel 
was the same as MGS/MDS node (described above),  then I ran the below commands: 

 

 

mkfs.lustre --ost --fsname=lustrewt --index=0 --mgsnode=10.0.2.4@tcp /dev/sdb

mkdir -p /ostoss_mount

mount -t lustre /dev/sdb /ostoss_mount

 

 

Client  node

1 client node. The setup to update kernel was the same as MGS/MDS node 
(described above),  then I ran the below commands: 

 

[root@lustre-client-1 opc]# modprobe lustre

[root@lustre-client-1 opc]# mount -t lustre 10.0.2.3@tcp:/lustrewt /mnt   (This 
fails with below error):

mount.lustre: mount 10.0.2.4@tcp:/lustrewt at /mnt failed: Input/output error

Is the MGS running?

[root@lustre-client-1 opc]#

 

 

 

 

Thanks,

Pinkesh Valdria

OCI – Big Data

Principal Solutions Architect 

m: +1-206-234-4314

HYPERLINK "mailto:pinkesh.vald...@oracle.com"pinkesh.vald...@oracle.com

 

 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org