[lustre-discuss] 回复: Cannot add new OST after upgrade from 2.5.3 to 2.10.6

2018-12-28 Thread wanglu
Hi, 

This new OSTs are formated with e2fsprogs-1.44.3.wc1-0.el7.x86_64, while the 
MGS and other old OSTs  are formated with e2fsprogs-1.42.12.wc1 last year, and 
mount with  e2fsprogs-1.44.3.wc1-0.el7.x86_64
Do we need to run writeconf on all the devices following this process?
https://lustre-discuss.lustre.narkive.com/Z5s6LU8B/lustre-2-5-2-unable-to-mount-ost
 

Thanks,
Lu


Computing center,the Institute of High Energy Physics, CAS, China
Wang LuTel: (+86) 10 8823 6087
P.O. Box 918-7   Fax: (+86) 10 8823 6839
Beijing 100049  P.R. ChinaEmail: lu.w...@ihep.ac.cn
===
 
From: wanglu
Date: 2018-12-28 10:45
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Cannot add new OST after upgrade from 2.5.3 to 2.10.6
Hi, 

For hardware compatibiility reason, we just upgraded a 2.5.3 instance to 
2.10.6.  After that, when we tried to mount a new formated OST on 2.10.6, we 
got  failures on OSS. Here is the symptom:
1. The ost mount operation will stuck for about 10 mins, and then we got “Is 
the MGS running?...” on terminal 
2. In syslog, we found
 LustreError: 166-1: MGC192.168.50.63@tcp: Connection to MGS (at 
192.168.50.63@tcp) was lost; in progress operations using this service will 
fail 
   LustreError: 105461:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) 
### lock timed out (enqueued at 1545962328, 300s ago), entering recovery for 
MGS@MGC192.168.50.63@tcp_0 ns: MGC192.168.50.63@tcp lock: 
9ae9283b8200/0xa4c148c2f2e256b9 lrc: 4/1,0 mode: -/CR res: 
[0x73666361:0x0:0x0].0x0 rrc: 3 type: PLN flags: 0x1 nid: local 
remote: 0x38d3cf901311c189 expref: -99 pid: 105461 timeout: 0 lvb_type: 0
3. During the stuck, we can see  ll_OST_XX and lazyldiskfsinit running on the 
new OSS, but the obdfilter directory can not be found under /proc/fs/lustre
4. On MDS+MGS node, we got  
   " 166-1: MGC192.168.50.63@tcp: Connection to MGS (at 0@lo) was lost; in 
progress operations using this service will fail" on MGS
5. After that , other new clients cannot mount the system.  
6. It seemed the OST mount operation had caused problems on MGS, so we umounted 
the MDT and run e2fsck, and remount it.  
7. After that,client mount  is possible, and we got deactivate ost on "lfs df".
8. When we tried to mount the new OSS, the symptom repeat again...

Any one has a hint on this problem?

Cheers,
Lu


Computing center,the Institute of High Energy Physics, CAS, China
Wang LuTel: (+86) 10 8823 6087
P.O. Box 918-7   Fax: (+86) 10 8823 6839
Beijing 100049  P.R. ChinaEmail: lu.w...@ihep.ac.cn
===
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ldiskfs performance degradation due to kernel swap hugging cpu

2018-12-28 Thread Patrick Farrell


Abe,

You gave some general info, but unless I missed something, nothing specific to 
show any involvement by swap.  How did you determine that?  Can you share that 
data?  And what performance are you expecting here?

- Patrick

From: lustre-devel  on behalf of Abe 
Asraoui 
Sent: Friday, December 28, 2018 6:42:50 PM
To: Lustre Developement; lustre-discuss@lists.lustre.org; Abe Asraoui
Subject: Re: [lustre-devel] ldiskfs performance degradation due to kernel swap 
hugging cpu



+ lustre-discuss



 Hi All.
We are seeing low performance with lustre2.11 in ldiskfs configuration with 
obdfilter survey, not sure if this is a known issue.

obdfilter survery under ldiskfs performance is impacted by kernel swap 
hugging cpu usage, current configurations is as follows:
2 osts: ost1,ost2
/dev/sdc on /mnt/mdt type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-MDT,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)
/dev/sdb on /mnt/ost1 type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-OST0001,mgsnode=10.10.10.168@o2ib,osd=osd-ldiskfs,errors=remount-ro)
/dev/sda on /mnt/ost2 type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-OST0002,mgsnode=10.10.10.168@o2ib,osd=osd-ldiskfs,errors=remount-ro)
[root@oss100 htop-2.2.0]#
[root@oss100 htop-2.2.0]# dkms status
lustre-ldiskfs, 2.11.0, 3.10.0-693.21.1.el7_lustre.x86_64, x86_64: installed
spl, 0.7.6, 3.10.0-693.21.1.el7_lustre.x86_64, x86_64: installed
[root@oss100 htop-2.2.0]#
sh ./obdsurvey-script.sh
Mon Dec 10 17:19:52 PST 2018 Obdfilter-survey for case=disk from oss100
ost 2 sz 51200K rsz 1024K obj 2 thr 2 write 134.52 [ 49.99, 101.96] 
rewrite 132.09 [ 49.99, 78.99] read 2566.74 [ 258.96, 2068.71]
ost 2 sz 51200K rsz 1024K obj 2 thr 4 write 195.73 [ 76.99, 128.98] 
rewrite
root@oss100 htop-2.2.0]# lctl dl
0 UP osd-ldiskfs tempAA-MDT-osd tempAA-MDT-osd_UUID 9
1 UP mgs MGS MGS 4
2 UP mgc MGC10.10.10.168@o2ib 65f231a0-8fd8-001d-6b0f-3e986f914178 4
3 UP mds MDS MDS_uuid 2
4 UP lod tempAA-MDT-mdtlov tempAA-MDT-mdtlov_UUID 3
5 UP mdt tempAA-MDT tempAA-MDT_UUID 8
6 UP mdd tempAA-MDD tempAA-MDD_UUID 3
7 UP qmt tempAA-QMT tempAA-QMT_UUID 3
8 UP lwp tempAA-MDT-lwp-MDT tempAA-MDT-lwp-MDT_UUID 4
9 UP osd-ldiskfs tempAA-OST0001-osd tempAA-OST0001-osd_UUID 4
10 UP ost OSS OSS_uuid 2
11 UP obdfilter tempAA-OST0001 tempAA-OST0001_UUID 5
12 UP lwp tempAA-MDT-lwp-OST0001 tempAA-MDT-lwp-OST0001_UUID 4
13 UP osp tempAA-OST0001-osc-MDT tempAA-MDT-mdtlov_UUID 4
14 UP echo_client tempAA-OST0001_ecc tempAA-OST0001_ecc_UUID 2
15 UP osd-ldiskfs tempAA-OST0002-osd tempAA-OST0002-osd_UUID 4
16 UP obdfilter tempAA-OST0002 tempAA-OST0002_UUID 5
17 UP lwp tempAA-MDT-lwp-OST0002 tempAA-MDT-lwp-OST0002_UUID 4
18 UP osp tempAA-OST0002-osc-MDT tempAA-MDT-mdtlov_UUID 4
19 UP echo_client tempAA-OST0002_ecc tempAA-OST0002_ecc_UUID 2
[root@oss100 htop-2.2.0]#
root@oss100 htop-2.2.0]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 152.8T 0 disk /mnt/ost2
sdb 8:16 0 152.8T 0 disk /mnt/ost1
sdc 8:32 0 931.5G 0 disk /mnt/mdt
sdd 8:48 0 465.8G 0 disk
\u251c\u2500sdd1 8:49 0 200M 0 part /boot/efi
\u251c\u2500sdd2 8:50 0 1G 0 part /boot
\u2514\u2500sdd3 8:51 0 464.6G 0 part
\u251c\u2500centos-root 253:0 0 50G 0 lvm /
\u251c\u2500centos-swap 253:1 0 4G 0 lvm [SWAP]
\u2514\u2500centos-home 253:2 0 410.6G 0 lvm /home
nvme0n1 259:2 0 372.6G 0 disk
\u2514\u2500md124 9:124 0 372.6G 0 raid1
nvme1n1 259:0 0 372.6G 0 disk
\u2514\u2500md124 9:124 0 372.6G 0 raid1
nvme2n1 259:3 0 372.6G 0 disk
\u2514\u2500md125 9:125 0 354G 0 raid1
nvme3n1 259:1 0 372.6G 0 disk
\u2514\u2500md125 9:125 0 354G 0 raid1

thanks,
Abe




___
lustre-devel mailing list
lustre-de...@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ldiskfs performance degradation due to kernel swap hugging cpu

2018-12-28 Thread Abe Asraoui



+ lustre-discuss 



 Hi All.
We are seeing low performance with lustre2.11 in ldiskfs configuration with 
obdfilter survey, not sure if this is a known issue.

obdfilter survery under ldiskfs performance is impacted by kernel swap 
hugging cpu usage, current configurations is as follows:
2 osts: ost1,ost2
/dev/sdc on /mnt/mdt type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-MDT,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)
/dev/sdb on /mnt/ost1 type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-OST0001,mgsnode=10.10.10.168@o2ib,osd=osd-ldiskfs,errors=remount-ro)
/dev/sda on /mnt/ost2 type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-OST0002,mgsnode=10.10.10.168@o2ib,osd=osd-ldiskfs,errors=remount-ro)
[root@oss100 htop-2.2.0]#
[root@oss100 htop-2.2.0]# dkms status
lustre-ldiskfs, 2.11.0, 3.10.0-693.21.1.el7_lustre.x86_64, x86_64: installed
spl, 0.7.6, 3.10.0-693.21.1.el7_lustre.x86_64, x86_64: installed
[root@oss100 htop-2.2.0]#
sh ./obdsurvey-script.sh 
Mon Dec 10 17:19:52 PST 2018 Obdfilter-survey for case=disk from oss100
ost 2 sz 51200K rsz 1024K obj 2 thr 2 write 134.52 [ 49.99, 101.96] 
rewrite 132.09 [ 49.99, 78.99] read 2566.74 [ 258.96, 2068.71] 
ost 2 sz 51200K rsz 1024K obj 2 thr 4 write 195.73 [ 76.99, 128.98] 
rewrite
root@oss100 htop-2.2.0]# lctl dl
0 UP osd-ldiskfs tempAA-MDT-osd tempAA-MDT-osd_UUID 9
1 UP mgs MGS MGS 4
2 UP mgc MGC10.10.10.168@o2ib 65f231a0-8fd8-001d-6b0f-3e986f914178 4
3 UP mds MDS MDS_uuid 2
4 UP lod tempAA-MDT-mdtlov tempAA-MDT-mdtlov_UUID 3
5 UP mdt tempAA-MDT tempAA-MDT_UUID 8
6 UP mdd tempAA-MDD tempAA-MDD_UUID 3
7 UP qmt tempAA-QMT tempAA-QMT_UUID 3
8 UP lwp tempAA-MDT-lwp-MDT tempAA-MDT-lwp-MDT_UUID 4
9 UP osd-ldiskfs tempAA-OST0001-osd tempAA-OST0001-osd_UUID 4
10 UP ost OSS OSS_uuid 2
11 UP obdfilter tempAA-OST0001 tempAA-OST0001_UUID 5
12 UP lwp tempAA-MDT-lwp-OST0001 tempAA-MDT-lwp-OST0001_UUID 4
13 UP osp tempAA-OST0001-osc-MDT tempAA-MDT-mdtlov_UUID 4
14 UP echo_client tempAA-OST0001_ecc tempAA-OST0001_ecc_UUID 2
15 UP osd-ldiskfs tempAA-OST0002-osd tempAA-OST0002-osd_UUID 4
16 UP obdfilter tempAA-OST0002 tempAA-OST0002_UUID 5
17 UP lwp tempAA-MDT-lwp-OST0002 tempAA-MDT-lwp-OST0002_UUID 4
18 UP osp tempAA-OST0002-osc-MDT tempAA-MDT-mdtlov_UUID 4
19 UP echo_client tempAA-OST0002_ecc tempAA-OST0002_ecc_UUID 2
[root@oss100 htop-2.2.0]#
root@oss100 htop-2.2.0]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 152.8T 0 disk /mnt/ost2
sdb 8:16 0 152.8T 0 disk /mnt/ost1
sdc 8:32 0 931.5G 0 disk /mnt/mdt
sdd 8:48 0 465.8G 0 disk 
\u251c\u2500sdd1 8:49 0 200M 0 part /boot/efi
\u251c\u2500sdd2 8:50 0 1G 0 part /boot
\u2514\u2500sdd3 8:51 0 464.6G 0 part 
\u251c\u2500centos-root 253:0 0 50G 0 lvm /
\u251c\u2500centos-swap 253:1 0 4G 0 lvm [SWAP]
\u2514\u2500centos-home 253:2 0 410.6G 0 lvm /home
nvme0n1 259:2 0 372.6G 0 disk 
\u2514\u2500md124 9:124 0 372.6G 0 raid1 
nvme1n1 259:0 0 372.6G 0 disk 
\u2514\u2500md124 9:124 0 372.6G 0 raid1 
nvme2n1 259:3 0 372.6G 0 disk 
\u2514\u2500md125 9:125 0 354G 0 raid1 
nvme3n1 259:1 0 372.6G 0 disk 
\u2514\u2500md125 9:125 0 354G 0 raid1
 
thanks,
Abe




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org