Re: [lustre-discuss] Linux 5.6 Kernel Support

2020-04-02 Thread Peter Jones
2.13.53 will not be a release per se- it will be an interim development build- 
but, yes, it will be tagged shortly.

From: lustre-discuss  on behalf of 
"Tauferner, Andrew T" 
Date: Thursday, April 2, 2020 at 9:06 AM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] Linux 5.6 Kernel Support

I notice that there have been various commits needed for support of recent 
Linux kernels.  Will there be a release tagged soon that would make it 
convenient to get such commits?  Maybe something like a 2.13.53 tag is planned? 
 Thanks.

Andrew Tauferner

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Linux 5.6 Kernel Support

2020-04-02 Thread Tauferner, Andrew T
I notice that there have been various commits needed for support of recent 
Linux kernels.  Will there be a release tagged soon that would make it 
convenient to get such commits?  Maybe something like a 2.13.53 tag is planned? 
 Thanks.

Andrew Tauferner

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] BUG: soft lockup - CPU#13 stuck for 23s! [ll_ost_io01_027:24071]

2020-04-02 Thread Bernd Melchers
Hi All,
after several month of working, we recently have stability problems
with our lustre installation. Each of our seven OSD Server crashes
after some hours with kernel messages like

NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [ll_ost_io01_027:24071]

The time duration of the messages varies between 22s and 23s and the
number after the last colon between 32281, 30488 and 24071.

Our environment:
CentOS-7.7, (recent kernels 3.10.0-1062.12.1 or 3.10.0-1062.18.1)
lustre-2.12.4 on zfs-0.7.13
single rail Omnipath network (mixed mpi and lustre)
same behaviour with in kernel omnipath stack and Intel Stack (10.10.1.0.36)

At the time of these kernel ll_ost_io messages, the omnipath interface
of the failing osd is not longer able to ping (outgoing or ingoing).

What i have already done is reducing the ost_io.threads from 132
stepwise down to 40 (server has 32 cpu cores):
lctl set_param ost.OSS.ost_io.threads_max=40
Then i changed between kernel 3.10.0-1062.12.1 and 3.10.0-1062.18.1
and between kernel and  intel omnipath driver.

It is not clear for me if the failing lustre destroys the omnipath
of the server or the other way round. Intel Omnipath utlitities (opatop,
fmgui) does not show problems in the network (or we did not found them).

Other parameters for the OSDs are:
# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0)"
options ptlrpc at_min=40 at_max=400 ldlm_enqueue_min=260

# cat /etc/modprobe.d/hfi1.conf
options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 
rcvhdrcnt=4096 cap_mask=0x4c09a01cbba

# lctl get_param '*.*.*.threads_max timeout *.*.*.timeout'
ldlm.services.ldlm_canceld.threads_max=128
ldlm.services.ldlm_cbd.threads_max=128
ost.OSS.ost.threads_max=132
ost.OSS.ost_create.threads_max=24
ost.OSS.ost_io.threads_max=40
ost.OSS.ost_out.threads_max=24
ost.OSS.ost_seq.threads_max=24
timeout=100
osd-zfs.scratch-OST.quota_slave.timeout=50
osd-zfs.scratch-OST.quota_slave_dt.timeout=50
... (for all six OST)


Any hints?
Bernd Melchers

-- 
Archiv- und Backup-Service | fab-serv...@zedat.fu-berlin.de
Freie Universität Berlin   |
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org