-----原始邮件-----
发件人:"Andreas Ladanyi" <andreas.lada...@kit.edu>
发送时间:2019-12-16 18:34:10 (星期一)
收件人: huangql <huan...@ihep.ac.cn>, openafs-info <openafs-info@openafs.org>
抄送:
主题: Re: [OpenAFS] AFS client hanged
Hi ,
Dear all,
Recently, I'm stuck with some AFS issues.
AFS client hanged with the following log message. In this case, the AFS
instance blocked and jobs failed to access any files located in AFS. I have to
reboot the work node to recover service.
Dec 6 15:03:18 bws0825 kernel: INFO: task afs_callback:19124 blocked for more
than 120 seconds.
Dec 6 15:03:18 bws0825 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 6 15:03:18 bws0825 kernel: afs_callback D ffff9860d826e180 0 19124
2 0x00000000
Dec 6 15:03:18 bws0825 kernel: Call Trace:
Dec 6 15:03:18 bws0825 kernel: afs_callback D ffff9860d826e180 0 19124
2 0x00000000
Dec 6 15:03:18 bws0825 kernel: Call Trace:
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2169df9>]
schedule_preempt_disabled+0x29/0x70
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2167d77>]
__mutex_lock_slowpath+0xc7/0x1d0
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa216715f>] mutex_lock+0x1f/0x2f
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc084dff4>]
SRXAFSCB_InitCallBackState+0x34/0x470 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc0898047>] ? afs_xdr_vector+0x57/0x90
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc084f19e>]
SRXAFSCB_InitCallBackState3+0xe/0x10 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08b6f43>]
RXAFSCB_ExecuteRequest+0x6f3/0x8a0 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1b028ae>] ? getnstimeofday64+0xe/0x30
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08ae589>] ? afs_mutex_exit+0x29/0x40
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08a6a5d>] rxi_ServerProc+0xcd/0x1e0
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ?
afs_shutdown_pagecopy+0x20/0x20 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08af017>] rx_ServerProc+0x87/0xe0
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc084eedd>]
afs_RXCallBackServer+0x3d/0x50 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c76a5>] afsd_thread+0x1e5/0x730
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ?
afs_shutdown_pagecopy+0x20/0x20 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1da1>] kthread+0xd1/0xe0
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ?
insert_kthread_work+0x40/0x40
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>]
ret_from_fork_nospec_begin+0x7/0x21
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ?
insert_kthread_work+0x40/0x40
Dec 6 15:03:18 bws0825 kernel: INFO: task afs_rxevent:19127 blocked for more
than 120 seconds.
Dec 6 15:03:18 bws0825 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 6 15:03:18 bws0825 kernel: afs_rxevent D ffff9860cbbf6180 0 19127
2 0x00000000
Dec 6 15:03:18 bws0825 kernel: Call Trace:
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1aaa2d2>] ? del_timer_sync+0x52/0x60
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2169df9>]
schedule_preempt_disabled+0x29/0x70
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2167d77>]
__mutex_lock_slowpath+0xc7/0x1d0
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa216715f>] mutex_lock+0x1f/0x2f
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08bdb58>]
afs_osi_TimedSleep+0x118/0x210 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ad6b60>] ? wake_up_state+0x20/0x20
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08bdce8>] afs_osi_Wait+0x98/0xd0
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ?
afs_shutdown_pagecopy+0x20/0x20 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08af575>]
afs_rxevent_daemon+0x95/0x140 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c7af6>] afsd_thread+0x636/0x730
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ?
afs_shutdown_pagecopy+0x20/0x20 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1da1>] kthread+0xd1/0xe0
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ?
insert_kthread_work+0x40/0x40
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>]
ret_from_fork_nospec_begin+0x7/0x21
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ?
insert_kthread_work+0x40/0x40
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>]
ret_from_fork_nospec_begin+0x7/0x21
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ?
insert_kthread_work+0x40/0x40
Dec 6 15:03:18 bws0825 kernel: INFO: task afs_checkserver:19870 blocked for
more than 120 seconds.
Dec 6 15:03:18 bws0825 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 6 15:03:18 bws0825 kernel: afs_checkserver D ffff9860c7811040 0 19870
2 0x00000000
Dec 6 15:03:18 bws0825 kernel: Call Trace:
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1aaa2d2>] ? del_timer_sync+0x52/0x60
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2169df9>]
schedule_preempt_disabled+0x29/0x70
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2167d77>]
__mutex_lock_slowpath+0xc7/0x1d0
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa216715f>] mutex_lock+0x1f/0x2f
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08bdb58>]
afs_osi_TimedSleep+0x118/0x210 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ad6b60>] ? wake_up_state+0x20/0x20
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08bdce8>] afs_osi_Wait+0x98/0xd0
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc0853b08>]
afs_CheckServerDaemon+0x118/0x1a0 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ?
afs_shutdown_pagecopy+0x20/0x20 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c7930>] afsd_thread+0x470/0x730
[openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffc08c74c0>] ?
afs_shutdown_pagecopy+0x20/0x20 [openafs]
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1da1>] kthread+0xd1/0xe0
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa1ac1cd0>] ?
insert_kthread_work+0x40/0x40
Dec 6 15:03:18 bws0825 kernel: [<ffffffffa2175c1d>]
ret_from_fork_nospec_begin+0x7/0x21
>Is there an IO intensive process running in the background ?
NO, I don't think there is no IO intensive process. Most tasks access the lib
files stored in AFS file system like gcc, python, Gaudi lib files.
>Is there an process which uses too much RAM ?
Some jobs would consume much RAM but we have memory limit for each job.
>>Does the 1.6.23 is not compatible with the linux kernel or AFS server version?
>SL7 has kernel 3.10, since AFS 1.6.4
>SL6 has kernel 2.6, support before AFS 1.6
>Since AFS 1.6.22.4 kernel support up to 4.18 is included
I know the version is included in SL6 and SL7 kernel. But 1.6.20 version works
well in SL6 kernel
kernel-2.6.32-696.20.1.el6.x86_64. After we upgrade to the new linux kernel and
install the default openafs client version using yum(the version we used listed
in the following), we have the hang issue. That's why I suspect the version
compatibility.
AFS clinet--sl7 : l.6.23
[root@bws0825 ~]# rpm -qa|grep openafs
openafs-1.6-sl-client-1.6.23-289.sl7.x86_64
openafs-1.6-sl-authlibs-1.6.23-289.sl7.x86_64
openafs-1.6-sl-devel-1.6.23-289.sl7.x86_64
openafs-1.6-sl-module-tools-1.6.23-289.sl7.x86_64
openafs-1.6-sl-krb5-1.6.23-289.sl7.x86_64
openafs-1.6-sl-1.6.23-289.sl7.x86_64
openafs-1.6-sl-authlibs-devel-1.6.23-289.sl7.x86_64
kmod-openafs-1.6-sl-957-1.6.23-289.sl7.957.x86_64
AFS client-SL6: 1.6.23
openafs-krb5-1.6.23-289.sl6.x86_64
openafs-client-1.6.23-289.sl6.x86_64
openafs-1.6.23-289.sl6.x86_64
openafs-kpasswd-1.6.23-289.sl6.x86_64
openafs-module-tools-1.6.23-289.sl6.x86_64
openafs-kernel-source-1.6.23-289.sl6.x86_64
openafs-firstboot-1.6-1.sl6.noarch
openafs-authlibs-1.6.23-289.sl6.x86_64
kmod-openafs-1.6.22.3-1.SL610.el6.noarch
openafs-compat-1.6.23-289.sl6.x86_64
>>Any information you provided would be appreciated. Thanks.
Regards,
Qiulan
huangql
====================================================================
Computing center,the Institute of High Energy Physics, CAS, China
Qiulan Huang Tel: (+86) 10 8823 6087
P.O. Box 918-7 Fax: (+86) 10 8823 6839
Beijing 100049 P.R. China Email: huan...@ihep.ac.cn
===================================================================