Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11512
# pigs52 /root > time cat /proc/fs/lustre/health_check
healthy
real 8m18.093s
user 0m0.000s
sys 0m0.001s
While an OSS is under a heavy ost_destroy load reading the proc entry
/proc/fs/lustre/health_check can take an unreasonably long time. This
disrupts our ability the effectively monitor the health of the filesystem.
/proc/fs/lustre/ost/OSS/ost_io//stats @ 1168365762.925608
Name Cur.Rate #Events last min avg max
ost_destroy 18 1048933 15277920.89 5 5944339.71 388464831
# pigs52 /root > uptime
10:18:00 up 17 days, 18:38, 2 users, load average: 242.51, 228.74, 227
Notice the excessively long average destroy time of 5.9 seconds, resulting
in virtually all of the threads being busy and a load average of 242+. The
slow destroys look like they're the result of swamping the journal. This
is likely only this bad now because we've disabled the writeback caches on
our DDNs which substantially slows down the journal IO. It seems likely
to me any OST activity which overwhelms the journal, not just destroys,
will results in the very sluggish behavior.
PIGS: ddn-7b: host taskinfo
DMT301,C op2A L:19 S:0000E8:42E0 a: 79 l: 2 01/01 00,00 W:WRU WT T:3
DMT311,5 op2A L:19 S:0000E8:4310 a: 80 l: 1 01/01 00,00 W:WRU WT T:3
DMT312,6 op2A L:21 S:0000E8:4340 a: 274AE l: 1 01/01 00,00 W:WRU WT T:3
DMT305,4 op2A L:21 S:0000E8:4358 a: 274BE l: 1 01/01 01,00 W:WRU WT T:3
DMT308,9 op2A L:21 S:0000E8:4370 a: 274CD l: 1 00/00 01+00 W:1/1 WT T:3
DMT310,B op2A L:19 S:0000E8:4388 a: 82 l: 2 01/01 00,00 W:WRU WT T:3
DMT303,3 op2A L:19 S:0000E8:43A0 a: 8A l: 1 01/01 01,00 W:WRU WT T:3
DMT304,2 op2A L:19 S:0000E8:43B8 a: 8C l: 3 00/00 01+00 W:1/1 WT T:2
DMT300,0 op2A L:21 S:0000E8:43E8 a: 274D8 l: 1 00/00 01+00 W:1/1 WT T:2
DMT306,7 op2A L:21 S:0000E8:4400 a: 274DD l: 1 00/00 01+00 W:1/1 WT T:2
DMT315,1 op2A L:21 S:0000E8:4490 a: 274F1 l: 1 00/00 01+00 W:1/1 WT T:1
DMT313,A op2A L:21 S:0000E8:44A8 a: 274FE l: 1 00/00 01+00 W:1/1 WT T:1
DMT316,8 op2A L:21 S:0000E8:44C0 a: 2750F l: 1 00/00 01+00 W:1/1 WT T:1
DMT406,2 op2A L:24 S:0000E8:2C18 a: 324A6 l: 1 01/01 00,00 W:WRU WT T:5
DMT405,C op2A L:24 S:0000E8:2C30 a: 324B6 l: 1 01/01 00,00 W:WRU WT T:4
DMT408,0 op2A L:24 S:0000E8:2C48 a: 3256E l: 1 01/01 01,00 W:WRU WT T:4
DMT407,A op2A L:23 S:0000E8:2D20 a: 38 l: 2 01/01 00,00 W:WRU WT T:3
DMT403,7 op2A L:23 S:0000E8:2D38 a: 3C l: 1 01/01 00,00 W:WRU WT T:3
DMT416,4 op2A L:23 S:0000E8:2D50 a: 3E l: 1 01/01 00,00 W:WRU WT T:2
DMT409,B op2A L:23 S:0000E8:2D68 a: 41 l: 1 01/01 01,00 W:WRU WT T:2
DMT400,9 op2A L:23 S:0000E8:2D98 a: 49 l: 2 00/00 01+00 W:1/1 WT T:1
DMT411,6 op2A L:23 S:0000E8:2DB0 a: 4C l: 3 00/00 01+00 W:1/1 WT T:1
DMT414,5 op2A L:24 S:0000E8:2DC8 a: 328B9 l: 1 01/01 00,00 W:WRU WT T:1
DMT413,8 op2A L:24 S:0000E8:2DE0 a: 328EA l: 1 01/01 00,00 W:WRU WT T:1
DMT404,3 op2A L:24 S:0000E8:2DF8 a: 32903 l: 1 01/01 01,00 W:WRU WT T:1
DMT415,1 op2A L:22 S:0000E8:2E10 a: 32666 l: 1 01/01 01,00 W:WRU WT T:1
>From the DDN it shows all the active commands while are localized around
low numbers addresses exactly where we would expect the journal to be
located.
A backtrace from the timed cat above shows the proc read is waiting on
a log_wait_commit(). Perhaps it's not the best of ideas to have the
health check rely on speedy access to the disk. It's quite often the
case on our servers that disk access will be slow.
crash> bt 25674
PID: 25674 TASK: 10009024980 CPU: 2 COMMAND: "cat"
#0 [10005085bf8] schedule at ffffffff802ec76f
#1 [10005085cd0] log_wait_commit at ffffffffa0392fc8
#2 [10005085d60] journal_stop at ffffffffa038db34
#3 [10005085da0] fsfilt_ldiskfs_write_record at ffffffffa03d82db
#4 [10005085e20] lvfs_check_io_health at ffffffffa02467fa
#5 [10005085e50] filter_health_check at ffffffffa03f1e61
#6 [10005085e70] obd_proc_read_health at ffffffffa025ca81
#7 [10005085eb0] proc_file_read at ffffffff801a5baf
#8 [10005085f10] vfs_read at ffffffff80175f17
#9 [10005085f40] sys_read at ffffffff801761b2
#10 [10005085f80] system_call at ffffffff8010f262
RIP: 0000002a95724a82 RSP: 0000007fbffff6d8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff8010f262 RCX: 0000002a957244e2
RDX: 0000000000000400 RSI: 0000000000505070 RDI: 0000000000000003
RBP: 0000000000000003 R8: 0000002a9589cb18 R9: 0000000000000200
R10: 0000007fbffff601 R11: 0000000000000246 R12: 0000000000505070
R13: 0000000000000400 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel