Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11512



# pigs52 /root >  time cat /proc/fs/lustre/health_check
healthy
real    8m18.093s
user    0m0.000s
sys     0m0.001s

While an OSS is under a heavy ost_destroy load reading the proc entry
/proc/fs/lustre/health_check can take an unreasonably long time.  This
disrupts our ability the effectively monitor the health of the filesystem.

/proc/fs/lustre/ost/OSS/ost_io//stats @ 1168365762.925608
Name         Cur.Rate #Events  last         min  avg        max
ost_destroy  18       1048933  15277920.89  5    5944339.71 388464831

# pigs52 /root > uptime
10:18:00 up 17 days, 18:38,  2 users,  load average: 242.51, 228.74, 227

Notice the excessively long average destroy time of 5.9 seconds, resulting
in virtually all of the threads being busy and a load average of 242+.  The
slow destroys look like they're the result of swamping the journal.  This
is likely only this bad now because we've disabled the writeback caches on 
our DDNs which substantially slows down the journal IO.  It seems likely
to me any OST activity which overwhelms the journal, not just destroys,
will results in the very sluggish behavior.

PIGS: ddn-7b: host taskinfo
DMT301,C op2A L:19 S:0000E8:42E0 a:      79 l:   2 01/01 00,00 W:WRU WT T:3
DMT311,5 op2A L:19 S:0000E8:4310 a:      80 l:   1 01/01 00,00 W:WRU WT T:3
DMT312,6 op2A L:21 S:0000E8:4340 a:   274AE l:   1 01/01 00,00 W:WRU WT T:3
DMT305,4 op2A L:21 S:0000E8:4358 a:   274BE l:   1 01/01 01,00 W:WRU WT T:3
DMT308,9 op2A L:21 S:0000E8:4370 a:   274CD l:   1 00/00 01+00 W:1/1 WT T:3
DMT310,B op2A L:19 S:0000E8:4388 a:      82 l:   2 01/01 00,00 W:WRU WT T:3
DMT303,3 op2A L:19 S:0000E8:43A0 a:      8A l:   1 01/01 01,00 W:WRU WT T:3
DMT304,2 op2A L:19 S:0000E8:43B8 a:      8C l:   3 00/00 01+00 W:1/1 WT T:2
DMT300,0 op2A L:21 S:0000E8:43E8 a:   274D8 l:   1 00/00 01+00 W:1/1 WT T:2
DMT306,7 op2A L:21 S:0000E8:4400 a:   274DD l:   1 00/00 01+00 W:1/1 WT T:2
DMT315,1 op2A L:21 S:0000E8:4490 a:   274F1 l:   1 00/00 01+00 W:1/1 WT T:1
DMT313,A op2A L:21 S:0000E8:44A8 a:   274FE l:   1 00/00 01+00 W:1/1 WT T:1
DMT316,8 op2A L:21 S:0000E8:44C0 a:   2750F l:   1 00/00 01+00 W:1/1 WT T:1
DMT406,2 op2A L:24 S:0000E8:2C18 a:   324A6 l:   1 01/01 00,00 W:WRU WT T:5
DMT405,C op2A L:24 S:0000E8:2C30 a:   324B6 l:   1 01/01 00,00 W:WRU WT T:4
DMT408,0 op2A L:24 S:0000E8:2C48 a:   3256E l:   1 01/01 01,00 W:WRU WT T:4
DMT407,A op2A L:23 S:0000E8:2D20 a:      38 l:   2 01/01 00,00 W:WRU WT T:3
DMT403,7 op2A L:23 S:0000E8:2D38 a:      3C l:   1 01/01 00,00 W:WRU WT T:3
DMT416,4 op2A L:23 S:0000E8:2D50 a:      3E l:   1 01/01 00,00 W:WRU WT T:2
DMT409,B op2A L:23 S:0000E8:2D68 a:      41 l:   1 01/01 01,00 W:WRU WT T:2
DMT400,9 op2A L:23 S:0000E8:2D98 a:      49 l:   2 00/00 01+00 W:1/1 WT T:1
DMT411,6 op2A L:23 S:0000E8:2DB0 a:      4C l:   3 00/00 01+00 W:1/1 WT T:1
DMT414,5 op2A L:24 S:0000E8:2DC8 a:   328B9 l:   1 01/01 00,00 W:WRU WT T:1
DMT413,8 op2A L:24 S:0000E8:2DE0 a:   328EA l:   1 01/01 00,00 W:WRU WT T:1
DMT404,3 op2A L:24 S:0000E8:2DF8 a:   32903 l:   1 01/01 01,00 W:WRU WT T:1
DMT415,1 op2A L:22 S:0000E8:2E10 a:   32666 l:   1 01/01 01,00 W:WRU WT T:1

>From the DDN it shows all the active commands while are localized around
low numbers addresses exactly where we would expect the journal to be
located.

A backtrace from the timed cat above shows the proc read is waiting on
a log_wait_commit().  Perhaps it's not the best of ideas to have the
health check rely on speedy access to the disk.  It's quite often the
case on our servers that disk access will be slow.

crash> bt 25674
PID: 25674  TASK: 10009024980       CPU: 2   COMMAND: "cat"
 #0 [10005085bf8] schedule at ffffffff802ec76f
 #1 [10005085cd0] log_wait_commit at ffffffffa0392fc8
 #2 [10005085d60] journal_stop at ffffffffa038db34
 #3 [10005085da0] fsfilt_ldiskfs_write_record at ffffffffa03d82db
 #4 [10005085e20] lvfs_check_io_health at ffffffffa02467fa
 #5 [10005085e50] filter_health_check at ffffffffa03f1e61
 #6 [10005085e70] obd_proc_read_health at ffffffffa025ca81
 #7 [10005085eb0] proc_file_read at ffffffff801a5baf
 #8 [10005085f10] vfs_read at ffffffff80175f17
 #9 [10005085f40] sys_read at ffffffff801761b2
#10 [10005085f80] system_call at ffffffff8010f262
    RIP: 0000002a95724a82  RSP: 0000007fbffff6d8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff8010f262  RCX: 0000002a957244e2
    RDX: 0000000000000400  RSI: 0000000000505070  RDI: 0000000000000003
    RBP: 0000000000000003   R8: 0000002a9589cb18   R9: 0000000000000200
    R10: 0000007fbffff601  R11: 0000000000000246  R12: 0000000000505070
    R13: 0000000000000400  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to