Dear all
Last weekend we got a strage problem with the changelog in one of our
lustre.
Saturday lustre stop to work with the following errors on the MDS:
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97356:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for
log...
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97356:0:(llog_obd.c:461:llog_obd_origin_add()) write one catalog record
failed: -28
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for
log...
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog
failed:
rc=-28 op17
t[0x20cc50b18:0x1e83:0x0]
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(llog_obd.c:461:llog_obd_origin_add()) write one catalog record
failed: -28
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(llog_obd.c:461:llog_obd_origin_add()) Skipped 1 previous
similar
message
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog
failed:
rc=-28 op17
t[0x20cc1a400:0x1f9a0:0x0]
Jun 20 10:03:07 monchmds01 kernel: LustreError:
114688:0:(mdd_dir.c:665:mdd_changelog_ns_store()) changelog failed: rc=
-28, op6
monchc206_3250911.0 c[0x20cc345f8:0x1f909:0x0] p[0x200156a5e:0x36:0x0]
Jun 20 10:03:07 monchmds01 kernel: LustreError:
120659:0:(mdd_dir.c:665:mdd_changelog_ns_store()) changelog failed: rc=
-28, op6
monchc205_3250911.0 c[0x20cc8cff0:0x1b9c:0x0] p[0x200156a5e:0x36:0x0]
Jun 20 10:03:07 monchmds01 kernel: LustreError:
114688:0:(mdd_dir.c:665:mdd_changelog_ns_store()) Skipped 3 previous
similar
messages
Jun 20 10:03:07 monchmds01 kernel: LustreError:
16776:0:(mdd_dir.c:747:mdd_changelog_ext_ns_store()) changelog failed:
rc=-28, op8
out24.dcd c[0x20cc8c820:0x8bf1:0x0] p[0x20cc4dc38:0x1cf:0x0]
Jun 20 10:03:07 monchmds01 kernel: Lustre:
16776:0:(cmm_object.c:697:cml_rename_warn()) cml_rename failed for
mdo_rename, should
revoke: [mo_po [0x20cc4dc38:0x1cf:0x0]] [mo_pn [0x20cc4dc38:0x1cf:0x0]]
[lf [0x20cc8c820:0x8bf1:0x0]] [sname out.dcd] [mo_t
NULL]
[tname out24.dcd] [err -14]
...
And the /var/log/messages of RBH servers was full of the following
messages:
Jun 18 22:07:40 monchrbh01 kernel: LustreError: 11-0: lnec-MDT0000-mdc
-ffff88014558e400: Communicating with 148.187.72.14@o2ib, operation
llog_origin_handle_
open failed with -116.
Jun 18 22:07:40 monchrbh01 kernel: LustreError:
89229:0:(llog_cat.c:192:llog_cat_id2handle()) lnec-MDT0000-mdc
-ffff88014558e400: error opening log id 0x0:131
5963671:2e9d340b: rc = -116
Jun 18 22:07:40 monchrbh01 kernel: LustreError:
89229:0:(llog_cat.c:565:llog_cat_process_cb()) lnec-MDT0000-mdc
-ffff88014558e400: cannot find handle for llog
0x0:1315963671: -116
Jun 18 22:08:25 monchrbh01 kernel: Lustre:
89281:0:(llog_cat.c:615:llog_cat_process_or_fork()) catlog 0x21800004:1
crosses index zero
Jun 18 22:08:25 monchrbh01 kernel: Lustre:
89281:0:(llog_cat.c:615:llog_cat_process_or_fork()) Skipped 557
previous similar messages
Jun 18 22:18:26 monchrbh01 kernel: Lustre:
89939:0:(llog_cat.c:615:llog_cat_process_or_fork()) catlog 0x21800004:1
crosses index zero
Jun 18 22:18:26 monchrbh01 kernel: Lustre:
89939:0:(llog_cat.c:615:llog_cat_process_or_fork()) Skipped 557
previous similar messages
Jun 18 22:28:26 monchrbh01 kernel: Lustre:
90577:0:(llog_cat.c:615:llog_cat_process_or_fork()) catlog 0x21800004:1
crosses index zero
I don't understand why the changelog catalog was full (no free catalog
slots for log...), so I would like to know if has anyone had similar
problems before?
I'm asking also myself if this is a lustre general problem or if it is
related to RBH.
For the moment we de-registered changelog and stop RBH.
Thank you in advance
Carmelo Ponti
Additional information:
MDS lustre version: 2.1.6
MDS OS version: CentOS release 6.4 (Final)
RBH lustre client version: 2.5.4
RBH OS version: CentOS release 6.6 (Final)
RBH version: 2.5.4
-- ----------------------------------------------------------------------
Carmelo Ponti System Engineer
CSCS Swiss Center for Scientific Computing
Via Trevano 131 Email: [email protected]
CH-6900 Lugano http://www.cscs.ch
Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support