Hi,
There was a couple of Changelog bugs fixed since 2.1 (some of them still
recently).
There is some flaws in the way llog are designed which could lead you to
this state.
This reminds me of similar issues we met, but I could not find the
corresponding tickets (but there is definitely JIRAs for them)
What is your changelog creation rate?
What's your MDS size?
You probably had too many records created or too many records
outstanding in your llog. Do you know what was your highest number of
records in the changelog at one time?
Aurélien
Le 22/06/2015 09:54, Carmelo Ponti (CSCS) a écrit :
Dear all
Last weekend we got a strage problem with the changelog in one of our
lustre.
Saturday lustre stop to work with the following errors on the MDS:
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97356:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for
log...
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97356:0:(llog_obd.c:461:llog_obd_origin_add()) write one catalog record
failed: -28
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for
log...
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed:
rc=-28 op17
t[0x20cc50b18:0x1e83:0x0]
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(llog_obd.c:461:llog_obd_origin_add()) write one catalog record
failed: -28
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(llog_obd.c:461:llog_obd_origin_add()) Skipped 1 previous similar
message
Jun 20 10:03:06 monchmds01 kernel: LustreError:
97331:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed:
rc=-28 op17
t[0x20cc1a400:0x1f9a0:0x0]
Jun 20 10:03:07 monchmds01 kernel: LustreError:
114688:0:(mdd_dir.c:665:mdd_changelog_ns_store()) changelog failed:
rc=-28, op6
monchc206_3250911.0 c[0x20cc345f8:0x1f909:0x0] p[0x200156a5e:0x36:0x0]
Jun 20 10:03:07 monchmds01 kernel: LustreError:
120659:0:(mdd_dir.c:665:mdd_changelog_ns_store()) changelog failed:
rc=-28, op6
monchc205_3250911.0 c[0x20cc8cff0:0x1b9c:0x0] p[0x200156a5e:0x36:0x0]
Jun 20 10:03:07 monchmds01 kernel: LustreError:
114688:0:(mdd_dir.c:665:mdd_changelog_ns_store()) Skipped 3 previous
similar
messages
Jun 20 10:03:07 monchmds01 kernel: LustreError:
16776:0:(mdd_dir.c:747:mdd_changelog_ext_ns_store()) changelog failed:
rc=-28, op8
out24.dcd c[0x20cc8c820:0x8bf1:0x0] p[0x20cc4dc38:0x1cf:0x0]
Jun 20 10:03:07 monchmds01 kernel: Lustre:
16776:0:(cmm_object.c:697:cml_rename_warn()) cml_rename failed for
mdo_rename, should
revoke: [mo_po [0x20cc4dc38:0x1cf:0x0]] [mo_pn
[0x20cc4dc38:0x1cf:0x0]] [lf [0x20cc8c820:0x8bf1:0x0]] [sname out.dcd]
[mo_t
NULL]
[tname out24.dcd] [err -14]
...
And the /var/log/messages of RBH servers was full of the following
messages:
Jun 18 22:07:40 monchrbh01 kernel: LustreError: 11-0:
lnec-MDT0000-mdc-ffff88014558e400: Communicating with
148.187.72.14@o2ib <mailto:148.187.72.14@o2ib>, operation
llog_origin_handle_
open failed with -116.
Jun 18 22:07:40 monchrbh01 kernel: LustreError:
89229:0:(llog_cat.c:192:llog_cat_id2handle())
lnec-MDT0000-mdc-ffff88014558e400: error opening log id 0x0:131
5963671:2e9d340b: rc = -116
Jun 18 22:07:40 monchrbh01 kernel: LustreError:
89229:0:(llog_cat.c:565:llog_cat_process_cb())
lnec-MDT0000-mdc-ffff88014558e400: cannot find handle for llog
0x0:1315963671: -116
Jun 18 22:08:25 monchrbh01 kernel: Lustre:
89281:0:(llog_cat.c:615:llog_cat_process_or_fork()) catlog
0x21800004:1 crosses index zero
Jun 18 22:08:25 monchrbh01 kernel: Lustre:
89281:0:(llog_cat.c:615:llog_cat_process_or_fork()) Skipped 557
previous similar messages
Jun 18 22:18:26 monchrbh01 kernel: Lustre:
89939:0:(llog_cat.c:615:llog_cat_process_or_fork()) catlog
0x21800004:1 crosses index zero
Jun 18 22:18:26 monchrbh01 kernel: Lustre:
89939:0:(llog_cat.c:615:llog_cat_process_or_fork()) Skipped 557
previous similar messages
Jun 18 22:28:26 monchrbh01 kernel: Lustre:
90577:0:(llog_cat.c:615:llog_cat_process_or_fork()) catlog
0x21800004:1 crosses index zero
I don't understand why the changelog catalog was full (no free catalog
slots for log...), so I would like to know if has anyone had similar
problems before?
I'm asking also myself if this is a lustre general problem or if it is
related to RBH.
For the moment we de-registered changelog and stop RBH.
Thank you in advance
Carmelo Ponti
Additional information:
MDS lustre version: 2.1.6
MDS OS version: CentOS release 6.4 (Final)
RBH lustre client version: 2.5.4
RBH OS version: CentOS release 6.6 (Final)
RBH version: 2.5.4
--
----------------------------------------------------------------------
Carmelo Ponti System Engineer
CSCS Swiss Center for Scientific Computing
Via Trevano 131 Email: [email protected]
CH-6900 Lugano http://www.cscs.ch
Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support
------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical & virtual servers, alerts via email & sms
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support