[ceph-users] OSD crashing over and over, taking cluster down

Daniel Schneller Tue, 19 May 2015 04:30:08 -0700

Last night our Hammer cluster suffered a series of OSD crashes on all cluster 
nodes.
We were running Hammer (0.94.1-98-g7df3eb5, built because we had a major 
problem a week ago which we suspected to be related to bugs we found in the 
tracker, that were not yet in 0.94.1). In the meantime we downgraded to the 
official 0.94.1 to rule out we introduced any instability by taking a HEAD 
version. No change.


Around 22.00h our users started reporting a down web application, and at the 
same time we found lots of OSD crashes and restarts.
The stack trace in the log looks like this on all of them (taken at a later 
time, when I had increased debug_osd=20):

2015-05-19 03:38:35.449476 7f7aed260700 -1 osd/osd_types.h: In function 'void 
ObjectContext::RWState::put_excl(std::list<std::tr1::shared_ptr<OpRequest> >*)' 
thread 7f7aed260700 time 2015-05-19 03:38:35.445434
osd/osd_types.h: 3167: FAILED assert(state == RWEXCL)

 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) 
[0xbc2b8b]
 2: /usr/bin/ceph-osd() [0x8b9b05]
 3: (ReplicatedPG::remove_repop(ReplicatedPG::RepGather*)+0xec) [0x84516c]
 4: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x912) [0x857082]
 5: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x16d) [0x857bbd]
 6: (Context::complete(int)+0x9) [0x6caf09]
 7: (ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0x1ec) 
[0xa081dc]
 8: (Context::complete(int)+0x9) [0x6caf09]
 9: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af634]
 10: (Context::complete(int)+0x9) [0x6caf09]
 11: (void finish_contexts<Context>(CephContext*, std::list<Context*, 
std::allocator<Context*> >&, int)+0x94) [0x70b764]
 12: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6cb759]
 13: (Finisher::finisher_thread_entry()+0x158) [0xaef528]
 14: (()+0x8182) [0x7f7afd540182]
 15: (clone()+0x6d) [0x7f7afbaabfbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

Right before the crash, log lines like this were output (see also: 
http://pastebin.com/W0srF0bW <http://pastebin.com/W0srF0bW>)
2015-05-19 03:38:35.445346 7f7aed260700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23683 active+clean] do_osd_op_effects client.141679501 
con 0x13fcd180
2015-05-19 03:38:35.445356 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean]  removing repgather(0x52f63900 
34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 
op=osd_op(client.141679501.0:19791 
default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call 
refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445370 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean]    q front is repgather(0x52f63900 
34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 
op=osd_op(client.141679501.0:19791 
default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call 
refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445381 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean] remove_repop repgather(0x52f63900 
34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 
op=osd_op(client.141679501.0:19791 
default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call 
refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445393 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean]  obc 
obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/head//81(dne)
 rwstate(excl n=1 w=0))
2015-05-19 03:38:35.445401 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean]  clone_obc 
obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/c8//81
 rwstate(write n=1 w=0))
2015-05-19 03:38:35.445409 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean]  snapset_obc 
obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/snapdir//81
 rwstate(write n=1 w=0))
We cannot pinpoint an exact trigger, but it there _seems_ to be some 
correlation with larger uploads into the RGW. This is not yet completely 
validated, but merely a timing related assumption. Could it be that the RGW 
code causes OSDs to fail either with the big upload alone or in conjunction 
with parallel other requests? We are seeing more crashes, though, than large 
uploads, so this remains a guess at best.

We created this http://tracker.ceph.com/issues/11677 
<http://tracker.ceph.com/issues/11677> bug ticket, but would be extremely 
thankful for some quicker help.
I am in the #ceph IRC channel as dschneller, too.

Thanks!
Daniel


-- 
Daniel Schneller
Infrastructure Engineer / Developer
 
CenterDevice GmbH                  | Merscheider Straße 1
                                   | 42699 Solingen
tel: +49 1754155711                | Deutschland
daniel.schnel...@centerdevice.de   | www.centerdevice.de

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD crashing over and over, taking cluster down

Reply via email to