Hi list,

during the last days our Ceph cluster did not handle a recovery correctly.
Several VM images on RBD have been corrupted. I'm currently trying to
understand what has happened and how to avoid such problems in the future.

I'll describe the course of events.

1. We are running 32 OSDs on 4 storage servers, all Ceph 0.72.2.

2. Three OSD processes on 2 storage servers crashed one at a time due to the
glibc "invalid fastbin entry (free)" bug. The time span between each couple of
crashes was large enough so that the cluster got into HEALTH_OK state in 
between.

3. I tried to restart the crashed OSD processes. While starting recovery, we
got nf_conntrack table overflow in the hosts' firewalls (kernel: nf_conntrack:
table full, dropping packet), causing packets of random TCP connections being
dropped.

4. The OSDs struggle to recover. One OSD breaks with a failed assertion.

In the logs, I see several 10000's of lines like:

2014-06-07 14:25:10.337771 7f0e989f3700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.12:6802/26867 pipe(0x7f0e9c8b0090 sd=122 :43869 s=2 pgs=37296
cs=36291 l=0 c=0x7f0e9c8b02f0).fault, initiating reconnect
2014-06-07 14:25:10.345561 7f0e8aded700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.16:6814/12031 pipe(0x7f0e7c4ef220 sd=33 :53214 s=2 pgs=30224 cs=51533
l=0 c=0x7f0e7c4ddc20).fault, initiating reconnect
2014-06-07 14:25:10.345569 7f0e899d9700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.16:6804/19906 pipe(0x7f0e9c002010 sd=428 :47466 s=2 pgs=58398
cs=44949 l=0 c=0x7f0e9c002270).fault, initiating reconnect
2014-06-07 14:25:10.345609 7f0e920e7700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.16:6806/32760 pipe(0x7f0e7c4e0720 sd=32 :54326 s=2 pgs=46482 cs=47809
l=0 c=0x7f0e7c4e0980).fault, initiating reconnect
2014-06-07 14:25:10.345627 7f0e92df4700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.16:6802/17411 pipe(0x7f0e7c4db680 sd=166 :34622 s=2 pgs=47008
cs=47737 l=0 c=0x7f0e7c4db8e0).fault, initiating reconnect

After two or three minutes, the situation somehow settles and I see lines like:

2014-06-07 14:25:10.449240 7f0cefcfc700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.14:6812/1029 pipe(0x7f0c84002580 sd=498 :6800 s=0 pgs=0 cs=0 l=0
c=0x7f0c84001620).accept connect_seq 0 vs existing 43412 state connecting
2014-06-07 14:25:10.449267 7f0cefcfc700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.14:6812/1029 pipe(0x7f0c84002580 sd=498 :6800 s=0 pgs=0 cs=0 l=0
c=0x7f0c84001620).accept peer reset, then tried to connect to us, replacing
2014-06-07 14:25:10.477051 7f0cef6f6700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.14:6800/14984 pipe(0x7f0c84002950 sd=502 :6800 s=0 pgs=0 cs=0 l=0
c=0x7f0c84001620).accept connect_seq 0 vs existing 38738 state connecting
2014-06-07 14:25:10.477070 7f0cef6f6700  0 -- 172.22.8.13:6800/14741 >>
172.22.8.14:6800/14984 pipe(0x7f0c84002950 sd=502 :6800 s=0 pgs=0 cs=0 l=0
c=0x7f0c84001620).accept peer reset, then tried to connect to us, replacing
2014-06-07 14:25:10.603165 7f0ceecec700  0 -- 172.22.4.43:6800/14741 >>
172.22.4.23:0/3597218324 pipe(0x7f0de0087df0 sd=508 :6800 s=0 pgs=0 cs=0 l=0
c=0x7f0de0088050).accept peer addr is really 172.22.4.23:0/3597218324 (socket
is 172.22.4.23:43421/0)

Shortly after, one of the previously dead OSDs breaks during recovery:

2014-06-07 14:25:12.943590 7f0ea18c3700 -1 ./include/interval_set.h: In
function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread
7f0ea18c3700 time 2014-06-07 14:25:12.910201
./include/interval_set.h: 385: FAILED assert(_size >= 0)

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0x1ee)
[0x8251d4]
 2: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*,
std::allocator<Context*> >&, std::map<int, std::map<pg_t, pg_query_t,
std::less<pg_t>, std::allocator<std::pair<pg_t const, pg_query_t> > >,
std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, pg_query_t,
std::less<pg_t>, std::allocator<std::pair<pg_t const, pg_query_t> > > > > >&,
std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int,
pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int
const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t,
std::map<unsigned int, pg_interval_t, std::less<unsigned int>,
std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >,
std::less<int>, std::allocator<std::pair<int const,
std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t,
std::less<unsigned int>, std::allocator<std::pair<unsigned int const,
pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned
int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned
int const, pg_interval_t> > > > > > > > >*)+0x2347) [0x803ea9]
 3:
(PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active,
PG::RecoveryState::Primary, PG::RecoveryState::Activating,
(boost::statechart::history_mode)0>::my_context)+0x37d) [0x805235]
 4: (boost::statechart::state<PG::RecoveryState::Active,
PG::RecoveryState::Primary, PG::RecoveryState::Activating,
(boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Primary>
const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>&)+0x4f) [0x83a3fa]
 5: (boost::statechart::detail::safe_reaction_result
boost::statechart::simple_state<PG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0>::transit<PG::RecoveryState::Active>()+0xa4) 
[0x83a5c8]
 6: (boost::statechart::simple_state<PG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x16a) [0x83a8ae]
 7: (boost::statechart::simple_state<PG::RecoveryState::WaitFlushedPeering,
PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x84) [0x837e0a]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_queued_events()+0xf2)
[0x81abe0]
 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x1e) [0x81ae24]
 10: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>,
PG::RecoveryCtx*)+0x2fb) [0x7d6303]
 11: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&,
ThreadPool::TPHandle&)+0x320) [0x64cdde]
 12: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&,
ThreadPool::TPHandle&)+0x16) [0x6aa06a]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x569) [0x9cf66f]
 14: (ThreadPool::WorkThread::entry()+0x10) [0x9d10b6]
 15: (()+0x7b77) [0x7f0ee292bb77]
 16: (clone()+0x6d) [0x7f0ee0c6368d]

We finally managed to restart all 3 affected OSDs, but we got corrupted
filesystems inside the VMs as well as scrub errors afterwards.

How can this be? Isn't Ceph designed to handle network failures? Obviously,
running nf_conntrack on Ceph hosts is not a brilliant idea but it simply was
present here. But I don't think that dropping network packets should lead to
corrupted data. Am I right? Any hints on what could be wrong here are
appreciated! I don't like to run into a similar situation again.

TIA

Christian

-- 
Dipl.-Inf. Christian Kauhaus <>< · k...@gocept.com · systems administration
gocept gmbh & co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to