Hi list, during the last days our Ceph cluster did not handle a recovery correctly. Several VM images on RBD have been corrupted. I'm currently trying to understand what has happened and how to avoid such problems in the future.
I'll describe the course of events. 1. We are running 32 OSDs on 4 storage servers, all Ceph 0.72.2. 2. Three OSD processes on 2 storage servers crashed one at a time due to the glibc "invalid fastbin entry (free)" bug. The time span between each couple of crashes was large enough so that the cluster got into HEALTH_OK state in between. 3. I tried to restart the crashed OSD processes. While starting recovery, we got nf_conntrack table overflow in the hosts' firewalls (kernel: nf_conntrack: table full, dropping packet), causing packets of random TCP connections being dropped. 4. The OSDs struggle to recover. One OSD breaks with a failed assertion. In the logs, I see several 10000's of lines like: 2014-06-07 14:25:10.337771 7f0e989f3700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.12:6802/26867 pipe(0x7f0e9c8b0090 sd=122 :43869 s=2 pgs=37296 cs=36291 l=0 c=0x7f0e9c8b02f0).fault, initiating reconnect 2014-06-07 14:25:10.345561 7f0e8aded700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.16:6814/12031 pipe(0x7f0e7c4ef220 sd=33 :53214 s=2 pgs=30224 cs=51533 l=0 c=0x7f0e7c4ddc20).fault, initiating reconnect 2014-06-07 14:25:10.345569 7f0e899d9700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.16:6804/19906 pipe(0x7f0e9c002010 sd=428 :47466 s=2 pgs=58398 cs=44949 l=0 c=0x7f0e9c002270).fault, initiating reconnect 2014-06-07 14:25:10.345609 7f0e920e7700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.16:6806/32760 pipe(0x7f0e7c4e0720 sd=32 :54326 s=2 pgs=46482 cs=47809 l=0 c=0x7f0e7c4e0980).fault, initiating reconnect 2014-06-07 14:25:10.345627 7f0e92df4700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.16:6802/17411 pipe(0x7f0e7c4db680 sd=166 :34622 s=2 pgs=47008 cs=47737 l=0 c=0x7f0e7c4db8e0).fault, initiating reconnect After two or three minutes, the situation somehow settles and I see lines like: 2014-06-07 14:25:10.449240 7f0cefcfc700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.14:6812/1029 pipe(0x7f0c84002580 sd=498 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f0c84001620).accept connect_seq 0 vs existing 43412 state connecting 2014-06-07 14:25:10.449267 7f0cefcfc700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.14:6812/1029 pipe(0x7f0c84002580 sd=498 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f0c84001620).accept peer reset, then tried to connect to us, replacing 2014-06-07 14:25:10.477051 7f0cef6f6700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.14:6800/14984 pipe(0x7f0c84002950 sd=502 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f0c84001620).accept connect_seq 0 vs existing 38738 state connecting 2014-06-07 14:25:10.477070 7f0cef6f6700 0 -- 172.22.8.13:6800/14741 >> 172.22.8.14:6800/14984 pipe(0x7f0c84002950 sd=502 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f0c84001620).accept peer reset, then tried to connect to us, replacing 2014-06-07 14:25:10.603165 7f0ceecec700 0 -- 172.22.4.43:6800/14741 >> 172.22.4.23:0/3597218324 pipe(0x7f0de0087df0 sd=508 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f0de0088050).accept peer addr is really 172.22.4.23:0/3597218324 (socket is 172.22.4.23:43421/0) Shortly after, one of the previously dead OSDs breaks during recovery: 2014-06-07 14:25:12.943590 7f0ea18c3700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f0ea18c3700 time 2014-06-07 14:25:12.910201 ./include/interval_set.h: 385: FAILED assert(_size >= 0) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0x1ee) [0x8251d4] 2: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, pg_query_t, std::less<pg_t>, std::allocator<std::pair<pg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, pg_query_t, std::less<pg_t>, std::allocator<std::pair<pg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >*)+0x2347) [0x803ea9] 3: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x37d) [0x805235] 4: (boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<PG::RecoveryState::Primary> const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x4f) [0x83a3fa] 5: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit<PG::RecoveryState::Active>()+0xa4) [0x83a5c8] 6: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x83a8ae] 7: (boost::statechart::simple_state<PG::RecoveryState::WaitFlushedPeering, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x84) [0x837e0a] 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xf2) [0x81abe0] 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x1e) [0x81ae24] 10: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x2fb) [0x7d6303] 11: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x320) [0x64cdde] 12: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x16) [0x6aa06a] 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x569) [0x9cf66f] 14: (ThreadPool::WorkThread::entry()+0x10) [0x9d10b6] 15: (()+0x7b77) [0x7f0ee292bb77] 16: (clone()+0x6d) [0x7f0ee0c6368d] We finally managed to restart all 3 affected OSDs, but we got corrupted filesystems inside the VMs as well as scrub errors afterwards. How can this be? Isn't Ceph designed to handle network failures? Obviously, running nf_conntrack on Ceph hosts is not a brilliant idea but it simply was present here. But I don't think that dropping network packets should lead to corrupted data. Am I right? Any hints on what could be wrong here are appreciated! I don't like to run into a similar situation again. TIA Christian -- Dipl.-Inf. Christian Kauhaus <>< · k...@gocept.com · systems administration gocept gmbh & co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany http://gocept.com · tel +49 345 219401-11 Python, Pyramid, Plone, Zope · consulting, development, hosting, operations _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com