That sound like an easy rule to follow. thanks again for you reply. /Peter
mån 19 juni 2017 kl 10:19 skrev Wido den Hollander <w...@42on.com>: > > > Op 19 juni 2017 om 9:55 schreef Peter Rosell <peter.ros...@gmail.com>: > > > > > > I have my servers on UPS and shutdown them manually the way I use to turn > > them off. There where enough power in the UPS after the servers were > > shutdown because is continued to beep. Anyway, I will wipe it and re-add > > it. Thanks for your reply. > > > > Ok, you didn't mention that in the first post. I assumed a sudden power > failure. > > My general recommendation is to wipe a single OSD if it has issues. The > reason is that I've seen many cases where people ran XFS repair, played > with the files on the disk and then had data corruption. > > That's why I'd say that you should try to avoid fixing single OSDs when > you don't need to. > > Wido > > > /Peter > > > > mån 19 juni 2017 kl 09:11 skrev Wido den Hollander <w...@42on.com>: > > > > > > > > > Op 18 juni 2017 om 16:21 schreef Peter Rosell < > peter.ros...@gmail.com>: > > > > > > > > > > > > Hi, > > > > I have a small cluster with only three nodes, 4 OSDs + 3 OSDs. I have > > > been > > > > running version 0.87.2 (Giant) for over 2.5 year, but a couple of day > > > ago I > > > > upgraded to 0.94.10 (Hammer) and then up to 10.2.7 (Jewel). Both the > > > > upgrades went great. Started with monitors, osd and finally mds. The > log > > > > shows all 448 pgs active+clean. I'm running all daemons inside > docker and > > > > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) > > > > > > > > Today I had a power outage and I had to take down the servers. When > I now > > > > start the servers again one OSD daemon doesn't start properly. It > keeps > > > > crashing. > > > > > > > > I noticed that the two first restarts of the osd daemon crashed with > this > > > > error: > > > > FAILED assert(rollback_info_trimmed_to_riter == log.rbegin()) > > > > > > > > After that it always fails with "FAILED assert(i.first <= i.last)" > > > > > > > > I have 15 logs like this one: > > > > Jun 18 08:56:18 island sh[27991]: 2017-06-18 08:56:18.300641 > 7f5c5e0ff8c0 > > > > -1 log_channel(cluster) log [ERR] : 2.38 log bound mismatch, info > > > > (19544'666742,19691'671046] actual [19499'665843,19691'671046] > > > > > > > > I removed the directories <pg_id>_head, but that just removed these > error > > > > logs. It crashes anyway. > > > > > > > > Anyone has any suggestions what to do to make it start up correct. Of > > > > course I can remove the OSD from the cluster and re-add it, but it > feels > > > > like a bug. > > > > > > Are you sure? Since you had a power failure it could be that certain > parts > > > weren't committed to disk/FS properly when the power failed. That > really > > > depends on the hardware and configuration. > > > > > > Please, do not try to repair this OSD. Wipe it and re-add it to the > > > cluster. > > > > > > Wido > > > > > > > A small snippet from the logs is added below. I didn't include the > event > > > > list. If it will help I can send it too. > > > > > > > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: In function > 'static > > > bool > > > > pg_interval_t::check_new_interval(int, int, const std::vector<int>&, > > > const > > > > std::vector<int>&, int, int, const std::vector<int>&, const > > > > std::vector<int>&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t, > > > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t>*, > > > > std::ostream*)' thread 7f4fc2500700 time 2017-06-18 13:52:23.593991 > > > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: 3132: FAILED > > > > assert(i.first <= i.last) > > > > Jun 18 13:52:23 island sh[7068]: ceph version 10.2.7 > > > > (50e863e0f4bc8f4b9e31156de690d765af245185) > > > > Jun 18 13:52:23 island sh[7068]: 1: (ceph::__ceph_assert_fail(char > > > const*, > > > > char const*, int, char const*)+0x80) [0x559fe4c14360] > > > > Jun 18 13:52:23 island sh[7068]: 2: > > > > (pg_interval_t::check_new_interval(int, int, std::vector<int, > > > > std::allocator<int> > const&, std::vector<int, std::allocator<int> > > > > > const&, int, int, std::vector<int, std::allocator<int> > const&, > > > > std::vector<int, std::allocator<int> > const&, unsigned int, unsigned > > > int, > > > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, > > > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t, > > > > std::less<unsigned int>, std::allocator<std::pair<unsigned int const, > > > > pg_interval_t> > >*, std::ostream*)+0x72c) [0x559fe47f723c] > > > > Jun 18 13:52:23 island sh[7068]: 3: > > > > (PG::start_peering_interval(std::shared_ptr<OSDMap const>, > > > std::vector<int, > > > > std::allocator<int> > const&, int, std::vector<int, > std::allocator<int> > > > > > const&, int, ObjectStore::Transaction*)+0x3ff) [0x559fe461439f] > > > > Jun 18 13:52:23 island sh[7068]: 4: > > > > (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x478) > > > [0x559fe4615828] > > > > Jun 18 13:52:23 island sh[7068]: 5: > > > > (boost::statechart::simple_state<PG::RecoveryState::Reset, > > > > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, > mpl_::na, > > > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > > > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > > > > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > > > > > > > > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base > > > > const&, void const*)+0x176) [0x559fe4645b86] > > > > Jun 18 13:52:23 island sh[7068]: 6: > > > > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > > > > PG::RecoveryState::Initial, std::allocator<void>, > > > > > > > > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base > > > > const&)+0x69) [0x559fe4626d49] > > > > Jun 18 13:52:23 island sh[7068]: 7: > > > > (PG::handle_advance_map(std::shared_ptr<OSDMap const>, > > > > std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > >&, > > > > int, std::vector<int, std::allocator<int> >&, int, > > > PG::RecoveryCtx*)+0x49e) > > > > [0x559fe45fa5ae] > > > > Jun 18 13:52:23 island sh[7068]: 8: (OSD::advance_pg(unsigned int, > PG*, > > > > ThreadPool::TPHandle&, PG::RecoveryCtx*, > > > std::set<boost::intrusive_ptr<PG>, > > > > std::less<boost::intrusive_ptr<PG> >, > > > > std::allocator<boost::intrusive_ptr<PG> > >*)+0x2f2) [0x559fe452c042] > > > > Jun 18 13:52:23 island sh[7068]: 9: > > > > (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > > > > > > > > const&, ThreadPool::TPHandle&)+0x214) [0x559fe4546d34] > > > > Jun 18 13:52:23 island sh[7068]: 10: > > > > (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > > > > ThreadPool::TPHandle&)+0x25) [0x559fe458f8e5] > > > > Jun 18 13:52:23 island sh[7068]: 11: > > > > (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x559fe4c06531] > > > > Jun 18 13:52:23 island sh[7068]: 12: > > > > (ThreadPool::WorkThread::entry()+0x10) [0x559fe4c07630] > > > > Jun 18 13:52:23 island sh[7068]: 13: (()+0x76fa) [0x7f4fe256b6fa] > > > > Jun 18 13:52:23 island sh[7068]: 14: (clone()+0x6d) [0x7f4fe05e3b5d] > > > > Jun 18 13:52:23 island sh[7068]: NOTE: a copy of the executable, or > > > > `objdump -rdS <executable>` is needed to interpret this. > > > > Jun 18 13:52:23 island sh[7068]: 2017-06-18 13:52:23.599558 > 7f4fc2500700 > > > -1 > > > > osd/osd_types.cc: In function 'static bool > > > > pg_interval_t::check_new_interval(int, int, const std::vector<int>&, > > > const > > > > std::vector<int>&, int, int, const std::vector<int>&, const > > > > std::vector<int>&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t, > > > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t>*, > > > > std::ostream*)' thread 7f4fc2500700 time 2017-06-18 13:52:23.593991 > > > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: 3132: FAILED > > > > assert(i.first <= i.last) > > > > Jun 18 13:52:23 island sh[7068]: ceph version 10.2.7 > > > > (50e863e0f4bc8f4b9e31156de690d765af245185) > > > > Jun 18 13:52:23 island sh[7068]: 1: (ceph::__ceph_assert_fail(char > > > const*, > > > > char const*, int, char const*)+0x80) [0x559fe4c14360] > > > > Jun 18 13:52:23 island sh[7068]: 2: > > > > (pg_interval_t::check_new_interval(int, int, std::vector<int, > > > > std::allocator<int> > const&, std::vector<int, std::allocator<int> > > > > > const&, int, int, std::vector<int, std::allocator<int> > const&, > > > > std::vector<int, std::allocator<int> > const&, unsigned int, unsigned > > > int, > > > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, > > > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t, > > > > std::less<unsigned int>, std::allocator<std::pair<unsigned int const, > > > > pg_interval_t> > >*, std::ostream*)+0x72c) [0x559fe47f723c] > > > > Jun 18 13:52:23 island sh[7068]: 3: > > > > (PG::start_peering_interval(std::shared_ptr<OSDMap const>, > > > std::vector<int, > > > > std::allocator<int> > const&, int, std::vector<int, > std::allocator<int> > > > > > const&, int, ObjectStore::Transaction*)+0x3ff) [0x559fe461439f] > > > > Jun 18 13:52:23 island sh[7068]: 4: > > > > (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x478) > > > [0x559fe4615828] > > > > Jun 18 13:52:23 island sh[7068]: 5: > > > > (boost::statechart::simple_state<PG::RecoveryState::Reset, > > > > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, > mpl_::na, > > > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > > > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > > > > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > > > > > > > > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base > > > > const&, void const*)+0x176) [0x559fe4645b86] > > > > Jun 18 13:52:23 island sh[7068]: 6: > > > > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > > > > PG::RecoveryState::Initial, std::allocator<void>, > > > > > > > > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base > > > > const&)+0x69) [0x559fe4626d49] > > > > Jun 18 13:52:23 island sh[7068]: 7: > > > > (PG::handle_advance_map(std::shared_ptr<OSDMap const>, > > > > std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > >&, > > > > int, std::vector<int, std::allocator<int> >&, int, > > > PG::RecoveryCtx*)+0x49e) > > > > [0x559fe45fa5ae] > > > > Jun 18 13:52:23 island sh[7068]: 8: (OSD::advance_pg(unsigned int, > PG*, > > > > ThreadPool::TPHandle&, PG::RecoveryCtx*, > > > std::set<boost::intrusive_ptr<PG>, > > > > std::less<boost::intrusive_ptr<PG> >, > > > > std::allocator<boost::intrusive_ptr<PG> > >*)+0x2f2) [0x559fe452c042] > > > > Jun 18 13:52:23 island sh[7068]: 9: > > > > (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > > > > > > > > const&, ThreadPool::TPHandle&)+0x214) [0x559fe4546d34] > > > > Jun 18 13:52:23 island sh[7068]: 10: > > > > (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > > > > ThreadPool::TPHandle&)+0x25) [0x559fe458f8e5] > > > > Jun 18 13:52:23 island sh[7068]: 11: > > > > (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x559fe4c06531] > > > > Jun 18 13:52:23 island sh[7068]: 12: > > > > (ThreadPool::WorkThread::entry()+0x10) [0x559fe4c07630] > > > > Jun 18 13:52:23 island sh[7068]: 13: (()+0x76fa) [0x7f4fe256b6fa] > > > > Jun 18 13:52:23 island sh[7068]: 14: (clone()+0x6d) [0x7f4fe05e3b5d] > > > > Jun 18 13:52:23 island sh[7068]: NOTE: a copy of the executable, or > > > > `objdump -rdS <executable>` is needed to interpret this. > > > > Jun 18 13:52:23 island sh[7068]: --- begin dump of recent events --- > > > > Jun 18 13:52:23 island sh[7068]: -2051> 2017-06-18 13:50:36.086036 > > > > 7f4fe36bb8c0 5 asok(0x559fef2d6000) register_command > perfcounters_dump > > > > hook 0x559fef216030 > > > > > > > > > > > > /Peter > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com