> Op 19 juni 2017 om 9:55 schreef Peter Rosell <peter.ros...@gmail.com>:
> 
> 
> I have my servers on UPS and shutdown them manually the way I use to turn
> them off. There where enough power in the UPS after the servers were
> shutdown because is continued to beep. Anyway, I will wipe it and re-add
> it. Thanks for your reply.
> 

Ok, you didn't mention that in the first post. I assumed a sudden power failure.

My general recommendation is to wipe a single OSD if it has issues. The reason 
is that I've seen many cases where people ran XFS repair, played with the files 
on the disk and then had data corruption.

That's why I'd say that you should try to avoid fixing single OSDs when you 
don't need to.

Wido

> /Peter
> 
> mån 19 juni 2017 kl 09:11 skrev Wido den Hollander <w...@42on.com>:
> 
> >
> > > Op 18 juni 2017 om 16:21 schreef Peter Rosell <peter.ros...@gmail.com>:
> > >
> > >
> > > Hi,
> > > I have a small cluster with only three nodes, 4 OSDs + 3 OSDs. I have
> > been
> > > running version 0.87.2 (Giant) for over 2.5 year, but a couple of day
> > ago I
> > > upgraded to 0.94.10 (Hammer) and then up to 10.2.7 (Jewel). Both the
> > > upgrades went great. Started with monitors, osd and finally mds. The log
> > > shows all 448 pgs active+clean. I'm running all daemons inside docker and
> > > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
> > >
> > > Today I had a power outage and I had to take down the servers. When I now
> > > start the servers again one OSD daemon doesn't start properly. It keeps
> > > crashing.
> > >
> > > I noticed that the two first restarts of the osd daemon crashed with this
> > > error:
> > > FAILED assert(rollback_info_trimmed_to_riter == log.rbegin())
> > >
> > > After that it always fails with "FAILED assert(i.first <= i.last)"
> > >
> > > I have 15 logs like this one:
> > > Jun 18 08:56:18 island sh[27991]: 2017-06-18 08:56:18.300641 7f5c5e0ff8c0
> > > -1 log_channel(cluster) log [ERR] : 2.38 log bound mismatch, info
> > > (19544'666742,19691'671046] actual [19499'665843,19691'671046]
> > >
> > > I removed the directories <pg_id>_head, but that just removed these error
> > > logs. It crashes anyway.
> > >
> > > Anyone has any suggestions what to do to make it start up correct. Of
> > > course I can remove the OSD from the cluster and re-add it, but it feels
> > > like a bug.
> >
> > Are you sure? Since you had a power failure it could be that certain parts
> > weren't committed to disk/FS properly when the power failed. That really
> > depends on the hardware and configuration.
> >
> > Please, do not try to repair this OSD. Wipe it and re-add it to the
> > cluster.
> >
> > Wido
> >
> > > A small snippet from the logs is added below. I didn't include the event
> > > list. If it will help I can send it too.
> > >
> > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: In function 'static
> > bool
> > > pg_interval_t::check_new_interval(int, int, const std::vector<int>&,
> > const
> > > std::vector<int>&, int, int, const std::vector<int>&, const
> > > std::vector<int>&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t,
> > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t>*,
> > > std::ostream*)' thread 7f4fc2500700 time 2017-06-18 13:52:23.593991
> > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: 3132: FAILED
> > > assert(i.first <= i.last)
> > > Jun 18 13:52:23 island sh[7068]:  ceph version 10.2.7
> > > (50e863e0f4bc8f4b9e31156de690d765af245185)
> > > Jun 18 13:52:23 island sh[7068]:  1: (ceph::__ceph_assert_fail(char
> > const*,
> > > char const*, int, char const*)+0x80) [0x559fe4c14360]
> > > Jun 18 13:52:23 island sh[7068]:  2:
> > > (pg_interval_t::check_new_interval(int, int, std::vector<int,
> > > std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> > > const&, int, int, std::vector<int, std::allocator<int> > const&,
> > > std::vector<int, std::allocator<int> > const&, unsigned int, unsigned
> > int,
> > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t,
> > > std::less<unsigned int>, std::allocator<std::pair<unsigned int const,
> > > pg_interval_t> > >*, std::ostream*)+0x72c) [0x559fe47f723c]
> > > Jun 18 13:52:23 island sh[7068]:  3:
> > > (PG::start_peering_interval(std::shared_ptr<OSDMap const>,
> > std::vector<int,
> > > std::allocator<int> > const&, int, std::vector<int, std::allocator<int> >
> > > const&, int, ObjectStore::Transaction*)+0x3ff) [0x559fe461439f]
> > > Jun 18 13:52:23 island sh[7068]:  4:
> > > (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x478)
> > [0x559fe4615828]
> > > Jun 18 13:52:23 island sh[7068]:  5:
> > > (boost::statechart::simple_state<PG::RecoveryState::Reset,
> > > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,
> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> > >
> > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> > > const&, void const*)+0x176) [0x559fe4645b86]
> > > Jun 18 13:52:23 island sh[7068]:  6:
> > > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> > > PG::RecoveryState::Initial, std::allocator<void>,
> > >
> > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> > > const&)+0x69) [0x559fe4626d49]
> > > Jun 18 13:52:23 island sh[7068]:  7:
> > > (PG::handle_advance_map(std::shared_ptr<OSDMap const>,
> > > std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&,
> > > int, std::vector<int, std::allocator<int> >&, int,
> > PG::RecoveryCtx*)+0x49e)
> > > [0x559fe45fa5ae]
> > > Jun 18 13:52:23 island sh[7068]:  8: (OSD::advance_pg(unsigned int, PG*,
> > > ThreadPool::TPHandle&, PG::RecoveryCtx*,
> > std::set<boost::intrusive_ptr<PG>,
> > > std::less<boost::intrusive_ptr<PG> >,
> > > std::allocator<boost::intrusive_ptr<PG> > >*)+0x2f2) [0x559fe452c042]
> > > Jun 18 13:52:23 island sh[7068]:  9:
> > > (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*>
> > >
> > > const&, ThreadPool::TPHandle&)+0x214) [0x559fe4546d34]
> > > Jun 18 13:52:23 island sh[7068]:  10:
> > > (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
> > > ThreadPool::TPHandle&)+0x25) [0x559fe458f8e5]
> > > Jun 18 13:52:23 island sh[7068]:  11:
> > > (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x559fe4c06531]
> > > Jun 18 13:52:23 island sh[7068]:  12:
> > > (ThreadPool::WorkThread::entry()+0x10) [0x559fe4c07630]
> > > Jun 18 13:52:23 island sh[7068]:  13: (()+0x76fa) [0x7f4fe256b6fa]
> > > Jun 18 13:52:23 island sh[7068]:  14: (clone()+0x6d) [0x7f4fe05e3b5d]
> > > Jun 18 13:52:23 island sh[7068]:  NOTE: a copy of the executable, or
> > > `objdump -rdS <executable>` is needed to interpret this.
> > > Jun 18 13:52:23 island sh[7068]: 2017-06-18 13:52:23.599558 7f4fc2500700
> > -1
> > > osd/osd_types.cc: In function 'static bool
> > > pg_interval_t::check_new_interval(int, int, const std::vector<int>&,
> > const
> > > std::vector<int>&, int, int, const std::vector<int>&, const
> > > std::vector<int>&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t,
> > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t>*,
> > > std::ostream*)' thread 7f4fc2500700 time 2017-06-18 13:52:23.593991
> > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: 3132: FAILED
> > > assert(i.first <= i.last)
> > > Jun 18 13:52:23 island sh[7068]:  ceph version 10.2.7
> > > (50e863e0f4bc8f4b9e31156de690d765af245185)
> > > Jun 18 13:52:23 island sh[7068]:  1: (ceph::__ceph_assert_fail(char
> > const*,
> > > char const*, int, char const*)+0x80) [0x559fe4c14360]
> > > Jun 18 13:52:23 island sh[7068]:  2:
> > > (pg_interval_t::check_new_interval(int, int, std::vector<int,
> > > std::allocator<int> > const&, std::vector<int, std::allocator<int> >
> > > const&, int, int, std::vector<int, std::allocator<int> > const&,
> > > std::vector<int, std::allocator<int> > const&, unsigned int, unsigned
> > int,
> > > std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
> > > IsPGRecoverablePredicate*, std::map<unsigned int, pg_interval_t,
> > > std::less<unsigned int>, std::allocator<std::pair<unsigned int const,
> > > pg_interval_t> > >*, std::ostream*)+0x72c) [0x559fe47f723c]
> > > Jun 18 13:52:23 island sh[7068]:  3:
> > > (PG::start_peering_interval(std::shared_ptr<OSDMap const>,
> > std::vector<int,
> > > std::allocator<int> > const&, int, std::vector<int, std::allocator<int> >
> > > const&, int, ObjectStore::Transaction*)+0x3ff) [0x559fe461439f]
> > > Jun 18 13:52:23 island sh[7068]:  4:
> > > (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x478)
> > [0x559fe4615828]
> > > Jun 18 13:52:23 island sh[7068]:  5:
> > > (boost::statechart::simple_state<PG::RecoveryState::Reset,
> > > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,
> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> > >
> > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> > > const&, void const*)+0x176) [0x559fe4645b86]
> > > Jun 18 13:52:23 island sh[7068]:  6:
> > > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> > > PG::RecoveryState::Initial, std::allocator<void>,
> > >
> > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> > > const&)+0x69) [0x559fe4626d49]
> > > Jun 18 13:52:23 island sh[7068]:  7:
> > > (PG::handle_advance_map(std::shared_ptr<OSDMap const>,
> > > std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&,
> > > int, std::vector<int, std::allocator<int> >&, int,
> > PG::RecoveryCtx*)+0x49e)
> > > [0x559fe45fa5ae]
> > > Jun 18 13:52:23 island sh[7068]:  8: (OSD::advance_pg(unsigned int, PG*,
> > > ThreadPool::TPHandle&, PG::RecoveryCtx*,
> > std::set<boost::intrusive_ptr<PG>,
> > > std::less<boost::intrusive_ptr<PG> >,
> > > std::allocator<boost::intrusive_ptr<PG> > >*)+0x2f2) [0x559fe452c042]
> > > Jun 18 13:52:23 island sh[7068]:  9:
> > > (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*>
> > >
> > > const&, ThreadPool::TPHandle&)+0x214) [0x559fe4546d34]
> > > Jun 18 13:52:23 island sh[7068]:  10:
> > > (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
> > > ThreadPool::TPHandle&)+0x25) [0x559fe458f8e5]
> > > Jun 18 13:52:23 island sh[7068]:  11:
> > > (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x559fe4c06531]
> > > Jun 18 13:52:23 island sh[7068]:  12:
> > > (ThreadPool::WorkThread::entry()+0x10) [0x559fe4c07630]
> > > Jun 18 13:52:23 island sh[7068]:  13: (()+0x76fa) [0x7f4fe256b6fa]
> > > Jun 18 13:52:23 island sh[7068]:  14: (clone()+0x6d) [0x7f4fe05e3b5d]
> > > Jun 18 13:52:23 island sh[7068]:  NOTE: a copy of the executable, or
> > > `objdump -rdS <executable>` is needed to interpret this.
> > > Jun 18 13:52:23 island sh[7068]: --- begin dump of recent events ---
> > > Jun 18 13:52:23 island sh[7068]:  -2051> 2017-06-18 13:50:36.086036
> > > 7f4fe36bb8c0  5 asok(0x559fef2d6000) register_command perfcounters_dump
> > > hook 0x559fef216030
> > >
> > >
> > > /Peter
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to