Hello,

Will try with the noup now and see if makes any difference.

Is effecting both BS & FS OSD’s and effecting different host’s and different 
PG’s seems to be no form of pattern.

,Ashley

From: David Turner [mailto:[email protected]]
Sent: 18 November 2017 22:19
To: Ashley Merrick <[email protected]>
Cc: Eric Nelson <[email protected]>; [email protected]
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Does letting the cluster run with noup for a while until all down disks are 
idle, and then letting them come in help at all?  I don't know your specific 
issue and haven't touched bluestore yet, but that is generally sound advice 
when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common hosts, 
common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

Any further suggestions or work around’s from anyone?

Cluster is hard down now with around 2% PG’s offline, on the occasion able to 
get an OSD to start for a bit but then will seem to do some peering and again 
crash with “*** Caught signal (Aborted) **in thread 7f3471c55700 
thread_name:tp_peering”

,Ashley

From: Ashley Merrick
Sent: 16 November 2017 17:27
To: Eric Nelson <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,



Good to hear it's not just me, however have a cluster basically offline due to 
too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley

________________________________
From: Eric Nelson <[email protected]<mailto:[email protected]>>
Sent: 16 November 2017 00:06:14
To: Ashley Merrick
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I've been seeing these as well on our SSD cachetier that's been ravaged by disk 
failures as of late.... Same tp_peering assert as above even running luminous 
branch from git.

Let me know if you have a bug filed I can +1 or have found a workaround.

E

On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick 
<[email protected]<mailto:[email protected]>> wrote:

Hello,



After replacing a single OSD disk due to a failed disk I am now seeing 2-3 
OSD’s randomly stop and fail to start, do a boot loop get to load_pgs and then 
fail with the following (I tried setting OSD log’s to 5/5 but didn’t get any 
extra lines around the error just more information pre boot.



Could this be a certain PG causing these OSD’s to crash (6.2f2s10 for example)?



    -9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] state<Start>: transitioning to Stray

    -8> 2017-11-15 17:37:14.696239 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] exit Start 0.000019 0 0.000000

    -7> 2017-11-15 17:37:14.696250 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] 
local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 
161521/152523/159786 161517/161519/161519) 
[34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
 r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY 
m=21] enter Started/Stray

    -6> 2017-11-15 17:37:14.696324 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Reset 3.363755 2 0.000076

    -5> 2017-11-15 17:37:14.696337 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started

    -4> 2017-11-15 17:37:14.696346 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Start

    -3> 2017-11-15 17:37:14.696353 7fa4ec50f700  1 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] state<Start>: transitioning to 
Stray

    -2> 2017-11-15 17:37:14.696364 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Start 0.000018 0 0.000000

    -1> 2017-11-15 17:37:14.696372 7fa4ec50f700  5 osd.37 pg_epoch: 161571 
pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] 
local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 
161519/160963/159786 161517/161517/108939) 
[96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 
crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started/Stray

     0> 2017-11-15 17:37:14.697245 7fa4ebd0e700 -1 *** Caught signal (Aborted) 
**

in thread 7fa4ebd0e700 thread_name:tp_peering



ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

1: (()+0xa3acdc) [0x55dfb6ba3cdc]

2: (()+0xf890) [0x7fa510e2c890]

3: (gsignal()+0x37) [0x7fa50fe66067]

4: (abort()+0x148) [0x7fa50fe67448]

5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) 
[0x55dfb6be6f5f]

6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, 
std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > 
const&, int, ObjectStore::Transaction*)+0x14e3) [0x55dfb670f8a3]

7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539) [0x55dfb670ff39]

8: (boost::statechart::simple_state<PG::RecoveryState::Reset, 
PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x244) [0x55dfb67552a4]

9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocator<void>, 
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
 const&)+0x6b) [0x55dfb6732c1b]

10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, 
std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, 
std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x3e3) 
[0x55dfb6702ef3]

11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, 
std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > 
>*)+0x20a) [0x55dfb664db2a]

12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, 
ThreadPool::TPHandle&)+0x175) [0x55dfb664e6b5]

13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, 
ThreadPool::TPHandle&)+0x27) [0x55dfb66ae5a7]

14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55dfb6bedb1f]

15: (ThreadPool::WorkThread::entry()+0x10) [0x55dfb6beea50]

16: (()+0x8064) [0x7fa510e25064]

17: (clone()+0x6d) [0x7fa50ff1962d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.



--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_mirror

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

   1/ 5 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   1/ 3 journal

   0/ 5 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

   1/ 5 compressor

   1/ 5 bluestore

   1/ 5 bluefs

   1/ 3 bdev

   1/ 5 kstore

   4/ 5 rocksdb

   4/ 5 leveldb

   4/ 5 memdb

   1/ 5 kinetic

   1/ 5 fuse

   1/ 5 mgr

   1/ 5 mgrc

   1/ 5 dpdk

   1/ 5 eventtrace

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file /var/log/ceph/ceph-osd.37.log

_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to