Hi HP.

I am just a site admin so my opinion should be validated by proper support staff

Seems really similar to
http://tracker.ceph.com/issues/14399

The ticket speaks about timezone difference between osds. Maybe it is something 
worthwhile to check?

Cheers
Goncalo

________________________________________
From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Hein-Pieter 
van Braam [h...@tmm.cx]
Sent: 13 August 2016 21:48
To: ceph-users
Subject: [ceph-users] Cascading failure on a placement group

Hello all,

My cluster started to lose OSDs without any warning, whenever an OSD
becomes the primary for a particular PG it crashes with the following
stacktrace:

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: /usr/bin/ceph-osd() [0xada722]
 2: (()+0xf100) [0x7fc28bca5100]
 3: (gsignal()+0x37) [0x7fc28a6bd5f7]
 4: (abort()+0x148) [0x7fc28a6bece8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
 6: (()+0x5e946) [0x7fc28afc0946]
 7: (()+0x5e973) [0x7fc28afc0973]
 8: (()+0x5eb93) [0x7fc28afc0b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xbddcba]
 10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e48f]
 11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 12: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a)
[0x8a0d1a]
 13: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
[0x69a5c5]
 15: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xbcd1cf]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 18: (()+0x7dc5) [0x7fc28bc9ddc5]
 19: (clone()+0x6d) [0x7fc28a77eced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Has anyone ever seen this? Is there a way to fix this? My cluster is in
rather large disarray at the moment. I have one of the OSDs now in a
restart loop and that is at least preventing other OSDs from going
down, but obviously not all other PGs can peer now.

I'm not sure what else to do at the moment.

Thank you so much,

- HP
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to