12.2.5 on Proxmox cluster. 6 nodes, about 50 OSDs, bluestore and cache tiering on an EC pool. Mostly spinners with an SSD OSD drive and an SSD WAL DB drive on each node. PM863 SSDs with ~75%+ endurance remaning.
Has been running relatively okay besides some spinner failures until I checked today and found 5-6 OSDs flapping. I remember reading about some issues with 12.2.5, so I upgraded one node to 12.2.10 but no change. Seeing: 2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0) -3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug in the logs for each of them. I've found several bugs in the tracker related to these, but nothing with a resolution I could apply besides upgrading, which doesn't appear to have helped. Suggestions welcome. Snippet of the last few lines: rbd_data.17.afb3726b8b4567.00000000000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive -6> 2018-12-20 00:27:42.753106 7f578f68a700 5 osd.33 pg_epoch: 59344 pg[18.10s5( v 15786'3211273 (15781'3201224,15786'3211273] lb 18:0af1fd67:::rbd_data.17.afb3726b8b4567.00000000000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive/RepNotRecovering -5> 2018-12-20 00:27:42.753186 7f578f68a700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 15786'3211274, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -4> 2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0) -3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug -2> 2018-12-20 00:27:42.754507 7f578f68a700 0 bluestore(/var/lib/ceph/osd/ceph-33) transaction dump: { "ops": [ { "op_num": 0, "op_name": "clonerange2", "collection": "18.10s5_head", "src_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#310014", "dst_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#", "src_offset": 512000, "len": 8192, "dst_offset": 512000 }, { "op_num": 1, "op_name": "remove", "collection": "18.10s5_head", "oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#310014" }, { "op_num": 2, "op_name": "setattrs", "collection": "18.10s5_head", "oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#", "attr_lens": { "_": 298, "hinfo_key": 18, "snapset": 35 } }, { "op_num": 3, "op_name": "nop" }, { "op_num": 4, "op_name": "op_omap_rmkeyrange", "collection": "18.10s5_head", "oid": "5#18:08000000::::head#", "first": "0000015786.00000000000003211274", "last": "4294967295.18446744073709551615" }, { "op_num": 5, "op_name": "omap_setkeys", "collection": "18.10s5_head", "oid": "5#18:08000000::::head#", "attr_lens": { "_biginfo": 1646, "_epoch": 4, "_info": 1014, "can_rollback_to": 12, "rollback_info_trimmed_to": 12 } } ] } -1> 2018-12-20 00:27:42.757231 7f5795696700 1 -- 10.10.145.105:6801/29516 --> 10.10.145.100:6818/5468876 -- pg_info((query:59344 sent:59344 18.1es9( v 15786'3322304 (15786'3312225,15786'3322304] local-lis/les=59343/59344 n=163868 ec=408/408 lis/c 59343/3966 les/c/f 59344/3980/0 59341/59343/15773) 9->0)=(empty) epoch 59344) v5 -- 0x55c2f7e7be00 con 0 0> 2018-12-20 00:27:42.758519 7f578f68a700 -1 /home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f578f68a700 time 2018-12-20 00:27:42.754596 /home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: 9621: FAILED assert(0 == "unexpected error") ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55c2d9a2d9b2] 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a] 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296] 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef] 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec] 6: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22] 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58] 9: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0] 10: (()+0x7494) [0x7f57a7275494] 11: (clone()+0x3f) [0x7f57a62fcacf] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.33.log --- end dump of recent events --- 2018-12-20 00:27:42.795716 7f578f68a700 -1 *** Caught signal (Aborted) ** in thread 7f578f68a700 thread_name:tp_peering ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable) 1: (()+0xa56bd4) [0x55c2d99e5bd4] 2: (()+0x110c0) [0x7f57a727f0c0] 3: (gsignal()+0xcf) [0x7f57a6246fff] 4: (abort()+0x16a) [0x7f57a624842a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55c2d9a2db3e] 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a] 7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296] 8: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef] 9: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec] 10: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22] 11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c] 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58] 13: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0] 14: (()+0x7494) [0x7f57a7275494] 15: (clone()+0x3f) [0x7f57a62fcacf] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -13> 2018-12-20 00:27:42.764248 7f57a3ed5700 5 -- 10.10.145.105:6802/29516 >> 10.10.145.102:0/436138 conn(0x55c2f8090000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=54007 cs=1 l=1). rx osd.8 seq 2 0x55c2f7e7be00 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 -12> 2018-12-20 00:27:42.764269 7f57a3ed5700 1 -- 10.10.145.105:6802/29516 <== osd.8 10.10.145.102:0/436138 2 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 ==== 2004+0+0 (2799076138 0 0) 0x55c2f7e7be00 con 0x55c2f8090000 -11> 2018-12-20 00:27:42.764282 7f57a3ed5700 1 -- 10.10.145.105:6802/29516 --> 10.10.145.102:0/436138 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.752504) v4 -- 0x55c2f7f12200 con 0 -10> 2018-12-20 00:27:42.764391 7f57a2ed3700 5 -- 10.10.145.105:6803/29516 >> 10.10.145.102:0/436138 conn(0x55c2f8091800 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=54004 cs=1 l=1). rx osd.8 seq 2 0x55c2f76c8000 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 -9> 2018-12-20 00:27:42.764413 7f57a2ed3700 1 -- 10.10.145.105:6803/29516 <== osd.8 10.10.145.102:0/436138 2 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 ==== 2004+0+0 (2799076138 0 0) 0x55c2f76c8000 con 0x55c2f8091800 -8> 2018-12-20 00:27:42.764430 7f57a2ed3700 1 -- 10.10.145.105:6803/29516 --> 10.10.145.102:0/436138 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.752504) v4 -- 0x55c2f7df0800 con 0 -7> 2018-12-20 00:27:42.773561 7f57a2ed3700 5 -- 10.10.145.105:6802/29516 >> 10.10.145.104:0/3815033 conn(0x55c2f80eb000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=50456 cs=1 l=1). rx osd.24 seq 1 0x55c2f7f97800 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 -6> 2018-12-20 00:27:42.773580 7f57a2ed3700 1 -- 10.10.145.105:6802/29516 <== osd.24 10.10.145.104:0/3815033 1 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 ==== 2004+0+0 (1752418694 0 0) 0x55c2f7f97800 con 0x55c2f80eb000 -5> 2018-12-20 00:27:42.773591 7f57a2ed3700 1 -- 10.10.145.105:6802/29516 --> 10.10.145.104:0/3815033 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.760326) v4 -- 0x55c2f7df0800 con 0 -4> 2018-12-20 00:27:42.773632 7f57a2ed3700 1 -- 10.10.145.105:6801/29516 --> 10.10.145.104:6807/23815033 -- osd_map(59344..59344 src has 3756..59344) v3 -- 0x55c2f892a080 con 0 -3> 2018-12-20 00:27:42.773737 7f57a36d4700 5 -- 10.10.145.105:6803/29516 >> 10.10.145.104:0/3815033 conn(0x55c2f80ec800 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=50470 cs=1 l=1). rx osd.24 seq 1 0x55c2f7fc6c00 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 -2> 2018-12-20 00:27:42.773757 7f57a36d4700 1 -- 10.10.145.105:6803/29516 <== osd.24 10.10.145.104:0/3815033 1 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 ==== 2004+0+0 (1752418694 0 0) 0x55c2f7fc6c00 con 0x55c2f80ec800 -1> 2018-12-20 00:27:42.773772 7f57a36d4700 1 -- 10.10.145.105:6803/29516 --> 10.10.145.104:0/3815033 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.760326) v4 -- 0x55c2f822d600 con 0 0> 2018-12-20 00:27:42.795716 7f578f68a700 -1 *** Caught signal (Aborted) ** in thread 7f578f68a700 thread_name:tp_peering ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable) 1: (()+0xa56bd4) [0x55c2d99e5bd4] 2: (()+0x110c0) [0x7f57a727f0c0] 3: (gsignal()+0xcf) [0x7f57a6246fff] 4: (abort()+0x16a) [0x7f57a624842a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55c2d9a2db3e] 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a] 7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296] 8: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef] 9: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec] 10: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22] 11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c] 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58] 13: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0] 14: (()+0x7494) [0x7f57a7275494] 15: (clone()+0x3f) [0x7f57a62fcacf] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.33.log --- end dump of recent events ---
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com