Did you ever get anywhere with this? I have 6 OSDs out of 36 continuously flapping with this error in the logs.
Thanks, Dan On Fri, Jun 8, 2018 at 11:10 AM Caspar Smit <caspars...@supernas.eu> wrote: > Hi all, > > Maybe this will help: > > The issue is with shards 3,4 and 5 of PG 6.3f: > > LOG's of OSD's 16, 17 & 36 (the ones crashing on startup). > > > *Log OSD.16 (shard 4):* > > 2018-06-08 08:35:01.727261 7f4c585e3700 -1 > bluestore(/var/lib/ceph/osd/ceph-16) _txc_add_transaction error (2) No such > file or directory not handled on operation 30 (op 0, counting from 0) > 2018-06-08 08:35:01.727273 7f4c585e3700 -1 > bluestore(/var/lib/ceph/osd/ceph-16) ENOENT on clone suggests osd bug > 2018-06-08 08:35:01.727274 7f4c585e3700 0 > bluestore(/var/lib/ceph/osd/ceph-16) transaction dump: > { > "ops": [ > { > "op_num": 0, > "op_name": "clonerange2", > "collection": "6.3fs4_head", > "src_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903d0", > "dst_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "src_offset": 950272, > "len": 98304, > "dst_offset": 950272 > }, > { > "op_num": 1, > "op_name": "remove", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903d0" > }, > { > "op_num": 2, > "op_name": "setattrs", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 3, > "op_name": "clonerange2", > "collection": "6.3fs4_head", > "src_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903cf", > "dst_oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "src_offset": 679936, > "len": 274432, > "dst_offset": 679936 > }, > { > "op_num": 4, > "op_name": "remove", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903cf" > }, > { > "op_num": 5, > "op_name": "setattrs", > "collection": "6.3fs4_head", > "oid": > "4#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 6, > "op_name": "nop" > }, > { > "op_num": 7, > "op_name": "op_omap_rmkeyrange", > "collection": "6.3fs4_head", > "oid": "4#6:fc000000::::head#", > "first": "0000011124.00000000000000590799", > "last": "4294967295.18446744073709551615" > }, > { > "op_num": 8, > "op_name": "omap_setkeys", > "collection": "6.3fs4_head", > "oid": "4#6:fc000000::::head#", > "attr_lens": { > "_biginfo": 597, > "_epoch": 4, > "_info": 953, > "can_rollback_to": 12, > "rollback_info_trimmed_to": 12 > } > } > ] > } > > 2018-06-08 08:35:01.730584 7f4c585e3700 -1 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function > 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)' thread 7f4c585e3700 time 2018-06-08 > 08:35:01.727379 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 9363: > FAILED assert(0 == "unexpected error") > > ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x558e08ba4202] > 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x15fa) [0x558e08a55c3a] > 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, > std::vector<ObjectStore::Transaction, > std::allocator<ObjectStore::Transaction> >&, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) > [0x558e08a572a6] > 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, > ObjectStore::Transaction&&, Context*, Context*, Context*, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) > [0x558e085fa37f] > 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, > ThreadPool::TPHandle*)+0x6c) [0x558e0857db5c] > 6: (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x558e085abec2] > 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x2c) [0x558e0861a91c] > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x558e08bab3a8] > 9: (ThreadPool::WorkThread::entry()+0x10) [0x558e08bac540] > 10: (()+0x7494) [0x7f4c709ca494] > 11: (clone()+0x3f) [0x7f4c6fa51aff] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > *Log OSD.17 (Shard 5):* > > 2018-06-08 08:35:56.745249 7fe3fa687700 -1 > bluestore(/var/lib/ceph/osd/ceph-17) _txc_add_transaction error (2) No such > file or directory not handled on operation 30 (op 0, counting from 0) > 2018-06-08 08:35:56.745264 7fe3fa687700 -1 > bluestore(/var/lib/ceph/osd/ceph-17) ENOENT on clone suggests osd bug > 2018-06-08 08:35:56.745266 7fe3fa687700 0 > bluestore(/var/lib/ceph/osd/ceph-17) transaction dump: > { > "ops": [ > { > "op_num": 0, > "op_name": "clonerange2", > "collection": "6.3fs5_head", > "src_oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903d0", > "dst_oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "src_offset": 950272, > "len": 98304, > "dst_offset": 950272 > }, > { > "op_num": 1, > "op_name": "remove", > "collection": "6.3fs5_head", > "oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903d0" > }, > { > "op_num": 2, > "op_name": "setattrs", > "collection": "6.3fs5_head", > "oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 3, > "op_name": "clonerange2", > "collection": "6.3fs5_head", > "src_oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903cf", > "dst_oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "src_offset": 679936, > "len": 274432, > "dst_offset": 679936 > }, > { > "op_num": 4, > "op_name": "remove", > "collection": "6.3fs5_head", > "oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903cf" > }, > { > "op_num": 5, > "op_name": "setattrs", > "collection": "6.3fs5_head", > "oid": > "5#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 6, > "op_name": "nop" > }, > { > "op_num": 7, > "op_name": "op_omap_rmkeyrange", > "collection": "6.3fs5_head", > "oid": "5#6:fc000000::::head#", > "first": "0000011124.00000000000000590799", > "last": "4294967295.18446744073709551615" > }, > { > "op_num": 8, > "op_name": "omap_setkeys", > "collection": "6.3fs5_head", > "oid": "5#6:fc000000::::head#", > "attr_lens": { > "_biginfo": 586, > "_epoch": 4, > "_info": 953, > "can_rollback_to": 12, > "rollback_info_trimmed_to": 12 > } > } > ] > } > > 2018-06-08 08:35:56.748436 7fe3fa687700 -1 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function > 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)' thread 7fe3fa687700 time 2018-06-08 > 08:35:56.745458 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 9363: > FAILED assert(0 == "unexpected error") > > ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x55695de1f202] > 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x15fa) [0x55695dcd0c3a] > 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, > std::vector<ObjectStore::Transaction, > std::allocator<ObjectStore::Transaction> >&, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) > [0x55695dcd22a6] > 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, > ObjectStore::Transaction&&, Context*, Context*, Context*, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) > [0x55695d87537f] > 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, > ThreadPool::TPHandle*)+0x6c) [0x55695d7f8b5c] > 6: (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55695d826ec2] > 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x2c) [0x55695d89591c] > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55695de263a8] > 9: (ThreadPool::WorkThread::entry()+0x10) [0x55695de27540] > 10: (()+0x7494) [0x7fe412a6e494] > 11: (clone()+0x3f) [0x7fe411af5aff] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > *Log OSD.36 (Shard 3):* > > -3> 2018-06-08 09:25:36.660055 7f43c9262700 -1 > bluestore(/var/lib/ceph/osd/ceph-36) _txc_add_transaction error (2) No such > file or directory not handled on operation 30 (op 0, counting from 0) > -2> 2018-06-08 09:25:36.660068 7f43c9262700 -1 > bluestore(/var/lib/ceph/osd/ceph-36) ENOENT on clone suggests osd bug > -1> 2018-06-08 09:25:36.660070 7f43c9262700 0 > bluestore(/var/lib/ceph/osd/ceph-36) transaction dump: > { > "ops": [ > { > "op_num": 0, > "op_name": "clonerange2", > "collection": "*6.3fs3*_head", > "src_oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903d0", > "dst_oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "src_offset": 950272, > "len": 98304, > "dst_offset": 950272 > }, > { > "op_num": 1, > "op_name": "remove", > "collection": "6.3fs3_head", > "oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903d0" > }, > { > "op_num": 2, > "op_name": "setattrs", > "collection": "6.3fs3_head", > "oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 3, > "op_name": "clonerange2", > "collection": "6.3fs3_head", > "src_oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903cf", > "dst_oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "src_offset": 679936, > "len": 274432, > "dst_offset": 679936 > }, > { > "op_num": 4, > "op_name": "remove", > "collection": "6.3fs3_head", > "oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#903cf" > }, > { > "op_num": 5, > "op_name": "setattrs", > "collection": "6.3fs3_head", > "oid": > "3#6:fc074663:::rbd_data.5.6c1d9574b0dc51.00000000000312db:head#", > "attr_lens": { > "_": 297, > "hinfo_key": 18, > "snapset": 35 > } > }, > { > "op_num": 6, > "op_name": "nop" > }, > { > "op_num": 7, > "op_name": "op_omap_rmkeyrange", > "collection": "6.3fs3_head", > "oid": "3#6:fc000000::::head#", > "first": "0000011124.00000000000000590799", > "last": "4294967295.18446744073709551615" > }, > { > "op_num": 8, > "op_name": "omap_setkeys", > "collection": "6.3fs3_head", > "oid": "3#6:fc000000::::head#", > "attr_lens": { > "_biginfo": 743, > "_epoch": 4, > "_info": 953, > "can_rollback_to": 12, > "rollback_info_trimmed_to": 12 > } > } > ] > } > > 0> 2018-06-08 09:25:36.663334 7f43c9262700 -1 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function > 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)' thread 7f43c9262700 time 2018-06-08 > 09:25:36.660177 > /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 9363: > FAILED assert(0 == "unexpected error") > > ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x557025013202] > 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x15fa) [0x557024ec4c3a] > 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, > std::vector<ObjectStore::Transaction, > std::allocator<ObjectStore::Transaction> >&, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) > [0x557024ec62a6] > 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, > ObjectStore::Transaction&&, Context*, Context*, Context*, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) > [0x557024a6937f] > 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, > ThreadPool::TPHandle*)+0x6c) [0x5570249ecb5c] > 6: (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x557024a1aec2] > 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x2c) [0x557024a8991c] > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55702501a3a8] > 9: (ThreadPool::WorkThread::entry()+0x10) [0x55702501b540] > 10: (()+0x7494) [0x7f43e0e48494] > 11: (clone()+0x3f) [0x7f43dfecfaff] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > Kind regards, > Caspar Smit > > 2018-06-08 16:38 GMT+02:00 Caspar Smit <caspars...@supernas.eu>: > >> Hi all, >> >> I seem to be hitting these tracker issues: >> >> https://tracker.ceph.com/issues/23145 >> http://tracker.ceph.com/issues/24422 >> >> PG's 6.1 and 6.3f are having the issues >> >> When i list all PG's of a down OSD with: >> >> ceph-objectstore-tool --dry-run --type bluestore --data-path >> /var/lib/ceph/osd/ceph-17/ --op list-pgs >> >> There are a lot of 'double' pgid's like (also for other pg's): >> >> 6.3fs3 >> 6.3fs5 >> >> Is that normal? I would assume different shards for EC would be on >> seperate OSD's >> >> We still have 4 OSD's down and 2 PG's down+remapped and i can't find any >> way to get the crashed OSD's back up. >> >> pg 6.1 is down+remapped, acting >> [6,3,2147483647,29,2147483647,2147483647] >> pg 6.3f is down+remapped, acting [20,24,2147483647,2147483647,3,28] >> >> Kind regards, >> Caspar Smit >> >> 2018-06-08 8:53 GMT+02:00 Caspar Smit <caspars...@supernas.eu>: >> >>> Update: >>> >>> I've unset nodown to let it continue but now 4 osd's are down and cannot >>> be brought up again, here's what the lofgfile reads: >>> >>> 2018-06-08 08:35:01.716245 7f4c58de4700 0 log_channel(cluster) log >>> [INF] : 6.e3s0 continuing backfill to osd.37(4) from >>> (10864'911406,11124'921472] >>> 6:c7d71bbd:::rbd_data.5.6c1d9574b0dc51.0000000000bf38b9:head to 11124'921472 >>> 2018-06-08 08:35:01.727261 7f4c585e3700 -1 >>> bluestore(/var/lib/ceph/osd/ceph-16) _txc_add_transaction error (2) No such >>> file or directory not handled on operation 30 (op 0, counting from 0) >>> 2018-06-08 08:35:01.727273 7f4c585e3700 -1 >>> bluestore(/var/lib/ceph/osd/ceph-16) ENOENT on clone suggests osd bug >>> >>> 2018-06-08 08:35:01.730584 7f4c585e3700 -1 >>> /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: In function >>> 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, >>> ObjectStore::Transaction*)' thread 7f4c585e3700 time 2018-06-08 >>> 08:35:01.727379 >>> /home/builder/source/ceph-12.2.2/src/os/bluestore/BlueStore.cc: 9363: >>> FAILED assert(0 == "unexpected error") >>> >>> ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous >>> (stable) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0x102) [0x558e08ba4202] >>> 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, >>> ObjectStore::Transaction*)+0x15fa) [0x558e08a55c3a] >>> 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, >>> std::vector<ObjectStore::Transaction, >>> std::allocator<ObjectStore::Transaction> >&, >>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) >>> [0x558e08a572a6] >>> 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, >>> ObjectStore::Transaction&&, Context*, Context*, Context*, >>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) >>> [0x558e085fa37f] >>> 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, >>> ThreadPool::TPHandle*)+0x6c) [0x558e0857db5c] >>> 6: (OSD::process_peering_events(std::__cxx11::list<PG*, >>> std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x558e085abec2] >>> 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, >>> ThreadPool::TPHandle&)+0x2c) [0x558e0861a91c] >>> 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x558e08bab3a8] >>> 9: (ThreadPool::WorkThread::entry()+0x10) [0x558e08bac540] >>> 10: (()+0x7494) [0x7f4c709ca494] >>> 11: (clone()+0x3f) [0x7f4c6fa51aff] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>> needed to interpret this. >>> >>> Any help is highly appreciated. >>> >>> Kind regards, >>> Caspar Smit >>> >>> >>> 2018-06-08 7:57 GMT+02:00 Caspar Smit <caspars...@supernas.eu>: >>> >>>> Well i let it run with flags nodown and it looked like it would finish >>>> BUT it all went wrong somewhere: >>>> >>>> This is now the state: >>>> >>>> health: HEALTH_ERR >>>> nodown flag(s) set >>>> 5602396/94833780 objects misplaced (5.908%) >>>> Reduced data availability: 143 pgs inactive, 142 pgs >>>> peering, 7 pgs stale >>>> Degraded data redundancy: 248859/94833780 objects degraded >>>> (0.262%), 194 pgs unclean, 21 pgs degraded, 12 pgs undersized >>>> 11 stuck requests are blocked > 4096 sec >>>> >>>> pgs: 13.965% pgs not active >>>> 248859/94833780 objects degraded (0.262%) >>>> 5602396/94833780 objects misplaced (5.908%) >>>> 830 active+clean >>>> 75 remapped+peering >>>> 66 peering >>>> 26 active+remapped+backfill_wait >>>> 6 active+undersized+degraded+remapped+backfill_wait >>>> 6 active+recovery_wait+degraded+remapped >>>> 3 active+undersized+degraded+remapped+backfilling >>>> 3 stale+active+undersized+degraded+remapped+backfill_wait >>>> 3 stale+active+remapped+backfill_wait >>>> 2 active+recovery_wait+degraded >>>> 2 active+remapped+backfilling >>>> 1 activating+degraded+remapped >>>> 1 stale+remapped+peering >>>> >>>> >>>> #ceph health detail shows: >>>> >>>> REQUEST_STUCK 11 stuck requests are blocked > 4096 sec >>>> 11 ops are blocked > 16777.2 sec >>>> osds 4,7,23,24 have stuck requests > 16777.2 sec >>>> >>>> >>>> So what happened and what should i do now? >>>> >>>> Thank you very much for any help >>>> >>>> Kind regards, >>>> Caspar >>>> >>>> >>>> 2018-06-07 13:33 GMT+02:00 Sage Weil <s...@newdream.net>: >>>> >>>>> On Wed, 6 Jun 2018, Caspar Smit wrote: >>>>> > Hi all, >>>>> > >>>>> > We have a Luminous 12.2.2 cluster with 3 nodes and i recently added >>>>> a node >>>>> > to it. >>>>> > >>>>> > osd-max-backfills is at the default 1 so backfilling didn't go very >>>>> fast >>>>> > but that doesn't matter. >>>>> > >>>>> > Once it started backfilling everything looked ok: >>>>> > >>>>> > ~300 pgs in backfill_wait >>>>> > ~10 pgs backfilling (~number of new osd's) >>>>> > >>>>> > But i noticed the degraded objects increasing a lot. I presume a pg >>>>> that is >>>>> > in backfill_wait state doesn't accept any new writes anymore? Hence >>>>> > increasing the degraded objects? >>>>> > >>>>> > So far so good, but once a while i noticed a random OSD flapping >>>>> (they come >>>>> > back up automatically). This isn't because the disk is saturated but >>>>> a >>>>> > driver/controller/kernel incompatibility which 'hangs' the disk for >>>>> a short >>>>> > time (scsi abort_task error in syslog). Investigating further i >>>>> noticed >>>>> > this was already the case before the node expansion. >>>>> > >>>>> > These OSD's flapping results in lots of pg states which are a bit >>>>> worrying: >>>>> > >>>>> > 109 active+remapped+backfill_wait >>>>> > 80 active+undersized+degraded+remapped+backfill_wait >>>>> > 51 active+recovery_wait+degraded+remapped >>>>> > 41 active+recovery_wait+degraded >>>>> > 27 active+recovery_wait+undersized+degraded+remapped >>>>> > 14 active+undersized+remapped+backfill_wait >>>>> > 4 active+undersized+degraded+remapped+backfilling >>>>> > >>>>> > I think the recovery_wait is more important then the backfill_wait, >>>>> so i >>>>> > like to prioritize these because the recovery_wait was triggered by >>>>> the >>>>> > flapping OSD's >>>>> >>>>> Just a note: this is fixed in mimic. Previously, we would choose the >>>>> highest-priority PG to start recovery on at the time, but once >>>>> recovery >>>>> had started, the appearance of a new PG with a higher priority (e.g., >>>>> because it finished peering after the others) wouldn't preempt/cancel >>>>> the >>>>> other PG's recovery, so you would get behavior like the above. >>>>> >>>>> Mimic implements that preemption, so you should not see behavior like >>>>> this. (If you do, then the function that assigns a priority score to >>>>> a >>>>> PG needs to be tweaked.) >>>>> >>>>> sage >>>>> >>>> >>>> >>> >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com