On 2/17/2021 1:07 PM, Boris Behrens wrote:
Hi Igor,

this is good news for me. Do you have an idea in which version the fix will be released and can you tell me how I can track if the fix is in the release?

v14.2.17 will include the fix as the patch is already merged into the Nautilus branch.

Can't say when this version will be released though... Not that long I believe.


I will read a bit about the allocators but I doubt we will do the switch and just wait it out (if it does not take a year) :)

Actually switching allocators is quite safe and trivial operation. To reduce the risk even more you might apply it to failing OSDs only.



Thank you a lot.

Am Mi., 17. Feb. 2021 um 10:59 Uhr schrieb Igor Fedotov <ifedo...@suse.de <mailto:ifedo...@suse.de>>:

    Hi Boris,

    highly likely you've faced https://tracker.ceph.com/issues/47751
    <https://tracker.ceph.com/issues/47751>

    It's fixed in upcoming Nautilus release but v14.2.16 still lacks
    the fix.

    As a workaround you might want to switch back to bitmap or avl
    allocator.

    Thanks,

    Igor


    On 2/17/2021 12:36 PM, Boris Behrens wrote:
    > Hi,
    >
    > currently we experience osd daemon crashes and I can't pin the
    issue. I
    > hope someone can help me with it.
    >
    > * We operate multiple cluster (440 SSD - 1PB, 36 SSD - 126TB,
    40SSD 100TB,
    > 84HDD - 680TB)
    > * All clusters were updated around the same time (2021-02-03)
    > * We restarted ALL ceph daemons (systemctl restart ceph.target) on
    > 2021-02-11 after we added OOMScoreAdjust=-900 the all service files.
    >
    > now in our main cluster (440SSD with 1PB) the OSD daemons begin
    to crash:
    > # ceph crash ls
    > ID        ENTITY  NEW
    > 2020-03-06_17:37:54.031675Z_0bbbb807-ff2f-46df-9508-58d319b89bd6
    osd.397
    > 2020-05-28_12:23:27.677741Z_061f2449-9a36-4747-a2f8-624e72cd1ad0
    osd.410
    > 2021-02-05_07:03:35.943384Z_dffab245-4788-4de2-a677-76b735d5fc01
    osd.403
    > 2021-02-15_15:41:27.934194Z_97b57f8f-58f2-4390-9d3e-993874e0e000
    osd.395
    > 2021-02-15_18:01:19.774879Z_18160e65-4659-451f-8aae-def2984f1f29
    osd.178
    > 2021-02-17_04:51:05.101052Z_9f04c6e8-d0c7-442c-9a38-33d5164d2a83
    osd.384
    >
    > osd.384 and osd.395 are on the same node, which had some memory
    issues we
    > fixed 2021-02-16_12:00:00
    >
    > osd.384 was marked as out for >24h when the daemon crashed, and
    there no
    > more misplaced objects in the cluster.
    >
    > Here is the latest crash dump
    >   --- begin dump of recent events ---
    >   -9999> 2021-02-17 03:31:31.305 7fcf7e136700  1 do_command
    'perf dump'
    > 'result is 30067 bytes
    >   -9998> 2021-02-17 03:31:31.626 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9997> 2021-02-17 03:31:32.634 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9996> 2021-02-17 03:31:33.639 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9995> 2021-02-17 03:31:34.647 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9994> 2021-02-17 03:31:35.651 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9993> 2021-02-17 03:31:36.654 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9992> 2021-02-17 03:31:37.657 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9991> 2021-02-17 03:31:38.676 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9990> 2021-02-17 03:31:39.680 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9989> 2021-02-17 03:31:40.684 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2882789376 unmapped: 956792832 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >   -9988> 2021-02-17 03:31:41.193 7fcf7e136700  1 do_command
    'perf dump' '
    >   -9987> 2021-02-17 03:31:41.193 7fcf7e136700  1 do_command
    'perf dump'
    > 'result is 30067 bytes
    >
    > <snip>
    >
    >     -31> 2021-02-17 05:50:41.158 7fcf7e136700  1 do_command
    'perf dump' '
    >     -30> 2021-02-17 05:50:41.159 7fcf7e136700  1 do_command
    'perf dump'
    > 'result is 30070 bytes
    >     -29> 2021-02-17 05:50:41.804 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851831808 unmapped: 987750400 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -28> 2021-02-17 05:50:42.813 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851831808 unmapped: 987750400 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -27> 2021-02-17 05:50:43.820 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -26> 2021-02-17 05:50:44.825 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -25> 2021-02-17 05:50:45.831 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -24> 2021-02-17 05:50:46.837 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -23> 2021-02-17 05:50:47.840 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -22> 2021-02-17 05:50:48.843 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -21> 2021-02-17 05:50:49.847 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -20> 2021-02-17 05:50:50.853 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -19> 2021-02-17 05:50:51.524 7fcf7e136700  1 do_command
    'perf dump' '
    >     -18> 2021-02-17 05:50:51.525 7fcf7e136700  1 do_command
    'perf dump'
    > 'result is 30070 bytes
    >     -17> 2021-02-17 05:50:51.859 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -16> 2021-02-17 05:50:52.862 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -15> 2021-02-17 05:50:53.871 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -14> 2021-02-17 05:50:54.875 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -13> 2021-02-17 05:50:55.886 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -12> 2021-02-17 05:50:56.891 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -11> 2021-02-17 05:50:57.905 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >     -10> 2021-02-17 05:50:58.911 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -9> 2021-02-17 05:50:59.917 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -8> 2021-02-17 05:51:00.929 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -7> 2021-02-17 05:51:01.566 7fcf7e136700  1 do_command
    'perf dump' '
    >      -6> 2021-02-17 05:51:01.567 7fcf7e136700  1 do_command
    'perf dump'
    > 'result is 30070 bytes
    >      -5> 2021-02-17 05:51:01.935 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -4> 2021-02-17 05:51:02.943 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851405824 unmapped: 988176384 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -3> 2021-02-17 05:51:03.949 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851102720 unmapped: 988479488 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -2> 2021-02-17 05:51:04.967 7fcf73be6700  5 prioritycache
    tune_memory
    > target: 4294967296 mapped: 2851102720 unmapped: 988479488 heap:
    3839582208
    > old mem: 2845415832 new mem: 2845415832
    >      -1> 2021-02-17 05:51:05.091 7fcf743e7700 -1
    >
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/os/bluestore/fastbmap_allocator_impl.h:
    > In function 'uint64_t
    AllocatorLevel02<T>::claim_free_to_right(uint64_t)
    > [with L1 = AllocatorLevel01Loose; uint64_t = long unsigned int]'
    thread
    > 7fcf743e7700 time 2021-02-17 05:51:04.998475
    >
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/os/bluestore/fastbmap_allocator_impl.h:
    > 572: FAILED ceph_assert(available >= allocated)
    >
    >   ceph version 14.2.16
    (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus
    > (stable)
    >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    > const*)+0x14a) [0x561c84cc2c7d]
    >   2: (()+0x4d8e45) [0x561c84cc2e45]
    >   3: (HybridAllocator::_add_to_tree(unsigned long, unsigned
    long)+0x49e)
    > [0x561c853167de]
    >   4: (AvlAllocator::_release(interval_set<unsigned long,
    std::map<unsigned
    > long, unsigned long, std::less<unsigned long>,
    > std::allocator<std::pair<unsigned long const, unsigned long> > > >
    > const&)+0x60) [0x561c85310b20]
    >   5: (HybridAllocator::release(interval_set<unsigned long,
    std::map<unsigned
    > long, unsigned long, std::less<unsigned long>,
    > std::allocator<std::pair<unsigned long const, unsigned long> > > >
    > const&)+0x3a) [0x561c853143ca]
    >   6: (BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x5f)
    > [0x561c851ee83f]
    >   7: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x1be)
    > [0x561c8522f4ae]
    >   8: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0xaa)
    > [0x561c8522fe9a]
    >   9: (BlueStore::_kv_finalize_thread()+0x604) [0x561c85232ed4]
    >   10: (BlueStore::KVFinalizeThread::entry()+0xd) [0x561c852625ed]
    >   11: (()+0x7ea5) [0x7fcf840a2ea5]
    >   12: (clone()+0x6d) [0x7fcf82f6596d]
    >
    >       0> 2021-02-17 05:51:05.145 7fcf743e7700 -1 *** Caught
    signal (Aborted)
    > **
    >   in thread 7fcf743e7700 thread_name:bstore_kv_final
    >
    >   ceph version 14.2.16
    (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus
    > (stable)
    >   1: (()+0xf630) [0x7fcf840aa630]
    >   2: (gsignal()+0x37) [0x7fcf82e9d387]
    >   3: (abort()+0x148) [0x7fcf82e9ea78]
    >   4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    > const*)+0x199) [0x561c84cc2ccc]
    >   5: (()+0x4d8e45) [0x561c84cc2e45]
    >   6: (HybridAllocator::_add_to_tree(unsigned long, unsigned
    long)+0x49e)
    > [0x561c853167de]
    >   7: (AvlAllocator::_release(interval_set<unsigned long,
    std::map<unsigned
    > long, unsigned long, std::less<unsigned long>,
    > std::allocator<std::pair<unsigned long const, unsigned long> > > >
    > const&)+0x60) [0x561c85310b20]
    >   8: (HybridAllocator::release(interval_set<unsigned long,
    std::map<unsigned
    > long, unsigned long, std::less<unsigned long>,
    > std::allocator<std::pair<unsigned long const, unsigned long> > > >
    > const&)+0x3a) [0x561c853143ca]
    >   9: (BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x5f)
    > [0x561c851ee83f]
    >   10: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x1be)
    > [0x561c8522f4ae]
    >   11: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0xaa)
    > [0x561c8522fe9a]
    >   12: (BlueStore::_kv_finalize_thread()+0x604) [0x561c85232ed4]
    >   13: (BlueStore::KVFinalizeThread::entry()+0xd) [0x561c852625ed]
    >   14: (()+0x7ea5) [0x7fcf840a2ea5]
    >   15: (clone()+0x6d) [0x7fcf82f6596d]
    >   NOTE: a copy of the executable, or `objdump -rdS <executable>`
    is needed
    > to interpret this.



--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to