Re: [ceph-users] ceph-osd@n crash dumps

2019-10-02 Thread Del Monaco, Andrea
Hi Brad,

Apologies for the flow of messages - the previous messages went for approval 
because of their length.
Here you can see the requested output: https://pastebin.com/N8jG08sH

Regards,

[Atos logo]


Andrea Del Monaco
HPC Consultant – Big Data & Security
M: +31 612031174
Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands
atos.net
[LinkedIn icon] [Twitter icon] 
  [Facebook icon]   
[Youtube icon] 


This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, Atos’ liability cannot be triggered for the message 
content. Although the sender endeavours to maintain a computer virus-free 
network, the sender does not warrant that this transmission is virus-free and 
will not be liable for any damages resulting from any virus transmitted. On all 
offers and agreements under which Atos Nederland B.V. supplies goods and/or 
services of whatever nature, the Terms of Delivery from Atos Nederland B.V. 
exclusively apply. The Terms of Delivery shall be promptly submitted to you on 
your request.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd@n crash dumps

2019-10-01 Thread Brad Hubbard
On Tue, Oct 1, 2019 at 10:43 PM Del Monaco, Andrea <
andrea.delmon...@atos.net> wrote:

> Hi list,
>
> After the nodes ran OOM and after reboot, we are not able to restart the
> ceph-osd@x services anymore. (Details about the setup at the end).
>
> I am trying to do this manually, so we can see the error but all i see is
> several crash dumps - this is just one of the OSDs which is not starting.
> Any idea how to get past this??
> [root@ceph001 ~]# /usr/bin/ceph-osd --debug_osd 10 -f --cluster ceph --id
> 83 --setuser ceph --setgroup ceph  > /tmp/dump 2>&1
> starting osd.83 at - osd_data /var/lib/ceph/osd/ceph-83
> /var/lib/ceph/osd/ceph-83/journal
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
> thread 2aaf5540 time 2019-10-01 14:19:49.494368
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
> 34: FAILED assert(stripe_width % stripe_size == 0)
>  ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14b) [0x2af3d36b]
>  2: (()+0x26e4f7) [0x2af3d4f7]
>  3: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&,
> boost::intrusive_ptr&, ObjectStore*,
> CephContext*, std::shared_ptr, unsigned
> long)+0x46d) [0x55c0bd3d]
>  4: (PGBackend::build_pg_backend(pg_pool_t const&, std::map std::string, std::less, std::allocator const, std::string> > > const&, PGBackend::Listener*, coll_t,
> boost::intrusive_ptr&, ObjectStore*,
> CephContext*)+0x30a) [0x55b0ba8a]
>  5: (PrimaryLogPG::PrimaryLogPG(OSDService*, std::shared_ptr const>, PGPool const&, std::map std::less, std::allocator std::string> > > const&, spg_t)+0x140) [0x55abd100]
>  6: (OSD::_make_pg(std::shared_ptr, spg_t)+0x10cb)
> [0x55914ecb]
>  7: (OSD::load_pgs()+0x4a9) [0x55917e39]
>  8: (OSD::init()+0xc99) [0x559238e9]
>  9: (main()+0x23a3) [0x558017a3]
>  10: (__libc_start_main()+0xf5) [0x2aaab77de495]
>  11: (()+0x385900) [0x558d9900]
> 2019-10-01 14:19:49.500 2aaf5540 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
> thread 2aaf5540 time 2019-10-01 14:19:49.494368
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
> 34: FAILED assert(stripe_width % stripe_size == 0)
>

 https://tracker.ceph.com/issues/41336 may be relevant here.

Can you post details of the pool involved as well as the erasure code
profile in use for that pool?


>  ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14b) [0x2af3d36b]
>  2: (()+0x26e4f7) [0x2af3d4f7]
>  3: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&,
> boost::intrusive_ptr&, ObjectStore*,
> CephContext*, std::shared_ptr, unsigned
> long)+0x46d) [0x55c0bd3d]
>  4: (PGBackend::build_pg_backend(pg_pool_t const&, std::map std::string, std::less, std::allocator const, std::string> > > const&, PGBackend::Listener*, coll_t,
> boost::intrusive_ptr&, ObjectStore*,
> CephContext*)+0x30a) [0x55b0ba8a]
>  5: (PrimaryLogPG::PrimaryLogPG(OSDService*, std::shared_ptr const>, PGPool const&, std::map std::less, std::allocator std::string> > > const&, spg_t)+0x140) [0x55abd100]
>  6: (OSD::_make_pg(std::shared_ptr, spg_t)+0x10cb)
> [0x55914ecb]
>  7: (OSD::load_pgs()+0x4a9) [0x55917e39]
>  8: (OSD::init()+0xc99) [0x559238e9]
>  9: (main()+0x23a3) [0x558017a3]
>  10: (__libc_start_main()+0xf5) [0x2aaab77de495]
>  11: (()+0x385900) [0x558d9900]
>
> *** Caught signal (Aborted) **
>  in thread 2aaf5540 thread_name:ceph-osd
>  ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>  1: (()+0xf5d0) [0x2aaab69765d0]
>  2: (gsignal()+0x37) [0x2aaab77f22c7]
>  3: (abort()+0x148) [0x2aaab77f39b8]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x248) [0x2af3d468]
>  5: (()+0x26e4f7) [0x2af3d4f7]
>  6: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&,
> boost::intrusive_ptr&, ObjectStore*,
> CephContext*, std::shared_ptr, unsigned
> long)+0x46d) [0x55c0bd3d]
>  7: (PGBackend::build_pg_backend(pg_pool_t const&, std::map std::string, std::less, std::allocator const, std::string> > > const&, PGBackend::Listener*, coll_t,
> 

[ceph-users] ceph-osd@n crash dumps

2019-10-01 Thread Del Monaco, Andrea
Hi list,

After the nodes ran OOM and after reboot, we are not able to restart the 
ceph-osd@x services anymore. (Details about the setup at the end).

I am trying to do this manually, so we can see the error but all i see is 
several crash dumps - this is just one of the OSDs which is not starting. Any 
idea how to get past this??
[root@ceph001 ~]# /usr/bin/ceph-osd --debug_osd 10 -f --cluster ceph --id 83 
--setuser ceph --setgroup ceph  > /tmp/dump 2>&1
starting osd.83 at - osd_data /var/lib/ceph/osd/ceph-83 
/var/lib/ceph/osd/ceph-83/journal
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
 In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread 
2aaf5540 time 2019-10-01 14:19:49.494368
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
 34: FAILED assert(stripe_width % stripe_size == 0)
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14b) [0x2af3d36b]
 2: (()+0x26e4f7) [0x2af3d4f7]
 3: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&, 
boost::intrusive_ptr&, ObjectStore*, CephContext*, 
std::shared_ptr, unsigned long)+0x46d) 
[0x55c0bd3d]
 4: (PGBackend::build_pg_backend(pg_pool_t const&, std::map, std::allocator > > const&, PGBackend::Listener*, coll_t, 
boost::intrusive_ptr&, ObjectStore*, 
CephContext*)+0x30a) [0x55b0ba8a]
 5: (PrimaryLogPG::PrimaryLogPG(OSDService*, std::shared_ptr, 
PGPool const&, std::map, 
std::allocator > > const&, 
spg_t)+0x140) [0x55abd100]
 6: (OSD::_make_pg(std::shared_ptr, spg_t)+0x10cb) 
[0x55914ecb]
 7: (OSD::load_pgs()+0x4a9) [0x55917e39]
 8: (OSD::init()+0xc99) [0x559238e9]
 9: (main()+0x23a3) [0x558017a3]
 10: (__libc_start_main()+0xf5) [0x2aaab77de495]
 11: (()+0x385900) [0x558d9900]
2019-10-01 14:19:49.500 2aaf5540 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
 In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread 
2aaf5540 time 2019-10-01 14:19:49.494368
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/osd/ECUtil.h:
 34: FAILED assert(stripe_width % stripe_size == 0)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14b) [0x2af3d36b]
 2: (()+0x26e4f7) [0x2af3d4f7]
 3: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&, 
boost::intrusive_ptr&, ObjectStore*, CephContext*, 
std::shared_ptr, unsigned long)+0x46d) 
[0x55c0bd3d]
 4: (PGBackend::build_pg_backend(pg_pool_t const&, std::map, std::allocator > > const&, PGBackend::Listener*, coll_t, 
boost::intrusive_ptr&, ObjectStore*, 
CephContext*)+0x30a) [0x55b0ba8a]
 5: (PrimaryLogPG::PrimaryLogPG(OSDService*, std::shared_ptr, 
PGPool const&, std::map, 
std::allocator > > const&, 
spg_t)+0x140) [0x55abd100]
 6: (OSD::_make_pg(std::shared_ptr, spg_t)+0x10cb) 
[0x55914ecb]
 7: (OSD::load_pgs()+0x4a9) [0x55917e39]
 8: (OSD::init()+0xc99) [0x559238e9]
 9: (main()+0x23a3) [0x558017a3]
 10: (__libc_start_main()+0xf5) [0x2aaab77de495]
 11: (()+0x385900) [0x558d9900]

*** Caught signal (Aborted) **
 in thread 2aaf5540 thread_name:ceph-osd
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0xf5d0) [0x2aaab69765d0]
 2: (gsignal()+0x37) [0x2aaab77f22c7]
 3: (abort()+0x148) [0x2aaab77f39b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x248) [0x2af3d468]
 5: (()+0x26e4f7) [0x2af3d4f7]
 6: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&, 
boost::intrusive_ptr&, ObjectStore*, CephContext*, 
std::shared_ptr, unsigned long)+0x46d) 
[0x55c0bd3d]
 7: (PGBackend::build_pg_backend(pg_pool_t const&, std::map, std::allocator > > const&, PGBackend::Listener*, coll_t, 
boost::intrusive_ptr&, ObjectStore*, 
CephContext*)+0x30a) [0x55b0ba8a]
 8: (PrimaryLogPG::PrimaryLogPG(OSDService*, std::shared_ptr, 
PGPool const&, std::map, 
std::allocator > > const&, 
spg_t)+0x140) [0x55abd100]
 9: (OSD::_make_pg(std::shared_ptr, spg_t)+0x10cb) 
[0x55914ecb]
 10: (OSD::load_pgs()+0x4a9) [0x55917e39]
 11: (OSD::init()+0xc99) [0x559238e9]
 12: (main()+0x23a3) [0x558017a3]
 13: (__libc_start_main()+0xf5) [0x2aaab77de495]
 14: (()+0x385900) [0x558d9900]
2019-10-01 14:19:49.509 2aaf5540 -1 *** Caught signal (Aborted)