Re: [ceph-users] OSDs flapping
SOLVED: upgrading to Luminous v12.1.2 put a stop to the OSD crashes in cephx_verify_authorizer(). On Fri, Jul 21, 2017 at 3:21 AM Jens Harbott wrote: > 2017-07-21 1:14 GMT+00:00 Gregory Farnum : > > At a glance that looks like the bug fixed by just-merged > > https://github.com/ceph/ceph/pull/16421 > > With the crashes in cephx_verify_authorizer() this rather looks like > an instance of http://tracker.ceph.com/issues/20667 to me with > https://github.com/ceph/ceph/pull/16455 as proposed fix. See Sage's > mail on ceph-dev earlier. > > > On Thu, Jul 20, 2017 at 1:02 PM Roger Brown > wrote: > ... > >> Representative example from osd1 logs: > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: *** Caught signal (Segmentation > >> fault) ** > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: in thread 7f52960e7700 > >> thread_name:msgr-worker-2 > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.658076 > >> 7f529bf85c80 -1 osd.3 3444 log_to_monitors {default=true} > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662695 > >> 7f52968e8700 -1 failed to decode message of type 70 v3: > >> buffer::malformed_input: void > >> osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer > understand > >> old encoding version 1 < struct_compat > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: ceph version 12.1.1 > >> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 1: (()+0xa257a4) [0x55bc98fe27a4] > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2: (()+0x11390) [0x7f529a468390] > >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 3: > >> (cephx_verify_authorizer(CephContext*, KeyStore*, > >> ceph::buffer::list::iterator&, CephXServiceTicketInfo&, > >> ceph::buffer::list&)+0x496) [0x55bc991b0ca6] > ... > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs flapping
At a glance that looks like the bug fixed by just-merged https://github.com/ceph/ceph/pull/16421 On Thu, Jul 20, 2017 at 1:02 PM Roger Brown wrote: > I'm on Luminous 12.1.1 and noticed I have flapping OSDs. Even with `ceph > osd set nodown`, the OSDs will catch signal Aborted and sometimes > Segmentation fault 2-5 minutes after starting. I verified hosts can talk to > eachother on the cluster network. I've rebooted the hosts. I'm running out > of ideas. Please advise. > > Tally of crashes: > roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | > awk '{print $9}' | sort | uniq -c | sort -nr > 100 (Segmentation > 77 (Aborted) > roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | > awk '{print $9}' | sort | uniq -c | sort -nr > 77 (Aborted) > 13 (Segmentation > roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | > awk '{print $9}' | sort | uniq -c | sort -nr > 86 (Aborted) > 3 (Segmentation > > First crash observed Jul 19: > roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head > -1 > Jul 19 10:07:12 osd1 ceph-osd[13491]: *** Caught signal (Aborted) ** > roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head > -1 > Jul 19 10:07:36 osd2 ceph-osd[13937]: *** Caught signal (Aborted) ** > roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head > -1 > Jul 19 16:07:12 osd3 ceph-osd[8807]: *** Caught signal (Aborted) ** > > Crashes started with Luminous 12.1.0: > roger@osd1:~$ sudo grep 'Jul 19 10:07:12.*ceph version' /var/log/syslog.1 > | head -1 > Jul 19 10:07:12 osd1 ceph-osd[13491]: ceph version 12.1.0 > (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev) > roger@osd2:~$ sudo grep 'Jul 19 10:07:36.*ceph version' /var/log/syslog.1 > | head -1 > Jul 19 10:07:36 osd2 ceph-osd[13937]: ceph version 12.1.0 > (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev) > roger@osd3:~$ sudo grep 'Jul 19 16:07:12.*ceph version' /var/log/syslog.1 > | head -1 > Jul 19 16:07:12 osd3 ceph-osd[8807]: ceph version 12.1.0 > (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev) > > Representative example from osd1 logs: > Jul 20 13:42:18 osd1 ceph-osd[4035]: *** Caught signal (Segmentation > fault) ** > Jul 20 13:42:18 osd1 ceph-osd[4035]: in thread 7f52960e7700 > thread_name:msgr-worker-2 > Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.658076 > 7f529bf85c80 -1 osd.3 3444 log_to_monitors {default=true} > Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662695 > 7f52968e8700 -1 failed to decode message of type 70 v3: > buffer::malformed_input: void > osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer understand > old encoding version 1 < struct_compat > Jul 20 13:42:18 osd1 ceph-osd[4035]: ceph version 12.1.1 > (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) > Jul 20 13:42:18 osd1 ceph-osd[4035]: 1: (()+0xa257a4) [0x55bc98fe27a4] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 2: (()+0x11390) [0x7f529a468390] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 3: > (cephx_verify_authorizer(CephContext*, KeyStore*, > ceph::buffer::list::iterator&, CephXServiceTicketInfo&, > ceph::buffer::list&)+0x496) [0x55bc991b0ca6] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 4: > (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, > ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, > AuthCapsInfo&, CryptoKey&, unsigned long*)+0x31a) [0x55bc991a2cda] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 5: > (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, > ceph::buffer::list&, bool&, CryptoKey&)+0xf9) [0x55bc98a2c759] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 6: > (AsyncConnection::handle_connect_msg(ceph_msg_connect&, > ceph::buffer::list&, ceph::buffer::list&)+0x228) [0x55bc99271108] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 7: > (AsyncConnection::_process_connection()+0x1e07) [0x55bc99276a57] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 8: > (AsyncConnection::process()+0x1ae8) [0x55bc9927b978] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 9: (EventCenter::process_events(int, > std::chrono::duration >*)+0xa08) > [0x55bc990c6148] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 10: (()+0xb0d0d8) [0x55bc990ca0d8] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 11: (()+0xb8c80) [0x7f5299d6fc80] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 12: (()+0x76ba) [0x7f529a45e6ba] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 13: (clone()+0x6d) [0x7f52994d53dd] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662763 > 7f52960e7700 -1 *** Caught signal (Segmentation fault) ** > Jul 20 13:42:18 osd1 ceph-osd[4035]: in thread 7f52960e7700 > thread_name:msgr-worker-2 > Jul 20 13:42:18 osd1 ceph-osd[4035]: ceph version 12.1.1 > (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) > Jul 20 13:42:18 osd1 ceph-osd[4035]: 1: (()+0xa257a4) [0x55bc98fe27a4] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 2: (()+0x11390) [0x7f529a468390] > Jul 20 13:42:18 osd1 ceph-osd[4035]: 3:
[ceph-users] OSDs flapping
I'm on Luminous 12.1.1 and noticed I have flapping OSDs. Even with `ceph osd set nodown`, the OSDs will catch signal Aborted and sometimes Segmentation fault 2-5 minutes after starting. I verified hosts can talk to eachother on the cluster network. I've rebooted the hosts. I'm running out of ideas. Please advise. Tally of crashes: roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | awk '{print $9}' | sort | uniq -c | sort -nr 100 (Segmentation 77 (Aborted) roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | awk '{print $9}' | sort | uniq -c | sort -nr 77 (Aborted) 13 (Segmentation roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | awk '{print $9}' | sort | uniq -c | sort -nr 86 (Aborted) 3 (Segmentation First crash observed Jul 19: roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head -1 Jul 19 10:07:12 osd1 ceph-osd[13491]: *** Caught signal (Aborted) ** roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head -1 Jul 19 10:07:36 osd2 ceph-osd[13937]: *** Caught signal (Aborted) ** roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head -1 Jul 19 16:07:12 osd3 ceph-osd[8807]: *** Caught signal (Aborted) ** Crashes started with Luminous 12.1.0: roger@osd1:~$ sudo grep 'Jul 19 10:07:12.*ceph version' /var/log/syslog.1 | head -1 Jul 19 10:07:12 osd1 ceph-osd[13491]: ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev) roger@osd2:~$ sudo grep 'Jul 19 10:07:36.*ceph version' /var/log/syslog.1 | head -1 Jul 19 10:07:36 osd2 ceph-osd[13937]: ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev) roger@osd3:~$ sudo grep 'Jul 19 16:07:12.*ceph version' /var/log/syslog.1 | head -1 Jul 19 16:07:12 osd3 ceph-osd[8807]: ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev) Representative example from osd1 logs: Jul 20 13:42:18 osd1 ceph-osd[4035]: *** Caught signal (Segmentation fault) ** Jul 20 13:42:18 osd1 ceph-osd[4035]: in thread 7f52960e7700 thread_name:msgr-worker-2 Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.658076 7f529bf85c80 -1 osd.3 3444 log_to_monitors {default=true} Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662695 7f52968e8700 -1 failed to decode message of type 70 v3: buffer::malformed_input: void osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer understand old encoding version 1 < struct_compat Jul 20 13:42:18 osd1 ceph-osd[4035]: ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) Jul 20 13:42:18 osd1 ceph-osd[4035]: 1: (()+0xa257a4) [0x55bc98fe27a4] Jul 20 13:42:18 osd1 ceph-osd[4035]: 2: (()+0x11390) [0x7f529a468390] Jul 20 13:42:18 osd1 ceph-osd[4035]: 3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x496) [0x55bc991b0ca6] Jul 20 13:42:18 osd1 ceph-osd[4035]: 4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x31a) [0x55bc991a2cda] Jul 20 13:42:18 osd1 ceph-osd[4035]: 5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0xf9) [0x55bc98a2c759] Jul 20 13:42:18 osd1 ceph-osd[4035]: 6: (AsyncConnection::handle_connect_msg(ceph_msg_connect&, ceph::buffer::list&, ceph::buffer::list&)+0x228) [0x55bc99271108] Jul 20 13:42:18 osd1 ceph-osd[4035]: 7: (AsyncConnection::_process_connection()+0x1e07) [0x55bc99276a57] Jul 20 13:42:18 osd1 ceph-osd[4035]: 8: (AsyncConnection::process()+0x1ae8) [0x55bc9927b978] Jul 20 13:42:18 osd1 ceph-osd[4035]: 9: (EventCenter::process_events(int, std::chrono::duration >*)+0xa08) [0x55bc990c6148] Jul 20 13:42:18 osd1 ceph-osd[4035]: 10: (()+0xb0d0d8) [0x55bc990ca0d8] Jul 20 13:42:18 osd1 ceph-osd[4035]: 11: (()+0xb8c80) [0x7f5299d6fc80] Jul 20 13:42:18 osd1 ceph-osd[4035]: 12: (()+0x76ba) [0x7f529a45e6ba] Jul 20 13:42:18 osd1 ceph-osd[4035]: 13: (clone()+0x6d) [0x7f52994d53dd] Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662763 7f52960e7700 -1 *** Caught signal (Segmentation fault) ** Jul 20 13:42:18 osd1 ceph-osd[4035]: in thread 7f52960e7700 thread_name:msgr-worker-2 Jul 20 13:42:18 osd1 ceph-osd[4035]: ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) Jul 20 13:42:18 osd1 ceph-osd[4035]: 1: (()+0xa257a4) [0x55bc98fe27a4] Jul 20 13:42:18 osd1 ceph-osd[4035]: 2: (()+0x11390) [0x7f529a468390] Jul 20 13:42:18 osd1 ceph-osd[4035]: 3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x496) [0x55bc991b0ca6] Jul 20 13:42:18 osd1 ceph-osd[4035]: 4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCa