Re: [ceph-users] OSDs flapping

2017-08-02 Thread Roger Brown
SOLVED: upgrading to Luminous v12.1.2 put a stop to the OSD crashes
in cephx_verify_authorizer().


On Fri, Jul 21, 2017 at 3:21 AM Jens Harbott  wrote:

> 2017-07-21 1:14 GMT+00:00 Gregory Farnum :
> > At a glance that looks like the bug fixed by just-merged
> > https://github.com/ceph/ceph/pull/16421
>
> With the crashes in cephx_verify_authorizer() this rather looks like
> an instance of http://tracker.ceph.com/issues/20667 to me with
> https://github.com/ceph/ceph/pull/16455 as proposed fix. See Sage's
> mail on ceph-dev earlier.
>
> > On Thu, Jul 20, 2017 at 1:02 PM Roger Brown 
> wrote:
> ...
> >> Representative example from osd1 logs:
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]: *** Caught signal (Segmentation
> >> fault) **
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]:  in thread 7f52960e7700
> >> thread_name:msgr-worker-2
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.658076
> >> 7f529bf85c80 -1 osd.3 3444 log_to_monitors {default=true}
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662695
> >> 7f52968e8700 -1 failed to decode message of type 70 v3:
> >> buffer::malformed_input: void
> >> osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer
> understand
> >> old encoding version 1 < struct_compat
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]:  ceph version 12.1.1
> >> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]:  1: (()+0xa257a4) [0x55bc98fe27a4]
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]:  2: (()+0x11390) [0x7f529a468390]
> >> Jul 20 13:42:18 osd1 ceph-osd[4035]:  3:
> >> (cephx_verify_authorizer(CephContext*, KeyStore*,
> >> ceph::buffer::list::iterator&, CephXServiceTicketInfo&,
> >> ceph::buffer::list&)+0x496) [0x55bc991b0ca6]
> ...
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs flapping

2017-07-20 Thread Gregory Farnum
At a glance that looks like the bug fixed by just-merged
https://github.com/ceph/ceph/pull/16421

On Thu, Jul 20, 2017 at 1:02 PM Roger Brown  wrote:

> I'm on Luminous 12.1.1 and noticed I have flapping OSDs. Even with `ceph
> osd set nodown`, the OSDs will catch signal Aborted and sometimes
> Segmentation fault 2-5 minutes after starting. I verified hosts can talk to
> eachother on the cluster network. I've rebooted the hosts. I'm running out
> of ideas. Please advise.
>
> Tally of crashes:
> roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} |
> awk '{print $9}' | sort | uniq -c | sort -nr
> 100 (Segmentation
>  77 (Aborted)
> roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} |
> awk '{print $9}' | sort | uniq -c | sort -nr
>  77 (Aborted)
>  13 (Segmentation
> roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} |
> awk '{print $9}' | sort | uniq -c | sort -nr
>  86 (Aborted)
>   3 (Segmentation
>
> First crash observed Jul 19:
> roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head
> -1
> Jul 19 10:07:12 osd1 ceph-osd[13491]: *** Caught signal (Aborted) **
> roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head
> -1
> Jul 19 10:07:36 osd2 ceph-osd[13937]: *** Caught signal (Aborted) **
> roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head
> -1
> Jul 19 16:07:12 osd3 ceph-osd[8807]: *** Caught signal (Aborted) **
>
> Crashes started with Luminous 12.1.0:
> roger@osd1:~$ sudo grep 'Jul 19 10:07:12.*ceph version' /var/log/syslog.1
> | head -1
> Jul 19 10:07:12 osd1 ceph-osd[13491]:  ceph version 12.1.0
> (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
> roger@osd2:~$ sudo grep 'Jul 19 10:07:36.*ceph version' /var/log/syslog.1
> | head -1
> Jul 19 10:07:36 osd2 ceph-osd[13937]:  ceph version 12.1.0
> (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
> roger@osd3:~$ sudo grep 'Jul 19 16:07:12.*ceph version' /var/log/syslog.1
> | head -1
> Jul 19 16:07:12 osd3 ceph-osd[8807]:  ceph version 12.1.0
> (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
>
> Representative example from osd1 logs:
> Jul 20 13:42:18 osd1 ceph-osd[4035]: *** Caught signal (Segmentation
> fault) **
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  in thread 7f52960e7700
> thread_name:msgr-worker-2
> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.658076
> 7f529bf85c80 -1 osd.3 3444 log_to_monitors {default=true}
> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662695
> 7f52968e8700 -1 failed to decode message of type 70 v3:
> buffer::malformed_input: void
> osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer understand
> old encoding version 1 < struct_compat
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  ceph version 12.1.1
> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  1: (()+0xa257a4) [0x55bc98fe27a4]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  2: (()+0x11390) [0x7f529a468390]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  3:
> (cephx_verify_authorizer(CephContext*, KeyStore*,
> ceph::buffer::list::iterator&, CephXServiceTicketInfo&,
> ceph::buffer::list&)+0x496) [0x55bc991b0ca6]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  4:
> (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*,
> ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&,
> AuthCapsInfo&, CryptoKey&, unsigned long*)+0x31a) [0x55bc991a2cda]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  5:
> (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&,
> ceph::buffer::list&, bool&, CryptoKey&)+0xf9) [0x55bc98a2c759]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  6:
> (AsyncConnection::handle_connect_msg(ceph_msg_connect&,
> ceph::buffer::list&, ceph::buffer::list&)+0x228) [0x55bc99271108]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  7:
> (AsyncConnection::_process_connection()+0x1e07) [0x55bc99276a57]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  8:
> (AsyncConnection::process()+0x1ae8) [0x55bc9927b978]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  9: (EventCenter::process_events(int,
> std::chrono::duration >*)+0xa08)
> [0x55bc990c6148]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  10: (()+0xb0d0d8) [0x55bc990ca0d8]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  11: (()+0xb8c80) [0x7f5299d6fc80]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  12: (()+0x76ba) [0x7f529a45e6ba]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  13: (clone()+0x6d) [0x7f52994d53dd]
> Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662763
> 7f52960e7700 -1 *** Caught signal (Segmentation fault) **
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  in thread 7f52960e7700
> thread_name:msgr-worker-2
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  ceph version 12.1.1
> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  1: (()+0xa257a4) [0x55bc98fe27a4]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  2: (()+0x11390) [0x7f529a468390]
> Jul 20 13:42:18 osd1 ceph-osd[4035]:  3:

[ceph-users] OSDs flapping

2017-07-20 Thread Roger Brown
I'm on Luminous 12.1.1 and noticed I have flapping OSDs. Even with `ceph
osd set nodown`, the OSDs will catch signal Aborted and sometimes
Segmentation fault 2-5 minutes after starting. I verified hosts can talk to
eachother on the cluster network. I've rebooted the hosts. I'm running out
of ideas. Please advise.

Tally of crashes:
roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | awk
'{print $9}' | sort | uniq -c | sort -nr
100 (Segmentation
 77 (Aborted)
roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | awk
'{print $9}' | sort | uniq -c | sort -nr
 77 (Aborted)
 13 (Segmentation
roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog{,.1} | awk
'{print $9}' | sort | uniq -c | sort -nr
 86 (Aborted)
  3 (Segmentation

First crash observed Jul 19:
roger@osd1:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head -1
Jul 19 10:07:12 osd1 ceph-osd[13491]: *** Caught signal (Aborted) **
roger@osd2:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head -1
Jul 19 10:07:36 osd2 ceph-osd[13937]: *** Caught signal (Aborted) **
roger@osd3:~$ sudo grep ': \*\*\* Caught signal' /var/log/syslog.1 | head -1
Jul 19 16:07:12 osd3 ceph-osd[8807]: *** Caught signal (Aborted) **

Crashes started with Luminous 12.1.0:
roger@osd1:~$ sudo grep 'Jul 19 10:07:12.*ceph version' /var/log/syslog.1 |
head -1
Jul 19 10:07:12 osd1 ceph-osd[13491]:  ceph version 12.1.0
(262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
roger@osd2:~$ sudo grep 'Jul 19 10:07:36.*ceph version' /var/log/syslog.1 |
head -1
Jul 19 10:07:36 osd2 ceph-osd[13937]:  ceph version 12.1.0
(262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
roger@osd3:~$ sudo grep 'Jul 19 16:07:12.*ceph version' /var/log/syslog.1 |
head -1
Jul 19 16:07:12 osd3 ceph-osd[8807]:  ceph version 12.1.0
(262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)

Representative example from osd1 logs:
Jul 20 13:42:18 osd1 ceph-osd[4035]: *** Caught signal (Segmentation fault)
**
Jul 20 13:42:18 osd1 ceph-osd[4035]:  in thread 7f52960e7700
thread_name:msgr-worker-2
Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.658076
7f529bf85c80 -1 osd.3 3444 log_to_monitors {default=true}
Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662695
7f52968e8700 -1 failed to decode message of type 70 v3:
buffer::malformed_input: void
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer understand
old encoding version 1 < struct_compat
Jul 20 13:42:18 osd1 ceph-osd[4035]:  ceph version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
Jul 20 13:42:18 osd1 ceph-osd[4035]:  1: (()+0xa257a4) [0x55bc98fe27a4]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  2: (()+0x11390) [0x7f529a468390]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  3:
(cephx_verify_authorizer(CephContext*, KeyStore*,
ceph::buffer::list::iterator&, CephXServiceTicketInfo&,
ceph::buffer::list&)+0x496) [0x55bc991b0ca6]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  4:
(CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*,
ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&,
AuthCapsInfo&, CryptoKey&, unsigned long*)+0x31a) [0x55bc991a2cda]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  5:
(OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&,
ceph::buffer::list&, bool&, CryptoKey&)+0xf9) [0x55bc98a2c759]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  6:
(AsyncConnection::handle_connect_msg(ceph_msg_connect&,
ceph::buffer::list&, ceph::buffer::list&)+0x228) [0x55bc99271108]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  7:
(AsyncConnection::_process_connection()+0x1e07) [0x55bc99276a57]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  8:
(AsyncConnection::process()+0x1ae8) [0x55bc9927b978]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  9: (EventCenter::process_events(int,
std::chrono::duration >*)+0xa08)
[0x55bc990c6148]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  10: (()+0xb0d0d8) [0x55bc990ca0d8]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  11: (()+0xb8c80) [0x7f5299d6fc80]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  12: (()+0x76ba) [0x7f529a45e6ba]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  13: (clone()+0x6d) [0x7f52994d53dd]
Jul 20 13:42:18 osd1 ceph-osd[4035]: 2017-07-20 13:42:18.662763
7f52960e7700 -1 *** Caught signal (Segmentation fault) **
Jul 20 13:42:18 osd1 ceph-osd[4035]:  in thread 7f52960e7700
thread_name:msgr-worker-2
Jul 20 13:42:18 osd1 ceph-osd[4035]:  ceph version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
Jul 20 13:42:18 osd1 ceph-osd[4035]:  1: (()+0xa257a4) [0x55bc98fe27a4]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  2: (()+0x11390) [0x7f529a468390]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  3:
(cephx_verify_authorizer(CephContext*, KeyStore*,
ceph::buffer::list::iterator&, CephXServiceTicketInfo&,
ceph::buffer::list&)+0x496) [0x55bc991b0ca6]
Jul 20 13:42:18 osd1 ceph-osd[4035]:  4:
(CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*,
ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&,
AuthCa