[ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-09 Thread Stillwell, Bryan J
Last week I decided to play around with Kraken (11.1.1-1xenial) on a
single node, two OSD cluster, and after a while I noticed that the new
ceph-mgr daemon is frequently using a lot of the CPU:

17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
ceph-mgr

Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
usage down to < 1%, but after a while it climbs back up to > 100%.  Has
anyone else seen this?

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread John Spray
On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
 wrote:
> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
> single node, two OSD cluster, and after a while I noticed that the new
> ceph-mgr daemon is frequently using a lot of the CPU:
>
> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
> ceph-mgr
>
> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
> anyone else seen this?

Definitely worth investigating, could you set "debug mgr = 20" on the
daemon to see if it's obviously spinning in a particular place?

Thanks,
John

>
> Bryan
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please immediately alert the 
> sender by reply e-mail and then delete this message and any attachments. If 
> you are not the intended recipient, you are notified that any use, 
> dissemination, distribution, copying, or storage of this message or any 
> attachment is strictly prohibited.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Stillwell, Bryan J
On 1/10/17, 5:35 AM, "John Spray"  wrote:

>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> wrote:
>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>> single node, two OSD cluster, and after a while I noticed that the new
>> ceph-mgr daemon is frequently using a lot of the CPU:
>>
>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>> ceph-mgr
>>
>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
>> anyone else seen this?
>
>Definitely worth investigating, could you set "debug mgr = 20" on the
>daemon to see if it's obviously spinning in a particular place?

I've injected that option to the ceps-mgr process, and now I'm just
waiting for it to go out of control again.

However, I've noticed quite a few messages like this in the logs already:

2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
cs=1 l=0).fault initiating reconnect
2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept peer reset, then tried to connect to us, replacing
2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
send and in the half  accept state just closed


What's weird about that is that this is a single node cluster with
ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
host.  So none of the communication should be leaving the node.

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Samuel Just
What ceph sha1 is that?  Does it include
6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
spin?
-Sam

On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
 wrote:
> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>
>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> wrote:
>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>> single node, two OSD cluster, and after a while I noticed that the new
>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>
>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>> ceph-mgr
>>>
>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
>>> anyone else seen this?
>>
>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>daemon to see if it's obviously spinning in a particular place?
>
> I've injected that option to the ceps-mgr process, and now I'm just
> waiting for it to go out of control again.
>
> However, I've noticed quite a few messages like this in the logs already:
>
> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
> cs=1 l=0).fault initiating reconnect
> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept peer reset, then tried to connect to us, replacing
> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
> send and in the half  accept state just closed
>
>
> What's weird about that is that this is a single node cluster with
> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
> host.  So none of the communication should be leaving the node.
>
> Bryan
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please immediately alert the 
> sender by reply e-mail and then delete this message and any attachments. If 
> you are not the intended recipient, you are notified that any use, 
> dissemination, distribution, copying, or storage of this message or any 
> attachment is strictly prohibited.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Stillwell, Bryan J
This is from:

ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)

On 1/10/17, 10:23 AM, "Samuel Just"  wrote:

>What ceph sha1 is that?  Does it include
>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>spin?
>-Sam
>
>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
> wrote:
>> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>>
>>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>>> wrote:
 Last week I decided to play around with Kraken (11.1.1-1xenial) on a
 single node, two OSD cluster, and after a while I noticed that the new
 ceph-mgr daemon is frequently using a lot of the CPU:

 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
 ceph-mgr

 Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
 usage down to < 1%, but after a while it climbs back up to > 100%.
Has
 anyone else seen this?
>>>
>>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>>daemon to see if it's obviously spinning in a particular place?
>>
>> I've injected that option to the ceps-mgr process, and now I'm just
>> waiting for it to go out of control again.
>>
>> However, I've noticed quite a few messages like this in the logs
>>already:
>>
>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
>> cs=1 l=0).fault initiating reconnect
>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>l=0).handle_connect_msg
>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>l=0).handle_connect_msg
>> accept peer reset, then tried to connect to us, replacing
>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
>> send and in the half  accept state just closed
>>
>>
>> What's weird about that is that this is a single node cluster with
>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>> host.  So none of the communication should be leaving the node.
>>
>> Bryan
>>
>> E-MAIL CONFIDENTIALITY NOTICE:
>> The contents of this e-mail message and any attachments are intended
>>solely for the addressee(s) and may contain confidential and/or legally
>>privileged information. If you are not the intended recipient of this
>>message or if this message has been addressed to you in error, please
>>immediately alert the sender by reply e-mail and then delete this
>>message and any attachments. If you are not the intended recipient, you
>>are notified that any use, dissemination, distribution, copying, or
>>storage of this message or any attachment is strictly prohibited.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Samuel Just
Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
-Sam

On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
 wrote:
> This is from:
>
> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)
>
> On 1/10/17, 10:23 AM, "Samuel Just"  wrote:
>
>>What ceph sha1 is that?  Does it include
>>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>>spin?
>>-Sam
>>
>>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
>> wrote:
>>> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>>>
On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
 wrote:
> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
> single node, two OSD cluster, and after a while I noticed that the new
> ceph-mgr daemon is frequently using a lot of the CPU:
>
> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
> ceph-mgr
>
> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
> usage down to < 1%, but after a while it climbs back up to > 100%.
>Has
> anyone else seen this?

Definitely worth investigating, could you set "debug mgr = 20" on the
daemon to see if it's obviously spinning in a particular place?
>>>
>>> I've injected that option to the ceps-mgr process, and now I'm just
>>> waiting for it to go out of control again.
>>>
>>> However, I've noticed quite a few messages like this in the logs
>>>already:
>>>
>>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
>>> cs=1 l=0).fault initiating reconnect
>>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>l=0).handle_connect_msg
>>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>l=0).handle_connect_msg
>>> accept peer reset, then tried to connect to us, replacing
>>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
>>> send and in the half  accept state just closed
>>>
>>>
>>> What's weird about that is that this is a single node cluster with
>>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>>> host.  So none of the communication should be leaving the node.
>>>
>>> Bryan
>>>
>>> E-MAIL CONFIDENTIALITY NOTICE:
>>> The contents of this e-mail message and any attachments are intended
>>>solely for the addressee(s) and may contain confidential and/or legally
>>>privileged information. If you are not the intended recipient of this
>>>message or if this message has been addressed to you in error, please
>>>immediately alert the sender by reply e-mail and then delete this
>>>message and any attachments. If you are not the intended recipient, you
>>>are notified that any use, dissemination, distribution, copying, or
>>>storage of this message or any attachment is strictly prohibited.
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please immediately alert the 
> sender by reply e-mail and then delete this message and any attachments. If 
> you are not the intended recipient, you are notified that any use, 
> dissemination, distribution, copying, or storage of this message or any 
> attachment is strictly prohibited.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Stillwell, Bryan J
That's strange, I installed that version using packages from here:

http://download.ceph.com/debian-kraken/pool/main/c/ceph/


Bryan

On 1/10/17, 10:51 AM, "Samuel Just"  wrote:

>Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
>-Sam
>
>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
> wrote:
>> This is from:
>>
>> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)
>>
>> On 1/10/17, 10:23 AM, "Samuel Just"  wrote:
>>
>>>What ceph sha1 is that?  Does it include
>>>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>>>spin?
>>>-Sam
>>>
>>>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
>>> wrote:
 On 1/10/17, 5:35 AM, "John Spray"  wrote:

>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> wrote:
>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>> single node, two OSD cluster, and after a while I noticed that the
>>new
>> ceph-mgr daemon is frequently using a lot of the CPU:
>>
>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>> ceph-mgr
>>
>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its
>>CPU
>> usage down to < 1%, but after a while it climbs back up to > 100%.
>>Has
>> anyone else seen this?
>
>Definitely worth investigating, could you set "debug mgr = 20" on the
>daemon to see if it's obviously spinning in a particular place?

 I've injected that option to the ceps-mgr process, and now I'm just
 waiting for it to go out of control again.

 However, I've noticed quite a few messages like this in the logs
already:

 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>
 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN
pgs=2
 cs=1 l=0).fault initiating reconnect
 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>
 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg
 accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>
 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg
 accept peer reset, then tried to connect to us, replacing
 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>
 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing
to
 send and in the half  accept state just closed


 What's weird about that is that this is a single node cluster with
 ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
 host.  So none of the communication should be leaving the node.

 Bryan

 E-MAIL CONFIDENTIALITY NOTICE:
 The contents of this e-mail message and any attachments are intended
solely for the addressee(s) and may contain confidential and/or legally
privileged information. If you are not the intended recipient of this
message or if this message has been addressed to you in error, please
immediately alert the sender by reply e-mail and then delete this
message and any attachments. If you are not the intended recipient, you
are notified that any use, dissemination, distribution, copying, or
storage of this message or any attachment is strictly prohibited.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> E-MAIL CONFIDENTIALITY NOTICE:
>> The contents of this e-mail message and any attachments are intended
>>solely for the addressee(s) and may contain confidential and/or legally
>>privileged information. If you are not the intended recipient of this
>>message or if this message has been addressed to you in error, please
>>immediately alert the sender by reply e-mail and then delete this
>>message and any attachments. If you are not the intended recipient, you
>>are notified that any use, dissemination, distribution, copying, or
>>storage of this message or any attachment is strictly prohibited.
>>

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Samuel Just
Mm, maybe the tag didn't get pushed.  Alfredo, is there supposed to be
a v11.1.1 tag?
-Sam

On Tue, Jan 10, 2017 at 9:57 AM, Stillwell, Bryan J
 wrote:
> That's strange, I installed that version using packages from here:
>
> http://download.ceph.com/debian-kraken/pool/main/c/ceph/
>
>
> Bryan
>
> On 1/10/17, 10:51 AM, "Samuel Just"  wrote:
>
>>Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
>>-Sam
>>
>>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
>> wrote:
>>> This is from:
>>>
>>> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)
>>>
>>> On 1/10/17, 10:23 AM, "Samuel Just"  wrote:
>>>
What ceph sha1 is that?  Does it include
6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
spin?
-Sam

On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
 wrote:
> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>
>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> wrote:
>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>> single node, two OSD cluster, and after a while I noticed that the
>>>new
>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>
>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>> ceph-mgr
>>>
>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its
>>>CPU
>>> usage down to < 1%, but after a while it climbs back up to > 100%.
>>>Has
>>> anyone else seen this?
>>
>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>daemon to see if it's obviously spinning in a particular place?
>
> I've injected that option to the ceps-mgr process, and now I'm just
> waiting for it to go out of control again.
>
> However, I've noticed quite a few messages like this in the logs
>already:
>
> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>
> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN
>pgs=2
> cs=1 l=0).fault initiating reconnect
> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>
> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>l=0).handle_connect_msg
> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>
> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>l=0).handle_connect_msg
> accept peer reset, then tried to connect to us, replacing
> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>
> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing
>to
> send and in the half  accept state just closed
>
>
> What's weird about that is that this is a single node cluster with
> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
> host.  So none of the communication should be leaving the node.
>
> Bryan
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended
>solely for the addressee(s) and may contain confidential and/or legally
>privileged information. If you are not the intended recipient of this
>message or if this message has been addressed to you in error, please
>immediately alert the sender by reply e-mail and then delete this
>message and any attachments. If you are not the intended recipient, you
>are notified that any use, dissemination, distribution, copying, or
>storage of this message or any attachment is strictly prohibited.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> E-MAIL CONFIDENTIALITY NOTICE:
>>> The contents of this e-mail message and any attachments are intended
>>>solely for the addressee(s) and may contain confidential and/or legally
>>>privileged information. If you are not the intended recipient of this
>>>message or if this message has been addressed to you in error, please
>>>immediately alert the sender by reply e-mail and then delete this
>>>message and any attachments. If you are not the intended recipient, you
>>>are notified that any use, dissemination, distribution, copying, or
>>>storage of this message or any attachment is strictly prohibited.
>>>
>
> E-MAIL CONFIDENTIALITY NOTICE:
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please imme

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Alfredo Deza
On Tue, Jan 10, 2017 at 12:59 PM, Samuel Just  wrote:
> Mm, maybe the tag didn't get pushed.  Alfredo, is there supposed to be
> a v11.1.1 tag?

Yep. You can see there is one here: https://github.com/ceph/ceph/releases

Specifically: https://github.com/ceph/ceph/releases/tag/v11.1.1 which
points to 
https://github.com/ceph/ceph/commit/87597971b371d7f497d7eabad3545d72d18dd755


> -Sam
>
> On Tue, Jan 10, 2017 at 9:57 AM, Stillwell, Bryan J
>  wrote:
>> That's strange, I installed that version using packages from here:
>>
>> http://download.ceph.com/debian-kraken/pool/main/c/ceph/
>>
>>
>> Bryan
>>
>> On 1/10/17, 10:51 AM, "Samuel Just"  wrote:
>>
>>>Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
>>>-Sam
>>>
>>>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
>>> wrote:
 This is from:

 ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)

 On 1/10/17, 10:23 AM, "Samuel Just"  wrote:

>What ceph sha1 is that?  Does it include
>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>spin?
>-Sam
>
>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
> wrote:
>> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>>
>>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>>> wrote:
 Last week I decided to play around with Kraken (11.1.1-1xenial) on a
 single node, two OSD cluster, and after a while I noticed that the
new
 ceph-mgr daemon is frequently using a lot of the CPU:

 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
 ceph-mgr

 Restarting it with 'systemctl restart ceph-mgr*' seems to get its
CPU
 usage down to < 1%, but after a while it climbs back up to > 100%.
Has
 anyone else seen this?
>>>
>>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>>daemon to see if it's obviously spinning in a particular place?
>>
>> I've injected that option to the ceps-mgr process, and now I'm just
>> waiting for it to go out of control again.
>>
>> However, I've noticed quite a few messages like this in the logs
>>already:
>>
>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN
>>pgs=2
>> cs=1 l=0).fault initiating reconnect
>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>l=0).handle_connect_msg
>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>l=0).handle_connect_msg
>> accept peer reset, then tried to connect to us, replacing
>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing
>>to
>> send and in the half  accept state just closed
>>
>>
>> What's weird about that is that this is a single node cluster with
>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>> host.  So none of the communication should be leaving the node.
>>
>> Bryan
>>
>> E-MAIL CONFIDENTIALITY NOTICE:
>> The contents of this e-mail message and any attachments are intended
>>solely for the addressee(s) and may contain confidential and/or legally
>>privileged information. If you are not the intended recipient of this
>>message or if this message has been addressed to you in error, please
>>immediately alert the sender by reply e-mail and then delete this
>>message and any attachments. If you are not the intended recipient, you
>>are notified that any use, dissemination, distribution, copying, or
>>storage of this message or any attachment is strictly prohibited.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 E-MAIL CONFIDENTIALITY NOTICE:
 The contents of this e-mail message and any attachments are intended
solely for the addressee(s) and may contain confidential and/or legally
privileged information. If you are not the intended recipient of this
message or if this message has been addressed to you in error, please
immediately alert the sender by reply e-mail and then delete this
message and any attachments. If you are not the intended recipient, you
are notified that any use, dissemination, distribution, copying

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-11 Thread Stillwell, Bryan J
John,

This morning I compared the logs from yesterday and I show a noticeable
increase in messages like these:

2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
notify_all mon_status
2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
notify_all health
2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
notify_all pg_summary
2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
mgrdigest v1
2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
notify_all mon_status
2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
notify_all health
2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
notify_all pg_summary
2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
mgrdigest v1
2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1


In a 1 minute period yesterday I saw 84 times this group of messages
showed up.  Today that same group of messages showed up 156 times.

Other than that I did see an increase in this messages from 9 times a
minute to 14 times a minute:

2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104 >> -
conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
l=0).fault with nothing to send and in the half  accept state just closed

Let me know if you need anything else.

Bryan


On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
 wrote:

>On 1/10/17, 5:35 AM, "John Spray"  wrote:
>
>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> wrote:
>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>> single node, two OSD cluster, and after a while I noticed that the new
>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>
>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>> ceph-mgr
>>>
>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
>>> anyone else seen this?
>>
>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>daemon to see if it's obviously spinning in a particular place?
>
>I've injected that option to the ceps-mgr process, and now I'm just
>waiting for it to go out of control again.
>
>However, I've noticed quite a few messages like this in the logs already:
>
>2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
>cs=1 l=0).fault initiating reconnect
>2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>accept peer reset, then tried to connect to us, replacing
>2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
>send and in the half  accept state just closed
>
>
>What's weird about that is that this is a single node cluster with
>ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>host.  So none of the communication should be leaving the node.
>
>Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-13 Thread Robert Longstaff
FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7
w/ elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
allocated ~11GB of RAM after a single day of usage. Only the active manager
is performing this way. The growth is linear and reproducible.

The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs
each.


top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21

Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie

%Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
0.0 st

KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812 buff/cache

KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail Mem


  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND





 2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
ceph-mgr




 2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
ceph-mon




On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J <
bryan.stillw...@charter.com> wrote:

> John,
>
> This morning I compared the logs from yesterday and I show a noticeable
> increase in messages like these:
>
> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all mon_status
> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all health
> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all pg_summary
> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
> mgrdigest v1
> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all mon_status
> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all health
> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
> notify_all pg_summary
> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
> mgrdigest v1
> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>
>
> In a 1 minute period yesterday I saw 84 times this group of messages
> showed up.  Today that same group of messages showed up 156 times.
>
> Other than that I did see an increase in this messages from 9 times a
> minute to 14 times a minute:
>
> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104 >> -
> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
>
> Let me know if you need anything else.
>
> Bryan
>
>
> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
>  bryan.stillw...@charter.com> wrote:
>
> >On 1/10/17, 5:35 AM, "John Spray"  wrote:
> >
> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> >> wrote:
> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
> >>> single node, two OSD cluster, and after a while I noticed that the new
> >>> ceph-mgr daemon is frequently using a lot of the CPU:
> >>>
> >>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
> >>> ceph-mgr
> >>>
> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
> >>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
> >>> anyone else seen this?
> >>
> >>Definitely worth investigating, could you set "debug mgr = 20" on the
> >>daemon to see if it's obviously spinning in a particular place?
> >
> >I've injected that option to the ceps-mgr process, and now I'm just
> >waiting for it to go out of control again.
> >
> >However, I've noticed quite a few messages like this in the logs already:
> >
> >2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
> >cs=1 l=0).fault initiating reconnect
> >2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg
> >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
> >2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg
> >accept peer reset, then tried to connect to us, replacing
> >2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
> >send and in the half  accept state just closed
> >
> >

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-13 Thread Brad Hubbard
Want to install debuginfo packages and use something like this to try
and find out where it is spending most of its time?

https://poormansprofiler.org/

Note that you may need to do multiple runs to get a "feel" for where
it is spending most of its time. Also not that likely only one or two
threads will be using the CPU (you can see this in ps output using a
command like the following) the rest will likely be idle or waiting
for something.

# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan

Observation of these two and maybe a couple of manual gstack dumps
like this to compare thread ids to ps output (LWP is the thread id
(tid) in gdb output) should give us some idea of where it is spinning.

# gstack $(pidof ceph-mgr)


On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
 wrote:
> FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7 w/
> elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
> allocated ~11GB of RAM after a single day of usage. Only the active manager
> is performing this way. The growth is linear and reproducible.
>
> The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs
> each.
>
>
> top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21
>
> Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
>
> %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0
> st
>
> KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812 buff/cache
>
> KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail Mem
>
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
>
>  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27 ceph-mgr
>
>  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50 ceph-mon
>
>
> On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>  wrote:
>>
>> John,
>>
>> This morning I compared the logs from yesterday and I show a noticeable
>> increase in messages like these:
>>
>> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
>> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
>> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
>> notify_all mon_status
>> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
>> notify_all health
>> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
>> notify_all pg_summary
>> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> mgrdigest v1
>> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
>> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
>> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
>> notify_all mon_status
>> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
>> notify_all health
>> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
>> notify_all pg_summary
>> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
>> mgrdigest v1
>> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>>
>>
>> In a 1 minute period yesterday I saw 84 times this group of messages
>> showed up.  Today that same group of messages showed up 156 times.
>>
>> Other than that I did see an increase in this messages from 9 times a
>> minute to 14 times a minute:
>>
>> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104 >> -
>> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
>> l=0).fault with nothing to send and in the half  accept state just closed
>>
>> Let me know if you need anything else.
>>
>> Bryan
>>
>>
>> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
>> > bryan.stillw...@charter.com> wrote:
>>
>> >On 1/10/17, 5:35 AM, "John Spray"  wrote:
>> >
>> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> >> wrote:
>> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>> >>> single node, two OSD cluster, and after a while I noticed that the new
>> >>> ceph-mgr daemon is frequently using a lot of the CPU:
>> >>>
>> >>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>> >>> ceph-mgr
>> >>>
>> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>> >>> usage down to < 1%, but after a while it climbs back up to > 100%.
>> >>> Has
>> >>> anyone else seen this?
>> >>
>> >>Definitely worth investigating, could you set "debug mgr = 20" on the
>> >>daemon to see if it's obviously spinning in a particular place?
>> >
>> >I've injected that option to the ceps-mgr process, and now I'm just
>> >waiting for it to go out of control again.
>> >
>> >However, I've noticed quite a few messages like this in the logs already:
>> >
>> >2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=ST

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-13 Thread Donny Davis
I am having the same issue. When I looked at my idle cluster this morning,
one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.
I have 3 AIO nodes, and only one of them seemed to be affected.

On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard  wrote:

> Want to install debuginfo packages and use something like this to try
> and find out where it is spending most of its time?
>
> https://poormansprofiler.org/
>
> Note that you may need to do multiple runs to get a "feel" for where
> it is spending most of its time. Also not that likely only one or two
> threads will be using the CPU (you can see this in ps output using a
> command like the following) the rest will likely be idle or waiting
> for something.
>
> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>
> Observation of these two and maybe a couple of manual gstack dumps
> like this to compare thread ids to ps output (LWP is the thread id
> (tid) in gdb output) should give us some idea of where it is spinning.
>
> # gstack $(pidof ceph-mgr)
>
>
> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>  wrote:
> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS
> 7 w/
> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
> > allocated ~11GB of RAM after a single day of usage. Only the active
> manager
> > is performing this way. The growth is linear and reproducible.
> >
> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB
> OSDs
> > each.
> >
> >
> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21
> >
> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
> >
> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
> 0.0
> > st
> >
> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
> buff/cache
> >
> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail
> Mem
> >
> >
> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
> >
> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
> ceph-mgr
> >
> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
> ceph-mon
> >
> >
> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
> >  wrote:
> >>
> >> John,
> >>
> >> This morning I compared the logs from yesterday and I show a noticeable
> >> increase in messages like these:
> >>
> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all mon_status
> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all health
> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all pg_summary
> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
> >> mgrdigest v1
> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all mon_status
> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all health
> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
> >> notify_all pg_summary
> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
> >> mgrdigest v1
> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
> >>
> >>
> >> In a 1 minute period yesterday I saw 84 times this group of messages
> >> showed up.  Today that same group of messages showed up 156 times.
> >>
> >> Other than that I did see an increase in this messages from 9 times a
> >> minute to 14 times a minute:
> >>
> >> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104
> >> -
> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> >> l=0).fault with nothing to send and in the half  accept state just
> closed
> >>
> >> Let me know if you need anything else.
> >>
> >> Bryan
> >>
> >>
> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
> >>  >> bryan.stillw...@charter.com> wrote:
> >>
> >> >On 1/10/17, 5:35 AM, "John Spray"  wrote:
> >> >
> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> >> >> wrote:
> >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
> >> >>> single node, two OSD cluster, and after a while I noticed that the
> new
> >> >>> ceph-mgr daemon is frequently using a lot of the CPU:
> >> >>>
> >> >>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
> >> >>> ceph-mgr
> >> >>>
> >> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its
> CPU
> >> >>> usage down to < 1%, but after a while it climbs back up to > 100%.
> >> >>> Has
> >> >>> any

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-13 Thread Brad Hubbard
Could one of the reporters open a tracker for this issue and attach
the requested debugging data?

On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis  wrote:
> I am having the same issue. When I looked at my idle cluster this morning,
> one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.  I
> have 3 AIO nodes, and only one of them seemed to be affected.
>
> On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard  wrote:
>>
>> Want to install debuginfo packages and use something like this to try
>> and find out where it is spending most of its time?
>>
>> https://poormansprofiler.org/
>>
>> Note that you may need to do multiple runs to get a "feel" for where
>> it is spending most of its time. Also not that likely only one or two
>> threads will be using the CPU (you can see this in ps output using a
>> command like the following) the rest will likely be idle or waiting
>> for something.
>>
>> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>>
>> Observation of these two and maybe a couple of manual gstack dumps
>> like this to compare thread ids to ps output (LWP is the thread id
>> (tid) in gdb output) should give us some idea of where it is spinning.
>>
>> # gstack $(pidof ceph-mgr)
>>
>>
>> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>>  wrote:
>> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS
>> > 7 w/
>> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has
>> > allocated ~11GB of RAM after a single day of usage. Only the active
>> > manager
>> > is performing this way. The growth is linear and reproducible.
>> >
>> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB
>> > OSDs
>> > each.
>> >
>> >
>> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21
>> >
>> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
>> >
>> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,
>> > 0.0
>> > st
>> >
>> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> > buff/cache
>> >
>> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail
>> > Mem
>> >
>> >
>> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> > COMMAND
>> >
>> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
>> > ceph-mgr
>> >
>> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
>> > ceph-mon
>> >
>> >
>> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>> >  wrote:
>> >>
>> >> John,
>> >>
>> >> This morning I compared the logs from yesterday and I show a noticeable
>> >> increase in messages like these:
>> >>
>> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all mon_status
>> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all health
>> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all pg_summary
>> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> >> mgrdigest v1
>> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all mon_status
>> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all health
>> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
>> >> notify_all pg_summary
>> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
>> >> mgrdigest v1
>> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
>> >>
>> >>
>> >> In a 1 minute period yesterday I saw 84 times this group of messages
>> >> showed up.  Today that same group of messages showed up 156 times.
>> >>
>> >> Other than that I did see an increase in this messages from 9 times a
>> >> minute to 14 times a minute:
>> >>
>> >> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104
>> >> >> -
>> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
>> >> l=0).fault with nothing to send and in the half  accept state just
>> >> closed
>> >>
>> >> Let me know if you need anything else.
>> >>
>> >> Bryan
>> >>
>> >>
>> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
>> >> > >> bryan.stillw...@charter.com> wrote:
>> >>
>> >> >On 1/10/17, 5:35 AM, "John Spray"  wrote:
>> >> >
>> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> >> >> wrote:
>> >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on
>> >> >>> a
>> >> >>> single node, two OSD cluster, and after a while I noticed that the
>> >> >>> new
>> >> >>> ceph-mgr daemon is freq

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-16 Thread Muthusamy Muthiah
On one our platform mgr uses 3 CPU cores . Is there a ticket available for
this issue ?

Thanks,
Muthu

On 14 February 2017 at 03:13, Brad Hubbard  wrote:

> Could one of the reporters open a tracker for this issue and attach
> the requested debugging data?
>
> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis 
> wrote:
> > I am having the same issue. When I looked at my idle cluster this
> morning,
> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of
> that.  I
> > have 3 AIO nodes, and only one of them seemed to be affected.
> >
> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard 
> wrote:
> >>
> >> Want to install debuginfo packages and use something like this to try
> >> and find out where it is spending most of its time?
> >>
> >> https://poormansprofiler.org/
> >>
> >> Note that you may need to do multiple runs to get a "feel" for where
> >> it is spending most of its time. Also not that likely only one or two
> >> threads will be using the CPU (you can see this in ps output using a
> >> command like the following) the rest will likely be idle or waiting
> >> for something.
> >>
> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
> >>
> >> Observation of these two and maybe a couple of manual gstack dumps
> >> like this to compare thread ids to ps output (LWP is the thread id
> >> (tid) in gdb output) should give us some idea of where it is spinning.
> >>
> >> # gstack $(pidof ceph-mgr)
> >>
> >>
> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
> >>  wrote:
> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on
> CentOS
> >> > 7 w/
> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and
> has
> >> > allocated ~11GB of RAM after a single day of usage. Only the active
> >> > manager
> >> > is performing this way. The growth is linear and reproducible.
> >> >
> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB
> >> > OSDs
> >> > each.
> >> >
> >> >
> >> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94,
> 4.21
> >> >
> >> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
> >> >
> >> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7
> si,
> >> > 0.0
> >> > st
> >> >
> >> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
> >> > buff/cache
> >> >
> >> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772 avail
> >> > Mem
> >> >
> >> >
> >> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> >> > COMMAND
> >> >
> >> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
> >> > ceph-mgr
> >> >
> >> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
> >> > ceph-mon
> >> >
> >> >
> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
> >> >  wrote:
> >> >>
> >> >> John,
> >> >>
> >> >> This morning I compared the logs from yesterday and I show a
> noticeable
> >> >> increase in messages like these:
> >> >>
> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
> >> >> notify_all mon_status
> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
> >> >> notify_all health
> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
> >> >> notify_all pg_summary
> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
> >> >> mgrdigest v1
> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest
> v1
> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
> >> >> notify_all mon_status
> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
> >> >> notify_all health
> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
> >> >> notify_all pg_summary
> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
> >> >> mgrdigest v1
> >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest
> v1
> >> >>
> >> >>
> >> >> In a 1 minute period yesterday I saw 84 times this group of messages
> >> >> showed up.  Today that same group of messages showed up 156 times.
> >> >>
> >> >> Other than that I did see an increase in this messages from 9 times a
> >> >> minute to 14 times a minute:
> >> >>
> >> >> 2017-01-11 09:00:00.402000 7f70f3d61700  0 --
> 172.24.88.207:6800/4104
> >> >> >> -
> >> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0
> cs=0
> >> >> l=0).fault with nothing to send and in the half  accept state just
> >> >> closed
> >> >>
> >> >> Let me know if you need anything else.
> >> >>
> >> >> Bryan
> >> >>
> >> >>
> >> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of 

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-17 Thread John Spray
On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah
 wrote:
> On one our platform mgr uses 3 CPU cores . Is there a ticket available for
> this issue ?

Not that I'm aware of, you could go ahead and open one.

Cheers,
John

> Thanks,
> Muthu
>
> On 14 February 2017 at 03:13, Brad Hubbard  wrote:
>>
>> Could one of the reporters open a tracker for this issue and attach
>> the requested debugging data?
>>
>> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis 
>> wrote:
>> > I am having the same issue. When I looked at my idle cluster this
>> > morning,
>> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of
>> > that.  I
>> > have 3 AIO nodes, and only one of them seemed to be affected.
>> >
>> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard 
>> > wrote:
>> >>
>> >> Want to install debuginfo packages and use something like this to try
>> >> and find out where it is spending most of its time?
>> >>
>> >> https://poormansprofiler.org/
>> >>
>> >> Note that you may need to do multiple runs to get a "feel" for where
>> >> it is spending most of its time. Also not that likely only one or two
>> >> threads will be using the CPU (you can see this in ps output using a
>> >> command like the following) the rest will likely be idle or waiting
>> >> for something.
>> >>
>> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>> >>
>> >> Observation of these two and maybe a couple of manual gstack dumps
>> >> like this to compare thread ids to ps output (LWP is the thread id
>> >> (tid) in gdb output) should give us some idea of where it is spinning.
>> >>
>> >> # gstack $(pidof ceph-mgr)
>> >>
>> >>
>> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>> >>  wrote:
>> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on
>> >> > CentOS
>> >> > 7 w/
>> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and
>> >> > has
>> >> > allocated ~11GB of RAM after a single day of usage. Only the active
>> >> > manager
>> >> > is performing this way. The growth is linear and reproducible.
>> >> >
>> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with
>> >> > 45x8TB
>> >> > OSDs
>> >> > each.
>> >> >
>> >> >
>> >> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94,
>> >> > 4.21
>> >> >
>> >> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
>> >> >
>> >> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7
>> >> > si,
>> >> > 0.0
>> >> > st
>> >> >
>> >> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> >> > buff/cache
>> >> >
>> >> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772
>> >> > avail
>> >> > Mem
>> >> >
>> >> >
>> >> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> >> > COMMAND
>> >> >
>> >> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
>> >> > ceph-mgr
>> >> >
>> >> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
>> >> > ceph-mon
>> >> >
>> >> >
>> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>> >> >  wrote:
>> >> >>
>> >> >> John,
>> >> >>
>> >> >> This morning I compared the logs from yesterday and I show a
>> >> >> noticeable
>> >> >> increase in messages like these:
>> >> >>
>> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all mon_status
>> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all health
>> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all pg_summary
>> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> >> >> mgrdigest v1
>> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest
>> >> >> v1
>> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all mon_status
>> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all health
>> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all pg_summary
>> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
>> >> >> mgrdigest v1
>> >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest
>> >> >> v1
>> >> >>
>> >> >>
>> >> >> In a 1 minute period yesterday I saw 84 times this group of messages
>> >> >> showed up.  Today that same group of messages showed up 156 times.
>> >> >>
>> >> >> Other than that I did see an increase in this messages from 9 times
>> >> >> a
>> >> >> minute to 14 times a minute:
>> >> >>
>> >> >> 2017-01-11 09:00:00.

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-20 Thread Jay Linux
Hello John,

Created tracker for this issue Refer-- >
http://tracker.ceph.com/issues/18994

Thanks

On Fri, Feb 17, 2017 at 6:15 PM, John Spray  wrote:

> On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah
>  wrote:
> > On one our platform mgr uses 3 CPU cores . Is there a ticket available
> for
> > this issue ?
>
> Not that I'm aware of, you could go ahead and open one.
>
> Cheers,
> John
>
> > Thanks,
> > Muthu
> >
> > On 14 February 2017 at 03:13, Brad Hubbard  wrote:
> >>
> >> Could one of the reporters open a tracker for this issue and attach
> >> the requested debugging data?
> >>
> >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis 
> >> wrote:
> >> > I am having the same issue. When I looked at my idle cluster this
> >> > morning,
> >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of
> >> > that.  I
> >> > have 3 AIO nodes, and only one of them seemed to be affected.
> >> >
> >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard 
> >> > wrote:
> >> >>
> >> >> Want to install debuginfo packages and use something like this to try
> >> >> and find out where it is spending most of its time?
> >> >>
> >> >> https://poormansprofiler.org/
> >> >>
> >> >> Note that you may need to do multiple runs to get a "feel" for where
> >> >> it is spending most of its time. Also not that likely only one or two
> >> >> threads will be using the CPU (you can see this in ps output using a
> >> >> command like the following) the rest will likely be idle or waiting
> >> >> for something.
> >> >>
> >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
> >> >>
> >> >> Observation of these two and maybe a couple of manual gstack dumps
> >> >> like this to compare thread ids to ps output (LWP is the thread id
> >> >> (tid) in gdb output) should give us some idea of where it is
> spinning.
> >> >>
> >> >> # gstack $(pidof ceph-mgr)
> >> >>
> >> >>
> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
> >> >>  wrote:
> >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on
> >> >> > CentOS
> >> >> > 7 w/
> >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and
> >> >> > has
> >> >> > allocated ~11GB of RAM after a single day of usage. Only the active
> >> >> > manager
> >> >> > is performing this way. The growth is linear and reproducible.
> >> >> >
> >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with
> >> >> > 45x8TB
> >> >> > OSDs
> >> >> > each.
> >> >> >
> >> >> >
> >> >> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94,
> >> >> > 4.21
> >> >> >
> >> >> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0
> zombie
> >> >> >
> >> >> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7
> >> >> > si,
> >> >> > 0.0
> >> >> > st
> >> >> >
> >> >> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
> >> >> > buff/cache
> >> >> >
> >> >> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772
> >> >> > avail
> >> >> > Mem
> >> >> >
> >> >> >
> >> >> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> >> >> > COMMAND
> >> >> >
> >> >> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
> >> >> > ceph-mgr
> >> >> >
> >> >> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
> >> >> > ceph-mon
> >> >> >
> >> >> >
> >> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
> >> >> >  wrote:
> >> >> >>
> >> >> >> John,
> >> >> >>
> >> >> >> This morning I compared the logs from yesterday and I show a
> >> >> >> noticeable
> >> >> >> increase in messages like these:
> >> >> >>
> >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest
> 575
> >> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest
> 441
> >> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all
> >> >> >> notify_all:
> >> >> >> notify_all mon_status
> >> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all
> >> >> >> notify_all:
> >> >> >> notify_all health
> >> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all
> >> >> >> notify_all:
> >> >> >> notify_all pg_summary
> >> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
> >> >> >> mgrdigest v1
> >> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch
> mgrdigest
> >> >> >> v1
> >> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest
> 575
> >> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest
> 441
> >> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all
> >> >> >> notify_all:
> >> >> >> notify_all mon_status
> >> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all
> >> >> >> notify_all:
> >> >> >> notify_all health
> >> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all
> >> >> >> notify_all:
> >> >> >> notify_all pg_summary
> >> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
> >> >> >> mgrdigest v1
> >> >>

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-20 Thread Brad Hubbard
Refer to my previous post for data you can gather that will help
narrow this down.

On Mon, Feb 20, 2017 at 6:36 PM, Jay Linux  wrote:
> Hello John,
>
> Created tracker for this issue Refer-- >
> http://tracker.ceph.com/issues/18994
>
> Thanks
>
> On Fri, Feb 17, 2017 at 6:15 PM, John Spray  wrote:
>>
>> On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah
>>  wrote:
>> > On one our platform mgr uses 3 CPU cores . Is there a ticket available
>> > for
>> > this issue ?
>>
>> Not that I'm aware of, you could go ahead and open one.
>>
>> Cheers,
>> John
>>
>> > Thanks,
>> > Muthu
>> >
>> > On 14 February 2017 at 03:13, Brad Hubbard  wrote:
>> >>
>> >> Could one of the reporters open a tracker for this issue and attach
>> >> the requested debugging data?
>> >>
>> >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis 
>> >> wrote:
>> >> > I am having the same issue. When I looked at my idle cluster this
>> >> > morning,
>> >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of
>> >> > that.  I
>> >> > have 3 AIO nodes, and only one of them seemed to be affected.
>> >> >
>> >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard 
>> >> > wrote:
>> >> >>
>> >> >> Want to install debuginfo packages and use something like this to
>> >> >> try
>> >> >> and find out where it is spending most of its time?
>> >> >>
>> >> >> https://poormansprofiler.org/
>> >> >>
>> >> >> Note that you may need to do multiple runs to get a "feel" for where
>> >> >> it is spending most of its time. Also not that likely only one or
>> >> >> two
>> >> >> threads will be using the CPU (you can see this in ps output using a
>> >> >> command like the following) the rest will likely be idle or waiting
>> >> >> for something.
>> >> >>
>> >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>> >> >>
>> >> >> Observation of these two and maybe a couple of manual gstack dumps
>> >> >> like this to compare thread ids to ps output (LWP is the thread id
>> >> >> (tid) in gdb output) should give us some idea of where it is
>> >> >> spinning.
>> >> >>
>> >> >> # gstack $(pidof ceph-mgr)
>> >> >>
>> >> >>
>> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>> >> >>  wrote:
>> >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on
>> >> >> > CentOS
>> >> >> > 7 w/
>> >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU
>> >> >> > and
>> >> >> > has
>> >> >> > allocated ~11GB of RAM after a single day of usage. Only the
>> >> >> > active
>> >> >> > manager
>> >> >> > is performing this way. The growth is linear and reproducible.
>> >> >> >
>> >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with
>> >> >> > 45x8TB
>> >> >> > OSDs
>> >> >> > each.
>> >> >> >
>> >> >> >
>> >> >> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56,
>> >> >> > 3.94,
>> >> >> > 4.21
>> >> >> >
>> >> >> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0
>> >> >> > zombie
>> >> >> >
>> >> >> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,
>> >> >> > 0.7
>> >> >> > si,
>> >> >> > 0.0
>> >> >> > st
>> >> >> >
>> >> >> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> >> >> > buff/cache
>> >> >> >
>> >> >> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772
>> >> >> > avail
>> >> >> > Mem
>> >> >> >
>> >> >> >
>> >> >> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM
>> >> >> > TIME+
>> >> >> > COMMAND
>> >> >> >
>> >> >> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8
>> >> >> > 2094:27
>> >> >> > ceph-mgr
>> >> >> >
>> >> >> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6
>> >> >> > 65:11.50
>> >> >> > ceph-mon
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>> >> >> >  wrote:
>> >> >> >>
>> >> >> >> John,
>> >> >> >>
>> >> >> >> This morning I compared the logs from yesterday and I show a
>> >> >> >> noticeable
>> >> >> >> increase in messages like these:
>> >> >> >>
>> >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest
>> >> >> >> 575
>> >> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest
>> >> >> >> 441
>> >> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all
>> >> >> >> notify_all:
>> >> >> >> notify_all mon_status
>> >> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all
>> >> >> >> notify_all:
>> >> >> >> notify_all health
>> >> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all
>> >> >> >> notify_all:
>> >> >> >> notify_all pg_summary
>> >> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> >> >> >> mgrdigest v1
>> >> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch
>> >> >> >> mgrdigest
>> >> >> >> v1
>> >> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest
>> >> >> >> 575
>> >> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest
>> >> >> >> 441
>> >> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 1