[ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Last week I decided to play around with Kraken (11.1.1-1xenial) on a single node, two OSD cluster, and after a while I noticed that the new ceph-mgr daemon is frequently using a lot of the CPU: 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 ceph-mgr Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU usage down to < 1%, but after a while it climbs back up to > 100%. Has anyone else seen this? Bryan E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J wrote: > Last week I decided to play around with Kraken (11.1.1-1xenial) on a > single node, two OSD cluster, and after a while I noticed that the new > ceph-mgr daemon is frequently using a lot of the CPU: > > 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 > ceph-mgr > > Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU > usage down to < 1%, but after a while it climbs back up to > 100%. Has > anyone else seen this? Definitely worth investigating, could you set "debug mgr = 20" on the daemon to see if it's obviously spinning in a particular place? Thanks, John > > Bryan > > E-MAIL CONFIDENTIALITY NOTICE: > The contents of this e-mail message and any attachments are intended solely > for the addressee(s) and may contain confidential and/or legally privileged > information. If you are not the intended recipient of this message or if this > message has been addressed to you in error, please immediately alert the > sender by reply e-mail and then delete this message and any attachments. If > you are not the intended recipient, you are notified that any use, > dissemination, distribution, copying, or storage of this message or any > attachment is strictly prohibited. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
On 1/10/17, 5:35 AM, "John Spray" wrote: >On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J > wrote: >> Last week I decided to play around with Kraken (11.1.1-1xenial) on a >> single node, two OSD cluster, and after a while I noticed that the new >> ceph-mgr daemon is frequently using a lot of the CPU: >> >> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 >> ceph-mgr >> >> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU >> usage down to < 1%, but after a while it climbs back up to > 100%. Has >> anyone else seen this? > >Definitely worth investigating, could you set "debug mgr = 20" on the >daemon to see if it's obviously spinning in a particular place? I've injected that option to the ceps-mgr process, and now I'm just waiting for it to go out of control again. However, I've noticed quite a few messages like this in the logs already: 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 cs=1 l=0).fault initiating reconnect 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to send and in the half accept state just closed What's weird about that is that this is a single node cluster with ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same host. So none of the communication should be leaving the node. Bryan E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
What ceph sha1 is that? Does it include 6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger spin? -Sam On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J wrote: > On 1/10/17, 5:35 AM, "John Spray" wrote: > >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >> wrote: >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a >>> single node, two OSD cluster, and after a while I noticed that the new >>> ceph-mgr daemon is frequently using a lot of the CPU: >>> >>> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 >>> ceph-mgr >>> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU >>> usage down to < 1%, but after a while it climbs back up to > 100%. Has >>> anyone else seen this? >> >>Definitely worth investigating, could you set "debug mgr = 20" on the >>daemon to see if it's obviously spinning in a particular place? > > I've injected that option to the ceps-mgr process, and now I'm just > waiting for it to go out of control again. > > However, I've noticed quite a few messages like this in the logs already: > > 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 > cs=1 l=0).fault initiating reconnect > 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING > 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > accept peer reset, then tried to connect to us, replacing > 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to > send and in the half accept state just closed > > > What's weird about that is that this is a single node cluster with > ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same > host. So none of the communication should be leaving the node. > > Bryan > > E-MAIL CONFIDENTIALITY NOTICE: > The contents of this e-mail message and any attachments are intended solely > for the addressee(s) and may contain confidential and/or legally privileged > information. If you are not the intended recipient of this message or if this > message has been addressed to you in error, please immediately alert the > sender by reply e-mail and then delete this message and any attachments. If > you are not the intended recipient, you are notified that any use, > dissemination, distribution, copying, or storage of this message or any > attachment is strictly prohibited. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
This is from: ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755) On 1/10/17, 10:23 AM, "Samuel Just" wrote: >What ceph sha1 is that? Does it include >6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger >spin? >-Sam > >On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J > wrote: >> On 1/10/17, 5:35 AM, "John Spray" wrote: >> >>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >>> wrote: Last week I decided to play around with Kraken (11.1.1-1xenial) on a single node, two OSD cluster, and after a while I noticed that the new ceph-mgr daemon is frequently using a lot of the CPU: 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 ceph-mgr Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU usage down to < 1%, but after a while it climbs back up to > 100%. Has anyone else seen this? >>> >>>Definitely worth investigating, could you set "debug mgr = 20" on the >>>daemon to see if it's obviously spinning in a particular place? >> >> I've injected that option to the ceps-mgr process, and now I'm just >> waiting for it to go out of control again. >> >> However, I've noticed quite a few messages like this in the logs >>already: >> >> 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 >> cs=1 l=0).fault initiating reconnect >> 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>l=0).handle_connect_msg >> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING >> 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>l=0).handle_connect_msg >> accept peer reset, then tried to connect to us, replacing >> 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to >> send and in the half accept state just closed >> >> >> What's weird about that is that this is a single node cluster with >> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same >> host. So none of the communication should be leaving the node. >> >> Bryan >> >> E-MAIL CONFIDENTIALITY NOTICE: >> The contents of this e-mail message and any attachments are intended >>solely for the addressee(s) and may contain confidential and/or legally >>privileged information. If you are not the intended recipient of this >>message or if this message has been addressed to you in error, please >>immediately alert the sender by reply e-mail and then delete this >>message and any attachments. If you are not the intended recipient, you >>are notified that any use, dissemination, distribution, copying, or >>storage of this message or any attachment is strictly prohibited. >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Can you push that branch somewhere? I don't have a v11.1.1 or that sha1. -Sam On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J wrote: > This is from: > > ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755) > > On 1/10/17, 10:23 AM, "Samuel Just" wrote: > >>What ceph sha1 is that? Does it include >>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger >>spin? >>-Sam >> >>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J >> wrote: >>> On 1/10/17, 5:35 AM, "John Spray" wrote: >>> On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J wrote: > Last week I decided to play around with Kraken (11.1.1-1xenial) on a > single node, two OSD cluster, and after a while I noticed that the new > ceph-mgr daemon is frequently using a lot of the CPU: > > 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 > ceph-mgr > > Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU > usage down to < 1%, but after a while it climbs back up to > 100%. >Has > anyone else seen this? Definitely worth investigating, could you set "debug mgr = 20" on the daemon to see if it's obviously spinning in a particular place? >>> >>> I've injected that option to the ceps-mgr process, and now I'm just >>> waiting for it to go out of control again. >>> >>> However, I've noticed quite a few messages like this in the logs >>>already: >>> >>> 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 >>> cs=1 l=0).fault initiating reconnect >>> 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>>l=0).handle_connect_msg >>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING >>> 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>>l=0).handle_connect_msg >>> accept peer reset, then tried to connect to us, replacing >>> 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 >>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to >>> send and in the half accept state just closed >>> >>> >>> What's weird about that is that this is a single node cluster with >>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same >>> host. So none of the communication should be leaving the node. >>> >>> Bryan >>> >>> E-MAIL CONFIDENTIALITY NOTICE: >>> The contents of this e-mail message and any attachments are intended >>>solely for the addressee(s) and may contain confidential and/or legally >>>privileged information. If you are not the intended recipient of this >>>message or if this message has been addressed to you in error, please >>>immediately alert the sender by reply e-mail and then delete this >>>message and any attachments. If you are not the intended recipient, you >>>are notified that any use, dissemination, distribution, copying, or >>>storage of this message or any attachment is strictly prohibited. >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > E-MAIL CONFIDENTIALITY NOTICE: > The contents of this e-mail message and any attachments are intended solely > for the addressee(s) and may contain confidential and/or legally privileged > information. If you are not the intended recipient of this message or if this > message has been addressed to you in error, please immediately alert the > sender by reply e-mail and then delete this message and any attachments. If > you are not the intended recipient, you are notified that any use, > dissemination, distribution, copying, or storage of this message or any > attachment is strictly prohibited. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
That's strange, I installed that version using packages from here: http://download.ceph.com/debian-kraken/pool/main/c/ceph/ Bryan On 1/10/17, 10:51 AM, "Samuel Just" wrote: >Can you push that branch somewhere? I don't have a v11.1.1 or that sha1. >-Sam > >On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J > wrote: >> This is from: >> >> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755) >> >> On 1/10/17, 10:23 AM, "Samuel Just" wrote: >> >>>What ceph sha1 is that? Does it include >>>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger >>>spin? >>>-Sam >>> >>>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J >>> wrote: On 1/10/17, 5:35 AM, "John Spray" wrote: >On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J > wrote: >> Last week I decided to play around with Kraken (11.1.1-1xenial) on a >> single node, two OSD cluster, and after a while I noticed that the >>new >> ceph-mgr daemon is frequently using a lot of the CPU: >> >> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 >> ceph-mgr >> >> Restarting it with 'systemctl restart ceph-mgr*' seems to get its >>CPU >> usage down to < 1%, but after a while it climbs back up to > 100%. >>Has >> anyone else seen this? > >Definitely worth investigating, could you set "debug mgr = 20" on the >daemon to see if it's obviously spinning in a particular place? I've injected that option to the ceps-mgr process, and now I'm just waiting for it to go out of control again. However, I've noticed quite a few messages like this in the logs already: 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 cs=1 l=0).fault initiating reconnect 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to send and in the half accept state just closed What's weird about that is that this is a single node cluster with ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same host. So none of the communication should be leaving the node. Bryan E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> E-MAIL CONFIDENTIALITY NOTICE: >> The contents of this e-mail message and any attachments are intended >>solely for the addressee(s) and may contain confidential and/or legally >>privileged information. If you are not the intended recipient of this >>message or if this message has been addressed to you in error, please >>immediately alert the sender by reply e-mail and then delete this >>message and any attachments. If you are not the intended recipient, you >>are notified that any use, dissemination, distribution, copying, or >>storage of this message or any attachment is strictly prohibited. >> E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Mm, maybe the tag didn't get pushed. Alfredo, is there supposed to be a v11.1.1 tag? -Sam On Tue, Jan 10, 2017 at 9:57 AM, Stillwell, Bryan J wrote: > That's strange, I installed that version using packages from here: > > http://download.ceph.com/debian-kraken/pool/main/c/ceph/ > > > Bryan > > On 1/10/17, 10:51 AM, "Samuel Just" wrote: > >>Can you push that branch somewhere? I don't have a v11.1.1 or that sha1. >>-Sam >> >>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J >> wrote: >>> This is from: >>> >>> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755) >>> >>> On 1/10/17, 10:23 AM, "Samuel Just" wrote: >>> What ceph sha1 is that? Does it include 6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger spin? -Sam On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J wrote: > On 1/10/17, 5:35 AM, "John Spray" wrote: > >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >> wrote: >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a >>> single node, two OSD cluster, and after a while I noticed that the >>>new >>> ceph-mgr daemon is frequently using a lot of the CPU: >>> >>> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 >>> ceph-mgr >>> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its >>>CPU >>> usage down to < 1%, but after a while it climbs back up to > 100%. >>>Has >>> anyone else seen this? >> >>Definitely worth investigating, could you set "debug mgr = 20" on the >>daemon to see if it's obviously spinning in a particular place? > > I've injected that option to the ceps-mgr process, and now I'm just > waiting for it to go out of control again. > > However, I've noticed quite a few messages like this in the logs >already: > > 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >>> > 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN >pgs=2 > cs=1 l=0).fault initiating reconnect > 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >>> > 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >l=0).handle_connect_msg > accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING > 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >>> > 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >l=0).handle_connect_msg > accept peer reset, then tried to connect to us, replacing > 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >>> > 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing >to > send and in the half accept state just closed > > > What's weird about that is that this is a single node cluster with > ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same > host. So none of the communication should be leaving the node. > > Bryan > > E-MAIL CONFIDENTIALITY NOTICE: > The contents of this e-mail message and any attachments are intended >solely for the addressee(s) and may contain confidential and/or legally >privileged information. If you are not the intended recipient of this >message or if this message has been addressed to you in error, please >immediately alert the sender by reply e-mail and then delete this >message and any attachments. If you are not the intended recipient, you >are notified that any use, dissemination, distribution, copying, or >storage of this message or any attachment is strictly prohibited. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> E-MAIL CONFIDENTIALITY NOTICE: >>> The contents of this e-mail message and any attachments are intended >>>solely for the addressee(s) and may contain confidential and/or legally >>>privileged information. If you are not the intended recipient of this >>>message or if this message has been addressed to you in error, please >>>immediately alert the sender by reply e-mail and then delete this >>>message and any attachments. If you are not the intended recipient, you >>>are notified that any use, dissemination, distribution, copying, or >>>storage of this message or any attachment is strictly prohibited. >>> > > E-MAIL CONFIDENTIALITY NOTICE: > The contents of this e-mail message and any attachments are intended solely > for the addressee(s) and may contain confidential and/or legally privileged > information. If you are not the intended recipient of this message or if this > message has been addressed to you in error, please imme
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
On Tue, Jan 10, 2017 at 12:59 PM, Samuel Just wrote: > Mm, maybe the tag didn't get pushed. Alfredo, is there supposed to be > a v11.1.1 tag? Yep. You can see there is one here: https://github.com/ceph/ceph/releases Specifically: https://github.com/ceph/ceph/releases/tag/v11.1.1 which points to https://github.com/ceph/ceph/commit/87597971b371d7f497d7eabad3545d72d18dd755 > -Sam > > On Tue, Jan 10, 2017 at 9:57 AM, Stillwell, Bryan J > wrote: >> That's strange, I installed that version using packages from here: >> >> http://download.ceph.com/debian-kraken/pool/main/c/ceph/ >> >> >> Bryan >> >> On 1/10/17, 10:51 AM, "Samuel Just" wrote: >> >>>Can you push that branch somewhere? I don't have a v11.1.1 or that sha1. >>>-Sam >>> >>>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J >>> wrote: This is from: ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755) On 1/10/17, 10:23 AM, "Samuel Just" wrote: >What ceph sha1 is that? Does it include >6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger >spin? >-Sam > >On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J > wrote: >> On 1/10/17, 5:35 AM, "John Spray" wrote: >> >>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >>> wrote: Last week I decided to play around with Kraken (11.1.1-1xenial) on a single node, two OSD cluster, and after a while I noticed that the new ceph-mgr daemon is frequently using a lot of the CPU: 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 ceph-mgr Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU usage down to < 1%, but after a while it climbs back up to > 100%. Has anyone else seen this? >>> >>>Definitely worth investigating, could you set "debug mgr = 20" on the >>>daemon to see if it's obviously spinning in a particular place? >> >> I've injected that option to the ceps-mgr process, and now I'm just >> waiting for it to go out of control again. >> >> However, I've noticed quite a few messages like this in the logs >>already: >> >> 2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN >>pgs=2 >> cs=1 l=0).fault initiating reconnect >> 2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>l=0).handle_connect_msg >> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING >> 2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >>l=0).handle_connect_msg >> accept peer reset, then tried to connect to us, replacing >> 2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 >> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing >>to >> send and in the half accept state just closed >> >> >> What's weird about that is that this is a single node cluster with >> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same >> host. So none of the communication should be leaving the node. >> >> Bryan >> >> E-MAIL CONFIDENTIALITY NOTICE: >> The contents of this e-mail message and any attachments are intended >>solely for the addressee(s) and may contain confidential and/or legally >>privileged information. If you are not the intended recipient of this >>message or if this message has been addressed to you in error, please >>immediately alert the sender by reply e-mail and then delete this >>message and any attachments. If you are not the intended recipient, you >>are notified that any use, dissemination, distribution, copying, or >>storage of this message or any attachment is strictly prohibited. >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
John, This morning I compared the logs from yesterday and I show a noticeable increase in messages like these: 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: notify_all mon_status 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: notify_all health 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: notify_all pg_summary 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active mgrdigest v1 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: notify_all mon_status 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: notify_all health 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: notify_all pg_summary 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active mgrdigest v1 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 In a 1 minute period yesterday I saw 84 times this group of messages showed up. Today that same group of messages showed up 156 times. Other than that I did see an increase in this messages from 9 times a minute to 14 times a minute: 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- 172.24.88.207:6800/4104 >> - conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed Let me know if you need anything else. Bryan On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J" wrote: >On 1/10/17, 5:35 AM, "John Spray" wrote: > >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >> wrote: >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a >>> single node, two OSD cluster, and after a while I noticed that the new >>> ceph-mgr daemon is frequently using a lot of the CPU: >>> >>> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 >>> ceph-mgr >>> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU >>> usage down to < 1%, but after a while it climbs back up to > 100%. Has >>> anyone else seen this? >> >>Definitely worth investigating, could you set "debug mgr = 20" on the >>daemon to see if it's obviously spinning in a particular place? > >I've injected that option to the ceps-mgr process, and now I'm just >waiting for it to go out of control again. > >However, I've noticed quite a few messages like this in the logs already: > >2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 >cs=1 l=0).fault initiating reconnect >2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING >2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg >accept peer reset, then tried to connect to us, replacing >2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to >send and in the half accept state just closed > > >What's weird about that is that this is a single node cluster with >ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same >host. So none of the communication should be leaving the node. > >Bryan E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7 w/ elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has allocated ~11GB of RAM after a single day of usage. Only the active manager is performing this way. The growth is linear and reproducible. The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs each. top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, 4.21 Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 buff/cache KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 ceph-mgr 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 ceph-mon On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J < bryan.stillw...@charter.com> wrote: > John, > > This morning I compared the logs from yesterday and I show a noticeable > increase in messages like these: > > 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 > 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 > 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: > notify_all mon_status > 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: > notify_all health > 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: > notify_all pg_summary > 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active > mgrdigest v1 > 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 > 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 > 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 > 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: > notify_all mon_status > 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: > notify_all health > 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: > notify_all pg_summary > 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active > mgrdigest v1 > 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 > > > In a 1 minute period yesterday I saw 84 times this group of messages > showed up. Today that same group of messages showed up 156 times. > > Other than that I did see an increase in this messages from 9 times a > minute to 14 times a minute: > > 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- 172.24.88.207:6800/4104 >> - > conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > > Let me know if you need anything else. > > Bryan > > > On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J" > bryan.stillw...@charter.com> wrote: > > >On 1/10/17, 5:35 AM, "John Spray" wrote: > > > >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J > >> wrote: > >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a > >>> single node, two OSD cluster, and after a while I noticed that the new > >>> ceph-mgr daemon is frequently using a lot of the CPU: > >>> > >>> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 > >>> ceph-mgr > >>> > >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU > >>> usage down to < 1%, but after a while it climbs back up to > 100%. Has > >>> anyone else seen this? > >> > >>Definitely worth investigating, could you set "debug mgr = 20" on the > >>daemon to see if it's obviously spinning in a particular place? > > > >I've injected that option to the ceps-mgr process, and now I'm just > >waiting for it to go out of control again. > > > >However, I've noticed quite a few messages like this in the logs already: > > > >2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2 > >cs=1 l=0).fault initiating reconnect > >2017-01-10 09:56:07.442044 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_msg > >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING > >2017-01-10 09:56:07.442067 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 > >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_msg > >accept peer reset, then tried to connect to us, replacing > >2017-01-10 09:56:07.443026 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> > >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 > >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to > >send and in the half accept state just closed > > > >
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Want to install debuginfo packages and use something like this to try and find out where it is spending most of its time? https://poormansprofiler.org/ Note that you may need to do multiple runs to get a "feel" for where it is spending most of its time. Also not that likely only one or two threads will be using the CPU (you can see this in ps output using a command like the following) the rest will likely be idle or waiting for something. # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan Observation of these two and maybe a couple of manual gstack dumps like this to compare thread ids to ps output (LWP is the thread id (tid) in gdb output) should give us some idea of where it is spinning. # gstack $(pidof ceph-mgr) On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff wrote: > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7 w/ > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has > allocated ~11GB of RAM after a single day of usage. Only the active manager > is performing this way. The growth is linear and reproducible. > > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs > each. > > > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, 4.21 > > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie > > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 > st > > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 buff/cache > > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 avail Mem > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 ceph-mgr > > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 ceph-mon > > > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J > wrote: >> >> John, >> >> This morning I compared the logs from yesterday and I show a noticeable >> increase in messages like these: >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: >> notify_all mon_status >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: >> notify_all health >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: >> notify_all pg_summary >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active >> mgrdigest v1 >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: >> notify_all mon_status >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: >> notify_all health >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: >> notify_all pg_summary >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active >> mgrdigest v1 >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 >> >> >> In a 1 minute period yesterday I saw 84 times this group of messages >> showed up. Today that same group of messages showed up 156 times. >> >> Other than that I did see an increase in this messages from 9 times a >> minute to 14 times a minute: >> >> 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- 172.24.88.207:6800/4104 >> - >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 >> l=0).fault with nothing to send and in the half accept state just closed >> >> Let me know if you need anything else. >> >> Bryan >> >> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J" >> > bryan.stillw...@charter.com> wrote: >> >> >On 1/10/17, 5:35 AM, "John Spray" wrote: >> > >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >> >> wrote: >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a >> >>> single node, two OSD cluster, and after a while I noticed that the new >> >>> ceph-mgr daemon is frequently using a lot of the CPU: >> >>> >> >>> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 >> >>> ceph-mgr >> >>> >> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU >> >>> usage down to < 1%, but after a while it climbs back up to > 100%. >> >>> Has >> >>> anyone else seen this? >> >> >> >>Definitely worth investigating, could you set "debug mgr = 20" on the >> >>daemon to see if it's obviously spinning in a particular place? >> > >> >I've injected that option to the ceps-mgr process, and now I'm just >> >waiting for it to go out of control again. >> > >> >However, I've noticed quite a few messages like this in the logs already: >> > >> >2017-01-10 09:56:07.441678 7f70f4562700 0 -- 172.24.88.207:6800/4104 >> >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=ST
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
I am having the same issue. When I looked at my idle cluster this morning, one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that. I have 3 AIO nodes, and only one of them seemed to be affected. On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard wrote: > Want to install debuginfo packages and use something like this to try > and find out where it is spending most of its time? > > https://poormansprofiler.org/ > > Note that you may need to do multiple runs to get a "feel" for where > it is spending most of its time. Also not that likely only one or two > threads will be using the CPU (you can see this in ps output using a > command like the following) the rest will likely be idle or waiting > for something. > > # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan > > Observation of these two and maybe a couple of manual gstack dumps > like this to compare thread ids to ps output (LWP is the thread id > (tid) in gdb output) should give us some idea of where it is spinning. > > # gstack $(pidof ceph-mgr) > > > On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff > wrote: > > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS > 7 w/ > > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has > > allocated ~11GB of RAM after a single day of usage. Only the active > manager > > is performing this way. The growth is linear and reproducible. > > > > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB > OSDs > > each. > > > > > > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, 4.21 > > > > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie > > > > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 si, > 0.0 > > st > > > > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 > buff/cache > > > > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 avail > Mem > > > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > COMMAND > > > > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 > ceph-mgr > > > > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 > ceph-mon > > > > > > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J > > wrote: > >> > >> John, > >> > >> This morning I compared the logs from yesterday and I show a noticeable > >> increase in messages like these: > >> > >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 > >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 > >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: > >> notify_all mon_status > >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: > >> notify_all health > >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: > >> notify_all pg_summary > >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active > >> mgrdigest v1 > >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 > >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 > >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 > >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: > >> notify_all mon_status > >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: > >> notify_all health > >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: > >> notify_all pg_summary > >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active > >> mgrdigest v1 > >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 > >> > >> > >> In a 1 minute period yesterday I saw 84 times this group of messages > >> showed up. Today that same group of messages showed up 156 times. > >> > >> Other than that I did see an increase in this messages from 9 times a > >> minute to 14 times a minute: > >> > >> 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- 172.24.88.207:6800/4104 > >> - > >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > >> l=0).fault with nothing to send and in the half accept state just > closed > >> > >> Let me know if you need anything else. > >> > >> Bryan > >> > >> > >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J" > >> >> bryan.stillw...@charter.com> wrote: > >> > >> >On 1/10/17, 5:35 AM, "John Spray" wrote: > >> > > >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J > >> >> wrote: > >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a > >> >>> single node, two OSD cluster, and after a while I noticed that the > new > >> >>> ceph-mgr daemon is frequently using a lot of the CPU: > >> >>> > >> >>> 17519 ceph 20 0 850044 168104208 S 102.7 4.3 1278:27 > >> >>> ceph-mgr > >> >>> > >> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its > CPU > >> >>> usage down to < 1%, but after a while it climbs back up to > 100%. > >> >>> Has > >> >>> any
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Could one of the reporters open a tracker for this issue and attach the requested debugging data? On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis wrote: > I am having the same issue. When I looked at my idle cluster this morning, > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that. I > have 3 AIO nodes, and only one of them seemed to be affected. > > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard wrote: >> >> Want to install debuginfo packages and use something like this to try >> and find out where it is spending most of its time? >> >> https://poormansprofiler.org/ >> >> Note that you may need to do multiple runs to get a "feel" for where >> it is spending most of its time. Also not that likely only one or two >> threads will be using the CPU (you can see this in ps output using a >> command like the following) the rest will likely be idle or waiting >> for something. >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan >> >> Observation of these two and maybe a couple of manual gstack dumps >> like this to compare thread ids to ps output (LWP is the thread id >> (tid) in gdb output) should give us some idea of where it is spinning. >> >> # gstack $(pidof ceph-mgr) >> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff >> wrote: >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS >> > 7 w/ >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has >> > allocated ~11GB of RAM after a single day of usage. Only the active >> > manager >> > is performing this way. The growth is linear and reproducible. >> > >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB >> > OSDs >> > each. >> > >> > >> > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, 4.21 >> > >> > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie >> > >> > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 si, >> > 0.0 >> > st >> > >> > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 >> > buff/cache >> > >> > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 avail >> > Mem >> > >> > >> > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ >> > COMMAND >> > >> > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 >> > ceph-mgr >> > >> > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 >> > ceph-mon >> > >> > >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J >> > wrote: >> >> >> >> John, >> >> >> >> This morning I compared the logs from yesterday and I show a noticeable >> >> increase in messages like these: >> >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: >> >> notify_all mon_status >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: >> >> notify_all health >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: >> >> notify_all pg_summary >> >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active >> >> mgrdigest v1 >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: >> >> notify_all mon_status >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: >> >> notify_all health >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: >> >> notify_all pg_summary >> >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active >> >> mgrdigest v1 >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1 >> >> >> >> >> >> In a 1 minute period yesterday I saw 84 times this group of messages >> >> showed up. Today that same group of messages showed up 156 times. >> >> >> >> Other than that I did see an increase in this messages from 9 times a >> >> minute to 14 times a minute: >> >> >> >> 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- 172.24.88.207:6800/4104 >> >> >> - >> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 >> >> l=0).fault with nothing to send and in the half accept state just >> >> closed >> >> >> >> Let me know if you need anything else. >> >> >> >> Bryan >> >> >> >> >> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J" >> >> > >> bryan.stillw...@charter.com> wrote: >> >> >> >> >On 1/10/17, 5:35 AM, "John Spray" wrote: >> >> > >> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >> >> >> wrote: >> >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on >> >> >>> a >> >> >>> single node, two OSD cluster, and after a while I noticed that the >> >> >>> new >> >> >>> ceph-mgr daemon is freq
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
On one our platform mgr uses 3 CPU cores . Is there a ticket available for this issue ? Thanks, Muthu On 14 February 2017 at 03:13, Brad Hubbard wrote: > Could one of the reporters open a tracker for this issue and attach > the requested debugging data? > > On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis > wrote: > > I am having the same issue. When I looked at my idle cluster this > morning, > > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of > that. I > > have 3 AIO nodes, and only one of them seemed to be affected. > > > > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard > wrote: > >> > >> Want to install debuginfo packages and use something like this to try > >> and find out where it is spending most of its time? > >> > >> https://poormansprofiler.org/ > >> > >> Note that you may need to do multiple runs to get a "feel" for where > >> it is spending most of its time. Also not that likely only one or two > >> threads will be using the CPU (you can see this in ps output using a > >> command like the following) the rest will likely be idle or waiting > >> for something. > >> > >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan > >> > >> Observation of these two and maybe a couple of manual gstack dumps > >> like this to compare thread ids to ps output (LWP is the thread id > >> (tid) in gdb output) should give us some idea of where it is spinning. > >> > >> # gstack $(pidof ceph-mgr) > >> > >> > >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff > >> wrote: > >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on > CentOS > >> > 7 w/ > >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and > has > >> > allocated ~11GB of RAM after a single day of usage. Only the active > >> > manager > >> > is performing this way. The growth is linear and reproducible. > >> > > >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB > >> > OSDs > >> > each. > >> > > >> > > >> > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, > 4.21 > >> > > >> > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie > >> > > >> > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 > si, > >> > 0.0 > >> > st > >> > > >> > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 > >> > buff/cache > >> > > >> > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 avail > >> > Mem > >> > > >> > > >> > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > >> > COMMAND > >> > > >> > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 > >> > ceph-mgr > >> > > >> > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 > >> > ceph-mon > >> > > >> > > >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J > >> > wrote: > >> >> > >> >> John, > >> >> > >> >> This morning I compared the logs from yesterday and I show a > noticeable > >> >> increase in messages like these: > >> >> > >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 > >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 > >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all: > >> >> notify_all mon_status > >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all: > >> >> notify_all health > >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all: > >> >> notify_all pg_summary > >> >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active > >> >> mgrdigest v1 > >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest > v1 > >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 > >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 > >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all: > >> >> notify_all mon_status > >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all: > >> >> notify_all health > >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all: > >> >> notify_all pg_summary > >> >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active > >> >> mgrdigest v1 > >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest > v1 > >> >> > >> >> > >> >> In a 1 minute period yesterday I saw 84 times this group of messages > >> >> showed up. Today that same group of messages showed up 156 times. > >> >> > >> >> Other than that I did see an increase in this messages from 9 times a > >> >> minute to 14 times a minute: > >> >> > >> >> 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- > 172.24.88.207:6800/4104 > >> >> >> - > >> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 > cs=0 > >> >> l=0).fault with nothing to send and in the half accept state just > >> >> closed > >> >> > >> >> Let me know if you need anything else. > >> >> > >> >> Bryan > >> >> > >> >> > >> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah wrote: > On one our platform mgr uses 3 CPU cores . Is there a ticket available for > this issue ? Not that I'm aware of, you could go ahead and open one. Cheers, John > Thanks, > Muthu > > On 14 February 2017 at 03:13, Brad Hubbard wrote: >> >> Could one of the reporters open a tracker for this issue and attach >> the requested debugging data? >> >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis >> wrote: >> > I am having the same issue. When I looked at my idle cluster this >> > morning, >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of >> > that. I >> > have 3 AIO nodes, and only one of them seemed to be affected. >> > >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard >> > wrote: >> >> >> >> Want to install debuginfo packages and use something like this to try >> >> and find out where it is spending most of its time? >> >> >> >> https://poormansprofiler.org/ >> >> >> >> Note that you may need to do multiple runs to get a "feel" for where >> >> it is spending most of its time. Also not that likely only one or two >> >> threads will be using the CPU (you can see this in ps output using a >> >> command like the following) the rest will likely be idle or waiting >> >> for something. >> >> >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan >> >> >> >> Observation of these two and maybe a couple of manual gstack dumps >> >> like this to compare thread ids to ps output (LWP is the thread id >> >> (tid) in gdb output) should give us some idea of where it is spinning. >> >> >> >> # gstack $(pidof ceph-mgr) >> >> >> >> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff >> >> wrote: >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on >> >> > CentOS >> >> > 7 w/ >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and >> >> > has >> >> > allocated ~11GB of RAM after a single day of usage. Only the active >> >> > manager >> >> > is performing this way. The growth is linear and reproducible. >> >> > >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with >> >> > 45x8TB >> >> > OSDs >> >> > each. >> >> > >> >> > >> >> > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, >> >> > 4.21 >> >> > >> >> > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie >> >> > >> >> > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 >> >> > si, >> >> > 0.0 >> >> > st >> >> > >> >> > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 >> >> > buff/cache >> >> > >> >> > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 >> >> > avail >> >> > Mem >> >> > >> >> > >> >> > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ >> >> > COMMAND >> >> > >> >> > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 >> >> > ceph-mgr >> >> > >> >> > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 >> >> > ceph-mon >> >> > >> >> > >> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J >> >> > wrote: >> >> >> >> >> >> John, >> >> >> >> >> >> This morning I compared the logs from yesterday and I show a >> >> >> noticeable >> >> >> increase in messages like these: >> >> >> >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575 >> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441 >> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all >> >> >> notify_all: >> >> >> notify_all mon_status >> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all >> >> >> notify_all: >> >> >> notify_all health >> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all >> >> >> notify_all: >> >> >> notify_all pg_summary >> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active >> >> >> mgrdigest v1 >> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest >> >> >> v1 >> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575 >> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441 >> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all >> >> >> notify_all: >> >> >> notify_all mon_status >> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all >> >> >> notify_all: >> >> >> notify_all health >> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all >> >> >> notify_all: >> >> >> notify_all pg_summary >> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active >> >> >> mgrdigest v1 >> >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest >> >> >> v1 >> >> >> >> >> >> >> >> >> In a 1 minute period yesterday I saw 84 times this group of messages >> >> >> showed up. Today that same group of messages showed up 156 times. >> >> >> >> >> >> Other than that I did see an increase in this messages from 9 times >> >> >> a >> >> >> minute to 14 times a minute: >> >> >> >> >> >> 2017-01-11 09:00:00.
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Hello John, Created tracker for this issue Refer-- > http://tracker.ceph.com/issues/18994 Thanks On Fri, Feb 17, 2017 at 6:15 PM, John Spray wrote: > On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah > wrote: > > On one our platform mgr uses 3 CPU cores . Is there a ticket available > for > > this issue ? > > Not that I'm aware of, you could go ahead and open one. > > Cheers, > John > > > Thanks, > > Muthu > > > > On 14 February 2017 at 03:13, Brad Hubbard wrote: > >> > >> Could one of the reporters open a tracker for this issue and attach > >> the requested debugging data? > >> > >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis > >> wrote: > >> > I am having the same issue. When I looked at my idle cluster this > >> > morning, > >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of > >> > that. I > >> > have 3 AIO nodes, and only one of them seemed to be affected. > >> > > >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard > >> > wrote: > >> >> > >> >> Want to install debuginfo packages and use something like this to try > >> >> and find out where it is spending most of its time? > >> >> > >> >> https://poormansprofiler.org/ > >> >> > >> >> Note that you may need to do multiple runs to get a "feel" for where > >> >> it is spending most of its time. Also not that likely only one or two > >> >> threads will be using the CPU (you can see this in ps output using a > >> >> command like the following) the rest will likely be idle or waiting > >> >> for something. > >> >> > >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan > >> >> > >> >> Observation of these two and maybe a couple of manual gstack dumps > >> >> like this to compare thread ids to ps output (LWP is the thread id > >> >> (tid) in gdb output) should give us some idea of where it is > spinning. > >> >> > >> >> # gstack $(pidof ceph-mgr) > >> >> > >> >> > >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff > >> >> wrote: > >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on > >> >> > CentOS > >> >> > 7 w/ > >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and > >> >> > has > >> >> > allocated ~11GB of RAM after a single day of usage. Only the active > >> >> > manager > >> >> > is performing this way. The growth is linear and reproducible. > >> >> > > >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with > >> >> > 45x8TB > >> >> > OSDs > >> >> > each. > >> >> > > >> >> > > >> >> > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, 3.94, > >> >> > 4.21 > >> >> > > >> >> > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 > zombie > >> >> > > >> >> > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, 0.7 > >> >> > si, > >> >> > 0.0 > >> >> > st > >> >> > > >> >> > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 > >> >> > buff/cache > >> >> > > >> >> > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 > >> >> > avail > >> >> > Mem > >> >> > > >> >> > > >> >> > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > >> >> > COMMAND > >> >> > > >> >> > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 2094:27 > >> >> > ceph-mgr > >> >> > > >> >> > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 65:11.50 > >> >> > ceph-mon > >> >> > > >> >> > > >> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J > >> >> > wrote: > >> >> >> > >> >> >> John, > >> >> >> > >> >> >> This morning I compared the logs from yesterday and I show a > >> >> >> noticeable > >> >> >> increase in messages like these: > >> >> >> > >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest > 575 > >> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest > 441 > >> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all > >> >> >> notify_all: > >> >> >> notify_all mon_status > >> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all > >> >> >> notify_all: > >> >> >> notify_all health > >> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all > >> >> >> notify_all: > >> >> >> notify_all pg_summary > >> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active > >> >> >> mgrdigest v1 > >> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch > mgrdigest > >> >> >> v1 > >> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest > 575 > >> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest > 441 > >> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all > >> >> >> notify_all: > >> >> >> notify_all mon_status > >> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all > >> >> >> notify_all: > >> >> >> notify_all health > >> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all > >> >> >> notify_all: > >> >> >> notify_all pg_summary > >> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active > >> >> >> mgrdigest v1 > >> >>
Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster
Refer to my previous post for data you can gather that will help narrow this down. On Mon, Feb 20, 2017 at 6:36 PM, Jay Linux wrote: > Hello John, > > Created tracker for this issue Refer-- > > http://tracker.ceph.com/issues/18994 > > Thanks > > On Fri, Feb 17, 2017 at 6:15 PM, John Spray wrote: >> >> On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah >> wrote: >> > On one our platform mgr uses 3 CPU cores . Is there a ticket available >> > for >> > this issue ? >> >> Not that I'm aware of, you could go ahead and open one. >> >> Cheers, >> John >> >> > Thanks, >> > Muthu >> > >> > On 14 February 2017 at 03:13, Brad Hubbard wrote: >> >> >> >> Could one of the reporters open a tracker for this issue and attach >> >> the requested debugging data? >> >> >> >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis >> >> wrote: >> >> > I am having the same issue. When I looked at my idle cluster this >> >> > morning, >> >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of >> >> > that. I >> >> > have 3 AIO nodes, and only one of them seemed to be affected. >> >> > >> >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard >> >> > wrote: >> >> >> >> >> >> Want to install debuginfo packages and use something like this to >> >> >> try >> >> >> and find out where it is spending most of its time? >> >> >> >> >> >> https://poormansprofiler.org/ >> >> >> >> >> >> Note that you may need to do multiple runs to get a "feel" for where >> >> >> it is spending most of its time. Also not that likely only one or >> >> >> two >> >> >> threads will be using the CPU (you can see this in ps output using a >> >> >> command like the following) the rest will likely be idle or waiting >> >> >> for something. >> >> >> >> >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan >> >> >> >> >> >> Observation of these two and maybe a couple of manual gstack dumps >> >> >> like this to compare thread ids to ps output (LWP is the thread id >> >> >> (tid) in gdb output) should give us some idea of where it is >> >> >> spinning. >> >> >> >> >> >> # gstack $(pidof ceph-mgr) >> >> >> >> >> >> >> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff >> >> >> wrote: >> >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on >> >> >> > CentOS >> >> >> > 7 w/ >> >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU >> >> >> > and >> >> >> > has >> >> >> > allocated ~11GB of RAM after a single day of usage. Only the >> >> >> > active >> >> >> > manager >> >> >> > is performing this way. The growth is linear and reproducible. >> >> >> > >> >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with >> >> >> > 45x8TB >> >> >> > OSDs >> >> >> > each. >> >> >> > >> >> >> > >> >> >> > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, >> >> >> > 3.94, >> >> >> > 4.21 >> >> >> > >> >> >> > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 >> >> >> > zombie >> >> >> > >> >> >> > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, >> >> >> > 0.7 >> >> >> > si, >> >> >> > 0.0 >> >> >> > st >> >> >> > >> >> >> > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 >> >> >> > buff/cache >> >> >> > >> >> >> > KiB Swap: 2097148 total, 2097148 free,0 used. 4836772 >> >> >> > avail >> >> >> > Mem >> >> >> > >> >> >> > >> >> >> > PID USER PR NIVIRTRESSHR S %CPU %MEM >> >> >> > TIME+ >> >> >> > COMMAND >> >> >> > >> >> >> > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 >> >> >> > 2094:27 >> >> >> > ceph-mgr >> >> >> > >> >> >> > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 >> >> >> > 65:11.50 >> >> >> > ceph-mon >> >> >> > >> >> >> > >> >> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J >> >> >> > wrote: >> >> >> >> >> >> >> >> John, >> >> >> >> >> >> >> >> This morning I compared the logs from yesterday and I show a >> >> >> >> noticeable >> >> >> >> increase in messages like these: >> >> >> >> >> >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 575 >> >> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 441 >> >> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all mon_status >> >> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all health >> >> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all pg_summary >> >> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active >> >> >> >> mgrdigest v1 >> >> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch >> >> >> >> mgrdigest >> >> >> >> v1 >> >> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 575 >> >> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 441 >> >> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 1