Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

Gregory Farnum Thu, 23 Aug 2018 09:59:00 -0700

On Thu, Aug 23, 2018 at 8:42 AM Adrien Gillard <gillard.adr...@gmail.com>
wrote:


> With a bit of profiling, it seems all the memory is allocated to
> ceph::logging::Log::create_entry (see below)
>
> Shoould this be normal ? Is it because some OSDs are down and it logs the
> results of its osd_ping  ?
>

Hmm, is that where the memory is actually *staying* or was it merely
allocated and then quickly deallocated? If you have debugging turned on the
volume of log output can be quite large, but it's a small amount kept
resident in memory at a time. Given that you don't seem to have any logs
over level 5 (except monc at 10) that does seem like a very strange ratio,
though.

Generally log messages are put into a circular buffer and then removed, and
if the buffer gets full the thread trying to log just gets blocked. I'm not
sure how you could get very large arbitrary amounts of data like that. Are
you using syslog or something? I don't think that should do it, but that's
all I can imagine.
That or somehow the upgrade has broken the way the OSD releases memory back
to the OS. Have you looked at the heap stats available through the admin
socket? What distro are you running and what's the source of your packages?
-Greg


>
> The debug level of the OSD is below also.
>
> Thanks,
>
> Adrien
>
>  $ pprof /usr/bin/ceph-osd osd.36.profile.0042-first.heap
> Using local file /usr/bin/ceph-osd.
> Using local file osd.36.profile.0042-first.heap.
> Welcome to pprof!  For help, type 'help'.
> (pprof) top
> Total: 2468.0 MB
>    519.5  21.0%  21.0%   1000.4  40.5% std::_Rb_tree::_M_copy
>    481.0  19.5%  40.5%    481.0  19.5% std::_Rb_tree::_M_create_node
>    384.3  15.6%  56.1%    384.3  15.6%
> std::_Rb_tree::_M_emplace_hint_unique
>    374.0  15.2%  71.3%    374.0  15.2%
> ceph::buffer::create_aligned_in_mempool
>    305.6  12.4%  83.6%    305.6  12.4% ceph::logging::Log::create_entry
>    217.3   8.8%  92.4%    220.2   8.9% std::vector::_M_emplace_back_aux
>    114.4   4.6%  97.1%    115.6   4.7% ceph::buffer::list::append@a51900
>     21.4   0.9%  98.0%    210.0   8.5% OSD::heartbeat
>     21.2   0.9%  98.8%     21.2   0.9% std::string::_Rep::_S_create
>      4.7   0.2%  99.0%    266.2  10.8% AsyncConnection::send_message
> (pprof)
>
>
>  $ pprof /usr/bin/ceph-osd --alloc_space osd.36.profile.0042-first.heap
> Using local file /usr/bin/ceph-osd.
> Using local file osd.36.profile.0042.heap.
> Welcome to pprof!  For help, type 'help'.
> (pprof) top
> Total: 16519.3 MB
>  11915.7  72.1%  72.1%  11915.7  72.1% ceph::logging::Log::create_entry
>    745.0   4.5%  76.6%    745.0   4.5% std::string::_Rep::_S_create
>    716.9   4.3%  81.0%    716.9   4.3%
> ceph::buffer::create_aligned_in_mempool
>    700.2   4.2%  85.2%    703.8   4.3% std::vector::_M_emplace_back_aux
>    671.9   4.1%  89.3%    671.9   4.1%
> std::_Rb_tree::_M_emplace_hint_unique
>    557.8   3.4%  92.7%   1075.2   6.5% std::_Rb_tree::_M_copy
>    517.4   3.1%  95.8%    517.4   3.1% std::_Rb_tree::_M_create_node
>    468.2   2.8%  98.6%    469.6   2.8% ceph::buffer::list::append@a51900
>    122.8   0.7%  99.4%    122.8   0.7% ceph::buffer::list::append@a51d20
>     33.3   0.2%  99.6%   7042.7  42.6% OSD::heartbeat
> (pprof)
>
>
>  $ ceph --admin-daemon /var/run/ceph/ceph-osd.36.asok config show | grep
> debug
>     "debug_none": "0/5",
>     "debug_lockdep": "0/1",
>     "debug_context": "0/1",
>     "debug_crush": "1/1",
>     "debug_mds": "1/5",
>     "debug_mds_balancer": "1/5",
>     "debug_mds_locker": "1/5",
>     "debug_mds_log": "1/5",
>     "debug_mds_log_expire": "1/5",
>     "debug_mds_migrator": "1/5",
>     "debug_buffer": "0/1",
>     "debug_timer": "0/1",
>     "debug_filer": "0/1",
>     "debug_striper": "0/1",
>     "debug_objecter": "0/1",
>     "debug_rados": "0/5",
>     "debug_rbd": "0/5",
>     "debug_rbd_mirror": "0/5",
>     "debug_rbd_replay": "0/5",
>     "debug_journaler": "0/5",
>     "debug_objectcacher": "0/5",
>     "debug_client": "0/5",
>     "debug_osd": "0/0",
>     "debug_optracker": "0/5",
>     "debug_objclass": "0/5",
>     "debug_filestore": "0/0",
>     "debug_journal": "0/0",
>     "debug_ms": "0/5",
>     "debug_mon": "1/5",
>     "debug_monc": "0/10",
>     "debug_paxos": "1/5",
>     "debug_tp": "0/5",
>     "debug_auth": "1/5",
>     "debug_crypto": "1/5",
>     "debug_finisher": "1/1",
>     "debug_reserver": "1/1",
>     "debug_heartbeatmap": "1/5",
>     "debug_perfcounter": "1/5",
>     "debug_rgw": "1/5",
>     "debug_civetweb": "1/10",
>     "debug_javaclient": "1/5",
>     "debug_asok": "1/5",
>     "debug_throttle": "1/1",
>     "debug_refs": "0/0",
>     "debug_xio": "1/5",
>     "debug_compressor": "1/5",
>     "debug_bluestore": "1/5",
>     "debug_bluefs": "1/5",
>     "debug_bdev": "1/3",
>     "debug_kstore": "1/5",
>     "debug_rocksdb": "4/5",
>     "debug_leveldb": "4/5",
>     "debug_memdb": "4/5",
>     "debug_kinetic": "1/5",
>     "debug_fuse": "1/5",
>     "debug_mgr": "1/5",
>     "debug_mgrc": "1/5",
>     "debug_dpdk": "1/5",
>     "debug_eventtrace": "1/5",
>
>
> On Thu, Aug 23, 2018 at 4:27 PM Adrien Gillard <gillard.adr...@gmail.com>
> wrote:
>
>> After upgrading to luminous, we see the exact same behaviour, with OSDs
>> eating as much as 80/90 GB of memory.
>>
>> We'll try some memory profiling but at this point we're a bit lost. Is
>> there any specific logs that could help us ?
>>
>> On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard <gillard.adr...@gmail.com>
>> wrote:
>>
>>> Well after a few hours, still nothing new in the behaviour. With half of
>>> the OSDs (so 6 per host)  up and peering and the nodown flag set to limit
>>> the creation of new maps, all the memory is consumed and OSDs get killed by
>>> OOM killer.
>>> We observe a lot of threads being created for each OSDs (roughly between
>>> 500 and 900). As there is around 200 pg per osd, does this seem a lot to
>>> you ? We see tons being created each second, most probably to replace the
>>> threads that time out or disappear with the osd processes.
>>>
>>> We are preparing to upgrade the OSDs to Luminous.
>>>
>>> On Thu, Aug 23, 2018 at 4:05 AM Gregory Farnum <gfar...@redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard <
>>>> gillard.adr...@gmail.com> wrote:
>>>>
>>>>> We 'paused' the cluster early in our investigation to avoid
>>>>> unnecessary IO.
>>>>> We also set the nodown flag but the OOM rate was really sustained and
>>>>> we got servers that stopped responding from time to time, so we decided to
>>>>> lower the number of OSDs up and let them peer.
>>>>>
>>>>
>>>> Right but this often also means that they need to deal with changing
>>>> intervals or trying to set up recovery and things, plus the change in which
>>>> OSDs are up creates new maps that need to be processed. If you set “nodown”
>>>> you can incrementally bring up OSDs while doing a better job of throttling
>>>> the map creation process and preventing a bunch of new IO when/if they OOM.
>>>>
>>>>
>>>> I don't know if it is the best solution but it is the middle of the
>>>>> night here, so we'll see tomorrow morning and ponder the completion of the
>>>>> upgrade.
>>>>>
>>>>> I agree with you Greg, it does not make sense. The cluster was running
>>>>> great for around 6 hours after the mons upgrade...
>>>>>
>>>>> Finally, regarding the cache-tiering, I know it is not used a lot nor
>>>>> will it be maintained over time. One of the main goal of the luminous
>>>>> upgrade is to remove it completely and switch to ec rbd.
>>>>>
>>>>> Thank you for your inputs !
>>>>>
>>>>>
>>>>> On Wed, Aug 22, 2018 at 11:57 PM Gregory Farnum <gfar...@redhat.com>
>>>>> wrote:
>>>>>
>>>>>> On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard <
>>>>>> gillard.adr...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> We have a hard time figuring out a behaviour encountered after
>>>>>>> upgrading the monitors of one of our cluster from Jewel to Luminous
>>>>>>> yesterday.
>>>>>>>
>>>>>>> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of
>>>>>>> RAM), each containing 12x4TB OSD with journals on DC grade SSDs). The
>>>>>>> cluster was installed in Infernalis and upgraded to Jewel. We mainly 
>>>>>>> use it
>>>>>>> for RBD with a cache tier in front of an EC pool.
>>>>>>>
>>>>>>> We started the upgrade from Jewel 10.2.10 to Luminous 12.2.7
>>>>>>> yesterday afternoon with the monitors and the provisionning of the MGR. 
>>>>>>> We
>>>>>>> were planning on upgrading the OSD servers today.
>>>>>>> All went well until the cluster received the following commands, a
>>>>>>> few hours after the upgrade :
>>>>>>>
>>>>>>> ceph osd pool set pool-cache-01 target_max_bytes 0
>>>>>>> ceph osd pool set pool-cache-02 target_max_bytes 0
>>>>>>>
>>>>>>> (The context here is that we use this cluster for backups and set
>>>>>>> target_max_bytes to 0 to stop the flush and give maximum ressources to 
>>>>>>> the
>>>>>>> backups IOs)
>>>>>>>
>>>>>>> Below is an extract from the logs of one of the monitors.
>>>>>>> The cluster started exchanging a lot over the cluster network,
>>>>>>> apparently backfill/recovery and multiple Gbits/s, in an amount which 
>>>>>>> we've
>>>>>>> never seen before. Due to some network uplinks sub-sizing, the traffic
>>>>>>> could not be handled well and this lead to OSDs crashing.
>>>>>>>
>>>>>>
>>>>>> This really doesn't make a ton of sense — you're saying that by
>>>>>> telling the cluster to *stop* doing flushing IO, everything went bad? In
>>>>>> terms of root causing the issue you'll obviously want to focus on that,
>>>>>> although it may or may not be helpful in recovering things at this point.
>>>>>>
>>>>>> Do keep in mind that users running cache-tiering are increasingly
>>>>>> uncommon, and while we still test it we aren't expanding that coverage 
>>>>>> and
>>>>>> would really like to remove and replace that functionality, so you're out
>>>>>> on the far edge of things here. :/
>>>>>>
>>>>>> Finally, if you're still experiencing OOM issues, you may want to set
>>>>>> the "nodown" flag — that will prevent OSDs being marked down, which
>>>>>> obviously will stall IO and peering if they go down anyway, but will
>>>>>> prevent spamming map updates that cause the cluster to OOM again.
>>>>>> -Greg
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> At the moment we are still investigating the logs to try to
>>>>>>> understand what happened exactly. The cluster is down. When we restart 
>>>>>>> the
>>>>>>> OSDs, PG start activating and peering but at some point all the RAM and
>>>>>>> swap of the servers is consumed and OSDs crash.
>>>>>>> I tried to set the ovious cluster parameters
>>>>>>> (osd_recovery_max_active=1, osd_recovery_threads=1, osd_max_backfills=1)
>>>>>>> and norecover, nobackfill, but when the OSDs start, we still see 4 or 5
>>>>>>> Gbits/s over the cluster network.
>>>>>>>
>>>>>>> We are currently reconfiguring the network to try to remove the
>>>>>>> bandwith problem and are preparing to add more RAM to the servers.
>>>>>>>
>>>>>>> But still, I have a couple of questions :
>>>>>>>
>>>>>>>  - Has anyone encoutered the issue we have with the cache ? Or have
>>>>>>> a possible explanation ?
>>>>>>>  - Is it normal to observe so much traffic on the cluster network
>>>>>>> when the PG are activating / peering with norecover and nobackfill set ?
>>>>>>>
>>>>>>> Thank you for reading and for your input !
>>>>>>>
>>>>>>>
>>>>>>> 2018-08-21 21:05:02.766969 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 
>>>>>>> 225 TB
>>>>>>> / 611 TB avail; 298 MB/s rd, 212 MB/s wr, 456 op/s
>>>>>>> 2018-08-21 21:05:03.805124 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 
>>>>>>> 225 TB
>>>>>>> / 611 TB avail; 296 MB/s rd, 193 MB/s wr, 433 op/s
>>>>>>> 2018-08-21 21:05:03.870690 7f10d83cc700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader) e6 handle_command
>>>>>>> mon_command({"var": "target_max_bytes", "prefix": "osd pool set", 
>>>>>>> "pool":
>>>>>>> "pool-cache-02", "val": "0"} v 0
>>>>>>> ) v1
>>>>>>> 2018-08-21 21:05:03.870749 7f10d83cc700  0 log_channel(audit) log
>>>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd=[{"var":
>>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-02
>>>>>>> ", "val": "0"}]: dispatch
>>>>>>> 2018-08-21 21:05:03.913663 7f10d83cc700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader) e6 handle_command
>>>>>>> mon_command({"prefix": "osd pool stats", "format": "json"} v 0) v1
>>>>>>> 2018-08-21 21:05:03.913745 7f10d83cc700  0 log_channel(audit) log
>>>>>>> [DBG] : from='client.? 10.25.25.3:0/4101432770'
>>>>>>> entity='client.admin' cmd=[{"prefix": "osd pool stats", "format": 
>>>>>>> "json"}]:
>>>>>>> dispatch
>>>>>>> 2018-08-21 21:05:04.042835 7f10d83cc700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader) e6 handle_command
>>>>>>> mon_command({"var": "target_max_bytes", "prefix": "osd pool set", 
>>>>>>> "pool":
>>>>>>> "pool-cache-01", "val": "0"} v 0
>>>>>>> ) v1
>>>>>>> 2018-08-21 21:05:04.042894 7f10d83cc700  0 log_channel(audit) log
>>>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd=[{"var":
>>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-01
>>>>>>> ", "val": "0"}]: dispatch
>>>>>>> 2018-08-21 21:05:04.813217 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 
>>>>>>> 225 TB
>>>>>>> / 611 TB avail; 271 MB/s rd, 181 MB/s wr, 415 op/s
>>>>>>> 2018-08-21 21:05:04.821486 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114693 e114693: 168 total,
>>>>>>> 168 up, 168 in
>>>>>>> 2018-08-21 21:05:04.821711 7f10de377700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114693 crush map has
>>>>>>> features 432365734501548032, adjusting msgr requires
>>>>>>> 2018-08-21 21:05:04.823032 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114693 create_pending
>>>>>>> setting backfillfull_ratio = 0.9
>>>>>>> 2018-08-21 21:05:04.823156 7f10de377700  0 log_channel(audit) log
>>>>>>> [INF] : from='client.146529236 -' entity='client.admin' cmd='[{"var":
>>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>>>> 2", "val": "0"}]': finished
>>>>>>> 2018-08-21 21:05:04.823280 7f10de377700  0 log_channel(audit) log
>>>>>>> [INF] : from='client.146565481 -' entity='client.admin' cmd='[{"var":
>>>>>>> "target_max_bytes", "prefix": "osd pool set", "pool": "pool-cache-0
>>>>>>> 1", "val": "0"}]': finished
>>>>>>> 2018-08-21 21:05:04.825528 7f10de377700  0 log_channel(cluster) log
>>>>>>> [DBG] : osdmap e114693: 168 total, 168 up, 168 in
>>>>>>> 2018-08-21 21:05:04.844932 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 6208 active+clean; 261 TB data, 385 TB used, 
>>>>>>> 225 TB
>>>>>>> / 611 TB avail; 275 MB/s rd, 192 MB/s wr, 429 op/s
>>>>>>> 2018-08-21 21:05:05.873682 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 187 peering, 6021 active+clean; 261 TB data, 
>>>>>>> 385 TB
>>>>>>> used, 225 TB / 611 TB avail; 220 MB/s rd, 188 MB/s wr, 383
>>>>>>> op/s
>>>>>>> 2018-08-21 21:05:05.895414 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114694 e114694: 168 total,
>>>>>>> 168 up, 168 in
>>>>>>> 2018-08-21 21:05:05.895721 7f10de377700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114694 crush map has
>>>>>>> features 432365734501548032, adjusting msgr requires
>>>>>>> 2018-08-21 21:05:05.905473 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114694 create_pending
>>>>>>> setting backfillfull_ratio = 0.9
>>>>>>> 2018-08-21 21:05:05.940729 7f10de377700  0 log_channel(cluster) log
>>>>>>> [DBG] : osdmap e114694: 168 total, 168 up, 168 in
>>>>>>> 2018-08-21 21:05:05.947921 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 204 peering, 6004 active+clean; 261 TB data, 
>>>>>>> 385 TB
>>>>>>> used, 225 TB / 611 TB avail; 201 MB/s rd, 200 MB/s wr, 355
>>>>>>> op/s
>>>>>>> 2018-08-21 21:05:06.912752 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114695 e114695: 168 total,
>>>>>>> 168 up, 168 in
>>>>>>> 2018-08-21 21:05:06.913104 7f10de377700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114695 crush map has
>>>>>>> features 432365734501548032, adjusting msgr requires
>>>>>>> 2018-08-21 21:05:06.928521 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114695 create_pending
>>>>>>> setting backfillfull_ratio = 0.9
>>>>>>> 2018-08-21 21:05:06.968793 7f10de377700  0 log_channel(cluster) log
>>>>>>> [DBG] : osdmap e114695: 168 total, 168 up, 168 in
>>>>>>> 2018-08-21 21:05:07.001112 7f10d93ce700  1 heartbeat_map
>>>>>>> reset_timeout 'Monitor::cpu_tp thread 0x7f10d93ce700' had timed out 
>>>>>>> after 0
>>>>>>> 2018-08-21 21:05:07.004274 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 12 unknown, 114 remapped+peering, 5 activating, 
>>>>>>> 481
>>>>>>> peering, 5596 active+clean; 261 TB data, 385 TB used, 225 T
>>>>>>> B / 611 TB avail; 177 MB/s rd, 173 MB/s wr, 317 op/s
>>>>>>> 2018-08-21 21:05:07.025200 7f10d8bcd700  1 heartbeat_map
>>>>>>> reset_timeout 'Monitor::cpu_tp thread 0x7f10d8bcd700' had timed out 
>>>>>>> after 0
>>>>>>> 2018-08-21 21:05:07.948138 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114696 e114696: 168 total,
>>>>>>> 168 up, 168 in
>>>>>>> 2018-08-21 21:05:07.948487 7f10de377700  0
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114696 crush map has
>>>>>>> features 432365734501548032, adjusting msgr requires
>>>>>>> 2018-08-21 21:05:07.965671 7f10de377700  1
>>>>>>> mon.clb-cph-frpar2-mon-01@0(leader).osd e114696 create_pending
>>>>>>> setting backfillfull_ratio = 0.9
>>>>>>> 2018-08-21 21:05:08.007585 7f10de377700  0 log_channel(cluster) log
>>>>>>> [DBG] : osdmap e114696: 168 total, 168 up, 168 in
>>>>>>> 2018-08-21 21:05:08.052153 7f10de377700  0 log_channel(cluster) log
>>>>>>> [INF] : pgmap 6208 pgs: 46 activating+remapped, 15 unknown, 305
>>>>>>> remapped+peering, 11 activating, 697 peering, 8 active+remapped+backfil
>>>>>>> ling, 5126 active+clean; 261 TB data, 385 TB used, 225 TB / 611 TB
>>>>>>> avail; 115 MB/s rd, 182 MB/s wr, 292 op/s; 26681/519327262 objects
>>>>>>> misplaced (0.005%); 4856 kB/s, 3 objects/s recovering
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

Reply via email to