[ceph-users] Re: Tool to cancel pending backfills
Am 01.10.21 um 16:52 schrieb Josh Baergen: Hi Peter, When I check for circles I found that running the upmap balancer alone never seems to create any kind of circle in the graph By a circle, do you mean something like this? pg 1.a: 1->2 (upmap to put a chunk on 2 instead of 1) pg 1.b: 2->3 pg 1.c: 3->1 Exactly. The upmap balancer tries to remove upmap entries first as far as I understand so I would expect that there never will be a circle like, 1->2, 2->1, but I don't see why It would no accidently create a circle with more nodes involved. If so, then it's not surprising that the upmap balancer wouldn't create this situation by itself, since there's no reason for this set of upmaps to exist purely for balance reasons. I don't think the balancer needs any explicit code to avoid the situation because of this. Running pgremapper + balancer created circles with sometimes several dozen nodes. I would update the docs of the pgremapper to warn about this fact and guide the users to use undo-upmap to slowly remove the upmaps create by cancel-backfill. This is again not surprising, since cancel-backfill will do whatever's necessary to undo a set of CRUSH changes (and some CRUSH changes regularly lead to movement cycles like this), and then using the upmap balancer will only make enough changes to achieve balance, not undo everything that's there. It might be a nice addition to pgremapper to add an option to optimze the upmap table. What I'm still missing here is the value in this. Are there demonstrable problems presented by a large upmap exception table (e.g. performance or operational)? I have no evidence, how expensive upmap table entries are. They need to be synched to every client and there have a certain overhead. Maybe someone with more knowledge of the internals can give some insight here. The whole idea of crush is not carry a table with exact mappings around and upmap entries are the exactly opposite of this idea. Bottom line, every circlic upmap definition is an overhead thats unnecessary and i personally think it should be avoided. Best, Peter ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Tool to cancel pending backfills
Hi Peter, > When I check for circles I found that running the upmap balancer alone never > seems to create > any kind of circle in the graph By a circle, do you mean something like this? pg 1.a: 1->2 (upmap to put a chunk on 2 instead of 1) pg 1.b: 2->3 pg 1.c: 3->1 If so, then it's not surprising that the upmap balancer wouldn't create this situation by itself, since there's no reason for this set of upmaps to exist purely for balance reasons. I don't think the balancer needs any explicit code to avoid the situation because of this. > Running pgremapper + balancer created circles with sometimes several dozen > nodes. I would update the docs of the pgremapper > to warn about this fact and guide the users to use undo-upmap to slowly > remove the upmaps create by cancel-backfill. This is again not surprising, since cancel-backfill will do whatever's necessary to undo a set of CRUSH changes (and some CRUSH changes regularly lead to movement cycles like this), and then using the upmap balancer will only make enough changes to achieve balance, not undo everything that's there. > It might be a nice addition to pgremapper to add an option to optimze the > upmap table. What I'm still missing here is the value in this. Are there demonstrable problems presented by a large upmap exception table (e.g. performance or operational)? Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Tool to cancel pending backfills
Am 27.09.21 um 22:38 schrieb Josh Baergen: >> I have a question regarding the last step. It seems to me that the ceph >> balancer is not able to remove the upmaps >> created by pgremapper, but instead creates new upmaps to balance the pgs >> among osds. > The balancer will prefer to remove existing upmaps[1], but it's not > guaranteed. The upmap has to exist between the source and target OSD > already decided on by the balancer in order for this to happen. The > reality, though, is that the upmap balancer will need to create many > upmap exception table entries to balance any sizable system. > > Like you, I would prefer to have as few upmap exception table entries > as possible in a system (fewer surprises when an OSD fails), but we > regularly run systems that have thousands of entries without any > discernible impact and haven't had any major operational issues that > result from it, except for on really old systems that are just awful > to work with in the first place. Hi Josh, thanks for the update. With the pgremapper run first and then the upmap balancer enabled I ended up with an upmap entry for almost every placement group in a pool. So I decided to write a tool which analizes the upmap entries and put them in a digraph. When I check for circles I found that running the upmap balancer alone never seems to create any kind of circle in the graph (i will have to check the code you mentioned [1] if there is any mechanism to avoid that). Running pgremapper + balancer created circles with sometimes several dozen nodes. I would update the docs of the pgremapper to warn about this fact and guide the users to use undo-upmap to slowly remove the upmaps create by cancel-backfill. It might be a nice addition to pgremapper to add an option to optimze the upmap table. When searching for circles you might want to limit the depth of the DFS otherwise the runtime will be crazy. Thanks, Peter ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Tool to cancel pending backfills
> I have a question regarding the last step. It seems to me that the ceph > balancer is not able to remove the upmaps > created by pgremapper, but instead creates new upmaps to balance the pgs > among osds. The balancer will prefer to remove existing upmaps[1], but it's not guaranteed. The upmap has to exist between the source and target OSD already decided on by the balancer in order for this to happen. The reality, though, is that the upmap balancer will need to create many upmap exception table entries to balance any sizable system. Like you, I would prefer to have as few upmap exception table entries as possible in a system (fewer surprises when an OSD fails), but we regularly run systems that have thousands of entries without any discernible impact and haven't had any major operational issues that result from it, except for on really old systems that are just awful to work with in the first place. Josh [1] I think this is the implementation: https://github.com/ceph/ceph/blob/bc8c846b36288ff7ac65005087b0dda0e4b857f4/src/osd/OSDMap.cc#L4794-L4832 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Tool to cancel pending backfills
Am 26.09.21 um 19:08 schrieb Alexandre Marangone: > Thanks for the feedback Alex! If you have any issue or ideas for > improvements please do submit them on the GH repo: > https://github.com/digitalocean/pgremapper/ > > Last Thursday I did a Ceph at DO tech talk, I talked about how we use > pgremapper to do augments on HDD clusters. The recording is not > available yet but the gist is: > - set nobackfill/norebalance > - create all your osds -> PGs are in backill state but not data movement > - cancel all backfills with pgremapper -> PGs are back to active+clean > - unset nobackfill/norebalance -> nothing happens > - turn on ceph balancer or use pgremapper undo-upmap to do your > augment in a controlled way Thanks anyone for the pointer. Great tool! I have a question regarding the last step. It seems to me that the ceph balancer is not able to remove the upmaps created by pgremapper, but instead creates new upmaps to balance the pgs among osds. Is this expected? The ideal way would be try to remove undo upmaps to achieve balance rather than creating more and more upmaps. This way it might happen that we have an upmap for a pg from OSD A to OSD B and an upmap for another pg from OSB B to OSD A whereas it would just be enough to have no upmap at all. Thanks, Peter ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Tool to cancel pending backfills
Thanks for the feedback Alex! If you have any issue or ideas for improvements please do submit them on the GH repo: https://github.com/digitalocean/pgremapper/ Last Thursday I did a Ceph at DO tech talk, I talked about how we use pgremapper to do augments on HDD clusters. The recording is not available yet but the gist is: - set nobackfill/norebalance - create all your osds -> PGs are in backill state but not data movement - cancel all backfills with pgremapper -> PGs are back to active+clean - unset nobackfill/norebalance -> nothing happens - turn on ceph balancer or use pgremapper undo-upmap to do your augment in a controlled way Our main motivation to do it this way is that on HDD clusters flapping is a fact of life and creates recovery PGs that are blocked by the backfill reservations from the augment. As more flapping occurs, the number of degraded objects increases which is always uncomfortable. Doing an augment this way allows us to have N backfills at a time, wait for completion -> let recovery happen -> undo N more upmaps -> etc. which dramatically lowers the amount of time a cluster is degraded. On Sat, Sep 25, 2021 at 10:15 AM Alex Gorbachev wrote: > > Hi Ceph community, > > I think this is so important operationally that bears repeating ( > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GJ35EL73A4LV6NPA74M6H6IN7BXMMHYA/ > ) > > Digital Ocean has released the pgremapper tool, with which one can cancel > pending backfills (in case bad decisions were made by balancer, or other > tools) - in my case this was a necessity to reweight many OSDs back to 1. > This tool saved many days of waiting for an unneeded rebalance. > > I found the tool at https://golangrepo.com/repo/digitalocean-pgremapper > -- > Alex Gorbachev > https://alextelescope.blogspot.com > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io