Re: [ceph-users] extending ceph cluster with osds close to near full ratio (85%)
Hi Brian, On 14 February 2017 at 19:33, Brian Andrus wrote: > > > On Tue, Feb 14, 2017 at 5:27 AM, Tyanko Aleksiev > wrote: > >> Hi Cephers, >> >> At University of Zurich we are using Ceph as a storage back-end for our >> OpenStack installation. Since we recently reached 70% of occupancy >> (mostly caused by the cinder pool served by 16384PGs) we are in the >> phase of extending the cluster with additional storage nodes of the same >> type (except for a slight more powerful CPU). >> >> We decided to opt for a gradual OSD deployment: we created a temporary >> "root" >> bucket called "fresh-install" containing the newly installed nodes and >> then we >> moved OSDs from this bucket to the current production root via: >> >> ceph osd crush set osd.{id} {weight} host={hostname} >> root={production_root} >> >> Everything seemed nicely planned but when we started adding a few new >> OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs, >> already at 84% disk use, passed the 85% threshold. This in turn >> triggered the "near full osd(s)" warning and more than 20PGs previously >> in "wait_backfill" state were marked as: "wait_backfill+backfill_tooful >> l". >> Since the OSD kept growing until, reached 90% disk use, we decided to >> reduce >> its relative weight from 1 to 0.95. >> The last action recalculated the crushmap and remapped a few PGs but did >> not appear to move any data off the almost full OSD. Only when, by steps >> of 0.05, we reached 0.50 of relative weight data was moved and some >> "backfill_toofull" requests were released. However, he had do go down >> almost to 0.10% of relative weight in order to trigger some additional >> data movement and have the backfilling process finally finished. >> >> We are now adding new OSDs but the problem is constantly triggered since >> we have multiple OSDs > 83% that starts growing during the rebalance. >> >> My questions are: >> >> - Is there something wrong in our process of adding new OSDs (some >> additional >> details below)? >> >> > It could work but - also could be more disruptive than need be. We have a > similar situation/configuration and what we do is start OSDs with ` osd > crush initial weight = 0` as well as "crush_osd_location" set properly. > This will weight the OSDs at 0 weight and let us bring them in in a > controlled fashion. We bring them in to 1 (no disruption), then crush > weight in gradually. > We are currently trying out this type of gradual insertion. Thanks! > > >> - We also noticed that the problem has the tendency to cluster around the >> newly >> added OSDs, so could those two things be correlated? >> >> I'm not sure which problem you are referring to - this OSDs filling? > Possibly due to temporary files or some other mechanism I'm not familiar > with adding a little extra data on top. > >> - Why reweighting does not trigger instant data moving? What's the logic >> behind remapped PGs? Is there some sort of flat queue of tasks or does >> it have some priorities defined? >> >> > It should, perhaps you aren't choosing large enough increments or perhaps > you have some settings set. > Indeed, with sufficiently large increments it triggers some instant pg rebalancing. > > >> - Did somebody experience this situation and eventually how was it >> solved/bypassed? >> >> > FWIW, we also run a rebalance cronjob every hour with the following: > > `ceph osd reweight-by-utilization 103 .010 10` > Already running that but on a daily basis. > > it was detailed in another recent thread on [ceph-users] > > >> Cluster details are as follows: >> >> - version: 0.94.9 >> - 5 monitors, >> - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs >> in total), >> - osd pool default size = 3, >> - journaling is on SSDs. >> >> We have "hosts" failure domain. Relevant crushmap details: >> >> # rules >> rule sas { >> ruleset 1 >> type replicated >> min_size 1 >> max_size 10 >> step take sas >> step chooseleaf firstn 0 type host >> step emit >> } >> >> root sas { >> id -41 # do not change unnecessarily >> # weight 3283.279 >> alg straw >> hash 0 # rjenkins1 >> item osd-l2-16 weight 87.360 >> item osd-l4-06 weight 87.360 >> ... >> item osd-k7-41 weight 14.560 >> item osd-l4-36 weight 14.560 >> item osd-k5-36 weight 14.560 >> } >> >> host osd-k7-21 { >> id -46 # do not change unnecessarily >> # weight 87.360 >> alg straw >> hash 0 # rjenkins1 >> item osd.281 weight 3.640 >> item osd.282 weight 3.640 >> item osd.285 weight 3.640 >> ... >> } >> >> host osd-k7-41 { >> id -50 # do not change unnecessarily >> # weight 14.560 >> alg straw >> hash 0 # rjenkins1 >> item osd.900 weight 3.640 >> item osd.901 weight 3.640 >> item osd.902 w
Re: [ceph-users] extending ceph cluster with osds close to near full ratio (85%)
On Tue, Feb 14, 2017 at 5:27 AM, Tyanko Aleksiev wrote: > Hi Cephers, > > At University of Zurich we are using Ceph as a storage back-end for our > OpenStack installation. Since we recently reached 70% of occupancy > (mostly caused by the cinder pool served by 16384PGs) we are in the > phase of extending the cluster with additional storage nodes of the same > type (except for a slight more powerful CPU). > > We decided to opt for a gradual OSD deployment: we created a temporary > "root" > bucket called "fresh-install" containing the newly installed nodes and > then we > moved OSDs from this bucket to the current production root via: > > ceph osd crush set osd.{id} {weight} host={hostname} root={production_root} > > Everything seemed nicely planned but when we started adding a few new > OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs, > already at 84% disk use, passed the 85% threshold. This in turn > triggered the "near full osd(s)" warning and more than 20PGs previously > in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull". > Since the OSD kept growing until, reached 90% disk use, we decided to > reduce > its relative weight from 1 to 0.95. > The last action recalculated the crushmap and remapped a few PGs but did > not appear to move any data off the almost full OSD. Only when, by steps > of 0.05, we reached 0.50 of relative weight data was moved and some > "backfill_toofull" requests were released. However, he had do go down > almost to 0.10% of relative weight in order to trigger some additional > data movement and have the backfilling process finally finished. > > We are now adding new OSDs but the problem is constantly triggered since > we have multiple OSDs > 83% that starts growing during the rebalance. > > My questions are: > > - Is there something wrong in our process of adding new OSDs (some > additional > details below)? > > It could work but - also could be more disruptive than need be. We have a similar situation/configuration and what we do is start OSDs with ` osd crush initial weight = 0` as well as "crush_osd_location" set properly. This will weight the OSDs at 0 weight and let us bring them in in a controlled fashion. We bring them in to 1 (no disruption), then crush weight in gradually. > - We also noticed that the problem has the tendency to cluster around the > newly > added OSDs, so could those two things be correlated? > > I'm not sure which problem you are referring to - this OSDs filling? Possibly due to temporary files or some other mechanism I'm not familiar with adding a little extra data on top. > - Why reweighting does not trigger instant data moving? What's the logic > behind remapped PGs? Is there some sort of flat queue of tasks or does > it have some priorities defined? > > It should, perhaps you aren't choosing large enough increments or perhaps you have some settings set. > - Did somebody experience this situation and eventually how was it > solved/bypassed? > > FWIW, we also run a rebalance cronjob every hour with the following: `ceph osd reweight-by-utilization 103 .010 10` it was detailed in another recent thread on [ceph-users] > Cluster details are as follows: > > - version: 0.94.9 > - 5 monitors, > - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in > total), > - osd pool default size = 3, > - journaling is on SSDs. > > We have "hosts" failure domain. Relevant crushmap details: > > # rules > rule sas { > ruleset 1 > type replicated > min_size 1 > max_size 10 > step take sas > step chooseleaf firstn 0 type host > step emit > } > > root sas { > id -41 # do not change unnecessarily > # weight 3283.279 > alg straw > hash 0 # rjenkins1 > item osd-l2-16 weight 87.360 > item osd-l4-06 weight 87.360 > ... > item osd-k7-41 weight 14.560 > item osd-l4-36 weight 14.560 > item osd-k5-36 weight 14.560 > } > > host osd-k7-21 { > id -46 # do not change unnecessarily > # weight 87.360 > alg straw > hash 0 # rjenkins1 > item osd.281 weight 3.640 > item osd.282 weight 3.640 > item osd.285 weight 3.640 > ... > } > > host osd-k7-41 { > id -50 # do not change unnecessarily > # weight 14.560 > alg straw > hash 0 # rjenkins1 > item osd.900 weight 3.640 > item osd.901 weight 3.640 > item osd.902 weight 3.640 > item osd.903 weight 3.640 > } > > > As mentioned before we created a temporary bucket called "fresh-install" > containing the newly installed nodes (i.e.): > > root fresh-install { > id -34 # do not change unnecessarily > # weight 218.400 > alg straw > hash 0 # rjenkins1 > item osd-k5-36-fresh weight 72.800 > item osd-k7-41-fresh weight 72.800 >
[ceph-users] extending ceph cluster with osds close to near full ratio (85%)
Hi Cephers, At University of Zurich we are using Ceph as a storage back-end for our OpenStack installation. Since we recently reached 70% of occupancy (mostly caused by the cinder pool served by 16384PGs) we are in the phase of extending the cluster with additional storage nodes of the same type (except for a slight more powerful CPU). We decided to opt for a gradual OSD deployment: we created a temporary "root" bucket called "fresh-install" containing the newly installed nodes and then we moved OSDs from this bucket to the current production root via: ceph osd crush set osd.{id} {weight} host={hostname} root={production_root} Everything seemed nicely planned but when we started adding a few new OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs, already at 84% disk use, passed the 85% threshold. This in turn triggered the "near full osd(s)" warning and more than 20PGs previously in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull". Since the OSD kept growing until, reached 90% disk use, we decided to reduce its relative weight from 1 to 0.95. The last action recalculated the crushmap and remapped a few PGs but did not appear to move any data off the almost full OSD. Only when, by steps of 0.05, we reached 0.50 of relative weight data was moved and some "backfill_toofull" requests were released. However, he had do go down almost to 0.10% of relative weight in order to trigger some additional data movement and have the backfilling process finally finished. We are now adding new OSDs but the problem is constantly triggered since we have multiple OSDs > 83% that starts growing during the rebalance. My questions are: - Is there something wrong in our process of adding new OSDs (some additional details below)? - We also noticed that the problem has the tendency to cluster around the newly added OSDs, so could those two things be correlated? - Why reweighting does not trigger instant data moving? What's the logic behind remapped PGs? Is there some sort of flat queue of tasks or does it have some priorities defined? - Did somebody experience this situation and eventually how was it solved/bypassed? Cluster details are as follows: - version: 0.94.9 - 5 monitors, - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in total), - osd pool default size = 3, - journaling is on SSDs. We have "hosts" failure domain. Relevant crushmap details: # rules rule sas { ruleset 1 type replicated min_size 1 max_size 10 step take sas step chooseleaf firstn 0 type host step emit } root sas { id -41 # do not change unnecessarily # weight 3283.279 alg straw hash 0 # rjenkins1 item osd-l2-16 weight 87.360 item osd-l4-06 weight 87.360 ... item osd-k7-41 weight 14.560 item osd-l4-36 weight 14.560 item osd-k5-36 weight 14.560 } host osd-k7-21 { id -46 # do not change unnecessarily # weight 87.360 alg straw hash 0 # rjenkins1 item osd.281 weight 3.640 item osd.282 weight 3.640 item osd.285 weight 3.640 ... } host osd-k7-41 { id -50 # do not change unnecessarily # weight 14.560 alg straw hash 0 # rjenkins1 item osd.900 weight 3.640 item osd.901 weight 3.640 item osd.902 weight 3.640 item osd.903 weight 3.640 } As mentioned before we created a temporary bucket called "fresh-install" containing the newly installed nodes (i.e.): root fresh-install { id -34 # do not change unnecessarily # weight 218.400 alg straw hash 0 # rjenkins1 item osd-k5-36-fresh weight 72.800 item osd-k7-41-fresh weight 72.800 item osd-l4-36-fresh weight 72.800 } Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from the "fresh-install" to the "sas" bucket. Thank you in advance for all the suggestions. Cheers, Tyanko ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com