Re: [ceph-users] extending ceph cluster with osds close to near full ratio (85%)

2017-02-20 Thread Tyanko Aleksiev
Hi Brian,


On 14 February 2017 at 19:33, Brian Andrus 
wrote:

>
>
> On Tue, Feb 14, 2017 at 5:27 AM, Tyanko Aleksiev  > wrote:
>
>> Hi Cephers,
>>
>> At University of Zurich we are using Ceph as a storage back-end for our
>> OpenStack installation. Since we recently reached 70% of occupancy
>> (mostly caused by the cinder pool served by 16384PGs) we are in the
>> phase of extending the cluster with additional storage nodes of the same
>> type (except for a slight more powerful CPU).
>>
>> We decided to opt for a gradual OSD deployment: we created a temporary
>> "root"
>> bucket called "fresh-install" containing the newly installed nodes and
>> then we
>> moved OSDs from this bucket to the current production root via:
>>
>> ceph osd crush set osd.{id} {weight} host={hostname}
>> root={production_root}
>>
>> Everything seemed nicely planned but when we started adding a few new
>> OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
>> already at 84% disk use, passed the 85% threshold. This in turn
>> triggered the "near full osd(s)" warning and more than 20PGs previously
>> in "wait_backfill" state were marked as: "wait_backfill+backfill_tooful
>> l".
>> Since the OSD kept growing until, reached 90% disk use, we decided to
>> reduce
>> its relative weight from 1 to 0.95.
>> The last action recalculated the crushmap and remapped a few PGs but did
>> not appear to move any data off the almost full OSD. Only when, by steps
>> of 0.05, we reached 0.50 of relative weight data was moved and some
>> "backfill_toofull" requests were released. However, he had do go down
>> almost to 0.10% of relative weight in order to trigger some additional
>> data movement and have the backfilling process finally finished.
>>
>> We are now adding new OSDs but the problem is constantly triggered since
>> we have multiple OSDs > 83% that starts growing during the rebalance.
>>
>> My questions are:
>>
>> - Is there something wrong in our process of adding new OSDs (some
>> additional
>> details below)?
>>
>>
> It could work but - also could be more disruptive than need be. We have a
> similar situation/configuration and what we do is start OSDs with ` osd
> crush initial weight = 0` as well as "crush_osd_location" set properly.
> This will weight the OSDs at 0 weight and let us bring them in in a
> controlled fashion. We bring them in to 1 (no disruption), then crush
> weight in gradually.
>

We are currently trying out this type of gradual insertion. Thanks!


>
>
>> - We also noticed that the problem has the tendency to cluster around the 
>> newly
>> added OSDs, so could those two things be correlated?
>>
>> I'm not sure which problem you are referring to - this OSDs filling?
> Possibly due to temporary files or some other mechanism I'm not familiar
> with adding a little extra data on top.
>
>> - Why reweighting does not trigger instant data moving? What's the logic
>> behind remapped PGs? Is there some sort of flat queue of tasks or does
>> it have some priorities defined?
>>
>>
> It should, perhaps you aren't choosing large enough increments or perhaps
> you have some settings set.
>

Indeed, with sufficiently large increments it triggers some instant pg
rebalancing.


>
>
>> - Did somebody experience this situation and eventually how was it 
>> solved/bypassed?
>>
>>
> FWIW, we also run a rebalance cronjob every hour with the following:
>
> `ceph osd reweight-by-utilization 103 .010 10`
>

Already running that but on a daily basis.


>
> it was detailed in another recent thread on [ceph-users]
>
>
>> Cluster details are as follows:
>>
>> - version: 0.94.9
>> - 5 monitors,
>> - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs 
>> in total),
>> - osd pool default size = 3,
>> - journaling is on SSDs.
>>
>> We have "hosts" failure domain. Relevant crushmap details:
>>
>> # rules
>> rule sas {
>> ruleset 1
>> type replicated
>> min_size 1
>> max_size 10
>> step take sas
>> step chooseleaf firstn 0 type host
>> step emit
>> }
>>
>> root sas {
>> id -41  # do not change unnecessarily
>> # weight 3283.279
>> alg straw
>> hash 0  # rjenkins1
>> item osd-l2-16 weight 87.360
>> item osd-l4-06 weight 87.360
>> ...
>> item osd-k7-41 weight 14.560
>> item osd-l4-36 weight 14.560
>> item osd-k5-36 weight 14.560
>> }
>>
>> host osd-k7-21 {
>> id -46  # do not change unnecessarily
>> # weight 87.360
>> alg straw
>> hash 0  # rjenkins1
>> item osd.281 weight 3.640
>> item osd.282 weight 3.640
>> item osd.285 weight 3.640
>> ...
>> }
>>
>> host osd-k7-41 {
>> id -50  # do not change unnecessarily
>> # weight 14.560
>> alg straw
>> hash 0  # rjenkins1
>> item osd.900 weight 3.640
>> item osd.901 weight 3.640
>> item osd.902 w

Re: [ceph-users] extending ceph cluster with osds close to near full ratio (85%)

2017-02-14 Thread Brian Andrus
On Tue, Feb 14, 2017 at 5:27 AM, Tyanko Aleksiev 
wrote:

> Hi Cephers,
>
> At University of Zurich we are using Ceph as a storage back-end for our
> OpenStack installation. Since we recently reached 70% of occupancy
> (mostly caused by the cinder pool served by 16384PGs) we are in the
> phase of extending the cluster with additional storage nodes of the same
> type (except for a slight more powerful CPU).
>
> We decided to opt for a gradual OSD deployment: we created a temporary
> "root"
> bucket called "fresh-install" containing the newly installed nodes and
> then we
> moved OSDs from this bucket to the current production root via:
>
> ceph osd crush set osd.{id} {weight} host={hostname} root={production_root}
>
> Everything seemed nicely planned but when we started adding a few new
> OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
> already at 84% disk use, passed the 85% threshold. This in turn
> triggered the "near full osd(s)" warning and more than 20PGs previously
> in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull".
> Since the OSD kept growing until, reached 90% disk use, we decided to
> reduce
> its relative weight from 1 to 0.95.
> The last action recalculated the crushmap and remapped a few PGs but did
> not appear to move any data off the almost full OSD. Only when, by steps
> of 0.05, we reached 0.50 of relative weight data was moved and some
> "backfill_toofull" requests were released. However, he had do go down
> almost to 0.10% of relative weight in order to trigger some additional
> data movement and have the backfilling process finally finished.
>
> We are now adding new OSDs but the problem is constantly triggered since
> we have multiple OSDs > 83% that starts growing during the rebalance.
>
> My questions are:
>
> - Is there something wrong in our process of adding new OSDs (some
> additional
> details below)?
>
>
It could work but - also could be more disruptive than need be. We have a
similar situation/configuration and what we do is start OSDs with ` osd
crush initial weight = 0` as well as "crush_osd_location" set properly.
This will weight the OSDs at 0 weight and let us bring them in in a
controlled fashion. We bring them in to 1 (no disruption), then crush
weight in gradually.


> - We also noticed that the problem has the tendency to cluster around the 
> newly
> added OSDs, so could those two things be correlated?
>
> I'm not sure which problem you are referring to - this OSDs filling?
Possibly due to temporary files or some other mechanism I'm not familiar
with adding a little extra data on top.

> - Why reweighting does not trigger instant data moving? What's the logic
> behind remapped PGs? Is there some sort of flat queue of tasks or does
> it have some priorities defined?
>
>
It should, perhaps you aren't choosing large enough increments or perhaps
you have some settings set.


> - Did somebody experience this situation and eventually how was it 
> solved/bypassed?
>
>
FWIW, we also run a rebalance cronjob every hour with the following:

`ceph osd reweight-by-utilization 103 .010 10`

it was detailed in another recent thread on [ceph-users]


> Cluster details are as follows:
>
> - version: 0.94.9
> - 5 monitors,
> - 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in 
> total),
> - osd pool default size = 3,
> - journaling is on SSDs.
>
> We have "hosts" failure domain. Relevant crushmap details:
>
> # rules
> rule sas {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take sas
> step chooseleaf firstn 0 type host
> step emit
> }
>
> root sas {
> id -41  # do not change unnecessarily
> # weight 3283.279
> alg straw
> hash 0  # rjenkins1
> item osd-l2-16 weight 87.360
> item osd-l4-06 weight 87.360
> ...
> item osd-k7-41 weight 14.560
> item osd-l4-36 weight 14.560
> item osd-k5-36 weight 14.560
> }
>
> host osd-k7-21 {
> id -46  # do not change unnecessarily
> # weight 87.360
> alg straw
> hash 0  # rjenkins1
> item osd.281 weight 3.640
> item osd.282 weight 3.640
> item osd.285 weight 3.640
> ...
> }
>
> host osd-k7-41 {
> id -50  # do not change unnecessarily
> # weight 14.560
> alg straw
> hash 0  # rjenkins1
> item osd.900 weight 3.640
> item osd.901 weight 3.640
> item osd.902 weight 3.640
> item osd.903 weight 3.640
> }
>
>
> As mentioned before we created a temporary bucket called "fresh-install"
> containing the newly installed nodes (i.e.):
>
> root fresh-install {
> id -34  # do not change unnecessarily
> # weight 218.400
> alg straw
> hash 0  # rjenkins1
> item osd-k5-36-fresh weight 72.800
> item osd-k7-41-fresh weight 72.800
>   

[ceph-users] extending ceph cluster with osds close to near full ratio (85%)

2017-02-14 Thread Tyanko Aleksiev
Hi Cephers,

At University of Zurich we are using Ceph as a storage back-end for our
OpenStack installation. Since we recently reached 70% of occupancy
(mostly caused by the cinder pool served by 16384PGs) we are in the
phase of extending the cluster with additional storage nodes of the same
type (except for a slight more powerful CPU).

We decided to opt for a gradual OSD deployment: we created a temporary
"root"
bucket called "fresh-install" containing the newly installed nodes and then
we
moved OSDs from this bucket to the current production root via:

ceph osd crush set osd.{id} {weight} host={hostname} root={production_root}

Everything seemed nicely planned but when we started adding a few new
OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
already at 84% disk use, passed the 85% threshold. This in turn
triggered the "near full osd(s)" warning and more than 20PGs previously
in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull".
Since the OSD kept growing until, reached 90% disk use, we decided to reduce
its relative weight from 1 to 0.95.
The last action recalculated the crushmap and remapped a few PGs but did
not appear to move any data off the almost full OSD. Only when, by steps
of 0.05, we reached 0.50 of relative weight data was moved and some
"backfill_toofull" requests were released. However, he had do go down
almost to 0.10% of relative weight in order to trigger some additional
data movement and have the backfilling process finally finished.

We are now adding new OSDs but the problem is constantly triggered since
we have multiple OSDs > 83% that starts growing during the rebalance.

My questions are:

- Is there something wrong in our process of adding new OSDs (some
additional
details below)?
- We also noticed that the problem has the tendency to cluster around the
newly
added OSDs, so could those two things be correlated?
- Why reweighting does not trigger instant data moving? What's the logic
behind remapped PGs? Is there some sort of flat queue of tasks or does
it have some priorities defined?
- Did somebody experience this situation and eventually how was it
solved/bypassed?

Cluster details are as follows:

- version: 0.94.9
- 5 monitors,
- 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs
in total),
- osd pool default size = 3,
- journaling is on SSDs.

We have "hosts" failure domain. Relevant crushmap details:

# rules
rule sas {
ruleset 1
type replicated
min_size 1
max_size 10
step take sas
step chooseleaf firstn 0 type host
step emit
}

root sas {
id -41 # do not change unnecessarily
# weight 3283.279
alg straw
hash 0 # rjenkins1
item osd-l2-16 weight 87.360
item osd-l4-06 weight 87.360
...
item osd-k7-41 weight 14.560
item osd-l4-36 weight 14.560
item osd-k5-36 weight 14.560
}

host osd-k7-21 {
id -46 # do not change unnecessarily
# weight 87.360
alg straw
hash 0 # rjenkins1
item osd.281 weight 3.640
item osd.282 weight 3.640
item osd.285 weight 3.640
...
}

host osd-k7-41 {
id -50 # do not change unnecessarily
# weight 14.560
alg straw
hash 0 # rjenkins1
item osd.900 weight 3.640
item osd.901 weight 3.640
item osd.902 weight 3.640
item osd.903 weight 3.640
}


As mentioned before we created a temporary bucket called "fresh-install"
containing the newly installed nodes (i.e.):

root fresh-install {
id -34 # do not change unnecessarily
# weight 218.400
alg straw
hash 0 # rjenkins1
item osd-k5-36-fresh weight 72.800
item osd-k7-41-fresh weight 72.800
item osd-l4-36-fresh weight 72.800
}

Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from
the "fresh-install" to the "sas" bucket.

Thank you in advance for all the suggestions.

Cheers,
Tyanko
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com