[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-17 Thread Konstantin Shalygin


> On 17 Feb 2023, at 23:20, Anthony D'Atri  wrote:
> 
> 
> 
>> * if rebalance will starts due EDAC or SFP degradation, is faster to fix the 
>> issue via DC engineers and put node back to work
> 
> A judicious mon_osd_down_out_subtree_limit setting can also do this by not 
> rebalancing when an entire node is detected down. 

Yes. But in this case when single disk dead, it's may be not actually dead, the 
examples:

* disk just stuck - reboot or/and physical inject_insert return in to live
* disk read errors - such errors lead to OSD down, but after OSD restart is 
just works normal (Pending Sectors -> Reallocates)

The fill of single 16TB OSD may be a 7-10 days. And it's may be fixed with 
10-20 minutes with duty engineer

> 
>> * noout prevents unwanted OSD's fills and the run out of space => outage of 
>> services
> 
> Do you run your clusters very full?

We provide public services. This means client can rent 1000 disks x 1000GB via 
one terraform command, at 02:00 Saturday night. Just physically impossible to 
add nodes at this case. Any movement without upmap is highly undesirable



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-17 Thread Konstantin Shalygin


> On 17 Feb 2023, at 23:20, Anthony D'Atri  wrote:
> 
> 
> 
>> * if rebalance will starts due EDAC or SFP degradation, is faster to fix the 
>> issue via DC engineers and put node back to work
> 
> A judicious mon_osd_down_out_subtree_limit setting can also do this by not 
> rebalancing when an entire node is detected down. 

Yes. But in this case when single disk dead, it's may be not actually dead, the 
examples:

* disk just stuck - reboot or/and physical inject_insert return in to live
* disk read errors - such errors lead to OSD down, but after OSD restart is 
just works normal (Pending Sectors -> Reallocates)

The fill of single 16TB OSD may be a 7-10 days. And it's may be fixed with 
10-20 minutes with duty engineer

> 
>> * noout prevents unwanted OSD's fills and the run out of space => outage of 
>> services
> 
> Do you run your clusters very full?

We provide public services. This means client can rent 1000 disks x 1000GB via 
one terraform command, at 02:00 Saturday night. Just physically impossible to 
add nodes at this case. Any movement without upmap is highly undesirable



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-17 Thread Anthony D'Atri



> * if rebalance will starts due EDAC or SFP degradation, is faster to fix the 
> issue via DC engineers and put node back to work

A judicious mon_osd_down_out_subtree_limit setting can also do this by not 
rebalancing when an entire node is detected down. 

> * noout prevents unwanted OSD's fills and the run out of space => outage of 
> services

Do you run your clusters very full?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-16 Thread Konstantin Shalygin
Hi Will,

All our clusters with noout flag by default, since cluster birth. The reasons:

* if rebalance will starts due EDAC or SFP degradation, is faster to fix the 
issue via DC engineers and put node back to work

* noout prevents unwanted OSD's fills and the run out of space => outage of 
services

* OSD down (broken disk) Prometeus alert will be resolved without noout - after 
OSD will be marked out. Because UP state of OSD in metrics world is expression 
of (in + up). We need "fire alert" for humans, for disk replacement 🙂


Also hope this helps!

k
Sent from my iPhone

> On 16 Feb 2023, at 06:30, William Konitzer  wrote:
> Hi Dan,
> 
> I appreciate the quick response. In that case, would something like this be 
> better, or is it overkill!?
> 
> 1. ceph osd add-noout osd.x #mark out for recovery operations
> 2. ceph osd add-noin osd.x #prevent rebalancing onto the OSD
> 3. kubectl -n rook-ceph scale deployment rook-ceph-osd--* --replicas=0 
> #disable OSD
> 4. ceph osd down osd.x #prevent it from data placement and recovery operations
> 5. Upgrade the firmware on OSD
> 6. ceph osd up osd.x
> 7. kubectl -n rook-ceph scale deployment rook-ceph-osd--* --replicas=1
> 8. ceph osd rm-noin osd.x
> 9. ceph osd rm-noout osd.x
> 
> Thanks,
> Will
> 
> 
>> On Feb 15, 2023, at 5:05 PM, Dan van der Ster  wrote:
>> 
>> Sorry -- Let me rewrite that second paragraph without overloading the
>> term "rebalancing", which I recognize is confusing.
>> 
>> ...
>> 
>> In your case, where you want to perform a quick firmware update on the
>> drive, you should just use noout.
>> 
>> Without noout, the OSD will be marked out after 5 minutes and objects
>> will be re-replicated to other OSDs -- those degraded PGs will move to
>> "backfilling" state and copy the objects on new OSDs.
>> 
>> With noout, the cluster won't start backfilling/recovering, but don't
>> worry -- this won't block IO. What happens is the disk that is having
>> its firmware upgraded will be marked "down", and IO will be accepted
>> and logged by its peers, so that when the disk is back "up" it can
>> replay ("recover") those writes to catch up.
>> 
>> 
>> The norebalance flag only impacts data movement for PGs that are not
>> degraded -- no OSDs are down. This can be useful to pause backfilling
>> e.g. when you are adding or removing hosts to a cluster.
>> 
>> -- dan
>> 
>> On Wed, Feb 15, 2023 at 2:58 PM Dan van der Ster  wrote:
>>> Hi Will,
>>> There are some misconceptions in your mail.
>>> 1. "noout" is a flag used to prevent the down -> out transition after
>>> an osd is down for several minutes. (Default 5 minutes).
>>> 2. "norebalance" is a flag used to prevent objects from being
>>> backfilling to a different OSD *if the PG is not degraded*.
>>> In your case, where you want to perform a quick firmware update on the
>>> drive, you should just use noout.
>>> Without noout, the OSD will be marked out after 5 minutes and data
>>> will start rebalancing to other OSDs.
>>> With noout, the cluster won't start rebalancing. But this won't block
>>> IO -- the disk being repaired will be "down" and IO will be accepted
>>> and logged by it's peers, so that when the disk is back "up" it can
>>> replay those writes to catch up.
>>> Hope that helps!
>>> Dan
>>> On Wed, Feb 15, 2023 at 1:12 PM  wrote:
 Hi,
 We have a discussion going on about which is the correct flag to use for 
 some maintenance on an OSD, should it be "noout" or "norebalance"? This 
 was sparked because we need to take an OSD out of service for a short 
 while to upgrade the firmware.
 One school of thought is:
 - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
 which Ceph does to ensure all OSDs have roughly the same amount of data.
 - "ceph noout" on the other hand prevents OSDs from being marked as 
 out-of-service during maintenance, which helps maintain cluster 
 performance and availability.
 - Additionally, if another OSD fails while the "norebalance" flag is set, 
 the data redundancy and fault tolerance of the Ceph cluster may be 
 compromised.
 - So if we're going to maintain the performance and reliability we need to 
 set the "ceph noout" flag to prevent the OSD from being marked as OOS 
 during maintenance and allow the automatic data redistribution feature of 
 Ceph to work as intended.
 The other opinion is:
 - With the noout flag set, Ceph clients are forced to think that OSD 
 exists and is accessible - so they continue sending requests to such OSD. 
 The OSD also remains in the crush map without any signs that it is 
 actually out. If an additional OSD fails in the cluster with the noout 
 flag set, Ceph is forced to continue thinking that this new failed OSD is 
 OK. It leads to stalled or delayed response from the OSD side to clients.
 - Norebalance instead takes into account the in/out OSD status, but 
 prevents data rebalance. Clients are 

[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-15 Thread William Konitzer
Hi Dan,

I appreciate the quick response. In that case, would something like this be 
better, or is it overkill!?

1. ceph osd add-noout osd.x #mark out for recovery operations
2. ceph osd add-noin osd.x #prevent rebalancing onto the OSD
3. kubectl -n rook-ceph scale deployment rook-ceph-osd--* --replicas=0 
#disable OSD
4. ceph osd down osd.x #prevent it from data placement and recovery operations
5. Upgrade the firmware on OSD
6. ceph osd up osd.x
7. kubectl -n rook-ceph scale deployment rook-ceph-osd--* --replicas=1
8. ceph osd rm-noin osd.x
9. ceph osd rm-noout osd.x

Thanks,
Will


> On Feb 15, 2023, at 5:05 PM, Dan van der Ster  wrote:
> 
> Sorry -- Let me rewrite that second paragraph without overloading the
> term "rebalancing", which I recognize is confusing.
> 
> ...
> 
> In your case, where you want to perform a quick firmware update on the
> drive, you should just use noout.
> 
> Without noout, the OSD will be marked out after 5 minutes and objects
> will be re-replicated to other OSDs -- those degraded PGs will move to
> "backfilling" state and copy the objects on new OSDs.
> 
> With noout, the cluster won't start backfilling/recovering, but don't
> worry -- this won't block IO. What happens is the disk that is having
> its firmware upgraded will be marked "down", and IO will be accepted
> and logged by its peers, so that when the disk is back "up" it can
> replay ("recover") those writes to catch up.
> 
> 
> The norebalance flag only impacts data movement for PGs that are not
> degraded -- no OSDs are down. This can be useful to pause backfilling
> e.g. when you are adding or removing hosts to a cluster.
> 
> -- dan
> 
> On Wed, Feb 15, 2023 at 2:58 PM Dan van der Ster  wrote:
>> 
>> Hi Will,
>> 
>> There are some misconceptions in your mail.
>> 
>> 1. "noout" is a flag used to prevent the down -> out transition after
>> an osd is down for several minutes. (Default 5 minutes).
>> 2. "norebalance" is a flag used to prevent objects from being
>> backfilling to a different OSD *if the PG is not degraded*.
>> 
>> In your case, where you want to perform a quick firmware update on the
>> drive, you should just use noout.
>> Without noout, the OSD will be marked out after 5 minutes and data
>> will start rebalancing to other OSDs.
>> With noout, the cluster won't start rebalancing. But this won't block
>> IO -- the disk being repaired will be "down" and IO will be accepted
>> and logged by it's peers, so that when the disk is back "up" it can
>> replay those writes to catch up.
>> 
>> Hope that helps!
>> 
>> Dan
>> 
>> 
>> 
>> On Wed, Feb 15, 2023 at 1:12 PM  wrote:
>>> 
>>> Hi,
>>> 
>>> We have a discussion going on about which is the correct flag to use for 
>>> some maintenance on an OSD, should it be "noout" or "norebalance"? This was 
>>> sparked because we need to take an OSD out of service for a short while to 
>>> upgrade the firmware.
>>> 
>>> One school of thought is:
>>> - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
>>> which Ceph does to ensure all OSDs have roughly the same amount of data.
>>> - "ceph noout" on the other hand prevents OSDs from being marked as 
>>> out-of-service during maintenance, which helps maintain cluster performance 
>>> and availability.
>>> - Additionally, if another OSD fails while the "norebalance" flag is set, 
>>> the data redundancy and fault tolerance of the Ceph cluster may be 
>>> compromised.
>>> - So if we're going to maintain the performance and reliability we need to 
>>> set the "ceph noout" flag to prevent the OSD from being marked as OOS 
>>> during maintenance and allow the automatic data redistribution feature of 
>>> Ceph to work as intended.
>>> 
>>> The other opinion is:
>>> - With the noout flag set, Ceph clients are forced to think that OSD exists 
>>> and is accessible - so they continue sending requests to such OSD. The OSD 
>>> also remains in the crush map without any signs that it is actually out. If 
>>> an additional OSD fails in the cluster with the noout flag set, Ceph is 
>>> forced to continue thinking that this new failed OSD is OK. It leads to 
>>> stalled or delayed response from the OSD side to clients.
>>> - Norebalance instead takes into account the in/out OSD status, but 
>>> prevents data rebalance. Clients are also aware of the real OSD status, so 
>>> no requests go to the OSD which is actually out. If an additional OSD fails 
>>> - only the required temporary PG are created to maintain at least 2 
>>> existing copies of the same data (well, generally it is set by the pool min 
>>> size).
>>> 
>>> The upstream docs seem pretty clear that noout should be used for 
>>> maintenance 
>>> (https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-osd/),
>>>  but the second opinion strongly suggests that norebalance is actually 
>>> better and the Ceph docs are out of date.
>>> 
>>> So what is the feedback from the wider community?
>>> 
>>> Thanks,
>>> Will
>>> ___

[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-15 Thread Dan van der Ster
Sorry -- Let me rewrite that second paragraph without overloading the
term "rebalancing", which I recognize is confusing.

...

In your case, where you want to perform a quick firmware update on the
drive, you should just use noout.

Without noout, the OSD will be marked out after 5 minutes and objects
will be re-replicated to other OSDs -- those degraded PGs will move to
"backfilling" state and copy the objects on new OSDs.

With noout, the cluster won't start backfilling/recovering, but don't
worry -- this won't block IO. What happens is the disk that is having
its firmware upgraded will be marked "down", and IO will be accepted
and logged by its peers, so that when the disk is back "up" it can
replay ("recover") those writes to catch up.


The norebalance flag only impacts data movement for PGs that are not
degraded -- no OSDs are down. This can be useful to pause backfilling
e.g. when you are adding or removing hosts to a cluster.

-- dan

On Wed, Feb 15, 2023 at 2:58 PM Dan van der Ster  wrote:
>
> Hi Will,
>
> There are some misconceptions in your mail.
>
> 1. "noout" is a flag used to prevent the down -> out transition after
> an osd is down for several minutes. (Default 5 minutes).
> 2. "norebalance" is a flag used to prevent objects from being
> backfilling to a different OSD *if the PG is not degraded*.
>
> In your case, where you want to perform a quick firmware update on the
> drive, you should just use noout.
> Without noout, the OSD will be marked out after 5 minutes and data
> will start rebalancing to other OSDs.
> With noout, the cluster won't start rebalancing. But this won't block
> IO -- the disk being repaired will be "down" and IO will be accepted
> and logged by it's peers, so that when the disk is back "up" it can
> replay those writes to catch up.
>
> Hope that helps!
>
> Dan
>
>
>
> On Wed, Feb 15, 2023 at 1:12 PM  wrote:
> >
> > Hi,
> >
> > We have a discussion going on about which is the correct flag to use for 
> > some maintenance on an OSD, should it be "noout" or "norebalance"? This was 
> > sparked because we need to take an OSD out of service for a short while to 
> > upgrade the firmware.
> >
> > One school of thought is:
> > - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
> > which Ceph does to ensure all OSDs have roughly the same amount of data.
> > - "ceph noout" on the other hand prevents OSDs from being marked as 
> > out-of-service during maintenance, which helps maintain cluster performance 
> > and availability.
> > - Additionally, if another OSD fails while the "norebalance" flag is set, 
> > the data redundancy and fault tolerance of the Ceph cluster may be 
> > compromised.
> > - So if we're going to maintain the performance and reliability we need to 
> > set the "ceph noout" flag to prevent the OSD from being marked as OOS 
> > during maintenance and allow the automatic data redistribution feature of 
> > Ceph to work as intended.
> >
> > The other opinion is:
> > - With the noout flag set, Ceph clients are forced to think that OSD exists 
> > and is accessible - so they continue sending requests to such OSD. The OSD 
> > also remains in the crush map without any signs that it is actually out. If 
> > an additional OSD fails in the cluster with the noout flag set, Ceph is 
> > forced to continue thinking that this new failed OSD is OK. It leads to 
> > stalled or delayed response from the OSD side to clients.
> > - Norebalance instead takes into account the in/out OSD status, but 
> > prevents data rebalance. Clients are also aware of the real OSD status, so 
> > no requests go to the OSD which is actually out. If an additional OSD fails 
> > - only the required temporary PG are created to maintain at least 2 
> > existing copies of the same data (well, generally it is set by the pool min 
> > size).
> >
> > The upstream docs seem pretty clear that noout should be used for 
> > maintenance 
> > (https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-osd/),
> >  but the second opinion strongly suggests that norebalance is actually 
> > better and the Ceph docs are out of date.
> >
> > So what is the feedback from the wider community?
> >
> > Thanks,
> > Will
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-15 Thread Dan van der Ster
Hi Will,

There are some misconceptions in your mail.

1. "noout" is a flag used to prevent the down -> out transition after
an osd is down for several minutes. (Default 5 minutes).
2. "norebalance" is a flag used to prevent objects from being
backfilling to a different OSD *if the PG is not degraded*.

In your case, where you want to perform a quick firmware update on the
drive, you should just use noout.
Without noout, the OSD will be marked out after 5 minutes and data
will start rebalancing to other OSDs.
With noout, the cluster won't start rebalancing. But this won't block
IO -- the disk being repaired will be "down" and IO will be accepted
and logged by it's peers, so that when the disk is back "up" it can
replay those writes to catch up.

Hope that helps!

Dan



On Wed, Feb 15, 2023 at 1:12 PM  wrote:
>
> Hi,
>
> We have a discussion going on about which is the correct flag to use for some 
> maintenance on an OSD, should it be "noout" or "norebalance"? This was 
> sparked because we need to take an OSD out of service for a short while to 
> upgrade the firmware.
>
> One school of thought is:
> - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
> which Ceph does to ensure all OSDs have roughly the same amount of data.
> - "ceph noout" on the other hand prevents OSDs from being marked as 
> out-of-service during maintenance, which helps maintain cluster performance 
> and availability.
> - Additionally, if another OSD fails while the "norebalance" flag is set, the 
> data redundancy and fault tolerance of the Ceph cluster may be compromised.
> - So if we're going to maintain the performance and reliability we need to 
> set the "ceph noout" flag to prevent the OSD from being marked as OOS during 
> maintenance and allow the automatic data redistribution feature of Ceph to 
> work as intended.
>
> The other opinion is:
> - With the noout flag set, Ceph clients are forced to think that OSD exists 
> and is accessible - so they continue sending requests to such OSD. The OSD 
> also remains in the crush map without any signs that it is actually out. If 
> an additional OSD fails in the cluster with the noout flag set, Ceph is 
> forced to continue thinking that this new failed OSD is OK. It leads to 
> stalled or delayed response from the OSD side to clients.
> - Norebalance instead takes into account the in/out OSD status, but prevents 
> data rebalance. Clients are also aware of the real OSD status, so no requests 
> go to the OSD which is actually out. If an additional OSD fails - only the 
> required temporary PG are created to maintain at least 2 existing copies of 
> the same data (well, generally it is set by the pool min size).
>
> The upstream docs seem pretty clear that noout should be used for maintenance 
> (https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-osd/), 
> but the second opinion strongly suggests that norebalance is actually better 
> and the Ceph docs are out of date.
>
> So what is the feedback from the wider community?
>
> Thanks,
> Will
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io