[ceph-users] Re: Replace OSD node without remapping PGs

Anthony D'Atri Wed, 01 Apr 2020 11:01:38 -0700

The strategy that Nghia described is inefficient for moving data more than 
once, but safe since there are always N copies, vs a strategy of setting noout, 
destroying the OSDs, and recreating them on the new server.  That would be more 
efficient, albeit with a period of reduced redundancy.


I’ve done what Eugen describes, slightly differently:
- Create a staging root
- Create a host bucket there with the new nodename
- Create new OSDs, CRUSH weight them to 0
- Move into the production root
- Weight up using your method of choice

Another option would be, if the hardware is compatible, to set noout, take down 
one node, destroy the OSDs, swap in the new drives/node, provision the OSDs 
with the same IDs, and wait for balancing.  But you have a period of reduced 
redundancy, and the wrong drive failing can cause grief.

I think, though, that this sort of scenario may be what swap-bucket was 
designed for.

https://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/

> On Apr 1, 2020, at 5:43 AM, Eugen Block <ebl...@nde.ag> wrote:
> 
> Hi,
> 
> I have a different approach in mind for a replacement, we successfully 
> accomplished that last year in our production environment where we replaced 
> all nodes of the cluster with newer hardware. Of course we wanted to avoid 
> rebalancing the data multiple times.
> 
> What we did was to create a new "root" bucket in our crush tree parallel to 
> the root=default, then we moved the old nodes to the new root. This can't 
> trigger rebalancing because there was no host available in the default root 
> anymore, but the data was still available to the clients as if nothing had 
> changed.
> Then we added the new nodes to the default root with initial osd crush weight 
> = 0. After all new nodes and OSDs were there we increased the weight to start 
> data movement. This way all PGs were recreated only once on the new nodes, 
> slowly draining the old servers.
> 
> This should be a valid approach for a single server, too. Create a 
> (temporary) new root or bucket within your crush tree and move the old host 
> to that bucket. Then add your new server to the correct root with initial osd 
> crush weight = 0. When all OSDs are there, increase the weight for all OSDs 
> at once to start the data movement.
> 
> This was all in a Luminous cluster.
> 
> Regards,
> Eugen
> 
> 
> Zitat von Nghia Viet Tran <nghia.viet.t...@mgm-tp.com>:
> 
>> Hi everyone,
>> 
>> I'm working on replacing OSDs node with the newer one. The new host has the 
>> new hostname and new disk (faster one but the same size with old disk). My 
>> plan is
>> - Reweight the OSD to zero to spread all existed data to the rest nodes to 
>> keep data availability
>> - set flag noout, norebalance, norecover, nobackfill, destroy the OSD and 
>> join the new OSD as the same ID of the old one.
>> 
>> By above approach, the cluster will remap PGs of all nodes. Each data will 
>> be moved twice times until it reach the new OSD (reweight and join new node 
>> as same ID)
>> 
>> I also did the other way that only set flags and destroy OSD. But the result 
>> is still the same (degraded objects from destroyed osd and misplaced object 
>> after joining new osd)
>> 
>> Are there any ways to replace the OSD node directly without remapping PGs of 
>> the whole cluster?
>> 
>> Many thanks!
>> Nghia.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Replace OSD node without remapping PGs

Reply via email to