[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Christian Theune Fri, 16 Jun 2023 22:49:57 -0700

What got lost is that I need to change the pool’s m/k parameters, which is only 
possible by creating a new pool and moving all data from the old pool. Changing 
the crush rule doesn’t allow you to do that.


> On 16. Jun 2023, at 23:32, Nino Kotur <ninoko...@gmail.com> wrote:
> 
> If you create new crush rule for ssd/nvme/hdd and attach it to existing pool 
> you should be able to do the migration seamlessly while everything is 
> online... However impact to user will depend on storage devices load and 
> network utilization as it will create chaos on cluster network.
> 
> Or did i get something wrong?
> 
> 
> 
>  
> Kind regards,
> Nino
> 
> 
> On Wed, Jun 14, 2023 at 5:44 PM Christian Theune <c...@flyingcircus.io> wrote:
> Hi,
> 
> further note to self and for posterity … ;)
> 
> This turned out to be a no-go as well, because you can’t silently switch the 
> pools to a different storage class: the objects will be found, but the index 
> still refers to the old storage class and lifecycle migrations won’t work.
> 
> I’ve brainstormed for further options and it appears that the last resort is 
> to use placement targets and copy the buckets explicitly - twice, because on 
> Nautilus I don’t have renames available, yet. :( 
> 
> This will require temporary downtimes prohibiting users to access their 
> bucket. Fortunately we only have a few very large buckets (200T+) that will 
> take a while to copy. We can pre-sync them of course, so the downtime will 
> only be during the second copy.
> 
> Christian
> 
> > On 13. Jun 2023, at 14:52, Christian Theune <c...@flyingcircus.io> wrote:
> > 
> > Following up to myself and for posterity:
> > 
> > I’m going to try to perform a switch here using (temporary) storage classes 
> > and renaming of the pools to ensure that I can quickly change the STANDARD 
> > class to a better EC pool and have new objects located there. After that 
> > we’ll add (temporary) lifecycle rules to all buckets to ensure their 
> > objects will be migrated to the STANDARD class.
> > 
> > Once that is finished we should be able to delete the old pool and the 
> > temporary storage class.
> > 
> > First tests appear successfull, but I’m a bit struggling to get the bucket 
> > rules working (apparently 0 days isn’t a real rule … and the debug interval 
> > setting causes high frequent LC runs but doesn’t seem move objects just 
> > yet. I’ll play around with that setting a bit more, though, I think I might 
> > have tripped something that only wants to process objects every so often 
> > and on an interval of 10 a day is still 2.4 hours … 
> > 
> > Cheers,
> > Christian
> > 
> >> On 9. Jun 2023, at 11:16, Christian Theune <c...@flyingcircus.io> wrote:
> >> 
> >> Hi,
> >> 
> >> we are running a cluster that has been alive for a long time and we tread 
> >> carefully regarding updates. We are still a bit lagging and our cluster 
> >> (that started around Firefly) is currently at Nautilus. We’re updating and 
> >> we know we’re still behind, but we do keep running into challenges along 
> >> the way that typically are still unfixed on main and - as I started with - 
> >> have to tread carefully.
> >> 
> >> Nevertheless, mistakes happen, and we found ourselves in this situation: 
> >> we converted our RGW data pool from replicated (n=3) to erasure coded 
> >> (k=10, m=3, with 17 hosts) but when doing the EC profile selection we 
> >> missed that our hosts are not evenly balanced (this is a growing cluster 
> >> and some machines have around 20TiB capacity for the RGW data pool, wheres 
> >> newer machines have around 160TiB and we rather should have gone with k=4, 
> >> m=3.  In any case, having 13 chunks causes too many hosts to participate 
> >> in each object. Going for k+m=7 will allow distribution to be more 
> >> effective as we have 7 hosts that have the 160TiB sizing.
> >> 
> >> Our original migration used the “cache tiering” approach, but that only 
> >> works once when moving from replicated to EC and can not be used for 
> >> further migrations.
> >> 
> >> The amount of data is at 215TiB somewhat significant, so using an approach 
> >> that scales when copying data[1] to avoid ending up with months of 
> >> migration.
> >> 
> >> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on 
> >> a rados/pool level) and I guess we can only fix this on an application 
> >> level using multi-zone replication.
> >> 
> >> I have the setup nailed in general, but I’m running into issues with 
> >> buckets in our staging and production environment that have 
> >> `explicit_placement` pools attached, AFAICT is this an outdated mechanisms 
> >> but there are no migration tools around. I’ve seen some people talk about 
> >> patched versions of the `radosgw-admin metadata put` variant that (still) 
> >> prohibits removing explicit placements.
> >> 
> >> AFAICT those explicit placements will be synced to the secondary zone and 
> >> the effect that I’m seeing underpins that theory: the sync runs for a 
> >> while and only a few hundred objects show up in the new zone, as the 
> >> buckets/objects are already found in the old pool that the new zone uses 
> >> due to the explicit placement rule.
> >> 
> >> I’m currently running out of ideas, but open for any other options.
> >> 
> >> Looking at 
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z
> >>  I’m wondering whether the relevant patch is available somewhere, or 
> >> whether I’ll have to try building that patch again on my own.
> >> 
> >> Going through the docs and the code I’m actually wondering whether 
> >> `explicit_placement` is actually a really crufty residual piece that won’t 
> >> get used in newer clusters but older clusters don’t really have an option 
> >> to get away from?
> >> 
> >> In my specific case, the placement rules are identical to the explicit 
> >> placements that are stored on (apparently older) buckets and the only 
> >> thing I need to do is to remove them. I can accept a bit of downtime to 
> >> avoid any race conditions if needed, so maybe having a small tool to just 
> >> remove those entries while all RGWs are down would be fine. A call to 
> >> `radosgw-admin bucket stat` takes about 18s for all buckets in production 
> >> and I guess that would be a good comparison for what timing to expect when 
> >> running an update on the metadata.
> >> 
> >> I’ll also be in touch with colleagues from Heinlein and 42on but I’m open 
> >> to other suggestions.
> >> 
> >> Hugs,
> >> Christian
> >> 
> >> [1] We currently have 215TiB data in 230M objects. Using the “official” 
> >> “cache-flush-evict-all” approach was unfeasible here as it only yielded 
> >> around 50MiB/s. Using cache limits and targetting the cache sizes to 0 
> >> caused proper parallelization and was able to flush/evict at almost 
> >> constant 1GiB/s in the cluster. 
> >> 
> >> 
> >> -- 
> >> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> >> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> >> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> >> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> >> Zagrodnick
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> > Liebe Grüße,
> > Christian Theune
> > 
> > -- 
> > Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> > Flying Circus Internet Operations GmbH · https://flyingcircus.io
> > Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> > HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian 
> > Zagrodnick
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> Liebe Grüße,
> Christian Theune
> 
> -- 
> Christian Theune · c...@flyingcircus.io · +49 345 219401 0
> Flying Circus Internet Operations GmbH · https://flyingcircus.io
> Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
> HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

Liebe Grüße,
Christian Theune

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Reply via email to