[ceph-users] Re: OMAP data growth

2022-12-03 Thread ceph
Just a shoot,

Perhaps you have many backfilling tasks...
You can trottle the recovery when you limit max-backfill.

Hth
Mehmet 

Am 2. Dezember 2022 21:09:04 MEZ schrieb Wyll Ingersoll 
:
>
>We have a large cluster (10PB) which is about 30% full at this point.  We 
>recently fixed a configuration issue that then triggered the pg autoscaler to 
>start moving around massive amounts of data (85% misplaced objects - about 
>7.5B objects).  The misplaced % is dropping slowly (about 10% each day), but 
>the overall data usage is growing by about 300T/day even though the data being 
>written by clients is well under 30T/day.
>
>The issue was that we have both 3x replicated pools and a very large 
>erasure-coded (8+4) data pool for RGW.  autoscaler doesnt work if it sees what 
>it thinks are overlapping roots ("default" vs "default~hdd" in the crush tree, 
>even if both refer to the same OSDs, they have different ids: -1 vs -2).  We 
>cleared that by setting the same root for both crush rules and then PG 
>autoscaler kicked in and started doing its thing.
>
>The "ceph osd df" output shows the OMAP jumping significantly and our data 
>availability is shrinking MUCH faster than we would expect based on the client 
>usage.
>
>Questions:
>
>  *   What is causing the OMAP data consumption to grow so fast and can it be 
> trimmed/throttled?
>  *   Will the overhead data be cleaned up once the misplaced object counts 
> drop to a much lower value?
>  *   Would it do any good to disable the autoscaler at this point since the 
> PGs have already started being moved?
>  *   Any other recommendations to make this go smoother?
>
>thanks!
>
>___
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Tuning CephFS on NVME for HPC / IO500

2022-12-03 Thread Sebastian
One thing to this discussion. 
I had a lot of problems with my clusters. I spent some time debugging.
What I found and what I confirmed on AMD nodes, everything starts working like 
a charm when I added to kernel param iommu=pt
Plus some other tunings, I can’t share, all information now, but this iommu=pt 
should help.
On beginning everything looks like something in kernel stack slowdown packets.

BR,
Sebastian

> On 2 Dec 2022, at 16:03, Manuel Holtgrewe  wrote:
> 
> Dear Mark.
> 
> Thank you very much for all of this information. I learned a lot! In
> particular that I need to learn more about pinning.
> 
> In the end, I want to run the whole thing in production with real world
> workloads. My main aim in running the benchmark is to ensure that my
> hardware and OS is correctly configured (I already found some configuration
> issues in my switches on the way with lack of balancing between LAG
> interconnects and using layer 3+4 in creatikg my bonds, particularities of
> Dell VLTi and needing unique VLT IDs...). Also, it will be interesting to
> see how and whether things will turn out to be after the cluster has run
> for a year.
> 
> As far as I can see, network and OS configuration is sane. Ceph
> configuration appears to be not too far off something that I could hand to
> my users.
> 
> I will try to play a bit more on the pinning and meta data tuning.
> 
> Best wishes,
> Manuel
> 
> Mark Nelson  schrieb am Do., 1. Dez. 2022, 20:19:
> 
>> Hi Manuel,
>> 
>> 
>> I did the IO500 runs back in 2020 and wrote the cephfs aiori backend for
>> IOR/mdtest.  Not sure about the segfault, it's been a while since I've
>> touched that code.  It was working the last time I used it. :D  Having
>> said that, I don't think that's your issue.   The userland backend
>> helped work around an issue where I wasn't able to exceed about 3GB/s
>> per host with the kernel client and thus couldn't hit more than about
>> 30GB/s in the easy tests on a 10 node setup.  I think Jeff Layton might
>> have fixed that issue when he improved the locking code in the kernel a
>> while back and it appears you are getting good results with the kernel
>> client in the easy tests.  I don't recall the userland backend
>> performing much different than the kernel client in the other tests.
>> Instead I would recommend looking at each test individually:
>> 
>> 
>> ior-easy-write (and read):
>> 
>> Each process gets it's own file, large aligned IO.  Pretty easy for the
>> MDS and the rest of Ceph to handle.  You get better results overall than
>> I did!  These are the tests we typically do best on out of the box.
>> 
>> 
>> mdtest-easy-write (and stat/delete):
>> 
>> Each process gets it's own directory writing out zero sized files.  The
>> trick to getting good performance here is to use ephemeral pinning on
>> the parent test directory.  Even better would be to use static round
>> robin pinning for each rank's sub-directory.  Sadly that violates the
>> rules now and we haven't implemented a way to do this with a single
>> parent level xattr (though it would be pretty easy which makes the rule
>> not to touch the subdirs kind of silly imho).  I was able to achieve up
>> to around 10K IOPs per MDS, with the highest achieved score around
>> 400-500K IOPS with 80 MDSes (but that configuration was suboptimal for
>> other tests).  Ephemeral pinning is ok, but you need enough directories
>> to avoid "clumpy" distribution across MDSes.  At ~320
>> processes/directories and 40 MDSes I was seeing about half the
>> performance vs doing perfect round-robin pinning of the individual
>> process directories.  Well, with one exception:  When doing manual
>> pinning, it's better to exclude the authoritative MDS for the parent
>> directory (or perhaps just give it fewer directories than the others)
>> since it's also doing other work and ends up lagging behind slowing the
>> whole benchmark down.  Having said that, this is one of the easier tests
>> to improve so long as you use some kind of reasonable pinning strategy
>> with multiple MDSes.
>> 
>> 
>> ior-hard-write (and read):
>> 
>> Small unaligned IO to a single shared file.  I think it's ~47K IOs.
>> This is rough to improve without code changes imho.  I remember the
>> results being highly variable in my tests, and it took multiple runs to
>> get a high score.  I don't remember exactly what I had to tweak here,
>> but as opposed to the easy tests you are likely heavily latency bound
>> even with 47K IOs.  I expect you are going to be slamming a single OSD
>> (and PG!) over and over from multiple clients and constrained by how
>> quickly you can get those IOs replicated (for writes when rep > 1) and
>> locks acquired/released (in all cases).  I'm guessing that ensuring the
>> lowest possible per-OSD latency and highest per-OSD throughput is
>> probably a big win here.  Not sure what on the CephFS side might be
>> playing a role, but I imagine caps and file level locking might matter.
>> You can imagine that a

[ceph-users] Re: What to expect on rejoining a host to cluster?

2022-12-03 Thread Matt Larson
Thank you Frank and Eneko,

 Without help and support from ceph admins like you, I would be adrift.  I
really appreciate this.

 I rejoined the host now one week ago, and the cluster has been dealing
with the misplaced objects and recovering well.

I will use this strategy in the future:

"If you consider replacing the host and all disks, get a new host first and
give it the host name in the crush map. Just before you deploy the new
host, simply purge all down OSDs in its bucket (set norebalance) and
deploy. Then, the data movement is restricted to re-balancing to the new
host.

If you just want to throw out the old host, destroy the OSDs but keep the
IDs intact (ceph osd destroy). Then, no further re-balancing will happen
and you can re-use the OSD ids later when adding a new host. That's a
stable situation from an operations point of view."

Last question I have is that I am now seeing that some OSDs have uneven
load of PGs, which balancer do you recommend and any caveats for how
the balancer operations can affect/slow the cluster?

Thanks,
  Matt

On Mon, Nov 28, 2022 at 2:23 AM Eneko Lacunza  wrote:

> Hi Matt,
>
> Also, make sure that when rejoining host has correct time. I have seen
> clusters going down when rejoining hosts that were down for maintenance for
> various weeks and came in with datetime deltas of some months (no idea why
> that happened, I arrived with the firefighter team ;-) )
>
> Cheers
>
> El 27/11/22 a las 13:27, Frank Schilder escribió:
>
> Hi Matt,
>
> if you didn't touch the OSDs on that host, they will join and only objects 
> that have been modified will actually be updated. Ceph keeps some basic 
> history information and can detect changes. 2 weeks is not a very long time. 
> If you have a lot of cold data, re-integration will go fast.
>
> Initially, you will see a huge amount of misplaced objects. However, this 
> count will go down much faster than objects/s recovery.
>
> Before you rejoin the host, I would fix its issues though. Now that you have 
> it out of the cluster, do the maintenance first. There is no rush. In fact, 
> you can buy a new host, install the OSDs in the new one and join that to the 
> cluster with the host-name of the old host.
>
> If you consider replacing the host and all disks, the get a new host first 
> and give it the host name in the crush map. Just before you deploy the new 
> host, simply purge all down OSDs in its bucket (set norebalance) and deploy. 
> Then, the data movement is restricted to re-balancing to the new host.
>
> If you just want to throw out the old host, destroy the OSDs but keep the IDs 
> intact (ceph osd destroy). Then, no further re-balancing will happen and you 
> can re-use the OSD ids later when adding a new host. That's a stable 
> situation from an operations point of view.
>
> Hope that helps.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Matt Larson  
> Sent: 26 November 2022 21:07:41
> To: ceph-users
> Subject: [ceph-users] What to expect on rejoining a host to cluster?
>
> Hi all,
>
>  I have had a host with 16 OSDs, each 14TB in capacity that started having
> hardware issues causing it to crash.  I took this host down 2 weeks ago,
> and the data rebalanced to the remaining 11 server hosts in the Ceph
> cluster over this time period.
>
>  My initial goal was to then remove the host completely from the cluster
> with `ceph osd rm XX` and `ceph osd purge XX` (Adding/Removing OSDs — Ceph
> Documentation
>  ).
> However, I found that after the large amount of data migration from the
> recovery, that the purge and removal from the crush map for an OSDs still
> required another large data move.  It appears that it would have been a
> better strategy to assign a 0 weight to an OSD to have only a single larger
> data move instead of twice.
>
>  I'd like to join the downed server back into the Ceph cluster.  It still
> has 14 OSDs that are listed as out/down that would be brought back online.
> My question is what can I expect if I bring this host online?  Will the
> OSDs of a host that has been offline for an extended period of time and out
> of the cluster have PGs that are now quite different or inconsistent?  Will
> this be problematic?
>
>  Thanks for any advice,
>Matt
>
> --
> Matt Larson, PhD
> Madison, WI  53705 U.S.A.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 | https://www.binovo.es
> Astigarragako Bide

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-03 Thread Alex Gorbachev
Boris, I have seen one problematic OSD cause this issue on all OSD with
which its PGs peered.  The solution was to take out the slow OSD,
immediately all slow ops stopped.  I found it by observing common OSDs in
reported slow ops.  Not saying this is your issue, but it may be a
possibility.  Good luck!

--
Alex Gorbachev
https://alextelescope.blogspot.com



On Fri, Dec 2, 2022 at 7:54 PM Boris Behrens  wrote:

> hi,
> maybe someone here can help me to debug an issue we faced today.
>
> Today one of our clusters came to a grinding halt with 2/3 of our OSDs
> reporting slow ops.
> Only option to get it back to work fast, was to restart all OSDs daemons.
>
> The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work
> on the cluster: synced in a node 4 days ago.
>
> The only health issue, that was reported, was the SLOW_OPS. No slow pings
> on the networks. No restarting OSDs. Nothing.
>
> I was able to ping it to a 20s timeframe and I read ALL the logs in a 20
> minute timeframe around this issue.
>
> I haven't found any clues.
>
> Maybe someone encountered this in the past?
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] set-rgw-api-host removed from pacific

2022-12-03 Thread Ryan
I recently upgraded a cluster from octopus to pacific using cephadm. The
cluster has 3 rados gateways internally facing with rgw_enable_apis at the
default value and 2 rados gateways externally facing with rgw_enable_apis
set to s3website. After the upgrade the dashboard object gateway page was
showing an error page for no admin bucket. Troubleshooting revealed "ceph
dashboard set-rgw-api-host" is no longer available. I had to default
rgw_enable_apis on the externally facing gateways for the dashboard object
gateway page to work.

I'd like to disable the admin api on the externally facing gateways. How do
I control which rados gateways the dashboard will connect to?

Thanks,
Ryan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io