from:"Anthony D'Atri"

[ceph-users] Re: Backfill full osds

2024-11-06 Thread Anthony D';Atri

I’ve successfully used a *temporary* relax of the ratios to get out of a sticky 
situation, but I must qualify that with an admonition to make SURE that you 
move them back ASAP.

Note that backfillfull_ratio is enforced to be lower than full_ratio, so 
depending on how close to the precipice you skate, you may need to raise 
full_ratio slightly too.

And change it back ASAP.  I can’t stress that enough.

pgremapper might help get out of this too, or depending on how many problematic 
OSDs you have, surgical manual remaps:

https://indico.cern.ch/event/669931/contributions/2742401/attachments/1533434/2401109/upmap.pdf
upmap
PDF Document · 213 KB

> On Nov 6, 2024, at 9:11 AM, Eugen Block  wrote:
> 
> Hi,
> 
> depending on the actual size of the PGs and OSDs, it could be sufficient to 
> temporarily increase the backfillfull_ratio (default 90%) to 91% or 92%, at 
> 95% is the cluster is considered full, so you need to be really careful with 
> those ratios. If you provided more details about the current state, the 
> community might have a couple of more ideas. I haven't used the pgremapper 
> yet so I can't really comment on that.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: why performance difference between 'rados bench seq' and 'rados bench rand' quite significant

2024-10-29 Thread Anthony D';Atri

The good Mr. Nelson and others may have more to contribute, but a few thoughts:

* Running for 60 or 120 seconds isn’t quantitative:  rados bench typically 
exhibits a clear ramp-up; watch the per-second stats.
* Suggest running for 10 minutes, three times in a row and averaging the results
* How many PGs in rep3datapool?  Average number of PG replicas per OSD shown by 
`ceph osd df` ?  I would shoot for 150 - 200 in your case.
* Disable scrubs, the balancer, and the pg autoscaler during benching.
* If you have OS swap configured, disable it and reboot.  How much RAM?

> On Oct 29, 2024, at 11:43 PM, Louisa  wrote:
> 
> Hi all,
> We used 'rados bench' to test 4k object read and write operations.  
> Our cluster is pacific, one node, 11 bluestore osd ,db and wal share the 
> block device.  Block device is HDD.
> 
> 1. testing 4k write with command 'rados bench 120 write -t 16 -b 4K -p 
> rep3datapool --run-name 4kreadwrite --no-cleanup'
> 
> 2. Before tesing 4k reads, we restarted all OSD daemons.  The perfomance of 
> 'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' was very 
> good, which Average IOPS: 17735; 
> using 'ceph daemon osd.1 perf dump rocksdb' , we found the 
> rocksdb:get_latency avgcount: 15189, avgtime: 0.12947 (12.9us)
> 
> 3. Before tesing 4k rand reads, we restarted all OSD daemons.  'rados bench 
> 60 rand -t 16 -p rep3datapool --run-name 4kreadwrite' average IOPS: 2071
> rocksdb:get_latency avgcount: 8756, avgtime: 0.001761293 (1.7ms)
> 
> Q1: Why performance difference between 'rados bench seq' and 'rados bench 
> rand' quite significant? How to explain the rocksdb get_latency perfomance 
> between this two scenario?
> 
> 4. We write 40w 4k object to the pool, restarted all OSD daemons. running 
> 'rados bench 120 seq -t 16 -p rep3datapool --run-name 4kreadwrite' again. 
> Average IOPS~= 2000. 
> rocsdb:get_latency avgtime  also reached milliseconds level
> Q2: Why 'rados bench seq' performance decresing extremly after writing some 
> more 4k object to the pool?
> 
> Q3: Is there any methods and suggestions to optimized the read performance of 
> this scenario under this hardware configuration.
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Influencing the osd.id when creating or replacing an osd

2024-10-28 Thread Anthony D';Atri

Well sure, if you want to do it the EASY way :rolleyes:

> On Oct 28, 2024, at 1:02 PM, Eugen Block  wrote:
> 
> Or:
> 
> ceph osd find {ID}
> 
> :-)
> 
> Zitat von Robert Sander :
> 
>> On 10/28/24 17:41, Dave Hall wrote:
>> 
>>> However, it would be nice to have something like 'ceph osd location {id}'
>>> from the command line.  If such exists, I haven't seen it.
>> 
>> "ceph osd metadata {id} | jq -r .hostname" will give you the hostname
>> 
>> Regards
>> -- 
>> Robert Sander
>> Heinlein Consulting GmbH
>> Schwedter Str. 8/9b, 10119 Berlin
>> 
>> https://www.heinlein-support.de
>> 
>> Tel: 030 / 405051-43
>> Fax: 030 / 405051-19
>> 
>> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
>> Geschäftsführer: Peer Heinlein - Sitz: Berlin
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Influencing the osd.id when creating or replacing an osd

2024-10-28 Thread Anthony D';Atri



> Yes, but it's irritating. Ideally, I'd like my OSD IDs and hostnames to track 
> so that if a server going pong I can find it and fix it ASAP

`ceph osd tree down` etc. (including alertmanager rules and Grafana panels) 
arguably make that faster and easier than everyone having to memorize OSD 
numbers, especially as the clusters grow.

> But it doesn't take much maintenance to break that scheme and the only thing 
> more painful than renaming a Ceph host is re-numbering an OSD.

Yep!

>> My advice: Do not try to manually number your OSDs.

This.  I’ve been there myself, but it’s truly a sisyphean goal.  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: centos9 or el9/rocky9

2024-10-24 Thread Anthony D';Atri

Is this moot if the Ceph daemon nodes are numbered in RFC1918 space or 
otherwise not reachable from the internet at learge?

> 
>> 
>>  Sorry for posting off topic, a bit to lazy to create yet another
>> account somewhere. I still need to make this upgrade to different os. I
>> have now some vms on centos9 stream. What annoys me a lot is that tcp
>> wrapper support is not default added to ssh. (I am using auto fed dns
>> blacklists to refuse access)
>> 
>>  Can anyone tell me if this is the same in el9/rocky9?
>> 
>> 
>> 
>> I use fail2ban for this purpose on CentOS Stream 9. It works with
>> firewalld.
> 
> Yes on one host not? It is not like if host a is being harassed and 
> blacklists, all hosts are having this update. I am using remote syslog with 
> fail2ban -> dns update -> dns checks on all hosts.
> Or does firewalld allow for some remote updates?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RGW performance guidelines

2024-10-21 Thread Anthony D';Atri



> > Not surprising for HDDs.  Double your deep-scrub interval.
> 
> Done!

If your PG ratio is low, say <200, bumping pg_num may help as well.  Oh yeah, 
looking up your gist from a prior message, you average around 70 PG replicas 
per OSD.  Aim for 200.

Your index pool has way too few PGs.  Set pg_num to 1024.  I’d jack up your 
buckets.data pool to at least 8192 as well.  If you do any MPU at all, I’d 
raise non-ec to 512 or 1024.

> 
> > So you’re relying on the SSD DB device for the index pool?  Have you looked 
> > at your logs / metrics for those OSDs to see if there is any spillover?
> > What type of SSD are you using here?  And how many HDD OSDs do you have 
> > using each? 
> 
> I will try to describe the system as best I can, We are talking about 18 
> different hosts. Each host has a large number of HDDs, and a small number of 
> SSDs (4),
> Out of these SSDs, 2 are used as the backend, to a high speed volume-ssd 
> pool, that certain VMs write into, and the other 2 are split into very large 
> LVM partitions, which act as the journal for the HDDs,

As I suspected.

> I have amended the gist to add that extra information from lsblk. I have not 
> added any information regarding disk models etc. But from the top of my head, 
> each HDD should be about 16T in size, and the NVME is also extremely large 
> and built for high-I/O systems.

There are NVMe devices available that decidedly are not suited for this 
purpose.  The usual rule of thumb that I’ve seen when using TLC-class NVMe 
WAL+DB devices is a max ratio of 10:1 to spinners.  You seem to have 21:1 .

> Each db_devices, if you see in the lsblk, is extremely large so I think there 
> is no spillover.

675GB is the largest WAL+DB partition I've ever seen.

> 
> > Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, 
> > then yeah any metadata ops are going to be dog slow.  Check that your OSDs 
> > actually do have external SSD DBs — it’s easy over the OSD lifecycle to 
> > deploy that way > > initially but to inadvertently rebuild OSDs without the 
> > external device. 
> 
> I will investigate

`ceph osd metadata` | a suitable grep may show if you have OSDs that aren’t 
actually using the offboard WAL+DB partition.

> and I will start by planning a new pg bump which takes forever due to the 
> size of the cluster for the volumes pool

It takes forever because you have spinners ;). And because with recent Ceph 
releases the cluster throttles the (expensive) PG splitting to prevent DoS.  
Splitting all the PGs at once can be … impactful.

>  AND somehow move the index pool to an osd device before bumping.

Is it only on dedicated NVMes right now?  Which would be what, 36 OSDs?  

With your WAL+DB SSDs having a 21:1 ratio, using them for the index pool 
instead / in addition may or may not improve your performance, but you could 
always move back.

> All this is excellent advice which I thank you for.
> 
> I would like now to ask your opinion on the original query, 
> 
> Do you think that there is some palpable difference between 1 bucket with 10 
> million objects, and 10 buckets with 1 million objects each?

Depends on what you’re measuring.  The second case I suspect would list bucket 
contents faster.

> Intuitively, I feel that the first case would mean interacting with far fewer 
> pgs than the second (10 times less?) which spreads the load on more devices, 
> but my knowledge of ceph internals is nearly 0.
> 
> 
> Regards,
> Harry
> 
> 
> 
> On Tue, Oct 15, 2024 at 4:26 PM Anthony D'Atri  <mailto:anthony.da...@gmail.com>> wrote:
>> 
>> 
>> > On Oct 15, 2024, at 9:28 AM, Harry Kominos > > <mailto:hkomi...@gmail.com>> wrote:
>> > 
>> > Hello Anthony and thank you for your response!
>> > 
>> > I have placed the requested info in a separate gist here:
>> > https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885
>> 
>> > 3826 pgs not deep-scrubbed in time
>> > 1501 pgs not scrubbed in time
>> 
>> Not surprising for HDDs.  Double your deep-scrub interval.
>> 
>> > Every OSD is an HDD, with their corresponding index, on a partition in an
>> > SSD device.
>> 
>> 
>> So you’re relying on the SSD DB device for the index pool?  Have you looked 
>> at your logs / metrics for those OSDs to see if there is any spillover?
>> 
>> What type of SSD are you using here?  And how many HDD OSDs do you have 
>> using each?
>> 
>> 
>> > And we are talking about 18 separate devices, with separate
>> > cluster_network for the rebalancing etc.
>> 
>> 
>> 18 separate devices?  Do you mean 18 OSDs per

[ceph-users] Re: Influencing the osd.id when creating or replacing an osd

2024-10-19 Thread Anthony D';Atri



> On Oct 19, 2024, at 2:47 PM, Shain Miley  wrote:
> 
> We are running octopus but will be upgrading to reef or squid in the next few 
> weeks.  As part of that upgrade I am planning on switching over to using 
> cephadm as well.
> 
> Part of what I am doing right now is going through and replacing old drives 
> and removing some of our oldest nodes and replacing them with new ones…then I 
> will convert the rest of the filestore osd over to bluestore so that I can 
> upgrade.
>  
> One other question based on your suggestion below…my typical process of 
> removing or replacing an osd involves the following:
> 
> ceph osd crush reweight osd.id  0.0
> ceph osd out osd.id 
> service ceph stop osd.id 
> ceph osd crush remove osd.id 
> ceph auth del osd.id 
> ceph osd rm id
>  
> Does `ceph osd destroy` do something other than the last 3 commands above or 
> am I just doing the same thing using multiple commands?  If I need to start 
> issuing the destroy command as well I can.
> 

I don’t recall if it will stop the service if running, but it does leave the 
OSD in the CRUSH map marked as ‘destroyed’.  I *think* it leaves the auth but 
I’m not sure.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Influencing the osd.id when creating or replacing an osd

2024-10-18 Thread Anthony D';Atri

What release are you running where ceph-deploy still works?

I get what you're saying, but really you should get used to OSD IDs being 
arbitrary.

- ``ceph osd ls-tree `` will output a list of OSD ids under
  the given CRUSH name (like a host or rack name).  This is useful
  for applying changes to entire subtrees.  For example, ``ceph
  osd down `ceph osd ls-tree rack1```.

This is useful for one-off scripts, where you can e.g. use it to get a list of 
OSDs on a given host.

Normally the OSD ID selected is the lowest-numbered unused one.  Which can 
either be an ID that has never been used before, or one that has been deleted.  
So if you delete an OSD entirely and redeploy, you may or may not get the same 
ID depending on the cluster’s history.

- ``ceph osd destroy`` will mark an OSD destroyed and remove its
  cephx and lockbox keys.  However, the OSD id and CRUSH map entry
  will remain in place, allowing the id to be reused by a
  replacement device with minimal data rebalancing.

Destroying OSDs and redeploying them can help with what you’re after. 

> On Oct 17, 2024, at 9:14 PM, Shain Miley  wrote:
> 
> Hello,
> I am still using ceph-deploy to add osd’s to my cluster.  From what I have 
> read ceph-deploy does not allow you to specify the osd.id when creating new 
> osds, however I am wondering if there is a way to influence the number that 
> ceph will assign for the next osd that is created.
> 
> I know that it really shouldn’t matter what osd number gets assigned to the 
> disk but as the number of osd increases it is much easier to keep track of 
> where things are if you can control the id when replacing failed disks or 
> adding new nodes.
> 
> Thank you,
> Shain
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: "ceph orch" not working anymore

2024-10-17 Thread Anthony D';Atri


> I appreciate your kind words. 😎🙂
> Frederics link has the correct answer, remove the respective field from the 
> json.
> 
> 
>> You're so cool, Eugen. Somehow you seem to find out everything.

Indeed he’s awesome!  Perfect example of why the Ceph community is so valuable 
and indeed a gating factor when evaluating against other solutions.


— not Eugen

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reef osd_memory_target and swapping

2024-10-16 Thread Anthony D';Atri

> Unfortunately, its not quite that simple. At least until mimic, but 
> potentially later too there was this behavior that either the OSD's allocator 
> did not release or the kernel did not reclaim unused pages if there was 
> sufficient total memory available. Which implied pointless swapping. The 
> observation was exactly what Dave describes, huge resident memory size 
> without any load. The resident memory size just stayed high for no apparent 
> reason.

I’ve seen that on non-Ceph systems too.  Sometimes with Ceph I see tcmalloc not 
actually freeing unused mem; in those situations a “heap release” on the admin 
socket does wonders.  I haven’t seen that since … Nautilus perhaps.

> The consequences were bad though, because during peering apparently the 
> "leaked memory" started playing a role and OSDs crashed due to pages on swap 
> not fitting into RAM.

Back in the BSD days swap had to be >= physmem, these days we skate SysV style 
where swap extends the VM space instead of backing it.

> Having said that, we do have large swap partitions on disk for emergency 
> cases. We have swap off by default to avoid the memory "leak" issue and we 
> actually have sufficient RAM to begin with - maybe that's a bit of a luxury.

I can’t argue with that strategy, if your boot drives are large enough.  I’ve 
as recently as this year suffered legacy systems with as little as 100GB boot 
drives — so overly balkanized that no partition was large enough.

K8s as I understand it won’t even run if swap is enabled.  Swap to me is what 
we did in 1988 when RAM cost money and we had 3MB (yes) diskless (yes) 
workstations.  Out of necessity.

> The swap recommendation is a contentious one - I, for one, have always been 
> against it.

Same here.  It’s a relic of the days when RAM was dramatically more expensive.  
I’ve had this argument with people stuck in the past, even when the resident 
performance expert 100% agreed with me.  

>IMHO, disabling swap is a recommendation that comes up because folks are 
>afraid of their OSDs becoming sluggish when their hosts become
>oversubscribed.

In part yes.  I tell people all the time that Ceph is usually better off with a 
failed component than a crippled component.

>But why not just avoid oversubscription altogether?

Well, yeah.  In the above case, with non-Ceph systems, there were like 2000 of 
them at unstaffed DCs around the world that were DellR430s with only 64GB.  
There was a closely-guarded secret that deployments were blue-green so enough 
vmem was needed to run two copies for brief intervals.  Upgrading them would 
have been prohibitively expensive, even if they weren’t already like 8 years 
old.  Plus certain people were stubborn.

> If you set appropriate OSD memory targets, set kernel swapiness to
> something like 10-20, and properly pin your OSDs in a system with >1 NUMA
> node so that they're evenly distributed across NUMA nodes, your kernel will
> not swap because it simply has no reason to.

I had swapiness arguments with the above people too, and had lobbied for the 
refresh nodes (again, non-Ceph) to be single-socket to avoid the NUMA factor 
that demonstrably was degrading performance.

> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Tyler Stachecki 
> Sent: Wednesday, October 16, 2024 1:46 PM
> To: Anthony D'Atri
> Cc: Dave Hall; ceph-users
> Subject: [ceph-users] Re: Reef osd_memory_target and swapping
> 
> On Tue, Oct 15, 2024, 1:38 PM Anthony D'Atri  wrote:
> 
>> 
>> 
>>> On Oct 15, 2024, at 1:06 PM, Dave Hall  wrote:
>>> 
>>> Hello.
>>> 
>>> I'm seeing the following in the Dashboard  -> Configuration panel
>>> for osd_memory_target:
>>> 
>>> Default:
>>> 4294967296
>>> 
>>> Current Values:
>>> osd: 9797659437,
>>> osd: 10408081664,
>>> osd: 11381160192,
>>> osd: 22260320563
>>> 
>>> I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have 128GB
>>> RAM, the 4th has 256GB.
>> 
>> 
>> https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory
>> 
>> You have autotuning enabled, and it’s trying to use all of your physmem.
>> I don’t know offhand how Ceph determines the amount of available memory, if
>> it looks specifically for physmem or if it only looks at vmem.  If it looks
>> at vmem that arguably could be a bug
>> 
>> 
>>> On the host with 256GB, top shows some OSD
>>> processes with very high VIRT and RES values - the highest VIRT OSD has
>>> 13.0g.  The highest RES is 8.5g

[ceph-users] Re: SLOW_OPS problems

2024-10-15 Thread Anthony D';Atri

Oh yeah that’s really high for a drive.

Do other drives in the same / other chassis show the same temps, or is this an 
outlier?

With Dell chassis, for example, I’ve often had to increase the iDRAC fan speed 
offset to get the drive temps below 40C

> On Oct 15, 2024, at 1:36 PM, Mat Young  wrote:
> 
> Looking at the smartlog seems to show 63C current temp with 53C as worst case 
> which doesn’t make a lot of sense. Could they drive be thermally throttling?
> 
> Rgds
> 
> mat
> 
> From: Tim Sauerbein 
> Sent: Tuesday, October 15, 2024 11:21 AM
> To: ceph-users 
> Subject: [ceph-users] Re: SLOW_OPS problems
> 
> [External: Do not click links or open attachments without verifying the 
> sender, always login to your account directly.]
> Sorry, forgot to mention: I did a secure erase on the drive yesterday, added 
> it to the OSD again with the same result of slow ops a few hours later. > On 
> 15 Oct 2024, at 16:07, Tim Sauerbein 
> mailto:sauerb...@icloud.com>> wrote: > >>
> NkdkJdXPPEBannerStart
> Be Careful With This Message
> From (Tim Sauerbein 
> )<https://godaddy.cloud-protect.net/email-details/?k=k1&payload=53616c7465645f5f15db257bb6f7446517f83bae6d9f7752b417bd2af5cd0f818688bd2dbc09ea4d63ff123db727aac7a26336b0ac38030b4d5f6578b60f00f399cecc7c56cdf55ccf24a1fbdf14dd1574c17a4c300de8705b37b4ef25d11fe41ce0f9fdb7e8228a29a4207e9be5cfb2fd78296edffa23c172a5cf397b0ff766a21788297658e61718f53d2a445d8056650a2d047cf31eeb08d1ff50f3ec7363971db9f2a6809e803c3678894306df1b57d1d2463235136b2beacce4e62ccdca0169afeef74aea8b0af616a0d7c44fc6c0b69d24c4211ba1c98a19ac640aafae16cb463be63f93ad06a2c67696d8bcba>
> Learn 
> More<https://godaddy.cloud-protect.net/email-details/?k=k1&payload=53616c7465645f5f15db257bb6f7446517f83bae6d9f7752b417bd2af5cd0f818688bd2dbc09ea4d63ff123db727aac7a26336b0ac38030b4d5f6578b60f00f399cecc7c56cdf55ccf24a1fbdf14dd1574c17a4c300de8705b37b4ef25d11fe41ce0f9fdb7e8228a29a4207e9be5cfb2fd78296edffa23c172a5cf397b0ff766a21788297658e61718f53d2a445d8056650a2d047cf31eeb08d1ff50f3ec7363971db9f2a6809e803c3678894306df1b57d1d2463235136b2beacce4e62ccdca0169afeef74aea8b0af616a0d7c44fc6c0b69d24c4211ba1c98a19ac640aafae16cb463be63f93ad06a2c67696d8bcba>
> Potential Impersonation
> The sender's identity could not be verified and someone may be impersonating 
> the sender. Take caution when interacting with this message.
> 
> NkdkJdXPPEBannerEnd
> 
> Sorry, forgot to mention:
> 
> 
> 
> I did a secure erase on the drive yesterday, added it to the OSD again with 
> the same result of slow ops a few hours later.
> 
> 
> 
>> On 15 Oct 2024, at 16:07, Tim Sauerbein 
>> mailto:sauerb...@icloud.com>> wrote:
> 
>> 
> 
>>> On 14 Oct 2024, at 16:01, Anthony D'Atri 
>>> mailto:a...@dreamsnake.net>> wrote:
> 
>>> 
> 
>>> Remind me, have you sent me a full `smartctl -a` output for this drive?
> 
>> 
> 
>> See here, looks good though: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_sauerbein_6423231adb954d28c8c82a8422256355&d=DwIGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=bDvyvkr2bb4BZTlvrJx55ZzKgQTyuyI1pgJpjo3ippU&m=4BzydYt8R0yplK8QvqowJz2GaV9Lnf7dg9Xos6bDeNBYcAyuL4faGr3ma7qRjCKc&s=VKfARgTFDlE0uDcv7zs4xjZ6--dLJqws6_O877VPbyw&e=
> 
>> 
> 
>>> If there’s a firmware update available, updating it with a subsequent 
>>> secure-erase could plausibly recover it.
> 
>> 
> 
>> I don't think there is a firmware update publicly available. Other disks of 
>> same model and same firmware run without issues in my cluster btw.
> 
>> 
> 
>>> On 14 Oct 2024, at 15:56, Mark Nelson 
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mark.nelson-40clyso.com&d=DwIGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=bDvyvkr2bb4BZTlvrJx55ZzKgQTyuyI1pgJpjo3ippU&m=4BzydYt8R0yplK8QvqowJz2GaV9Lnf7dg9Xos6bDeNBYcAyuL4faGr3ma7qRjCKc&s=59RCI_vd1KgnEwQGO9-paAVPPm3884F9Oq_hMho-S94&e=>
>>>  wrote:
> 
>>> 
> 
>>> I've seen similar issues before where smart showed no failures but the 
>>> drive performed terribly.  You can try trimming the drive or even doing a 
>>> secure format to see if it helps, but at least in the case I recall it was 
>>> an issue with the drive itself.
> 
>> 
> 
>> I think that the disk is just faulty too. Do you have any idea of a test to 
>> run on the SSD to prove that independent of Ceph?
> 
>> 
> 
>>

[ceph-users] Re: Reef osd_memory_target and swapping

2024-10-15 Thread Anthony D';Atri



> On Oct 15, 2024, at 1:06 PM, Dave Hall  wrote:
> 
> Hello.
> 
> I'm seeing the following in the Dashboard  -> Configuration panel
> for osd_memory_target:
> 
> Default:
> 4294967296
> 
> Current Values:
> osd: 9797659437,
> osd: 10408081664,
> osd: 11381160192,
> osd: 22260320563
> 
> I have 4 hoists in the cluster right now - all OSD+MGR+MON.  3 have 128GB
> RAM, the 4th has 256GB.

https://docs.ceph.com/en/reef/cephadm/services/osd/#automatically-tuning-osd-memory

You have autotuning enabled, and it’s trying to use all of your physmem.  I 
don’t know offhand how Ceph determines the amount of available memory, if it 
looks specifically for physmem or if it only looks at vmem.  If it looks at 
vmem that arguably could be a bug


>  On the host with 256GB, top shows some OSD
> processes with very high VIRT and RES values - the highest VIRT OSD has
> 13.0g.  The highest RES is 8.5g.
> 
> All 4 systems are currently swapping, but the 256GB system has much higher
> swap usage.
> 
> I am confused why I have 4 current values for osd_memory_target, and
> especially about the 4th one at 22GB.
> 
> Also, I'm recalling that there might be a recommendation to disable swap.
> and I could easily do 'swapoff -a' when the swap usage is lower than the
> free RAM.

I tend to advise not using swap at all.  Suggest disabling swap in fstab, then 
serially rebooting your OSD nodes, of course waiting for recovery between each 
before proceeding to the next. 

> 
> Can anybody shed any light on this?
> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RGW performance guidelines

2024-10-15 Thread Anthony D';Atri

> On Oct 15, 2024, at 9:28 AM, Harry Kominos  wrote:
> 
> Hello Anthony and thank you for your response!
> 
> I have placed the requested info in a separate gist here:
> https://gist.github.com/hkominos/85dc46f3ce7037ec23ac6e1e2535e885

> 3826 pgs not deep-scrubbed in time
> 1501 pgs not scrubbed in time

Not surprising for HDDs.  Double your deep-scrub interval.

> Every OSD is an HDD, with their corresponding index, on a partition in an
> SSD device.

So you’re relying on the SSD DB device for the index pool?  Have you looked at 
your logs / metrics for those OSDs to see if there is any spillover?

What type of SSD are you using here?  And how many HDD OSDs do you have using 
each?

> And we are talking about 18 separate devices, with separate
> cluster_network for the rebalancing etc.

18 separate devices?  Do you mean 18 OSDs per server?  18 servers?  Or the fact 
that you’re using 18TB HDDs?

> The index for the RGW is also on an HDD (for now).

Uggh.  If the index pool is entirely on HDDs, with no SSD DB partition, then 
yeah any metadata ops are going to be dog slow.  Check that your OSDs actually 
do have external SSD DBs — it’s easy over the OSD lifecycle to deploy that way 
initially but to inadvertently rebuild OSDs without the external device.  

> Now as far as the number of pgs is concerned, I reached that number,
> through one of the calculators that are found online.

You’re using the autoscaler, I see.  

In your `ceph osd df` output, look at the PGS column at right.  Your balancer 
seems to be working fairly well.  Your average number of PG replicas per OSD is 
around 71, which is in alignment with upstream guidance.  

But I would suggest going twice as high.  See the very recent thread about PGs. 
 So I would adjust pg_num on pools in accordance with their usage and needs so 
that the PGS column there ends up in the 150 - 200 range.

> Since the cluster is doing Object store, Filesystem and Block storage, each 
> pool has a different
> number for pg_num.
> In the RGW Data case, the pool has about 300TB in it , so perhaps that
> explains that the pg_num is lower than what you expected ?

Ah, mixed cluster.  You shoulda led with that ;)

default.rgw.buckets.data 356.7T 3.0 16440T 0.0651 1.0 4096 off False
default.rgw.buckets.index 5693M 3.0 16440T 0. 1.0 32 on False
default.rgw.buckets.non-ec 62769k 3.0 418.7T 0. 1.0 32 
volumes 8 16384 2.4 PiB 650.08M 7.2 PiB 53.80 2.1 PiB

You have three pools with appreciable data — the two RBD pools and your bucket 
pool.  Your pg_nums are more or less reflective of that, which is general 
guidance.

But the index pool is not about data or objects stored.  The index pool is 
mainly omaps not RADOS objects, and needs to be resourced differently.
Assuming that all 978 OSDs are identical media?  Your `ceph df` output though 
implies that you have OSDs on SSDs, so I’ll again request info on the media and 
how your OSDs are built.

Your index pool has only 32 PGs.  I suggest setting pg_num for that pool to, 
say, 1024.  It’ll take a while to split those PGs and you’ll see pgp_num slowly 
increasing, but when it’s done I strongly suspect that you’ll have better 
results.

The non-ec pool is mainly AIUI used for multipart uploads.  If your S3 objects 
are 4MB in size it probably doesn’t matter.  If you do start using MPU you’ll 
want to increase pg_num there too.

> 
> Regards,
> Harry
> 
> 
> 
> On Tue, Oct 15, 2024 at 2:54 PM Anthony D'Atri 
> wrote:
> 
>> 
>> 
>>> Hello Ceph Community!
>>> 
>>> I have the following very interesting problem, for which I found no clear
>>> guidelines upstream so I am hoping to get some input from the mailing
>> list.
>>> I have a 6PB cluster in operation which is currently half full. The
>> cluster
>>> has around 1K OSD, and the RGW data pool  has 4096 pgs (and pgp_num).
>> 
>> Even without specifics I can tell you that pg_num is waay too
>> low.
>> 
>> Please send
>> 
>> `ceph -s`
>> `ceph osd tree | head -30`
>> `ceph osd df | head -10`
>> `ceph -v`
>> 
>> Also, tell us what media your index and bucket OSDs are on.
>> 
>>> The issue is as follows:
>>> Let's say that we have 10 million small objects (4MB) each.
>> 
>> In RGW terms, those are large objects.  Small objects would be 4KB.
>> 
>>> 1)Is there a performance difference *when fetching* between storing all
>> 10
>>> million objects in one bucket and storing 1 million in 10 buckets?
>> 
>> Larger buckets will generally be slower for some things, but if you’re on
>> Reef, and your bucket wasn’t created on an older release, 10 million
>> shouldn’t be too bad.  Listing larger buckets will always be increasi

[ceph-users] Re: Ceph RGW performance guidelines

2024-10-15 Thread Anthony D';Atri



> Hello Ceph Community!
> 
> I have the following very interesting problem, for which I found no clear
> guidelines upstream so I am hoping to get some input from the mailing list.
> I have a 6PB cluster in operation which is currently half full. The cluster
> has around 1K OSD, and the RGW data pool  has 4096 pgs (and pgp_num).

Even without specifics I can tell you that pg_num is waay too low.

Please send

`ceph -s`
`ceph osd tree | head -30`
`ceph osd df | head -10`
`ceph -v`

Also, tell us what media your index and bucket OSDs are on.

> The issue is as follows:
> Let's say that we have 10 million small objects (4MB) each.

In RGW terms, those are large objects.  Small objects would be 4KB.

> 1)Is there a performance difference *when fetching* between storing all 10
> million objects in one bucket and storing 1 million in 10 buckets?

Larger buckets will generally be slower for some things, but if you’re on Reef, 
and your bucket wasn’t created on an older release, 10 million shouldn’t be too 
bad.  Listing larger buckets will always be increasingly slower.  

> There
> should be "some" because of the different number of pgs in use, in the 2
> scenarios but it is very hard to quantify.
> 
> 2) What if I have 100 million objects? Is there some theoretical limit /
> guideline on the number of objects that I should have in a bucket before I
> see performance drops?

At that point, you might consider indexless buckets, if your client/application 
can keep track of objects in its own DB.

With dynamic sharding (assuming you have it enabled), RGW defaults to 100,000 
objects per shard and 1999 max shards, so I *think* that after 199M objects in 
a bucket it won’t auto-reshard.

> I should mention here that the contents of the bucket *never need to be
> listed, *The user always knows how to do a curl, to get the contents.

We can most likely improve your config, but you may also be a candidate for an 
indexless bucket.  They don’t get a lot of press, and I won’t claim to be 
expert in them, but it’s something to look into.


> 
> Thank you for your help,
> Harry
> 
> P.S.
> The following URLs have been very informative, but they do not answer my
> question unfortunately.
> 
> https://www.redhat.com/en/blog/red-hat-ceph-object-store-dell-emc-servers-part-1
> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-14 Thread Anthony D';Atri

Try failing over to a standby mgr

> On Oct 14, 2024, at 9:33 PM, Harry G Coin  wrote:
> 
> I need help to remove a useless "HEALTH ERR" in 19.2.0 on a fully dual stack 
> docker setup with ceph using ip v6, public and private nets separated, with a 
> few servers.   After upgrading from an error free v18 rev, I can't get rid of 
> the 'health err' owing to the report that all osds are unreachable.  
> Meanwhile ceph -s reports all osds up and in and the cluster otherwise 
> operates normally.   I don't care if it's 'a real fix'  I just need to remove 
> the false error report.   Any ideas?
> 
> Thanks
> 
> Harry Coin
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: SLOW_OPS problems

2024-10-14 Thread Anthony D';Atri



>>> Out of curiosity - have you found out what was the problem with that OSD? 
>>> Some hardware issues?
>> I guess the SSD is faulty, even though it doesn't show any issues in SMART. 
>> I will replace it next week to bring the OSD back online and will report if 
>> the issue reappears, which would mean something else is the cause.
> I've seen similar issues before where smart showed no failures but the drive 
> performed terribly.  You can try trimming the drive or even doing a secure 
> format to see if it helps, but at least in the case I recall it was an issue 
> with the drive itself.


Remind me, have you sent me a full `smartctl -a` output for this drive?

If there’s a firmware update available, updating it with a subsequent 
secure-erase could plausibly recover it.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reduced data availability: 3 pgs inactive, 3 pgs down

2024-10-13 Thread Anthony D';Atri


> 
> The majority of the pools have ‘replicated size 3 min_size 2’.
> 
Groovy.  

> I do see a few pools such as .rgw.control and a few others have ‘replicated 
> size 3 min_size 1’.

Not a good way to run.  Set min_size to 2 after you get healthy.  

> I am not using erasure encoding and none of the pools are set to ‘replicated 
> size 3 min_size 3’.

Odd that you’re in this situation.   You might increase the retries in your 
crush rules.  

You might also set min_ size temporarily to 1 on pool #0, which may let these 
PGs activate and recover, then immediately set back to 2, then investigate if 
all PGs now have a full acting set.   NB There is some risk here.  

> 
> Thank you,
> 
> Shain
> 
> 
> From: Anthony D'Atri 
> Date: Sunday, October 13, 2024 at 11:29 AM
> To: Shain Miley 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] Reduced data availability: 3 pgs inactive, 3 pgs 
> down
> !---|
>  External Email - Use Caution
> 
> |---!
> 
> When you get the cluster healthy, redeploy those Filestore OSDs as BlueStore. 
>  Not before.
> 
> 
> Does you r pool have size=3, min_size=3?  Is this a replicated pool? Or EC 
> 2,1?
> 
> Don’t mark lost, there are things we can do.  I don’t want to suggest 
> anything until you share the above info.
> 
>> On Oct 13, 2024, at 10:00 AM, Shain Miley  wrote:
>> 
>> Hello,
>> 
>> I am seeing the following information after reviewing ‘ceph health detail’:
>> 
>> [WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive, 3 pgs down
>> 
>>   pg 0.1a is down, acting [234,35]
>> 
>>   pg 0.20 is down, acting [226,267]
>> 
>>   pg 0.2f is down, acting [227,161]
>> 
>> 
>> When I query each of those pgs I see the following message on each of them:
>> 
>> "peering_blocked_by": [
>> 
>>   {
>> 
>>   "osd": 233,
>> 
>>   "current_lost_at": 0,
>> 
>>   "comment": "starting or marking this osd lost may let us 
>> proceed"
>> 
>>   }
>> 
>> 
>> Osd.233 crashed a while ago and when I try to start it the log shows some 
>> sort of issue with the filesystem:
>> 
>> 
>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus 
>> (stable)
>> 
>> 1: (()+0x12980) [0x7f2779617980]
>> 
>> 2: (gsignal()+0xc7) [0x7f27782c9fb7]
>> 
>> 3: (abort()+0x141) [0x7f27782cb921]
>> 
>> 4: (ceph::__ceph_abort(char const*, int, char const*, 
>> std::__cxx11::basic_string, 
>> std::allocator > const&)+0x1b2) [0x556ebe773ddf]
>> 
>> 5: (FileStore::_do_transaction(ceph::os::Transaction&, unsigned long, int, 
>> ThreadPool::TPHandle*, char const*)+0x62b3) [0x556ebebe2753]
>> 
>> 6: (FileStore::_do_transactions(std::vector> std::allocator >&, unsigned long, 
>> ThreadPool::TPHandle*, char const*)+0x48) [0x556ebebe3f38]
>> 
>> 7: (JournalingObjectStore::journal_replay(unsigned long)+0x105a) 
>> [0x556ebebfc56a]
>> 
>> 8: (FileStore::mount()+0x438a) [0x556ebebda82a]
>> 
>> 9: (OSD::init()+0x4d1) [0x556ebe80fdc1]
>> 
>> 10: (main()+0x3f8c) [0x556ebe77ad2c]
>> 
>> 11: (__libc_start_main()+0xe7) [0x7f27782acbf7]
>> 
>> 12: (_start()+0x2a) [0x556ebe78fc4a]
>> 
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> 
>> 
>> 
>> 
>> 
>> At this point I am thinking about either running an xfs repair on osd.233 
>> and trying to see if I can get it back up (once the pgs are healthy again I 
>> would likey zap/readd or replace the drive).
>> 
>> 
>> 
>> Another option it sounds like is to mark the osd as lost.
>> 
>> 
>> 
>> I am just looking for advice on what exactly I should do next to try to 
>> minimize the chances of any data loss.
>> 
>> Here is the query output for each of those pgs:
>> https://urldefense.com/v3/__https://pastebin.com/YbfnpZGC__;!!Iwwt!XTVUuKiQDmZ8ZXQP-pHoxFFWYAIntSqVBuXcigFVVWbYMtpJTcQeg4BzgQQxWSAhs1BKujMNHx4rDIhAStU$<https://urldefense.com/v3/__https:/pastebin.com/YbfnpZGC__;!!Iwwt!XTVUuKiQDmZ8ZXQP-pHoxFFWYAIntSqVBuXcigFVVWbYMtpJTcQeg4BzgQQxWSAhs1BKujMNHx4rDIhAStU$>
>> 
>> 
>> 
>> Thank you,
>> 
>> Shain
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reduced data availability: 3 pgs inactive, 3 pgs down

2024-10-13 Thread Anthony D';Atri

When you get the cluster healthy, redeploy those Filestore OSDs as BlueStore.  
Not before.  


Does you r pool have size=3, min_size=3?  Is this a replicated pool? Or EC 2,1?

Don’t mark lost, there are things we can do.  I don’t want to suggest anything 
until you share the above info.  

> On Oct 13, 2024, at 10:00 AM, Shain Miley  wrote:
> 
> Hello,
> 
> I am seeing the following information after reviewing ‘ceph health detail’:
> 
> [WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive, 3 pgs down
> 
>pg 0.1a is down, acting [234,35]
> 
>pg 0.20 is down, acting [226,267]
> 
>pg 0.2f is down, acting [227,161]
> 
> 
> When I query each of those pgs I see the following message on each of them:
> 
>  "peering_blocked_by": [
> 
>{
> 
>"osd": 233,
> 
>"current_lost_at": 0,
> 
>"comment": "starting or marking this osd lost may let us 
> proceed"
> 
>}
> 
> 
> Osd.233 crashed a while ago and when I try to start it the log shows some 
> sort of issue with the filesystem:
> 
> 
> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus 
> (stable)
> 
> 1: (()+0x12980) [0x7f2779617980]
> 
> 2: (gsignal()+0xc7) [0x7f27782c9fb7]
> 
> 3: (abort()+0x141) [0x7f27782cb921]
> 
> 4: (ceph::__ceph_abort(char const*, int, char const*, 
> std::__cxx11::basic_string, std::allocator 
> > const&)+0x1b2) [0x556ebe773ddf]
> 
> 5: (FileStore::_do_transaction(ceph::os::Transaction&, unsigned long, int, 
> ThreadPool::TPHandle*, char const*)+0x62b3) [0x556ebebe2753]
> 
> 6: (FileStore::_do_transactions(std::vector std::allocator >&, unsigned long, 
> ThreadPool::TPHandle*, char const*)+0x48) [0x556ebebe3f38]
> 
> 7: (JournalingObjectStore::journal_replay(unsigned long)+0x105a) 
> [0x556ebebfc56a]
> 
> 8: (FileStore::mount()+0x438a) [0x556ebebda82a]
> 
> 9: (OSD::init()+0x4d1) [0x556ebe80fdc1]
> 
> 10: (main()+0x3f8c) [0x556ebe77ad2c]
> 
> 11: (__libc_start_main()+0xe7) [0x7f27782acbf7]
> 
> 12: (_start()+0x2a) [0x556ebe78fc4a]
> 
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> 
> 
> 
> 
> At this point I am thinking about either running an xfs repair on osd.233 and 
> trying to see if I can get it back up (once the pgs are healthy again I would 
> likey zap/readd or replace the drive).
> 
> 
> 
> Another option it sounds like is to mark the osd as lost.
> 
> 
> 
> I am just looking for advice on what exactly I should do next to try to 
> minimize the chances of any data loss.
> 
> Here is the query output for each of those pgs:
> https://pastebin.com/YbfnpZGC
> 
> 
> 
> Thank you,
> 
> Shain
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D';Atri



> Hi Anthony.
> 
>> ... Bump up pg_num on pools and see how the average / P90 ceph-osd process 
>> size changes?
>> Grafana FTW.  osd_map_cache_size I think defaults to 50 now; I want to say 
>> it used to be much higher.
> 
> That's not an option. What would help is a-priori information based on the 
> implementation

I think with so many variables in play that would be tough to quantify.

> I'm looking at a pool with 5PB of data and 8192 PGs. If I increase that by, 
> say a factor 4, its in one step, not gradual to avoid excessive redundant 
> data movement. I don't want to spend hardware life for nothing and also don't 
> want to wait for months or more for this to complete or get stuck along the 
> way due to something catastrophic.

Used to be that Ceph wouldn’t let you more than double pg_num in one step.  You 
might consider going to just, say, 9216 and see what happens.  Non power of 2 
pg_num isn’t THAT big a deal these days, you’ll end up some some PGs larger 
than others, but it’s not horrible for a short term.My sense re hardware 
life is that writes due to rebalancing are trivial.  

> 
> What I would like to know is is there a fundamental scaling limit in the PG 
> implementation that someone who was staring at the code for a long time knows 
> about. This is usually something that grows much worse than N log N in time- 
> or memory complexity. The answer to this is in the code and boils down to 
> "why the recommendation of 100 PGs per OSD" and not 200 or 1000 or 100 per TB 
> - the latter would make a lot more sense). There ought to be a reason 
> other than "we didn't know what else to write".
> 
> I would like to know the scaling in (worst-case) complexity as a function of 
> the number of PGs. Making a fixed recommendation of a specific number 
> independent of anything else is something really weird. It indicates that 
> there is something catastrophic in the code that will blow up once an 
> (unknown/undocumented!!) threshold is crossed. For example, a tiny but 
> important function that is exponential in the number of PGs. If there is 
> nothing catastrophic in the code, then why is the recommendation not 
> floating, specifying what increase in resource consumption one should expect.
> 
> None of the discussions I have seen so far address this extreme weirdness of 
> the recommendation. If there is an unsolved scaling problem, please anyone 
> state what it is, why its there and what the critical threshold is. What part 
> of the code will explode?
> 
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Anthony D'Atri 
> Sent: Wednesday, October 9, 2024 3:52 PM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] What is the problem with many PGs per OSD
> 
>> Unfortunately, it doesn't really help answering my questions either.
> 
> 
> Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we 
> couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, 
> so a certain fear of overshooting was established.  Mark is the go-to here.
> 
>> That's why deploying multiple OSDs per SSD is such a great way to improve 
>> performance on devices where 4K random IO throughput scales with iodepth.
> 
> Mark’s testing have shown this to not be so much the case with recent 
> releases — do you still see this?  Until recently I was expecting 30TB TLC 
> SSDs for RBD, and in the next year perhaps as large as 122T for object so I 
> was thinking of splitting just because of the size - and the systems in 
> question were overequipped with CPU.
> 
> 
>> Memory: I have never used file store, so can't relate to that.
> 
> XFS - I experienced a lot of ballooning, to the point of OOMkilling.  In 
> mixed clusters under duress the BlueStore OSDs consistently behaved better.
> 
>> 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? 
>> How many OSDs per host?
> 
> Those were Cisco UCS… C240m3.  Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs 
> @ 3TB, 64GB I think.
> 
>> Did it even work with 200PGs with the same data (recovery after power loss)?
> 
> I didn’t have remote power control, and being a shared lab it was difficult 
> to take a cluster down for such testing.  We did have a larger integration 
> cluster (450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power 
> drop.  Ceph was fine (this was …. Firefly I think) but the LSI RoC HBAs lost 
> data like crazy due to hardware, firmware, and utility bugs.
> 
>> Was it maybe the death spiral 
>&

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D';Atri



> but simply on the physical parameter of IOPS-per-TB (a "figure of merit" that
> is widely underestimate or ignored)

hear hear!

> of HDDs, and having enough IOPS-per-TB to sustain both user and admin 
> workload.

Even with SATA SSDs I twice had to expand a cluster to meet SLO long before it 
was nearly full.  The SNIA TCO calculator includes a multiplier for number of 
drives one has to provision for semi-acceptable IOPs.

> A couple of legacy Ceph instances I saw in the past had 8TB and
> 18TB drives and as they got full the instances basically
> congealed (latencies in the several seconds or even dozens of
> second range) even under modest user workloads, and anyhow
> expensive admin workloads like scrubbing (never mind deep
> scrubbing) got behind by a year or two, and rebalancing was
> nearly impossible. Again not because of Ceph.

Been there, ITSY’d.  Fragmentation matters with rotational media, even with op 
re-ordering within the drive or the driver.

> But that is completely different: SSDs have *much* higher IOPS,
> even SATA ones, so even large SSDs have enormously better
> IOPS-per-TB.

And IOPS-per-yourlocalcurrency.  Coarse-IU QLC is a bit of a wrinkle depending 
on workload...

>> I would like to point out that there are scale-out storage
>> systems that have adopted their architecture for this scenario
>> and use large HDDs very well.
> 
> That is *physically impossible* as they just do not have enough
> IOPS-per-TB for many "live" workloads. The illusion that they
> might work well happens in one of two cases:
> 
> * Either because they have not filled up yet,

I saw this with RGW on ultradense HDD toploaders.  

> or because they
>  have filled up but only a minuscule subset of the data is in
>  active use, the IOPS-per-*active*-TB of the user workload is
>  still good enough.

Archival workloads - sure.  Sometimes even backups.  Even then, 
prudently-sourced QLC often has superior TCO compared to spinners.

> * If the *active data* is mostly read-only and gets cached on a
>  SSD tier of sufficient size, and admin workload does not
>  matter.

And sometimes when that data active because of full backups, that process 
effectively flushes the cache to boot.

> I have some idea of how Qumulo does things and that is very
> unlikely, Ceph is not fundamentally inferior to their design.
> Perhaps the workload's anisotropy matches particularly well that
> of that particular Qumulo instance:

Like a DB that’s column-oriented vs row-oriented?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Procedure for temporary evacuation and replacement

2024-10-10 Thread Anthony D';Atri


> 
> We need to replace about 40 disks distributed over all 12 hosts backing a 
> large pool with EC 8+3. We can't do it host by host as it would take way too 
> long (replace disks per host and let recovery rebuild the data)

This is one of the false economies of HDDs ;) 

> Therefore, we would like to evacuate all data from these disks simultaneously 
> and with as little data movement as possible. This is the procedure that 
> seems to do the trick:
> 
> 1.) For all OSDs: ceph osd reweight ID 0  # Note: not "osd crush reweight"

Note that this will run afoul of the balancer module.  I *think* also that it 
will result in the data moving to OSDs on the same host.

> 2.) Wait for rebalance to finish
> 3.) Replace disks and deploy OSDs with the same IDs as before per host
> 4.) Start OSDs and let rebalance back
> 
> I tested step 1 on Octopus with 1 disk and it seems to work. The reason I ask 
> is that step 1 actually marks the OSDs as OUT. However, they are still UP and 
> I see only misplaced objects, not degraded objects. It is a bit 
> counter-intuitive, but it seems that UP+OUT OSDs still participate in IO.
> 
> Because it is counter-intuitive, I would like to have a second opinion. I 
> have read before that others reweight to something like 0.001 and hope that 
> this flushes all PGs. I would prefer not to rely on hope and a reweight to 0 
> apparently is a valid choice here, leading to a somewhat weird state with 
> UP+OUT OSDs.
> 
> Problems that could arise are timeouts I'm overlooking that will make data 
> chunks on UP+OUT OSDs unavailable after some time. I'm also wondering if 
> UP+OUT OSDs participate in peering in case there is an OSD restart somewhere 
> in the pool.
> 
> Thanks for your input and best regards!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D';Atri



> The main problem was the increase in ram use scaling with PGs, which in
> normal operation is often fine but as we all know balloons in failure
> conditions.

Less so with BlueStore in my experience. I think in part this surfaces a bit of 
Filestore legacy that we might re-examine with Filestore being deprecated.

> There are many developments that may have made things behave better

Very, very much so.

> but early on some clusters just couldn’t be recovered until they received
> double their starting ram and were babysat through careful
> manually-orchestrated startup. (Or maybe worse — I forget.)

I helped a colleague through just such a 40 hour outage, trust me the only way 
it coulda been worse was if it were unrecoverable, as was the lab setup I 
described with a 9000 ratio.  Ask Michael Kidd about the SVL disaster, he 
probably remembers ;)

That outage was in part precipitated (or at least exacerbated) by a user 
issuing a rather large number of snap trims at once.  I subsequently jacked up 
the snap trim cost and delay values.

We did emergency RAM upgrades followed by babysitting.  My colleague wrote a 
Python script that watched MemAvailable and gracefully restarted the OSDs on 
the given system as it reached a low water mark.  This way recovery could at 
least make incremental progress.  During which I increased the markdown count, 
and I think adjusted the reporters value.  The one and only time I’ve ever run 
“ceph osd pause”.  Luminous with mixed Filestore and BlueStore, and OSDs 
ranging from 1.6T to 3.84T.  That cluster had initially been deployed in only 
two racks, so the CRUSH rules weren’t ideal.  I subsequently refactored it and 
siblings to improve the failure domain situation and spread capacity.  In the 
end one larger and one smaller cluster became four clusters, each with nearly 
uniform OSD sizes.  And all the Filestore OSDs got redeployed in the process.

Two weeks previously I’d found that the mons in this very cluster had enough 
RAM to run but not enough to boot — a function of growth and dedicated mon 
nodes.  I’d arranged a Z0MG RAM upgrade on them.  If I hadn’t, that outage 
indeed would have been much, much worse.

> 
> Nobody’s run experiments, presumably because the current sizing guidelines
> are generally good enough to be getting on with, for anybody who has the
> resources to try and engage in the measurement work it would take to
> re-validate them. I will be surprised if anybody has information of the
> sort you seem to be searching for.

The inestimable Mr. Farnum here describes an opportunity for community 
contribution (nudge nudge wink wink ;)

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: About 100g network card for ceph

2024-10-10 Thread Anthony D';Atri

> I would treat having a separate cluster network
> at all as a serious cluster design bug.

I wouldn’t go quite that far, there are still situations where it can be the 
right thing to do.  Like if one is stuck with only 1GE or 10GE networking, but 
NICs and switch ports abound.  Then having separate nets, each with bonded 
links, can make sense.

I’ve also seen network scenarios where bonding isn’t feasible, say a very large 
cluster where the TORs aren’t redundant.  In such a case, one might reason that 
decreasing osd_max_markdown_count can improve the impact of flapping, and the 
impact of the described flapping might be amortized toward the noise floor.

When bonding, always always talk to your networking folks about the right 
xmit_hash_policy for your deployment.  Suboptimal values rob people of 
bandwidth all the time.

> Reason: a single faulty NIC or
> cable or switch port on the backend network can bring down the whole
> cluster. This is even documented:
> 
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#flapping-osds

I love it when people reference stuff I suffer then write about :D. I haven’t 
seen it bring down as such, but it does impact and can be tricky to 
troubleshoot if you aren’t looking for it.  The clusters I wrote about there 
FWIW did have bonded private and public networks, but weren’t very large by 
modern standards.

> 
> On Thu, Oct 10, 2024 at 3:23 PM Phong Tran Thanh  
> wrote:
>> 
>> Hi ceph users
>> 
>> I have a 100G network card with dual ports for a Ceph node with NVMe disks.
>> Should I bond them or not? Should I bond 200G for both the public and
>> cluster networks, or separate it: one port for the public network and one
>> for the cluster?
>> 
>> Thank ceph users
>> --
>> Email: tranphong...@gmail.com
>> Skype: tranphong079
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> -- 
> Alexander Patrakov
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-10 Thread Anthony D';Atri



> I'm afraid nobody will build a 100PB cluster with 1T drives. That's just 
> absurd

Check the archives for the panoply of absurdity that I’ve encountered ;)

> So, the sharp increase of per-device capacity has to be taken into account. 
> Specifically as the same development is happening with SSDs. There is no way 
> around 100TB drives in the near future and a system like ceph is either able 
> to handle that or will die

Agreed.  I expect 122TB QLC in 1H2025.  With NVMe and PCI-e Gen 5 one might 
experiment with slicing each into two OSDs.  But for archival and object 
workloads latency usually isn’t so big a deal, so we may increasingly see a 
strategy adapted to the workloads.

> 10 higher aggregated sustained IOP/s performance compared with a similarly 
> sized ceph cluster 
> 
But not, I suspect, nearly as many tentacles.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-09 Thread Anthony D';Atri

> Unfortunately, it doesn't really help answering my questions either.


 Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we 
couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, so 
a certain fear of overshooting was established.  Mark is the go-to here.

> That's why deploying multiple OSDs per SSD is such a great way to improve 
> performance on devices where 4K random IO throughput scales with iodepth.

Mark’s testing have shown this to not be so much the case with recent releases 
— do you still see this?  Until recently I was expecting 30TB TLC SSDs for RBD, 
and in the next year perhaps as large as 122T for object so I was thinking of 
splitting just because of the size - and the systems in question were 
overequipped with CPU.


> Memory: I have never used file store, so can't relate to that.

XFS - I experienced a lot of ballooning, to the point of OOMkilling.  In mixed 
clusters under duress the BlueStore OSDs consistently behaved better.

> 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? 
> How many OSDs per host?

Those were Cisco UCS… C240m3.  Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs @ 
3TB, 64GB I think.

> Did it even work with 200PGs with the same data (recovery after power loss)?

I didn’t have remote power control, and being a shared lab it was difficult to 
take a cluster down for such testing.  We did have a larger integration cluster 
(450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power drop.  Ceph 
was fine (this was …. Firefly I think) but the LSI RoC HBAs lost data like 
crazy due to hardware, firmware, and utility bugs.

> Was it maybe the death spiral 
> (https://ceph-users.ceph.narkive.com/KAzvjjPc/explanation-for-ceph-osd-set-nodown-and-ceph-osd-cluster-snap)
>  that prevented the cluster from coming up and not so much the PG count?

Not in this case, though I’ve seen a similar cascading issue in another context.

> Rumors: Yes, 1000 PGs/OSD on spinners without issues. I guess we are not 
> talking about barely working home systems with lack of all sorts of resources 
> here.

I’d be curious how such systems behave under duress.  I’ve seen a cluster that 
had grown - the mons ended up with enough RAM to run but not to boot, so I did 
urgent RAM upgrades on the mons.  That was the mixed Filestore / BlueStore 
cluster (Luminous 12.2.2) where the Filestore OSDs were much more affected by a 
cascading event than the [mostly larger] BlueStore OSDs.  I suspect that had 
the whole cluster been BlueStore it might not have cascaded.

> 
> The goal: Let's say I want to go 500-1000PGs/OSD on 16T spinners to trim PGs 
> to about 10-20G each. What are the resources that count will require compared 
> with, say, 200 PGs/OSD? That's the interesting question and if I can make the 
> resources available I would consider doing that.

The proof is in the proverbial pudding.  Bump up pg_num on pools and see how 
the average / P90 ceph-osd process size changes?  Grafana FTW.  
osd_map_cache_size I think defaults to 50 now; I want to say it used to be much 
higher.



> 
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Anthony D'Atri 
> Sent: Wednesday, October 9, 2024 2:40 AM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] What is the problem with many PGs per OSD
> 
> I’ve sprinkled minimizers below.  Free advice and worth every penny.  ymmv.  
> Do not taunt Happy Fun Ball.
> 
> 
>> during a lot of discussions in the past the comment that having "many PGs 
>> per OSD can lead to issues" came up without ever explaining what these 
>> issues will (not might!) be or how one would notice. It comes up as kind of 
>> a rumor without any factual or even anecdotal backing.
> 
> A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 
> to help avoid OOMing, the idea being that more PGs = more RAM usage on each 
> daemon that stores the maps.  With BlueStore’s osd_memory_target, my sense is 
> that the ballooning seen with Filestore is much less of an issue.
> 
>> As far as I can tell from experience, any increase of resource utilization 
>> due to an increase of the PG count per OSD is more than offset by the 
>> performance impact of the reduced size of the PGs. Everything seems to 
>> benefit from smaller PGs, recovery, user IO, scrubbing.
> 
> My understanding is that there is serialization in the PG code, and thus the 
> PG ratio can be thought of as the degree of parallelism the OSD device can 
> handle.  SAS/SATA SSDs don’t seek so they can handle more than HDDS, and NVMe 
> devices can handle more than SAS/SATA.
> 
>&

[ceph-users] Re: What is the problem with many PGs per OSD

2024-10-08 Thread Anthony D';Atri

I’ve sprinkled minimizers below.  Free advice and worth every penny.  ymmv.  Do 
not taunt Happy Fun Ball.


> during a lot of discussions in the past the comment that having "many PGs per 
> OSD can lead to issues" came up without ever explaining what these issues 
> will (not might!) be or how one would notice. It comes up as kind of a rumor 
> without any factual or even anecdotal backing.

A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 to 
help avoid OOMing, the idea being that more PGs = more RAM usage on each daemon 
that stores the maps.  With BlueStore’s osd_memory_target, my sense is that the 
ballooning seen with Filestore is much less of an issue.

> As far as I can tell from experience, any increase of resource utilization 
> due to an increase of the PG count per OSD is more than offset by the 
> performance impact of the reduced size of the PGs. Everything seems to 
> benefit from smaller PGs, recovery, user IO, scrubbing.

My understanding is that there is serialization in the PG code, and thus the PG 
ratio can be thought of as the degree of parallelism the OSD device can handle. 
 SAS/SATA SSDs don’t seek so they can handle more than HDDS, and NVMe devices 
can handle more than SAS/SATA.

> Yet, I'm holding back on an increase of PG count due to these rumors.

My personal sense:

HDD OSD:  PG ratio 100-200
SATA/SAS SSD OSD: 200-300
NVMe SSD OSD: 300-400

These are not empirical figures.  ymmv.


> My situation: I would like to split PGs on large HDDs. Currently, we have on 
> average 135PGs per OSD and I would like to go for something like 450.

The good Mr. Nelson may have more precise advice, but my personal sense is that 
I wouldn’t go higher than 200 on an HDD.  If you were at like 20 (I’ve seen 
it!) that would be a different story, my sense is that there are diminishing 
returns over say 150.  Seek thrashing fu, elevator scheduling fu, op 
re-ordering fu, etc.  Assuming you’re on Nautilus or later, it doesn’t hurt to 
experiment with your actual workload since you can scale pg_num back down.  
Without Filestore colocated journals, the seek thrashing may be less of an 
issue than it used to be.

> I heard in related rumors that some users have 1000+ PGs per OSD without 
> problems.

On spinners?  Or NVMe?  On a 60-120 TB NVMe OSD I’d be sorely tempted to try 
500-1000.

> I would be very much interested in a non-rumor answer, that is, not an answer 
> of the form "it might use more RAM", "it might stress xyz". I don't care what 
> a rumor says it might do. I would like to know what it will do.

It WILL use more RAM.

> I'm looking for answers of the form "a PG per OSD requires X amount of RAM 
> fixed plus Y amount per object”

Derive the size of your map and multiple by the number of OSDs per system.  My 
sense is that it’s on the order of MBs per OSD.  After a certain point the RAM 
delta might have more impact by raising osd_memory_target instead.  

> or "searching/indexing stuff of kind A in N PGs per OSD requires N log 
> N/N²/... operations", "peering of N PGs per OSD requires N/N log 
> N/N²/N*#peers/... operations". In other words, what are the *actual* 
> resources required to host N PGs with M objects on an OSD (note that N*M is a 
> constant per OSD). With that info one could make an informed decision, 
> informed by facts not rumors.
> 
> An additional question of interest is: Has anyone ever observed any 
> detrimental effects of increasing the PG count per OSD to large values>500?

Consider this scenario:

An unmanaged lab setup used for successive OpenStack deployments, each of which 
created two RBD pools and the panoply of RGW pools.  Which nobody cleaned up 
before redeploys, so they accreted like plaque in the arteries of an omnivore.  
Such that the PG ratio hits 9000.  Yes, 9000. Then the building loses power.  
The systems don’t have nearly enough RAM to boot, peer, and activate, so the 
entire cluster has to be wiped and redeployed from scratch.  An extreme 
example, but remember that I don’t make stuff up.  

> 
> Thanks a lot for any clarifications in this matter!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Administrative test, please ignore

2024-10-08 Thread Anthony D';Atri

Testing
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm crush_device_class not applied

2024-10-03 Thread Anthony D';Atri

Interesting - my prior research had supported the idea that device classes are 
named arbitrarily.  For sure these three don’t cover everything — one might 
have, say, nvme-qlc-coarse, nvme-qlc-4k, nvme-slc, nvme-tlc-value, 
nvme-tlc-performance, etc.

> On Oct 3, 2024, at 5:20 PM, Eugen Block  wrote:
> 
> I think this PR [1] is responsible. And here are the three supported classes 
> [2]:
> 
> class to_ceph_volume(object):
> 
>_supported_device_classes = [
>"hdd", "ssd", "nvme"
>]
> 
> Why this limitation?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitors for two different cluster

2024-10-03 Thread Anthony D';Atri

No, you need — and want — separate mons.  The mon daemons can run on the OSD 
nodes.

I’m curious about your use-case where you’d want another tiny cluster instead 
of expanding the one you have.

> On Oct 3, 2024, at 6:06 AM, Michel Niyoyita  wrote:
> 
> Hello Team,
> 
> I have a running cluster deployed using ceph-ansible pacific version and
> ubuntu OS , with three mons and 3 osds servers , the cluster is well
> running , now I want to make another cluster wich will consist of 3 osds
> servers , can the new cluster be deployed using cephadm and using the
> existing Mons for the first cluster?
> 
> Best regards
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about speeding hdd based cluster

2024-10-02 Thread Anthony D';Atri



>> It is nonetheless risky.  The wrong sequence of cascading events, of 
>> overlapping failures and you may lose data.  
> 
> Our setup is with 3/2.  size=3 seems much safer than 2.

Indeed that is the default for replicated pools.  Additional replicas exhibit 
diminishing returns in most cases at high cost.

> 
>>> 2-) Move the filesystem metadata pools to use at least SSD only.
>>> 
>> Absolutely.  The CephFS docs suggest using size=4 for the MD pool.
>> 
> 
> Hmm..  I don’t remember reading that anywhere, but it makes sense.  

https://docs.ceph.com/en/quincy/cephfs/createfs/#creating-pools

We recommend configuring at least 3 replicas for the metadata pool, as data 
loss in this pool can render the entire file system inaccessible. Configuring 4 
would not be extreme, especially since the metadata pool’s capacity 
requirements are quite modest.


> 
> Thanks!
> 
> George
> 
> 
>> 
>>> 
>>> 3-) Increase server and client cache.
>>> Here I left it like this:
>>> osd_memory_target_autotune=true (each OSD always has more than 12G).
>>> 
>>> For clients:
>>> client_cache_size=163840
>>>
>>> client_oc_max_dirty=1048576000  
>>>  
>>> client_oc_max_dirty_age=50
>>> client_oc_max_objects=1 
>>>
>>> client_oc_size=2097152000   
>>> 
>>> client_oc_target_dirty=838860800
>>> 
>>>  Evaluate, following the documentation, which of these variables makes 
>>> sense for your cluster.
>>> 
>>>  For the backup scenario, I imagine that decreasing the size and 
>>> min_size values will change the impact. However, you must evaluate your 
>>> needs for these settings.
>>> 
>>> 
>>> Rafael.
>>> 
>>>  
>>> 
>>> De: "Kyriazis, George" 
>>> Enviada: 2024/10/02 13:06:09
>>> Para: ebl...@nde.ag, ceph-users@ceph.io
>>> Assunto: [ceph-users] Re: Question about speeding hdd based cluster
>>>  
>>> Thank you all.
>>> 
>>> The cluster is used mostly for backup of large files currently, but we are 
>>> hoping to use it for home directories (compiles, etc.) soon. Most usage 
>>> would be for large files, though.
>>> 
>>> What I've observed with its current usage is that ceph rebalances, and 
>>> proxmox-initiated VM backups bring the storage to its knees.
>>> 
>>> Would a safe approach be to move the metadata pool to ssd first, see how it 
>>> goes (since it would be cheaper), and then add DB/WAL disks? How would ceph 
>>> behave if we are adding DB/WAL disks "slowly" (ie one node at a time)? We 
>>> have about 100 OSDs (mix hdd/ssd) spread across about 25 hosts. Hosts are 
>>> server-grade with plenty of memory and processing power.
>>> 
>>> Thank you!
>>> 
>>> George
>>> 
>>> 
>>> > -Original Message-
>>> > From: Eugen Block 
>>> > Sent: Wednesday, October 2, 2024 2:18 AM
>>> > To: ceph-users@ceph.io
>>> > Subject: [ceph-users] Re: Question about speeding hdd based cluster
>>> >
>>> > Hi George,
>>> >
>>> > the docs [0] strongly recommend to have dedicated SSD or NVMe OSDs for
>>> > the metadata pool. You'll also benefit from dedicated DB/WAL devices.
>>> > But as Joachim already stated, it depends on a couple of factors like the
>>> > number of clients, the load they produce, file sizes etc. There's no easy 
>>> > answer.
>>> >
>>> > Regards,
>>> > Eugen
>>> >
>>> > [0] https://docs.ceph.com/en/latest/cephfs/createfs/#creating-pools
>>> >
>>> > Zitat von Joachim Kraftmayer :
>>> >
>>> > > Hi Kyriazis,
>>> > >
>>> > > depends on the workload.
>>> > > I would recommend to add ssd/nvme DB/WAL to each osd.
>>> > >
>>> > >
>>> > >
>>> > > Joachim Kraftmayer
>>> > >
>>> > > www.clyso.com 
>>> > >
>>> > > Hohenzollernstr. 27, 80801 Munich
>>> > >
>>> > > Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
>>> > >
>>> > > Kyriazis, George  schrieb am Mi., 2. Okt.
>>> > > 2024,
>>> > > 07:37:
>>> > >
>>> > >> Hello ceph-users,
>>> > >>
>>> > >> I’ve been wondering…. I have a proxmox hdd-based cephfs pool with no
>>> > >> DB/WAL drives. I also have ssd drives in this setup used for other 
>>> > >> pools.
>>> > >>
>>> > >> What would increase the speed of the hdd-based cephfs more, and in
>>> > >> what usage scenarios:
>>> > >>
>>> > >> 1. Adding ssd/nvme DB/WAL drives for each node 2. Moving the metadata
>>> > >> pool for my cephfs to ssd 3. Increasing the performance of the
>>> > >> network. I currently have 10gbe links.
>>> > >>
>>> > >> It doesn’t look like the network is currently saturated, so I’m
>>> > >> thinking
>>> > >> (3) is not a solution. However, if I choose any of the other
>>> > >> options, would I need to also upgrade the network so that the network
>>> > >> does not become a bottleneck?
>>> > >>
>>> > >> Thank you!
>>> > >>
>>> > >> George

[ceph-users] Re: Question about speeding hdd based cluster

2024-10-02 Thread Anthony D';Atri



> On Oct 2, 2024, at 2:19 PM, quag...@bol.com.br wrote:
> 
> Hi Kyriazis,
>  I work with a cluster similar to yours : 142 HDDs and 18 SSDs.
>  I had a lot of performance gains when I made the following settings:
> 
> 1-) For the pool that is configured on the HDDs (here, home directories are 
> on HDDs), reduce the following replica settings (I don't know what your 
> resilience requirement is):
> *size=2
> * min_size=1
> 
>   I do this for at least 4 years with no problems (even when there is a 
> need to change discs or reboot a server, this config never got me in trouble).
> 
It is nonetheless risky.  The wrong sequence of cascading events, of 
overlapping failures and you may lose data.  


> 2-) Move the filesystem metadata pools to use at least SSD only.
> 
Absolutely.  The CephFS docs suggest using size=4 for the MD pool.


> 
> 3-) Increase server and client cache.
> Here I left it like this:
> osd_memory_target_autotune=true (each OSD always has more than 12G).
> 
> For clients:
> client_cache_size=163840  
>  
> client_oc_max_dirty=1048576000
>
> client_oc_max_dirty_age=50
> client_oc_max_objects=1   
>  
> client_oc_size=2097152000 
>   
> client_oc_target_dirty=838860800
> 
>  Evaluate, following the documentation, which of these variables makes 
> sense for your cluster.
> 
>  For the backup scenario, I imagine that decreasing the size and min_size 
> values will change the impact. However, you must evaluate your needs for 
> these settings.
> 
> 
> Rafael.
> 
>  
> 
> De: "Kyriazis, George" 
> Enviada: 2024/10/02 13:06:09
> Para: ebl...@nde.ag, ceph-users@ceph.io
> Assunto: [ceph-users] Re: Question about speeding hdd based cluster
>  
> Thank you all.
> 
> The cluster is used mostly for backup of large files currently, but we are 
> hoping to use it for home directories (compiles, etc.) soon. Most usage would 
> be for large files, though.
> 
> What I've observed with its current usage is that ceph rebalances, and 
> proxmox-initiated VM backups bring the storage to its knees.
> 
> Would a safe approach be to move the metadata pool to ssd first, see how it 
> goes (since it would be cheaper), and then add DB/WAL disks? How would ceph 
> behave if we are adding DB/WAL disks "slowly" (ie one node at a time)? We 
> have about 100 OSDs (mix hdd/ssd) spread across about 25 hosts. Hosts are 
> server-grade with plenty of memory and processing power.
> 
> Thank you!
> 
> George
> 
> 
> > -Original Message-
> > From: Eugen Block 
> > Sent: Wednesday, October 2, 2024 2:18 AM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Re: Question about speeding hdd based cluster
> >
> > Hi George,
> >
> > the docs [0] strongly recommend to have dedicated SSD or NVMe OSDs for
> > the metadata pool. You'll also benefit from dedicated DB/WAL devices.
> > But as Joachim already stated, it depends on a couple of factors like the
> > number of clients, the load they produce, file sizes etc. There's no easy 
> > answer.
> >
> > Regards,
> > Eugen
> >
> > [0] https://docs.ceph.com/en/latest/cephfs/createfs/#creating-pools
> >
> > Zitat von Joachim Kraftmayer :
> >
> > > Hi Kyriazis,
> > >
> > > depends on the workload.
> > > I would recommend to add ssd/nvme DB/WAL to each osd.
> > >
> > >
> > >
> > > Joachim Kraftmayer
> > >
> > > www.clyso.com 
> > >
> > > Hohenzollernstr. 27, 80801 Munich
> > >
> > > Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
> > >
> > > Kyriazis, George  schrieb am Mi., 2. Okt.
> > > 2024,
> > > 07:37:
> > >
> > >> Hello ceph-users,
> > >>
> > >> I’ve been wondering…. I have a proxmox hdd-based cephfs pool with no
> > >> DB/WAL drives. I also have ssd drives in this setup used for other pools.
> > >>
> > >> What would increase the speed of the hdd-based cephfs more, and in
> > >> what usage scenarios:
> > >>
> > >> 1. Adding ssd/nvme DB/WAL drives for each node 2. Moving the metadata
> > >> pool for my cephfs to ssd 3. Increasing the performance of the
> > >> network. I currently have 10gbe links.
> > >>
> > >> It doesn’t look like the network is currently saturated, so I’m
> > >> thinking
> > >> (3) is not a solution. However, if I choose any of the other
> > >> options, would I need to also upgrade the network so that the network
> > >> does not become a bottleneck?
> > >>
> > >> Thank you!
> > >>
> > >> George
> > >>
> > >> ___
> > >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > >> email to ceph-users-le...@ceph.io
> > >>
> > > ___
> > > ceph-users mailing list -- ceph-user

[ceph-users] Re: Using XFS and LVM backends together on the same cluster and hosts

2024-09-30 Thread Anthony D';Atri

BlueStore vs Filestore doesn’t matter beyond each OSD.  Filestore is very 
deprecated so you’ll want to redeploy any Filestore OSDs when you can.   `ceph 
osd metadata` can survey.  

I’ve had multiple issues over time with the MG spinners fwiw.   For what SAS 
spinners cost with some effort you can get SATA SSDs for legacy systems.   Do 
you have those NVMe drives for WAL+DB mirrored?   You generally don’t want to 
go higher than 10:1

> On Sep 30, 2024, at 3:00 PM, Özkan Göksu  wrote:
> 
> Hello folks! I hope you are doing well :)
> 
> I have a general question about XFS and LVM backend OSD performance
> and possible effects if they are used together in the same pool.
> 
> I built a cluster 5 years ago with Nautilus and I used the XFS backend for
> OSD's.
> After 5 years they reached me back with ERR state and I see 10K++ slow ops,
> 29 incomplete PG, 2 inconsistent PG, 34300 unfound objects and 1 osd down
> due to compactions problem.
> 
> At the first check-up I found some drives were replaced by others and they
> used LVM backend for the replaced drives without using wal+db on nvme.
> In the cluster I have mostly XFS backend drives wal+db on nvme and some LVM
> drives without wal+db.
> 
> We have Cephfs and RBD pools on SSD drives and 8+2 EC pool for RGW S3
> workload. RGW index stored on SSD pool.
> 10 nodes: with
> - 21 x 16TB Toshiba MG08CA16TEY, Firmware:EJ09 | 8+2 EC RGW DATA POOL
> - 3 x 960GB MZILS960HEHP/007 Firm: GXL0 | Rep 2 RGW index pool
> - 2 x PM1725B 1.6T PCI-E NVME | 50G WAL+DB for 21x HDD
> 
> Total HDD raw size 2.8PiB, SSD size 26TiB
> 
> I started fixing all the problems one by one and I'm gonna recreate these
> LVM drives without wal+db and I wonder 2 questions:
> 1- Are there any speed or latency differences on XFS and LVM backend OSD's
> for 16TB 7200rpm NL-SAS drives.
> 2- Mixing XFS and LVM backend on the same cluster does have any
> negative effect or problems?
> 
> Best regards:
> 
> Extra note: If you wonder please check the LSBLK output for 1/10 server:
> NODE-01# lsblk
> NAME
>   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
> sda
>8:00  14.6T  0 disk
> ├─sda1
> 8:10   100M  0 part
> /var/lib/ceph/osd/ceph-180
> └─sda2
> 8:20  14.6T  0 part
> sdb
>8:16   0  14.6T  0 disk
> ├─sdb1
> 8:17   0   100M  0 part
> /var/lib/ceph/osd/ceph-181
> └─sdb2
> 8:18   0  14.6T  0 part
> sdc
>8:32   0  14.6T  0 disk
> ├─sdc1
> 8:33   0   100M  0 part
> /var/lib/ceph/osd/ceph-182
> └─sdc2
> 8:34   0  14.6T  0 part
> sdd
>8:48   0  14.6T  0 disk
> ├─sdd1
> 8:49   0   100M  0 part
> /var/lib/ceph/osd/ceph-183
> └─sdd2
> 8:50   0  14.6T  0 part
> sde
>8:64   0  14.6T  0 disk
> ├─sde1
> 8:65   0   100M  0 part
> /var/lib/ceph/osd/ceph-185
> └─sde2
> 8:66   0  14.6T  0 part
> sdf
>8:80   0  14.6T  0 disk
> └─ceph--ef5bd394--8dc9--46a8--a244--0c5d3c1400e3-osd--block--b69c0802--9634--43a5--b4a9--0f36cd8690c5
> 253:20  14.6T  0 lvm
> sdg
>8:96   0  14.6T  0 disk
> ├─sdg1
> 8:97   0   100M  0 part
> /var/lib/ceph/osd/ceph-186
> └─sdg2
> 8:98   0  14.6T  0 part
> sdh
>8:112  0  14.6T  0 disk
> ├─sdh1
> 8:113  0   100M  0 part
> /var/lib/ceph/osd/ceph-187
> └─sdh2
> 8:114  0  14.6T  0 part
> sdi
>8:128  0  14.6T  0 disk
> ├─sdi1
> 8:129  0   100M  0 part
> /var/lib/ceph/osd/ceph-188
> └─sdi2
> 8:130  0  14.6T  0 part
> sdj
>8:144  0  14.6T  0 disk
> ├─sdj1
> 8:145  0   100M  0 part
> /var/lib/ceph/osd/ceph-189
> └─sdj2
> 8:146  0  14.6T  0 part
> sdk
>8:160  0  14.6T  0 disk
> ├─sdk1
> 8:161  0   100M  0 part
> /var/lib/ceph/osd/ceph-190
> └─sdk2
> 8:162  0  14.6T  0 part
> sdl
>8:176  0  14.6T  0 disk
> ├─sdl1
> 8:177  0   100M  0 part
> /var/lib/ceph/osd/ceph-191
> └─sdl2
> 8:178  0  14.6T  0 part
> sdm
>8:192  0  14.6T  0 disk
> ├─sdm1
> 8:193  0   100M  0 part
> /var/lib/ceph/osd/ceph-192
> └─sdm2
> 8:194  0  14.6T  0 part
> sdn
>8:208  0  14.6T  0 disk
> ├─sdn1
> 8:209  0   100M  0 part
> /var/

[ceph-users] Re: SLOW_OPS problems

2024-09-30 Thread Anthony D';Atri

My point is that you may have more 10-30s delays that aren’t surfaced.  

> On Sep 30, 2024, at 10:17 AM, Tim Sauerbein  wrote:
> 
> Thanks for the replies everyone!
> 
>> On 30 Sep 2024, at 13:10, Anthony D'Atri  wrote:
>> 
>> Remember that slow ops are a top of the iceberg thing, you only see ones 
>> that crest above 30s
> 
> So far metrics of the hosted VMs show no other I/O slowdown except when these 
> hiccups occur.
> 
>> On 30 Sep 2024, at 13:35, Igor Fedotov  wrote:
>> 
>> there is no log attached to your post, you better share it via some other 
>> means.
>> 
>> BTW - what log did you mean - monitor or OSD one?
>> 
>> It would be nice to have logs for a couple of OSDs suffering from slow ops, 
>> preferably relevant to two different cases.
> 
> 
> Sorry, the attachments have apparently been stripped. See here for one 
> incident (they all look the same but I can share more if relevant) monitor 
> log, affected osd logs, iostat log:
> 
> https://gist.github.com/sauerbein/5a485a6d2546475912709743e3cfbf4b
> 
> Let me know if you need any other logs to analyse!
> 
>> On 30 Sep 2024, at 14:34, Alexander Schreiber  wrote:
>> 
>> One cause for "slow ops" I discovered are networking issues. I had slow
>> ops across my entire cluster (interconnected with 10G). Turns out the
>> switch was bad an achieved < 10 MBit/s on one of the 10G links.
>> Replaced the switch, tested the links again - got full 10G connectivity
>> and the slow ops disappeared.
> 
> Thanks for the idea. The hosts are connected to two switches with fail-over 
> bonding, normally communicating via the same switch. I will move them all 
> over to the second switch to rule out a switch issue.
> 
> Best regards,
> Tim
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: SLOW_OPS problems

2024-09-30 Thread Anthony D';Atri

Remember that slow ops are a top of the iceberg thing, you only see ones that 
crest above 30s

> On Sep 30, 2024, at 6:06 AM, Tim Sauerbein  wrote:
> 
> 
>> On 30 Sep 2024, at 06:23, Joachim Kraftmayer  
>> wrote:
>> 
>> do you see the behaviour across all devices or does it only affect one 
>> type/manufacturer?
> 
> All devices are affected equally, every time one or two random ODSs report 
> slow ops. So I don't think the SSDs are to blame.
> 
> Thanks,
> Tim
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RGW returning HTTP 500 during resharding

2024-09-29 Thread Anthony D';Atri


>> 
>> AGN or AG is Dell for “Agnostic”, i.e. whatever the cheapest is they have on 
>> the shelf. Samsung
>> indeed is one of the potentials. `smartctl -a` should show you what it 
>> actually is.
> 
> Smartctl only gave “Dell”.


That’s weird.  Send me a full `smartctl -a` privately please.  I’m very close 
to forking smartmontools publicly or at least drivedb.h



> But see iDRAC is more telling and says 90% Samsung, 10% SkHynix (*sigh*)

iDRAC uses the same interfaces, so that’s weird.  Last year I found Dell 
shipping Hynix drives that my contacts said had been cancelled.  Wouldn’t give 
me a SMART reference, I ended up up getting upstream to accept a regex 
assumption.   

> 
>>> newest 2.5.0 firmware.
>> 
>> Verified by a DSU run?
> 
> Was updated through iDRAC.
> Cannot install DSU on the OS itself.

Why not?  I’ve done so thousands of times.  

> 
>>> They are pretty empty. Although there is some 10% capacity being used by 
>>> other stuff (RBD images)
>>> 
>>> - Single bucket. My import application already errored out after only 72 M 
>>> objects/476 GiB of data,
>>> and need a lot more. Objects are between 0 bytes and 1 MB, 7 KB average.
>> 
>> Only 72M? That’s a rather sizable bucket. Were there existing objects as 
>> well? Do you have the
>> ability to spread across multiple buckets? That would decrease your need to 
>> reshard. As I interpret
>> the docs, 199M is the default max number of objects above which 
>> auto-resharding won’t happen.
>> 
>> Since you *know* that you will be pulling in extreme numbers of objects, 
>> consider pre-sharding the
>> bucket while it’s empty. That will be dramatically faster in every way.
> 
> No existing data.
> And yes, I set it manually to 10069 shards now.
> So now it should not happen again, since that is above the 1999 
> rgw_max_dynamic_shards

Cool.  

> It still feels a bit wrong to me to have to set this manually though.

Might be a failsafe against fat fingers.  Your bucket is an outlier in my 
world.  

> I am not against having to tune applications for performance gains, but think 
> it is unfortunate that one seems to have to do so just to prevent the “500 
> internal server errors” that the resharding can effectively cause.

I’m just speculating with my limited info.  

> 
> 
>>> - I cannot touch TCP socket options settings in my Java application.
>> 
>> Your RGW daemons are running on a Java application, not a Linux system
> 
> Sorry, thought you were asking if my (Java based) import-existing-stuff-to-S3 
> program disabled nagle, in its communication with the rgw.
> Ceph has default settings, which is nagle disabled (ms_tcp_nodelay true).
> 
> 
>> Those numbers are nearly useless without context: the rest of the info I 
>> requested. There was a
>> reason for everything on the list. Folks contribute to the list out of the 
>> goodness of their
>> hearts, and aren’t paid for back-and-forth tooth-pulling. If your index pool 
>> and bucket pool share
>> say 3x HDDs or 3x coarse-IU QLC, then don’t expect much.
>> 
>> Sounds like you have the pg autoscaler enabled, which probably doesn’t help. 
>> Your index pool almost
>> certainly needs more PGs. Probably log as well, or set the rgw log levels to 
>> 0.
> 
> Reason I am a bit vague about amount of OSDs and such, is that the numbers 
> are not written on stone yet.

But can inform what you’re seeing today, esp if your poc is a gating one.  

> Currently using 40 in total

Which drives? Replicated pools?

You aren’t equipped for the workload you’re throwing.   

> , but may or may not be able to “steal” more gear from another project, if 
> the need arises. So I am rather fishing for advice in the form of “if you 
> have >X OSDs, tune this and that” than specific to my exact current 
> situation. ;-)

That’s the thing, it isn’t one size fits all.  That advice is a function of 
your hardware choices.  


> Not doing anything special placement wise at the moment (not configured 
> anything that would make the index and data go to different OSDs)

All pools on all OSDs?  3x replication?

> 
> Also note that my use-case is a bit non-standard in other ways.
> While I have a lot of objects, the application itself is internal within an 
> organization and does not have many concurrent users.

Sounds ripe for tweaking for larger objects, and is more common than you think. 
 Yours must be like <1KB?  Tiny objects don’t work well with EC, so you’ll want 
3x replication.  Which costs more.   

> The performance demand is less than one would need if you have say a public 
> facing web application with lots of visitors and peaks.
> So my main concern right now is having it run without internal server errors 
> (the resharding thing), and fast enough so that my initial import completes 
> within a couple weeks…

Trying to help.  Can’t do that without info.  Network and other tuning likely 
would help dramatically.  Improving your PoC is the first step toward improving 
prod.   

> 
> I do have control ov

[ceph-users] Re: RGW returning HTTP 500 during resharding

2024-09-28 Thread Anthony D';Atri

> On Sep 28, 2024, at 5:21 PM, Floris Bos  wrote:
> 
> "Anthony D'Atri"  schreef op 28 september 2024 16:24:
>>> No retries.
>>> Is it expected that resharding can take so long?
>>> (in a setup with all NVMe drives)
>> 
>> Which drive SKU(s)? How full are they? Is their firmware up to date? How 
>> many RGWs? Have you tuned
>> your server network stack? Disabled Nagle? How many bucket OSDs? How many 
>> index OSDs? How many PGs
>> in the bucket and index pools? How many buckets? Do you have like 200M 
>> objects per? Do you have the
>> default max objects/shard setting?
>> 
>> Tiny objects are the devil of many object systems. I can think of cases 
>> where the above questions
>> could affect this case. I think you resharding in advance might help.
> 
> - Drives advertise themselves as “Dell Ent NVMe v2 AGN MU U.2 6.4TB” (think 
> that is Samsung under the Dell sticker)

AGN or AG is Dell for “Agnostic”, i.e. whatever the cheapest is they have on 
the shelf.  Samsung indeed is one of the potentials.  `smartctl -a` should show 
you what it actually is. 

> newest 2.5.0 firmware.

Verified by a DSU run?

>  They are pretty empty. Although there is some 10% capacity being used by 
> other stuff (RBD images)
> 
> - Single bucket. My import application already errored out after only 72 M 
> objects/476 GiB of data, and need a lot more. Objects are between 0 bytes and 
> 1 MB, 7 KB average.

Only 72M?  That’s a rather sizable bucket.  Were there existing objects as 
well?  Do you have the ability to spread across multiple buckets?  That would 
decrease your need to reshard.  As I interpret the docs, 199M is the default 
max number of objects above which auto-resharding won’t happen.

Since you *know* that you will be pulling in extreme numbers of objects, 
consider pre-sharding the bucket while it’s empty.  That will be dramatically 
faster in every way.

> - Currently using only 1 RGW during my test run to simplify looking at logs, 
> although I have 4.

That’s a lot of ingest for even 4.  Would not be surprised if you’re saturating 
your connection limit or Linux networking. 

> - I cannot touch TCP socket options settings in my Java application.

Your RGW daemons are running on a Java application, not a Linux system

> When you build a S3AsyncClient with the Java AWS SDK using the .crtBuilder(), 
> the SDK outsources the communication to the AWS aws-c-s3/aws-c-http/aws-io 
> CRT libraries written in C, and I never get to see the raw socket in Java.
> Looking at the source I don’t think Amazon is disabling the nagle algorithm 
> in their code.

On the server(s).  You’re unhappy with the *server* performance, no?  RGW can 
configure the front end options to disable Nagle; search the archives for an 
articles where doing so significantly improved small object latency

> At least I don’t see TCP_NODELAY or similar options being used at the place 
> they seem to set the socket options:
> https://github.com/awslabs/aws-c-io/blob/c345d77274db83c0c2e30331814093e7c84c45e2/source/posix/socket.c#L1216
> 
> - Did not tune any network settings, and it is pretty quiet on the network 
> side, nowhere near saturating bandwidth because objects are so small.

There’s more to life than bandwidth.  somaxconn, nf_conntrack filling up, 
filling buffers, etc.

> - Did not really tune anything else either yet. Pretty much a default cephadm 
> setup for now.
> 
> - See it (automagically) allocated 1024 PGs for .data and 32 for .index.

Those numbers are nearly useless without context: the rest of the info I 
requested. There was a reason for everything on the list.  Folks contribute to 
the list out of the goodness of their hearts, and aren’t paid for 
back-and-forth tooth-pulling. If your index pool and bucket pool share say 3x 
HDDs or 3x coarse-IU QLC, then don’t expect much.

Sounds like you have the pg autoscaler enabled, which probably doesn’t help.  
Your index pool almost certainly needs more PGs.  Probably log as well, or set 
the rgw log levels to 0.

> 
> - Think the main delay is just Ceph wanting to make sure everything is 
> sync’ed to storage before reporting success. So that is why I am making a lot 
> of concurrent connections to perform multiple PUT requests simultaneously. 
> But even with 250 connections, it only does around 5000 objects per second 
> according to the “object ingress/egress” Grafana graph. Can probably raise it 
> some more…

With one RGW you aren’t going to get far, unless you have a 500 core CPU, and 
probably not even than.

> 
> 
> Had the default max. objects per shard settings for the dynamic sharding.
> But have now manually resharded to 10069 shards, and will have a go to see if 
> it works better now.
> 
> 
> Yours sincerely,
>

[ceph-users] Re: RGW returning HTTP 500 during resharding

2024-09-28 Thread Anthony D';Atri



> 
> No retries.
> Is it expected that resharding can take so long?
> (in a setup with all NVMe drives)

Which drive SKU(s)?  How full are they?  Is their firmware up to date?  How 
many RGWs?  Have you tuned your server network stack? Disabled Nagle?   How 
many bucket OSDs? How many index OSDs? How many PGs in the bucket and index 
pools?  How many buckets?  Do you have like 200M objects per? Do you have the 
default max objects/shard setting? 

Tiny objects are the devil of many object systems.  I can think of cases where 
the above questions could affect this case.  I think you resharding in advance 
might help.




> And is it correct behavior that it returns HTTP response code 500, instead of 
> something that could indicate it is a retry'able condition?
> 
> If I would add my own code that does retry for a very long time, is there any 
> way I can detect the 500 is due to the resharding, instead of some other 
> condition that do is fatal?
> Also, is there any more efficient way to get a large amount of objects into 
> Ceph than individual PUTs?
> Yours sincerely,
> 
> Floris Bos
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: WAL on NVMe/SSD not used after OSD/HDD replace

2024-09-27 Thread Anthony D';Atri

Not a unique issue and I suspect that it affects lots of people who don’t know 
yet.  

Might be that you should rm the old LVM first or specify it with an explicit 
create command.  

> On Sep 27, 2024, at 8:55 AM, mailing-lists  wrote:
> 
> Dear Ceph-users,
> I have a problem that I'd like to have your input for.
> 
> Preface:
> I have got a test-cluster and a productive-cluster. Both are setup the same 
> and both are having the same "issue". I am running Ubuntu 22.04 and deployed 
> ceph 17.2.3 via cephadm. Upgraded to 17.2.7 later on, which is the version we 
> are currently running. Since the issue seem to be the exact same on the 
> test-cluster, I will post test-cluster-outputs here for better readability.
> 
> The issue:
> I have replaced disks and after the replacement, it does not show that it 
> would use the NVMe as WAL device anymore. The LV still exists, but the 
> metadata of the osd does not show it, as it would be with any other osd/hdd, 
> that hasnt been replaced.
> 
> ODS.1 (incorrect, bluefs_dedicated_wal: "0")
> ```
> {
> "id": 1,
> "arch": "x86_64",
> "back_addr": 
> "[v2:192.168.6.241:6802/3213655489,v1:192.168.6.241:6803/3213655489]",
> "back_iface": "",
> "bluefs": "1",
> "bluefs_dedicated_db": "0",
> "bluefs_dedicated_wal": "0",
> "bluefs_single_shared_device": "1",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev_node": "/dev/dm-3",
> "bluestore_bdev_devices": "sdd",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_optimal_io_size": "0",
> "bluestore_bdev_partition_path": "/dev/dm-3",
> "bluestore_bdev_rotational": "1",
> "bluestore_bdev_size": "17175674880",
> "bluestore_bdev_support_discard": "1",
> "bluestore_bdev_type": "hdd",
> "bluestore_min_alloc_size": "4096",
> "ceph_release": "quincy",
> "ceph_version": "ceph version 17.2.7 
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)",
> "ceph_version_short": "17.2.7",
> "ceph_version_when_created": "",
> "container_hostname": "bi-ubu-srv-ceph2-01",
> "container_image": 
> "quay.io/ceph/ceph@sha256:28323e41a7d17db238bdcc0a4d7f38d272f75c1a499bc30f59b0b504af132c6b",
> "cpu": "AMD EPYC 75F3 32-Core Processor",
> "created_at": "",
> "default_device_class": "hdd",
> "device_ids": "sdd=QEMU_HARDDISK_drive-scsi3",
> "device_paths": "sdd=/dev/disk/by-path/pci-:00:05.0-scsi-0:0:3:0",
> "devices": "sdd",
> "distro": "centos",
> "distro_description": "CentOS Stream 8",
> "distro_version": "8",
> "front_addr": 
> "[v2:.241:6800/3213655489,v1:.241:6801/3213655489]",
> "front_iface": "",
> "hb_back_addr": 
> "[v2:192.168.6.241:6806/3213655489,v1:192.168.6.241:6807/3213655489]",
> "hb_front_addr": 
> "[v2:.241:6804/3213655489,v1:.241:6805/3213655489]",
> "hostname": "bi-ubu-srv-ceph2-01",
> "journal_rotational": "1",
> "kernel_description": "#132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024",
> "kernel_version": "5.15.0-122-generic",
> "mem_swap_kb": "4018172",
> "mem_total_kb": "5025288",
> "network_numa_unknown_ifaces": "back_iface,front_iface",
> "objectstore_numa_unknown_devices": "sdd",
> "os": "Linux",
> "osd_data": "/var/lib/ceph/osd/ceph-1",
> "osd_objectstore": "bluestore",
> "osdspec_affinity": "dashboard-admin-1661853488642",
> "rotational": "1"
> }
> ```
> 
> ODS.0 (correct, bluefs_dedicated_wal: "1")
> ```
> {
> "id": 0,
> "arch": "x86_64",
> "back_addr": 
> "[v2:192.168.6.241:6810/3249286142,v1:192.168.6.241:6811/3249286142]",
> "back_iface": "",
> "bluefs": "1",
> "bluefs_dedicated_db": "0",
> "bluefs_dedicated_wal": "1",
> "bluefs_single_shared_device": "0",
> "bluefs_wal_access_mode": "blk",
> "bluefs_wal_block_size": "4096",
> "bluefs_wal_dev_node": "/dev/dm-0",
> "bluefs_wal_devices": "sdb",
> "bluefs_wal_driver": "KernelDevice",
> "bluefs_wal_optimal_io_size": "0",
> "bluefs_wal_partition_path": "/dev/dm-0",
> "bluefs_wal_rotational": "0",
> "bluefs_wal_size": "4290772992",
> "bluefs_wal_support_discard": "1",
> "bluefs_wal_type": "ssd",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev_node": "/dev/dm-2",
> "bluestore_bdev_devices": "sdc",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_optimal_io_size": "0",
> "bluestore_bdev_partition_path": "/dev/dm-2",
> "bluestore_bdev_rotational": "1",
> "bluestore_bdev_size": "17175674880",
> "bluestore_bdev_support_discard": "1",
> "bluestore_bdev_type": "hdd",
> "bluestore_min_alloc_size": "4096",
> "ceph_release": "quincy",
> "ceph_version": "ceph version 17.2.7 
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)",
> "ceph_version_short": "17.2.7",
> "ceph_version_when_created": "",
> "con

[ceph-users] Re: ceph can list volumes from a pool but can not remove the volume

2024-09-26 Thread Anthony D';Atri

https://docs.ceph.com/en/reef/rbd/rbd-snapshot/ should give you everything you 
need.

Sounds like maybe you have snapshots / clones that have left the parent 
lingering as a tombstone?

Start with

rbd children volume-ssd/volume-8a30615b-1c91-4e44-8482-3c7d15026c28
rbd info volume-ssd/volume-8a30615b-1c91-4e44-8482-3c7d15026c28
rbd du volume-ssd/volume-8a30615b-1c91-4e44-8482-3c7d15026c28

That looks like the only volume in that pool?  If targeted cleanup doesn’t 
work, you could just delete the whole pool, but triple check everything before 
taking action here.


> On Sep 25, 2024, at 1:50 PM, bryansoon...@gmail.com wrote:
> 
> We have a volume in our cluster:
> 
> [r...@ceph-1.lab-a ~]# rbd ls volume-ssd
> volume-8a30615b-1c91-4e44-8482-3c7d15026c28
> 
> [r...@ceph-1.lab-a ~]# rbd rm 
> volume-ssd/volume-8a30615b-1c91-4e44-8482-3c7d15026c28
> Removing image: 0% complete...failed.
> rbd: error opening image volume-8a30615b-1c91-4e44-8482-3c7d15026c28: (2) No 
> such file or directory
> rbd: image has snapshots with linked clones - these must be deleted or 
> flattened before the image can be removed.
> 
> Any ideas on how can I remove the volume? Thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [External Email] Overlapping Roots - How to Fix?

2024-09-20 Thread Anthony D';Atri

Well, it was pasted from a local cluster, meant as a guide not to be run 
literally.

> On Sep 20, 2024, at 12:48 PM, Dave Hall  wrote:
> 
> Stefan, Anthony,
> 
> Anthony's sequence of commands to reclassify the root failed with errors. so 
> I have tried to look a little deeper.
> 
> I can now see the extra root via 'ceph osd crush tree --show-shadow'.  
> Looking at the decompiled crush tree, I can also see the extra root:

> 
> root default {
> id -1   # do not change unnecessarily
> id -2 class hdd # do not change unnecessarily
> # weight 361.90518
> alg straw2
> hash 0  # rjenkins1
> item ceph00 weight 90.51434
> item ceph01 weight 90.29265
> item ceph09 weight 90.80554
> item ceph02 weight 90.29265
> }
> 
> Based on the hints given in the link provided by Stefan, it would appear that 
> the correct solution might be to get rid of 'id -2' and change id -1 to class 
> hdd, 
> 
> root default {
> id -1 class hdd # do not change unnecessarily
> # weight 361.90518
> alg straw2
> hash 0  # rjenkins1
> item ceph00 weight 90.51434
> item ceph01 weight 90.29265
> item ceph09 weight 90.80554
> item ceph02 weight 90.29265
> }
> 
> but I'm no expert and anxious about losing data.  
> 
> The rest of the rules in my crush map are:
> 
> # rules
> rule replicated_rule {
> id 0
> type replicated
> step take default   # missing device class
> step chooseleaf firstn 0 type host
> step emit
> }
> rule block-1 {
> id 1
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step choose indep 0 type osd
> step emit
> }
> rule default.rgw.buckets.data {
> id 2
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step choose indep 0 type osd
> step emit
> }
> rule ceph-block {
> id 3
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step choose indep 0 type osd
> step emit
> }
> rule replicated-hdd {
> id 4
> type replicated
> step take default class hdd
> step choose firstn 0 type osd
> step emit
> }
> 
> # end crush map
> 
> Of these, the last - id 4 - is one that I added while trying to figure this 
> out.  What this tells me is that the 'take' step in rule id 0 should probably 
> change to 'step take default class hdd'.
> 
> I also notice that each of my host stanzas (buckets) has what looks like two 
> roots.  For example
> 
> host ceph00 {
> id -3 # do not change unnecessarily
> id -4 class hdd # do not change unnecessarily
> # weight 90.51434
> alg straw2
> hash 0 # rjenkins1
> item osd.0 weight 11.35069
> item osd.1 weight 11.35069
> item osd.2 weight 11.35069
> item osd.3 weight 11.35069
> item osd.4 weight 11.27789
> item osd.5 weight 11.27789
> item osd.6 weight 11.27789
> item osd.7 weight 11.27789
> }
> 
> I assume I may need to clean this up somehow, or perhaps this is the real 
> problem.
> 
> Please advise.
> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu <mailto:kdh...@binghamton.edu>
> 
> On Thu, Sep 19, 2024 at 3:56 AM Stefan Kooman  <mailto:ste...@bit.nl>> wrote:
>> On 19-09-2024 05:10, Anthony D'Atri wrote:
>> > 
>> > 
>> >>
>> >> Anthony,
>> >>
>> >> So it sounds like I need to make a new crush rule for replicated pools 
>> >> that specifies default-hdd and the device class?  (Or should I go the 
>> >> other way around?  I think I'd rather change the replicated pools even 
>> >> though there's more of them.)
>> > 
>> > I think it would be best to edit the CRUSH rules in-situ so that each 
>> > specifies the device class, that way if you do get different media in the 
>> > future, you'll be ready.  Rather than messing around with new rules and 
>> > modifying pools, this is arguably one of the few times when one would 
>> > decompile, edit, recompile, and inject the CRUSH map in toto.
>> > 
>> > I haven't tried this myself, but maybe something like the below, to avoid 
>> > the PITA and potential for error of edting the decompiled text file by 
>> > hand.
>> > 
>> > 
>> > ceph osd getcrushmap -o original.crush
>> > crushtool -d original.crush -o original.txt
>> > crushtool -i original.crush --reclassify --reclassify-root default hdd 
>> > --set-subtree-class default hdd -o adjusted.crush
>> > crushtool -d adjusted.crush -o adjusted.txt
>> > crushtool -i original.crush --compare adjusted.crush
>> > ceph osd setcrushmap -i adjusted.crush
>> 
>> This might be of use as well (if a lot of data would move): 
>> https://blog.widodh.nl/2019/02/comparing-two-ceph-crush-maps/
>> 
>> Gr. Stefan

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CPU requirements

2024-09-19 Thread Anthony D';Atri



> Thank you for your explanations and references. I will check them all. In the 
> meantime it turned out that the disks for Ceph will come from SAN

Be prepared for the additional latency and amplified network traffic.

> Probably in this case the per OSD CPU cores can be lowered to 2 CPU/OSD. But 
> I will try to find some references for this usecase.

It’s hard to give guidance for such things beyond “it depends”, but it does 
depend on the CPUs and media in use.

>> I think it depends on the type of cluster you're trying to build.

Indeed.  If you’re storing archival data, you might not care that much about 
throughput or latency.  Object storage a bit more, block you usually do care 
about latency.


>> If made of HDD/SSD OSDs, then 2 cores per OSD is probably still valid.

Back in the Dumpling era the rule of thumb was 1 vcore/thread per OSD.  Mind 
you the performance of a current-gen vcore then vs today are worlds apart, but 
also then most people were stuck with HDDs.  And Ceph itself has gained 
substantially in terms of performance.


>> I believe the 5-6 cores per OSDs recommendation you mentioned relates to all 
>> flash (NVMe) clusters where CPUs and especially memory bandwidth can't 
>> always keep up with all flash storage and small IO workloads.

Indeed, Mark and Dan’s journey to 1TB/s presentation is a great resource.  
Beyond a few cores, you often are into a long tail of diminishing returns.  It 
depends on what you’re solving for.

>> "Ceph can easily utilize five or six cores on real clusters and up to about 
>> fourteen cores on single OSDs in isolation" [1]

I think I’m to thank — or blame as you see fit — for that sentence.

>> I'd say 'real clusters' here stands for all flash NVMe clusters with 
>> multiples OSDs on multiple hosts with x3 replication

It was indeed intended to mean a production-class cluster with replication / 
EC, vs a standalone OSD not doing any subops.  While I maintain that HDDs are 
usually a false economy, I was not intending to limit a “real” cluster to 
non-rotational media.  Probably “production-class” would have been a better 
term.

>> while 'a single OSDs in isolation' refers to a single NVMe OSD, maybe on a 
>> single host cluster with no replication at all.

Yes.  My understanding of some of Mark’s perf work is that it was against a 
single OSD with no sub-ops, to test the OSD code without confounding variables. 
 In a production-class cluster, network latency and congestion may easily be 
dominant factors.

>> But this may need further confirmation. At the end of day, if your intention 
>> is to build an all flash cluster then I think you should multiply 5-6 cores 
>> by the number of NVMe OSDs and chose CPU accordingly (with the highest 
>> frequency you can get for the bucks).

My sense is that as a first-order model, with modern CPUs for NVMe OSDs, 
perhaps start with 4 vcores / threads per, and more is gravy.  If running 
mons/mgrs, etc. on the same nodes, plan for a few cores for those.  After say 
5-6 you get into diminishing return territory.  Watch out for C-states, and on 
EPYCs, IOMMU.  Basically all the things in Mark and Dan’s excellent 
presentation.


>> 
>> You might want to check Mark's talks [2][3] and studies [4][5][6] about all 
>> flash Ceph clusters. It explains it all and suggests some modern hardware 
>> for all flash storage, if that's what you're building.
>> 
>> Cheers,
>> Frédéric.
>> 
>> [1]https://github.com/ceph/ceph/pull/44466#discussion_r779650295
>> [2]https://youtu.be/S2rPA7qlSYY
>> [3]https://youtu.be/pGwwlaCXfzo
>> [4]https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
>> [5]https://docs.clyso.com/blog/ceph-a-journey-to-1tibps
>> [6]https://docs.clyso.com/blog/clyso-enterprise-storage-all-flash-ceph-deployment-guide-preview/
>> 
>> - Le 18 Sep 24, à 10:15, Laszlo budailas...@componentsoft.eu  a écrit :
>> 
>>> Hello everyone,
>>> 
>>> I'm trying to understand the CPU requirements for recent versions of CEPH.
>>> Reading the documentation
>>> (https://docs.ceph.com/en/latest/start/hardware-recommendations/) I cannot 
>>> get
>>> any conclusion about how to plan CPUs for ceph. There is the following
>>> statement:
>>> 
>>> "With earlier releases of Ceph, we would make hardware recommendations 
>>> based on
>>> the number of cores per OSD, but this cores-per-osd metric is no longer as
>>> useful a metric as the number of cycles per IOP and the number of IOPS per 
>>> OSD.
>>> For example, with NVMe OSD drives, Ceph can easily utilize five or six 
>>> cores on
>>> real clusters and up to about fourteen cores on single OSDs in isolation. So
>>> cores per OSD are no longer as pressing a concern as they were. When 
>>> selecting
>>> hardware, select for IOPS per core."
>>> 
>>> How should I understand this? On one side it's saying "with NVMe OSD drives,
>>> Ceph can easily utilize five or six cores on real clusters"
>>>  and then continues: "and up to about fourteen cores on single OSDs in
>>>  isolation."

[ceph-users] Re: [External Email] Overlapping Roots - How to Fix?

2024-09-18 Thread Anthony D';Atri



> 
> Anthony,
> 
> So it sounds like I need to make a new crush rule for replicated pools that 
> specifies default-hdd and the device class?  (Or should I go the other way 
> around?  I think I'd rather change the replicated pools even though there's 
> more of them.)

I think it would be best to edit the CRUSH rules in-situ so that each specifies 
the device class, that way if you do get different media in the future, you'll 
be ready.  Rather than messing around with new rules and modifying pools, this 
is arguably one of the few times when one would decompile, edit, recompile, and 
inject the CRUSH map in toto.  

I haven't tried this myself, but maybe something like the below, to avoid the 
PITA and potential for error of edting the decompiled text file by hand.


ceph osd getcrushmap -o original.crush 
crushtool -d original.crush -o original.txt 
crushtool -i original.crush --reclassify --reclassify-root default hdd 
--set-subtree-class default hdd -o adjusted.crush 
crushtool -d adjusted.crush -o adjusted.txt 
crushtool -i original.crush --compare adjusted.crush 
ceph osd setcrushmap -i adjusted.crush 



> 
> Then, after I create this new rule, I simply assign the pool to a new crush 
> rule using a command similar to the one shown in your note in the link you 
> referenced?
> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu <mailto:kdh...@binghamton.edu>
> 
> On Wed, Sep 18, 2024 at 2:10 PM Anthony D'Atri  <mailto:anthony.da...@gmail.com>> wrote:
>> 
>> 
>>> 
>>> Helllo,
>>> 
>>> I've reviewed some recent posts in this list and also searched Google for
>>> info about autoscale and overlapping roots.  In what I have found I do not
>>> see anything that I can understand regarding how to fix the issue -
>>> probably because I don't deal with Crush on a regular basis.
>> 
>> 
>> Checkout the Note in this section:  
>> https://docs.ceph.com/en/reef/rados/operations/placement-groups/#viewing-pg-scaling-recommendations
>> 
>> I added that last year I think it was as a result of how Rook was creating 
>> pools.
>> 
>>> 
>>> From what I read and looking at 'ceph osd crush rule dump', it looks like
>>> the 8 replicated pools have
>>> 
>>>"op": "take",
>>>"item": -1,
>>>"item_name": "default"
>>> 
>>> whereas the 2 EC pools have
>>> 
>>>"op": "take",
>>>"item": -2,
>>>"item_name": "default~hdd"
>>> 
>>> To be sure, all of my OSDs are identical - HDD with SSD WAL/DB.
>>> 
>>> Please advise on how to fix this.
>> 
>> The subtlety that's easy to miss is that when you specify a device class for 
>> only *some* pools, the pools/rules that specify a device class effectively 
>> act on a "shadow" CRUSH root.  My terminology may be inexact there.
>> 
>> So I think if you adjust your CRUSH rules so that they all specify a device 
>> class -- in your case all the same device class -- your problem (and 
>> balancer performance perhaps) will improve.
>> 
>> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Overlapping Roots - How to Fix?

2024-09-18 Thread Anthony D';Atri




> 
> Helllo,
> 
> I've reviewed some recent posts in this list and also searched Google for
> info about autoscale and overlapping roots.  In what I have found I do not
> see anything that I can understand regarding how to fix the issue -
> probably because I don't deal with Crush on a regular basis.


Checkout the Note in this section:  
https://docs.ceph.com/en/reef/rados/operations/placement-groups/#viewing-pg-scaling-recommendations

I added that last year I think it was as a result of how Rook was creating 
pools.

> 
> From what I read and looking at 'ceph osd crush rule dump', it looks like
> the 8 replicated pools have
> 
>"op": "take",
>"item": -1,
>"item_name": "default"
> 
> whereas the 2 EC pools have
> 
>"op": "take",
>"item": -2,
>"item_name": "default~hdd"
> 
> To be sure, all of my OSDs are identical - HDD with SSD WAL/DB.
> 
> Please advise on how to fix this.

The subtlety that's easy to miss is that when you specify a device class for 
only *some* pools, the pools/rules that specify a device class effectively act 
on a "shadow" CRUSH root.  My terminology may be inexact there.

So I think if you adjust your CRUSH rules so that they all specify a device 
class -- in your case all the same device class -- your problem (and balancer 
performance perhaps) will improve.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RBD w/erasure coding

2024-09-15 Thread Anthony D';Atri

100% agree.  

I’ve seen claims that SSDs synch quickly so 2 is enough.   Such claims are 
shortsighted.  

I have personally witnessed cases of device failures and OSD flaps / crashes at 
just the wrong times resulted in no clear latest copy of PGs.  

There are cases where data loss isn’t catastrophic, but unless you’re sure, you 
want either R3 for durability or EC to minimize space amp.   A 2,2 profile, for 
example.  

> On Sep 15, 2024, at 10:35 AM, Joachim Kraftmayer 
>  wrote:
> 
> first comment on the replicated pools:
> the replication size for rbd pools of 2 is not suitable for production
> clusters. It is only a matter of time before you lose data.
> Joachim
> 
> 
>  www.clyso.com
> 
>  Hohenzollernstr. 27, 80801 Munich
> 
> Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
> 
> 
> 
>> Am Sa., 14. Sept. 2024 um 14:04 Uhr schrieb :
>> 
>> Here You have guide
>> 
>> https://febryandana.xyz/posts/deploy-ceph-openstack-cluster/
>> 
>> in short
>> 
>> ceph osd pool create images 128
>> ceph osd pool set images size 2
>> while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
>> 
>> ceph osd pool create volumes 128
>> ceph osd pool set volumes size 2
>> while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
>> 
>> ceph osd pool create vms 128
>> ceph osd pool set vms size 2
>> while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
>> 
>> ceph osd erasure-code-profile set ec-22-profile k=2 m=2
>> crush-device-class=ssd
>> ceph osd erasure-code-profile ls
>> ceph osd erasure-code-profile get ec-22-profile
>> 
>> ceph osd pool create images_data 128 128 erasure ec-22-profile
>> while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
>> 
>> ceph osd pool create volumes_data 128 128 erasure ec-22-profile
>> while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
>> 
>> ceph osd pool create vms_data 128 128 erasure ec-22-profile
>> while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
>> 
>> ceph osd pool ls detail
>> 
>> ceph osd pool set images_data allow_ec_overwrites true
>> ceph osd pool set volumes_data allow_ec_overwrites true
>> ceph osd pool set vms_data allow_ec_overwrites true
>> 
>> ceph osd pool application enable volumes rbd
>> ceph osd pool application enable images rbd
>> ceph osd pool application enable vms rbd
>> ceph osd pool application enable volumes_data rbd
>> ceph osd pool application enable images_data rbd
>> ceph osd pool application enable vms_data rbd
>> On ceph.conf You need put as below
>> 
>> [client.glance]
>> rbd default data pool = images_data
>> 
>> [client.cinder]
>> rbd default data pool = volumes_data
>> 
>> [client.nova]
>> rbd default data pool = vms_data
>> for permission you probably also need add
>> 
>> caps mon = "allow r, allow command \\"osd blacklist\\", allow command
>> \\"osd blocklist\\", allow command \\"blacklistop\\", allow command
>> \\"blocklistop\\""
>> Newer versions might not work with blacklist anymore.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Numa pinning best practices

2024-09-14 Thread Anthony D';Atri



> 
> 
> On Fri, Sep 13, 2024 at 12:16 PM Anthony D'Atri  wrote:
>> My sense is that with recent OS and kernel releases (e.g., not CentOS 8) 
>> irqbalance does a halfway decent job.
> 
> Strongly disagree! Canonical has actually disabled it by default in
> Ubuntu 24.04 and IIRC Debian already does, too:
> https://discourse.ubuntu.com/t/ubuntu-24-04-lts-noble-numbat-release-notes/39890#irqbalance-no-more-installed-and-enabled-by-default

Interesting.  The varied viewpoints of the Ceph community are invaluable.

Reading the above page, I infer that recent kernels do well by default now?

> While irqbalance _can_ do a decent job in some scenarios, it can also
> really mess things up. For something like Ceph where you are likely
> running a lot of the same platform(s) and are seeking predictability,
> you can probably do better controlling affinity yourself. At least,
> you should be able to do no worse.

Fair enough, would love to 

> 
>>> I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) 
>>> [1] and a Reddit post [2]. Apparently there is quita a bit of performance 
>>> to gain when NUMA is optimally configured for Ceph.
>> 
>> My sense is that NUMA is very much a function of what CPUs one is using, and 
>> 1S vs 2S / 4S.  With 4S servers I've seen people using multiple NICs, 
>> multiple HBAs, etc., effectively partitioning into 4x 1S servers.  Why not 
>> save yourself hassle and just use 1S to begin with?  4+S-capable CPUs cost 
>> more and sometimes lag generationally.
> 
> Hey, that's me!

I first saw an elaborate 4S pinning scheme at an OpenStack Summit, 2016 or so.

> As Anthony says, YMMV based on your platform, what you use Ceph for
> (RBD?), and also how much Ceph you're running.
> 
> Early versions of Zen had quite bad core to core memory latency when
> you hopped across CCD/CCX.

There’s a graphic out there comparing those latencies for …. IIRC, Icelake and 
Rome or Milan.

> There's some early warning signs in the Zen
> 5 client reviews that such latencies may be back to bite (I have not
> gotten my hands on one yet, nor have I see anyone explain "why" yet):

Ouch.  Would one interpret this as Genoa being better?

> https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/3
> 
> In the diagram within that article you can clearly see the ~180ns
> difference, as well as the "striping" effect, when you cross a CCX.
> I'm wondering this is a byproduct of the new ladder cache design
> within the Zen 5 CCX? Regardless: if you have latencies like this
> within a single socket, you likely stand to gain something by pinning
> processes to NUMA nodes even with 1P servers. The results mentioned in
> my presentation are all based on 1P platforms as well for comparison.

Which presentation?  I want to read through that carefully.  I’m about to 
deploy a bunch of 1S EPYC 9454 systems with 30TB SSDs for RBD, RGW, and perhaps 
later CephFS.  After clamoring for 1S systems for years I finally got my wish, 
now I want to optimize them as best I can, especially with 12x 30TB SSDs each 
(PCI-e Gen 4, QLC and TLC).Bonded 100GE.

In the past I inherited scripting that spread HBA and NIC interrupts across 
physical cores (every other thread) and messed with the CPU governor, but have 
not dived deeply into NVMe interrupts yet.



> 
>>> So what is most optimal there? Does it still make sense to have the Ceph 
>>> processes bound to the CPU where their respective NVMe resides when the 
>>> network interface card is attached to another CPU / NUMA node? Or would 
>>> this just result in more inter NUMA traffic (latency) and negate any 
>>> possible gains that could have been made?
> 
> I never benchmarked this, so I can only guess.
> 
> However: if you look at /proc/interrupts, you will see that most if
> not all enterprise NVMes in Linux effectively get allocated a MSI
> vector per thread per NVMe. Moreover, if you look at
> /proc//smp_affinity for each of those MSI vectors, you will see
> that they are each pinned to exactly one CPU thread.
> 
> In my experience, when NUMA pinning OSDs, only the MSI vectors local
> to the NUMA node where the OSD runs really have any activity. That
> seems optimal, so I've never had a reason to look any further.
> 
>>> So the default policy seems to be active, and no Ceph NUMA affinity seems 
>>> to have taken place. Can someone explain me what Ceph (cephadm) is 
>>> currently doing when the "osd_numa_auto_affinity" config setting is true 
>>> and NUMA is exposed?
> 
> I, personally, am in the camp of folk who are not cephadm fans. What I
> did in my case was to writ

[ceph-users] Re: Numa pinning best practices

2024-09-13 Thread Anthony D';Atri

Lots of opinions in this arena.  Below are mine.  ymmv.

>> Haven't really found a proper descripton in case of 2 socket how to pin osds 
>> to numa node, only this: 
>> https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#Ceph-Storage-Node-NUMA-Tuning
>> Tuning for All Flash Deployments - Ceph - Ceph 
>> 

A bit dated, note the reference to Jewel and thus XFS / Filestore.


>> Redmine
>> tracker.ceph.com
>> Is there anybody have some good how to on this topic?

Wido's presentation from a few years ago re affordable NVMe Ceph may be of 
interest.

> I'm also interested in how to configure NUMA for Ceph.

My sense is that with recent OS and kernel releases (e.g., not CentOS 8) 
irqbalance does a halfway decent job.

> I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) [1] 
> and a Reddit post [2]. Apparently there is quita a bit of performance to gain 
> when NUMA is optimally configured for Ceph.

My sense is that NUMA is very much a function of what CPUs one is using, and 1S 
vs 2S / 4S.  With 4S servers I've seen people using multiple NICs, multiple 
HBAs, etc., effectively partitioning into 4x 1S servers.  Why not save yourself 
hassle and just use 1S to begin with?  4+S-capable CPUs cost more and sometimes 
lag generationally.

cf. Mark and Dan's Journey to 1TB/s post, it discusses the impact of 
inter-socket communication and IOMMU.  With EPYCs, one may gain by disabling 
IOMMU on the kernel commandline, and other tunings including NPS values.  Xeons 
may require less adjustment from defaults. There is growing favor for 1S 
servers.  2S has been de-facto for years, because CPUs with higher core counts 
were disproportionally expensive. 

XCC vs MCC CPU SKUs may matter too.

With Emerald Rapids and Genoa, we may be able to afford a desired core/thread 
count (say, 4-6 vcores/threads per NVMe OSD) with a single socket.  Note that 
1S servers may have differing RAM population dynamics.

> 
> Red Hat documentation (hyper converged infra)

I may be heterodox, but I dislike convergence.  ymmv.

> suggests to pin the Ceph processes on the CPU with the storage controller / 
> NIC attached [3]. In an all flash system there is not just one storage 
> controller but the NVMe are attached to different PCIe buses spread across 
> the different NUMA nodes.

Look up your server / motherboard specifically.  With, say, Dell systems, there 
are often multiple variants of a model, each with very different PCI-e - NVMe 
mappings.  Especially if there is an antiquated and counterproductive RAID HBA 
present, the NVMe bays may not be even close to evenly distributed across two 
sockets.


> So what is most optimal there? Does it still make sense to have the Ceph 
> processes bound to the CPU where their respective NVMe resides when the 
> network interface card is attached to another CPU / NUMA node? Or would this 
> just result in more inter NUMA traffic (latency) and negate any possible 
> gains that could have been made?
> 
> Is the benefit of NUMA optimization so large that it would make sense to add 
> another NIC to the system, add it to the other NUMA domain and have half the 
> OSDs listen on one nic (IP), and the rest of the OSDs on the other nic 
> (separate IP)?

That sounds like two servers to me.  One reason I favor 1S 1U servers for Ceph.

> 
> Ceph has an admin command to show the NUMA status that gives the following 
> output for a node called storage1:
> 
> ceph osd numa-status
> OSD  HOST  NETWORK  STORAGE  AFFINITY  CPUS
>  0  storage1-2 -  -
>  1  storage1-2 -  -
>  2  storage1-2 -  -
>  3  storage1-1 -  -
>  4  storage1-1 -  -
>  5  storage1-0 -  -
>  6  storage1-0 -  -
>  7  storage1-0 -  -
>  8  storage1-2 -  -
> 
> But I'm unsure what that means. Because when I look up the numa status for 
> the OSD processes it shows the following:
> 
> numactl -s 22579
> policy: default
> preferred node: current
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> cpubind: 1 2
> nodebind: 1 2
> membind: 1 2
> 
> And its the same for all OSDs (NUMA node 0 / 4 only have CPUs and no memory 
> (AMD EPYC 7343 16-Core Processor)).

cf. the NPS setting in BIOS.


> So the default policy seems to be active, and no Ceph NUMA affinity seems to 
> have taken place. Can someone explain me what Ceph (cephadm) is currently 
> doing when the "osd_numa_auto_affinity" config setting is true and NUMA is 
> exposed?
> 
> Thanks in advance for any NUMA clue you can give me.
> 
> Gr. Stefan
> 
> [1]: https://www.youtube.com/watch?v=u8vgo2jfMpo
> [2]: 
> https://www.reddit.com/r/ceph/comments/15b3rp8/clyso_enterpris

[ceph-users] Re: [RGW][cephadm] How to configure RGW as code and independantely of daemon names ?

2024-09-12 Thread Anthony D';Atri

If those need improvement, please tag me on a tracker ticket.

> On Sep 12, 2024, at 2:37 AM, Robert Sander  
> wrote:
> 
> Hi,
> 
> On 9/11/24 22:00, Gilles Mocellin wrote:
> 
>> Is there some documentation I didn't find, or is this the kind of detail 
>> only a
>> developper can  find ?
> 
> It should be in these sections:
> 
> https://docs.ceph.com/en/reef/rados/configuration/ceph-conf/#configuration-sections
> https://docs.ceph.com/en/reef/rados/configuration/ceph-conf/#monitor-configuration-database
> 
> Kindest Regards
> -- 
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Successfully using dm-cache

2024-09-12 Thread Anthony D';Atri

I *think* the rotational flag isn't used at OSD creation time, but rather each 
time the OSD starts to select between options that have _hdd and _ssd values.
If I'm mistaken, please do englighten me.

One can use a udev rule to override the kernel's deduced rotational value

> On Sep 12, 2024, at 10:08 AM, Frank Schilder  wrote:
> 
> 
> - How deployed (lvm+dm-cache first, then OSD or other way around): This is 
> related to the previous question. The order might be important. My guess is 
> that after attaching the cache the rotational flag from the cache device is 
> inherited. If this happens before OSD creation, the OSD will be created with 
> SSD tunings and with HDD tunings otherwise. How did you do it?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Somehow throotle recovery even further than basic options?

2024-09-06 Thread Anthony D';Atri


> This sounds interesting because this way the pressure wouldn't be too big if 
> go like 0.1 0.2 OSD by OSD.

I used to do this as well, back before pg-upmap was a thing, and while I still 
had Jewel clients.  It is however less efficient, because some data ends up 
moving more than once.  Upweighting a handful of OSDs at the same time may 
spread the load and allow faster progress than going one at a time.  Say one 
per host or one per failure domain.

The PG remapping tools allow fine-grained control with more efficiency, though 
any clients that aren’t Luminous or later will have a really bad day.

> What I can see how ceph did it, when add the new OSDs, the complete host get 
> the remapped pgs from other hosts also, so the old osds PG number increased 
> by like +50% (which was already overloaded) and slowly rebalance to the newly 
> added osds on the same host. This initial pressure to big.

I don’t follow; adding new OSDs should on average decrease the PG replicas on 
the existing OSDs.  But imbalances during topology changes are one reason I 
like to raise mon_max_pg_per_osd to 1000, otherwise you can end up with PGs 
that won’t activate.

> 
> This "misplaced ratio to 1%" I've never tried, let me read a bit, thank you.
> 
> Istvan
> 
> From: Eugen Block 
> Sent: Saturday, September 7, 2024 4:55:40 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] Re: Somehow throotle recovery even further than basic 
> options?
> 
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
> 
> I can’t say anything about the pgremapper, but have you tried
> increasing the crush weight gradually? Add new OSDs with crush initial
> weight 0 and then increase it in small steps. I haven’t used that
> approach for years, but maybe that can help here. Or are all OSDs
> already up and in? Or you could reduce the max misplaced ratio to 1%
> or even lower (default is 5%)?
> 
> Zitat von "Szabo, Istvan (Agoda)" :
> 
>> Forgot to paste, somehow I want to reduce this recovery operation:
>> recovery: 0 B/s, 941.90k keys/s, 188 objects/s
>> To 2-300Keys/sec
>> 
>> 
>> 
>> 
>> From: Szabo, Istvan (Agoda) 
>> Sent: Friday, September 6, 2024 11:18 PM
>> To: Ceph Users 
>> Subject: [ceph-users] Somehow throotle recovery even further than
>> basic options?
>> 
>> Hi,
>> 
>> 4 years ago we've created our cluster with all disks 4osds (ssds and
>> nvme disks) on octopus.
>> The 15TB SSDs still working properly with 4 osds but the small 1.8T
>> nvmes with the index pool not.
>> Each new nvme osd adding to the existing nodes generates slow ops
>> with scrub off, recovery_op_priority 1, backfill and recovery 1-1.
>> I even turned off all index pool heavy sync mechanism but the read
>> latency still high which means recovery op pushes it even higher.
>> 
>> I'm trying to somehow add resource to the cluster to spread the 2048
>> index pool pg (in replica 3 means 6144pg index pool) but can't make
>> it more gentle.
>> 
>> The balancer is working in upmap with max deviation 1.
>> 
>> Have this script from digitalocean
>> https://github.com/digitalocean/pgremapper, is there anybody tried
>> it before how is it or could this help actually?
>> 
>> Thank you the ideas.
>> 
>> 
>> This message is confidential and is for the sole use of the intended
>> recipient(s). It may also be privileged or otherwise protected by
>> copyright or other legal rules. If you have received it by mistake
>> please let us know by reply email and delete it from your system. It
>> is prohibited to copy this message or disclose its content to
>> anyone. Any confidentiality or privilege is not waived or lost by
>> any mistaken delivery or unauthorized disclosure of the message. All
>> messages sent to and from Agoda may be monitored to ensure
>> compliance with company policies, to protect the company's interests
>> and to remove potential malware. Electronic messages may be
>> intercepted, amended, lost or deleted, or contain viruses.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any co

[ceph-users] Re: Prefered distro for Ceph

2024-09-05 Thread Anthony D';Atri

The bare metal has to run *something*, whether Ceph is run from packages or 
containers.



>> what distro would you prefer and why for the production Ceph? We use Ubuntu
>> on most of our Ceph clusters and some are Debian. Now we are thinking about
>> unifying it by using only Debian or Ubuntu.
>> 
>> I personally prefer Debian mainly for its stability and easy
>> upgrade-in-place. What are yours preferences?
> 
> I'm not the right person to answer you, I just wondering why not use the
> orchestrator ? 
> 
> Regards
> 
> JAS
> --
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prefered distro for Ceph

2024-09-05 Thread Anthony D';Atri

I personally prefer Ubuntu.

I like RPM / YUM better than APT / DEB, but Ubuntu provides a far richer set of 
prebuilt packages, obviating the mess that is EPEL and most of the need to 
compile and package myself.  Ubuntu's kernels are also far more current than 
those of the RHEL family.  I've been forced in RHEL land to use elrepo-kernel 
kernels, but potentially with certain issues.

Stability is IMHO a factor of how you deploy, not of the distribution itself.  
Maintain local snapshots of the upstream package repositories, and deploy 
against those.  That way all your systems have a consistent set of package 
revisions, which you control, vs whatever upstream has on a given day.

ymmv.

> On Sep 5, 2024, at 6:25 AM, Denis Polom  wrote:
> 
> Hi guys,
> 
> what distro would you prefer and why for the production Ceph? We use Ubuntu 
> on most of our Ceph clusters and some are Debian. Now we are thinking about 
> unifying it by using only Debian or Ubuntu.
> 
> I personally prefer Debian mainly for its stability and easy 
> upgrade-in-place. What are yours preferences?
> 
> Thank you
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-ansible installation error

2024-09-02 Thread Anthony D';Atri

I should know to not feed the trolls, but here goes.  I was answering a 
question asked to the list, not arguing for or against containers.

> 2. Logs in containerized ceph almost all go straight to the system journal. 
> Specialized subsystems such as Prometheus can be configured in other ways, 
> but everything's filed under /var/lib/ceph// so there's 
> relatively little confusion.

I see various other paths, which often aren’t /var/log. And don’t conflate 
“containerized Ceph” with “cephadm”.  There are lots of containerized 
deployments that don’t use cephadm / ceph orch.

> 3. I don't understand this, as I never stop all services just to play with 
> firewalls. RHEL 8+ support firewall-cmd

Lots of people don’t run RHEL, and I did wrote “iptables”, not whatever obscure 
firewall system RHEL also happens to ship.

> 4. Ceph knows exactly the names and locations of its containers

Sometimes.  See above.

> (NOTE: a "package" is NOT a "container")

Nobody claimed otherwise.

> You don't talk to "Docker*" directly, though, as systemd handles that.

Not in my experience.  Docker is not Podman.  I have Ceph clusters *right now* 
that use Docker and do not have Podman installed.  They also aren’t RHEL.

> 6. As I said, Ceph does almost everything via cephadm

When deployed with cephadm.  You asked about containers, not about cephadm.  
They are not fungible.

> or ceph orch when running in containers, which actually means you need to 
> learn less.

You assume that everyone already knows how containers roll, including the 
subtle dynamics of /etc/ceph/ceph.conf being mapped to the container’s 
filesystem view and potentially containing option settings that perplexing 
unless one knows how to find and modify them.  That isn’t rue.  When someone 
doesn’t know the dynamics of containers, they can add to the learning curve.  
And yes the docs do not yet pervasively cover the panoply of container 
scenarios.

> Administration of ceph itself, is, again, done via systemd.

Sorry, but that often isn’t the case.

> *Docker. As I've said elsewhere, Red Hat prefers Podman to Docker these days

Confused look.  I know people who prefer using vi or like brussell sprouts.  
Those aren’t relevant to the question about containerized deployments either. 
And the question was re containers, not about the organization formerly known 
as Red Hat.

> and even if you install Docker, there's a Podman transparency feature.

See above.

> Now if you really want networking headaches, run Podman containers rootless. 
> I've learned how to account for the differences but Ceph, fortunately hasn't 
> gone that route so far. Nor have they instituted private networks for Ceph 
> internal controls.
> 
> 
> On 9/1/24 15:54, Anthony D'Atri wrote:
>> * Docker networking is a hassle
>> * Not always clear how to get logs
>> * Not being able to update iptables without stopping all services
>> * Docker package management when the name changes at random
>> * Docker core leaks and kernel compatibility
>> * When someone isn’t already using containers, or has their own 
>> orchestration, going to containers steepens the learning curve.
>> 
>> Containers have advantages including decoupling the applications from the 
>> underlying OS
>> 
>>> I would greatly like to know what the rationale is for avoiding containers
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-ansible installation error

2024-09-01 Thread Anthony D';Atri

* Docker networking is a hassle
* Not always clear how to get logs 
* Not being able to update iptables without stopping all services
* Docker package management when the name changes at random
* Docker core leaks and kernel compatibility 
* When someone isn’t already using containers, or has their own orchestration, 
going to containers steepens the learning curve.  

Containers have advantages including decoupling the applications from the 
underlying OS

> I would greatly like to know what the rationale is for avoiding containers
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How many MDS & MON servers are required

2024-08-31 Thread Anthony D';Atri

The number of mons is ideally an odd number.  For production 5 is usually the 
right number.

MDS is a complicated question.

> On Aug 30, 2024, at 2:24 AM, s.dhivagar@gmail.com wrote:
> 
> Hi,
> 
> We are using raw cephfs data 182 TB replica 2 and single MDS seemed to 
> regularly run around 4002 req/s, So how many MDS & MON servers are required?
> 
> Also mentioned current ceph cluster servers  
> 
> Client : 60
> MDS: 3 (2 Active + 1 Standby)
> MON: 4
> MGR: 3 (1 Active + 2 Standby)
> OSD: 52
> PG :  auto scale
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to know is ceph is ready?

2024-08-29 Thread Anthony D';Atri



> Short note, if you do it like this it will fail.
> You have to use a regex like active+clean*. Because you always habe
> active+clean+scrub or deep-scrub

This.

A few releases back backfill was changed to no longer itself trigger 
HEALTH_WARN, so code that waited for HEALTH_OK needed to be updated.  Mind you 
sometimes you care about backfill and sometimes you don't.  
In this case the PG states should suffice, no reason to make Proxmox wait for 
backfill.

> 
> Joachim
> 
> 
> 
> Joachim Kraftmayer
> 
> CEO
> 
> joachim.kraftma...@clyso.com
> 
> www.clyso.com
> 
> Hohenzollernstr. 27, 80801 Munich
> 
> Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306
> 
> Bogdan Adrian Velica  schrieb am Do., 29. Aug. 2024,
> 13:04:
> 
>> Hi,
>> 
>> It's a hacky idea but create a script that checks if the Ceph RBD pool is
>> fully "active+clean" to ensure it's ready before starting Proxmox VMs.
>> Something like this...
>> 
>> 1. Bash Script:
>> - Write a script that checks if the Ceph RBD pool is in the "active+clean"
>> state using ceph pg stat or ceph -s.
>> - The script should run until the pool is ready before exiting. A loop or
>> something...
>> 
>> 2. Systemd Service for Ceph Readiness:
>> - Create a systemd service unit file (something like: ceph-ready.service)
>> that runs the script at startup.
>> - Ensure the service waits until the Ceph cluster is ready before
>> proceeding.
>> 
>> 3. Systemd Service for Proxmox VMs:
>> - Modify an existing Proxmox service or create a new one to depend on the
>> ceph-ready.service.
>> - Use After=ceph-ready.service in the unit file to ensure Proxmox VMs start
>> only after Ceph is ready.
>> 
>> 4. Enable and Start the Services:
>> - Enable both systemd services to start at boot with systemctl enable.
>> - Reboot to ensure that Proxmox waits for Ceph before starting VMs.
>> 
>> Just an idea...
>> 
>> Thank you,
>> Bogdan Velica
>> croit.io
>> 
>> On Thu, Aug 29, 2024 at 1:53 PM Alfredo Rezinovsky 
>> wrote:
>> 
>>> I have a proxmox cluster using an external CEPH cluster.
>>> 
>>> Sometimes due to blackouts the servers need to restart. If proxmox starts
>>> before CEPH is ready the VMs fail to boot.
>>> 
>>> I want to add a dependency in proxmox to wait for ceph to be ready.
>>> 
>>> I can work with a HEALTH_WARN as long the RBD pool is usable.
>>> 
>>> ceph status exit status doesn´t helps
>>> 
>>> Should I grep for "pgs not active" in ceph status or for "inactive" pgs
>> in
>>> ceph health or is there something more direct to know if everything is
>>> alright?
>>> 
>>> 
>>> 
>>> --
>>> Alfrenovsky
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: how to distinguish SED/OPAL and non SED/OPAL disks in orchestrator?

2024-08-28 Thread Anthony D';Atri




> is it possible to somehow distinguish self encrypting drives from drives
> that are lacking the support in the orchestrator.
> 
> So I don't encrypt them twice :)

I believe that an osd_spec can be bound to drive models, so you could 
differentiate that way.

You could also just not use SED/OPAL and rely on dmcrypt across the board, but 
probably you paid extra for drives for a reason and have a key management 
scheme.  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Do you need to use a dedicated server for the MON service?

2024-08-26 Thread Anthony D';Atri

If you're using cephadm or other container deployment scheme, upgrades 
shouldn't be affected.

When using a traditional package-based install, there can be snags in the 
unlikely event of a system rebooting at exactly the wrong time.

I personally prefer to keep mon/mgr services on dedicated, modest nodes, having 
experienced unexpected effects from mon restarts in the past.  But with recent 
Ceph releases, the daemons are better-behaved and lots of people converge in 
just this way.

If your nodes do not have extra RAM available for mon/MDS/etc, you might starve 
the OSDs a bit.

MDS is not multithreaded so it benefits from high freq CPU and doesn't care 
about cores.  Some people thus tie MDS to specifically-chosen systems.

> On Aug 23, 2024, at 11:10 AM, Phong Tran Thanh  wrote:
> 
>  This is my first time setting up a Ceph cluster for OpenStack. Will
> running both the Mon service and the OSD service on the same node affect
> maintenance or upgrades in the future? While running Mon, MDS, and OSD
> services on the same node does offer some benefits, could someone provide
> me with additional advice?
> 
> Vào Th 6, 23 thg 8, 2024 vào lúc 17:31 Bogdan Adrian Velica <
> vbog...@gmail.com> đã viết:
> 
>> Ah, for OpenStack..
>> Yeah I've seen setups like that so it should be fine...
>> 
>> On Fri, Aug 23, 2024 at 1:27 PM Bogdan Adrian Velica 
>> wrote:
>> 
>>> Hi,
>>> 
>>> MON servers typically don't consume a lot of resources, so it's fairly
>>> common to run MGR/MON daemons on the same machines as the OSDs in a
>>> cluster. Do you have any insights on the type of workload or data you plan
>>> to use with your Ceph cluster?
>>> 
>>> Thank you,
>>> Bogdan Velica
>>> croit.io
>>> 
>>> On Fri, Aug 23, 2024 at 12:05 PM Phong Tran Thanh 
>>> wrote:
>>> 
 Hi Ceph users

 I am designing a CEPH system with 6 servers running full NVMe. Do I need
 to
 use 3 separate servers to run the MON services that communicate with
 OpenStack, or should I integrate the MON services into the OSD servers?
 What is the recommendation? Thank you.

 --
 Email: tranphong...@gmail.com
 Skype: tranphong079
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io

>>> 
> 
> -- 
> Trân trọng,
> 
> 
> *Tran Thanh Phong*
> 
> Email: tranphong...@gmail.com
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Paid support options?

2024-08-23 Thread Anthony D';Atri

Names that come to mind:

Clyso
OSNEXUS
42on / Fairbanks
Redhat / IBM - has their distribution, do not know if they require
Croit - has their novel distribution

Some of us do an occasional bit of small-scale consulting as individuals



> On Aug 23, 2024, at 8:01 AM, Paul Mezzanini  wrote:
> 
> We've reached the point of our ceph cluster's life where it make sense to 
> have an outside vendor in the passenger seat with us.  I already know of a 
> few but I see value in having a thread consolidating this information so I'm 
> leaving it open ended.
> 
> The two main questions I'm asking are:
> 
> What vendors offer paid ceph support?  
> Do they have specific requirements? (e.g. must run their version of ceph vs 
> community, must be containerized vs bare metal)
> 
> Thanks
> -paul
> 
> --
> 
> Paul Mezzanini
> Platform Engineer III
> Research Computing
> Rochester Institute of Technology
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: squid release codename

2024-08-19 Thread Anthony D';Atri



> On Aug 19, 2024, at 9:45 AM, Yehuda Sadeh-Weinraub  wrote:
> 
> Originally I remember also suggesting "banana" (after bananaslug) [1] , 
> imagine how much worse it could have been.


Solidigm could have been Stodesic or Velostate ;)


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: weird outage of ceph

2024-08-18 Thread Anthony D';Atri



> 
> You may want to look into https://github.com/digitalocean/pgremapper to get 
> the situation under control first.
> 
> --
> Alex Gorbachev
> ISS

Not a bad idea.

>> We had a really weird outage today of ceph and I wonder how it came about.
>> The problem seems to have started around midnight, I still need to look if 
>> it was to the extend I found it in this morning or if it grew more 
>> gradually, but when I found it several osd servers had most or all osd 
>> processes down, to the point where our EC 8+3 buckets didn't work anymore.

Look at your metrics and systems.  Were the OSDs OOMkilled?

>> I see some of our OSDs are coming close to (but not quite) 80-85% full, 
>> There are many times when I've seen an overfull error lead to cascading and 
>> catastrophic failures. I suspect this may have been one of them.

One can (temporarily) raise the backfillfull / full ratios to help get out of a 
bad situation, but leaving them raised can lead to an even worse situation 
later.


>> Which brings me to another question, why is our balancer doing so badly at 
>> balancing the OSDs?

There are certain situations where the bundled balancer is confounded, 
including CRUSH trees with multiple roots.  Subtly, that may include a cluster 
where some CRUSH rules specify a deviceclass and some don’t, as with the .mgr 
pool if deployed by Rook.  That situation confounds the PG autoscaler for sure. 
 If this is the case, consider modifying CRUSH rules so that all specify a 
deviceclass, and/or simplifying your CRUSH tree if you have explicit multiple 
roots.

>> It's configured with upmap mode and it should work great with the amount of 
>> PGs per OSD we have

Which is?

>> , but it is letting some OSD's reach 80% full and others not yet 50% full 
>> (we're just over 61% full in total).
>> 
>> The current health status is:
>> HEALTH_WARN Low space hindering backfill (add storage if this doesn't 
>> resolve itself): 1 pg backfill_toofull 
>> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this 
>> doesn't resolve itself): 1 pg backfill_toofull 
>>pg 30.3fc is active+remapped+backfill_wait+backfill_toofull, acting 
>> [66,105,124,113,89,132,206,242,179]
>> 
>> I've started reweighting again, because the balancer is not doing it's job 
>> in our cluster for some reason...

Reweighting … are you doing “ceph osd crush reweight”, or “ceph osd reweight / 
reweight-by-utilization” ? The latter in conjunction with pg-upmap confuses the 
balancer.  If that’s the situation you have, I might

* Use pg-remapper or Dan’s 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
 to freeze the PG mappings temporarily.
* Temp jack up the backfillfull/full ratios for some working room, say to 95 / 
98 %
* One at a time, reset the override reweights to 1.0.  No data should move.
* Remove the manual upmaps one at time, in order of PGs on the most-full OSDs  
You should see a brief spurt of backfill
* Rinse, lather, repeat.
* This should progressively get you to a state where you no longer have any 
old-style override reweights, i.e. all OSDs have 1.0 for that value.
* Proceed removing the manual upmaps one or a few at a time
* The balancer should work now
* Set the ratios back to the default values


>> 
>> Below is our dashboard overview, you can see the start and recovery in the 
>> 24h graph...
>> 
>> Cheers
>> 
>> /Simon
>> 

>> 
>> 
>> --
>> I'm using my gmail.com  address, because the gmail.com 
>>  dmarc policy is "none", some mail servers will reject 
>> this (microsoft?) others will instead allow this when I send mail to a 
>> mailling list which has not yet been configured to send mail "on behalf of" 
>> the sender, but rather do a kind of "forward". The latter situation causes 
>> dkim/dmarc failures and the dmarc policy will be applied. see 
>> https://wiki.list.org/DEV/DMARC for more details
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io 
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Identify laggy PGs

2024-08-17 Thread Anthony D';Atri



> 
> I always thought that too many PGs have impact on the disk IO. I guess this 
> is wrong?

Mostly when they’re spinners.  Especially back in the Filestore days with a 
colocated journal.  Don’t get me started on that.   

Too many PGs can exhaust RAM if you’re tight - or using Filestore still.  

For a SATA SSD I’d set pg_nums to average 200-300 per drive.  Your size mix 
complicates, though, because the larger OSDs will get many more than the 
smaller.   Be sure to set mon_max_pg_per_osd to like 1000.  

You might be experiment with primary affinity, so that the smaller OSDs are 
more likely to be primaries and thus will get more load.  I’ve seen a 
first-order approximation here increase read throughput by 20%



To 
> So I could double the PGs in the pool and see if things become better.
> 
> And yes, removing that single OSD from the cluster stopped the flapping of 
> "monitor marked osd.N down".
> 
>> Am 15.08.2024 um 10:14 schrieb Frank Schilder :
>> 
>> The current ceph recommendation is to use between 100-200 PGs/OSD. 
>> Therefore, a large PG is a PG that has more data than 0.5-1% of the disk 
>> capacity and you should split PGs for the relevant pool.
>> 
>> A huge PG is a PG for which deep-scrub takes much longer than 20min on HDD 
>> and 4-5min on SSD.
>> 
>> Average deep-scrub times (time it takes to deep-scrub) are actually a very 
>> good way of judging if PGs are too large. These times roughly correlate with 
>> the time it takes to copy a PG.
>> 
>> On SSDs we aim for 200+PGs/OSD and for HDDs for 150PGs/OSD. For very large 
>> HDD disks (>=16TB) we consider raising this to 300PGs/OSD due to excessively 
>> long deep-scrub times per PG.
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: Szabo, Istvan (Agoda) 
>> Sent: Wednesday, August 14, 2024 12:00 PM
>> To: Eugen Block; ceph-users@ceph.io
>> Subject: [ceph-users] Re: Identify laggy PGs
>> 
>> Just curiously I've checked my pg size which is like 150GB, when are we 
>> talking about big pgs?
>> 
>> From: Eugen Block 
>> Sent: Wednesday, August 14, 2024 2:23 PM
>> To: ceph-users@ceph.io 
>> Subject: [ceph-users] Re: Identify laggy PGs
>> 
>> Email received from the internet. If in doubt, don't click any link nor open 
>> any attachment !
>> 
>> 
>> Hi,
>> 
>> how big are those PGs? If they're huge and are deep-scrubbed, for
>> example, that can cause significant delays. I usually look at 'ceph pg
>> ls-by-pool {pool}' and the "BYTES" column.
>> 
>> Zitat von Boris :
>> 
>>> Hi,
>>> 
>>> currently we encouter laggy PGs and I would like to find out what is
>>> causing it.
>>> I suspect it might be one or more failing OSDs. We had flapping OSDs and I
>>> synced one out, which helped with the flapping, but it doesn't help with
>>> the laggy ones.
>>> 
>>> Any tooling to identify or count PG performance and map that to OSDs?
>>> 
>>> 
>>> --
>>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>>> groÃƒ¼en Saal.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> This message is confidential and is for the sole use of the intended 
>> recipient(s). It may also be privileged or otherwise protected by copyright 
>> or other legal rules. If you have received it by mistake please let us know 
>> by reply email and delete it from your system. It is prohibited to copy this 
>> message or disclose its content to anyone. Any confidentiality or privilege 
>> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
>> the message. All messages sent to and from Agoda may be monitored to ensure 
>> compliance with company policies, to protect the company's interests and to 
>> remove potential malware. Electronic messages may be intercepted, amended, 
>> lost or deleted, or contain viruses.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: squid release codename

2024-08-17 Thread Anthony D';Atri

> It's going to wreak havoc on search engines that can't tell when
> someone's looking up Ceph versus the long-establish Squid Proxy.

Search engines are way smarter than that, and I daresay that people are far 
more likely to search for “Ceph” or “Ceph squid" than for “squid” alone looking 
for Ceph.


> I don’t know how many more (sub)species there are to start over from A (the 
> first release was Argonaut) 

Ammonite is a natural, and two years later we *must* release Cthulhu.

Cartoon names run some risk of trademark issues.

> ...  that said, naming a *release* of a software with the name of
> well known other open source software is pure crazyness.

I haven’t seen the web cache used in years — maybe still in Antarctica?  These 
are vanity names for fun.  I’ve found that more people know the numeric release 
they run than the codename anyway.

> What's coming next? Ceph Redis? Ceph Apache? Or Apache Ceph?

Since you mention Apache, their “Spark” is an overload.  And Apache itself is 
cultural appropriation but that’s a tangent.

When I worked for Advanced Micro Devices we used the Auto Mounter Daemon

I’ve also used AMANDA for backups, which was not a Boston song.

Let’s not forget Apple’s iOS and Cisco’s IOS.

Ceph Octopus, and this cable 
https://usb.brando.com/usb-octopus-4-port-hub-cable_p999c39d15.html and of 
course this one https://www.ebay.com/itm/110473961774

The first Ceph release named after Jason’s posse.
Bobcat colliding with skid-loaders and Goldthwaite
Dumpling and gyoza
Firefly and the Uriah Heep album (though Demons & Wizards was better)
Giant and the Liz Taylor movie (and grocery store)
Hammer and Jan
Jewel and the singer
Moreover, Ceph Nautilus:
Korg software
Process engineering software
CMS
GNOME file manager
Firefox and the Clint Eastwood movie
Chrome and the bumper on a 1962 Karmann Ghia
Slack and the Linux distribution

When I worked for Cisco, people thought I was in food service.  Namespaces are 
crowded.  Overlap happens.  Context resolves readily.

Within the Cephapod scheme we’ve used Octopus and Nautilus, to not use Squid 
would be odd.  And Shantungendoceras doesn’t roll off the tongue.



“What’s in a name?”  - Shakespeare


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: squid release codename

2024-08-15 Thread Anthony D';Atri

Do Reef searchers get confused by reducing mainsail?

In voting Squid was the overwhelming favorite.   Do you have a better S 
suggestion?

> On Aug 15, 2024, at 8:45 AM, Alfredo Rezinovsky  wrote:
> 
> I think is a very bad idea to name a release with the name of the most
> popular http cache.
> It will difficult googling.
> 
> --
> Alfrenovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: rbd du USED greater than PROVISIONED

2024-08-14 Thread Anthony D';Atri



> On Aug 14, 2024, at 10:45 AM, Murilo Morais  wrote:
> 
> Good morning everyone!
> 
> I am confused about the listing of the total amount used on a volume.
> It says that more than the amount provisioned is being used.

The command shows 710GB of changes since the snapshot was taken, added to the 
404GB currently used by the parent. The snapshot by nature retains data that 
has been changed or deleted in the parent.

> The image
> contains a snapshot. Below is the output of the "rbd du" command:
> 
> user@abc2:~# rbd info
> osr1_volume_ssd/volume-6e5f90ac-78e9-465e-8705-a4d476ebc019
> rbd image 'volume-6e5f90ac-78e9-465e-8705-a4d476ebc019':
>size 1000 GiB in 256000 objects
>order 22 (4 MiB objects)
>snapshot_count: 1
>id: 3ec2fef4cd0ad8
>block_name_prefix: rbd_data.3ec2fef4cd0ad8
>format: 2
>features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>op_features:
>flags:
>create_timestamp: Fri Jan  5 22:48:45 2024
>access_timestamp: Wed Aug 14 10:46:06 2024
>modify_timestamp: Wed Aug 14 10:47:02 2024
> user@abc2:~# rbd du 
> osr1_volume_ssd/volume-6e5f90ac-78e9-465e-8705-a4d476ebc019
> NAME
>PROVISIONED  USED
> volume-6e5f90ac-78e9-465e-8705-a4d476ebc019@snapshot-23af41b0-ca7b-45df-991f-ffbe95512dbe
>1000 GiB  710 GiB
> volume-6e5f90ac-78e9-465e-8705-a4d476ebc019
>   1000 GiB  404 GiB
> 
>   1000 GiB  1.1 TiB
> 
> 
> In this case, is that really the case or is it some kind of BUG?
> 
> For the record, we are still on version 17.2.3.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrading RGW before cluster?

2024-08-14 Thread Anthony D';Atri




> there are/were customers who had services colocated, for example MON, MGR and 
> RGW on the same nodes. Before cephadm when they upgraded the first MON node 
> they automatically upgraded the RGW as well, of course. 

This is one of the arguments in favor of containerized daemons.

Strictly speaking, updating the packages and daemon restarts are decoupled, but 
in practice there's a period of risk when the running daemons don't all match 
the installed packages, in case the update stalls for some reason or a node 
crashes at just the wrong time.  Both of which happened to me back in the day.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What's the best way to add numerous OSDs?

2024-08-06 Thread Anthony D';Atri

Since they’re 20TB, I’m going to assume that these are HDDs.

There are a number of approaches.  One common theme is to avoid rebalancing 
until after all have been added to the cluster and are up / in, otherwise you 
can end up with a storm of map updates and superfluous rebalancing.

One strategy is to set osd_crush_initial_weight = 0 temporarily, so that the 
OSDs when added won’t take any data yet.  Then when you’re ready you can set 
their CRUSH weights up to where they otherwise would be, and unset 
osd_crush_initial_weight so you don’t wonder what the heck is going on six 
months down the road.

Another is to add a staging CRUSH root.  If the new OSDs are all on new hosts, 
you can create CRUSH host buckets for them in advance so that when you create 
the OSDs they go there and again won’t immediately take data.  Then you can 
move the host buckets into the production root in quick succession.

Either way if you do want to add them to the cluster all at once, with HDDs 
you’ll want to limit the rate of backfill so you don’t DoS your clients.  One 
strategy is to leverage pg-upmap with a tool like 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Note that to use pg-upmap safely, you will need to ensure that your clients are 
all at Luminous or later, in the case of CephFS I *think* that means kernel 
4.13 or later.  `ceph features` will I think give you that information.

An older method of spreading out the backfill thundering herd was to use a for 
loop to weight up the OSDs in increments of, say, 0.1 at a time, let the 
cluster settle, then repeat.  This strategy results in at least some data 
moving twice, so it’s less efficient.  Similarly you might add, say, one OSD 
per host at a time and let the cluster settle between iterations, which would 
also be less than ideally efficient.

— aad

> On Aug 6, 2024, at 11:08 AM, Fabien Sirjean  wrote:
> 
> Hello everyone,
> 
> We need to add 180 20TB OSDs to our Ceph cluster, which currently consists of 
> 540 OSDs of identical size (replicated size 3).
> 
> I'm not sure, though: is it a good idea to add all the OSDs at once? Or is it 
> better to add them gradually?
> 
> The idea is to minimize the impact of rebalancing on the performance of 
> CephFS, which is used in production.
> 
> Thanks in advance for your opinions and feedback 🙂
> 
> Wishing you a great summer,
> 
> Fabien
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD data corruption after node reboot in Rook

2024-08-05 Thread Anthony D';Atri

Describe your hardware, please, and are you talking an orderly "shutdown -r" 
reboot, or a kernel / system crash or power loss?

Often corruptions like this are a result of:

* Using non-enterprise SSDs that lack power loss protection
* Buggy / defective RAID HBAs
* Enabling volatile write cache on drives



> On Aug 5, 2024, at 4:54 AM, Reza Bakhshayeshi  wrote:
> 
> Hello,
> 
> Whenever a node reboots in the cluster I get some corrupted OSDs, is there
> any config I should set to prevent this from happening that I am not aware
> of?
> 
> Here is the error log:
> 
> # kubectl logs rook-ceph-osd-1-5dcbd99cc7-2l5g2 -c expand-bluefs
> 
> ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x135) [0x7f969977ce15]
> 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f969977cfdb]
> 3: (BlueStore::expand_devices(std::ostream&)+0x5ff) [0x55ce89d1f3ff]
> 4: main()
> 5: __libc_start_main()
> 6: _start()
> 
> 0> 2024-07-31T08:39:19.840+ 7f969b1c0980 -1 *** Caught signal
> (Aborted) **
> in thread 7f969b1c0980 thread_name:ceph-bluestore-
> 
> ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef
> (stable)
> 1: /lib64/libpthread.so.0(+0x12d20) [0x7f969843fd20]
> 2: gsignal()
> 3: abort()
> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x18f) [0x7f969977ce6f]
> 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f969977cfdb]
> 6: (BlueStore::expand_devices(std::ostream&)+0x5ff) [0x55ce89d1f3ff]
> 7: main()
> 8: __libc_start_main()
> 9: _start()
> NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Subscribe

2024-07-25 Thread Anthony D';Atri

Known problem.  I’m managing this list manually (think: interactive Python 
shell) until the CLT finds somebody with the chops to set it up fresh on better 
infra and not lose the archives.
I’ll get dobr...@gmu.edu  added.

> On Jul 25, 2024, at 8:55 AM, Dan O'Brien  wrote:
> 
> Sorry, the list has been wonky for me. I was logged in with my GitHub 
> credentials and when I try and publish the post, I get the message
> 
> This list is moderated, please subscribe to it before posting.
> 
> When I try and manage my subscription, I get:
> 
> Something went wrong
> Mailman REST API not available. Please start Mailman core.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Questions about the usage of space in Ceph

2024-07-13 Thread Anthony D';Atri




> 
> My Ceph cluster has a CephFS file system, using an erasure-code data pool 
> (k=8, m=2), which has used 14TiB of space. My CephFS has 19 subvolumes, and 
> each subvolume automatically creates a snapshot every day and keeps it for 3 
> days. The problem is that when I manually calculate the disk space usage of 
> each subvolume directory in CephFS, the total amount is only 8.4TiB. I don't 
> know why this is happening. Do snapshots take up a lot of space?

Snapshot consumption of underlying storage is a function of how much data is 
written / removed.

Do you have a lot of fairly small files on your CephFS?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Repurposing some Dell R750s for Ceph

2024-07-12 Thread Anthony D';Atri

There’s more to it than bottlenecking.

RAS, man.  RAS.

> On Jul 12, 2024, at 3:58 PM, John Jasen  wrote:
> 
> How large of a ceph cluster are you planning on building, and what network 
> cards/speeds will you be using?
> 
> A lot of the talk about RAID HBA pass-through being sub-optimal probably 
> won't be your bottleneck unless you're aiming for a large cluster at 100Gb/s 
> speeds, in my opinion. 
> 
> On Fri, Jul 12, 2024 at 12:02 PM Drew Weaver  <mailto:drew.wea...@thenap.com>> wrote:
>> Okay it seems like we don't really have a definitive answer on whether it's 
>> OK to use a RAID controller or not and in what capacity.
>> 
>> Passthrough meaning:
>> 
>> Are you saying that it's OK to use a raid controller where the disks are in 
>> non-RAID mode?
>> Are you saying that it's OK to use a raid controller where each disk is in 
>> its own RAID-0 volume?
>> 
>> I'm just trying to clarify a little bit. You can imagine that nobody wants 
>> to be that user that does this against the documentation's guidelines and 
>> then something goes terribly wrong.
>> 
>> Thanks again,
>> -Drew
>> 
>> 
>> -Original Message-
>> From: Anthony D'Atri mailto:a...@dreamsnake.net>> 
>> Sent: Thursday, July 11, 2024 7:24 PM
>> To: Drew Weaver mailto:drew.wea...@thenap.com>>
>> Cc: John Jasen mailto:jja...@gmail.com>>; 
>> ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> Subject: Re: [ceph-users] Repurposing some Dell R750s for Ceph
>> 
>> 
>> 
>> > 
>> > Isn’t the supported/recommended configuration to use an HBA if you have to 
>> > but never use a RAID controller?
>> 
>> That may be something I added to the docs.  My contempt for RAID HBAs knows 
>> no bounds ;)
>> 
>> Ceph doesn’t care.  Passthrough should work fine, I’ve done that for tends 
>> of thousands of OSDs, albeit on different LSI HBA SKUs.
>> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help with Mirroring

2024-07-12 Thread Anthony D';Atri



> Hi,
> 
> just one question coming to mind, if you intend to migrate the images 
> separately, is it really necessary to set up mirroring? You could just 'rbd 
> export' on the source cluster and 'rbd import' on the destination cluster.

That can be slower if using a pipe, and require staging space one doesn’t have 
if using and transferring files.

And of course it’s right out if the RBD volume is currently attached.




> 
> 
> Zitat von Anthony D'Atri :
> 
>>> 
>>> I would like to use mirroring to facilitate migrating from an existing
>>> Nautilus cluster to a new cluster running Reef.  RIght now I'm looking at
>>> RBD mirroring.  I have studied the RBD Mirroring section of the
>>> documentation, but it is unclear to me which commands need to be issued on
>>> each cluster and, for commands that have both clusters as arguments, when
>>> to specify site-a where vs. site-b.
>> 
>> I won’t go into the nitty-gritty, but note that you’ll likely run the 
>> rbd-mirror daemon on the destination cluster, and it will need reachability 
>> to all of the source cluster’s mons and OSDs.  Maybe mgrs, not sure.
>> 
>>> Another concern:  Both the old and new cluster internally have the default
>>> name 'Ceph' - when I set up the second cluster I saw no obvious reason to
>>> change from the default.  If these will cause a problem with mirroring, is
>>> there a workaround?
>> 
>> The docs used to imply that the clusters need to have distinct vanity names, 
>> but that was never actually the case — and vanity names are no longer 
>> supported for clusters.
>> 
>> The ceph.conf files for both clusters need to be distinct and present on the 
>> system where rbd-mirror runs.  You can do this by putting them in different 
>> subdirectories or calling them like cephsource.conf and cephdest.conf.  The 
>> filenames are arbitrary, you’ll just have to specify them when setting up 
>> rbd-mirror peers.
>> 
>> 
>>> In the long run I will also be migrating a bunch of RGW data.  If there are
>>> advantages to using mirroring for this I'd be glad to know.
>> 
>> Whole different ballgame.  You can use multisite or rclone or the new Clyso 
>> “Chorus” tool for that.
>> 
>>> (BTW, the plan is to gradually decommission the systems from the old
>>> cluster and add them to the new cluster.  In this context, I am looking to
>>> enable and disable mirroring on specific RBD images and RGW buckets as the
>>> client workload is migrated from accessing the old cluster to accessing the
>>> new.
>> 
>> I’ve migrated thousands of RBD volumes between clusters this way.  It gets a 
>> bit tricky if a volume is currently attached.
>> 
>>> 
>>> Thanks.
>>> 
>>> -Dave
>>> 
>>> --
>>> Dave Hall
>>> Binghamton University
>>> kdh...@binghamton.edu
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Repurposing some Dell R750s for Ceph

2024-07-12 Thread Anthony D';Atri



> Okay it seems like we don't really have a definitive answer on whether it's 
> OK to use a RAID controller or not and in what capacity.

It’s okay to use it if that’s what you have.

For new systems, eschew the things.  They cost money for something you can do 
with MD for free and are finicky.  Look back on the list a few years for my 
litany of why I hate the things, all based on personal experience ;)


> Passthrough meaning:
> 
> Are you saying that it's OK to use a raid controller where the disks are in 
> non-RAID mode?

That’s the best approach in your situation, where you have servers already.

> Are you saying that it's OK to use a raid controller where each disk is in 
> its own RAID-0 volume?

It’s OK to do that, but there are drawbacks and you should do passthrough 
instead if you can:

* Operational hassles creating those wrappers and tearing them down when 
deploying OSDs and replacing drives
* Firmware updates may be confounded
* iostat’s already-marginal stats are confounded
* Hassles getting SMART-style stats

> 
> I'm just trying to clarify a little bit. You can imagine that nobody wants to 
> be that user that does this against the documentation's guidelines and then 
> something goes terribly wrong.

Dell’s documentation tells you how to set passthrough.  If you’ve seen docs 
that tell you not to, please do send a link as I’d like to see them.

Dell qualified a certain unnamed now-EOL SSD, yet I underwent a long and 
arduous engagement with the manufacturer to fix a firmware design flaw.  So I 
don’t place much stock in Dell recommendations.  They RAALLLY likes to 
stuff RoC HBAs down our throats, and they mark them up outrageously.

— aad


> 
> Thanks again,
> -Drew
> 
> 
> -Original Message-
> From: Anthony D'Atri  
> Sent: Thursday, July 11, 2024 7:24 PM
> To: Drew Weaver 
> Cc: John Jasen ; ceph-users@ceph.io
> Subject: Re: [ceph-users] Repurposing some Dell R750s for Ceph
> 
> 
> 
>> 
>> Isn’t the supported/recommended configuration to use an HBA if you have to 
>> but never use a RAID controller?
> 
> That may be something I added to the docs.  My contempt for RAID HBAs knows 
> no bounds ;)
> 
> Ceph doesn’t care.  Passthrough should work fine, I’ve done that for tends of 
> thousands of OSDs, albeit on different LSI HBA SKUs.
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help with Mirroring

2024-07-11 Thread Anthony D';Atri

> 
> I would like to use mirroring to facilitate migrating from an existing
> Nautilus cluster to a new cluster running Reef.  RIght now I'm looking at
> RBD mirroring.  I have studied the RBD Mirroring section of the
> documentation, but it is unclear to me which commands need to be issued on
> each cluster and, for commands that have both clusters as arguments, when
> to specify site-a where vs. site-b.

I won’t go into the nitty-gritty, but note that you’ll likely run the 
rbd-mirror daemon on the destination cluster, and it will need reachability to 
all of the source cluster’s mons and OSDs.  Maybe mgrs, not sure.

> Another concern:  Both the old and new cluster internally have the default
> name 'Ceph' - when I set up the second cluster I saw no obvious reason to
> change from the default.  If these will cause a problem with mirroring, is
> there a workaround?

The docs used to imply that the clusters need to have distinct vanity names, 
but that was never actually the case — and vanity names are no longer supported 
for clusters.

The ceph.conf files for both clusters need to be distinct and present on the 
system where rbd-mirror runs.  You can do this by putting them in different 
subdirectories or calling them like cephsource.conf and cephdest.conf.  The 
filenames are arbitrary, you’ll just have to specify them when setting up 
rbd-mirror peers.


> In the long run I will also be migrating a bunch of RGW data.  If there are
> advantages to using mirroring for this I'd be glad to know.

Whole different ballgame.  You can use multisite or rclone or the new Clyso 
“Chorus” tool for that.

> (BTW, the plan is to gradually decommission the systems from the old
> cluster and add them to the new cluster.  In this context, I am looking to
> enable and disable mirroring on specific RBD images and RGW buckets as the
> client workload is migrated from accessing the old cluster to accessing the
> new.

I’ve migrated thousands of RBD volumes between clusters this way.  It gets a 
bit tricky if a volume is currently attached.

> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Repurposing some Dell R750s for Ceph

2024-07-11 Thread Anthony D';Atri



> 
> Isn’t the supported/recommended configuration to use an HBA if you have to 
> but never use a RAID controller?

That may be something I added to the docs.  My contempt for RAID HBAs knows no 
bounds ;)

Ceph doesn’t care.  Passthrough should work fine, I’ve done that for tends of 
thousands of OSDs, albeit on different LSI HBA SKUs.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Repurposing some Dell R750s for Ceph

2024-07-11 Thread Anthony D';Atri

Agree with everything Robin wrote here.  RAID HBAs FTL.  Even in passthrough 
mode, it’s still an [absurdly expensive] point of failure, but a server in the 
rack is worth two on backorder.

Moreover, I’m told that it is possible to retrofit with cables and possibly an 
AIC mux / expander.

e.g.
https://www.ebay.com/itm/176400760681

Granted, I haven’t done this personally so I can’t speak to the BOM and 
procedure.  For OSD nodes it probably isn’t worth the effort.

Some of the LSI^H^H^H^HPERC HBAs — to my astonishment — don’t have a 
passthrough setting/mode.  This document though implies that this SKU does.



https://www.dell.com/support/manuals/en-ae/poweredge-r7525/perc11_ug/technical-specifications-of-perc-11-cards?guid=guid-aaaf8b59-903f-49c1-8832-f3997d125edf&lang=en-us


You should be able to set individual drives to passthrough:

storcli64 /call /eall /sall set jbod=on

or depending on the SKU and storcli revision, for the whole HBA

storcli64 /call set personality=JBOD

racadm set Storage.Controller.1.RequestedControllerMode HBA
or
racadm set Storage.Controller.1.RequestedControllerMode EnhancedHBA
then
  jobqueue create RAID.Integrated.1-1
  server action power cycle

LSI and Dell have not been particularly consistent with these beasts.

— aad



>> Hello,
>> 
>> We would like to repurpose some Dell PowerEdge R750s for a Ceph cluster.
>> 
>> Currently the servers have one H755N RAID controller for each 8 drives. (2 
>> total)
> The N variant of H755N specifically? So you have 16 NVME drives in each
> server?
> 
>> I have been asking their technical support what needs to happen in
>> order for us to just rip out those raid controllers and cable the
>> backplane directly to the motherboard/PCIe lanes and they haven't been
>> super enthusiastic about helping me. I get it just buy another 50
>> servers, right? No big deal.
> I don't think the motherboard has enough PCIe lanes to natively connect
> all the drives: the RAID controller effectively functioned as a
> expander, so you needed less PCIe lanes on the motherboard.
> 
> As the quickest way forward: look for passthrough / single-disk / RAID0
> options, in that order, in the controller management tools (perccli etc).
> 
> I haven't used the N variant at all, and since it's NVME presented as
> SCSI/SAS, I don't want to trust the solution of reflashing the
> controller for IT (passthrough) mode.
> 
> -- 
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster Alerts

2024-07-03 Thread Anthony D';Atri

https://docs.ceph.com/en/quincy/mgr/crash/

> On Jul 3, 2024, at 08:27, filip Mutterer  wrote:
> 
> In my cluster I have old Alerts, how should solved Alerts be handled?
> 
> Just wait until they disappear or Silence them?
> 
> Whats the recommended way?
> 
> 
> filip
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2024-07-02 Thread Anthony D';Atri

This was common in the NFS days, and some Linux distribution deliberately slew 
the execution time.  find over an NFS mount was a sure-fire way to horque the 
server. (e.g. Convex C1)

IMHO since the tool relies on a static index it isn't very useful, and I 
routinely remove any variant from my systems.

ymmv

> On Jul 2, 2024, at 10:20, Olli Rajala  wrote:
> 
> Hi - mostly as a note to future me and if anyone else looking for the same
> issue...
> 
> I finally solved this a couple of months ago. No idea what is wrong with
> Ceph but the root cause that was triggering this MDS issue was that I had
> several workstations and a couple servers where the updatedb of "locate"
> was getting run by daily cron exactly the same time every night causing
> high momentary strain on the MDS which then somehow screwed up the metadata
> caching and flushing creating this cumulative write io.
> 
> The thing to note here is that there's a difference with "locate" and
> "mlocate" packages. The default config (on Ubuntu atleast) of updatedb for
> "mlocate" does skip scanning cephfs filesystems but not so for "locate"
> which happily ventures onto all of your cephfs mounts :|
> 
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
> 
> 
> On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala  wrote:
> 
>> Hi,
>> 
>> One thing I now noticed in the mds logs is that there's a ton of entries
>> like this:
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache result
>> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 
>> ...and after dropping the caches considerably less of those - normal,
>> abnormal, typical, atypical? ...or is that just something that starts
>> happening after the cache gets filled?
>> 
>> Tnx,
>> ---
>> Olli Rajala - Lead TD
>> Anima Vitae Ltd.
>> www.anima.fi
>> ---
>> 
>> 
>> On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala  wrote:
>> 
>>> Hi,
>>> 
>>> I'm still totally lost with this issue. And now lately I've had a couple
>>> of incidents where the write bw has suddenly jumped to even crazier levels.
>>> See the graph here:
>>> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
>>> 
>>> The points where it drops to something manageable again are when I have
>>> dropped the mds caches. Usually after the drop there is steady rise but now
>>> these sudden jumps are something new and even more scary :E
>>> 
>>> Here's a fresh 2sec level 20 mds log:
>>> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
>>> 
>>> Any help and ideas greatly appreciated. Is there any tool or procedure to
>>> safely check or rebuild the mds data? ...if this behaviour could be caused
>>> by some hidden issue with the data itself.
>>> 
>>> Tnx,
>>> ---
>>> Olli Rajala - Lead TD
>>> Anima Vitae Ltd.
>>> www.anima.fi
>>> ---
>>> 
>>> 
>>> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar 
>>> wrote:
>>> 
 On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala 
 wrote:
> 
> Hi Venky,
> 
> I have indeed observed the output of the different sections of perf
 dump like so:
> watch -n 1 ceph tell mds.`hostname` perf dump objecter
> watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
> ...etc...
> 
> ...but without any proper understanding of what is a normal rate for
 some number to go up it's really difficult to make anything from that.
> 
> btw - is there some convenie

[ceph-users] Re: OSD service specs in mixed environment

2024-06-28 Thread Anthony D';Atri



>> 
>> But this in a spec doesn't match it:
>> 
>> size: '7000G:'
>> 
>> This does:
>> 
>> size: '6950G:'

There definitely is some rounding within Ceph, and base 2 vs base 10 
shenanigans.  

> 
> $ cephadm shell ceph-volume inventory /dev/sdc --format json | jq 
> .sys_api.human_readable_size
> "3.64 TB"

Ceph like humans thinks in terms of base 2 units, e.g. GiB and TiB. Storage 
manufacturers are, well, mustelids and almost always express in terms of base 
10 units, GB and TB, because they read as slightly higher.


> 
> The 'size:' spec you set is in GB (only GB and MB are supported). However, 
> ceph-volume inventory output can use other units (TB in this example). 
> Therefore, the orchestrator first converts both values to bytes. Since the 
> ceph-volume inventory produces a figure with only 2 decimals and the 
> conversion uses powers of 10 (1e+9 for GB, 1e+12 for TB), the matching size 
> here would be "size: 3640GB". This was confirmed by my testing a few months 
> ago.
> 
> If my understanding is correct, it may be worth adding to the doc [2] that 
> the device size is human_readable_size in TB from ceph-volume inventory x 10 
> GB.

Please enter a tracker ticket for this with details and tag me.  Do you mean 
x1000 not x10?


> 
> Regards,
> Frédéric.
> 
> [1] 
> https://github.com/ceph/ceph/blob/main/src/python-common/ceph/deployment/drive_selection/matchers.py
> [2] https://docs.ceph.com/en/latest/cephadm/services/osd/
> 
>> 
>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
>> 
>> --
>> Torkil Svensgaard
>> Sysadmin
>> MR-Forskningssektionen, afs. 714
>> DRCMR, Danish Research Centre for Magnetic Resonance
>> Hvidovre Hospital
>> Kettegård Allé 30
>> DK-2650 Hvidovre
>> Denmark
>> Tel: +45 386 22828
>> E-mail: tor...@drcmr.dk
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Viability of NVMeOF/TCP for VMWare

2024-06-27 Thread Anthony D';Atri

There are folks actively working on this gateway and there's a Slack channel.  
I haven't used it myself yet.

My understanding is that ESXi supports NFS.  Some people have had good success 
mounting KRBD volumes on a gateway system or VM and re-exporting via NFS.



> On Jun 27, 2024, at 09:01, Drew Weaver  wrote:
> 
> Howdy,
> 
> I recently saw that Ceph has a gateway which allows VMWare ESXi to connect to 
> RBD.
> 
> We had another gateway like this awhile back the ISCSI gateway.
> 
> The ISCSI gateway ended up being... let's say problematic.
> 
> Is there any reason to believe that NVMeOF will also end up on the floor and 
> has anyone that uses VMWare extensively evaluated it's viability?
> 
> Just curious!
> 
> Thanks,
> -Drew
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Test after list GC

2024-06-24 Thread Anthony D';Atri

Here’s a test after de-crufting held messages.  Grok the fullness.

— aad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Lot of spams on the list

2024-06-24 Thread Anthony D';Atri

I’m not sure if I have access but I can try.

> On Jun 24, 2024, at 4:37 PM, Kai Stian Olstad  wrote:
> 
> On 24.06.2024 19:15, Anthony D'Atri wrote:
>> * Subscription is now moderated
>> * The three worst spammers (you know who they are) have been removed
>> * I’ve deleted tens of thousands of crufty mail messages from the queue
>> The list should work normally now.  Working on the backlog of held messages. 
>>  99% are bogus, but I want to be careful wrt baby and bathwater.
> 
> Will the archive[1] also be clean up?
> 
> [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/
> 
> -- 
> Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Lot of spams on the list

2024-06-24 Thread Anthony D';Atri

* Subscription is now moderated
* The three worst spammers (you know who they are) have been removed
* I’ve deleted tens of thousands of crufty mail messages from the queue

The list should work normally now.  Working on the backlog of held messages.  
99% are bogus, but I want to be careful wrt baby and bathwater.



> On Jun 24, 2024, at 1:09 PM, Alex  wrote:
> 
> They seem to use the same few email address and then make new once. It
> should be possible to block them once a day to at least cut down the volume
> of emails but not completely block?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full list of metrics provided by ceph exporter daemon

2024-06-20 Thread Anthony D';Atri



> 
> That's not the answer I expected to see :( Are you suggesting enabling all
> possible daemons of Ceph just to obtain a full list of metrics?

Depends if what you're after is potential metrics or ones currently surfaced, I 
assumed you meant the latter.

> Isn't there any list in ceph repo or in documentation with all metrics 
> exposed?

Use the source, Luke?  I don't know of a documented list.

> Another question: ceph_rgw_qactive metric changed its instance_id from
> numeric format to a letter format. Is there any pull request for that in
> Ceph or it is Rook initiative? Example:

I believe recently there was a restructuring of some of the RGW metrics.

> 
> metrics from prometheus module:
> ceph_rgw_qactive{instance_id="4134960"} 0.0
> ceph_rgw_qactive{instance_id="4493203"} 0.0
> 
> metrics from ceph-exporter:
> ceph_rgw_qactive{instance_id="a"} 0 ceph_rgw_qactive{instance_id="a"} 0
> 
> 
> чт, 20 июн. 2024 г. в 20:09, Anthony D'Atri :
> 
>> curl http://endpoint:port/metrics
>> 
>>> On Jun 20, 2024, at 10:15, Peter Razumovsky 
>> wrote:
>>> 
>>> Hello!
>>> 
>>> I'm using Ceph Reef with Rook v1.13 and want to find somewhere a full
>> list
>>> of metrics exported by brand new ceph exporter daemon. We found that some
>>> metrics have been changed after moving from prometheus module metrics to
>> a
>>> separate daemon.
>>> 
>>> We used described method [1] to give us a time to mitigate all risks
>> during
>>> transfer to a new daemon. Now we want to obtain a full list of
>>> ceph-exporter metrics to compare old and new metrics and to mitigate
>>> potential risks of losing monitoring data and breaking alerting systems.
>> We
>>> found some examples of metrics [2] but it is not complete so we will
>>> appreciate it if someone points us to the full list.
>>> 
>>> [1]
>>> 
>> https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics
>>> 
>>> [2] https://docs.ceph.com/en/latest/monitoring/
>>> 
>>> --
>>> Best regards,
>>> Peter Razumovsky
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
> 
> -- 
> Best regards,
> Peter Razumovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full list of metrics provided by ceph exporter daemon

2024-06-20 Thread Anthony D';Atri

curl http://endpoint:port/metrics

> On Jun 20, 2024, at 10:15, Peter Razumovsky  wrote:
> 
> Hello!
> 
> I'm using Ceph Reef with Rook v1.13 and want to find somewhere a full list
> of metrics exported by brand new ceph exporter daemon. We found that some
> metrics have been changed after moving from prometheus module metrics to a
> separate daemon.
> 
> We used described method [1] to give us a time to mitigate all risks during
> transfer to a new daemon. Now we want to obtain a full list of
> ceph-exporter metrics to compare old and new metrics and to mitigate
> potential risks of losing monitoring data and breaking alerting systems. We
> found some examples of metrics [2] but it is not complete so we will
> appreciate it if someone points us to the full list.
> 
> [1]
> https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics
> 
> [2] https://docs.ceph.com/en/latest/monitoring/
> 
> -- 
> Best regards,
> Peter Razumovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to change default osd reweight from 1.0 to 0.5

2024-06-19 Thread Anthony D';Atri

I’ve thought about this strategy in the past.   I think you might enter a cron 
job to reset any OSDs at 1.0 to 0.5, but really the balancer module or JJ 
balancer is a better idea than old-style reweight.  

> On Jun 19, 2024, at 2:22 AM, 서민우  wrote:
> 
> Hello~
> 
> Our ceph cluster uses 260 osds.
> The most highest osd usage is 87% But, The most lowest is under 40%.
> We consider lowering the highest osd's reweight and rising the lowest osd's
> reweight.
> 
> We solved this problem following this workload.
> This is our workload.
> 1. Set All osd's reweight to 0.5
> 2. Rising the lowest osd's reweight to 0.6
> 3. Lowering the highest osd's reweight to 0.4
> 
> ** What I'm really curious about is this. **
> I want to set the default osd reweight to 0.5 for the new osd that will be
> added to the cluster.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitoring

2024-06-18 Thread Anthony D';Atri

Easier to ignore any node_exporter that Ceph (or k8s) deploys and just deploy 
your own on a different port across your whole fleet.

> On Jun 18, 2024, at 13:56, Alex  wrote:
> 
> But how do you combine it with Prometheus node exporter built into Ceph?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitoring

2024-06-18 Thread Anthony D';Atri

I don't, I have the fleetwide monitoring / observability systems query 
ceph_exporter and a fleetwide node_exporter instance on 9101.  ymmv.


> On Jun 18, 2024, at 09:25, Alex  wrote:
> 
> Good morning.
> 
> Our RH Ceph comes with Prometheus monitoring "built in". How does everyone
> interstate that into their existing monitoring infrastructure so Ceph and
> other servers are all under one dashboard?
> 
> Thanks,
> Alex.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread Anthony D';Atri

Ohhh, so multiple OSD failure domains on a single SAN node?  I suspected as 
much.

I've experienced a Ceph cluster built on SanDisk InfiniFlash, was was somewhere 
between SAN and DAS arguably.  Each of 4 IF chassis drive 4x OSD nodes via SAS, 
but it was zoned such that the chassis was the failure domain in the CRUSH tree.

> On Jun 17, 2024, at 16:52, David C.  wrote:
> 
> In Pablo's unfortunate incident, it was because of a SAN incident, so it's 
> possible that Replica 3 didn't save him.
> In this scenario, the architecture is more the origin of the incident than 
> the number of replicas.

> It seems to me that replica 3 exists, by default, since firefly => make 
> replica 2, this is intentional.

The default EC profile though is 2,1 and that makes it too easy for someone to 
understandably assume that the default is suitable for production.  I have an 
action item to update docs and code to default to, say, 2,2 so that it still 
works on smaller configurations like sandboxes but is less dangerous.

> However, I'd rather see a full flash Replica 2 platform with solid backups 
> than Replica 3 without backups (well obviously, Replica 3, or E/C + backup 
> are much better).
> 

Tangent, but yeah RAID or replication != backups.  SolidFire was RF2 flash, 
their assertion was that resilvering was fast enough that it was safe.  With 
Ceph we know there's more to it than that, but I'm not sure if they had special 
provisions to address the sequences of events that can cause problems with Ceph 
RF2.  They did have limited scalability, though.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread Anthony D';Atri



>> 
>> * We use replicated pools
>> * Replica 2, min replicas 1.

Note to self:   Change the docs and default to discourage this.  This is rarely 
appropriate in production.  

You had multiple overlapping drive failures?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: why not block gmail?

2024-06-17 Thread Anthony D';Atri

Yes.  I have admin juice on some other Ceph lists, I've asked for it here as 
well so that I can manage with alacrity.


> On Jun 17, 2024, at 09:31, Robert W. Eckert  wrote:
> 
> Is there any way to have a subscription request validated? 
> 
> -Original Message-
> From: Marc  
> Sent: Monday, June 17, 2024 7:56 AM
> To: ceph-users 
> Subject: [ceph-users] Re: why not block gmail?
> 
> I am putting ceph-users@ceph.io on the blacklist for now. Let me know via 
> different email address when it is resolved.
> 
>> 
>> Could we at least stop approving requests from obvious spammers?
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: Eneko Lacunza 
>> Sent: Monday, June 17, 2024 9:18 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: why not block gmail?
>> 
>> Hi,
>> 
>> El 15/6/24 a las 11:49, Marc escribió:
>>> If you don't block gmail, gmail/google will never make an effort to
>> clean up their shit. I don't think people with a gmail.com will mind, 
>> because this is free and get somewhere else a free account.
>>> 
>>> tip: google does not really know what part of their infrastructure 
>>> is
>> sending email so they use spf ~all. If you process gmail.com and force 
>> the -all manually, you block mostly spam.
>> In May, of 111 list messages 41 (no-spam) came from gmail.com
>> 
>> I think banning gmail.com will be an issue for the list, at least 
>> short-term.
>> 
>> Applying SPF -all seems better, but not sure about how easy that would 
>> be to implement... :)
>> 
>> Cheers
>> 
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>> 
>> Tel. +34 943 569 206 | https://www.binovo.es Astigarragako Bidea, 2 - 
>> 2º izda. Oficina 10-11, 20180 Oiartzun
>> 
>> https://www.youtube.com/user/CANALBINOVO
>> https://www.linkedin.com/company/37269706/
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
>> email to ceph-users-le...@ceph.io 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
>> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-13 Thread Anthony D';Atri

There you go.

Tiny objects are the hardest thing for any object storage service:  you can 
have space amplification and metadata operations become a very high portion of 
the overall workload.

With 500KB objects, you may waste a significant fraction of underlying space -- 
especially if you have large-IU QLC OSDs, or OSDs made with an older Ceph 
release where the min_alloc_size was 64KB vs the current 4KB.  This is 
exacerbated by EC if you're using it, as many do for buckets pools.

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?gid=358760253#gid=358760253
Bluestore Space Amplification Cheat Sheet
docs.google.com


Things to do:  Disable Nagle  https://docs.ceph.com/en/quincy/radosgw/frontends/

Putting your index pool on as many SSDs as you can would also help, I don't 
recall if it's on HDD now.   Index doesn't use all that much data, but benefits 
from a generous pg_num and multiple OSDs so that it isn't bottlenecked.


> On Jun 13, 2024, at 15:13, Sinan Polat  wrote:
> 
> 500K object size
> 
>> Op 13 jun 2024 om 21:11 heeft Anthony D'Atri  het 
>> volgende geschreven:
>> 
>> How large are the objects you tested with?  
>> 
>>> On Jun 13, 2024, at 14:46, si...@turka.nl wrote:
>>> 
>>> I have doing some further testing.
>>> 
>>> My RGW pool is placed on spinning disks.
>>> I created a 2nd RGW data pool, placed on flash disks.
>>> 
>>> Benchmarking on HDD pool:
>>> Client 1 -> 1 RGW Node: 150 obj/s
>>> Client 1-5 -> 1 RGW Node: 150 ob/s (30 obj/s each client)
>>> Client 1 -> HAProxy -> 3 RGW Nodes: 150 obj/s
>>> Client 1-5 -> HAProxy -> 3 RGW Nodes: 150 obj/s (30 obj/s each client)
>>> 
>>> I did the same tests towards the RGW pool on flash disks: same results
>>> 
>>> So, it doesn't matter if my pool is hosted on HDD or SSD.
>>> It doesn't matter if I am using 1 RGW or 3 RGW nodes.
>>> It doesn't matter if I am using 1 client or 5 clients.
>>> 
>>> I am constantly limited at around 140-160 objects/s.
>>> 
>>> I see some TCP Retransmissions on the RGW Node, but maybe thats 'normal'.
>>> 
>>> Any ideas/suggestions?
>>> 
>>> On 2024-06-11 22:08, Anthony D'Atri wrote:
>>>>> I am not sure adding more RGW's will increase the performance.
>>>> That was a tangent.
>>>>> To be clear, that means whatever.rgw.buckets.index ?
>>>>>>> No, sorry my bad. .index is 32 and .data is 256.
>>>>>> Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG 
>>>>>> replicas on each OSD?  You want (IMHO) to end up with 100-200, keeping 
>>>>>> each pool's pg_num to a power of 2 ideally.
>>>>> No, my RBD pool is larger. My average PG per OSD is round 60-70.
>>>> Ah.  Aim for 100-200 with spinners.
>>>>>> Assuming all your pools span all OSDs, I suggest at a minimum 256 for 
>>>>>> .index and 8192 for .data, assuming you have only RGW pools.  And would 
>>>>>> be included to try 512 / 8192.  Assuming your  other minor pools are at 
>>>>>> 32, I'd bump .log and .non-ec to 128 or 256 as well.
>>>>>> If you have RBD or other pools colocated, those numbers would change.
>>>>>> ^ above assume disabling the autoscaler
>>>>> I bumped my .data pool from 256 to 1024 and .index from 32 to 128.
>>>> Your index pool still only benefits from half of your OSDs with a value of 
>>>> 128.
>>>>> Also doubled the .non-e and .log pools. Performance wise I don't see any 
>>>>> improvement. If I would see 10-20% improvement, I definitely would 
>>>>> increase it to 512 / 8192.
>>>>> With 0.5MB object size I am still limited at about 150 up to 250 
>>>>> objects/s.
>>>>> The disks aren't saturated. The wr await is mostly around 1ms and does 
>>>>> not get higher when benchmarking with S3.
>>>> Trust iostat about as far as you can throw it.
>>>>> Other suggestions, or does anyone else has suggestions?
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-13 Thread Anthony D';Atri

How large are the objects you tested with?  

> On Jun 13, 2024, at 14:46, si...@turka.nl wrote:
> 
> I have doing some further testing.
> 
> My RGW pool is placed on spinning disks.
> I created a 2nd RGW data pool, placed on flash disks.
> 
> Benchmarking on HDD pool:
> Client 1 -> 1 RGW Node: 150 obj/s
> Client 1-5 -> 1 RGW Node: 150 ob/s (30 obj/s each client)
> Client 1 -> HAProxy -> 3 RGW Nodes: 150 obj/s
> Client 1-5 -> HAProxy -> 3 RGW Nodes: 150 obj/s (30 obj/s each client)
> 
> I did the same tests towards the RGW pool on flash disks: same results
> 
> So, it doesn't matter if my pool is hosted on HDD or SSD.
> It doesn't matter if I am using 1 RGW or 3 RGW nodes.
> It doesn't matter if I am using 1 client or 5 clients.
> 
> I am constantly limited at around 140-160 objects/s.
> 
> I see some TCP Retransmissions on the RGW Node, but maybe thats 'normal'.
> 
> Any ideas/suggestions?
> 
> On 2024-06-11 22:08, Anthony D'Atri wrote:
>>> I am not sure adding more RGW's will increase the performance.
>> That was a tangent.
>>> To be clear, that means whatever.rgw.buckets.index ?
>>>>> No, sorry my bad. .index is 32 and .data is 256.
>>>> Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG 
>>>> replicas on each OSD?  You want (IMHO) to end up with 100-200, keeping 
>>>> each pool's pg_num to a power of 2 ideally.
>>> No, my RBD pool is larger. My average PG per OSD is round 60-70.
>> Ah.  Aim for 100-200 with spinners.
>>>> Assuming all your pools span all OSDs, I suggest at a minimum 256 for 
>>>> .index and 8192 for .data, assuming you have only RGW pools.  And would be 
>>>> included to try 512 / 8192.  Assuming your  other minor pools are at 32, 
>>>> I'd bump .log and .non-ec to 128 or 256 as well.
>>>> If you have RBD or other pools colocated, those numbers would change.
>>>> ^ above assume disabling the autoscaler
>>> I bumped my .data pool from 256 to 1024 and .index from 32 to 128.
>> Your index pool still only benefits from half of your OSDs with a value of 
>> 128.
>>> Also doubled the .non-e and .log pools. Performance wise I don't see any 
>>> improvement. If I would see 10-20% improvement, I definitely would increase 
>>> it to 512 / 8192.
>>> With 0.5MB object size I am still limited at about 150 up to 250 objects/s.
>>> The disks aren't saturated. The wr await is mostly around 1ms and does not 
>>> get higher when benchmarking with S3.
>> Trust iostat about as far as you can throw it.
>>> Other suggestions, or does anyone else has suggestions?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Patching Ceph cluster

2024-06-12 Thread Anthony D';Atri

That's just setting noout, norebalance, etc.

> On Jun 12, 2024, at 11:28, Michael Worsham  
> wrote:
> 
> Interesting. How do you set this "maintenance mode"? If you have a series of 
> documented steps that you have to do and could provide as an example, that 
> would be beneficial for my efforts.
> 
> We are in the process of standing up both a dev-test environment consisting 
> of 3 Ceph servers (strictly for testing purposes) and a new production 
> environment consisting of 20+ Ceph servers.
> 
> We are using Ubuntu 22.04.
> 
> -- Michael
> 
> 
> From: Daniel Brown 
> Sent: Wednesday, June 12, 2024 9:18 AM
> To: Anthony D'Atri 
> Cc: Michael Worsham ; ceph-users@ceph.io 
> 
> Subject: Re: [ceph-users] Patching Ceph cluster
> 
> This is an external email. Please take care when clicking links or opening 
> attachments. When in doubt, check with the Help Desk or Security.
> 
> 
> There’s also a Maintenance mode that you can set for each server, as you’re 
> doing updates, so that the cluster doesn’t try to move data from affected 
> OSD’s, while the server being updated is offline or down. I’ve worked some on 
> automating this with Ansible, but have found my process (and/or my cluster) 
> still requires some manual intervention while it’s running to get things done 
> cleanly.
> 
> 
> 
>> On Jun 12, 2024, at 8:49 AM, Anthony D'Atri  wrote:
>> 
>> Do you mean patching the OS?
>> 
>> If so, easy -- one node at a time, then after it comes back up, wait until 
>> all PGs are active+clean and the mon quorum is complete before proceeding.
>> 
>> 
>> 
>>> On Jun 12, 2024, at 07:56, Michael Worsham  
>>> wrote:
>>> 
>>> What is the proper way to patch a Ceph cluster and reboot the servers in 
>>> said cluster if a reboot is necessary for said updates? And is it possible 
>>> to automate it via Ansible? This message and its attachments are from Data 
>>> Dimensions and are intended only for the use of the individual or entity to 
>>> which it is addressed, and may contain information that is privileged, 
>>> confidential, and exempt from disclosure under applicable law. If the 
>>> reader of this message is not the intended recipient, or the employee or 
>>> agent responsible for delivering the message to the intended recipient, you 
>>> are hereby notified that any dissemination, distribution, or copying of 
>>> this communication is strictly prohibited. If you have received this 
>>> communication in error, please notify the sender immediately and 
>>> permanently delete the original email and destroy any copies or printouts 
>>> of this email as well as any attachments.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> This message and its attachments are from Data Dimensions and are intended 
> only for the use of the individual or entity to which it is addressed, and 
> may contain information that is privileged, confidential, and exempt from 
> disclosure under applicable law. If the reader of this message is not the 
> intended recipient, or the employee or agent responsible for delivering the 
> message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, or copying of this communication is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender immediately and permanently delete the original email and destroy 
> any copies or printouts of this email as well as any attachments.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS metadata pool size

2024-06-12 Thread Anthony D';Atri

If you have:

* pg_num too low (defaults are too low)
* pg_num not a power of 2
* pg_num != number of OSDs in the pool
* balancer not enabled

any of those might result in imbalance.

> On Jun 12, 2024, at 07:33, Eugen Block  wrote:
> 
> I don't have any good explanation at this point. Can you share some more 
> information like:
> 
> ceph pg ls-by-pool 
> ceph osd df (for the relevant OSDs)
> ceph df
> 
> Thanks,
> Eugen
> 
> Zitat von Lars Köppel :
> 
>> Since my last update the size of the largest OSD increased by 0.4 TiB while
>> the smallest one only increased by 0.1 TiB. How is this possible?
>> 
>> Because the metadata pool reported to have only 900MB space left, I stopped
>> the hot-standby MDS. This gave me 8GB back but these filled up in the last
>> 2h.
>> I think I have to zap the next OSD because the filesystem is getting read
>> only...
>> 
>> How is it possible that an OSD has over 1 TiB less data on it after a
>> rebuild? And how is it possible to have so different sizes of OSDs?
>> 
>> 
>> [image: ariadne.ai Logo] Lars Köppel
>> Developer
>> Email: lars.koep...@ariadne.ai
>> Phone: +49 6221 5993580 <+4962215993580>
>> ariadne.ai (Germany) GmbH
>> Häusserstraße 3, 69115 Heidelberg
>> Amtsgericht Mannheim, HRB 744040
>> Geschäftsführer: Dr. Fabian Svara
>> https://ariadne.ai
>> 
>> 
>> On Tue, Jun 11, 2024 at 3:47 PM Lars Köppel  wrote:
>> 
>>> Only in warning mode. And there were no PG splits or merges in the last 2
>>> month.
>>> 
>>> 
>>> [image: ariadne.ai Logo] Lars Köppel
>>> Developer
>>> Email: lars.koep...@ariadne.ai
>>> Phone: +49 6221 5993580 <+4962215993580>
>>> ariadne.ai (Germany) GmbH
>>> Häusserstraße 3, 69115 Heidelberg
>>> Amtsgericht Mannheim, HRB 744040
>>> Geschäftsführer: Dr. Fabian Svara
>>> https://ariadne.ai
>>> 
>>> 
>>> On Tue, Jun 11, 2024 at 3:32 PM Eugen Block  wrote:
>>> 
 I don't think scrubs can cause this. Do you have autoscaler enabled?
 
 Zitat von Lars Köppel :
 
 > Hi,
 >
 > thank you for your response.
 >
 > I don't think this thread covers my problem, because the OSDs for the
 > metadata pool fill up at different rates. So I would think this is no
 > direct problem with the journal.
 > Because we had earlier problems with the journal I changed some
 > settings(see below). I already restarted all MDS multiple times but no
 > change here.
 >
 > The health warnings regarding cache pressure resolve normally after a
 > short period of time, when the heavy load on the client ends. Sometimes
 it
 > stays a bit longer because an rsync is running and copying data on the
 > cluster(rsync is not good at releasing the caps).
 >
 > Could it be a problem if scrubs run most of the time in the background?
 Can
 > this block any other tasks or generate new data itself?
 >
 > Best regards,
 > Lars
 >
 >
 > global  basic mds_cache_memory_limit
 > 17179869184
 > global  advanced  mds_max_caps_per_client
 >16384
 > global  advanced
 mds_recall_global_max_decay_threshold
 >262144
 > global  advanced  mds_recall_max_decay_rate
 >1.00
 > global  advanced  mds_recall_max_decay_threshold
 > 262144
 > mds advanced  mds_cache_trim_threshold
 > 131072
 > mds advanced  mds_heartbeat_grace
 >120.00
 > mds advanced  mds_heartbeat_reset_grace
 >7400
 > mds advanced  mds_tick_interval
 >3.00
 >
 >
 > [image: ariadne.ai Logo] Lars Köppel
 > Developer
 > Email: lars.koep...@ariadne.ai
 > Phone: +49 6221 5993580 <+4962215993580>
 > ariadne.ai (Germany) GmbH
 > Häusserstraße 3, 69115 Heidelberg
 > Amtsgericht Mannheim, HRB 744040
 > Geschäftsführer: Dr. Fabian Svara
 > https://ariadne.ai
 >
 >
 > On Tue, Jun 11, 2024 at 2:05 PM Eugen Block  wrote:
 >
 >> Hi,
 >>
 >> can you check if this thread [1] applies to your situation? You don't
 >> have multi-active MDS enabled, but maybe it's still some journal
 >> trimming, or maybe misbehaving clients? In your first post there were
 >> health warnings regarding cache pressure and cache size. Are those
 >> resolved?
 >>
 >> [1]
 >>
 >>
 https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/#VOOV235D4TP5TEOJUWHF4AVXIOTHYQQE
 >>
 >> Zitat von Lars Köppel :
 >>
 >> > Hello everyone,
 >> >
 >> > short update to this problem.
 >> > The zapped OSD is rebuilt and it has now 1.9 TiB (the expected size
 >> ~50%).
 >> > The other 2 OSDs are now at 2.8 respectively 3.2 TiB. They jumped up
 and
 >> > down a lot but the higher one has now

[ceph-users] Re: Patching Ceph cluster

2024-06-12 Thread Anthony D';Atri

Do you mean patching the OS?

If so, easy -- one node at a time, then after it comes back up, wait until all 
PGs are active+clean and the mon quorum is complete before proceeding.



> On Jun 12, 2024, at 07:56, Michael Worsham  
> wrote:
> 
> What is the proper way to patch a Ceph cluster and reboot the servers in said 
> cluster if a reboot is necessary for said updates? And is it possible to 
> automate it via Ansible? This message and its attachments are from Data 
> Dimensions and are intended only for the use of the individual or entity to 
> which it is addressed, and may contain information that is privileged, 
> confidential, and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, or the employee or agent 
> responsible for delivering the message to the intended recipient, you are 
> hereby notified that any dissemination, distribution, or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please notify the sender immediately and permanently delete the 
> original email and destroy any copies or printouts of this email as well as 
> any attachments.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-11 Thread Anthony D';Atri



> 
> I am not sure adding more RGW's will increase the performance.

That was a tangent.

> To be clear, that means whatever.rgw.buckets.index ?
>>> No, sorry my bad. .index is 32 and .data is 256.
>> Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG replicas 
>> on each OSD?  You want (IMHO) to end up with 100-200, keeping each pool's 
>> pg_num to a power of 2 ideally.
> 
> No, my RBD pool is larger. My average PG per OSD is round 60-70.

Ah.  Aim for 100-200 with spinners.

> 
>> Assuming all your pools span all OSDs, I suggest at a minimum 256 for .index 
>> and 8192 for .data, assuming you have only RGW pools.  And would be included 
>> to try 512 / 8192.  Assuming your  other minor pools are at 32, I'd bump 
>> .log and .non-ec to 128 or 256 as well.
>> If you have RBD or other pools colocated, those numbers would change.
>> ^ above assume disabling the autoscaler
> 
> I bumped my .data pool from 256 to 1024 and .index from 32 to 128.

Your index pool still only benefits from half of your OSDs with a value of 128.


> Also doubled the .non-e and .log pools. Performance wise I don't see any 
> improvement. If I would see 10-20% improvement, I definitely would increase 
> it to 512 / 8192.
> With 0.5MB object size I am still limited at about 150 up to 250 objects/s.
> 
> The disks aren't saturated. The wr await is mostly around 1ms and does not 
> get higher when benchmarking with S3.

Trust iostat about as far as you can throw it.


> 
> Other suggestions, or does anyone else has suggestions?
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Attention: Documentation - mon states and names

2024-06-11 Thread Anthony D';Atri

Custom names were never really 100% implemented, and I would not be surprised 
if they don't work in Reef.

> On Jun 11, 2024, at 14:02, Joel Davidow  wrote:
> 
> Zac,
> 
> Thanks for your super-fast response and action on this. Those four items
> are great and the corresponding email as reformatted looks good.
> 
> Jana's point about cluster names is a good one. The deprecation of custom
> cluster names, which appears to have started in octopus per
> https://docs.ceph.com/en/octopus/rados/configuration/common/, alleviates
> that confusion going forward but does not help with clusters already
> deployed with custom names.
> 
> Thanks again,
> Joel
> 
> On Tue, Jun 11, 2024 at 2:26 AM Janne Johansson  wrote:
> 
>>> Note the difference of convention in ceph command presentation. In
>>> 
>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status
>> ,
>>> mon.X uses X to represent the portion of the command to be replaced by
>> the
>>> operator with a specific value. However, that may not be clear to all
>>> readers, some of whom may read that as a literal X. I recommend switching
>>> convention to something that makes visually explicit any portion of a
>>> command that an operator has to replace with a specific value. One such
>>> convention is to use <> as delimiters marking the portion of a command
>> that
>>> an operator has to replace with a specific value, minus the delimiters
>>> themselves. I'm sure there are other conventions that would accomplish
>> the
>>> same goal and provide the <> convention as an example only.
>> 
>> Yes, this is one of my main gripes. Many of the doc parts should more
>> visibly point out which words or parts of names are the ones that you
>> chose (by selecting a hostname for instance), it gets weird when you
>> see "mon-1" or "client.rgw.rgw1" and you don't know which of those are
>> to be changed to suit your environment and which are not. Sometimes
>> the "ceph" word sneaks into paths because it is the name of the
>> software (duh) but sometimes because it is the clustername. Now I
>> don't hope many people change their clustername, but if you did, docs
>> would be hard to follow in order to figure out where to replace "ceph"
>> with your cluster name.
>> 
>>> Also, the actual name of a mon is not clear due to the variety of mon
>> name
>>> formats. The value of the NAME column returned by ceph orch ps
>>> --daemon-type mon and the return from ceph mon dump follow the format of
>>> mon. whereas the value of name returned by ceph tell 
>>> mon_status, the mon line returned by ceph -s, and the return from ceph
>> mon
>>> stat follow the format of . Unifying the return for the mon name
>>> value of all those commands could be helpful in establishing the format
>> of
>>> a mon name, though that is probably easier said than done.
>>> 
>>> In addition, in
>>> 
>> https://docs.ceph.com/en/latest/rados/configuration/mon-config-ref/#configuring-monitors
>> ,
>>> mon names are stated to use alpha notation by convention, but that
>>> convention is not followed by cephadm in the clusters that I've deployed.
>>> Cephadm also uses a minimal ceph.conf file with configs in the mon
>>> database. I recommend this section be updated to mention those changes.
>> If
>>> there is a way to explain what a mon name is or how it is formatted,
>>> perhaps adding that to that same section would be good.
>> 
>> 
>> 
>> --
>> May the most significant bit of your life be positive.
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: About disk disk iops and ultil peak

2024-06-10 Thread Anthony D';Atri

What specifically are your OSD devices?

> On Jun 10, 2024, at 22:23, Phong Tran Thanh  wrote:
> 
> Hi ceph user!
> 
> I am encountering a problem with IOPS and disk utilization of OSD. Sometimes, 
> my disk peaks in IOPS and utilization become too high, which affects my 
> cluster and causes slow operations to appear in the logs.
> 
> 6/6/24 9:51:46 AM[WRN]Health check update: 0 slow ops, oldest one blocked for 
> 36 sec, osd.268 has slow ops (SLOW_OPS)
> 
> 6/6/24 9:51:37 AM[WRN]Health check update: 0 slow ops, oldest one blocked for 
> 31 sec, osd.268 has slow ops (SLOW_OPS)
> 
> 
> This is config tu reduce it, but its not resolve my problem
> global  advanced  osd_mclock_profile  
>  custom   
>  
> global  advanced  osd_mclock_scheduler_background_best_effort_lim 
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_best_effort_res 
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_best_effort_wgt 
>  1
> 
> global  advanced  osd_mclock_scheduler_background_recovery_lim
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_recovery_res
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_recovery_wgt
>  1
> 
> global  advanced  osd_mclock_scheduler_client_lim 
>  0.40 
> 
> global  advanced  osd_mclock_scheduler_client_res 
>  0.40 
> 
> global  advanced  osd_mclock_scheduler_client_wgt 4
> 
> Hope someone can help me
> 
> Thanks so much!
> --
> 
> Email: tranphong...@gmail.com 
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D';Atri



>> To be clear, you don't need more nodes.  You can add RGWs to the ones you 
>> already have.  You have 12 OSD nodes - why not put an RGW on each?

> Might be an option, just don't like the idea to host multiple components on 
> nodes. But I'll consider it.

I really don't like mixing mon/mgr with other components because of coupled 
failure domains, and past experience with mon misbehavior, but many people do 
that.  ymmv.  With a bunch of RGWs none of them need grow to consume 
significant resources, and it can be difficult to get an RGW daemon to itself 
really use all of a dedicated node.

> 
 There are still serializations in the OSD and PG code.  You have 240 OSDs, 
 does your index pool have *at least* 256 PGs?
>>> Index as the data pool has 256 PG's.
>> To be clear, that means whatever.rgw.buckets.index ?
> 
> No, sorry my bad. .index is 32 and .data is 256.

Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG replicas on 
each OSD?  You want (IMHO) to end up with 100-200, keeping each pool's pg_num 
to a power of 2 ideally.

Assuming all your pools span all OSDs, I suggest at a minimum 256 for .index 
and 8192 for .data, assuming you have only RGW pools.  And would be included to 
try 512 / 8192.  Assuming your  other minor pools are at 32, I'd bump .log and 
.non-ec to 128 or 256 as well.

If you have RBD or other pools colocated, those numbers would change.



^ above assume disabling the autoscaler
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D';Atri




 
>>> You are right here, but we use Ceph mainly for RBD. It performs 'good 
>>> enough' for our RBD load.
>> You use RBD for archival?
> 
> No, storage for (light-weight) virtual machines.

I'm surprised that it's enough, I've seen HDDs fail miserably in that role.

> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted 
> on the OSD nodes and are running on modern hardware.
>> You didn't list additional nodes so I assumed.  You might still do well to 
>> have a larger number of RGWs, wherever they run.  RGWs often scale better 
>> horizontally than vertically.
> 
> Good to know. I'll check if adding more RGW nodes is possible.

To be clear, you don't need more nodes.  You can add RGWs to the ones you 
already have.  You have 12 OSD nodes - why not put an RGW on each?

> 
>> There are still serializations in the OSD and PG code.  You have 240 OSDs, 
>> does your index pool have *at least* 256 PGs?
> 
> Index as the data pool has 256 PG's.

To be clear, that means whatever.rgw.buckets.index ?

> 
 You might also disable Nagle on the RGW nodes.
>>> I need to lookup what that exactly is and does.
> It depends on the concurrency setting of Warp.
> It look likes the objects/s is the bottleneck, not the throughput.
> Max memory usage is about 80-90GB per node. CPU's are quite idling.
> Is it reasonable to expect more IOps / objects/s for RGW with my setup? 
> At this moment I am not able to find the bottleneck what is causing the 
> low obj/s.
 HDDs are a false economy.
>>> Got it :)
> Ceph version is 15.2.
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

1 2 3 4 5 6 >

1 - 100 of 576 matches

Mail list logo