[ceph-users] Re: Monitors for two different cluster

2024-10-03 Thread Christian Wuerdig
You could also add those SSD nodes to the existing cluster and just make a separate SSD pool On Fri, 4 Oct 2024 at 01:06, Michel Niyoyita wrote: > Hello Anthony, > > Thank you foryour reply, The first cluster is fully HDD drives and the > second would be SSD based . If it is not good to share mo

[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-14 Thread Christian Wuerdig
I could be wrong however as far as I can see you have 9 chunks which requires 9 failure domains. Your failure domain is set to datacenter which you only have 3 of. So that won't work. You need to set your failure domain to host and then create a crush rule to choose a DC and choose 3 hosts within

[ceph-users] Re: cephadm - podman vs docker

2023-12-31 Thread Christian Wuerdig
General complaint about docker is usually that it by default stops all running containers when the docker daemon gets shutdown. There is the "live-restore" option (which has been around for a while) but that's turned off by default (and requires a daemon restart to enable). It only supports patch u

[ceph-users] Re: EC Profiles & DR

2023-12-05 Thread Christian Wuerdig
You can structure your crush map so that you get multiple EC chunks per host in a way that you can still survive a host outage outage even though you have fewer hosts than k+1 For example if you run an EC=4+2 profile on 3 hosts you can structure your crushmap so that you have 2 chunks per host. Thi

[ceph-users] Re: Hardware recommendations for a Ceph cluster

2023-10-10 Thread Christian Wuerdig
On Mon, 9 Oct 2023 at 14:24, Anthony D'Atri wrote: > > > > AFAIK the standing recommendation for all flash setups is to prefer fewer > > but faster cores > > Hrm, I think this might depend on what you’re solving for. This is the > conventional wisdom for MDS for sure. My sense is that OSDs can

[ceph-users] Re: Hardware recommendations for a Ceph cluster

2023-10-08 Thread Christian Wuerdig
AFAIK the standing recommendation for all flash setups is to prefer fewer but faster cores, so something like a 75F3 might be yielding better latency. Plus you probably want to experiment with partitioning the NVMEs and running multiple OSDs per drive - either 2 or 4. On Sat, 7 Oct 2023 at 08:23,

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-16 Thread Christian Wuerdig
Based on my understanding of CRUSH it basically works down the hierarchy and then randomly (but deterministically for a given CRUSH map) picks buckets (based on the specific selection rule) on that level for the object and then it does this recursively until it ends up at the leaf nodes. Given that

[ceph-users] Re: Encryption per user Howto

2023-05-22 Thread Christian Wuerdig
Hm, this thread is confusing in the context of S3 client-side encryption means - the user is responsible to encrypt the data with their own keys before submitting it. As far as I'm aware, client-side encryption doesn't require any specific server support - it's a function of the client SDK used whi

[ceph-users] Re: Eccessive occupation of small OSDs

2023-04-02 Thread Christian Wuerdig
With failure domain host your max usable cluster capacity is essentially constrained by the total capacity of the smallest host which is 8TB if I read the output correctly. You need to balance your hosts better by swapping drives. On Fri, 31 Mar 2023 at 03:34, Nicola Mori wrote: > Dear Ceph user

[ceph-users] Re: Suggestion to build ceph storage

2022-06-19 Thread Christian Wuerdig
On Sun, 19 Jun 2022 at 02:29, Satish Patel wrote: > Greeting folks, > > We are planning to build Ceph storage for mostly cephFS for HPC workload > and in future we are planning to expand to S3 style but that is yet to be > decided. Because we need mass storage, we bought the following HW. > > 15

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-23 Thread Christian Wuerdig
I would not host multiple OSD on a spinning drive (unless it's one of those Seagate MACH.2 drives that have two independent heads) - head seek time will most likely kill performance. The main reason to host multiple OSD on a single SSD or NVME is typically to make use of the large IOPS capacity whi

[ceph-users] Re: [EXTERNAL] Re: Why you might want packages not containers for Ceph deployments

2021-11-18 Thread Christian Wuerdig
I think Marc uses containers - but they've chosen Apache Mesos as orchestrator and ceph-adm doesn't work with that. Currently essentially two ceph container orchestrators exist - rook which is a ceph orch or kubernetes and ceph-adm which is an orchestrator expecting docker or podman Admittedly I do

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-08 Thread Christian Wuerdig
In addition to what the others said - generally there is little point in splitting block and wal partitions - just stick to one for both. What model are you SSDs and how well do they handle small direct writes? Because that's what you'll be getting on them and the wrong type of SSD can make things

[ceph-users] Re: [Ceph] Recovery is very Slow

2021-10-28 Thread Christian Wuerdig
Yes, just expose each disk as an individual OSD and you'll already be better off. Depending what type of SSD they are - if they can sustain high random write IOPS you may even want to consider partitioning each disk and create 2 OSDs per SSD to make better use of the available IO capacity. For all-

[ceph-users] Re: Open discussing: Designing 50GB/s CephFS or S3 ceph cluster

2021-10-21 Thread Christian Wuerdig
- What is the expected file/object size distribution and count? - Is it write-once or modify-often data? - What's your overall required storage capacity? - 18 OSDs per WAL/DB drive seems a lot - recommended is ~6-8 - With 12TB OSD the recommended WAL/DB size is 120-480GB (1-4%) per O

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Christian Wuerdig
as well, suggested that in a > replicated pool writes and reads are handled by the primary PG, which would > explain this write bandwidth limit. > > /Z > > On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, > wrote: > >> Maybe some info is missing but 7k write IOP

[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance

2021-10-05 Thread Christian Wuerdig
Maybe some info is missing but 7k write IOPs at 4k block size seem fairly decent (as you also state) - the bandwidth automatically follows from that so not sure what you're expecting? I am a bit puzzled though - by my math 7k IOPS at 4k should only be 27MiB/sec - not sure how the 120MiB/sec was ach

[ceph-users] Re: Erasure coded pool chunk count k

2021-10-05 Thread Christian Wuerdig
A couple of notes to this: Ideally you should have at least 2 more failure domains than your base resilience (K+M for EC or size=N for replicated) - reasoning: Maintenance needs to be performed so chances are every now and then you take a host down for a few hours or possibly days to do some upgra

[ceph-users] Re: osd_memory_target=level0 ?

2021-10-01 Thread Christian Wuerdig
rd, 9.4 MiB/s wr, 5.38k op/s rd, 2.42k op/s wr > > recovery: 23 MiB/s, 389 objects/s > > > Istvan Szabo > Senior Infrastructure Engineer > --- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > ---

[ceph-users] Re: osd_memory_target=level0 ?

2021-09-30 Thread Christian Wuerdig
That is - one thing you could do is to rate limit PUT requests on your haproxy down to a level that your cluster is stable. At least that gives you a chance to finish the PG scaling without OSDs dying on you constantly On Fri, 1 Oct 2021 at 11:56, Christian Wuerdig wrote: > > Ok, so I

[ceph-users] Re: osd_memory_target=level0 ?

2021-09-30 Thread Christian Wuerdig
bbing+deep > 2 active+recovery_unfound+undersized+degraded+remapped > 2 active+remapped+backfill_wait > 1 active+clean+scrubbing > 1 active+undersized+remapped+backfilling > 1 active+undersized+degraded+remappe

[ceph-users] Re: osd_memory_target=level0 ?

2021-09-29 Thread Christian Wuerdig
Bluestore memory targets have nothing to do with spillover. It's already been said several times: The spillover warning is simply telling you that instead of writing data to your supposedly fast wal/blockdb device it's now hitting your slow device. You've stated previously that your fast device is

[ceph-users] Re: Limiting osd or buffer/cache memory with Pacific/cephadm?

2021-09-28 Thread Christian Wuerdig
buff/cache is the Linux kernel buffer and page cache which is unrelated to the ceph bluestore cache. Check the memory consumption of your individual OSD processes to confirm. Top also says 132GB available (since buffers and page cache entries will be dropped automatically if processes need more RAM

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Christian Wuerdig
This tracker item should cover it: https://tracker.ceph.com/issues/51948 On Wed, 22 Sept 2021 at 11:03, Nigel Williams wrote: > > Could we see the content of the bug report please, that RH bugzilla entry > seems to have restricted access. > "You are not authorized to access bug #1996680." > > On

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
ter with ec 4:2 :(( > > Istvan Szabo > Senior Infrastructure Engineer > --- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > --- > > On 2021. Sep 21., at 20:21, Christ

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
bo > Senior Infrastructure Engineer > --- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > ------- > > On 2021. Sep 21., at 9:19, Christian Wuerdig > wrote: > > Email received fro

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
them. Somebody else would have to chime in to confirm. Also keep in mind that even with 60GB partition you will still get spillover since you seem to have around 120-130GB meta data per OSD so moving to 160GB partitions would seem to be better. > > > > > > > Christian Wuerdig , 21

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
It's been discussed a few times on the list but RocksDB levels essentially grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you need (level-1)*10 space for the next level on your drive to avoid spill over So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB a

[ceph-users] Re: What's your biggest ceph cluster?

2021-09-02 Thread Christian Wuerdig
This probably provides a reasonable overview - https://ceph.io/en/news/blog/2020/public-telemetry-dashboards/, specifically the grafana dashboard is here: https://telemetry-public.ceph.com Keep in mind not all clusters have telemetry enabled The largest recorded cluster seems to be in the 32-64PB

[ceph-users] Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

2021-07-06 Thread Christian Wuerdig
Ceph on a single host makes little to no sense. You're better of running something like ZFS On Tue, 6 Jul 2021 at 23:52, Wladimir Mutel wrote: > I started my experimental 1-host/8-HDDs setup in 2018 with > Luminous, > and I read > https://ceph.io/community/new-luminous-erasure-co

[ceph-users] Re: OT: How to Build a poor man's storage with ceph

2021-06-08 Thread Christian Wuerdig
Since you mention NextCloud it will probably be RWG deployment. ALso it's not clear why 3 nodes? Is rack-space a premium? Just to compare your suggestion: 3x24 (I guess 4U?) x 8TB with Replication = 576 TB raw storage + 192 TB usable Let's go 6x12 (2U) x 4TB with EC 3+2 = 288 TB raw storage + 172

[ceph-users] Re: Can I create 8+2 Erasure coding pool on 5 node?

2021-03-27 Thread Christian Wuerdig
Once you have your additional 5 nodes you can adjust your crushrule to have failure domain = host and ceph will rebalance the data automatically for you. This will involve quite a bit of data movement (at least 50% of your data will need to be migrated) so can take some time. Also the official reco

[ceph-users] Re: Failure Domain = NVMe?

2021-03-11 Thread Christian Wuerdig
For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2 shards similar to this: https://ceph.io/planet/erasure-code-on-small-clusters/ If a host dies/goes down you can still recover all data (although at that stage your cluster is no longer available for client io). You shouldn't just

[ceph-users] Re: Advice on SSD choices for WAL/DB?

2020-11-26 Thread Christian Wuerdig
I think it's time to start pointing out the the 3/30/300 logic not really holds any longer true post Octopus: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/CKRCB3HUR7UDRLHQGC7XXZPWCWNJSBNT/ On Thu, 2 Jul 2020 at 00:09, Burkhard Linke < burkhard.li...@computational.bio.uni-giesse

[ceph-users] Re: DB sizing for lots of large files

2020-11-26 Thread Christian Wuerdig
Sorry, I replied to the wrong email thread before, so reposting this: I think it's time to start pointing out the the 3/30/300 logic not really holds any longer true post Octopus: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/CKRCB3HUR7UDRLHQGC7XXZPWCWNJSBNT/ Although I suppose i