You could also add those SSD nodes to the existing cluster and just make a
separate SSD pool
On Fri, 4 Oct 2024 at 01:06, Michel Niyoyita wrote:
> Hello Anthony,
>
> Thank you foryour reply, The first cluster is fully HDD drives and the
> second would be SSD based . If it is not good to share mo
I could be wrong however as far as I can see you have 9 chunks which
requires 9 failure domains.
Your failure domain is set to datacenter which you only have 3 of. So that
won't work.
You need to set your failure domain to host and then create a crush rule to
choose a DC and choose 3 hosts within
General complaint about docker is usually that it by default stops all
running containers when the docker daemon gets shutdown. There is the
"live-restore" option (which has been around for a while) but that's turned
off by default (and requires a daemon restart to enable). It only supports
patch u
You can structure your crush map so that you get multiple EC chunks per
host in a way that you can still survive a host outage outage even though
you have fewer hosts than k+1
For example if you run an EC=4+2 profile on 3 hosts you can structure your
crushmap so that you have 2 chunks per host. Thi
On Mon, 9 Oct 2023 at 14:24, Anthony D'Atri wrote:
>
>
> > AFAIK the standing recommendation for all flash setups is to prefer fewer
> > but faster cores
>
> Hrm, I think this might depend on what you’re solving for. This is the
> conventional wisdom for MDS for sure. My sense is that OSDs can
AFAIK the standing recommendation for all flash setups is to prefer fewer
but faster cores, so something like a 75F3 might be yielding better latency.
Plus you probably want to experiment with partitioning the NVMEs and
running multiple OSDs per drive - either 2 or 4.
On Sat, 7 Oct 2023 at 08:23,
Based on my understanding of CRUSH it basically works down the hierarchy
and then randomly (but deterministically for a given CRUSH map) picks
buckets (based on the specific selection rule) on that level for the object
and then it does this recursively until it ends up at the leaf nodes.
Given that
Hm, this thread is confusing
in the context of S3 client-side encryption means - the user is responsible
to encrypt the data with their own keys before submitting it. As far as I'm
aware, client-side encryption doesn't require any specific server support -
it's a function of the client SDK used whi
With failure domain host your max usable cluster capacity is essentially
constrained by the total capacity of the smallest host which is 8TB if I
read the output correctly. You need to balance your hosts better by
swapping drives.
On Fri, 31 Mar 2023 at 03:34, Nicola Mori wrote:
> Dear Ceph user
On Sun, 19 Jun 2022 at 02:29, Satish Patel wrote:
> Greeting folks,
>
> We are planning to build Ceph storage for mostly cephFS for HPC workload
> and in future we are planning to expand to S3 style but that is yet to be
> decided. Because we need mass storage, we bought the following HW.
>
> 15
I would not host multiple OSD on a spinning drive (unless it's one of those
Seagate MACH.2 drives that have two independent heads) - head seek time
will most likely kill performance. The main reason to host multiple OSD on
a single SSD or NVME is typically to make use of the large IOPS capacity
whi
I think Marc uses containers - but they've chosen Apache Mesos as
orchestrator and ceph-adm doesn't work with that.
Currently essentially two ceph container orchestrators exist - rook which
is a ceph orch or kubernetes and ceph-adm which is an orchestrator
expecting docker or podman
Admittedly I do
In addition to what the others said - generally there is little point
in splitting block and wal partitions - just stick to one for both.
What model are you SSDs and how well do they handle small direct
writes? Because that's what you'll be getting on them and the wrong
type of SSD can make things
Yes, just expose each disk as an individual OSD and you'll already be
better off. Depending what type of SSD they are - if they can sustain
high random write IOPS you may even want to consider partitioning each
disk and create 2 OSDs per SSD to make better use of the available IO
capacity.
For all-
- What is the expected file/object size distribution and count?
- Is it write-once or modify-often data?
- What's your overall required storage capacity?
- 18 OSDs per WAL/DB drive seems a lot - recommended is ~6-8
- With 12TB OSD the recommended WAL/DB size is 120-480GB (1-4%) per O
as well, suggested that in a
> replicated pool writes and reads are handled by the primary PG, which would
> explain this write bandwidth limit.
>
> /Z
>
> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig,
> wrote:
>
>> Maybe some info is missing but 7k write IOP
Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
decent (as you also state) - the bandwidth automatically follows from that
so not sure what you're expecting?
I am a bit puzzled though - by my math 7k IOPS at 4k should only be
27MiB/sec - not sure how the 120MiB/sec was ach
A couple of notes to this:
Ideally you should have at least 2 more failure domains than your base
resilience (K+M for EC or size=N for replicated) - reasoning: Maintenance
needs to be performed so chances are every now and then you take a host
down for a few hours or possibly days to do some upgra
rd, 9.4 MiB/s wr, 5.38k op/s rd, 2.42k op/s wr
>
> recovery: 23 MiB/s, 389 objects/s
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
That is - one thing you could do is to rate limit PUT requests on your
haproxy down to a level that your cluster is stable. At least that
gives you a chance to finish the PG scaling without OSDs dying on you
constantly
On Fri, 1 Oct 2021 at 11:56, Christian Wuerdig
wrote:
>
> Ok, so I
bbing+deep
> 2 active+recovery_unfound+undersized+degraded+remapped
> 2 active+remapped+backfill_wait
> 1 active+clean+scrubbing
> 1 active+undersized+remapped+backfilling
> 1 active+undersized+degraded+remappe
Bluestore memory targets have nothing to do with spillover. It's
already been said several times: The spillover warning is simply
telling you that instead of writing data to your supposedly fast
wal/blockdb device it's now hitting your slow device.
You've stated previously that your fast device is
buff/cache is the Linux kernel buffer and page cache which is
unrelated to the ceph bluestore cache. Check the memory consumption of
your individual OSD processes to confirm. Top also says 132GB
available (since buffers and page cache entries will be dropped
automatically if processes need more RAM
This tracker item should cover it: https://tracker.ceph.com/issues/51948
On Wed, 22 Sept 2021 at 11:03, Nigel Williams
wrote:
>
> Could we see the content of the bug report please, that RH bugzilla entry
> seems to have restricted access.
> "You are not authorized to access bug #1996680."
>
> On
ter with ec 4:2 :((
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2021. Sep 21., at 20:21, Christ
bo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> -------
>
> On 2021. Sep 21., at 9:19, Christian Wuerdig
> wrote:
>
> Email received fro
them. Somebody
else would have to chime in to confirm.
Also keep in mind that even with 60GB partition you will still get
spillover since you seem to have around 120-130GB meta data per OSD so
moving to 160GB partitions would seem to be better.
>
>
>
>
>
>
> Christian Wuerdig , 21
It's been discussed a few times on the list but RocksDB levels essentially
grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
need (level-1)*10 space for the next level on your drive to avoid spill over
So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB a
This probably provides a reasonable overview -
https://ceph.io/en/news/blog/2020/public-telemetry-dashboards/,
specifically the grafana dashboard is here:
https://telemetry-public.ceph.com
Keep in mind not all clusters have telemetry enabled
The largest recorded cluster seems to be in the 32-64PB
Ceph on a single host makes little to no sense. You're better of running
something like ZFS
On Tue, 6 Jul 2021 at 23:52, Wladimir Mutel wrote:
> I started my experimental 1-host/8-HDDs setup in 2018 with
> Luminous,
> and I read
> https://ceph.io/community/new-luminous-erasure-co
Since you mention NextCloud it will probably be RWG deployment. ALso it's
not clear why 3 nodes? Is rack-space a premium?
Just to compare your suggestion:
3x24 (I guess 4U?) x 8TB with Replication = 576 TB raw storage + 192 TB
usable
Let's go 6x12 (2U) x 4TB with EC 3+2 = 288 TB raw storage + 172
Once you have your additional 5 nodes you can adjust your crushrule to have
failure domain = host and ceph will rebalance the data automatically for
you. This will involve quite a bit of data movement (at least 50% of your
data will need to be migrated) so can take some time. Also the official
reco
For EC 8+2 you can get away with 5 hosts by ensuring each host gets 2
shards similar to this:
https://ceph.io/planet/erasure-code-on-small-clusters/
If a host dies/goes down you can still recover all data (although at that
stage your cluster is no longer available for client io).
You shouldn't just
I think it's time to start pointing out the the 3/30/300 logic not really
holds any longer true post Octopus:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/CKRCB3HUR7UDRLHQGC7XXZPWCWNJSBNT/
On Thu, 2 Jul 2020 at 00:09, Burkhard Linke <
burkhard.li...@computational.bio.uni-giesse
Sorry, I replied to the wrong email thread before, so reposting this:
I think it's time to start pointing out the the 3/30/300 logic not really
holds any longer true post Octopus:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/CKRCB3HUR7UDRLHQGC7XXZPWCWNJSBNT/
Although I suppose i
35 matches
Mail list logo