Why use such a card and M.2 drives that I suspect aren’t enterprise-class?
Instead of U.2, E1.s, or E3.s ?
> On Jan 13, 2024, at 5:10 AM, Mike O'Connor wrote:
>
> On 13/1/2024 1:02 am, Drew Weaver wrote:
>> Hello,
>>
>> So we were going to replace a Ceph cluster with some hardware we had layin
There are nuances, but in general the higher the sum of m+k, the lower the
performance, because *every* operation has to hit that many drives, which is
especially impactful with HDDs. So there’s a tradeoff between storage
efficiency and performance. And as you’ve seen, larger parity groups
es
> On Jan 12, 2024, at 03:31, Phong Tran Thanh wrote:
>
> Hi Yang and Anthony,
>
> I found the solution for this problem on a HDD disk 7200rpm
>
> When the cluster recovers, one or multiple disk failures because slowop
> appears and then affects the cluster, we can change these configurations
y from it imho. In an
> alternate universe it would have been really neat if Intel could have worked
> with the HDD vendors to put like 16GB of user accessible optane on every HDD.
> Enough for the WAL and L0 (and maybe L1).
>
>
> Mark
>
>
> On 1/9/24 08:53, Anth
Not strictly an answer to your worthy question, but IMHO this supports my
stance that hybrid OSDs aren't worth the hassle.
> On Jan 9, 2024, at 06:13, Frédéric Nass
> wrote:
>
> With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock
> only considers write performance,
be there after replication. No lost data?
>> Correct, *if* nothing happens to the survivors. But unless you take manual
>> steps, data will be unavailable.
>> Most of the time if a node fails you can replace a DIMM etc. and bring it
>> back.
>>> Many thanks!!
>
ual
steps, data will be unavailable.
Most of the time if a node fails you can replace a DIMM etc. and bring it back.
>
> Many thanks!!
>
> Regards
> Marcus
>
>
>
> On fre, dec 22 2023 at 19:12:19 -0500, Anthony D'Atri
> wrote:
>>>> You can
>>
>> You can do that for a PoC, but that's a bad idea for any production
>> workload. You'd want at least three nodes with OSDs to use the default RF=3
>> replication. You can do RF=2, but at the peril of your mortal data.
>
> I'm not sure I agree - I think size=2, min_size=2 is no worse t
> I have manually configured a ceph cluster with ceph fs on debian bookworm.
Bookworm support is very, very recent I think.
> What is the difference from installing with cephadm compared to manuall
> install,
> any benefits that you miss with manual install?
A manual install is dramatically m
> Sorry I thought of one more thing.
>
> I was actually re-reading the hardware recommendations for Ceph and it seems
> to imply that both RAID controllers as well as HBAs are bad ideas.
Advice I added most likely ;) "RAID controllers" *are* a subset of HBAs BTW.
The nomenclature can be co
[rook@rook-ceph-tools-5ff8d58445-gkl5w .aws]$ ceph features
{
"mon": [
{
"features": "0x3f01cfbf7ffd",
"release": "luminous",
"num": 3
}
],
"osd": [
{
"features": "0x3f01cfbf7ffd",
"release": "lu
Four servers doth not a quality cluster make. This setup will work, but you
can't use a reasonable EC profile for your bucket pool. Aim higher than the
party line wrt PG counts esp. for the index pool.
> On Dec 18, 2023, at 10:19, Amardeep Singh
> wrote:
>
> Hi Everyone,
>
> We are in
Following up on my own post from last month, for posterity.
The trick was updating the period. I'm not using multisite, but Rook seems to
deploy so that one can.
-- aad
> On Nov 6, 2023, at 16:52, Anthony D'Atri wrote:
>
> I'm having difficulty adding and using
>>
>> Today we had a big issue with slow ops on the nvme drives which holding
>> the index pool.
>>
>> Why the nvme shows full if on ceph is barely utilized? Which one I should
>> belive?
>>
>> When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme
>> drive has 4x osds on it):
I try to address these ideas in
https://www.amazon.com/Learning-Ceph-scalable-reliable-solution-ebook/dp/B01NBP2D9I
though as with any tech topic the details change over time.
It's difficult to interpret the table the OP included, but I think it shows a 3
node cluster. When you only have 3 nod
Sent to quickly — also note that consumer / client SSDs often don’t have
powerloss protection, so if your whole cluster were to lose power at the wrong
time, you might lose data.
> On Nov 28, 2023, at 8:16 PM, Anthony D'Atri wrote:
>
>
>>>
>>> 1) They
>>
>> 1) They’re client aka desktop SSDs, not “enterprise”
>> 2) They’re a partition of a larger OSD shared with other purposes
>
> Yup. They're a mix of SATA SSDs and NVMes, but everything is
> consumer-grade. They're only 10% full on average and I'm not
> super-concerned with performance. I
>> Very small and/or non-uniform clusters can be corner cases for many things,
>> especially if they don’t have enough PGs. What is your failure domain —
>> host or OSD?
>
> Failure domain is host,
Your host buckets do vary in weight by roughly a factor of two. They naturally
will get PGs m
>
> I'm fairly new to Ceph and running Rook on a fairly small cluster
> (half a dozen nodes, about 15 OSDs).
Very small and/or non-uniform clusters can be corner cases for many things,
especially if they don’t have enough PGs. What is your failure domain — host
or OSD?
Are your OSDs sized u
The options Wes listed are for data, not RocksDB.
> On Nov 27, 2023, at 1:59 PM, Denis Polom wrote:
>
> Hi,
>
> no we don't:
>
> "bluestore_rocksdb_options":
> "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268
If there’s a filesystem on the volume, running fstrim or mounting with the
discard option might significantly reduce usage and block count.
> On Nov 25, 2023, at 1:02 PM, Tony Liu wrote:
>
> Thank you Eugen! "rbd du" is it.
> The used_size from "rbd du" is object count times object size.
> T
>>>
>>> Should I modify the ceph.conf (vi/emacs) directly ?
>>
>> vi is never the answer.
>
> WTF ? You break my dream ;-) ;-)
Let line editors die.
>>
> You're right.
>
> Currently I'm testing
>
> 17.2.7 quincy.
>
> So in my daily life how I would know if I should use ceph config or
>
> to change something in the /etc/ceph/ceph.conf.
Central config was introduced with Mimic. Since both central config and
ceph.conf work and are supported, explicitly mentioning both in the docs every
time is a lot of work (and awkward). One day we’ll sort out an effective means
to generali
Yes, lots of people are using EC.
Which is more “reliable” depends on what you need. If you need to survive 4
failures, there are scenarios where RF=3 won’t do it for you.
You could in such a case use an EC 4,4 profile, 8,4, etc.
It’s a tradeoff between write speed and raw::usable ratio effi
I encountered mgr ballooning multiple times with Luminous, but have not since.
At the time, I could often achieve relief by sending the admin socket a heap
release - it would show large amounts of memory unused but not yet released.
That experience is one reason I got Rook recently to allow pro
Common motivations for this strategy include the lure of unit economics and
RUs.
Often ultra dense servers can’t fill racks anyway due to power and weight
limits.
Here the osd_memory_target would have to be severely reduced to avoid
oomkilling. Assuming the OSDs are top load LFF HDDs with e
I was thinking the same thing. Very small OSDs can behave unexpectedly because
of the relatively high percentage of overhead.
> On Nov 18, 2023, at 3:08 AM, Eugen Block wrote:
>
> Do you have a large block.db size defined in the ceph.conf (or config store)?
>
> Zitat von Debian :
>
>> th
I'm going to assume that ALL of your pools are replicated with size 3, since
you didn't provide that info, and that all but the *hdd pools are on SSDs.
`ceph osd dump | grep pool`
Let me know if that isn't the case.
With that assumption, I make your pg ratio to be ~ 57, which is way too low.
R
IMHO we don't need yet another place to look for information, especially one
that some operators never see. ymmv.
>
>> Hello,
>>
>> We wanted to get some feedback on one of the features that we are planning
>> to bring in for upcoming releases.
>>
>> On the Ceph GUI, we thought it could be in
I'm having difficulty adding and using a non-default placement target & storage
class and would appreciate insights. Am I going about this incorrectly? Rook
does not yet have the ability to do this, so I'm adding it by hand.
Following instructions on the net I added a second bucket pool, place
nm, Adam beat me to it.
> On Nov 3, 2023, at 11:40, Josh Baergen wrote:
>
> The ticket has been updated, but it's probably important enough to
> state on the list as well: The documentation is currently wrong in a
> way that running the command as documented will cause this corruption.
> The cor
If someone can point me at the errant docs locus I'll make it right.
> On Nov 3, 2023, at 11:45, Laura Flores wrote:
>
> Yes, Josh beat me to it- this is an issue of incorrectly documenting the
> command. You can try the solution posted in the tracker issue.
>
> On Fri, Nov 3, 2023 at 10:43 AM
This admittedly is the case throughout the docs.
> On Nov 2, 2023, at 07:27, Joachim Kraftmayer - ceph ambassador
> wrote:
>
> Hi,
>
> another short note regarding the documentation, the paths are designed for a
> package installation.
>
> the paths for container installation look a bit diff
ty, strong consistency and
> higher failure domains as host we do with Ceph.
>
> Joachim
>
> ___
> ceph ambassador DACH
> ceph consultant since 2012
>
> Clyso GmbH - Premier Ceph Foundation Member
>
> https://www.clyso.com/
Ceph is all about strong consistency and data durability. There can also be a
distinction between performance of the cluster in aggregate vs a single client,
especially in a virtualization scenario where to avoid the noisy-neighbor
dynamic you deliberately throttle iops and bandwidth per client
Ah, our old friend the P5316.
A few things to remember about these:
* 64KB IU means that you'll burn through endurance if you do a lot of writes
smaller than that. The firmware will try to coalesce smaller writes,
especially if they're sequential. You probably want to keep your RGW / CephFS
This is one of many reasons for not using HDDs ;)
One nuance that is easy overlooked is the CRUSH weight of failure domains.
If, say, you have a failure domain of "rack" with size=3 replicated pools and
3x CRUSH racks, if you add the new, larger OSDs to only one rack, you will not
increase the
ening and how the issue can
> be alleviated or resolved, unfortunately monitor RocksDB usage and tunables
> appear to be not documented at all.
>
> /Z
>
> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <mailto:anthony.da...@gmail.com>> wrote:
>> cf. Mark
cf. Mark's article I sent you re RocksDB tuning. I suspect that with Reef you
would experience fewer writes. Universal compaction might also help, but in
the end this SSD is a client SKU and really not suited for enterprise use. If
you had the 1TB SKU you'd get much longer life, or you could
> AFAIK the standing recommendation for all flash setups is to prefer fewer
> but faster cores
Hrm, I think this might depend on what you’re solving for. This is the
conventional wisdom for MDS for sure. My sense is that OSDs can use multiple
cores fairly well, so I might look at the cores *
> Currently, I have an OpenStack installation with a Ceph cluster consisting of
> 4 servers for OSD, each with 16TB SATA HDDs. My intention is to add a second,
> independent Ceph cluster to provide faster disks for OpenStack VMs.
Indeed, I know from experience that LFF spinners don't cut it fo
And unless you *need* a given ailing OSD to be up because it's the only copy of
data, you may get better recovery/backfill results by stopping the service for
that OSD entirely, so that the recovery reads all to to healthier OSDs.
> On Oct 3, 2023, at 12:21, Josh Baergen wrote:
>
> Hi Simon,
>
Note that this will adjust override reweight values, which will conflict with
balancer upmaps.
> On Sep 26, 2023, at 3:51 AM, c...@elchaka.de wrote:
>
> Hi an idea is to see what
>
> Ceph osd test-reweight-by-utilization
> shows.
> If it looks usefull you can run the above command without "
That may be the very one I was thinking of, though the OP seemed to be
preserving the IP addresses, so I suspect containerization is in play.
> On Sep 9, 2023, at 11:36 AM, Tyler Stachecki
> wrote:
>
> On Sat, Sep 9, 2023 at 10:48 AM Anthony D'Atri
> wrote:
>>
Which Ceph release are you running, and how was it deployed?
With some older releases I experienced mons behaving unexpectedly when one of
the quorum bounced, so I like to segregate them for isolation still.
There was also at point an issue where clients wouldn’t get a runtime update of
new m
Resurrection usually only makes sense if fate or a certain someone resulted in
enough overlapping removed OSDs that you can't meet min_size. I've had to a
couple of times :-/
If an OSD is down for more than a short while, backfilling a redeployed OSD
will likely be faster than waiting for it t
Is a secure-erase suggested after the firmware update? Sometimes manufacturers
do that.
> On Sep 1, 2023, at 05:16, Frédéric Nass
> wrote:
>
> Hello,
>
> This message to inform you that DELL has released a new firmwares for these
> SSD drives to fix the 70.000 POH issue:
>
> [
> https:/
>> The module don't have new commits for more than two year
>
> So diskprediction_local is unmaintained. Will it be removed?
> It looks like a nice feature but when you try to use it it's useless.
IIRC it has only a specific set of drive models, and the binary blob from
ProphetStor.
>> I sugg
> Thank you for reply,
>
> I have created two class SSD and NvME and assigned them to crush maps.
You don't have enough drives to keep them separate. Set the NVMe drives back
to "ssd" and just make one pool.
>
> $ ceph osd crush rule ls
> replicated_rule
> ssd_pool
> nvme_pool
>
>
> Runni
> Thanks Eugen for the explanation. To summarize what I understood:
> - delete from GUI simply does a drain+destroy;
> - destroy will preserve the OSD id so that it will be used by the next OSD
> that will be created on that host;
> - purge will remove everything, and the next OSD that will be c
>> As per recent isdct/intelmas/sst? The web site?
>
> Yes. It's all "Solidigm" now, which has made information harder to
> find and firmware harder to get, but these drives aren't exactly
> getting regular updates at this point.
Exactly. "isdct" more or less became "intelmas", and post-sep
>
>> The OP implies that the cluster's performance *degraded* with the Quincy
>> upgrade.I wonder if there was a kernel change at the same time.
>
> No, it's never been great. But it's definitely getting worse over
> time. That is most likely correlated with increased utilization (both
> in term
>
> Also, 1 CPU core/OSD is definitely undersized. I'm not sure how much
> you have -- but you want at least a couple per OSD for SSD, and even
> more for NVMe... especially when it comes to small block write
> workloads.
Think you meant s/SSD/SAS|SATA/
If the OP means physical core, granted
Yep. Remember that most Ceph clusters serve a number of simultaneous clients,
so the “IO blender” effect more or less presents a random workload to drives.
Dedicated single-client node-local drives might benefit from such strategies.
But really gymnastics like this for uncertain gain serve
>
> This is an expected result, and it is not specific to Ceph. Any
> storage that consists of multiple disks will produce a performance
> gain over a single disk only if the workload allows for concurrent use
> of these disks - which is not the case with your 4K benchmark due to
> the de-facto
>
> Good afternoon everybody!
>
> I have the following scenario:
> Pool RBD replication x3
> 5 hosts with 12 SAS spinning disks each
Old hardware? SAS is mostly dead.
> I'm using exactly the following line with FIO to test:
> fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -s
This.
You can even constrain placement by size or model number.
> On Aug 2, 2023, at 6:53 AM, Eugen Block wrote:
>
> But that could be done easily like this:
>
> service_type: osd
> service_id: ssd-db
> service_name: osd.ssd-db
> placement:
> hosts:
> - storage01
> - storage02
> ...
> spec:
I can believe the month timeframe for a cluster with multiple large spinners
behind each HBA. I’ve witnessed such personally.
> On Jul 20, 2023, at 4:16 PM, Michel Jouvin
> wrote:
>
> Hi Niklas,
>
> As I said, ceph placement is based on more than fulfilling the failure domain
> constraint.
Sometimes one can even get away with "ceph osd down 343" which doesn't affect
the process. I have had occasions when this goosed peering in a less-intrusive
way. I believe it just marks the OSD down in the mons' map, and when that
makes it to the OSD, the OSD responds with "I'm not dead yet" a
Indeed that's very useful. I improved the documentation for that not long ago,
took a while to sort out exactly what it was about.
Normally LC only runs once a day as I understand it, there's a debug option
that compresses time so that it'll run more frequently, as having to wait for a
day to
Index pool on Aerospike?
Building OSDs on PRAM might be a lot less work than trying to ensure
consistency on backing storage while still servicing out of RAM and not syncing
every transaction.
> On Jul 18, 2023, at 14:31, Peter Grandi wrote:
>
> [...] S3 workload, that will need to delet
I've seen this dynamic contribute to a hypervisor with many attachments running
out of system-wide file descriptors.
> On Jul 18, 2023, at 16:21, Konstantin Shalygin wrote:
>
> Hi,
>
> Check you libvirt limits for qemu open files/sockets. Seems, when you added
> new OSD's, your librbd client
Index pool distributed over a large number of NVMe OSDs? Multiple, dedicated
RGW instances that only run LC?
> On Jul 18, 2023, at 12:08, Peter Grandi wrote:
>
On Mon, 17 Jul 2023 19:19:34 +0700, Ha Nguyen Van
said:
>
>> [...] S3 workload, that will need to delete 100M file daily [
The docs aren't necessarily structured that way, i.e. there isn't a 17.2.6 docs
site as such. We try to document changes in behavior in sync with code, but
don't currently have a process to ensure that a given docs build corresponds
exactly to a given dot release. In fact we sometimes go back
Indeed. For clarity, this process is not the same as the pg_autoscaler. It's
real easy to conflate the two, along with the balancer module, so I like to
call that out to reduce confusion.
> On Jul 6, 2023, at 18:01, Dan van der Ster wrote:
>
> Since nautilus, pgp_num (and pg_num) will be inc
I’m also using Rook on BM. I had never used K8s before, so that was the
learning curve, e.g. translating the example YAML files into the Helm charts we
needed, and the label / taint / toleration dance to fit the square peg of
pinning services to round hole nodes. We’re using Kubespray ; I gath
ps://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> Sans virus.www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>
>
There aren’t enough drives to split into multiple pools.
Deploy 1 OSD on each of the 3.8T devices and 2 OSDs on each of the 7.6s.
Or, alternately, 2 and 4.
> On Jul 4, 2023, at 3:44 AM, Eneko Lacunza wrote:
>
> Hi,
>
> El 3/7/23 a las 17:27, wodel youchi escribió:
>> I will be deploying a Pr
Even when you factor in density, iops, and the cost of an HBA?
SAS is mostly dead, manufacturers are beginning to drop SATA from their
roadmaps.
> On Jun 28, 2023, at 10:24 AM, Marc wrote:
>
>
>
>>
>> What would we use instead? SATA / SAS that are progressively withering
>> in the market
That page has mixed info.
What would we use instead? SATA / SAS that are progressively withering in the
market, less performance for the same money? Why pay extra for an HBA just to
use legacy media?
You can use NVMe for WAL+DB, with more complexity. You’ll get faster metadata
and lower la
Stefan, how do you have this implemented? Earlier this year I submitted
https://tracker.ceph.com/issues/58569 asking to enable just this.
> On Jun 2, 2023, at 10:09, Stefan Kooman wrote:
>
> On 5/26/23 23:09, Alexander E. Patrakov wrote:
>> Hello Frank,
>> On Fri, May 26, 2023 at 6:27 PM Fran
This is my understanding as well: as with CRUSH tunable sets, features *happen*
to be named after releases don't always correlate 1:1.
> On May 25, 2023, at 15:49, Wesley Dillingham wrote:
>
> Fairly confident this is normal. I just checked a pacific cluster and they
> all report luminous as
The release of Reef has been delayed in part due to issues that sidelined the
testing / validation infrastructure.
> On May 15, 2023, at 05:40, huy nguyen wrote:
>
> Hi, as I understand, Pacific+ is having a performance issue that does not
> exist in older releases? So that why Ceph's new rele
As a KRDB client, I believe that 5.4 also introduces better support for RBD
features including fast-diff
> On May 11, 2023, at 3:59 AM, Gerdriaan Mulder wrote:
>
> As a data point: we've been running Octopus (solely for CephFS) on Ubuntu
> 20.04 with 5.4.0(-122) for some time now, with packag
There is also a direct RBD client for MS Windows, though it's relatively young.
> On Apr 27, 2023, at 18:20, Bailey Allison wrote:
>
> Hey Angelo,
>
> Just to make sure I'm understanding correctly, the main idea for the use
> case is to be able to present Ceph storage to windows clients as SMB?
> Indeed! Every Ceph instance I have seen (not many) and almost every HPC
> storage system I have seen have this problem, and that's because they were
> never setup to have enough IOPS to support the maintenance load, never mind
> the maintenance load plus the user load (and as a rule not even
Absolutely.
Moreover, PGs are not a unit of size, they are a logical grouping of smaller
RADOS objects, because a few thousand PGs are a lot easier and less expensive
to manage than tens or hundreds of millions of small underlying RADOS objects.
They’re for efficiency, and are not any set size
>>
>>
>> We have a customer that tries to use veeam with our rgw objectstorage and
>> it seems to be blazingly slow.
>> What also seems to be strange, that veeam sometimes show "bucket does not
>> exist" or "permission denied".
>> I've tested parallel and everything seems to work fine from the
Actually there was a firmware bug around that a while back. The HBA and
storcli claimed to not touch drive cache, but actually were enabling it and
lying.
> On Apr 19, 2023, at 1:41 PM, Marco Gaiarin wrote:
>
> Mandi! Mario Giammarco
> In chel di` si favelave...
>
>> The disk cache is:
>
LSI 9266/9271 as well in an affected range unless ECO’d
> On Apr 19, 2023, at 3:13 PM, Sebastian wrote:
>
> I want add one thing to what other says, we discussed this between
> Cephalocon sessions, avoid HP controllers p210/420, or upgrade firmware to
> latest.
> These controllers has strang
Are you baiting me? ;) HBA. Always. RAID HBAs are the devil.
> On Apr 19, 2023, at 12:56 AM, Murilo Morais wrote:
>
> Good evening everyone!
>
> Guys, about the P420 RAID controller, I have a question about the operation
> mode: What would be better: HBA or RAID-0 with BBU (active write c
With the LSI HBAs I’ve used, HBA cache seemed to only be used for VDs, not for
passthrough drives. And then with various nasty bugs. Be careful not to
conflate HBA cache with cache on the HDD itself.
> On Apr 15, 2023, at 11:51 AM, Konstantin Shalygin wrote:
>
> Hi,
>
> Current controller
I've used a similar process with great success for capacity management --
moving volumes from very full clusters to ones with more free space. There was
a weighting system to direct new volumes where there was space, but, to
forestall full ratio problems due to organic growth of existing
thi
>
> The truth is that:
> - hdd are too slow for ceph, the first time you need to do a rebalance or
> similar you will discover...
Depends on the needs. For cold storage, or sequential use-cases that aren't
performance-sensitive ... Can't say "too slow" without context. In Marco's
case, I
How bizarre, I haven’t dealt with this specific SKU before. Some Dell / LSI
HBAs call this passthrough mode, some “personality”, some “jbod mode”, dunno
why they can’t be consistent.
> We are testing an experimental Ceph cluster with server and controller at
> subject.
>
> The controller have
Any chance you ran `rados bench` but didn’t fully clean up afterward?
> On Apr 3, 2023, at 9:25 PM, Work Ceph
> wrote:
>
> Hello guys!
>
>
> We noticed an unexpected situation. In a recently deployed Ceph cluster we
> are seeing a raw usage, that is a bit odd. We have the following setup:
>
Mark Nelson's space amp sheet visualizes this really well. A nuance here is
that Ceph always writes a full stripe, so with a 9,6 profile, on conventional
media, a minimum of 15x4KB=20KB underlying storage will be consumed, even for
a 1KB object. A 22 KB object would similarly tie up ~18KB of
I think those only work for librbd clients, not for Ceph-CSI or other KRBD
clients.
> On Apr 2, 2023, at 3:47 PM, Danny Webb wrote:
>
> for RBD workloads you can set QOS values on a per image basis (and maybe on
> an entire pool basis):
>
> https://docs.ceph.com/en/latest/rbd/rbd-config-re
>
>>
>> What I also see is that I have three OSDs that have quite a lot of OMAP
>> data, in compare to other OSDs (~20 time higher). I don't know if this
>> is an issue:
>
> I have on 2TB ssd's with 2GB - 4GB omap data, while on 8TB hdd's the omap
> data is only 53MB - 100MB.
> Should I manu
A custom CRUSH rule can have two steps to enforce that.
> On Mar 24, 2023, at 11:04, Danny Webb wrote:
>
> The question I have regarding this setup is, how can you guarantee that the
> 12 m chunks will be located evenly across the two rooms. What would happen
> if by chance all 12 chunks were
>>
>> I would be surprised if I installed "ceph-osd" and didn't get
>> ceph-volume, but I never thought about if "only" installing "ceph" did
>> or did not provide it.
I — perhaps naively - think of `ceph` as just the CLI, usually installing
`ceph-common` too.
>>
>> (from one of the remaining
With CentOS/Rocky 7-8 I’ve observed unexpected usage of swap when there is
plenty of physmem available.
Swap IMHO is a relic of a time when RAM capacities were lower and much more
expensive.
In years beginning with a 2, and with Ceph explicitly, I assert that swap
should never be enabled duri
> so size 4 / min_size 2 would be a lot better (of course)
More copies (or parity) are always more reliable, but one quickly gets into
diminishing returns.
In your scenario you might look into stretch mode, which currently would
require 4 replicas. In the future maybe it could support EC wit
This is not speculation: I have personally experienced this with an inherited
2R cluster.
> On Mar 3, 2023, at 04:07, Janne Johansson wrote:
>
>
> Do not assume the last PG needs to die in a horrible fire, killing
> several DC operators with it, it only takes a REALLY small outage, a
> fluke
> but what is the problem with only one active PG?
> someone pointed out "split brain" but I am unsure about this.
I think Paxos will ensure that split-brain doesn’t happen by virtue of needing
>50% of the mon quorum to be up.
> i think what happens in the worst case is this:
> only 1 PG is
> By the sounds of it, a cluster may be configured for the 100 PG / OSD target;
> adding pools to the former configuration scenario will require an increase in
> OSDs to maintain that recommended PG distribution target and accommodate an
> increase of PGs resulting from additional pools.
This can be subtle and is easy to mix up.
The “PG ratio” is intended to be the number of PGs hosted on each OSD, plus or
minus a few.
Note how I phrased that, it’s not the number of PGs divided by the number of
OSDs. Remember that PGs are replicated.
While each PG belongs to exactly one pool,
> * if rebalance will starts due EDAC or SFP degradation, is faster to fix the
> issue via DC engineers and put node back to work
A judicious mon_osd_down_out_subtree_limit setting can also do this by not
rebalancing when an entire node is detected down.
> * noout prevents unwanted OSD's fi
Documented here:
https://github.com/ceph/ceph/blob/9754cafc029e1da83f5ddd4332b69066fe6b3ffb/src/common/options/global.yaml.in#L3202
Introduced back here with a bunch of other scrub tweaks:
https://github.com/ceph/ceph/pull/18971/files
Are your OSDs HDDs? Using EC?
How many deep scrubs do you h
When the client is libvirt/librbd/QEMU virtualization, IIRC one must set these
values in the hypervisor’s ceph.conf
> On Feb 1, 2023, at 11:05, Ruidong Gao wrote:
>
> Hi,
>
> You can use environment variable to set log level to what you want as below:
> bash-4.4$ export CEPH_ARGS="--debug-rbd=
201 - 300 of 558 matches
Mail list logo