> 24 active+clean+snaptrim
I see snaptrimming happening in your status output - do you know if
that was happening before restarting those OSDs? This is the mechanism
by which OSDs clean up deleted snapshots, and once all OSDs have
completed snaptrim for a given snapshot it should be removed
Hello,
Which version of Ceph are you using? Are all of your OSDs currently
up+in? If you're HEALTH_OK and all OSDs are up, snaptrim should work
through the removed_snaps_queue and clear it over time, but I have
seen cases where this seems to get stuck and restarting OSDs can help.
Josh
On Wed,
MPU etags are an MD5-of-MD5s, FWIW. If the users knows how the parts are
uploaded then it can be used to verify contents, both just after upload and
then at download time (both need to be validated if you want end-to-end
validation - but then you're trusting the system to not change the etag
On Fri, Feb 2, 2024 at 7:44 AM Ruben Vestergaard wrote:
> Is the RBD client performing partial object reads? Is that even a thing?
Yup! The rados API has both length and offset parameters for reads
(https://docs.ceph.com/en/latest/rados/api/librados/#c.rados_aio_read)
and writes
Ah, yeah, you hit https://tracker.ceph.com/issues/63389 during the upgrade.
Josh
On Tue, Jan 30, 2024 at 3:17 AM Jan Marek wrote:
>
> Hello again,
>
> I'm sorry, I forgot attach file... :-(
>
> Sincerely
> Jan
>
> Dne Út, led 30, 2024 at 11:09:44 CET napsal(a) Jan Marek:
> > Hello Sridhar,
> >
;
> On Mon, Jan 29, 2024 at 4:47 PM Josh Baergen
> wrote:
>>
>> Make sure you're on a fairly recent version of Ceph before doing this,
>> though.
>>
>> Josh
>>
>> On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson wrote:
>> >
>> >
Make sure you're on a fairly recent version of Ceph before doing this, though.
Josh
On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson wrote:
>
> Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita :
> >
> > Thank you Frank ,
> >
> > All disks are HDDs . Would like to know if I can increase the
> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
> HDD OSD, where data is stored, but it would be easier to set it to [osd]
> section without cherry picking only SSD/NVME OSDs, but for all at
> Do you know if it rocksdb_cf_compact_on_deletion_trigger and
> rocksdb_cf_compact_on_deletion_sliding_window can be changed in runtime
> without OSD restart?
Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.
Josh
On Fri, Jan
I would start with "ceph tell osd.1 config diff", as I find that
output the easiest to read when trying to understand where various
config overrides are coming from. You almost never need to use "ceph
daemon" in Octopus+ systems since "ceph tell" should be able to access
pretty much all commands
Given that this is s3, are the slow ops on index or data OSDs? (You
mentioned HDD but I don't want to assume that meant that the osd you
mentioned
is data)
Josh
On Fri, Dec 1, 2023 at 7:05 AM VÔ VI wrote:
>
> Hi Stefan,
>
> I am running replicate x3 with a failure domain as host and setting
>
The ticket has been updated, but it's probably important enough to
state on the list as well: The documentation is currently wrong in a
way that running the command as documented will cause this corruption.
The correct command to run is:
ceph-bluestore-tool \
--path \
in the monitor
>> configuration database.
>>
>> This works rather well for various Ceph components, including the monitors.
>> RocksDB options are also applied to monitors correctly, but for some reason
>> are being ignored.
>>
>> /Z
>>
>> On Sa
Apologies if you tried this already and I missed it - have you tried
configuring that setting in /etc/ceph/ceph.conf (or wherever your conf
file is) instead of via 'ceph config'? I wonder if mon settings like
this one won't actually apply the way you want because they're needed
before the mon has
Hi Simon,
If the OSD is actually up, using 'ceph osd down` will cause it to flap
but come back immediately. To prevent this, you would want to 'ceph
osd set noup'. However, I don't think this is what you actually want:
> I'm thinking (but perhaps incorrectly?) that it would be good to keep the
My guess is that this is because this setting can't be changed at
runtime, though if so that's a new enforcement behaviour in Quincy
that didn't exist in prior versions.
I think what you want to do is 'config set osd osd_op_queue wpq'
(assuming you want this set for all OSDs) and then restart
Hi Jonathan,
> - All PGs seem to be backfilling at the same time which seems to be in
> violation of osd_max_backfills. I understand that there should be 6 readers
> and 6 writers at a time, but I'm seeing a given OSD participate in more
> than 6 PG backfills. Is an OSD only considered as
Out of curiosity, what is your require_osd_release set to? (ceph osd
dump | grep require_osd_release)
Josh
On Tue, Jul 11, 2023 at 5:11 AM Eugen Block wrote:
>
> I'm not so sure anymore if that could really help here. The dump-keys
> output from the mon contains 42 million osd_snap prefix
On Tue, Jun 27, 2023 at 11:50 AM Matthew Booth wrote:
> What do you mean by saturated here? FWIW I was using the default cache
> size of 1G and each test run only wrote ~100MB of data, so I don't
> think I ever filled the cache, even with multiple runs.
Ah, my apologies - I saw that fio had been
Hi Matthew,
We've done a limited amount of work on characterizing the pwl and I think
it suffers the classic problem of some writeback caches in that, once the
cache is saturated, it's actually worse than just being in writethrough.
IIRC the pwl does try to preserve write ordering (unlike the
Hi Zakhar,
I'm going to guess that it's a permissions issue arising from
https://github.com/ceph/ceph/pull/48804, which was included in 16.2.13. You
may need to change the directory permissions, assuming that you manage the
directories yourself. If this is managed by cephadm or something like
Hi Samuel,
Both pgremapper and the CERN scripts were developed against Luminous,
and in my experience 12.2.13 has all of the upmap patches needed for
the scheme that Janne outlined to work. However, if you have a complex
CRUSH map sometimes the upmap balancer can struggle, and I think
that's true
Hi Samuel,
While the second method would probably work fine in the happy path, if
something goes wrong I think you'll be happier having a uniform
release installed. In general, we've found the backfill experience to
be better on Nautilus than Luminous, so my vote would be for the first
method.
hanism? Besides that, if I don't want to upgrade version in
> recently, is a good way that adjust osd_pool_default_read_lease_ratio to
> lower? For example, 0.4 or 0.2 to reach the user's tolerance time.
>
> Yite Gu
>
> Josh Baergen 于2023年3月10日周五 22:09写道:
>>
>> Hello
Hello,
When you say "osd restart", what sort of restart are you referring to
- planned (e.g. for upgrades or maintenance) or unplanned (OSD
hang/crash, host issue, etc.)? If it's the former, then these
parameters shouldn't matter provided that you're running a recent
enough Ceph with default
caches can improve the IOPS
performance of SSDs.
Josh
On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens wrote:
>
> Hi Josh,
> we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload.
>
>
>
> Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen
> :
>>
>>
Hi Boris,
Which version did you upgrade from and to, specifically? And what
workload are you running (RBD, etc.)?
Josh
On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens wrote:
>
> Hi,
> today I did the first update from octopus to pacific, and it looks like the
> avg apply latency went up from 1ms
Hi Boris,
This sounds a bit like https://tracker.ceph.com/issues/53729.
https://tracker.ceph.com/issues/53729#note-65 might help you diagnose
whether this is the case.
Josh
On Tue, Feb 21, 2023 at 9:29 AM Boris Behrens wrote:
>
> Hi,
> today I wanted to increase the PGs from 2k -> 4k and
Do the counters need to be moved under a separate key? That would
break anything today that currently tries to parse them. We have quite
a bit of internal monitoring that relies on "perf dump" output, but
it's mostly not output that I would expect to gain labels in general
(e.g. bluestore stats).
This often indicates that something is up with your mgr process. Based
on ceph status, it looks like both the mgr and mon had recently
restarted. Is that expected?
Josh
On Sun, Jan 29, 2023 at 3:36 AM Daniel Brunner wrote:
>
> Hi,
>
> my ceph cluster started to show HEALTH_WARN, there are no
This might be due to tombstone accumulation in rocksdb. You can try to
issue a compact to all of your OSDs and see if that helps (ceph tell
osd.XXX compact). I usually prefer to do this one host at a time just
in case it causes issues, though on a reasonably fast RBD cluster you
can often get away
> - you will need to love those filestore OSD’s to Bluestore before hitting
> Pacific, might even be part of the Nautilus upgrade. This takes some time if
> I remember correctly.
I don't think this is necessary. It _is_ necessary to convert all
leveldb to rocksdb before upgrading to Pacific, on
It's also possible you're running into large pglog entries - any
chance you're running RGW and there's an s3:CopyObject workload
hitting an object that was uploaded with MPU?
https://tracker.ceph.com/issues/56707
If that's the case, you can inject a much smaller value for
osd_min_pg_log_entries
Hi Murilo,
This is briefly referred to by
https://docs.ceph.com/en/octopus/rados/deployment/ceph-deploy-mon/,
but in order to avoid split brain issues it's common that distributed
consensus algorithms require a strict majority in order to maintain
quorum. This is why production deployments of
Hi Alexander,
I'd be suspicious that something is up with pool 25. Which pool is
that? ('ceph osd pool ls detail') Knowing the pool and the CRUSH rule
it's using is a good place to start. Then that can be compared to your
CRUSH map (e.g. 'ceph osd tree') to see why Ceph is struggling to map
that
:
>
> Hi Josh,
>
> On Mon, Oct 24, 2022 at 07:20:46AM -0600, Josh Baergen wrote:
> > > I've included the osd df output below, along with pool and crush rules.
> >
> > Looking at these, the balancer module should be taking care of this
> > imbalance automaticall
Hi Tim,
> I've included the osd df output below, along with pool and crush rules.
Looking at these, the balancer module should be taking care of this
imbalance automatically. What does "ceph balancer status" say?
Josh
___
ceph-users mailing list --
As of Nautilus+, when you set pg_num, it actually internally sets
pg(p)_num_target, and then slowly increases (or decreases, if you're
merging) pg_num and then pgp_num until it reaches the target. The
amount of backfill scheduled into the system is controlled by
target_max_misplaced_ratio.
Josh
FWIW, this is what the Quincy release notes say: LevelDB support has
been removed. WITH_LEVELDB is no longer a supported build option.
Users should migrate their monitors and OSDs to RocksDB before
upgrading to Quincy.
Josh
On Wed, Sep 28, 2022 at 4:20 AM Eugen Block wrote:
>
> Hi,
>
> there
Hey Wyll,
> $ pgremapper cancel-backfill --yes # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the
> balancer and unset the "no" flags set
Hi Fulvio,
> leads to a much shorter and less detailed page, and I assumed Nautilus
> was far behind Quincy in managing this...
The only major change I'm aware of between Nautilus and Quincy is that
in Quincy the mClock scheduler is able to automatically tune up/down
backfill parameters to
Hi Fulvio,
https://docs.ceph.com/en/quincy/dev/osd_internals/backfill_reservation/
describes the prioritization and reservation mechanism used for
recovery and backfill. AIUI, unless a PG is below min_size, all
backfills for a given pool will be at the same priority.
force-recovery will modify
Hi Jesper,
Given that the PG is marked recovery_unfound, I think you need to
follow
https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
Josh
On Tue, Sep 20, 2022 at 12:56 AM Jesper Lykkegaard Karlsen
wrote:
>
> Dear all,
>
> System: latest Octopus, 8+3
Hey Richard,
On Tue, Oct 19, 2021 at 8:37 PM Richard Bade wrote:
> user@cstor01 DEV:~$ sudo ceph config set osd/host:cstor01 osd_max_backfills 2
> user@cstor01 DEV:~$ sudo ceph config get osd.0 osd_max_backfills
> 2
> ...
> Are others able to reproduce?
Yes, we've found the same thing on
Hi Peter,
> When I check for circles I found that running the upmap balancer alone never
> seems to create
> any kind of circle in the graph
By a circle, do you mean something like this?
pg 1.a: 1->2 (upmap to put a chunk on 2 instead of 1)
pg 1.b: 2->3
pg 1.c: 3->1
If so, then it's not
> I have a question regarding the last step. It seems to me that the ceph
> balancer is not able to remove the upmaps
> created by pgremapper, but instead creates new upmaps to balance the pgs
> among osds.
The balancer will prefer to remove existing upmaps[1], but it's not
guaranteed. The
> I assume it's the balancer module. If you write lots of data quickly
> into the cluster the distribution can vary and the balancer will try
> to even out the placement.
The balancer won't cause degradation, only misplaced objects.
> Degraded data redundancy: 260/11856050 objects degraded
>
Hey Seb,
> I have a test cluster on which I created pools rbd and cephfs (octopus), when
> I copy a directory containing many small files on a pool rbd the USED part of
> the ceph df command seems normal on the other hand on cephfs the USED part
> seems really abnormal, I tried to change the
Hi Melanie,
On Mon, Sep 6, 2021 at 10:06 AM Desaive, Melanie
wrote:
> When I execute "ceph mon_status --format json-pretty" from our
> ceph-management VM, the correct mon nodes are returned.
>
> But when I execute "ceph daemon osd.xx config show | grep mon_host" on the
> respective storage
y or need to
> triger somehow.
>
> 1 Eyl 2021 Çar 17:07 tarihinde Josh Baergen şunu
> yazdı:
>>
>> Googling for that balancer error message, I came across
>> https://tracker.ceph.com/issues/22814, which was closed/wont-fix, and
>> some threads that claimed that
1
> type replicated
> step take default class ssd
> step chooseleaf firstn 0 type host
> step emit
> }
>
> pool 54 'rgw.buckets.index' replicated size 3 min_size 1 crush_rule 1
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change
> 31607 lf
GiB 4.1 GiB 401 GiB
> 55.20 0.91 103 up osd.201
> 217 ssd0.87329 1.0 894 GiB 261 GiB 83 GiB 176 GiB 2.3 GiB 634 GiB
> 29.15 0.48 89 up osd.217
>
>
> When I check the balancer status I saw that: ""optimize_result": "So
Hi there,
Could you post the output of "ceph osd df tree"? I would highly
suspect that this is a result of imbalance, and that's the easiest way
to see if that's the case. It would also confirm that the new disks
have taken on PGs.
Josh
On Tue, Aug 31, 2021 at 10:50 AM mhnx wrote:
>
> I'm
23hdd 0.89893 osd.23 up
> > 1.0 1.0
> > -45 2.69679 host jceph-n09
> > 24hdd 0.89893 osd.24 up
> > 1.0 1.0
> > 25hdd 0.89893 osd.25
Hi Jerry,
In general, your CRUSH rules should define the behaviour you're
looking for. Based on what you've stated about your configuration,
after failing a single node or an OSD on a single node, then you
should still be able to tolerate two more failures in the system
without losing data (or
Have you confirmed that all OSD hosts can see each other (on both the front
and back networks if you use split networks)? If there's not full
connectivity, then that can lead to the issues you see here.
Checking the logs on the mons can be helpful, as it will usually indicate
why a given OSD is
Oh, I just read your message again, and I see that I didn't answer your
question. :D I admit I don't know how MAX AVAIL is calculated, and whether
it takes things like imbalance into account (it might).
Josh
On Tue, Jul 6, 2021 at 7:41 AM Josh Baergen
wrote:
> Hey Wladimir,
>
> Th
(RAW STORAGE/RAW
> USED)-(SUM(POOLS/USED)) = 19-17.5 = 1.5 TiB ?
>
> As it does not seem I would get any more hosts for this setup,
> I am seriously thinking of bringing down this Ceph
> and setting up instead a Btrfs storing qcow2 images served over
> iSC
led RBD writes to EC data-pool ?
>
> Josh Baergen wrote:
> > Hey Arkadiy,
> >
> > If the OSDs are on HDDs and were created with the default
> > bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
> > effect data will be allocated from the pool in
Hey Arkadiy,
If the OSDs are on HDDs and were created with the default
bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
effect data will be allocated from the pool in 640KiB chunks (64KiB *
(k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
which
Hello all,
I just wanted to let you know that DigitalOcean has open-sourced a
tool we've developed called pgremapper.
Originally inspired by CERN's upmap exception table manipulation
scripts, pgremapper is a CLI written in Go which exposes a number of
upmap-based algorithms for backfill-related
Hey Josh,
Thanks for the info!
> With respect to reservations, it seems like an oversight that
> we don't reserve other shards for backfilling. We reserve all
> shards for recovery [0].
Very interesting that there is a reservation difference between
backfill and recovery.
> On the other hand,
Hey all,
I wanted to confirm my understanding of some of the mechanics of
backfill in EC pools. I've yet to find a document that outlines this
in detail; if there is one, please send it my way. :) Some of what I
write below is likely in the "well, duh" category, but I tended
towards completeness.
I thought that recovery below min_size for EC pools wasn't expected to work
until Octopus. From the Octopus release notes: "Ceph will allow recovery
below min_size for Erasure coded pools, wherever possible."
Josh
On Tue, Mar 30, 2021 at 6:53 AM Frank Schilder wrote:
> Dear Rainer,
>
> hmm,
As was mentioned in this thread, all of the mon clients (OSDs included)
learn about other mons through monmaps, which are distributed when mon
membership and election changes. Thus, your OSDs should already know about
the new mons.
mon_host indicates the list of mons that mon clients should try
Linux will automatically make use of all available memory for the buffer
cache, freeing buffers when it needs more memory for other things. This is
why MemAvailable is more useful than MemFree; the former indicates how much
memory could be used between Free, buffer cache, and anything else that
Hi George,
> May I ask if enabling pool compression helps for the future space
> amplification?
If the amplification is indeed due to min_alloc_size, then I don't
think that compression will help. My understanding is that compression
is applied post-EC (and thus probably won't even activate due
On Wed, Jan 27, 2021 at 12:24 AM George Yil wrote:
> May I ask if it can be dynamically changed and any disadvantages should be
> expected?
Unless there's some magic I'm unaware of, there is no way to
dynamically change this. Each OSD must be recreated with the new
min_alloc_size setting. In
> I created radosgw pools. secondaryzone.rgw.buckets.data pool is
configured as EC 8+2 (jerasure).
Did you override the default bluestore_min_alloc_size_hdd (64k in that
version IIRC) when creating your hdd OSDs? If not, all of the small objects
produced by that EC configuration will be leading
69 matches
Mail list logo