[ceph-users] Re: using non client.admin user for ceph-iscsi gateways

2019-09-06 Thread Wesley Dillingham
From: Jason Dillaman Sent: Friday, September 6, 2019 12:37 PM To: Wesley Dillingham Cc: ceph-users@ceph.io Subject: Re: [ceph-users] using non client.admin user for ceph-iscsi gateways Notice: This email is from an external sender. On Fri, Sep 6, 2019 at 12:00

[ceph-users] using non client.admin user for ceph-iscsi gateways

2019-09-06 Thread Wesley Dillingham
the iscsi-gateway.cfg seemingly allows for an alternative cephx user other than client.admin to be used, however the comments in the documentations says specifically to use client.admin. Other than having the cfg file point to the appropriate key/user with "gateway_keyring" and giving that

[ceph-users] OSD's addrvec, not getting msgr v2 address, PGs stuck unknown or peering

2019-11-11 Thread Wesley Dillingham
Running 14.2.4 (but same issue observed on 14.2.2) we have a problem with, thankfully a testing cluster, where all pgs are failing to peer and are stuck in peering or unknown stale etc states. My working theory is that this is because the OSDs dont seem to be utilizing msgr v2 as "ceph osd find

[ceph-users] Re: iSCSI Gateway reboots and permanent loss

2019-12-04 Thread Wesley Dillingham
On 12/04/2019 08:26 AM, Gesiel Galvão Bernardes wrote: > > Hi, > > > > Em qua., 4 de dez. de 2019 às 00:31, Mike Christie > <mailto:mchri...@redhat.com>> escreveu: > > > > On 12/03/2019 04:19 PM, Wesley Dillingham wrote: > > > Thanks. If

[ceph-users] iSCSI Gateway reboots and permanent loss

2019-12-03 Thread Wesley Dillingham
We utilize 4 iSCSI gateways in a cluster and have noticed the following during patching cycles when we sequentially reboot single iSCSI-gateways: "gwcli" often hangs on the still-up iSCSI GWs but sometimes still functions and gives the message: "1 gateway is inaccessible - updates will be

[ceph-users] Re: iSCSI Gateway reboots and permanent loss

2019-12-05 Thread Wesley Dillingham
com/in/wesleydillingham> On Thu, Dec 5, 2019 at 4:14 PM Mike Christie wrote: > On 12/04/2019 02:34 PM, Wesley Dillingham wrote: > > I have never had a permanent loss of a gateway but I'm a believer in > > Murphy's law and want to have a plan. Glad to hear that there is a > > solution

[ceph-users] Re: iSCSI Gateway reboots and permanent loss

2019-12-03 Thread Wesley Dillingham
l of a gateway via the "gwcli". I think the Ceph dashboard can > > do that as well. > > > > On Tue, Dec 3, 2019 at 1:59 PM Wesley Dillingham > wrote: > >> > >> We utilize 4 iSCSI gateways in a cluster and have noticed the following > during p

[ceph-users] Re: Unable to increase PG numbers

2020-02-25 Thread Wesley Dillingham
I believe you are encountering https://tracker.ceph.com/issues/39570 You should do a "ceph versions" on a mon and ensure all your OSDs are nautilus and if so set "ceph osd require-osd-release nautilus" then try to increase pg num. Upgrading to a more recent nautilus release is also probably a

[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread Wesley Dillingham
I would guess that you have something preventing osd to osd communication on ports 6800-7300 or osd to mon communication on port 6789 and/or 3300. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Tue, Feb 4, 2020 at 12:44 PM

[ceph-users] ceph-iscsi create RBDs on erasure coded data pools

2020-01-30 Thread Wesley Dillingham
Is it possible to create an EC backed RBD via ceph-iscsi tools (gwcli, rbd-target-api)? It appears that a pre-existing RBD created with the rbd command can be imported, but there is no means to directly create an EC backed RBD. The API seems to expect a single pool field in the body to work with.

[ceph-users] Re: ceph-volume lvm filestore OSDs fail to start on reboot. Permission denied on journal partition

2020-01-23 Thread Wesley Dillingham
red activation for: 219-529ea347-b129-4b53-81cb-bb5f2d91f8ae Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Thu, Jan 23, 2020 at 4:31 AM Jan Fajerski wrote: > On Wed, Jan 22, 2020 at 12:00:28PM -0500, Wesley Dill

[ceph-users] ceph-volume lvm filestore OSDs fail to start on reboot. Permission denied on journal partition

2020-01-22 Thread Wesley Dillingham
After upgrading to Nautilus 14.2.6 from Luminous 12.2.12 we are seeing the following behavior on OSDs which were created with "ceph-volume lvm create --filestore --osd-id --data --journal " Upon restart of the server containing these OSDs they fail to start with the following error in the logs:

[ceph-users] acting_primary is an osd with primary-affinity of 0, which seems wrong

2020-01-03 Thread Wesley Dillingham
In an exploration of trying to speedup the long tail of backfills resulting from marking a failing OSD out I began looking at my PGs to see if i could tune some settings and noticed the following: Scenario: on a 12.2.12 Cluster, I am alerted of an inconsistent PG and am alerted of SMART failures

[ceph-users] Re: Q release name

2020-03-23 Thread Wesley Dillingham
Checking the word "Octopus" in different languages the only one starting with a "Q" is in "Maltese": "Qarnit" For good measure here is a Maltesian Qarnit stew recipe: http://littlerock.com.mt/food/maltese-traditional-recipe-stuffat-tal-qarnit-octopus-stew/ Respectfully, *Wes Dillingham*

[ceph-users] Re: radosgw beast access logs

2020-08-19 Thread Wesley Dillingham
We would very much appreciate having this backported to nautilus. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Wed, Aug 19, 2020 at 9:02 AM Casey Bodley wrote: > On Tue, Aug 18, 2020 at 1:33 PM Graham Allan wrote: > > > >

[ceph-users] Apparent bucket corruption error: get_bucket_instance_from_oid failed

2020-08-04 Thread Wesley Dillingham
Long running cluster, currently running 14.2.6 I have a certain user whose buckets have become corrupted in that the following commands: radosgw-admin bucket check --bucket radosgw-admin bucket list --bucket= return with the following: ERROR: could not init bucket: (2) No such file or

[ceph-users] Meaning of the "tag" key in bucket metadata

2020-08-12 Thread Wesley Dillingham
Recently we encountered an instance of bucket corruption of two varieties. One in which the bucket metadata was missing and another in which the bucket.instance metadata was missing for various buckets. We have seemingly been successful in restoring the metadata by reconstructing it from the

[ceph-users] Running Mons on msgrv2/3300 only.

2020-12-08 Thread Wesley Dillingham
We rebuilt all of our mons in one cluster such that they bind only to port 3300 with msgrv2. Previous to this we were binding to both 6789 and 3300. All of our server and client components are sufficiently new (14.2.x) and we haven’t observed any disruption but I am inquiring if this may be

[ceph-users] Re: Monitors not starting, getting "e3 handle_auth_request failed to assign global_id"

2020-12-08 Thread Wesley Dillingham
We have also had this issue multiple times in 14.2.11 On Tue, Dec 8, 2020, 5:11 PM wrote: > I have same issue. My cluster runing 14.2.11 versions. What is your > version ceph? > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send

[ceph-users] Re: Monitors not starting, getting "e3 handle_auth_request failed to assign global_id"

2020-12-14 Thread Wesley Dillingham
/ packet inspection security technology being run on the servers. Perhaps you've made similar updates. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Tue, Dec 8, 2020 at 7:46 PM Wesley Dillingham wrote: > We have

[ceph-users] Mon's falling out of quorum, require rebuilding. Rebuilt with only V2 address.

2020-11-19 Thread Wesley Dillingham
We have had multiple clusters experiencing the following situation over the past few months on both 14.2.6 and 14.2.11. On a few instances it seemed random , in a second situation we had temporary networking disruption, in a third situation we accidentally made some osd changes which caused

[ceph-users] Re: bug ceph auth

2021-07-14 Thread Wesley Dillingham
is /var/lib/ceph/bootstrap-osd/ in existence and writeable? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Wed, Jul 14, 2021 at 8:35 AM Marc wrote: > > > > [@t01 ~]# ceph auth get client.bootstrap-osd -o >

[ceph-users] Re: bug ceph auth

2021-07-14 Thread Wesley Dillingham
Do you get the same error if you just do "ceph auth get client.bootstrap-osd" i.e. does client.bootstrap exist as a user? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Wed, Jul 14, 2021 at 1:56 PM Wesley D

[ceph-users] Re: ceph osd continously fails

2021-08-12 Thread Wesley Dillingham
Can you send the results of "ceph daemon osd.0 status" and maybe do that for a couple of osd ids ? You may need to target ones which are currently running. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Wed, Aug 11, 2021 at 9:51

[ceph-users] Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

2021-11-18 Thread Wesley Dillingham
That response is typically indicative of a pg whose OSD sets has changed since it was last scrubbed (typically from a disk failing). Are you sure its actually getting scrubbed when you issue the scrub? For example you can issue: "ceph pg query" and look for "last_deep_scrub_stamp" which will

[ceph-users] Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

2021-11-19 Thread Wesley Dillingham
t; run at all. Could the deepscrubbing process be stuck elsewhere? > On 11/18/21 3:29 PM, Wesley Dillingham wrote: > > That response is typically indicative of a pg whose OSD sets has changed > since it was last scrubbed (typically from a disk failing). > > Are you sure its actual

[ceph-users] Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

2021-11-19 Thread Wesley Dillingham
You may also be able to use an upmap (or the upmap balancer) to help make room for you on the osd which is too full. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Fri, Nov 19, 2021 at 1:14 PM Wesley Dillingham wrote:

[ceph-users] Re: Experience reducing size 3 to 2 on production cluster?

2021-12-10 Thread Wesley Dillingham
I would avoid doing this. Size 2 is not where you want to be. Maybe you can give more details about your cluster size and shape and what you are trying to accomplish and another solution could be proposed. The contents of "ceph osd tree " and "ceph df" would help. Respectfully, *Wes Dillingham*

[ceph-users] Re: How do you handle large Ceph object storage cluster?

2023-10-17 Thread Wesley Dillingham
Well you are probably in the top 1% of cluster size. I would guess that trying to cut your existing cluster in half while not encountering any downtime as you shuffle existing buckets between old cluster and new cluster would be harder than redirecting all new buckets (or users) to a second

[ceph-users] owner locked out of bucket via bucket policy

2023-10-25 Thread Wesley Dillingham
I have a bucket which got injected with bucket policy which locks the bucket even to the bucket owner. The bucket now cannot be accessed (even get its info or delete bucket policy does not work) I have looked in the radosgw-admin command for a way to delete a bucket policy but do not see anything.

[ceph-users] Re: owner locked out of bucket via bucket policy

2023-10-25 Thread Wesley Dillingham
tials with awscli to delete or overwrite this > bucket policy > > On Wed, Oct 25, 2023 at 4:11 PM Wesley Dillingham > wrote: > > > > I have a bucket which got injected with bucket policy which locks the > > bucket even to the bucket owner. The bucket now cannot be accessed

[ceph-users] Re: owner locked out of bucket via bucket policy

2023-10-26 Thread Wesley Dillingham
Thank you, this has worked to remove the policy. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Wed, Oct 25, 2023 at 5:10 PM Casey Bodley wrote: > On Wed, Oct 25, 2023 at 4:59 PM Wesley Dillingham > wrote: > >

[ceph-users] Re: Unable to fix 1 Inconsistent PG

2023-10-11 Thread Wesley Dillingham
If I recall correctly When the acting or up_set of an PG changes the scrub information is lost. This was likely lost when you stopped osd.238 and changed the sets. I do not believe based on your initial post you need to be using the objectstore tool currently. Inconsistent PGs are a common

[ceph-users] Re: Unable to fix 1 Inconsistent PG

2023-10-11 Thread Wesley Dillingham
to finish backfill 4 - issue the pg repair Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Wed, Oct 11, 2023 at 4:38 PM Wesley Dillingham wrote: > If I recall correctly When the acting or up_set of an PG changes

[ceph-users] Re: cannot repair a handful of damaged pg's

2023-10-06 Thread Wesley Dillingham
A repair is just a type of scrub and it is also limited by osd_max_scrubs which in pacific is 1. If another scrub is occurring on any OSD in the PG it wont start. do "ceph osd set noscrub" and "ceph osd set nodeep-scrub" wait for all scrubs to stop (a few seconds probably) Then issue the pg

[ceph-users] Re: Unable to fix 1 Inconsistent PG

2023-10-10 Thread Wesley Dillingham
You likely have a failing disk, what does "rados list-inconsistent-obj15.f4f" return? It should identify the failing osd. Assuming "ceph osd ok-to-stop " returns in the affirmative for that osd, you likely need to stop the associated osd daemon, then mark it out "ceph osd out wait for it to

[ceph-users] Re: Unable to fix 1 Inconsistent PG

2023-10-10 Thread Wesley Dillingham
In case it's not obvious I forgot a space: "rados list-inconsistent-obj 15.f4f" Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Tue, Oct 10, 2023 at 4:55 PM Wesley Dillingham wrote: > You likely have a fail

[ceph-users] Re: owner locked out of bucket via bucket policy

2023-11-08 Thread Wesley Dillingham
>> Thanks, >> Jayanth >> -- >> *From:* Jayanth Reddy >> *Sent:* Tuesday, November 7, 2023 11:59:38 PM >> *To:* Casey Bodley >> *Cc:* Wesley Dillingham ; ceph-users < >> ceph-users@ceph.io>; Adam Emerson >

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-17 Thread Wesley Dillingham
What was the largest cluster that you upgraded that didn't exhibit the new issue in 16.2.8 ? Thanks. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Tue, May 17, 2022 at 10:24 AM David Orman wrote: > We had an issue with our

[ceph-users] Re: Drained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients

2022-05-20 Thread Wesley Dillingham
This sounds similar to an inquiry I submitted a couple years ago [1] whereby I discovered that the choose_acting function does not consider primary affinity when choosing the primary osd. I had made the assumption it would when developing my procedure for replacing failing disks. After that

[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Wesley Dillingham
What does "ceph osd pool ls detail" say? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Thu, May 26, 2022 at 11:24 AM Sarunas Burdulis < saru...@math.dartmouth.edu> wrote: > Running > > `ceph osd ok-to-stop 0` > > shows: > >

[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Wesley Dillingham
t 2:22 PM Sarunas Burdulis wrote: > On 5/26/22 14:09, Wesley Dillingham wrote: > > What does "ceph osd pool ls detail" say? > > $ ceph osd pool ls detail > pool 0 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash > rjenkins pg_num 64 pgp_num 64 auto

[ceph-users] Re: Slow delete speed through the s3 API

2022-06-02 Thread Wesley Dillingham
Is it just your deletes which are slow or writes and read as well? On Thu, Jun 2, 2022, 4:09 PM J-P Methot wrote: > I'm following up on this as we upgraded to Pacific 16.2.9 and deletes > are still incredibly slow. The pool rgw is using is a fairly small > erasure coding pool set at 8 + 3. Is

[ceph-users] Re: Trouble getting cephadm to deploy iSCSI gateway

2022-05-17 Thread Wesley Dillingham
Well I dont use either the dashboard or the cephadm/containerized deployment but do use ceph-iscsi. The fact that your two gateways are not "up" might indicate that they havent been added to the target IQN yet. Once you can get into gwcli and create an iqn and associate your gateways with it, I

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-16 Thread Wesley Dillingham
We have a newly-built pacific (16.2.7) cluster running 8+3 EC jerasure ~250 OSDS across 21 hosts which has significantly lower than expected IOPS. Only doing about 30 IOPS per spinning disk (with appropriately sized SSD bluestore db) around ~100 PGs per OSD. Have around 100 CephFS (ceph fuse

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-16 Thread Wesley Dillingham
;mds_max_purge_ops_per_pg": "0.10", with some success but still experimenting with how we can reduce the throughput impact from osd slow ops. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Mon, May 16, 2022 a

[ceph-users] Re: Changes to Crush Weight Causing Degraded PGs instead of Remapped

2022-06-13 Thread Wesley Dillingham
ght osd.1 0.0 ? > > Istvan Szabo > Senior Infrastructure Engineer > --- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > ------- > > On 2022. Jun 14., at 0

[ceph-users] Re: Changes to Crush Weight Causing Degraded PGs instead of Remapped

2022-06-15 Thread Wesley Dillingham
cache={type=binned_lru} L P" \ reshard Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Tue, Jun 14, 2022 at 11:31 AM Wesley Dillingham wrote: > I have made https://tracker.ceph.com/issues/56046 regarding the iss

[ceph-users] Re: Changes to Crush Weight Causing Degraded PGs instead of Remapped

2022-06-14 Thread Wesley Dillingham
quick search. > > [1] > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/H4L5VNQJKIDXXNY2TINEGUGOYLUTT5UL/ > > Zitat von Wesley Dillingham : > > > Thanks for the reply. I believe regarding "0" vs "0.0" its the same > > difference. I will no

[ceph-users] Changes to Crush Weight Causing Degraded PGs instead of Remapped

2022-06-13 Thread Wesley Dillingham
I have a brand new Cluster 16.2.9 running bluestore with 0 client activity. I am modifying some crush weights to move PGs off of a host for testing purposes but the result is that the PGs go into a degraded+remapped state instead of simply a remapped state. This is a strange result to me as in

[ceph-users] Re: rh8 krbd mapping causes no match of type 1 in addrvec problem decoding monmap, -2

2022-07-19 Thread Wesley Dillingham
2 at 9:12 PM Wesley Dillingham > wrote: > > > > > > from ceph.conf: > > > > mon_host = 10.26.42.172,10.26.42.173,10.26.42.174 > > > > map command: > > rbd --id profilerbd device map win-rbd-test/originalrbdfromsnap > > > > [root@a2tlom

[ceph-users] Re: rh8 krbd mapping causes no match of type 1 in addrvec problem decoding monmap, -2

2022-07-19 Thread Wesley Dillingham
com/in/wesleydillingham> On Tue, Jul 19, 2022 at 12:51 PM Ilya Dryomov wrote: > On Tue, Jul 19, 2022 at 5:01 PM Wesley Dillingham > wrote: > > > > I have a strange error when trying to map via krdb on a RH (alma8) > release > > / kernel 4.18.0-372.13.1.el8_6.x86_64

[ceph-users] Using cloudbase windows RBD / wnbd with pre-pacific clusters

2022-07-20 Thread Wesley Dillingham
I understand that the client side code available from cloudbase started being distributed with pacific and now quincy client code but is there any particular reason it shouldn't work in conjunction with a nautilus, for instance, cluster. We have seen some errors when trying to do IO with mapped

[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-18 Thread Wesley Dillingham
; fix, see if it fits what you've encountered: >> >> https://github.com/ceph/ceph/pull/46727 (backport to Pacific here: >> https://github.com/ceph/ceph/pull/46877 ) >> https://tracker.ceph.com/issues/54172 >> >> On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham

[ceph-users] Re: Quincy full osd(s)

2022-07-24 Thread Wesley Dillingham
Can you send along the return of "ceph osd pool ls detail" and "ceph health detail" On Sun, Jul 24, 2022, 1:00 AM Nigel Williams wrote: > With current 17.2.1 (cephadm) I am seeing an unusual HEALTH_ERR > Adding files to a new empty cluster, replica 3 (crush is by host), OSDs > became 95% full

[ceph-users] Re: Map RBD to multiple nodes (line NFS)

2022-07-25 Thread Wesley Dillingham
You probably want CephFS instead RBD. Overview here: https://docs.ceph.com/en/quincy/cephfs/ Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Mon, Jul 25, 2022 at 11:00 AM Thomas Schneider <74cmo...@gmail.com> wrote: > Hi, > > I

[ceph-users] PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread Wesley Dillingham
We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9 Another 16.2.7 -> 16.2.9 Both with a multi disk (spinner block / ssd block.db) and both CephFS around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples of stuck scrubbing PGs from all of the pools. They have generally been

[ceph-users] rh8 krbd mapping causes no match of type 1 in addrvec problem decoding monmap, -2

2022-07-19 Thread Wesley Dillingham
I have a strange error when trying to map via krdb on a RH (alma8) release / kernel 4.18.0-372.13.1.el8_6.x86_64 using ceph client version 14.2.22 (cluster is 14.2.16) the rbd map causes the following error in dmesg: [Tue Jul 19 07:45:00 2022] libceph: no match of type 1 in addrvec [Tue Jul 19

[ceph-users] Re: rh8 krbd mapping causes no match of type 1 in addrvec problem decoding monmap, -2

2022-07-19 Thread Wesley Dillingham
:00 AM Wesley Dillingham wrote: > I have a strange error when trying to map via krdb on a RH (alma8) release > / kernel 4.18.0-372.13.1.el8_6.x86_64 using ceph client version 14.2.22 > (cluster is 14.2.16) > > the rbd map causes the following error in dmesg: > > [Tue Ju

[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?

2022-04-18 Thread Wesley Dillingham
If you mark an osd "out" but not down / you dont stop the daemon do the PGs go remapped or do they go degraded then as well? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Thu, Apr 14, 2022 at 5:15 AM Kai Stian Olstad wrote: >

[ceph-users] Aggressive Bluestore Compression Mode for client data only?

2022-04-18 Thread Wesley Dillingham
I would like to use bluestore compression (probably zstd level 3) to compress my clients data unless the incompressible hint is set (aggressive mode) but I do no want to expose myself to the bug experienced in this Cern talk (Ceph bug of the year) https://www.youtube.com/watch?v=_4HUR00oCGo where

[ceph-users] Re: Erasure-coded PG stuck in the failed_repair state

2022-05-10 Thread Wesley Dillingham
In my experience: "No scrub information available for pg 11.2b5 error 2: (2) No such file or directory" is the output you get from the command when the up or acting osd set has changed since the last deep-scrub. Have you tried to run a deep scrub (ceph pg deep-scrub 11.2b5) on the pg and then

[ceph-users] Re: Full cluster, new OSDS not being used

2022-08-23 Thread Wesley Dillingham
ery lightly used > at this point, only a few PGs have been assigned to them, though more than > zero and the number does appear to be slowly (very slowly) growing so > recovery is happening but very very slowly. > > > > > -- > *From:* Wesley

[ceph-users] Re: Full cluster, new OSDS not being used

2022-08-23 Thread Wesley Dillingham
Can you please send the output of "ceph osd tree" Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Tue, Aug 23, 2022 at 10:53 AM Wyll Ingersoll < wyllys.ingers...@keepertech.com> wrote: > > We have a large cluster with a many osds

[ceph-users] Re: Full cluster, new OSDS not being used

2022-08-23 Thread Wesley Dillingham
have increased backfill settings, but can you elaborate on > "injecting upmaps" ? > ---------- > *From:* Wesley Dillingham > *Sent:* Tuesday, August 23, 2022 1:44 PM > *To:* Wyll Ingersoll > *Cc:* ceph-users@ceph.io > *Subject:* Re: [cep

[ceph-users] How to determine if a filesystem is allow_standby_replay = true

2022-10-20 Thread Wesley Dillingham
I am building some automation for version upgrades of MDS and part of the process I would like to determine if a filesystem has allow_standby_replay set to true and if so then disable it. Granted I could just issue: "ceph fs set MyFS allow_standby_replay false" and be done with it but Its got me

[ceph-users] Re: How to determine if a filesystem is allow_standby_replay = true

2022-10-20 Thread Wesley Dillingham
BRARY_PATH *** > 2022-10-21T00:10:43.938+0530 7fe6b3e7a640 -1 WARNING: all dangerous and > experimental features are enabled. > 2022-10-21T00:10:43.945+0530 7fe6b3e7a640 -1 WARNING: all dangerous and > experimental features are enabled. > dumped fsmap epoch 15 > > Hope it hel

[ceph-users] subdirectory pinning and reducing ranks / max_mds

2022-10-21 Thread Wesley Dillingham
In a situation where you have say 3 active MDS (and 3 standbys). You have 3 ranks, 0,1,2 In your filesystem you have three directories at the root level [/a, /b, /c] you pin: /a to rank 0 /b to rank 1 /c to rank 2 and you need to upgrade your Ceph Version. When it becomes time to reduce max_mds

[ceph-users] Re: Power outage recovery

2022-09-15 Thread Wesley Dillingham
What does "ceph status" "ceph health detail" etc show, currently? Based on what you have said here my thought is you have created a new monitor quorum and as such all auth details from the old cluster are lost including any and all mgr cephx auth keys, so what does the log for the mgr say? How

[ceph-users] Re: Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Wesley Dillingham
Try adding mds_namespace option like so: 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/ name=james_user,secretfile=/etc/ceph/secret.key,mds_namespace=myfs On Fri, Sep 23, 2022 at 6:41 PM Sagittarius-A Black Hole < nigrat...@gmail.com> wrote: > Hi, > > The below fstab entry works,

[ceph-users] Re: Power outage recovery

2022-09-15 Thread Wesley Dillingham
Having the quorum / monitors back up may change the MDS and RGW's ability to start and stay running. Have you tried just restarting the MDS / RGW daemons again? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Thu, Sep 15, 2022 at

[ceph-users] Re: Increasing number of unscrubbed PGs

2022-09-13 Thread Wesley Dillingham
what does "ceph pg ls scrubbing" show? Do you have PGs that have been stuck in a scrubbing state for a long period of time (many hours,days,weeks etc). This will show in the "SINCE" column. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Wesley Dillingham
I haven't read through this entire thread so forgive me if already mentioned: What is the parameter "bluefs_buffered_io" set to on your OSDs? We once saw a terrible slowdown on our OSDs during snaptrim events and setting bluefs_buffered_io to true alleviated that issue. That was on a nautilus

[ceph-users] Re: Can't delete or unprotect snapshot with rbd

2022-10-06 Thread Wesley Dillingham
You are demo'ing two RBDs here: images/f3f4c73f-2eec-4af1-9bdf-4974a747607b seems to have 1 snapshot yet later when you try to interact with the snapshot you are doing so with a different rbd/image altogether: images/1fcfaa6b-eba0-4c75-b77d-d5b3ab4538a9 Respectfully, *Wes Dillingham*

[ceph-users] Re: Can't delete or unprotect snapshot with rbd

2022-10-06 Thread Wesley Dillingham
ce or resource busy > # rbd children images/f3f4c73f-2eec-4af1-9bdf-4974a747607b@snap > rbd: listing children failed: (2) No such file or directory > > /Niklas > > > From: Wesley Dillingham > Sent: Thursday, October 6, 2022 20:11 > To: Nikl

[ceph-users] Re: Odd 10-minute delay before recovery IO begins

2022-12-05 Thread Wesley Dillingham
I think you are experiencing the mon_osd_down_out_interval https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#confval-mon_osd_down_out_interval Ceph waits 10 minutes before marking a down osd as out for the reasons you mention, but this would have been the case in nautilus

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-01-27 Thread Wesley Dillingham
I hit this issue once on a nautilus cluster and changed the OSD parameter bluefs_buffered_io = true (was set at false). I believe the default of this parameter was switched from false to true in release 14.2.20, however, perhaps you could still check what your osds are configured with in regard to

[ceph-users] Re: deep scrub and long backfilling

2023-03-05 Thread Wesley Dillingham
In general it is safe and during long running remapping and backfill situations I enable it. You can enable it with: "ceph config set osd osd_scrub_during_recovery true" If you have any problems you think are caused by the change, undo it: Stop scrubs asap: "ceph osd set nodeep-scrub" "ceph

[ceph-users] Re: v16.2.12 Pacific (hot-fix) released

2023-04-24 Thread Wesley Dillingham
A few questions: - Will the 16.2.12 packages be "corrected" and reuploaded to the ceph.com mirror? or will 16.2.13 become what 16.2.12 was supposed to be? - Was the osd activation regression introduced in 16.2.11 (or does 16.2.10 have it as well)? - Were the hotfxes in 16.2.12 just related to

[ceph-users] Re: For suggestions and best practices on expanding Ceph cluster and removing old nodes

2023-04-25 Thread Wesley Dillingham
Get on nautilus first and (perhaps even go to pacific) before expansion. Primarily for the reason that starting in nautilus degraded data recovery will be prioritized over remapped data recovery. As you phase out old hardware and phase in new hardware you will have a very large amount of backfill

[ceph-users] Re: Ceph recovery

2023-05-01 Thread Wesley Dillingham
Assuming size=3 and min_size=2 It will run degraded (read/write capable) until a third host becomes available at which point it will backfill the third copy on the third host. It will be unable to create the third copy of data if no third host exists. If an additional host is lost the data will

[ceph-users] Re: librbd hangs during large backfill

2023-07-18 Thread Wesley Dillingham
Did your automation / process allow for stalls in between changes to allow peering to complete? My hunch is you caused a very large peering storm (during peering a PG is inactive) which in turn caused your VMs to panic. If the RBDs are unmapped and re-mapped does it still continue to struggle?

[ceph-users] Re: mon log file grows huge

2023-07-10 Thread Wesley Dillingham
At what level do you have logging set to for your mons? That is a high volume of logs for the mon to generate. You can ask all the mons to print their debug logging level with: "ceph tell mon.* config get debug_mon" The default is 1/5 What is the overall status of your cluster? Is it healthy?

[ceph-users] Re: ceph Pacific - MDS activity freezes when one the MDSs is restarted

2023-05-24 Thread Wesley Dillingham
There was a memory issue with standby-replay that may have been resolved since and fix is in 16.2.10 (not sure), the suggestion at the time was to avoid standby-replay. Perhaps a dev can chime in on that status. Your MDSs look pretty inactive. I would consider scaling them down (potentially to

[ceph-users] Re: The pg_num from 1024 reduce to 32 spend much time, is there way to shorten the time?

2023-06-06 Thread Wesley Dillingham
Can you send along the responses from "ceph df detail" and ceph "ceph osd pool ls detail" Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Tue, Jun 6, 2023 at 1:03 PM Eugen Block wrote: > I suspect the target_max_misplaced_ratio

[ceph-users] Re: PGs stuck undersized and not scrubbed

2023-06-05 Thread Wesley Dillingham
When PGs are degraded they won't scrub, further, if an OSD is involved with recovery of another PG it wont accept scrubs either so that is the likely explanation of your not-scrubbed-in time issue. Its of low concern. Are you sure that recovery is not progressing? I see: "7349/147534197 objects

[ceph-users] Re: `ceph features` on Nautilus still reports "luminous"

2023-05-25 Thread Wesley Dillingham
Fairly confident this is normal. I just checked a pacific cluster and they all report luminous as well. I think some of the backstory of this is luminous is the release where up-maps were released and there hasnt been a reason to increment the features release of subsequent daemons. To be honest

[ceph-users] Re: ceph.conf and two different ceph clusters

2023-06-26 Thread Wesley Dillingham
You need to use the --id and --cluster options of the rbd command and maintain a .conf file for each cluster. /etc/ceph/clusterA.conf /etc/ceph/clusterB.conf /etc/ceph/clusterA.client.userA.keyring /etc/ceph/clusterB.client.userB.keyring now use the rbd commands as such: rbd --id userA

[ceph-users] Re: Upgrade Ceph cluster + radosgw from 14.2.18 to latest 15

2023-05-09 Thread Wesley Dillingham
Curious, why not go to Pacific? You can upgrade up to 2 major releases in a go. The upgrade process to pacific is here: https://docs.ceph.com/en/latest/releases/pacific/#upgrading-non-cephadm-clusters The upgrade to Octopus is here:

[ceph-users] Re: Upgrade Ceph cluster + radosgw from 14.2.18 to latest 15

2023-05-15 Thread Wesley Dillingham
I have upgraded dozens of clusters 14 -> 16 using the methods described in the docs, and when followed precisely no issues have arisen. I would suggest moving to a release that is receiving backports still (pacific or quincy). The important aspects are only doing one system at a time. In the case

[ceph-users] Re: Logging control

2023-12-19 Thread Wesley Dillingham
"ceph daemon" commands need to be run local to the machine where the daemon is running. So in this case if you arent on the node where osd.1 lives it wouldnt work. "ceph tell" should work anywhere there is a client.admin key. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn

[ceph-users] Re: Is there a way to find out which client uses which version of ceph?

2023-12-21 Thread Wesley Dillingham
You can ask the monitor to dump its sessions (which should expose the IPs and the release / features) you can then track down by IP those with the undesirable features/release ceph daemon mon.`hostname -s` sessions Assuming your mon is named after the short hostname, you may need to do this for

[ceph-users] Re: Best Practice for OSD Balancing

2023-11-28 Thread Wesley Dillingham
It's a complicated topic and there is no one answer, it varies for each cluster and depends. You have a good lay of the land. I just wanted to mention that the correct "foundation" for equally utilized OSDs within a cluster relies on two important factors: - Symmetry of disk/osd quantity and

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Wesley Dillingham
> > "bluestore_rocksdb_options": > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_t

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Wesley Dillingham
Curious if you are using bluestore compression? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Mon, Nov 27, 2023 at 10:09 AM Denis Polom wrote: > Hi > > we have issue to start some OSDs on one node on our Ceph Quincy 17.2.7 >

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Wesley Dillingham
2,max_total_wal_size=1073741824", > thx > > On 11/27/23 19:17, Wesley Dillingham wrote: > > Curious if you are using bluestore compression? > > Respectfully, > > *Wes Dillingham* > w...@wesdillingham.com > LinkedIn <http://www.linkedin.com/in/wesleydillingham&g

[ceph-users] Re: About number of osd node can be failed with erasure code 3+2

2023-11-27 Thread Wesley Dillingham
With a k+m which is 3+2 each RADOS object is broken into 5 shards. By default the pool will have a min_size of k+1 (4 in this case). Which means you can lose 1 shard and still be >= min_size. If one host goes down and you use a host-based failure domain (default) you will lose 1 shard out of all

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Wesley Dillingham
> it's: > > "bluestore_compression_algorithm": "snappy" > > "bluestore_compression_mode": "none" > > > On 11/27/23 20:13, Wesley Dillingham wrote: > > How about these two options: > > bluestore_compression_algorithm > bluestore_compression_mode > &g

[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-26 Thread Wesley Dillingham
I faced a similar issue. The PG just would never finish recovery. Changing all OSDs in the PG to "osd_op_queue wpq" and then restarting them serially ultimately allowed the PG to recover. Seemed to be some issue with mclock. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Wesley Dillingham
Respond back with "ceph versions" output If your sole goal is to eliminate the not scrubbed in time errors you can increase the aggressiveness of scrubbing by setting: osd_max_scrubs = 2 The default in pacific is 1. if you are going to start tinkering manually with the pg_num you will want to

  1   2   >