[ceph-users] Re: [EXTERN] Re: Urgent help with degraded filesystem needed

2024-06-26 Thread Dietmar Rieder
Can  anybody comment on my questions below? Thanks so much in advance

Am 26. Juni 2024 08:08:39 MESZ schrieb Dietmar Rieder 
:
>...sending also to the list and Xiubo (were accidentally removed from 
>recipients)...
>
>On 6/25/24 21:28, Dietmar Rieder wrote:
>> Hi Patrick,  Xiubo and List,
>> 
>> finally we managed to get the filesystem repaired and running again! YEAH, 
>> I'm so happy!!
>> 
>> Big thanks for your support Patrick and Xiubo! (Would love invite you for a 
>> beer)!
>> 
>> 
>> Please see some comments and (important?) questions below:
>> 
>> On 6/25/24 03:14, Patrick Donnelly wrote:
>>> On Mon, Jun 24, 2024 at 5:22 PM Dietmar Rieder
>>>  wrote:
 
 (resending this, the original message seems that it didn't make it through 
 between all the SPAM recently sent to the list, my apologies if it doubles 
 at some point)
 
 Hi List,
 
 we are still struggeling to get our cephfs back online again, this is an 
 update to inform you what we did so far, and we kindly ask for any input 
 on this to get an idea on how to proceed:
 
 After resetting the journals Xiubo suggested (in a PM) to go on with the 
 disaster recovery procedure:
 
 cephfs-data-scan init skipped creating the inodes 0x0x1 and 0x0x100
 
 [root@ceph01-b ~]# cephfs-data-scan init
 Inode 0x0x1 already exists, skipping create.  Use --force-init to 
 overwrite the existing object.
 Inode 0x0x100 already exists, skipping create.  Use --force-init to 
 overwrite the existing object.
 
 We did not use --force-init and proceeded with scan_extents using a single 
 worker, which was indeed very slow.
 
 After ~24h we interupted the scan_extents and restarted it with 32 workers 
 which went through in about 2h15min w/o any issue.
 
 Then I started scan_inodes with 32 workers this was also finished after 
 ~50min no output on stderr or stdout.
 
 I went on with scan_links, which after ~45 minutes threw the following 
 error:
 
 # cephfs-data-scan scan_links
 Error ((2) No such file or directory)
>>> 
>>> Not sure what this indicates necessarily. You can try to get more
>>> debug information using:
>>> 
>>> [client]
>>>    debug mds = 20
>>>    debug ms = 1
>>>    debug client = 20
>>> 
>>> in the local ceph.conf for the node running cephfs-data-scan.
>> 
>> I did that, and restarted the  "cephfs-data-scan scan_links" .
>> 
>> It didn't produce any additional debug output, however this time it just 
>> went through without error (~50 min)
>> 
>> We then reran "cephfs-data-scan cleanup" and it also finished without error 
>> after about 10h.
>> 
>> We then set the fs as repaired and all seems to work fin again:
>> 
>> [root@ceph01-b ~]# ceph mds repaired 0
>> repaired: restoring rank 1:0
>> 
>> [root@ceph01-b ~]# ceph -s
>>    cluster:
>>      id: aae23c5c-a98b-11ee-b44d-00620b05cac4
>>      health: HEALTH_OK
>> 
>>    services:
>>      mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 6d)
>>      mgr: cephmon-01.dsxcho(active, since 6d), standbys: cephmon-02.nssigg, 
>> cephmon-03.rgefle
>>      mds: 1/1 daemons up, 5 standby
>>      osd: 336 osds: 336 up (since 2M), 336 in (since 4M)
>> 
>>    data:
>>      volumes: 1/1 healthy
>>      pools:   4 pools, 6401 pgs
>>      objects: 284.68M objects, 623 TiB
>>      usage:   890 TiB used, 3.1 PiB / 3.9 PiB avail
>>      pgs: 6206 active+clean
>>   140  active+clean+scrubbing
>>   55   active+clean+scrubbing+deep
>> 
>>    io:
>>      client:   3.9 MiB/s rd, 84 B/s wr, 482 op/s rd, 1.11k op/s wr
>> 
>> 
>> [root@ceph01-b ~]# ceph fs status
>> cephfs - 0 clients
>> ==
>> RANK  STATE  MDS    ACTIVITY DNS    INOS DIRS   
>> CAPS
>>   0    active  default.cephmon-03.xcujhz  Reqs:    0 /s   124k  60.3k 1993   
>>    0
>>   POOL    TYPE USED  AVAIL
>> ssd-rep-metadata-pool  metadata   298G  63.4T
>>    sdd-rep-data-pool  data    10.2T  84.5T
>>     hdd-ec-data-pool  data 808T  1929T
>>     STANDBY MDS
>> default.cephmon-01.cepqjp
>> default.cephmon-01.pvnqad
>> default.cephmon-02.duujba
>> default.cephmon-02.nyfook
>> default.cephmon-03.chjusj
>> MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) 
>> reef (stable)
>> 
>> 
>> The msd log however shows some "bad backtrace on directory inode" messages:
>> 
>> 2024-06-25T18:45:36.575+ 7f8594659700  1 mds.default.cephmon-03.xcujhz 
>> Updating MDS map to version 8082 from mon.1
>> 2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082 handle_mds_map i am 
>> now mds.0.8082
>> 2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082 handle_mds_map state 
>> change up:standby --> up:replay
>> 2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082 replay_start
>> 2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082  waiting for osdmap 
>> 34331 (which blocklists prior instance)
>> 

[ceph-users] pg deep-scrub control scheme

2024-06-26 Thread David Yang
Hello everyone.

I have a cluster with 8321 pgs and recently I started to get pg not
deep-scrub warnings.
The reason is that I reduced max_scrub to avoid the impact of scrub on IO.

Here is my current scrub configuration:

~]# ceph tell osd.1 config show|grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.00",
"mon_scrub_inject_missing_keys": "0.00",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_pg_not_deep_scrubbed_ratio": "0.80",
"mon_warn_pg_not_scrubbed_ratio": "0.50",
"osd_debug_deep_scrub_sleep": "0.00",
"osd_deep_scrub_interval": "1296000.00",
"osd_deep_scrub_keys": "1024",
"osd_deep_scrub_large_omap_object_key_threshold": "20",
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
"osd_deep_scrub_randomize_ratio": "0.08",
"osd_deep_scrub_stride": "131072",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.66",
"osd_scrub_begin_hour": "0",
"osd_scrub_begin_week_day": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "0",
"osd_scrub_end_week_day": "0",
"osd_scrub_extended_sleep": "0.00",
"osd_scrub_interval_randomize_ratio": "0.50",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.50",
"osd_scrub_max_interval": "1296000.00",
"osd_scrub_max_preemptions": "5",
"osd_scrub_min_interval": "259200.00",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.00",

I am currently trying to adjust the interval of scrub.

Is there a calculation formula that can be used to easily configure
the scrub/deepscrub strategy?

At present, there is only the adjustment of individual values, and
then it is a long wait, and there may be no progress in the end.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg's stuck activating on osd create

2024-06-26 Thread Richard Bade
Hi Everyone,
I had an issue last night when I was bringing online some osds that I
was rebuilding. When the osds created and came online 15pgs got stuck
in activating. The first osd (osd.112) seemed to come online ok, but
the second one (osd.113) triggered the issue. All the pgs in
activating included osd.112 in the pg map and I resolved it by doing
pg-upmap-items to map the pg back from osd.112 to where it currently
was but it was painful having 10min of stuck i/o os an rbd pool with
vm's running.

Some details about the cluster:
Pacific 16.2.15, upgraded from Nautilus fairly recently and Luminos
back in the past. All osds were rebuilt on bluestore in Nautilus, as
were the mons.
The disks in question are Intel DC P4510 8TB nvme. I'm rebuilding them
as I had previously had 4x2TB osd's per disk and now wanted to
consolidate down to one osd per disk.
There's around 300 osd's in the pool with 16384 pgs which means that
the 2TB osds had 157pgs on them. However this means that the 8TB osds
have 615pgs on them and I'm wondering if this is maybe the cause of
the problem.

There are no warnings about too many pgs per osd in the logs or ceph status.
I have the default value of 250 for mon_max_pg_per_osd and default
value of 3.0 for osd_max_pg_per_osd_hard_ratio.

My plan is to reduce the number of pgs in the pool but I want to
understand and prove what happened here.
Is it likely I've hit pg overdose protection? If I have, how would I
tell as I can't see anything in the cluster logs.

Thanks,
Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Unable to move realm master between zonegroups -- radosgw-admin zonegroup ignoring the --rgw-zonegroup flag?

2024-06-26 Thread Tim Hunter
Hi folks,

We have a number of ceph clusters organized into one realm and three
zonegroups.  All the clusters are running ceph 17 built with cephadm.

I am trying to move the metadata master from one zonegroup (us-east-1) to
zone mn1 in zonegroup us-central-1, by following the steps in the
documentation
,
but it doesn't work as advised.  I ran these commands on a system in the
`mn1` zone in the us-central-1 zonegroup.

sudo radosgw-admin zone modify --rgw-zone=mn1 --master
sudo radosgw-admin zonegroup modify --rgw-zonegroup=us-central-1 --master
sudo radosgw-admin period update --commit

When I carry out those steps, the period update gives me this error:

2024-06-26T11:19:12.256+ 7f1f5f610e40  0 Error updating periodmap,
multiple master zonegroups configured

2024-06-26T11:19:12.256+ 7f1f5f610e40  0 master zonegroup: 9ef7877b and
 a3abf9e1

Here we have problem (1) that the command doesn’t move the master from
a3abf9e1 (us-east-1) to 9ef7877b (us-central-1).  There is a fix advised by
a 2016 post to ceph-users
 -- explicitly mark
the old zonegroup master as no longer the master.

sudo radosgw-admin zonegroup modify --rgw-zonegroup=us-east-1 --master=false

However, the json output of this command is a modified map of the
us-central-1 zonegroup, showing that the 'is_master' flag has been set back
to 'false' for that zonegroup.  Thus we have problem (2) that I am unable
to modify two different zonegroups from the same system, leaving me unable
to change the realm master.

`sudo radosgw-admin zonegroup get --rgw-zonegroup=us-east-1`, when run from
the us-central-1 zonegroup, just shows me the us-central-1 zonegroup
details.  Changing the `=` to a space does not have any effect.

My brief attempts at reading the code for rgw_admin.cc on github left me
unenlightened.  Can anyone offer any assistance here?

-- 
Tim Hunter
Senior Infrastructure Engineer

telnyx.com
[image: Telnyx logo]
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rgw zone create fails EINVAL

2024-06-26 Thread Adam King
Interesting. Given this is coming from a radosgw-admin call being done from
within the rgw mgr module, I wonder if a  radosgw-admin log file is ending
up in the active mgr container when this happens.

On Wed, Jun 26, 2024 at 9:04 AM Daniel Gryniewicz  wrote:

> On 6/25/24 3:21 PM, Matthew Vernon wrote:
> > On 24/06/2024 21:18, Matthew Vernon wrote:
> >
> >> 2024-06-24T17:33:26.880065+00:00 moss-be2001 ceph-mgr[129346]: [rgw
> >> ERROR root] Non-zero return from ['radosgw-admin', '-k',
> >> '/var/lib/ceph/mgr/ceph-moss-be2001.qvwcaq/keyring', '-n',
> >> 'mgr.moss-be2001.qvwcaq', 'realm', 'pull', '--url',
> >> 'https://apus.svc.eqiad.wmnet:443', '--access-key', 'REDACTED',
> >> '--secret', 'REDACTED', '--rgw-realm', 'apus']: request failed: (5)
> >> Input/output error
> >>
> >> EIO is an odd sort of error [doesn't sound very network-y], and I
> >> don't think I see any corresponding request in the radosgw logs in the
> >> primary zone. From the CLI outside the container I can do e.g. curl
> >> https://apus.svc.eqiad.wmnet/ just fine, are there other things worth
> >> checking here? Could it matter that the mgr node isn't an rgw?
> >
> > ...the answer turned out to be "container image lacked the relevant CA
> > details to validate the TLS of the other end".
> >
>
> Also, for the record, radosgw-admin logs do not end up in the same log
> file as RGW's logs.  Each invocation of radosgw-admin makes it's own log
> file for the run of that command.  (This is because radosgw-admin is
> really a stripped down version of RGW itself, and it does not
> communicate with the running RGWs, but connects to the Ceph cluster
> directly.)  They're generally small, and frequently empty, but should
> have error messages in them on failure.
>
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rgw zone create fails EINVAL

2024-06-26 Thread Daniel Gryniewicz

On 6/25/24 3:21 PM, Matthew Vernon wrote:

On 24/06/2024 21:18, Matthew Vernon wrote:

2024-06-24T17:33:26.880065+00:00 moss-be2001 ceph-mgr[129346]: [rgw 
ERROR root] Non-zero return from ['radosgw-admin', '-k', 
'/var/lib/ceph/mgr/ceph-moss-be2001.qvwcaq/keyring', '-n', 
'mgr.moss-be2001.qvwcaq', 'realm', 'pull', '--url', 
'https://apus.svc.eqiad.wmnet:443', '--access-key', 'REDACTED', 
'--secret', 'REDACTED', '--rgw-realm', 'apus']: request failed: (5) 
Input/output error


EIO is an odd sort of error [doesn't sound very network-y], and I 
don't think I see any corresponding request in the radosgw logs in the 
primary zone. From the CLI outside the container I can do e.g. curl 
https://apus.svc.eqiad.wmnet/ just fine, are there other things worth 
checking here? Could it matter that the mgr node isn't an rgw?


...the answer turned out to be "container image lacked the relevant CA 
details to validate the TLS of the other end".




Also, for the record, radosgw-admin logs do not end up in the same log 
file as RGW's logs.  Each invocation of radosgw-admin makes it's own log 
file for the run of that command.  (This is because radosgw-admin is 
really a stripped down version of RGW itself, and it does not 
communicate with the running RGWs, but connects to the Ceph cluster 
directly.)  They're generally small, and frequently empty, but should 
have error messages in them on failure.


Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow down RGW updates via orchestrator

2024-06-26 Thread Boris
Ah nice.
Thanks a lot :)



Am Mi., 26. Juni 2024 um 11:56 Uhr schrieb Robert Sander <
r.san...@heinlein-support.de>:

> Hi,
>
> On 6/26/24 11:49, Boris wrote:
>
> > Is there a way to only update 1 daemon at a time?
>
> You can use the feature "staggered upgrade":
>
> https://docs.ceph.com/en/reef/cephadm/upgrade/#staggered-upgrade
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow down RGW updates via orchestrator

2024-06-26 Thread Robert Sander

Hi,

On 6/26/24 11:49, Boris wrote:


Is there a way to only update 1 daemon at a time?


You can use the feature "staggered upgrade":

https://docs.ceph.com/en/reef/cephadm/upgrade/#staggered-upgrade

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow down RGW updates via orchestrator

2024-06-26 Thread Boris
Hi,
we've just updated our test cluster via

ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.0


During the update of the RGW service, all of the daemons went down at the
same time. If I would do that on our production system it would cause a
small but noticeable outage.

Is there a way to only update 1 daemon at a time?

Cheers
 Boris

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD service specs in mixed environment

2024-06-26 Thread Torkil Svensgaard



On 26/06/2024 08:48, Torkil Svensgaard wrote:

Hi

We have a bunch of HDD OSD hosts with DB/WAL on PCI NVMe, either 2 x 
3.2TB or 1 x 6.4TB. We used to have 4 SSDs pr node for journals before 
bluestore and those have been repurposed for an SSD pool (wear level is 
fine).


We've been using the following service specs to avoid the PCI NVMe 
devices for bluestore being provisioned as OSDs:


---
service_type: osd
service_id: fast
service_name: osd.fast
placement:
   host_pattern: '*'
spec:
   data_devices:
     rotational: 0
     size: :1000G  <-- only use devices smaller than 1TB = not PCI NVMe
   filter_logic: AND
   objectstore: bluestore
---
service_type: osd
service_id: slow
service_name: osd.slow
placement:
   host_pattern: '*'
spec:
   block_db_size: 290966113186
   data_devices:
     rotational: 1
   db_devices:
     rotational: 0
     size: '1000G:' <-- only use devices larger than 1TB for DB/WAL
   filter_logic: AND
   objectstore: bluestore
---

We just bought a few 7.68 TB SATA SSDs to add to the SSD pool which 
aren't being picked up by the osd.fast spec because they are too large 
and they could also be picked up as DB/WAL with the current specs.


As far as I can determine there is no way to achieve what I want with 
the existing specs, as I can't filter on PCI vs SATA, only rotational or 
not, I can't use size, as it only can define an in between range, not an 
outside range, and I can't use filter_logic OR for the sizes because I 
need the rotational qualifier to be AND.


I can do a osd.fast2 spec with size: 7000G: and change the db_devices 
size for osd.slow to something like 1000G:7000G but curious to see if 
anyone would have a different suggestion?


Regarding this last part, this is the new SSD as ceph orch device ls 
sees it:


ssd   ATA_SAMSUNG_MZ7L37T6HBLA-00A07_S6EPNN0X504375  7153G

But this in a spec doesn't match it:

size: '7000G:'

This does:

size: '6950G:'

I can't get that to make sense. The value from ceph orch device ls looks 
like GiB. The documentation[1] states that the spec file uses GB and 
7000GB should be less than 7153GiB (and so should 7000GiB for that 
matter)? Some sort of internal rounding off?


Mvh.

Torkil

[1] https://docs.ceph.com/en/latest/cephadm/services/osd/



Mvh.

Torkil



--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD service specs in mixed environment

2024-06-26 Thread Torkil Svensgaard

Hi

We have a bunch of HDD OSD hosts with DB/WAL on PCI NVMe, either 2 x 
3.2TB or 1 x 6.4TB. We used to have 4 SSDs pr node for journals before 
bluestore and those have been repurposed for an SSD pool (wear level is 
fine).


We've been using the following service specs to avoid the PCI NVMe 
devices for bluestore being provisioned as OSDs:


---
service_type: osd
service_id: fast
service_name: osd.fast
placement:
  host_pattern: '*'
spec:
  data_devices:
rotational: 0
size: :1000G  <-- only use devices smaller than 1TB = not PCI NVMe
  filter_logic: AND
  objectstore: bluestore
---
service_type: osd
service_id: slow
service_name: osd.slow
placement:
  host_pattern: '*'
spec:
  block_db_size: 290966113186
  data_devices:
rotational: 1
  db_devices:
rotational: 0
size: '1000G:' <-- only use devices larger than 1TB for DB/WAL
  filter_logic: AND
  objectstore: bluestore
---

We just bought a few 7.68 TB SATA SSDs to add to the SSD pool which 
aren't being picked up by the osd.fast spec because they are too large 
and they could also be picked up as DB/WAL with the current specs.


As far as I can determine there is no way to achieve what I want with 
the existing specs, as I can't filter on PCI vs SATA, only rotational or 
not, I can't use size, as it only can define an in between range, not an 
outside range, and I can't use filter_logic OR for the sizes because I 
need the rotational qualifier to be AND.


I can do a osd.fast2 spec with size: 7000G: and change the db_devices 
size for osd.slow to something like 1000G:7000G but curious to see if 
anyone would have a different suggestion?


Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] Re: Urgent help with degraded filesystem needed

2024-06-26 Thread Dietmar Rieder
...sending also to the list and Xiubo (were accidentally removed from 
recipients)...


On 6/25/24 21:28, Dietmar Rieder wrote:

Hi Patrick,  Xiubo and List,

finally we managed to get the filesystem repaired and running again! 
YEAH, I'm so happy!!


Big thanks for your support Patrick and Xiubo! (Would love invite you 
for a beer)!



Please see some comments and (important?) questions below:

On 6/25/24 03:14, Patrick Donnelly wrote:

On Mon, Jun 24, 2024 at 5:22 PM Dietmar Rieder
 wrote:


(resending this, the original message seems that it didn't make it 
through between all the SPAM recently sent to the list, my apologies 
if it doubles at some point)


Hi List,

we are still struggeling to get our cephfs back online again, this is 
an update to inform you what we did so far, and we kindly ask for any 
input on this to get an idea on how to proceed:


After resetting the journals Xiubo suggested (in a PM) to go on with 
the disaster recovery procedure:


cephfs-data-scan init skipped creating the inodes 0x0x1 and 0x0x100

[root@ceph01-b ~]# cephfs-data-scan init
Inode 0x0x1 already exists, skipping create.  Use --force-init to 
overwrite the existing object.
Inode 0x0x100 already exists, skipping create.  Use --force-init to 
overwrite the existing object.


We did not use --force-init and proceeded with scan_extents using a 
single worker, which was indeed very slow.


After ~24h we interupted the scan_extents and restarted it with 32 
workers which went through in about 2h15min w/o any issue.


Then I started scan_inodes with 32 workers this was also finished 
after ~50min no output on stderr or stdout.


I went on with scan_links, which after ~45 minutes threw the 
following error:


# cephfs-data-scan scan_links
Error ((2) No such file or directory)


Not sure what this indicates necessarily. You can try to get more
debug information using:

[client]
   debug mds = 20
   debug ms = 1
   debug client = 20

in the local ceph.conf for the node running cephfs-data-scan.


I did that, and restarted the  "cephfs-data-scan scan_links" .

It didn't produce any additional debug output, however this time it just 
went through without error (~50 min)


We then reran "cephfs-data-scan cleanup" and it also finished without 
error after about 10h.


We then set the fs as repaired and all seems to work fin again:

[root@ceph01-b ~]# ceph mds repaired 0
repaired: restoring rank 1:0

[root@ceph01-b ~]# ceph -s
   cluster:
     id: aae23c5c-a98b-11ee-b44d-00620b05cac4
     health: HEALTH_OK

   services:
     mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 6d)
     mgr: cephmon-01.dsxcho(active, since 6d), standbys: 
cephmon-02.nssigg, cephmon-03.rgefle

     mds: 1/1 daemons up, 5 standby
     osd: 336 osds: 336 up (since 2M), 336 in (since 4M)

   data:
     volumes: 1/1 healthy
     pools:   4 pools, 6401 pgs
     objects: 284.68M objects, 623 TiB
     usage:   890 TiB used, 3.1 PiB / 3.9 PiB avail
     pgs: 6206 active+clean
  140  active+clean+scrubbing
  55   active+clean+scrubbing+deep

   io:
     client:   3.9 MiB/s rd, 84 B/s wr, 482 op/s rd, 1.11k op/s wr


[root@ceph01-b ~]# ceph fs status
cephfs - 0 clients
==
RANK  STATE  MDS    ACTIVITY DNS    INOS 
DIRS   CAPS
  0    active  default.cephmon-03.xcujhz  Reqs:    0 /s   124k  60.3k 
1993  0

  POOL    TYPE USED  AVAIL
ssd-rep-metadata-pool  metadata   298G  63.4T
   sdd-rep-data-pool  data    10.2T  84.5T
    hdd-ec-data-pool  data 808T  1929T
    STANDBY MDS
default.cephmon-01.cepqjp
default.cephmon-01.pvnqad
default.cephmon-02.duujba
default.cephmon-02.nyfook
default.cephmon-03.chjusj
MDS version: ceph version 18.2.2 
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)



The msd log however shows some "bad backtrace on directory inode" messages:

2024-06-25T18:45:36.575+ 7f8594659700  1 
mds.default.cephmon-03.xcujhz Updating MDS map to version 8082 from mon.1
2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082 handle_mds_map i 
am now mds.0.8082
2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082 handle_mds_map 
state change up:standby --> up:replay

2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082 replay_start
2024-06-25T18:45:36.575+ 7f8594659700  1 mds.0.8082  waiting for 
osdmap 34331 (which blocklists prior instance)
2024-06-25T18:45:36.581+ 7f858de4c700  0 mds.0.cache creating system 
inode with ino:0x100
2024-06-25T18:45:36.581+ 7f858de4c700  0 mds.0.cache creating system 
inode with ino:0x1

2024-06-25T18:45:36.589+ 7f858ce4a700  1 mds.0.journal EResetJournal
2024-06-25T18:45:36.589+ 7f858ce4a700  1 mds.0.sessionmap wipe start
2024-06-25T18:45:36.589+ 7f858ce4a700  1 mds.0.sessionmap wipe result
2024-06-25T18:45:36.589+ 7f858ce4a700  1 mds.0.sessionmap wipe done
2024-06-25T18:45:36.589+ 7f858ce4a700  1 mds.0.8082 Finished 
replaying journal
2024-06-25T18:45:36.589+