[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-16 Thread Michel Jouvin

Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may 
clutter the discussion if inline).


If somebody on the list has some clue on the LRC plugin, I'm still 
interested by understand what I'm doing wrong!


Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree yet, could you do that? 
Apparently nobody else but us reads this thread or nobody reading this 
uses the LRC plugin. ;-)


Thanks,
Eugen

Zitat von Michel Jouvin :


Hi,

I had to restart one of my OSD server today and the problem showed up 
again. This time I managed to capture "ceph health detail" output 
showing the problem with the 2 PGs:


[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 
pgs down
    pg 56.1 is down, acting 
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
    pg 56.12 is down, acting 
[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a 
datacenter failure, I cannot survive to 3 OSDs down on the same host, 
hosting shards for the PG. In the second case it is only 2 OSDs down 
but I'm surprised they don't seem in the same "group" of OSD (I'd 
expected all the the OSDs of one datacenter to be in the same groupe 
of 5 if the order given really reflects the allocation done...


Still interested by some explanation on what I'm doing wrong! Best 
regards,


Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still limited 
by the number of hosts I have available in my test cluster, but as 
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with 
locality=datacenter could fit your requirement, at least with 
regards to the recovery bandwidth usage between DCs, but the 
resiliency would not match your requirement (one DC failure). That 
profile creates 3 groups of 4 chunks (3 data/coding chunks and one 
parity chunk) across three DCs, in total 12 chunks. The min_size=7 
would not allow an entire DC to go down, I'm afraid, you'd have to 
reduce it to 6 to allow reads/writes in a disaster scenario. I'm 
still not sure if I got it right this time, but maybe you're better 
off without the LRC plugin with the limited number of hosts. Instead 
you could use the jerasure plugin with a profile like k=4 m=5 
allowing an entire DC to fail without losing data access (we have 
one customer using that).


Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might 
be some misunderstandings on my side. But I tried to play around 
with one of my test clusters (Nautilus). Because I'm limited in the 
number of hosts (6 across 3 virtual DCs) I tried two different 
profiles with lower numbers to get a feeling for how that works.


# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc 
k=4 m=2 l=3 crush-failure-domain=host


For every third OSD one parity chunk is added, so 2 more chunks to 
store ==> 8 chunks in total. Since my failure-domain is host and I 
only have 6 I get incomplete PGs.


# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc 
k=2 m=2 l=2 crush-failure-domain=host


This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* 
OMAP_KEYS* LOG STATE    SINCE VERSION REPORTED 
UP    ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
50.0   1    0 0   0   619 0  0 1 
active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27 
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.1   0    0 0   0 0 0  0 0 
active+clean    6m 0'0 18414:26 [27,33,22,6,13,34]p27 
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.2   0    0 0   0 0 0  0 0 
active+clean    6m 0'0 18413:25 [1,28,14,4,31,21]p1 
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135
50.3   0    0 0   0 0 0  0 0 
active+clean    6m 0'0 18413:24 [8,16,26,33,7,25]p8 
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 
14:53:54.322135


After stopping all OSDs on one host I was still able to read and 
write into the pool, but after stopping a second host one PG from 
that pool went "down". That I don't fully understand yet, but I 
just started to look into it.
With your setup (12 hosts) I would recommend to not utilize all of 
them so you have capacity to recover, let's say one "spare" host 
per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could 
make sense here, resulting in 9 total chunks (one more parity 
chunks for every other OSD), min_size 4. But as I wrote, it 
probably doesn't have the resiliency for a DC failure, so that 
needs some further investigation.


Regards,
Eugen

Z

[ceph-users] rbd mirror snapshot trash

2023-05-16 Thread Eugen Block

Good morning,

I would be grateful if anybody could shed some light on this, I can't  
reproduce it in my lab clusters so I was hoping for the community.
A customer has 2 clusters with rbd mirroring (snapshots) enabled, it  
seems to work fine, they have regular checks and the images on the  
remote site are synced correctly. While investigating a different  
issue (could be related though) we noticed that there are quite a lot  
of snapshots in the trash namespace but for unkown reasons they are  
not purged. This is some example output from the remote site (the  
primary site looks similar):


---snip---
# rbd snap ls --all /
SNAPID NAME 
   SIZEPROTECTED  TIMESTAMP
  NAMESPACE
 23962608  42c2e4c3-e1ab-480f-9d29-aabc751b
20 GiB Tue Sep 20  
20:33:18 2022  trash  
(.mirror.non_primary.61ef3a1e-6e5b-4147-ac62-552e4776dd70.ef345a74-2585-48ac-8e25-09b15c20c877)
 96796437  24cc96d4-51c6-447e-870d-1545d8ec9308
20 GiB Tue Apr 18  
12:15:28 2023  trash  
(.mirror.non_primary.61ef3a1e-6e5b-4147-ac62-552e4776dd70.a6ae46e4-cbb5-4264-b2e4-6b565820aa16)
110025019   
.mirror.non_primary.61ef3a1e-6e5b-4147-ac62-552e4776dd70.0fc73e88-6671-4ba8-a118-8fcbcd422c22  20 GiB Mon May 15 16:27:39 2023  mirror (non-primary peer_uuids:[] 18ac37b9-a6c6-4f64-ba43-a1f0d02b3f96:115986586  
copied)

---snip---

They're on latest Octopus and use all config defaults. There are more  
than 500 images in that pool with a snapshot schedule of 3 minutes for  
each images.
A couple of questions arised and hopefully at least some of them can  
be answered:


- Why are there snapshots in the trash namespace from September 2022,  
how can they be cleaned up and why aren't they cleaned up  
automatically? Not all images have trash entries though.
- Could disabling mirroring for those images help getting rid of the  
trash, then enable mirroring again?
- Is the default osd_max_trimming_pgs = 2 a bottleneck here? They have  
a week old sst files in the store.db, leading to long startup duration  
for MONs after reboot/restart. Would increasing that value help  
getting rid of the trash entries and maybe also trim the mon store?
- Regarding the general snapshot mirroring procedure, the default of  
rbd_mirroring_max_mirroring_snapshots  is 5, but I assume that the  
number of active snapshots would only grow in case of a disruption  
between those sites, correct? If the sync works there's no need to  
keep 5 snapshots, right?


I'm still looking into these things myself but I'd appreciate anyone  
chiming in here.


Thanks!
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd mirror snapshot trash

2023-05-16 Thread Stefan Kooman

On 5/16/23 09:47, Eugen Block wrote:


I'm still looking into these things myself but I'd appreciate anyone 
chiming in here.


IIRC the configuration of the trash purge schedule has changed in one of 
the Ceph releases (not sure which one). Have they recently upgaraded to 
a new(er) release? Do the trashed images get purged at all?


See ceph trash purge schedule * commands to check for this.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd mirror snapshot trash

2023-05-16 Thread Eugen Block
Thanks, they upgraded to 15.2.17 a few months ago, upgrading further  
is currently not possible.



See ceph trash purge schedule * commands to check for this.


If you mean 'rbd trash purge schedule' commands, there are no  
schedules defined, but lots of images seem to be okay. On a different  
channel I also got a response, there's a theory that trash snapshops  
appeared during mon reelection and vanished after upgrading to quincy.  
I'll recommend to delete the trash snapshots manually, then maybe  
increase the snaptrim config.


Zitat von Stefan Kooman :


On 5/16/23 09:47, Eugen Block wrote:


I'm still looking into these things myself but I'd appreciate  
anyone chiming in here.


IIRC the configuration of the trash purge schedule has changed in  
one of the Ceph releases (not sure which one). Have they recently  
upgaraded to a new(er) release? Do the trashed images get purged at  
all?


See ceph trash purge schedule * commands to check for this.

Gr. Stefan



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Frank Schilder
Hi Xiubo,

forgot to include these, the inodes i tried to dump and which caused a crash are

ceph tell "mds.ceph-10" dump inode 2199322355147 <-- original file/folder 
causing trouble
ceph tell "mds.ceph-10" dump inode 2199322727209 <-- copy also causing trouble 
(after taking snapshot??)

Other folders all the way up in the hierarchy did not lead to a crash, the dump 
worked fine for these.

The debug settings during the first tries were:

ceph config set mds.ceph-10 debug_mds 20/5
ceph config set mds.ceph-10 debug_ms 5/0

The hex codes are 0x20011d3e5cb and 0x20011d99329. The dump commands are also 
in the log. I tried a bit around to find out where the problem is localized. I 
didn't have the high debug settings all the time due to disk space constraints. 
I can pull specific logs over a short time for anything that is reproducible 
(short sequence of specific commands).

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Monday, May 15, 2023 6:33 PM
To: Xiubo Li; ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Dear Xiubo,

I uploaded the cache dump, the MDS log and the dmesg log containing the 
snaptrace dump to

ceph-post-file: 763955a3-7d37-408a-bbe4-a95dc687cd3f

Sorry, I forgot to add user and description this time.

A question about trouble shooting. I'm pretty sure I know the path where the 
error is located. Would a "ceph tell mds.1 scrub start / recursive repair" be 
able to discover and fix broken snaptraces? If not I'm awaiting further 
instructions.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Friday, May 12, 2023 3:44 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system


On 5/12/23 20:27, Frank Schilder wrote:
> Dear Xiubo and others.
>
>>> I have never heard about that option until now. How do I check that and how 
>>> to I disable it if necessary?
>>> I'm in meetings pretty much all day and will try to send some more info 
>>> later.
>> $ mount|grep ceph
> I get
>
> MON-IPs:SRC on DST type ceph 
> (rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)
>
> so async dirop seems disabled.

Yeah.


>> Yeah, the kclient just received a corrupted snaptrace from MDS.
>> So the first thing is you need to fix the corrupted snaptrace issue in 
>> cephfs and then continue.
> Ooookaaa. I will take it as a compliment that you seem to assume I know 
> how to do that. The documentation gives 0 hits. Could you please provide me 
> with instructions of what to look for and/or what to do first?

There is no doc about this as I know.

>> If possible you can parse the above corrupted snap message to check what 
>> exactly corrupted.
>> I haven't get a chance to do that.
> Again, how would I do that? Is there some documentation and what should I 
> expect?

Currently there is no easy way to do this as I know, last time I have
parsed the corrupted binary data to the corresponding message manully.

And then we could know what exactly has happened for the snaptrace.


>> You seems didn't enable the 'osd blocklist' cephx auth cap for mon:
> I can't find anything about an osd blocklist client auth cap in the 
> documentation. Is this something that came after octopus? Our caps are as 
> shown in the documentation for a ceph fs client 
> (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
> "allow r":
>
>  caps mds = "allow rw path=/shares"
>  caps mon = "allow r"
>  caps osd = "allow rw tag cephfs data=con-fs2"
Yeah, it seems the 'osd blocklist' was disabled. As I remembered if
enabled it should be something likes:

caps mon = "allow r, allow command \"osd blocklist\""

>
>> I checked that but by reading the code I couldn't get what had cause the MDS 
>> crash.
>> There seems something wrong corrupt the metadata in cephfs.
> He wrote something about an invalid xattrib (empty value). It would be really 
> helpful to get a clue how to proceed. I managed to dump the MDS cache with 
> the critical inode in cache. Would this help with debugging? I also managed 
> to get debug logs with debug_mds=20 during a crash caused by an "mds dump 
> inode" command. Would this contain something interesting? I can also pull the 
> rados objects out and can upload all of these files.

Yeah, possibly. Where is the logs ?


> I managed to track the problem down to a specific folder with a few files 
> (I'm not sure if this coincides with the snaptrace issue, we might have 2 
> issues here). I made a copy of the folder and checked that an "mds dump 
> inode" for the copy does not crash the MDS. I then moved the folders for 
> which this command causes a crash to a different location outside the mounts. 
> Do you think this will help? I'm wondering if af

[ceph-users] Re: Discussion thread for Known Pacific Performance Regressions

2023-05-16 Thread Konstantin Shalygin
Hi Mark!

Thank you very much for this message, acknowledging the problem publicly is the 
beginning of fixing it ❤️

> On 11 May 2023, at 17:38, Mark Nelson  wrote:
> 
> Hi Everyone,
> 
> This email was originally posted to d...@ceph.io, but Marc mentioned that he 
> thought this would be useful to post on the user list so I'm re-posting here 
> as well.

Is there any plan/way to the fix performance regressions the current even 
release (e.g. 16), which can work no worse than release 14? The 16.2.14 release 
should fix the last issues that block updates (such as the inability to delete 
old snapshots). I'm concern because of the Ceph release cycle, which obliges 
the old release to EOL. The solution at least in the form "you need to redeploy 
all our OSD's", due impossibility RocksDB version migrations - seems okay. This 
is important for the development of internal products (based on RADOS as 
backend), whether it is worth putting into the team roadmaps the writing of 
"migrators from one cluster to another" on application level or, in principle, 
preparing to live forever on version 14 and, accordingly, take into account its 
scaling limits


Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma


We noticed extremely slow performance when remapping is necessary. We didn't do 
anything special other than assigning the correct device_class (to ssd). When 
checking ceph status, we notice the number of objects recovering is around 
17-25 (with watch -n 1 -c ceph status).

How can we increase the recovery process?

There isn't any client load, because we're going to migrate to this cluster in 
the future, so only a rsync once a while is being executed.

[ceph: root@pwsoel12998 /]# ceph status
  cluster:
id: da3ca2e4-ee5b-11ed-8096-0050569e8c3b
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set

  services:
mon: 5 daemons, quorum 
pqsoel12997,pqsoel12996,pwsoel12994,pwsoel12998,prghygpl03 (age 3h)
mgr: pwsoel12998.ylvjcb(active, since 3h), standbys: pqsoel12997.gagpbt
mds: 4/4 daemons up, 2 standby
osd: 32 osds: 32 up (since 73m), 32 in (since 6d); 10 remapped pgs
 flags noscrub,nodeep-scrub

  data:
volumes: 2/2 healthy
pools:   5 pools, 193 pgs
objects: 13.97M objects, 853 GiB
usage:   3.5 TiB used, 12 TiB / 16 TiB avail
pgs: 755092/55882956 objects misplaced (1.351%)
 183 active+clean
 10  active+remapped+backfilling

  io:
recovery: 2.3 MiB/s, 20 objects/s

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Xiubo Li


On 5/16/23 00:33, Frank Schilder wrote:

Dear Xiubo,

I uploaded the cache dump, the MDS log and the dmesg log containing the 
snaptrace dump to

ceph-post-file: 763955a3-7d37-408a-bbe4-a95dc687cd3f


Okay, thanks.



Sorry, I forgot to add user and description this time.

A question about trouble shooting. I'm pretty sure I know the path where the error is 
located. Would a "ceph tell mds.1 scrub start / recursive repair" be able to 
discover and fix broken snaptraces? If not I'm awaiting further instructions.


Not very sure.

Haven't check this in detail yet.

Thanks




Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Friday, May 12, 2023 3:44 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system


On 5/12/23 20:27, Frank Schilder wrote:

Dear Xiubo and others.


I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?
I'm in meetings pretty much all day and will try to send some more info later.

$ mount|grep ceph

I get

MON-IPs:SRC on DST type ceph 
(rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)

so async dirop seems disabled.

Yeah.



Yeah, the kclient just received a corrupted snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs 
and then continue.

Ooookaaa. I will take it as a compliment that you seem to assume I know how 
to do that. The documentation gives 0 hits. Could you please provide me with 
instructions of what to look for and/or what to do first?

There is no doc about this as I know.


If possible you can parse the above corrupted snap message to check what 
exactly corrupted.
I haven't get a chance to do that.

Again, how would I do that? Is there some documentation and what should I 
expect?

Currently there is no easy way to do this as I know, last time I have
parsed the corrupted binary data to the corresponding message manully.

And then we could know what exactly has happened for the snaptrace.



You seems didn't enable the 'osd blocklist' cephx auth cap for mon:

I can't find anything about an osd blocklist client auth cap in the documentation. Is 
this something that came after octopus? Our caps are as shown in the documentation for a 
ceph fs client (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
"allow r":

  caps mds = "allow rw path=/shares"
  caps mon = "allow r"
  caps osd = "allow rw tag cephfs data=con-fs2"

Yeah, it seems the 'osd blocklist' was disabled. As I remembered if
enabled it should be something likes:

caps mon = "allow r, allow command \"osd blocklist\""


I checked that but by reading the code I couldn't get what had cause the MDS 
crash.
There seems something wrong corrupt the metadata in cephfs.

He wrote something about an invalid xattrib (empty value). It would be really helpful to 
get a clue how to proceed. I managed to dump the MDS cache with the critical inode in 
cache. Would this help with debugging? I also managed to get debug logs with debug_mds=20 
during a crash caused by an "mds dump inode" command. Would this contain 
something interesting? I can also pull the rados objects out and can upload all of these 
files.

Yeah, possibly. Where is the logs ?



I managed to track the problem down to a specific folder with a few files (I'm not sure 
if this coincides with the snaptrace issue, we might have 2 issues here). I made a copy 
of the folder and checked that an "mds dump inode" for the copy does not crash 
the MDS. I then moved the folders for which this command causes a crash to a different 
location outside the mounts. Do you think this will help? I'm wondering if after taking 
our daily snapshot tomorrow we end up in the degraded situation again.

I really need instructions for how to check what is broken without an MDS crash 
and then how to fix it.

Firstly we need to know where the corrupted metadata is.

I think the mds debug logs and the above corrupted snaptrace could help.
Need to parse that corrupted binary data.

Thanks



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Xiubo Li


On 5/16/23 17:44, Frank Schilder wrote:

Hi Xiubo,

forgot to include these, the inodes i tried to dump and which caused a crash are

ceph tell "mds.ceph-10" dump inode 2199322355147 <-- original file/folder 
causing trouble
ceph tell "mds.ceph-10" dump inode 2199322727209 <-- copy also causing trouble 
(after taking snapshot??)

Other folders all the way up in the hierarchy did not lead to a crash, the dump 
worked fine for these.

The debug settings during the first tries were:

ceph config set mds.ceph-10 debug_mds 20/5
ceph config set mds.ceph-10 debug_ms 5/0


Just set the 'debug_ms' to 1 should be enough. A higher level will 
introduce many noises.




The hex codes are 0x20011d3e5cb and 0x20011d99329. The dump commands are also 
in the log. I tried a bit around to find out where the problem is localized. I 
didn't have the high debug settings all the time due to disk space constraints. 
I can pull specific logs over a short time for anything that is reproducible 
(short sequence of specific commands).

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Monday, May 15, 2023 6:33 PM
To: Xiubo Li; ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Dear Xiubo,

I uploaded the cache dump, the MDS log and the dmesg log containing the 
snaptrace dump to

ceph-post-file: 763955a3-7d37-408a-bbe4-a95dc687cd3f

Sorry, I forgot to add user and description this time.

A question about trouble shooting. I'm pretty sure I know the path where the error is 
located. Would a "ceph tell mds.1 scrub start / recursive repair" be able to 
discover and fix broken snaptraces? If not I'm awaiting further instructions.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Friday, May 12, 2023 3:44 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system


On 5/12/23 20:27, Frank Schilder wrote:

Dear Xiubo and others.


I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?
I'm in meetings pretty much all day and will try to send some more info later.

$ mount|grep ceph

I get

MON-IPs:SRC on DST type ceph 
(rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)

so async dirop seems disabled.

Yeah.



Yeah, the kclient just received a corrupted snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs 
and then continue.

Ooookaaa. I will take it as a compliment that you seem to assume I know how 
to do that. The documentation gives 0 hits. Could you please provide me with 
instructions of what to look for and/or what to do first?

There is no doc about this as I know.


If possible you can parse the above corrupted snap message to check what 
exactly corrupted.
I haven't get a chance to do that.

Again, how would I do that? Is there some documentation and what should I 
expect?

Currently there is no easy way to do this as I know, last time I have
parsed the corrupted binary data to the corresponding message manully.

And then we could know what exactly has happened for the snaptrace.



You seems didn't enable the 'osd blocklist' cephx auth cap for mon:

I can't find anything about an osd blocklist client auth cap in the documentation. Is 
this something that came after octopus? Our caps are as shown in the documentation for a 
ceph fs client (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
"allow r":

  caps mds = "allow rw path=/shares"
  caps mon = "allow r"
  caps osd = "allow rw tag cephfs data=con-fs2"

Yeah, it seems the 'osd blocklist' was disabled. As I remembered if
enabled it should be something likes:

caps mon = "allow r, allow command \"osd blocklist\""


I checked that but by reading the code I couldn't get what had cause the MDS 
crash.
There seems something wrong corrupt the metadata in cephfs.

He wrote something about an invalid xattrib (empty value). It would be really helpful to 
get a clue how to proceed. I managed to dump the MDS cache with the critical inode in 
cache. Would this help with debugging? I also managed to get debug logs with debug_mds=20 
during a crash caused by an "mds dump inode" command. Would this contain 
something interesting? I can also pull the rados objects out and can upload all of these 
files.

Yeah, possibly. Where is the logs ?



I managed to track the problem down to a specific folder with a few files (I'm not sure 
if this coincides with the snaptrace issue, we might have 2 issues here). I made a copy 
of the folder and checked that an "mds dump inode" for the copy does not crash 
the MDS. I then moved the folders for which this command causes a crash to a different 
location outside the mounts. Do you thin

[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Gregory Farnum
On Fri, May 12, 2023 at 5:28 AM Frank Schilder  wrote:
>
> Dear Xiubo and others.
>
> >> I have never heard about that option until now. How do I check that and 
> >> how to I disable it if necessary?
> >> I'm in meetings pretty much all day and will try to send some more info 
> >> later.
> >
> > $ mount|grep ceph
>
> I get
>
> MON-IPs:SRC on DST type ceph 
> (rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)
>
> so async dirop seems disabled.
>
> > Yeah, the kclient just received a corrupted snaptrace from MDS.
> > So the first thing is you need to fix the corrupted snaptrace issue in 
> > cephfs and then continue.
>
> Ooookaaa. I will take it as a compliment that you seem to assume I know 
> how to do that. The documentation gives 0 hits. Could you please provide me 
> with instructions of what to look for and/or what to do first?
>
> > If possible you can parse the above corrupted snap message to check what 
> > exactly corrupted.
> > I haven't get a chance to do that.
>
> Again, how would I do that? Is there some documentation and what should I 
> expect?
>
> > You seems didn't enable the 'osd blocklist' cephx auth cap for mon:
>
> I can't find anything about an osd blocklist client auth cap in the 
> documentation. Is this something that came after octopus? Our caps are as 
> shown in the documentation for a ceph fs client 
> (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
> "allow r":
>
> caps mds = "allow rw path=/shares"
> caps mon = "allow r"
> caps osd = "allow rw tag cephfs data=con-fs2"
>
>
> > I checked that but by reading the code I couldn't get what had cause the 
> > MDS crash.
> > There seems something wrong corrupt the metadata in cephfs.
>
> He wrote something about an invalid xattrib (empty value). It would be really 
> helpful to get a clue how to proceed. I managed to dump the MDS cache with 
> the critical inode in cache. Would this help with debugging? I also managed 
> to get debug logs with debug_mds=20 during a crash caused by an "mds dump 
> inode" command. Would this contain something interesting? I can also pull the 
> rados objects out and can upload all of these files.

I was just guessing about the invalid xattr based on the very limited
crash info, so if it's clearly broken snapshot metadata from the
kclient logs I would focus on that.

I'm surprised/concerned your system managed to generate one of those,
of course...I'll let Xiubo work with you on that.
-Greg
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sridhar Seshasayee
Hello,

This is a known issue in quincy which uses mClock scheduler. There is a fix
for this which should be
available in 17.2.6+ releases.

You can confirm the active scheduler type on any osd using:

 ceph config show osd.0 osd_op_queue

If the active scheduler is 'mclock_scheduler', you can try switching the
mClock profile to
'high_recovery_ops' on all OSDs to speed up the backfilling using:

ceph config set osd osd_mclock_profile high_recovery_ops

After the backfilling is complete, you can switch the mClock profile back
to the default value using:

ceph config rm osd osd_mclock_profile


On Tue, May 16, 2023 at 4:46 PM Sake Paulusma  wrote:

>
> We noticed extremely slow performance when remapping is necessary. We
> didn't do anything special other than assigning the correct device_class
> (to ssd). When checking ceph status, we notice the number of objects
> recovering is around 17-25 (with watch -n 1 -c ceph status).
>
> How can we increase the recovery process?
>
> There isn't any client load, because we're going to migrate to this
> cluster in the future, so only a rsync once a while is being executed.
>
> [ceph: root@pwsoel12998 /]# ceph status
>   cluster:
> id: da3ca2e4-ee5b-11ed-8096-0050569e8c3b
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
>
>   services:
> mon: 5 daemons, quorum
> pqsoel12997,pqsoel12996,pwsoel12994,pwsoel12998,prghygpl03 (age 3h)
> mgr: pwsoel12998.ylvjcb(active, since 3h), standbys: pqsoel12997.gagpbt
> mds: 4/4 daemons up, 2 standby
> osd: 32 osds: 32 up (since 73m), 32 in (since 6d); 10 remapped pgs
>  flags noscrub,nodeep-scrub
>
>   data:
> volumes: 2/2 healthy
> pools:   5 pools, 193 pgs
> objects: 13.97M objects, 853 GiB
> usage:   3.5 TiB used, 12 TiB / 16 TiB avail
> pgs: 755092/55882956 objects misplaced (1.351%)
>  183 active+clean
>  10  active+remapped+backfilling
>
>   io:
> recovery: 2.3 MiB/s, 20 objects/s
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 

Sridhar Seshasayee

Partner Engineer

Red Hat 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dedicated radosgw gateways

2023-05-16 Thread Ulrich Klein
Hi,

Might be a dumb question …
I'm wondering how I can set those config variables in some but not all RGW 
processes?

I'm on a cephadm 17.2.6. On 3 nodes I have RGWs. The ones on 8080 are behind 
haproxy for users. the ones one 8081 I'd like for sync only.

# ceph orch ps | grep rgw
rgw.max.maxvm4.lmjaef  maxvm4   *:8080   running (51m) 4s ago   2h 
262M-  17.2.6   d007367d0f3c  315f47a4f164
rgw.max.maxvm4.lwzxpf  maxvm4   *:8081   running (51m) 4s ago   2h 
199M-  17.2.6   d007367d0f3c  7ae82e5f6ef2
rgw.max.maxvm5.syxpnb  maxvm5   *:8081   running (51m) 4s ago   2h 
137M-  17.2.6   d007367d0f3c  c0635c09ba8f
rgw.max.maxvm5.wtpyfk  maxvm5   *:8080   running (51m) 4s ago   2h 
267M-  17.2.6   d007367d0f3c  b4ad91718094
rgw.max.maxvm6.ostneb  maxvm6   *:8081   running (51m) 4s ago   2h 
150M-  17.2.6   d007367d0f3c  83b2af8f787a
rgw.max.maxvm6.qfulra  maxvm6   *:8080   running (51m) 4s ago   2h 
262M-  17.2.6   d007367d0f3c  81d01bf9e21d

# ceph config show rgw.max.maxvm4.lwzxpf
Error ENOENT: no config state for daemon rgw.max.maxvm4.lwzxpf

# ceph config set rgw.max.maxvm4.lwzxpf rgw_enable_lc_threads false
Error EINVAL: unrecognized config target 'rgw.max.maxvm4.lwzxpf'
(Not surprised)

# ceph tell rgw.max.maxvm4.lmjaef get rgw_enable_lc_threads
error handling command target: local variable 'poolid' referenced before 
assignment

# ceph tell rgw.max.maxvm4.lmjaef set rgw_enable_lc_threads false
error handling command target: local variable 'poolid' referenced before 
assignment

Is there any way to set the config for specific RGWs in a containerized env?

(ceph.conf doesn't work. Doesn't do anything and gets overwritten with a 
minimall version at "unpredictable" intervalls)

Thanks for any ideas.

Ciao, Uli

> On 15. May 2023, at 14:15, Konstantin Shalygin  wrote:
> 
> Hi,
> 
>> On 15 May 2023, at 14:58, Michal Strnad  wrote:
>> 
>> at Cephalocon 2023, it was mentioned several times that for service tasks 
>> such as data deletion via garbage collection or data replication in S3 via 
>> zoning, it is good to do them on dedicated radosgw gateways and not mix them 
>> with gateways used by users. How can this be achieved? How can we isolate 
>> these tasks? Will using dedicated keyrings instead of admin keys be 
>> sufficient? How do you operate this in your environment?
> 
> Just:
> 
> # don't put client traffic to "dedicated radosgw gateways"
> # disable lc/gc on "gateways used by users" via `rgw_enable_lc_threads = 
> false` & `rgw_enable_gc_threads = false`
> 
> 
> k
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma
Hi,

The config shows "mclock_scheduler" and I already switched to the 
high_recovery_ops, this does increase the recovery ops, but only a little.

You mention there is a fix in 17.2.6+, but we're running on 17.2.6 (this 
cluster is created on this version). Any more ideas?

Best regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma
Just to add:
high_client_ops: around 8-13 objects per second
high_recovery_ops: around 17-25 objects per second

Both observed with "watch - n 1 - c ceph status"

Best regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] User + Dev Monthly Meeting happening this Thursday

2023-05-16 Thread Laura Flores
Hi Ceph Users,

The User + Dev Monthly Meeting is coming up next week on *Thursday, May
17th* *@* *14:00 UTC* (time conversions below). See meeting details at the
bottom of this email.

Please add any topics you'd like to discuss to the agenda:
https://pad.ceph.com/p/ceph-user-dev-monthly-minutes

See you there,
Laura Flores

Meeting link: https://meet.jit.si/ceph-user-dev-monthly

Time Conversions:
UTC:   Wednesday, May 17, 14:00 UTC
Mountain View, CA, US: Wednesday, May 17,  7:00 PDT
Phoenix, AZ, US:   Wednesday, May 17,  7:00 MST
Denver, CO, US:Wednesday, May 17,  8:00 MDT
Huntsville, AL, US:Wednesday, May 17,  9:00 CDT
Raleigh, NC, US:   Wednesday, May 17, 10:00 EDT
London, England:   Wednesday, May 17, 15:00 BST
Paris, France: Wednesday, May 17, 16:00 CEST
Helsinki, Finland: Wednesday, May 17, 17:00 EEST
Tel Aviv, Israel:  Wednesday, May 17, 17:00 IDT
Pune, India:   Wednesday, May 17, 19:30 IST
Brisbane, Australia:   Thursday, May 18,  0:00 AEST
Singapore, Asia:   Wednesday, May 17, 22:00 +08
Auckland, New Zealand: Thursday, May 18,  2:00 NZST

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Tech Talk For May 2023: RGW Lua Scripting Code Walkthrough

2023-05-16 Thread Mike Perez
Hello everyone,

Join us on May 24th at 17:00 UTC for a long overdue Ceph Tech Talk! This month, 
Yuval Lifshitz will give an RGW Lua Scripting Code Walkthrough.

https://ceph.io/en/community/tech-talks/

You can also see Yuval's previous presentation at Ceph Month 2021, From Open 
Source to Open Ended in Ceph with Lua.

https://www.youtube.com/watch?v=anQJugs27hE

If you want to give a technical presentation for Ceph Tech Talks, please 
contact me directly with a title and description. Thank you!

--
Mike Perez
Community Manager
Ceph Foundation
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: User + Dev Monthly Meeting happening this Thursday

2023-05-16 Thread Laura Flores
Hi users,

Correction: The User + Dev Monthly Meeting is happening *this week* on
*Thursday,
May 18th* *@* *14:00 UTC*. Apologies for the confusion.
See below for updated meeting details.

Thanks,
Laura Flores

Meeting link: https://meet.jit.si/ceph-user-dev-monthly

Time conversions:
UTC:   Thursday, May 18, 14:00 UTC
Mountain View, CA, US: Thursday, May 18,  7:00 PDT
Phoenix, AZ, US:   Thursday, May 18,  7:00 MST
Denver, CO, US:Thursday, May 18,  8:00 MDT
Huntsville, AL, US:Thursday, May 18,  9:00 CDT
Raleigh, NC, US:   Thursday, May 18, 10:00 EDT
London, England:   Thursday, May 18, 15:00 BST
Paris, France: Thursday, May 18, 16:00 CEST
Helsinki, Finland: Thursday, May 18, 17:00 EEST
Tel Aviv, Israel:  Thursday, May 18, 17:00 IDT
Pune, India:   Thursday, May 18, 19:30 IST
Brisbane, Australia:   Friday, May 19,  0:00 AEST
Singapore, Asia:   Thursday, May 18, 22:00 +08
Auckland, New Zealand: Friday, May 19,  2:00 NZST

On Tue, May 16, 2023 at 10:13 AM Laura Flores  wrote:

> Hi Ceph Users,
>
> The User + Dev Monthly Meeting is coming up next week on *Thursday, May
> 17th* *@* *14:00 UTC* (time conversions below). See meeting details at
> the bottom of this email.
>
> Please add any topics you'd like to discuss to the agenda:
> https://pad.ceph.com/p/ceph-user-dev-monthly-minutes
>
> See you there,
> Laura Flores
>
> Meeting link: https://meet.jit.si/ceph-user-dev-monthly
>
> Time Conversions:
> UTC:   Wednesday, May 17, 14:00 UTC
> Mountain View, CA, US: Wednesday, May 17,  7:00 PDT
> Phoenix, AZ, US:   Wednesday, May 17,  7:00 MST
> Denver, CO, US:Wednesday, May 17,  8:00 MDT
> Huntsville, AL, US:Wednesday, May 17,  9:00 CDT
> Raleigh, NC, US:   Wednesday, May 17, 10:00 EDT
> London, England:   Wednesday, May 17, 15:00 BST
> Paris, France: Wednesday, May 17, 16:00 CEST
> Helsinki, Finland: Wednesday, May 17, 17:00 EEST
> Tel Aviv, Israel:  Wednesday, May 17, 17:00 IDT
> Pune, India:   Wednesday, May 17, 19:30 IST
> Brisbane, Australia:   Thursday, May 18,  0:00 AEST
> Singapore, Asia:   Wednesday, May 17, 22:00 +08
> Auckland, New Zealand: Thursday, May 18,  2:00 NZST
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage 
>
> Chicago, IL
>
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
>
>
>

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage 

Chicago, IL

lflo...@ibm.com | lflo...@redhat.com 
M: +17087388804
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread 胡 玮文
Hi Sake,

We are experiencing the same. I set “osd_mclock_cost_per_byte_usec_hdd” to 0.1 
(default is 2.6) and get about 15 times backfill speed, without significant 
affect client IO. This parameter seems calculated wrongly, from the description 
5e-3 should be a reasonable value for HDD (corresponding to 200MB/s). I noticed 
this default is originally 5.2, then changed to 2.6 to increase the recovery 
speed. So I suspect the original author just convert the unit wrongly, he may 
want 5.2e-3 but wrote 5.2 in code.

But all this may be not important in the next version. I see the relevant code 
is rewritten, and this parameter is now removed.

high_recovery_ops profile works very poorly for us. It increase the average 
latency of client IO from 50ms to about 1s.

Weiwen Hu

在 2023年5月16日,19:16,Sake Paulusma  写道:


We noticed extremely slow performance when remapping is necessary. We didn't do 
anything special other than assigning the correct device_class (to ssd). When 
checking ceph status, we notice the number of objects recovering is around 
17-25 (with watch -n 1 -c ceph status).

How can we increase the recovery process?

There isn't any client load, because we're going to migrate to this cluster in 
the future, so only a rsync once a while is being executed.

[ceph: root@pwsoel12998 /]# ceph status
 cluster:
   id: da3ca2e4-ee5b-11ed-8096-0050569e8c3b
   health: HEALTH_WARN
   noscrub,nodeep-scrub flag(s) set

 services:
   mon: 5 daemons, quorum 
pqsoel12997,pqsoel12996,pwsoel12994,pwsoel12998,prghygpl03 (age 3h)
   mgr: pwsoel12998.ylvjcb(active, since 3h), standbys: pqsoel12997.gagpbt
   mds: 4/4 daemons up, 2 standby
   osd: 32 osds: 32 up (since 73m), 32 in (since 6d); 10 remapped pgs
flags noscrub,nodeep-scrub

 data:
   volumes: 2/2 healthy
   pools:   5 pools, 193 pgs
   objects: 13.97M objects, 853 GiB
   usage:   3.5 TiB used, 12 TiB / 16 TiB avail
   pgs: 755092/55882956 objects misplaced (1.351%)
183 active+clean
10  active+remapped+backfilling

 io:
   recovery: 2.3 MiB/s, 20 objects/s

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma
Thanks for the input! Changing this value we indeed increased the recovery 
speed from 20 object per second to 500!

Now something strange:
1. We needed to change the class for our drives manually to ssd.
2. The setting "osd_mclock_max_capacity_iops_ssd" was set to 0. With osd bench 
descriped in the documentation, we configured the value 1 for the ssd 
parameter. Only nothing changed.
3. But when setting "osd_mclock_max_capacity_iops_hdd" to 1, the recovery 
speed also increased dramatically!

I don't understand this anymore :( Is the mclock scheduler ignoring the 
override of the device class?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow recovery on Quincy

2023-05-16 Thread Sake Paulusma
Did an extra test with shutting down an osd host and force a recovery. Only 
using the iops setting I got 500 objects a second, but using also the 
bytes_per_usec setting, I got 1200 objects a second!

Maybe there should also be an investigation about this performance issue.

Best regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Grafana service fails to start due to bad directory name after Quincy upgrade

2023-05-16 Thread Adiga, Anantha
Hi

Upgraded from Pacific 16.2.5 to 17.2.6 on May 8th

However, Grafana fails to start due to bad folder path
:/tmp# journalctl -u 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201 -n 25
-- Logs begin at Sun 2023-05-14 20:05:52 UTC, end at Tue 2023-05-16 19:07:51 
UTC. --
May 16 19:05:00 fl31ca104ja0201 systemd[1]: Stopped Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:00 fl31ca104ja0201 systemd[1]: Started Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:00 fl31ca104ja0201 bash[2575021]: /bin/bash: 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/u>
May 16 19:05:00 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: Main 
process ex>
May 16 19:05:00 fl31ca104ja0201 bash[2575030]: /bin/bash: 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/u>
May 16 19:05:00 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Failed with res>
May 16 19:05:10 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Scheduled resta>
May 16 19:05:10 fl31ca104ja0201 systemd[1]: Stopped Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:10 fl31ca104ja0201 systemd[1]: Started Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:10 fl31ca104ja0201 bash[2575273]: /bin/bash: 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/u>
May 16 19:05:10 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: Main 
process ex>
May 16 19:05:10 fl31ca104ja0201 bash[2575282]: /bin/bash: 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/u>
May 16 19:05:10 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Failed with res>
May 16 19:05:20 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Scheduled resta>
May 16 19:05:20 fl31ca104ja0201 systemd[1]: Stopped Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:20 fl31ca104ja0201 systemd[1]: Started Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:20 fl31ca104ja0201 bash[2575369]: /bin/bash: 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/u>
May 16 19:05:20 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: Main 
process ex>
May 16 19:05:20 fl31ca104ja0201 bash[2575370]: /bin/bash: 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/u>
May 16 19:05:20 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Failed with res>
May 16 19:05:30 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Scheduled resta>
May 16 19:05:30 fl31ca104ja0201 systemd[1]: Stopped Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
May 16 19:05:30 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Start request r>
May 16 19:05:30 fl31ca104ja0201 systemd[1]: 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service: 
Failed with res>
May 16 19:05:30 fl31ca104ja0201 systemd[1]: Failed to start Ceph 
grafana-fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
ESCOC
19:07:51 UTC. --
31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/unit.run: No 
such file or directory
-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service:
 Main process exited, code=exited, status=127/n/a
ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/unit.poststop:
 No such file or directory
-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service:
 Failed with result 'exit-code'.
-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service:
 Scheduled restart job, restart counter is at 3.
31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e.
ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/unit.run: No 
such file or directory
-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service:
 Main process exited, code=exited, status=127/n/a
ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana-fl31ca104ja0201/unit.poststop:
 No such file or directory
-be05-a7a3a1d7a87e@grafana-fl31ca104ja0201.service

[ceph-users] Bucket-level redirect_zone

2023-05-16 Thread Yixin Jin
Hi folks,

I created a feature request ticket to call for bucket-level redirect_zone 
(https://tracker.ceph.com/issues/61199), which is basically an extension from 
zone-level redirect_zone. I found it helpful in realizing CopyObject with 
(x-copy-source) in multisite environments where bucket content don't exist in 
all zones. This feature is similar to what Matt Benjamin suggested about the 
concept of "bucket residence".

In my own prototyping affect, redirecting feature at bucket level is fairly 
straightforward. Making use of it with CopyObject is trickier because I haven't 
found a good way to get the source object policy when 
RGWCopyObj::verify_permission() is called. Anyway, I will continue to explore 
this idea and hope to get more support of it.

Thanks,
Yixin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bucket-level redirect_zone

2023-05-16 Thread Matt Benjamin
Hi Yixin,

I support experimentation for sure, but if we want to consider a feature
for inclusion, we need design proposal(s) and review, of course.  If you're
interested in feedback on your current ideas, you could consider coming to
the "refactoring" meeting on a Wednesday.  I think these ideas would be
interesting to discuss, maybe it would reduce time to develop.

Matt

On Tue, May 16, 2023 at 5:20 PM Yixin Jin  wrote:

> Hi folks,
>
> I created a feature request ticket to call for bucket-level redirect_zone (
> https://tracker.ceph.com/issues/61199), which is basically an extension
> from zone-level redirect_zone. I found it helpful in realizing CopyObject
> with (x-copy-source) in multisite environments where bucket content don't
> exist in all zones. This feature is similar to what Matt Benjamin suggested
> about the concept of "bucket residence".
>
> In my own prototyping affect, redirecting feature at bucket level is
> fairly straightforward. Making use of it with CopyObject is trickier
> because I haven't found a good way to get the source object policy when
> RGWCopyObj::verify_permission() is called. Anyway, I will continue to
> explore this idea and hope to get more support of it.
>
> Thanks,
> Yixin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Xiubo Li


On 5/11/23 20:12, Frank Schilder wrote:

Dear Xiubo,

please see also my previous e-mail about the async dirop config.

I have a bit more log output from dmesg on the file server here: 
https://pastebin.com/9Y0EPgDD .


There is one bug in kclient and it forgets to skip the memories for 
'snap_realms' if the op is not 'CEPH_SNAP_OP_SPLIT'.


While the snaptrace was not corrupted and I have sent out one patch to 
fix it : https://patchwork.kernel.org/project/ceph-devel/list/?series=748267


But this bug won't always trigger the call trace and in special case it 
still could be successfully parsed and then may corrupt the snamrealms 
in kclient side, and then it maybe will send incorrect capsnap update 
back to MDS and then make the MDS mad.


I am not sure whether your case is caused by this.

Thanks

- Xiubo


  This covers a reboot after the one in my previous e-mail as well as another fail at the end. When 
I checked around 16:30 the mount point was inaccessible again with "stale file handle". 
Please note the "wrong peer at address" messages in the log, it seems that a number of 
issues come together here. These threads are actually all related to this file server and the 
observations we make now:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/MSB5TIG42XAFNG2CKILY5DZWIMX6C5CO/
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LYY7TBK63XPR6X6TD7372I2YEPJO2L6F/

You mentioned directory migration in the MDS, I guess you mean migrating a directory 
fragment between MDSes? This should not happen, all these directories are statically 
pinned to a rank. An MDS may split/merge directory fragments, but they stay at the same 
MDS all the time. This is confirmed by running a "dump inode" on directories 
under a pin. Only one MDS reports back that it has the dir inode in its cache, so I think 
the static pinning works as expected.

It would be great if you could also look at Greg's reply, maybe you have 
something I could look at to find the cause of the crash during the mds dump 
inode command.

Thanks a lot and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Thursday, May 11, 2023 12:26 PM
To: Xiubo Li; ceph-users@ceph.io
Subject: [ceph-users] Re: mds dump inode crashes file system

Dear Xiubo,

thanks for your reply.


BTW, did you enabled the async dirop ? Currently this is disabled by
default in 4.18.0-486.el8.x86_64.

I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?

I'm in meetings pretty much all day and will try to send some more info later.


Could you reproduce this by enabling the mds debug logs ?

Not right now. Our users are annoyed enough already. I first need to figure out 
how to move the troublesome inode somewhere else where I might be able to do 
something. The boot message shows up on this one file server every time. Is 
there any information about what dir/inode might be causing the issue? How 
could I reproduce this without affecting the users, say, by re-creating the 
same condition somewhere else? Any hints are appreciated.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Thursday, May 11, 2023 3:45 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system

Hey Frank,

On 5/10/23 21:44, Frank Schilder wrote:

The kernel message that shows up on boot on the file server in text format:

May 10 13:56:59 rit-pfile01 kernel: WARNING: CPU: 3 PID: 34 at 
fs/ceph/caps.c:689 ceph_add_cap+0x53e/0x550 [ceph]
May 10 13:56:59 rit-pfile01 kernel: Modules linked in: ceph libceph 
dns_resolver nls_utf8 isofs cirrus drm_shmem_helper intel_rapl_msr iTCO_wdt 
intel_rapl_common iTCO_vendor_support drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops pcspkr joydev virtio_net drm i2c_i801 net_failover 
virtio_balloon failover lpc_ich nfsd nfs_acl lockd auth_rpcgss grace sunrpc 
sr_mod cdrom sg xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci 
libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console 
virtio_scsi dm_mirror dm_region_hash dm_log dm_mod fuse
May 10 13:56:59 rit-pfile01 kernel: CPU: 3 PID: 34 Comm: kworker/3:0 Not 
tainted 4.18.0-486.el8.x86_64 #1
May 10 13:56:59 rit-pfile01 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 
1.16.0-3.module_el8.7.0+3346+68867adb 04/01/2014
May 10 13:56:59 rit-pfile01 kernel: Workqueue: ceph-msgr ceph_con_workfn 
[libceph]
May 10 13:56:59 rit-pfile01 kernel: RIP: 0010:ceph_add_cap+0x53e/0x550 [ceph]
May 10 13:56:59 rit-pfile01 kernel: Code: c0 48 c7 c7 c0 69 7f c0 e8 6c 4c 72 c3 0f 
0b 44 89 7c 24 04 e9 7e fc ff ff 44 8b 7c 24 04 e9 68 fe ff ff 0f 0b e9 c9 fc ff ff 
<0f> 0b e9 0a fe ff ff 0f 0b e9 12 fe ff ff 0f 0b 66 90 0f 1f 44 00
May 10 13:56:59 rit-pfile01 kernel: RSP: 0018:a4d000d87b

[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Xiubo Li


On 5/16/23 21:55, Gregory Farnum wrote:

On Fri, May 12, 2023 at 5:28 AM Frank Schilder  wrote:

Dear Xiubo and others.


I have never heard about that option until now. How do I check that and how to 
I disable it if necessary?
I'm in meetings pretty much all day and will try to send some more info later.

$ mount|grep ceph

I get

MON-IPs:SRC on DST type ceph 
(rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)

so async dirop seems disabled.


Yeah, the kclient just received a corrupted snaptrace from MDS.
So the first thing is you need to fix the corrupted snaptrace issue in cephfs 
and then continue.

Ooookaaa. I will take it as a compliment that you seem to assume I know how 
to do that. The documentation gives 0 hits. Could you please provide me with 
instructions of what to look for and/or what to do first?


If possible you can parse the above corrupted snap message to check what 
exactly corrupted.
I haven't get a chance to do that.

Again, how would I do that? Is there some documentation and what should I 
expect?


You seems didn't enable the 'osd blocklist' cephx auth cap for mon:

I can't find anything about an osd blocklist client auth cap in the documentation. Is 
this something that came after octopus? Our caps are as shown in the documentation for a 
ceph fs client (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
"allow r":

 caps mds = "allow rw path=/shares"
 caps mon = "allow r"
 caps osd = "allow rw tag cephfs data=con-fs2"



I checked that but by reading the code I couldn't get what had cause the MDS 
crash.
There seems something wrong corrupt the metadata in cephfs.

He wrote something about an invalid xattrib (empty value). It would be really helpful to 
get a clue how to proceed. I managed to dump the MDS cache with the critical inode in 
cache. Would this help with debugging? I also managed to get debug logs with debug_mds=20 
during a crash caused by an "mds dump inode" command. Would this contain 
something interesting? I can also pull the rados objects out and can upload all of these 
files.

I was just guessing about the invalid xattr based on the very limited
crash info, so if it's clearly broken snapshot metadata from the
kclient logs I would focus on that.


Actually the snaptrace was not corrupted and I have fixed the bug in 
kclient side. More detail please see my reply in the last mail.


For the MDS side's crash:

 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x158) [0x7fe979ae9b92]
 2: (()+0x27ddac) [0x7fe979ae9dac]
 3: (()+0x5ce831) [0x7fe979e3a831]
 4: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
 5: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
 6: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) [0x55c08c41e00c]

I just guess it may corrupted in dumping the 'old_inode':

4383 void InodeStoreBase::dump(Formatter *f) const
4384 {

...
4401   f->open_array_section("old_inodes");
4402   for (const auto &p : old_inodes) {
4403 f->open_object_section("old_inode");
4404 // The key is the last snapid, the first is in the 
mempool_old_inode

4405 f->dump_int("last", p.first);
4406 p.second.dump(f);
4407 f->close_section();  // old_inode
4408   }
4409   f->close_section();  // old_inodes
...
4413 }

Because incorrectly parsing the snaptrace in kclient may corrupt the 
snaprealm and then send corruped capsnap back to MDS.


Thanks

- Xiubo


I'm surprised/concerned your system managed to generate one of those,
of course...I'll let Xiubo work with you on that.
-Greg


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io