date:20230526

[ceph-users] Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

2023-05-26 Thread Emmanuel Jaep

Hi Milind,

I finally managed to dump the cache and find the file.
It generated a 1.5 GB file with about 7 Mio lines. It's kind of hard to
know what is out of the ordinary…

Furthermore, I noticed that dumping the cache was actually stopping the
MDS. Is it a normal behavior?

Best,

Emmanuel

On Thu, May 25, 2023 at 1:19 PM Milind Changire  wrote:

> try the command with the --id argument:
>
> # ceph --id admin --cluster floki daemon mds.icadmin011 dump cache
> /tmp/dump.txt
>
> I presume that your keyring has an appropriate entry for the client.admin
> user
>
>
> On Wed, May 24, 2023 at 5:10 PM Emmanuel Jaep 
> wrote:
>
>> Absolutely! :-)
>>
>> root@icadmin011:/tmp# ceph --cluster floki daemon mds.icadmin011 dump
>> cache /tmp/dump.txt
>> root@icadmin011:/tmp# ll
>> total 48
>> drwxrwxrwt 12 root root 4096 May 24 13:23  ./
>> drwxr-xr-x 18 root root 4096 Jun  9  2022  ../
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .ICE-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .Test-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .X11-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .XIM-unix/
>> drwxrwxrwt  2 root root 4096 May  4 12:43  .font-unix/
>> drwx--  2 root root 4096 May 24 13:23  ssh-Sl5AiotnXp/
>> drwx--  3 root root 4096 May  8 13:26
>> 'systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf
>> '/
>> drwx--  3 root root 4096 May  4 12:43
>>  
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi/
>> drwx--  3 root root 4096 May  4 12:43
>>  
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f/
>> drwx--  3 root root 4096 May  4 12:43
>>  
>> systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i/
>>
>> On Wed, May 24, 2023 at 1:17 PM Milind Changire 
>> wrote:
>>
>>> I hope the daemon mds.icadmin011 is running on the same machine that
>>> you are looking for /tmp/dump.txt, since the file is created on the system
>>> which has that daemon running.
>>>
>>>
>>> On Wed, May 24, 2023 at 2:16 PM Emmanuel Jaep 
>>> wrote:
>>>
 Hi Milind,

 you are absolutely right.

 The dump_ops_in_flight is giving a good hint about what's happening:
 {
 "ops": [
 {
 "description": "internal op exportdir:mds.5:975673",
 "initiated_at": "2023-05-23T17:49:53.030611+0200",
 "age": 60596.355186077999,
 "duration": 60596.355234167997,
 "type_data": {
 "flag_point": "failed to wrlock, waiting",
 "reqid": "mds.5:975673",
 "op_type": "internal_op",
 "internal_op": 5377,
 "op_name": "exportdir",
 "events": [
 {
 "time": "2023-05-23T17:49:53.030611+0200",
 "event": "initiated"
 },
 {
 "time": "2023-05-23T17:49:53.030611+0200",
 "event": "throttled"
 },
 {
 "time": "2023-05-23T17:49:53.030611+0200",
 "event": "header_read"
 },
 {
 "time": "2023-05-23T17:49:53.030611+0200",
 "event": "all_read"
 },
 {
 "time": "2023-05-23T17:49:53.030611+0200",
 "event": "dispatched"
 },
 {
 "time": "2023-05-23T17:49:53.030657+0200",
 "event": "requesting remote authpins"
 },
 {
 "time": "2023-05-23T17:49:53.050253+0200",
 "event": "failed to wrlock, waiting"
 }
 ]
 }
 }
 ],
 "num_ops": 1
 }

 However, the dump cache does not seem to produce an output:
 root@icadmin011:~# ceph --cluster floki daemon mds.icadmin011 dump
 cache /tmp/dump.txt
 root@icadmin011:~# ls /tmp
 ssh-cHvP3iF611

 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-ceph-mds@icadmin011.service-SGZrKf

 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-logind.service-uU1GAi

 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-resolved.service-KYHd7f

 systemd-private-18c17b770fc24c48a0507b8faa1c0ec2-systemd-timesyncd.service-1Qtj5i

 Do you have any hint?

 Best,

 Emmanuel

 On Wed, May 24, 2023 at 10:30 AM Milind Changire 
 wrote:

> Emmanuel,
> You probably missed the "daemon" keyword after the "ceph" c

[ceph-users] Pacific - MDS behind on trimming

2023-05-26 Thread Emmanuel Jaep

Hi,

lately, we have had some issues with our MDSs (Ceph version 16.2.10
Pacific).

Part of them are related to MDS being behind on trimming.

I checked the documentation and found the following information (
https://docs.ceph.com/en/pacific/cephfs/health-messages/):
> CephFS maintains a metadata journal that is divided into *log segments*.
The length of journal (in number of segments) is controlled by the setting
mds_log_max_segments, and when the number of segments exceeds that setting
the MDS starts writing back metadata so that it can remove (trim) the
oldest segments. If this writeback is happening too slowly, or a software
bug is preventing trimming, then this health message may appear. The
threshold for this message to appear is controlled by the config option
mds_log_warn_factor, the default is 2.0.


Some resources on the web (https://www.suse.com/support/kb/doc/?id=19740)
indicated that a solution would be to change the `mds_log_max_segments`.
Which I did:
```
ceph --cluster floki tell mds.* injectargs '--mds_log_max_segments=40'
```

Of course, the warning disappeared, but I have a feeling that I just hid
the problem. Pushing a value to 400'000 when the default value is 512 is a
lot.
 Why is the trimming not taking place? How can I troubleshoot this further?

Best,

Emmanuel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: `ceph features` on Nautilus still reports "luminous"

2023-05-26 Thread Frank Schilder

Hi Oliver,

there is a little bit more to it. The feature flags tell you also what version 
a client should be at a minimum - a bit as indicated by the command name 
set-require-*min-compat-client*. All clusters allow old clients to connect, but 
there is a minimum compatibility cap. If you increase the minimum version 
requirement, older clients might not be able to connect any more. Therefore, 
such min-compat-version increases only happen when there is a reason. The new 
cluster + old client combo is very common, specifically with ceph fs kclients. 
We have our cluster on octopus and clients reporting as kraken.

Your current setting simply says that clients all the way back to luminous will 
be able to connect and any client older than that might get refused.

In other words, there are always 2 versions: those of your daemons (cluster 
version) and those supported to connect (client versions). You should limit 
downwards compatibility only if there is a reason for doing so.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Oliver Schmidt 
Sent: Thursday, May 25, 2023 10:44 PM
To: Wesley Dillingham
Cc: ceph-users@ceph.io; a...@dreamsnake.net
Subject: [ceph-users] Re: `ceph features` on Nautilus still reports "luminous"



> To be honest I am not confident that "ceph osd set-require-min-compat-client 
> nautilus" is a necessary step for you. What prompted you to run that command?
>
> That step is not listed here: 
> https://docs.ceph.com/en/latest/releases/nautilus/#upgrading-from-mimic-or-luminous

You're correct indeed, it's neither listed in the docs for upgrading to Mimic 
nor Nautilus. It apparently just slipped over from my Luminous upgrade 
checklist, which I based the Nautilus upgrade steps upon.

Anthony D'Atri 
> features *happen* to be named after releases don't always correlate 1:1.

That somehow makes sense, but can be a bit confusing. It's good to hear that 
apparently I did not miss any steps. I wonder whether this should be documented 
somewhere, and where the chances of folks actually finding it are the best. 
(docs, manpage)

Kind regards
Oliver
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BlueStore fragmentation woes

2023-05-26 Thread Igor Fedotov


yeah, definitely this makes sense

On 26/05/2023 09:39, Konstantin Shalygin wrote:

Hi Igor,

Should we backpot this to the p,q and reef release's?


Thanks,
k
Sent from my iPhone


On 25 May 2023, at 23:13, Igor Fedotov  wrote:

You might be facing the issue fixed by https://github.com/ceph/ceph/pull/49885

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-26 Thread Michel Jouvin


Hi,

 I realize that the crushmap I attached to one of my email, probably 
required to understand the discussion here, has been stripped down by 
mailman. To avoid poluting the thread with a long output, I put it on at 
https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are 
interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent. If 
you manage to do some tests that help me to understand the problem I 
remain interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the 
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a 
unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes 
since your
first post and had some thoughts I wanted to share, but wanted to 
see your
rule first so I could try to visualize the distribution better.  The 
only
way I can currently visualize it working is with more servers, I'm 
thinking
6 or 9 per data center min, but that could be my lack of knowledge 
on some

of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:


Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree yet, could you do that?
Apparently nobody else but us reads this thread or nobody reading this
uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin :


Hi,

I had to restart one of my OSD server today and the problem showed up
again. This time I managed to capture "ceph health detail" output
showing the problem with the 2 PGs:

[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2
pgs down
pg 56.1 is down, acting
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
pg 56.12 is down, acting


[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]


I still doesn't understand why, if I am supposed to survive to a
datacenter failure, I cannot survive to 3 OSDs down on the same host,
hosting shards for the PG. In the second case it is only 2 OSDs down
but I'm surprised they don't seem in the same "group" of OSD (I'd
expected all the the OSDs of one datacenter to be in the same groupe
of 5 if the order given really reflects the allocation done...

Still interested by some explanation on what I'm doing wrong! Best
regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :

I think I got it wrong with the locality setting, I'm still limited
by the number of hosts I have available in my test cluster, but as
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
locality=datacenter could fit your requirement, at least with
regards to the recovery bandwidth usage between DCs, but the
resiliency would not match your requirement (one DC failure). That
profile creates 3 groups of 4 chunks (3 data/coding chunks and one
parity chunk) across three DCs, in total 12 chunks. The min_size=7
would not allow an entire DC to go down, I'm afraid, you'd have to
reduce it to 6 to allow reads/writes in a disaster scenario. I'm
still not sure if I got it right this time, but maybe you're better
off without the LRC plugin with the limited number of hosts. Instead
you could use the jerasure plugin with a profile like k=4 m=5
allowing an entire DC to fail without losing data access (we have
one customer using that).

Zitat von Eugen Block :


Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might
be some misunderstandings on my side. But I tried to play around
with one of my test clusters (Nautilus). Because I'm limited in the
number of hosts (6 across 3 virtual DCs) I tried two different
profiles with lower numbers to get a feeling for how that works.

# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
k=4 m=2 l=3 crush-failure-domain=host

For every third OSD one parity chunk is added, so 2 more chunks to
store ==> 8 chunks in total. Since my failure-domain is host and I
only have 6 I get incomplete PGs.

# second attempt
ceph:~ # ceph osd erasure-code-

[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Frank Schilder

Hi all,

jumping on this thread as we have requests for which per-client fs mount 
encryption makes a lot of sense:

> What kind of security to you want to achieve with encryption keys stored
> on the server side?

One of the use cases is if a user requests a share with encryption at rest. 
Since encryption has an unavoidable performance impact, it is impractical to 
make 100% of users pay for the requirements that only 1% of users really have. 
Instead of all-OSD back-end encryption hitting everyone for little reason, 
encrypting only some user-buckets/fs-shares on the front-end application level 
will ensure that the data is encrypted at rest.

It may very well not serve any other purpose, but these are requests we get. If 
I could provide an encryption key to a ceph-fs kernel at mount time, this 
requirement could be solved very elegantly on a per-user (request) basis and 
only making users who want it pay with performance penalties.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Robert Sander 
Sent: Tuesday, May 23, 2023 6:35 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Encryption per user Howto

On 23.05.23 08:42, huxia...@horebdata.cn wrote:
> Indeed, the question is on  server-side encryption with keys managed by ceph 
> on a per-user basis

What kind of security to you want to achieve with encryption keys stored
on the server side?

Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Robert Sander


On 5/26/23 12:26, Frank Schilder wrote:


It may very well not serve any other purpose, but these are requests we get. If 
I could provide an encryption key to a ceph-fs kernel at mount time, this 
requirement could be solved very elegantly on a per-user (request) basis and 
only making users who want it pay with performance penalties.


I understand this use case. But this would still mean that the client 
encrypts the data. In your case the CephFS mount or with S3 the 
rados-gateway.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Frank Schilder

Hi Robert.

> But this would still mean that the client encrypts the data.

Yes and as far as I understood this would be fine for the original request as 
well. Maybe this might sound confusing, but here is my terminology for that:

I don't count the RGW daemon as a storage server, in my terminology its a 
storage gateway, which in itself is a client of the rados back-end store. 
Hence, I count encryption on a gateway as client-sided. For RGW the natural 
place to have keys for such encryption would be the gateway (which was called 
server-sided in an earlier e-mail), while for cephfs if would be on the machine 
that does the actual FS mount.

For the kclient, this would be the host itself and when using ganesha, it would 
have to be in the VFS config on the NFS gateway. All these I count under 
client-sided keys while others might consider a gateway as server-sided. Note 
that client is not the same as user.

The key point here is, that ordinary (end-) users will in none of these cases 
be aware of the encryption or able to bypass it. It happens transparently. It 
is still on application level and, therefore, can be applied selectively.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Robert Sander 
Sent: Friday, May 26, 2023 1:29 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Encryption per user Howto

On 5/26/23 12:26, Frank Schilder wrote:

> It may very well not serve any other purpose, but these are requests we get. 
> If I could provide an encryption key to a ceph-fs kernel at mount time, this 
> requirement could be solved very elegantly on a per-user (request) basis and 
> only making users who want it pay with performance penalties.

I understand this use case. But this would still mean that the client
encrypts the data. In your case the CephFS mount or with S3 the
rados-gateway.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ln: failed to create hard link 'file name': Read-only file system

2023-05-26 Thread Frank Schilder

Update to the list: after extensive debugging by Xiubo on our test cluster, the 
issue was identified and fixed. A patch in on its way to distro kernels. The 
tracker for this case is: https://tracker.ceph.com/issues/59515

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Thursday, March 23, 2023 6:56 PM
To: Xiubo Li; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: ln: failed to create hard link 'file name': Read-only 
file system

Hi Xiubo and Gregory,

sorry for the slow reply, I did some more debugging and didn't have too much 
time. First some questions to collecting logs, but please see also below for 
reproducing the issue yourselves.

I can reproduce it reliably but need some input for these:

> enabling the kclient debug logs and
How do I do that? I thought the kclient ignores the ceph.conf and I'm not aware 
of a mount option to this effect. Is there a "ceph config set ..." setting I 
can change for a specific client (by host name/IP) and how exactly?

> also the mds debug logs
I guess here I should set a higher loglevel for the MDS serving this directory 
(it is pinned to a single rank) or is it something else?

The issue seems to require a certain load to show up. I created a minimal tar 
file mimicking the problem and having 2 directories with a hard link from a 
file in the first to a new name in the second directory. This does not cause 
any problems, so its not that easy to reproduce.

How you can reproduce it:

As an alternative to my limited skills of pulling logs out, I make the 
tgz-archive available to you both. You will receive an e-mail from our 
one-drive with a download link. If you un-tar the archive on an NFS client dir 
that's a re-export of a kclient mount, after some time you should see the 
errors showing up.

I can reliably reproduce these errors on our production- as well as on our test 
cluster. You should be able to reproduce it too with the tgz file.

Here is a result on our set-up:

- production cluster (executed in a sub-dir conda to make cleanup easy):

$ time tar -xzf ../conda.tgz
tar: mambaforge/pkgs/libstdcxx-ng-9.3.0-h6de172a_18/lib/libstdc++.so.6.0.28: 
Cannot hard link to ‘envs/satwindspy/lib/libstdc++.so.6.0.28’: Read-only file 
system
[...]
tar: mambaforge/pkgs/boost-cpp-1.72.0-h9d3c048_4/lib/libboost_log.so.1.72.0: 
Cannot hard link to ‘envs/satwindspy/lib/libboost_log.so.1.72.0’: Read-only 
file system
^C

real1m29.008s
user0m0.612s
sys 0m6.870s

By this time there are already hard links created, so it doesn't fail right 
away:
$ find -type f -links +1
./mambaforge/pkgs/libev-4.33-h516909a_1/share/man/man3/ev.3
./mambaforge/pkgs/libev-4.33-h516909a_1/include/ev++.h
./mambaforge/pkgs/libev-4.33-h516909a_1/include/ev.h
...

- test cluster (octopus latest stable, 3 OSD hosts with 3 HDD OSDs each, simple 
ceph-fs):

# ceph fs status
fs - 2 clients
==
RANK  STATE MDSACTIVITY DNSINOS
 0active  tceph-02  Reqs:0 /s  1807k  1739k
  POOL  TYPE USED  AVAIL
fs-meta1  metadata  18.3G   156G
fs-meta2data   0156G
fs-data data1604G   312G
STANDBY MDS
  tceph-01
  tceph-03
MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)

Its the new recommended 3-pool layout with fs-data being a 4+2 EC pool.

$ time tar -xzf / ... /conda.tgz
tar: mambaforge/ssl/cacert.pem: Cannot hard link to 
‘envs/satwindspy/ssl/cacert.pem’: Read-only file system
[...]
tar: mambaforge/lib/engines-1.1/padlock.so: Cannot hard link to 
‘envs/satwindspy/lib/engines-1.1/padlock.so’: Read-only file system
^C

real6m23.522s
user0m3.477s
sys 0m25.792s

Same story here, a large number of hard links has already been created before 
it starts failing:

$ find -type f -links +1
./mambaforge/lib/liblzo2.so.2.0.0
...

Looking at the output of find in both cases it also looks a bit 
non-deterministic when it starts failing.

It would be great if you can reproduce the issue on a similar test setup using 
the archive conda.tgz. If not, I'm happy to collect any type of logs on our 
test cluster.

We have now one user who has problems with rsync to an NFS share and it would 
be really appreciated if this could be sorted.

Thanks for your help and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Xiubo Li 
Sent: Thursday, March 23, 2023 2:41 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ln: failed to create hard link 'file name': 
Read-only file system

Hi Frank,

Could you reproduce it again by enabling the kclient debug logs and also
the mds debug logs ?

I need to know what exactly has happened in kclient and mds side.
Locally I couldn't reproduce it.

Thanks

- Xiubo

On 22/03/2023 23:27, Frank Schilder wrote:
> Hi Gregory,
>
> thanks for your reply. Firs

[ceph-users] Re: BlueStore fragmentation woes

2023-05-26 Thread Stefan Kooman


On 5/25/23 22:12, Igor Fedotov wrote:


On 25/05/2023 20:36, Stefan Kooman wrote:

On 5/25/23 18:17, Igor Fedotov wrote:

Perhaps...

I don't like the idea to use fragmentation score as a real index. IMO 
it's mostly like a very imprecise first turn marker to alert that 
something might be wrong. But not a real quantitative high-quality 
estimate.


Chiming in on the high fragmentation issue. We started collecting 
"fragmentation_rating" of each OSD this afternoon. All OSDs that have 
been provisioned a year ago have a fragmentation rating of ~ 0.9. Not 
sure for how long they are on this level.


Could you please collect allocation probes from existing OSD logs? Just 
a few samples from different OSDs...


10 OSDs from one host, but I have checked other nodes and they are similar:

CNT FRAGSizeRatio   Avg Frag size
21350923371468993170402590721.73982637659271
8534.77053554322
20951932381227693178414776321.8195347808498 8337.31352599283
21188454372989502783894118401.76034315670223
7463.73321072041
21605451393694622704271851521.82220042525379
6868.95810646333
19215230360637132909678182401.87682962941375
8068.16032059705
19293599354649282692384235521.83817068033807
7591.68109835159
19963538360881513157968363521.80770317365589
8750.70702159277
18030613317530982978261770241.76106591606176
9379.43683554909
17889602317180122995501424641.77298589426417
9444.16511551859
18475332332649442660532715521.80050588536109
7998.0074985847
18618154319142192548018831361.71414518324427
7983.96110323113
16437108294218732753503559681.78996651965784
9358.69568766067
17164338286053532494046494721.66655731202683
8718.81040838755
17895480296581023090471772161.65729569701399
10420.3288941416
19546560345885093013687377921.76954456436324
8712.97279081905
18525784348068563148758016001.87883309014075
9046.37297893266
18550989352364382730699489281.89943716747393
7749.64679823767
19085807346055722555120435201.81315738967705
7383.55209155335
17203820312055422770973573121.81387284916954
8879.74826112618
18003801337236702696967618561.87314167713807
7997.25420916525
18655425332271763065118105601.78109992133655
9224.7325069094
2638096545627920335280401.72957736762093
7348.15680925188
24923956447211093287909826561.79430219664968
7352.03106559813
25312482430353932877922263041.70016488308021
6687.33817079351
25841471462766992881684766721.79079197929561
6227.07502693742
25618384437859173215914885121.70915999229303
7344.63294469772
26006097450562062987476664321.73252472295247
6630.55532088077
26684805451967303511002439681.69372532420604
7768.26650883814
24025872424501353532654673921.76685095966548
8321.89267223768
24080466455105253717263237121.88993539410741
8167.91991988666
23195936450950513264738263041.94409274969546
7239.68193990955
23653302433127053075495731201.83114835298683
7100.67803707942
21589455400346703229821091841.85436223378497
8067.56017182107
22469039420427233143237017601.87114023879704
7476.29266924504
23647633434860983700038410241.83891969230071
8508.55464254346
23750561373871393204714536961.57415814304344
8571.70305799542
23142315386402743293410467841.66968058294946
8523.25857689312
23539469395732562925289103361.68114480407353
7392.08596674481
23810938379684992772703805441.59458224619291
7302.64266027477
19361754336102522863916769281.73590946357443
8520.96190555191
20331818341197362560768655361.67814486633709
7505.24170339419
21017537358622213187552829441.70629988661374
.33078531305
21660731426480773292175073281.96891217567865
7719.39863380007
20708620422851243445622620162.04190931119505
8148.54562129225
21371937431584473127541882882.01939800777066
7246.65065654471
21447150400341342836133314561.86664120873869
7084.28790931259
1890646936598724

[ceph-users] Multi region RGW Config Questions - Quincy

2023-05-26 Thread Deep Dish

Hello,



I have a Qunicy (17.2.6) cluster, looking to create a multi-zone /
multi-region RGW service and have a few questions with respect to published
docs - https://docs.ceph.com/en/quincy/radosgw/multisite/.



In general, I understand the process as:



1.   Create a new REALM, ZONEGROUP, ZONE:

radosgw-admin realm create --rgw-realm=my_new_realm [--default]


radosgw-admin zonegroup create --rgw-zonegroup=my_country
--endpoints=http://rgw1:80 --rgw-realm=my_new_realm --master –default



radosgw-admin zone create --rgw-zonegroup=my_country --rgw-zone=my-region *\*

--master --default *\*

--endpoints={http://fqdn}[,{http://fqdn}]





## Question:

If I have multiple RGWs deployed on my cluster, do I specify all of
them as individual endpoints?  OR specifying one rgw automatically
propagates config throughout all?





2.  Create SYSTEM user



radosgw-admin user create --uid="synchronization-user"
--display-name="Synchronization User" --system

radosgw-admin zone modify --rgw-zone={zone-name} --access-key={access-key}
--secret={secret}

radosgw-admin period update --commit





## Question:

The SYSTEM user is used only for replication?   Will creating new
REALM, ZONGROUP, ZONE reset any administrative access to management of
RGWs through ceph-dashboard?



3. Remove DEFAULT REALM, ZONEGROUP, ZONE and supporting pools

radosgw-admin zonegroup delete --rgw-zonegroup=default --rgw-zone=default

radosgw-admin period update --commit

radosgw-admin zone delete --rgw-zone=default

radosgw-admin period update --commit

radosgw-admin zonegroup delete --rgw-zonegroup=default

radosgw-admin period update --commit



ceph osd pool rm default.rgw.control default.rgw.control
--yes-i-really-really-mean-it

ceph osd pool rm default.rgw.data.root default.rgw.data.root
--yes-i-really-really-mean-it

ceph osd pool rm default.rgw.gc default.rgw.gc --yes-i-really-really-mean-it

ceph osd pool rm default.rgw.log default.rgw.log --yes-i-really-really-mean-it

ceph osd pool rm default.rgw.users.uid default.rgw.users.uid
--yes-i-really-really-mean-it



4. UPDATING CEPH CONFIG FILE / RGW CONFIG VIA CEPH ORCH



# QUESTION:

Since I’m using ceph orch would I simply set rgw_zone property via
CLUSTER -> CONFIGURATION on ceph-dashboard?


Thank you.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds dump inode crashes file system

2023-05-26 Thread Frank Schilder

Update to the list: a first issue was discovered and fixed on both, the MDS and 
kclient side. the tracker for the bug is here: 
https://tracker.ceph.com/issues/61200 . It contains a link to the kclient 
patchwork. There is no link to the MDS PR (yet).

This bug is responsible for the mount going stale. I still need to confirm that 
it did not lead to meta data corruption and also fixes the original problem 
reported here, the crash on "mds dump inode". TBC.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Wednesday, May 17, 2023 7:43 AM
To: Gregory Farnum; Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: mds dump inode crashes file system


On 5/16/23 21:55, Gregory Farnum wrote:
> On Fri, May 12, 2023 at 5:28 AM Frank Schilder  wrote:
>> Dear Xiubo and others.
>>
 I have never heard about that option until now. How do I check that and 
 how to I disable it if necessary?
 I'm in meetings pretty much all day and will try to send some more info 
 later.
>>> $ mount|grep ceph
>> I get
>>
>> MON-IPs:SRC on DST type ceph 
>> (rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)
>>
>> so async dirop seems disabled.
>>
>>> Yeah, the kclient just received a corrupted snaptrace from MDS.
>>> So the first thing is you need to fix the corrupted snaptrace issue in 
>>> cephfs and then continue.
>> Ooookaaa. I will take it as a compliment that you seem to assume I know 
>> how to do that. The documentation gives 0 hits. Could you please provide me 
>> with instructions of what to look for and/or what to do first?
>>
>>> If possible you can parse the above corrupted snap message to check what 
>>> exactly corrupted.
>>> I haven't get a chance to do that.
>> Again, how would I do that? Is there some documentation and what should I 
>> expect?
>>
>>> You seems didn't enable the 'osd blocklist' cephx auth cap for mon:
>> I can't find anything about an osd blocklist client auth cap in the 
>> documentation. Is this something that came after octopus? Our caps are as 
>> shown in the documentation for a ceph fs client 
>> (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
>> "allow r":
>>
>>  caps mds = "allow rw path=/shares"
>>  caps mon = "allow r"
>>  caps osd = "allow rw tag cephfs data=con-fs2"
>>
>>
>>> I checked that but by reading the code I couldn't get what had cause the 
>>> MDS crash.
>>> There seems something wrong corrupt the metadata in cephfs.
>> He wrote something about an invalid xattrib (empty value). It would be 
>> really helpful to get a clue how to proceed. I managed to dump the MDS cache 
>> with the critical inode in cache. Would this help with debugging? I also 
>> managed to get debug logs with debug_mds=20 during a crash caused by an "mds 
>> dump inode" command. Would this contain something interesting? I can also 
>> pull the rados objects out and can upload all of these files.
> I was just guessing about the invalid xattr based on the very limited
> crash info, so if it's clearly broken snapshot metadata from the
> kclient logs I would focus on that.

Actually the snaptrace was not corrupted and I have fixed the bug in
kclient side. More detail please see my reply in the last mail.

For the MDS side's crash:

  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x158) [0x7fe979ae9b92]
  2: (()+0x27ddac) [0x7fe979ae9dac]
  3: (()+0x5ce831) [0x7fe979e3a831]
  4: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
  5: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
  6: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) 
[0x55c08c41e00c]

I just guess it may corrupted in dumping the 'old_inode':

4383 void InodeStoreBase::dump(Formatter *f) const
4384 {

...
4401   f->open_array_section("old_inodes");
4402   for (const auto &p : old_inodes) {
4403 f->open_object_section("old_inode");
4404 // The key is the last snapid, the first is in the
mempool_old_inode
4405 f->dump_int("last", p.first);
4406 p.second.dump(f);
4407 f->close_section();  // old_inode
4408   }
4409   f->close_section();  // old_inodes
...
4413 }

Because incorrectly parsing the snaptrace in kclient may corrupt the
snaprealm and then send corruped capsnap back to MDS.

Thanks

- Xiubo

> I'm surprised/concerned your system managed to generate one of those,
> of course...I'll let Xiubo work with you on that.
> -Greg
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-26 Thread Patrick Begou


Hi,

I'm back working on this problem.

First of all, I saw that I had a hardware memory error so I had to solve 
this first. It's done.


I've tested some different Ceph deployments, each time starting with a 
full OS re-install (it requires some time for each test).


Using Octopus, the devices are found:

   dnf -y install \
   
https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm
   monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }'))
   cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \
  --allow-fqdn-hostname

   [ceph: root@mostha1 /]# *ceph orch device ls*
   Hostname  Path  Type  Serial Size   Health  
   Ident  Fault  Available
   mostha1.legi.grenoble-inp.fr  /dev/sda  hdd S2B5J90ZA02494    250G 
   Unknown  N/A    N/A    Yes
   mostha1.legi.grenoble-inp.fr  /dev/sdc  hdd WD-WMAYP0982329   500G 
   Unknown  N/A    N/A    Yes


But with Pacific or Quincy the command returns nothing.

With Pacific:

   dnf -y install \
   
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm
   monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
   cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \
   --allow-fqdn-hostname


"ceph orch device ls" doesn't return anything but "cephadm shell lsmcli 
ldl"  list all the devices.


   [ceph: root@mostha1 /]# *ceph orch device ls --wide*
   [ceph: root@mostha1 /]# *lsblk*
   NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
   sda    8:0    1 232.9G  0 disk
   |-sda1 8:1    1   3.9G  0 part /rootfs/boot
   |-sda2 8:2    1  78.1G  0 part
   | `-osvg-rootvol 253:0    0  48.8G  0 lvm  /rootfs
   |-sda3 8:3    1   3.9G  0 part [SWAP]
   `-sda4 8:4    1 146.9G  0 part
  |-secretvg-homevol 253:1    0   9.8G  0 lvm  /rootfs/home
  |-secretvg-tmpvol  253:2    0   9.8G  0 lvm  /rootfs/tmp
  `-secretvg-varvol  253:3    0   9.8G  0 lvm  /rootfs/var
   sdb    8:16   1 232.9G  0 disk
   sdc    8:32   1 465.8G  0 disk
   [ceph: root@mostha1 /]# exit
   [root@mostha1 ~]# *cephadm ceph-volume inventory*
   Inferring fsid 2e3e85a8-fbcf-11ed-84e5-00266cf8869c
   Using ceph image with id '0dc91bca92c2' and tag 'v17' created on
   2023-05-25 16:26:31 + UTC
   
quay.io/ceph/ceph@sha256:b8df01a568f4dec7bac6d5040f9391dcca14e00ec7f4de8a3dcf3f2a6502d3a9

   Device Path   Size Device nodes    rotates
   available Model name

   [root@mostha1 ~]# *cephadm shell lsmcli ldl*
   Inferring fsid 4d54823c-fb05-11ed-aecf-00266cf8869c
   Inferring config
   /var/lib/ceph/4d54823c-fb05-11ed-aecf-00266cf8869c/mon.mostha1/config
   Using ceph image with id 'c9a1062f7289' and tag 'v17' created on
   2023-04-25 16:04:33 + UTC
   
quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e
   Path | SCSI VPD 0x83    | Link Type | Serial Number   | Health
   Status
   -
   */dev/sda | 50024e92039e4f1c | PATA/SATA | S2B5J90ZA10142  | Good**
   **/dev/sdc | 50014ee0ad5953c9 | PATA/SATA | WD-WMAYP0982329 | Good**
   **/dev/sdb | 50024e920387fa2c | PATA/SATA | S2B5J90ZA02494  | Good**
   *


Could it be a bug in ceph-volume ?
Adam suggest looking to the underlying commands (lsblk, blkid, udevadm, 
lvs, or pvs) but I'm not very comfortable with blkid and udevadm. Is 
there a "debug flag" to set ceph more verbose ?


Thanks

Patrick

Le 15/05/2023 à 21:20, Adam King a écrit :
As you've already seem to have figured out, "ceph orch device ls" is 
populated with the results from "ceph-volume inventory". My best guess 
to try and debug this would be to manually run "cephadm ceph-volume -- 
inventory" (the same as "cephadm ceph-volume inventory", I just like 
to separate the ceph-volume command from cephadm itself with the " -- 
") and then check /var/log/ceph//ceph-volume.log from when you 
ran the command onward to try and see why it isn't seeing your 
devices. For example I can see a line  like


[2023-05-15 19:11:58,048][ceph_volume.main][INFO  ] Running command: 
ceph-volume  inventory


in there. Then if I look onward from there I can see it ran things like

lsblk -P -o 
NAME,KNAME,PKNAME,MAJ:MIN,FSTYPE,MOUNTPOINT,LABEL,UUID,RO,RM,MODEL,SIZE,STATE,OWNER,GROUP,MODE,ALIGNMENT,PHY-SEC,LOG-SEC,ROTA,SCHED,TYPE,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,PKNAME,PARTLABEL


as part of getting my device list. So if I was having issues I would 
try running that directly and see what I got. Will note that 
ceph-volume on certain more recent versions (not sure about octopus) 
runs commands through nsenter, so you'd have to look past that part in 
the log lines to the underlying command being used, typically 
something with lsblk, blkid, udevadm, lvs, or pvs.


Also, if you want to see if it's an issue wit

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-26 Thread Michel Jouvin


Hi Patrick,

It is weird, we have a couple of clusters with cephadm and running pacify 
or quincy and ceph orch device works well. Have you looked at the cephadm 
logs (ceph log last cephadm)?


Except if you are using a very specific hardware, I suspect Ceph is 
suffering of a problem outside it...


Cheers,

Michel
Sent from my mobile
Le 26 mai 2023 17:02:50 Patrick Begou 
 a écrit :



Hi,

I'm back working on this problem.

First of all, I saw that I had a hardware memory error so I had to solve
this first. It's done.

I've tested some different Ceph deployments, each time starting with a
full OS re-install (it requires some time for each test).

Using Octopus, the devices are found:

   dnf -y install \
   
https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm
   monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }'))
   cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \
  --allow-fqdn-hostname

   [ceph: root@mostha1 /]# *ceph orch device ls*
   Hostname  Path  Type  Serial Size   Health
   Ident  Fault  Available
   mostha1.legi.grenoble-inp.fr  /dev/sda  hdd S2B5J90ZA02494250G
   Unknown  N/AN/AYes
   mostha1.legi.grenoble-inp.fr  /dev/sdc  hdd WD-WMAYP0982329   500G
   Unknown  N/AN/AYes


But with Pacific or Quincy the command returns nothing.

With Pacific:

   dnf -y install \
   
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm
   monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
   cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \
   --allow-fqdn-hostname


"ceph orch device ls" doesn't return anything but "cephadm shell lsmcli
ldl"  list all the devices.

   [ceph: root@mostha1 /]# *ceph orch device ls --wide*
   [ceph: root@mostha1 /]# *lsblk*
   NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
   sda8:01 232.9G  0 disk
   |-sda1 8:11   3.9G  0 part /rootfs/boot
   |-sda2 8:21  78.1G  0 part
   | `-osvg-rootvol 253:00  48.8G  0 lvm  /rootfs
   |-sda3 8:31   3.9G  0 part [SWAP]
   `-sda4 8:41 146.9G  0 part
  |-secretvg-homevol 253:10   9.8G  0 lvm  /rootfs/home
  |-secretvg-tmpvol  253:20   9.8G  0 lvm  /rootfs/tmp
  `-secretvg-varvol  253:30   9.8G  0 lvm  /rootfs/var
   sdb8:16   1 232.9G  0 disk
   sdc8:32   1 465.8G  0 disk
   [ceph: root@mostha1 /]# exit
   [root@mostha1 ~]# *cephadm ceph-volume inventory*
   Inferring fsid 2e3e85a8-fbcf-11ed-84e5-00266cf8869c
   Using ceph image with id '0dc91bca92c2' and tag 'v17' created on
   2023-05-25 16:26:31 + UTC
   
quay.io/ceph/ceph@sha256:b8df01a568f4dec7bac6d5040f9391dcca14e00ec7f4de8a3dcf3f2a6502d3a9

   Device Path   Size Device nodesrotates
   available Model name

   [root@mostha1 ~]# *cephadm shell lsmcli ldl*
   Inferring fsid 4d54823c-fb05-11ed-aecf-00266cf8869c
   Inferring config
   /var/lib/ceph/4d54823c-fb05-11ed-aecf-00266cf8869c/mon.mostha1/config
   Using ceph image with id 'c9a1062f7289' and tag 'v17' created on
   2023-04-25 16:04:33 + UTC
   
quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e
   Path | SCSI VPD 0x83| Link Type | Serial Number   | Health
   Status
   -
   */dev/sda | 50024e92039e4f1c | PATA/SATA | S2B5J90ZA10142  | Good**
   **/dev/sdc | 50014ee0ad5953c9 | PATA/SATA | WD-WMAYP0982329 | Good**
   **/dev/sdb | 50024e920387fa2c | PATA/SATA | S2B5J90ZA02494  | Good**
   *


Could it be a bug in ceph-volume ?
Adam suggest looking to the underlying commands (lsblk, blkid, udevadm,
lvs, or pvs) but I'm not very comfortable with blkid and udevadm. Is
there a "debug flag" to set ceph more verbose ?

Thanks

Patrick

Le 15/05/2023 à 21:20, Adam King a écrit :

As you've already seem to have figured out, "ceph orch device ls" is
populated with the results from "ceph-volume inventory". My best guess
to try and debug this would be to manually run "cephadm ceph-volume --
inventory" (the same as "cephadm ceph-volume inventory", I just like
to separate the ceph-volume command from cephadm itself with the " --
") and then check /var/log/ceph//ceph-volume.log from when you
ran the command onward to try and see why it isn't seeing your
devices. For example I can see a line  like

[2023-05-15 19:11:58,048][ceph_volume.main][INFO  ] Running command:
ceph-volume  inventory

in there. Then if I look onward from there I can see it ran things like

lsblk -P -o
NAME,KNAME,PKNAME,MAJ:MIN,FSTYPE,MOUNTPOINT,LABEL,UUID,RO,RM,MODEL,SIZE,STATE,OWNER,GROUP,MODE,ALIGNMENT,PHY-SEC,LOG-SEC,ROTA,SCHED,TYPE,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,PKNAME,PARTLABEL

as part of getting my device list. So if I was having issues I woul

[ceph-users] Re: Pacific - MDS behind on trimming

2023-05-26 Thread Dan van der Ster

Hi Emmanuel,

In my experience MDS getting behind on trimming normally happens for
one of two reasons. Either your client workload is simply too
expensive for your metadata pool OSDs to keep up (and btw some ops are
known to be quite expensive such as setting xattrs or deleting files).
Or I've seen this during massive exports of subtrees between
multi-active MDS.

If you're using a single active MDS, you can exclude the 2nd case.

So if it's the former, then it would be useful to know exactly how
many log segments your MDS is accumulating.. is it going in short
bursts then coming back to normal? Or is it stuck at a very high
value?
Injecting mds_log_max_segments=40 is indeed a very large unusual
amount -- you definitely don't want to leave it like this long term.
(And silencing the warning for bursty client IO is better achieved by
increasing the mds_log_warn_factor e.g. to 5 or 10.)

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com

On Fri, May 26, 2023 at 1:29 AM Emmanuel Jaep  wrote:
>
> Hi,
>
> lately, we have had some issues with our MDSs (Ceph version 16.2.10
> Pacific).
>
> Part of them are related to MDS being behind on trimming.
>
> I checked the documentation and found the following information (
> https://docs.ceph.com/en/pacific/cephfs/health-messages/):
> > CephFS maintains a metadata journal that is divided into *log segments*.
> The length of journal (in number of segments) is controlled by the setting
> mds_log_max_segments, and when the number of segments exceeds that setting
> the MDS starts writing back metadata so that it can remove (trim) the
> oldest segments. If this writeback is happening too slowly, or a software
> bug is preventing trimming, then this health message may appear. The
> threshold for this message to appear is controlled by the config option
> mds_log_warn_factor, the default is 2.0.
>
>
> Some resources on the web (https://www.suse.com/support/kb/doc/?id=19740)
> indicated that a solution would be to change the `mds_log_max_segments`.
> Which I did:
> ```
> ceph --cluster floki tell mds.* injectargs '--mds_log_max_segments=40'
> ```
>
> Of course, the warning disappeared, but I have a feeling that I just hid
> the problem. Pushing a value to 400'000 when the default value is 512 is a
> lot.
>  Why is the trimming not taking place? How can I troubleshoot this further?
>
> Best,
>
> Emmanuel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph iscsi gateway semi deprecation warning?

2023-05-26 Thread Mark Kirkwood

I am looking at using an iscsi gateway in front of a ceph setup. However 
the warning in the docs is concerning:


The iSCSI gateway is in maintenance as of November 2022. This means that 
it is no longer in active development and will not be updated to add new 
features.


Does this mean I should be wary of using it, or is it simply that it 
does all the stuff it needs to and no further development is needed?


regards

Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mgr memory usage constantly increasing

2023-05-26 Thread Simon Fowler

It's a bit of a kludge, but just failing the active mgr on a regular 
schedule works around this issue (which we see also on our 17.2.5 
cluster). We just have a cron job that fails the active mgr every 24 
hours - it seems to get up to ~30G, then drop back to 10-15G once it 
goes to backup mode.


Simon Fowler

On 23/5/23 22:14, Tobias Hachmer wrote:

Hi Eugen,

Am 5/23/23 um 12:50 schrieb Eugen Block:
there was a thread [1] just a few weeks ago. Which mgr modules are 
enabled in your case? Also the mgr caps seem to be relevant here.


[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/BKP6EVZZHJMYG54ZW64YABYV6RLPZNQO/


thanks for the hint and link. We actually use the restful module and 
modified the mgr caps to use zabbix monitoring. I now have reverted 
the mgr caps to default and will observe the memory usage. I think w 
ran into the same issue here.


Thanks
Tobias

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] sub-read of librados

2023-05-26 Thread johnnyjohnnypd

Hi!
I am trying to read only part of an object by specifying the non-trivial offset 
and length of the read function: `librados::IoCtxImpl::read(const object_t& 
oid, bufferlist& bl, size_t len, uint64_t off)` from `IoCtxImpl.cc`.
However, after connecting to an erasure code pool (e.g., 12+4), I try to read 
data from a randomly chosen OSD (that is, 1/12 of the object), but the results 
of command `vmstat -d` and `iostat` show that the entire object was read, since 
read operations appeared on all 12 OSDs.
So, I wonder if librados doesn't support the real sub-read of an object, and 
what should I do if I want to implement this function.

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unexpected behavior of directory mtime after being set explicitly

2023-05-26 Thread sandip . divekar

Hi Milind

It's Kernel Client. 

Thanks
  Sandip
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unexpected behavior of directory mtime after being set explicitly

2023-05-26 Thread Joseph Fernandes

Hello Greogry,

We are setting the mtime to 01 Jan 1970 00:00



  1.  Create a directory "dir1"
  2.  set mtime of the "dir1 to 0 -> i.e 1 jan 1970
  3.  Create child directory in "dir1" i.e mkdir dir1/dir2 OR Create a file in 
"dir1" i.e "touch dir1/file1
  4.  stat "dir1"
Linux FS : updates the mtime of "dir1"
Ceph FS: DOESNOT update the mtime of "dir1"


Linux command output :


CEPHFS
===
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#
 mkdir dir1
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#
 stat dir1
  File: dir1
  Size: 0   Blocks: 0  IO Block: 65536  directory
Device: 28h/40d Inode: 1099511714911  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Access: 2023-05-24 11:09:25.260851345 +0530
Modify: 2023-05-24 11:09:25.260851345 +0530
Change: 2023-05-24 11:09:25.260851345 +0530
Birth: 2023-05-24 11:09:25.260851345 +0530
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#
  touch -m -d '26 Aug 1982 22:00' dir1
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#
 stat dir1/
  File: dir1/
  Size: 0   Blocks: 0  IO Block: 65536  directory
Device: 28h/40d Inode: 1099511714911  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Access: 2023-05-24 11:09:25.260851345 +0530
Modify: 1982-08-26 22:00:00.0 +0530
Change: 2023-05-24 11:10:04.881454967 +0530
Birth: 2023-05-24 11:09:25.260851345 +0530
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#
 mkdir dir1/dir2
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#
 stat dir1/
  File: dir1/
  Size: 1   Blocks: 0  IO Block: 65536  directory
Device: 28h/40d Inode: 1099511714911  Links: 3
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Access: 2023-05-24 11:09:25.260851345 +0530
Modify: 1982-08-26 22:00:00.0 +0530
Change: 2023-05-24 11:10:19.141672220 +0530
Birth: 2023-05-24 11:09:25.260851345 +0530
root@sds-ceph:/mnt/cephfs/volumes/_nogroup/test1/d5052b71-39ec-4d0a-9b0b-2091e1723538#


-Joe

From: Gregory Farnum 
Sent: Thursday, May 25, 2023 8:43 PM
To: Sandip Divekar 
Cc: Chris Palmer ; Gavin Lucas 
; Joseph Fernandes 
; Simon Crosland 
; ceph-users@ceph.io; d...@ceph.io
Subject: Re: [ceph-users] Re: Unexpected behavior of directory mtime after 
being set explicitly

* EXTERNAL EMAIL *
I haven’t checked the logs, but the most obvious way this happens is if the 
mtime set on the directory is in the future compared to the time on the client 
or server making changes — CephFS does not move times backwards. (This causes 
some problems but prevents many, many others when times are not synchronized 
well across the clients and servers.)
-Greg

On Thu, May 25, 2023 at 7:58 AM Sandip Divekar 
mailto:sandip.dive...@hitachivantara.com>> 
wrote:
Hi Chris,

Kindly request you that follow steps given in previous mail and paste the 
output here.

The reason behind this request is that we have encountered an issue which is 
easily reproducible on
Latest version of both quincy and pacific, also we have thoroughly investigated 
the matter and we are certain that
No other factors are at play in this scenario.

Note :  We have used Debian 11 for testing.
sdsadmin@ceph-pacific-1:~$ uname -a
Linux ceph-pacific-1 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08) 
x86_64 GNU/Linux
sdsadmin@ceph-pacific-1:~$ sudo ceph -v
ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific (stable)

Thanks for your prompt reply.

  Regards
   Sandip Divekar

-Original Message-
From: Chris Palmer mailto:chris.pal...@idnet.com>>
Sent: Thursday, May 25, 2023 7:25 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Unexpected behavior of directory mtime after being 
set explicitly

* EXTERNAL EMAIL *

Hi Milind
I just tried this using the ceph kernel client and ceph-common 17.2.6 package 
in the latest Fedora kernel, against Ceph 17.2.6 and it worked perfectly...
There must be some other factor in play.
Chris

On 25/05/2023 13:04, Sandip Divekar wrote:
> Hello Milind,
>
> We are using Ceph Kernel Client.
> But we found this same behavior while using Libcephfs library.
>
> Should we treat this as a bug?  Or
> Is there any existing bug for similar issue ?
>
> Thanks and Regards,
>Sandip Divekar
>
>
> From: Milind Changire mailto:mchan...@redhat.com>>
> Sent: Thursday, May 25, 2023 4:24 PM
> To: Sandip Divekar 
> mailto:sandip.dive...@hitachivantara.com>>
> Cc: ceph-users@ceph.io; 
> d...@ceph.io
> Subject: Re: [ceph-users] Unexpected behavior of directory mtime after
> being set explicitly
>
> * EXTERNAL EMAIL *
> Sandip,
>

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-26 Thread Patrick Begou


Hi Michel,

I do not notice anything strange in the logs files (looking for errors 
or warnings).


The hardware is a DELL C6100 sled (from 2011) running Alma Linux8 
up-to-date. It uses 3 sata disks.


Is there a way to force osd installation by hand with providing the 
device /dev/sdc  for example ? A "do what I say" approach...


Is it a good try to deploy Octopus on the nodes, configure the osd (even 
if podman 4.2.0 is not validated for Octopus)  and then upgrade to 
Pacific? Could this be a workaround for this sort of regression from 
Octopus to Pacific ?


May be updating the BIOS from 1.7.1 to 1.8.1 ?


All this is a little bit confusing for me as I'm trying to discover Ceph 😁

Thanks

Patrick


Le 26/05/2023 à 17:19, Michel Jouvin a écrit :

Hi Patrick,

It is weird, we have a couple of clusters with cephadm and running 
pacify or quincy and ceph orch device works well. Have you looked at 
the cephadm logs (ceph log last cephadm)?


Except if you are using a very specific hardware, I suspect Ceph is 
suffering of a problem outside it...


Cheers,

Michel
Sent from my mobile

Le 26 mai 2023 17:02:50 Patrick Begou 
 a écrit :



Hi,

I'm back working on this problem.

First of all, I saw that I had a hardware memory error so I had to solve
this first. It's done.

I've tested some different Ceph deployments, each time starting with a
full OS re-install (it requires some time for each test).

Using Octopus, the devices are found:

    dnf -y install \
https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm
    monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }'))
    cephadm bootstrap --mon-ip $monip --initial-dashboard-password 
x \

       --allow-fqdn-hostname

    [ceph: root@mostha1 /]# *ceph orch device ls*
    Hostname  Path Type  Serial Size   Health
    Ident  Fault  Available
    mostha1.legi.grenoble-inp.fr  /dev/sda hdd S2B5J90ZA02494    250G
    Unknown  N/A    N/A    Yes
    mostha1.legi.grenoble-inp.fr  /dev/sdc hdd WD-WMAYP0982329   500G
    Unknown  N/A    N/A    Yes


But with Pacific or Quincy the command returns nothing.

With Pacific:

    dnf -y install \
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm
    monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
    cephadm bootstrap --mon-ip $monip --initial-dashboard-password 
x \

    --allow-fqdn-hostname


"ceph orch device ls" doesn't return anything but "cephadm shell lsmcli
ldl"  list all the devices.

    [ceph: root@mostha1 /]# *ceph orch device ls --wide*
    [ceph: root@mostha1 /]# *lsblk*
    NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    sda    8:0    1 232.9G 0 disk
    |-sda1 8:1    1   3.9G 0 part /rootfs/boot
    |-sda2 8:2    1  78.1G 0 part
    | `-osvg-rootvol 253:0    0  48.8G 0 lvm  /rootfs
    |-sda3 8:3    1   3.9G 0 part [SWAP]
    `-sda4 8:4    1 146.9G 0 part
       |-secretvg-homevol 253:1    0   9.8G 0 lvm  /rootfs/home
       |-secretvg-tmpvol  253:2    0   9.8G 0 lvm  /rootfs/tmp
       `-secretvg-varvol  253:3    0   9.8G 0 lvm  /rootfs/var
    sdb    8:16   1 232.9G 0 disk
    sdc    8:32   1 465.8G 0 disk
    [ceph: root@mostha1 /]# exit
    [root@mostha1 ~]# *cephadm ceph-volume inventory*
    Inferring fsid 2e3e85a8-fbcf-11ed-84e5-00266cf8869c
    Using ceph image with id '0dc91bca92c2' and tag 'v17' created on
    2023-05-25 16:26:31 + UTC
quay.io/ceph/ceph@sha256:b8df01a568f4dec7bac6d5040f9391dcca14e00ec7f4de8a3dcf3f2a6502d3a9

    Device Path   Size Device nodes    rotates
    available Model name

    [root@mostha1 ~]# *cephadm shell lsmcli ldl*
    Inferring fsid 4d54823c-fb05-11ed-aecf-00266cf8869c
    Inferring config
/var/lib/ceph/4d54823c-fb05-11ed-aecf-00266cf8869c/mon.mostha1/config
    Using ceph image with id 'c9a1062f7289' and tag 'v17' created on
    2023-04-25 16:04:33 + UTC
quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e
    Path | SCSI VPD 0x83    | Link Type | Serial Number   | Health
    Status
-
    */dev/sda | 50024e92039e4f1c | PATA/SATA | S2B5J90ZA10142  | Good**
    **/dev/sdc | 50014ee0ad5953c9 | PATA/SATA | WD-WMAYP0982329 | Good**
    **/dev/sdb | 50024e920387fa2c | PATA/SATA | S2B5J90ZA02494  | Good**
    *


Could it be a bug in ceph-volume ?
Adam suggest looking to the underlying commands (lsblk, blkid, udevadm,
lvs, or pvs) but I'm not very comfortable with blkid and udevadm. Is
there a "debug flag" to set ceph more verbose ?

Thanks

Patrick

Le 15/05/2023 à 21:20, Adam King a écrit :

As you've already seem to have figured out, "ceph orch device ls" is
populated with the results from "ceph-volume inventory". My best guess
to try and debug this would be to manually run "cephad

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-26 Thread Michel Jouvin


Patrick,

I can only say that I would not expect a specific problem due to your 
hardware. Upgrading the firmware is generally a good idea but I wouldn't 
expect it helps in your case if the osk (lsblk) sees the disk.


As for starting with octopus I don't know if it will help... But we are 
also using the same os as you (centos stream in fact but basically the 
same). We have been running octopus (with cephadm) on this os version 
without problem and upgraded since then in pacific and quincy. In fact on 
of our cluster started in infernetis, the other in luminous and they have 
been upgraded without problems since then...


Michel
Sent from my mobile
Le 26 mai 2023 18:34:22 Patrick Begou 
 a écrit :

Hi Michel,
I do not notice anything strange in the logs files (looking for errors or 
warnings).
The hardware is a DELL C6100 sled (from 2011) running Alma Linux8 
up-to-date. It uses 3 sata disks.
Is there a way to force osd installation by hand with providing the device 
/dev/sdc  for example ? A "do what I say" approach...
Is it a good try to deploy Octopus on the nodes, configure the osd (even if 
podman 4.2.0 is not validated for Octopus)  and then upgrade to Pacific? 
Could this be a workaround for this sort of regression from Octopus to 
Pacific ?

May be updating the BIOS from 1.7.1 to 1.8.1 ?

All this is a little bit confusing for me as I'm trying to discover Ceph 😁
Thanks
Patrick

Le 26/05/2023 à 17:19, Michel Jouvin a écrit :

Hi Patrick,

It is weird, we have a couple of clusters with cephadm and running pacify 
or quincy and ceph orch device works well. Have you looked at the cephadm 
logs (ceph log last cephadm)?


Except if you are using a very specific hardware, I suspect Ceph is 
suffering of a problem outside it...


Cheers,

Michel
Sent from my mobile

Le 26 mai 2023 17:02:50 Patrick Begou 
 a écrit :



Hi,

I'm back working on this problem.

First of all, I saw that I had a hardware memory error so I had to solve
this first. It's done.

I've tested some different Ceph deployments, each time starting with a
full OS re-install (it requires some time for each test).

Using Octopus, the devices are found:

dnf -y install \
https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }'))
cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \
--allow-fqdn-hostname

[ceph: root@mostha1 /]# *ceph orch device ls*
Hostname  Path  Type  Serial Size   Health
Ident  Fault  Available
mostha1.legi.grenoble-inp.fr  /dev/sda  hdd S2B5J90ZA02494250G
Unknown  N/AN/AYes
mostha1.legi.grenoble-inp.fr  /dev/sdc  hdd WD-WMAYP0982329   500G
Unknown  N/AN/AYes


But with Pacific or Quincy the command returns nothing.

With Pacific:

dnf -y install \
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \
--allow-fqdn-hostname


"ceph orch device ls" doesn't return anything but "cephadm shell lsmcli
ldl"  list all the devices.

[ceph: root@mostha1 /]# *ceph orch device ls --wide*
[ceph: root@mostha1 /]# *lsblk*
NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda8:01 232.9G  0 disk
|-sda1 8:11   3.9G  0 part /rootfs/boot
|-sda2 8:21  78.1G  0 part
| `-osvg-rootvol 253:00  48.8G  0 lvm  /rootfs
|-sda3 8:31   3.9G  0 part [SWAP]
`-sda4 8:41 146.9G  0 part
|-secretvg-homevol 253:10   9.8G  0 lvm  /rootfs/home
|-secretvg-tmpvol  253:20   9.8G  0 lvm  /rootfs/tmp
`-secretvg-varvol  253:30   9.8G  0 lvm  /rootfs/var
sdb8:16   1 232.9G  0 disk
sdc8:32   1 465.8G  0 disk
[ceph: root@mostha1 /]# exit
[root@mostha1 ~]# *cephadm ceph-volume inventory*
Inferring fsid 2e3e85a8-fbcf-11ed-84e5-00266cf8869c
Using ceph image with id '0dc91bca92c2' and tag 'v17' created on
2023-05-25 16:26:31 + UTC
quay.io/ceph/ceph@sha256:b8df01a568f4dec7bac6d5040f9391dcca14e00ec7f4de8a3dcf3f2a6502d3a9

Device Path   Size Device nodesrotates
available Model name

[root@mostha1 ~]# *cephadm shell lsmcli ldl*
Inferring fsid 4d54823c-fb05-11ed-aecf-00266cf8869c
Inferring config
/var/lib/ceph/4d54823c-fb05-11ed-aecf-00266cf8869c/mon.mostha1/config
Using ceph image with id 'c9a1062f7289' and tag 'v17' created on
2023-04-25 16:04:33 + UTC
quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e
Path | SCSI VPD 0x83| Link Type | Serial Number   | Health
Status
-
*/dev/sda | 50024e92039e4f1c | PATA/SATA | S2B5J90ZA10142  | Good**
**/dev/sdc | 50014ee0ad5953c9 | PATA/SATA | WD-WMAYP0982329 | Good**
**/dev/sdb | 50024e920387fa2c | PATA/SATA | S2B5

[ceph-users] Seeking feedback on Improving cephadm bootstrap process

2023-05-26 Thread Redouane Kachach

Dear ceph community,

As you are aware, cephadm has become the default tool for installing Ceph
on bare-metal systems. Currently, during the bootstrap process of a new
cluster, if the user interrupts the process manually or if there are any
issues causing the bootstrap process to fail, cephadm leaves behind the
failed cluster files and processes on the current host. While this can be
beneficial for debugging and resolving issues related to the cephadm
bootstrap process, it can create difficulties for inexperienced users who
need to delete the faulty cluster and proceed with the Ceph installation.
The problem described in the tracker https://tracker.ceph.com/issues/57016 is
a good example of this issue.In the cephadm development team, we are
considering ways to enhance the user experience during the bootstrap of a
new cluster. We have discussed the following options:1) Retain the cluster
files without deleting them, but provide the user with a clear command to
remove the broken/faulty cluster.
2) Automatically delete the broken/failed ceph installation and offer an
option for the user to disable this behavior if desired.Both options have
their advantages and disadvantages, which is why we are seeking your
feedback. We would like to know which option you prefer and the reasoning
behind your choice. Please provide reasonable arguments to justify your
preference.Your feedback will be taken into careful consideration when we
work on improving the ceph bootstrap process.Thank you,
Redouane,
On behalf of cephadm dev team.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Unable to online CephFS, MDS segfaults during mds log replay

2023-05-26 Thread Alfred Heisner

Hello,

   I have a Ceph deployment using CephFS. Recently MDS failed and cannot
start. Attempting to start MDS for this filesystem results in nearly
immediate segfault in MDS. Logs below.

cephfs-journal-tool shows Overall journal integrity state OK
root@proxmox-2:/var/log/ceph# cephfs-journal-tool --rank=galaxy:all journal
inspect
Overall journal integrity: OK

Stack dump / log from MDS:
   -14> 2023-05-26T15:01:09.204-0500 7f27c24b2700  1
mds.0.journaler.mdlog(ro) probing for end of the log
   -13> 2023-05-26T15:01:09.208-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
_finish_read_head loghead(trim 4194304, expire 4194607, write 4194607,
stream_format 1).  probing for end of log (from 4194607)...
   -12> 2023-05-26T15:01:09.208-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
probing for end of the log
   -11> 2023-05-26T15:01:09.412-0500 7f27c24b2700  1
mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 2388235687 (header
had 2388213543). recovered.
   -10> 2023-05-26T15:01:09.412-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
_finish_probe_end write_pos = 4194607 (header had 4194607). recovered.
-9> 2023-05-26T15:01:09.412-0500 7f27c34b4700  4 mds.0.purge_queue
operator(): open complete
-8> 2023-05-26T15:01:09.412-0500 7f27c34b4700  1 mds.0.journaler.pq(ro)
set_writeable
-7> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  4 mds.0.log Journal
0x200 recovered.
-6> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  4 mds.0.log Recovered
journal 0x200 in format 1
-5> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  2 mds.0.6403 Booting: 1:
loading/discovering base inodes
-4> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  0 mds.0.cache creating
system inode with ino:0x100
-3> 2023-05-26T15:01:09.412-0500 7f27c1cb1700  0 mds.0.cache creating
system inode with ino:0x1
-2> 2023-05-26T15:01:09.416-0500 7f27c24b2700  2 mds.0.6403 Booting: 2:
replaying mds log
-1> 2023-05-26T15:01:09.416-0500 7f27c24b2700  2 mds.0.6403 Booting: 2:
waiting for purge queue recovered
 0> 2023-05-26T15:01:09.428-0500 7f27c0caf700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f27c0caf700 thread_name:md_log_replay

 ceph version 17.2.6 (995dec2cdae920da21db2d455e55efbc339bde24) quincy
(stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7f27cd70c140]
 2: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x66c2)
[0x563540fc7372]
 3: (EUpdate::replay(MDSRank*)+0x3c) [0x563540fc8abc]
 4: (MDLog::_replay_thread()+0x7cb) [0x563540f4d0fb]
 5: (MDLog::ReplayThread::entry()+0xd) [0x563540c1fbfd]
 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f27cd700ea7]
 7: clone()
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

What are the safest steps to recovery at this point?

Thanks,
Al
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Important: RGW multisite bug may silently corrupt encrypted objects on replication

2023-05-26 Thread Casey Bodley

Our downstream QE team recently observed an md5 mismatch of replicated
objects when testing rgw's server-side encryption in multisite. This
corruption is specific to s3 multipart uploads, and only affects the
replicated copy - the original object remains intact. The bug likely
affects Ceph releases all the way back to Luminous where server-side
encryption was first introduced.

To expand on the cause of this corruption: Encryption of multipart
uploads requires special handling around the part boundaries, because
each part is uploaded and encrypted separately. In multisite, objects
are replicated in their encrypted form, and multipart uploads are
replicated as a single part. As a result, the replicated copy loses
its knowledge about the original part boundaries required to decrypt
the data correctly.

We don't have a fix yet, but we're tracking it in
https://tracker.ceph.com/issues/46062. The fix will only modify the
replication logic, so won't repair any objects that have already
replicated incorrectly. We'll need to develop a radosgw-admin command
to search for affected objects and reschedule their replication.

In the meantime, I can only advise multisite users to avoid using
encryption for multipart uploads. If you'd like to scan your cluster
for existing encrypted multipart uploads, you can identify them with a
s3 HeadObject request. The response would include a
x-amz-server-side-encryption header, and the ETag header value (with
"s removed) would be longer than 32 characters (multipart ETags are in
the special form "-"). Take care not to delete the
corrupted replicas, because an active-active multisite configuration
would go on to delete the original copy.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

2023-05-26 Thread Nico Schottelius



Hello Redouane,

much appreciated kick-off for improving cephadm. I was wondering why
cephadm does not use a similar approach to rook in the sense of "repeat
until it is fixed?"

For the background, rook uses a controller that checks the state of the
cluster, the state of monitors, whether there are disks to be added,
etc. It periodically restarts the checks and when needed shifts
monitors, creates OSDs, etc.

My question is, why not have a daemon or checker subcommand of cephadm
that a) checks what the current cluster status is (i.e. cephadm
verify-cluster) and b) fixes the situation (i.e. cephadm 
verify-and-fix-cluster)?

I think that option would be much more beneficial than the other two
suggested ones.

Best regards,

Nico


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Alexander E. Patrakov

Hello Frank,

On Fri, May 26, 2023 at 6:27 PM Frank Schilder  wrote:
>
> Hi all,
>
> jumping on this thread as we have requests for which per-client fs mount 
> encryption makes a lot of sense:
>
> > What kind of security to you want to achieve with encryption keys stored
> > on the server side?
>
> One of the use cases is if a user requests a share with encryption at rest. 
> Since encryption has an unavoidable performance impact, it is impractical to 
> make 100% of users pay for the requirements that only 1% of users really 
> have. Instead of all-OSD back-end encryption hitting everyone for little 
> reason, encrypting only some user-buckets/fs-shares on the front-end 
> application level will ensure that the data is encrypted at rest.
>

I would disagree about the unavoidable performance impact of at-rest
encryption of OSDs. Read the CloudFlare blog article which shows how
they make the encryption impact on their (non-Ceph) drives negligible:
https://blog.cloudflare.com/speeding-up-linux-disk-encryption/. The
main part of their improvements (the ability to disable dm-crypt
workqueues) is already in the mainline kernel. There is also a Ceph
pull request that disables dm-crypt workqueues on certain drives:
https://github.com/ceph/ceph/pull/49554

While the other part of the performance enhancements authored by
CloudFlare (namely, the "xtsproxy" module) is not mainlined yet, I
hope that some equivalent solution will find its way into the official
kernel sooner or later.

In summary: just encrypt everything.

> It may very well not serve any other purpose, but these are requests we get. 
> If I could provide an encryption key to a ceph-fs kernel at mount time, this 
> requirement could be solved very elegantly on a per-user (request) basis and 
> only making users who want it pay with performance penalties.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Robert Sander 
> Sent: Tuesday, May 23, 2023 6:35 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Encryption per user Howto
>
> On 23.05.23 08:42, huxia...@horebdata.cn wrote:
> > Indeed, the question is on  server-side encryption with keys managed by 
> > ceph on a per-user basis
>
> What kind of security to you want to achieve with encryption keys stored
> on the server side?
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Alexander E. Patrakov

On Sat, May 27, 2023 at 5:09 AM Alexander E. Patrakov
 wrote:
>
> Hello Frank,
>
> On Fri, May 26, 2023 at 6:27 PM Frank Schilder  wrote:
> >
> > Hi all,
> >
> > jumping on this thread as we have requests for which per-client fs mount 
> > encryption makes a lot of sense:
> >
> > > What kind of security to you want to achieve with encryption keys stored
> > > on the server side?
> >
> > One of the use cases is if a user requests a share with encryption at rest. 
> > Since encryption has an unavoidable performance impact, it is impractical 
> > to make 100% of users pay for the requirements that only 1% of users really 
> > have. Instead of all-OSD back-end encryption hitting everyone for little 
> > reason, encrypting only some user-buckets/fs-shares on the front-end 
> > application level will ensure that the data is encrypted at rest.
> >
>
> I would disagree about the unavoidable performance impact of at-rest
> encryption of OSDs. Read the CloudFlare blog article which shows how
> they make the encryption impact on their (non-Ceph) drives negligible:
> https://blog.cloudflare.com/speeding-up-linux-disk-encryption/. The
> main part of their improvements (the ability to disable dm-crypt
> workqueues) is already in the mainline kernel. There is also a Ceph
> pull request that disables dm-crypt workqueues on certain drives:
> https://github.com/ceph/ceph/pull/49554
>
> While the other part of the performance enhancements authored by
> CloudFlare (namely, the "xtsproxy" module) is not mainlined yet, I
> hope that some equivalent solution will find its way into the official
> kernel sooner or later.
>
> In summary: just encrypt everything.

As a follow-up, if you disagree with the advice to encrypt everything,
please note that CephFS allows one to place certain directories on a
separate pool. Therefore, you can create a separate device class for
encrypted OSDs, create a pool that uses this device class, and put the
directories owned by your premium users onto this pool.

Documentation: https://docs.ceph.com/en/latest/cephfs/file-layouts/

>
> > It may very well not serve any other purpose, but these are requests we 
> > get. If I could provide an encryption key to a ceph-fs kernel at mount 
> > time, this requirement could be solved very elegantly on a per-user 
> > (request) basis and only making users who want it pay with performance 
> > penalties.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Robert Sander 
> > Sent: Tuesday, May 23, 2023 6:35 PM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Re: Encryption per user Howto
> >
> > On 23.05.23 08:42, huxia...@horebdata.cn wrote:
> > > Indeed, the question is on  server-side encryption with keys managed by 
> > > ceph on a per-user basis
> >
> > What kind of security to you want to achieve with encryption keys stored
> > on the server side?
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Support GmbH
> > Linux: Akademie - Support - Hosting
> > http://www.heinlein-support.de
> >
> > Tel: 030-405051-43
> > Fax: 030-405051-19
> >
> > Zwangsangaben lt. §35a GmbHG:
> > HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Alexander E. Patrakov



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] cephfs-data-scan with multiple data pools

2023-05-26 Thread Justin Li

Dear All,

I'm trying to recover failed MDS metadata by following the link below but
having troubles. Thanks in advance.

Question1: how to scan 2 data pools with scan_extents (cmd 1). The cmd
didn't work with two pools specified. Should I scan one then the other?

Question2: As to scan_inodes (cmd 2), should I only specify the first data
pool per the document. I'm concerned if the 2nd pool is not scanned then
that'll cause metadata loss.

*my fs name: cephfs, data pools: cephfs_hdd, cephfs_ssd*

cmd 1: cephfs-data-scan scan_extents --filesystem cephfs cephfs_hdd
cephfs_ssd
cmd 2: cephfs-data-scan scan_inodes --filesystem cephfs cephfs_hdd

cephfs-data-scan scan_extents [ [
...]]cephfs-data-scan scan_inodes []cephfs-data-scan
scan_links


Note, the data pool parameters for ‘scan_extents’, ‘scan_inodes’ and
‘cleanup’ commands are optional, and usually the tool will be able to
detect the pools automatically. Still you may override this. The
‘scan_extents’ command needs all data pools to be specified,* while
‘scan_inodes’ and ‘cleanup’ commands need only the main data pool.*

*https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
*





-- 
Best Regards,
*Justin Li*
IT Support/Systems Administrator
*justin.li2...@gmail.com *
  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph iscsi gateway semi deprecation warning?

2023-05-26 Thread Alexander E. Patrakov

On Sat, May 27, 2023 at 12:21 AM Mark Kirkwood
 wrote:
>
> I am looking at using an iscsi gateway in front of a ceph setup. However
> the warning in the docs is concerning:
>
> The iSCSI gateway is in maintenance as of November 2022. This means that
> it is no longer in active development and will not be updated to add new
> features.
>
> Does this mean I should be wary of using it, or is it simply that it
> does all the stuff it needs to and no further development is needed?

Hello Mark,

The planned replacement is based on the newer NVMe-oF protocol and
SPDK. See this presentation for the purported performance benefits:
https://ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdf

The git repository is here: https://github.com/ceph/ceph-nvmeof.
However, this is not yet something recommended for a production-grade
setup. At the very least, wait until this subproject makes it into
Ceph documentation and becomes available as RPMs and DEBs.

For now, you can still use ceph-iscsi - assuming that you need it,
i.e. that raw RBD is not an option.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

2023-05-26 Thread Sake Paulusma

Just a user opinion, maybe add the following additions to the options?

For option 1:
* Clear instructions how to remove all traces to the failed installation (if 
you can automate it, you can write a manual) or provide instructions to start a 
cleanup script.
* Don't allow another deployment of Cephadm if there's a failed deployment, 
only if everything is cleaned up.

For option 2:
* If an installation failed and gotten completely removed, don't allow another 
run unless the user sets an override (or removes the thing which triggers the 
check for failed installations). This to prevent a user in an endless loop to 
try and deploy Cephadm. Inform the user about the last failed deployment, show 
the available options for a retry and the option to keep the deployment files 
to troubleshoot the issue.
* If the deployment failed (or got interrupted) and the user wanted to keep a 
failed deployment, provide just like Option 1 clear instructions how to clean 
up the failed deployment.

With the above additions, I would prefer Option 1. Because there's almost 
always a reason a deployment fails and I would like to investigate directly why 
it happened.

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Troubleshooting "N slow requests are blocked > 30 secs" on Pacific

[ceph-users] Pacific - MDS behind on trimming

[ceph-users] Re: `ceph features` on Nautilus still reports "luminous"

[ceph-users] Re: BlueStore fragmentation woes

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

[ceph-users] Re: Encryption per user Howto

[ceph-users] Re: Encryption per user Howto

[ceph-users] Re: Encryption per user Howto

[ceph-users] Re: ln: failed to create hard link 'file name': Read-only file system

[ceph-users] Re: BlueStore fragmentation woes

[ceph-users] Multi region RGW Config Questions - Quincy

[ceph-users] Re: mds dump inode crashes file system

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

[ceph-users] Re: Pacific - MDS behind on trimming

[ceph-users] Ceph iscsi gateway semi deprecation warning?

[ceph-users] Re: mgr memory usage constantly increasing

[ceph-users] sub-read of librados

[ceph-users] Re: Unexpected behavior of directory mtime after being set explicitly

[ceph-users] Re: Unexpected behavior of directory mtime after being set explicitly

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

[ceph-users] Seeking feedback on Improving cephadm bootstrap process

[ceph-users] Unable to online CephFS, MDS segfaults during mds log replay

[ceph-users] Important: RGW multisite bug may silently corrupt encrypted objects on replication

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

[ceph-users] Re: Encryption per user Howto

[ceph-users] Re: Encryption per user Howto

[ceph-users] cephfs-data-scan with multiple data pools

[ceph-users] Re: Ceph iscsi gateway semi deprecation warning?

[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

31 matches

Site Navigation

Mail list logo

Footer information