[ceph-users] How to repair the OSDs while WAL/DB device breaks down

2023-03-14 Thread Norman

hi, everyone,

I have a question about repairing the broken WAL/DB device.

I have a cluster with 8 OSDs, and 4 WAL/DB devices(1 OSD per WAL/DB 
device), and hwo can I repair the OSDs quickly if


 one WAL/DB device breaks down without rebuilding the them? Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Richard Bade
Hi,
I found the documentation for metadata get to be unhelpful for what
syntax to use. I eventually found that it's this:
radosgw-admin metadata get bucket:{bucket_name}
or
radosgw-admin metadata get bucket.instance:{bucket_name}:{instance_id}

Hopefully that helps you or someone else struggling with this.

Rich

On Wed, 15 Mar 2023 at 07:18, Gaël THEROND  wrote:
>
> Alright,
> Seems something is odd out there, if I do a radosgw-admin metadata list
>
> I’ve got the following list:
>
> [
> ”bucket”,
> ”bucket.instance”,
> ”otp”,
> ”user”
> ]
>
> BUT
>
> When I try a radosgw-admin metadata get bucket or bucket.instance it
> complain with the following error:
>
> ERROR: can’t get key: (22) Invalid argument
>
> Ok, fine for the api, I’ll deal with the s3 api.
>
> Even if a radosgw-admin bucket flush version —keep-current or something
> similar would be much appreciated xD
>
> Le mar. 14 mars 2023 à 19:07, Robin H. Johnson  a
> écrit :
>
> > On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote:
> > > Versioning wasn’t enabled, at least not explicitly and for the
> > > documentation it isn’t enabled by default.
> > >
> > > Using nautilus.
> > >
> > > I’ll get all the required missing information on tomorrow morning, thanks
> > > for the help!
> > >
> > > Is there a way to tell CEPH to delete versions that aren’t current used
> > one
> > > with radosgw-admin?
> > >
> > > If not I’ll use the rest api no worries.
> > Nope, s3 API only.
> >
> > You should also check for incomplete multiparts. For that, I recommend
> > using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd
> > doesn't respect the  flag properly.
> >
> > --
> > Robin Hugh Johnson
> > Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> > E-Mail   : robb...@gentoo.org
> > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Gaël THEROND
Thanks a lot for this spreadsheet, I’ll check that on but I doubt we store
data smaller than the min_alloc size.

Yes we do use an EC pool type of 2+1 with failure_domain being at host
level.

Le mar. 14 mars 2023 à 19:38, Mark Nelson  a
écrit :

> Is it possible that you are storing object (chunks if EC) that are
> smaller than the min_alloc size?  This cheat sheet might help:
>
>
> https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing
>
> Mark
>
> On 3/14/23 12:34, Gaël THEROND wrote:
> > Hi everyone, I’ve got a quick question regarding one of our RadosGW
> bucket.
> >
> > This bucket is used to store docker registries, and the total amount of
> > data we use is supposed to be 4.5Tb BUT it looks like ceph told us we
> > rather use ~53Tb of data.
> >
> > One interesting thing is, this bucket seems to shard for unknown reason
> as
> > it is supposed to be disabled by default, but even taking that into
> account
> > we’re not supposed to see such a massive amount of additional data isn’t
> it?
> >
> > Here is the bucket stats of it:
> > https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Mark Nelson
Is it possible that you are storing object (chunks if EC) that are 
smaller than the min_alloc size?  This cheat sheet might help:


https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing

Mark

On 3/14/23 12:34, Gaël THEROND wrote:

Hi everyone, I’ve got a quick question regarding one of our RadosGW bucket.

This bucket is used to store docker registries, and the total amount of
data we use is supposed to be 4.5Tb BUT it looks like ceph told us we
rather use ~53Tb of data.

One interesting thing is, this bucket seems to shard for unknown reason as
it is supposed to be disabled by default, but even taking that into account
we’re not supposed to see such a massive amount of additional data isn’t it?

Here is the bucket stats of it:
https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS thrashing through the page cache

2023-03-14 Thread Ashu Pachauri
Got the answer to my own question; posting here if someone else
encounters the same problem. The issue is that the default stripe size in a
cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
test I posted) inside the file, you'll end up pulling at least 4MB to the
client (and then discarding most of the pulled data) even if you set
readahead to zero. So, the solution for us was to set a lower stripe size,
which aligns better with our workloads.

Thanks and Regards,
Ashu Pachauri


On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri  wrote:

> Also, I am able to reproduce the network read amplification when I try to
> do very small reads from larger files. e.g.
>
> for i in $(seq 1 1); do
>   dd if=test_${i} of=/dev/null bs=5k count=10
> done
>
>
> This piece of code generates a network traffic of 3.3 GB while it actually
> reads approx 500 MB of data.
>
>
> Thanks and Regards,
> Ashu Pachauri
>
> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri 
> wrote:
>
>> We have an internal use case where we back the storage of a proprietary
>> database by a shared file system. We noticed something very odd when
>> testing some workload with a local block device backed file system vs
>> cephfs. We noticed that the amount of network IO done by cephfs is almost
>> double compared to the IO done in case of a local file system backed by an
>> attached block device.
>>
>> We also noticed that CephFS thrashes through the page cache very quickly
>> compared to the amount of data being read and think that the two issues
>> might be related. So, I wrote a simple test.
>>
>> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
>> 2. I dropped the page cache completely.
>> 3. I then read these files serially, again using dd. The page cache usage
>> shot up to 39 GB for reading such a small amount of data.
>>
>> Following is the code used to repro this in bash:
>>
>> for i in $(seq 1 1); do
>>   dd if=/dev/zero of=test_${i} bs=4k count=100
>> done
>>
>> sync; echo 1 > /proc/sys/vm/drop_caches
>>
>> for i in $(seq 1 1); do
>>   dd if=test_${i} of=/dev/null bs=4k count=100
>> done
>>
>>
>> The ceph version being used is:
>> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
>> (stable)
>>
>> The ceph configs being overriden:
>> WHO   MASK  LEVEL OPTION VALUE
>>  RO
>>   mon   advanced  auth_allow_insecure_global_id_reclaim  false
>>
>>   mgr   advanced  mgr/balancer/mode  upmap
>>
>>   mgr   advanced  mgr/dashboard/server_addr
>>  127.0.0.1*
>>   mgr   advanced  mgr/dashboard/server_port  8443
>> *
>>   mgr   advanced  mgr/dashboard/ssl  false
>>  *
>>   mgr   advanced  mgr/prometheus/server_addr 0.0.0.0
>>  *
>>   mgr   advanced  mgr/prometheus/server_port 9283
>> *
>>   osd   advanced  bluestore_compression_algorithmlz4
>>
>>   osd   advanced  bluestore_compression_mode
>> aggressive
>>   osd   advanced  bluestore_throttle_bytes
>> 536870912
>>   osd   advanced  osd_max_backfills  3
>>
>>   osd   advanced  osd_op_num_threads_per_shard_ssd   8
>>  *
>>   osd   advanced  osd_scrub_auto_repair  true
>>
>>   mds   advanced  client_oc  false
>>
>>   mds   advanced  client_readahead_max_bytes 4096
>>
>>   mds   advanced  client_readahead_max_periods   1
>>
>>   mds   advanced  client_readahead_min   0
>>
>>   mds   basic mds_cache_memory_limit
>> 21474836480
>>   clientadvanced  client_oc  false
>>
>>   clientadvanced  client_readahead_max_bytes 4096
>>
>>   clientadvanced  client_readahead_max_periods   1
>>
>>   clientadvanced  client_readahead_min   0
>>
>>   clientadvanced  fuse_disable_pagecache false
>>
>>
>> The cephfs mount options (note that readahead was disabled for this test):
>> /mnt/cephfs type ceph
>> (rw,relatime,name=cephfs,secret=,acl,rasize=0)
>>
>> Any help or pointers are appreciated; this is a major performance issue
>> for us.
>>
>>
>> Thanks and Regards,
>> Ashu Pachauri
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Gaël THEROND
Alright,
Seems something is odd out there, if I do a radosgw-admin metadata list

I’ve got the following list:

[
”bucket”,
”bucket.instance”,
”otp”,
”user”
]

BUT

When I try a radosgw-admin metadata get bucket or bucket.instance it
complain with the following error:

ERROR: can’t get key: (22) Invalid argument

Ok, fine for the api, I’ll deal with the s3 api.

Even if a radosgw-admin bucket flush version —keep-current or something
similar would be much appreciated xD

Le mar. 14 mars 2023 à 19:07, Robin H. Johnson  a
écrit :

> On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote:
> > Versioning wasn’t enabled, at least not explicitly and for the
> > documentation it isn’t enabled by default.
> >
> > Using nautilus.
> >
> > I’ll get all the required missing information on tomorrow morning, thanks
> > for the help!
> >
> > Is there a way to tell CEPH to delete versions that aren’t current used
> one
> > with radosgw-admin?
> >
> > If not I’ll use the rest api no worries.
> Nope, s3 API only.
>
> You should also check for incomplete multiparts. For that, I recommend
> using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd
> doesn't respect the  flag properly.
>
> --
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Robin H. Johnson
On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote:
> Versioning wasn’t enabled, at least not explicitly and for the
> documentation it isn’t enabled by default.
> 
> Using nautilus.
> 
> I’ll get all the required missing information on tomorrow morning, thanks
> for the help!
> 
> Is there a way to tell CEPH to delete versions that aren’t current used one
> with radosgw-admin?
> 
> If not I’ll use the rest api no worries.
Nope, s3 API only.

You should also check for incomplete multiparts. For that, I recommend
using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd
doesn't respect the  flag properly.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Gaël THEROND
Versioning wasn’t enabled, at least not explicitly and for the
documentation it isn’t enabled by default.

Using nautilus.

I’ll get all the required missing information on tomorrow morning, thanks
for the help!

Is there a way to tell CEPH to delete versions that aren’t current used one
with radosgw-admin?

If not I’ll use the rest api no worries.

Le mar. 14 mars 2023 à 18:49, Robin H. Johnson  a
écrit :

> On Tue, Mar 14, 2023 at 06:34:54PM +0100, Gaël THEROND wrote:
> > Hi everyone, I’ve got a quick question regarding one of our RadosGW
> bucket.
> >
> > This bucket is used to store docker registries, and the total amount of
> > data we use is supposed to be 4.5Tb BUT it looks like ceph told us we
> > rather use ~53Tb of data.
> >
> > One interesting thing is, this bucket seems to shard for unknown reason
> as
> > it is supposed to be disabled by default, but even taking that into
> account
> > we’re not supposed to see such a massive amount of additional data isn’t
> it?
> >
> > Here is the bucket stats of it:
> > https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/
> At a glance, is versioning enabled?
>
> And if so, are you pruning old versions?
>
> Please share "radosgw-admin metadata get" for the bucket &
> bucket-instance.
>
> --
> Robin Hugh Johnson
> Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> E-Mail   : robb...@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Robin H. Johnson
On Tue, Mar 14, 2023 at 06:34:54PM +0100, Gaël THEROND wrote:
> Hi everyone, I’ve got a quick question regarding one of our RadosGW bucket.
> 
> This bucket is used to store docker registries, and the total amount of
> data we use is supposed to be 4.5Tb BUT it looks like ceph told us we
> rather use ~53Tb of data.
> 
> One interesting thing is, this bucket seems to shard for unknown reason as
> it is supposed to be disabled by default, but even taking that into account
> we’re not supposed to see such a massive amount of additional data isn’t it?
> 
> Here is the bucket stats of it:
> https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/
At a glance, is versioning enabled?

And if so, are you pruning old versions?

Please share "radosgw-admin metadata get" for the bucket &
bucket-instance.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 10x more used space than expected

2023-03-14 Thread Gaël THEROND
Hi everyone, I’ve got a quick question regarding one of our RadosGW bucket.

This bucket is used to store docker registries, and the total amount of
data we use is supposed to be 4.5Tb BUT it looks like ceph told us we
rather use ~53Tb of data.

One interesting thing is, this bucket seems to shard for unknown reason as
it is supposed to be disabled by default, but even taking that into account
we’re not supposed to see such a massive amount of additional data isn’t it?

Here is the bucket stats of it:
https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Last day to sponsor Cephalocon Amsterdam 2023

2023-03-14 Thread Mike Perez
Hi everyone,

Today is the last day to sponsor Cephalocon Amsterdam 2023! I want to
thank our current sponsors:

Platinum: IBM
Silver: 42on, Canonical Ubuntu, Clyso
Startup: Koor

Also, thank you to Clyso for their lanyard add-on and 42on's offsite
attendee party.

We are still short in covering the costs for the event, so I'm asking
for contributors and members of the Ceph Foundation to consider
applying today.

https://events.linuxfoundation.org/cephalocon/sponsor/
Sponsor Prospectus:
https://events.linuxfoundation.org/wp-content/uploads/2023/03/sponsor-ceph-23_030923.pdf

Please get in touch with us at sponsorships@ceph.foundation to get
started. Thank you!

-- 
Mike Perez
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg wait too long when osd restart

2023-03-14 Thread yite gu
Hello, baergen
Thanks for your reply, I got it. ☺

Best regards
Yitte Gu

Josh Baergen  于2023年3月13日周一 23:15写道:

> (trimming out the dev list and Radoslaw's email)
>
> Hello,
>
> I think the two critical PRs were:
> * https://github.com/ceph/ceph/pull/44585 - included in 15.2.16
> * https://github.com/ceph/ceph/pull/45655 - included in 15.2.17
>
> I don't have any comments on tweaking those configuration values, and
> what safe values would be.
>
> Josh
>
> On Sun, Mar 12, 2023 at 9:43 PM yite gu  wrote:
> >
> > Hello, Baergen
> > Thanks for your reply. Restart osd in planned, but my version is 15.2.7,
> so, I may have encountered the problem you said. Could you provide PR to me
> about optimize this mechanism? Besides that, if I don't want to upgrade
> version in recently, is a good way that adjust
> osd_pool_default_read_lease_ratio to lower? For example, 0.4 or 0.2 to
> reach the user's tolerance time.
> >
> > Yite Gu
> >
> > Josh Baergen  于2023年3月10日周五 22:09写道:
> >>
> >> Hello,
> >>
> >> When you say "osd restart", what sort of restart are you referring to
> >> - planned (e.g. for upgrades or maintenance) or unplanned (OSD
> >> hang/crash, host issue, etc.)? If it's the former, then these
> >> parameters shouldn't matter provided that you're running a recent
> >> enough Ceph with default settings - it's supposed to handle planned
> >> restarts with little I/O wait time. There were some issues with this
> >> mechanism before Octopus 15.2.17 / Pacific 16.2.8 that could cause
> >> planned restarts to wait for the read lease timeout in some
> >> circumstances.
> >>
> >> Josh
> >>
> >> On Fri, Mar 10, 2023 at 1:31 AM yite gu  wrote:
> >> >
> >> > Hi all,
> >> > osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8
> by
> >> > default, so, pg will wait 16s when osd restart in the worst case.
> This wait
> >> > time is too long, client i/o can not be unacceptable. I think
> adjusting
> >> > the osd_pool_default_read_lease_ratio to lower is a good way. Have
> any good
> >> > suggestions about reduce pg wait time?
> >> >
> >> > Best Regard
> >> > Yite Gu
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade 16.2.11 -> 17.2.0 failed

2023-03-14 Thread Robert Sander

On 14.03.23 14:21, bbk wrote:
`

# ceph orch upgrade start --ceph-version 17.2.0


I would never recommend to update to a .0 release.

Why not go directly to the latest 17.2.5?

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade 16.2.11 -> 17.2.0 failed

2023-03-14 Thread Adam King
That's very odd, I haven't seen this before. What container image is the
upgraded mgr running on (to know for sure, can check the podman/docker run
command at the end of the /var/lib/ceph//mgr./unit.run file
on the mgr's host)? Also, could maybe try "ceph mgr module enable cephadm"
to see if it does anything?

On Tue, Mar 14, 2023 at 9:23 AM bbk  wrote:

> Dear List,
>
> Today i was sucessfully upgrading with cephadm from 16.2.8 -> 16.2.9 ->
> 16.2.10 -> 16.2.11
>
> Now i wanted to upgrade to 17.2.0 but after starting the upgrade with
>
> ```
> # ceph orch upgrade start --ceph-version 17.2.0
> ```
>
> The orch manager module seems to be gone now and the upgrade don't seem to
> run.
>
>
> ```
> # ceph orch upgrade status
> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
>
> # ceph orch set backend cephadm
> Error ENOENT: Module not found
> ```
>
> During the failed upgrade all nodes had the 16.2.11 cephadm installed.
>
> Fortunately the cluster is still running... somehow. I installed the
> latest 17.2.X cephadm on all
> nodes and rebooted them nodes, but this didn't help.
>
> Does someone have a hint?
>
> Yours,
> bbk
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrade 16.2.11 -> 17.2.0 failed

2023-03-14 Thread bbk
Dear List,

Today i was sucessfully upgrading with cephadm from 16.2.8 -> 16.2.9 -> 16.2.10 
-> 16.2.11

Now i wanted to upgrade to 17.2.0 but after starting the upgrade with

```
# ceph orch upgrade start --ceph-version 17.2.0
```

The orch manager module seems to be gone now and the upgrade don't seem to run.


```
# ceph orch upgrade status
Error ENOENT: No orchestrator configured (try `ceph orch set backend`)

# ceph orch set backend cephadm
Error ENOENT: Module not found
```

During the failed upgrade all nodes had the 16.2.11 cephadm installed.

Fortunately the cluster is still running... somehow. I installed the latest 
17.2.X cephadm on all
nodes and rebooted them nodes, but this didn't help.

Does someone have a hint?

Yours,
bbk
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd on EC pool with fast and extremely slow writes/reads

2023-03-14 Thread Rok Jaklič
1, 2 times a year we are having similar problem in *not* ceph disk cluster,
where working -> but slow disk writes give us slow reads. We somehow
"understand it", since probably slow writes fill up queues and buffers.


On Thu, Mar 9, 2023 at 11:37 AM Andrej Filipcic 
wrote:

>
> Thanks for the hint, did run some short test, all fine. I am not sure
> it's a drive issue.
>
> Some more digging, the file with bad performance has this segments:
>
> [root@afsvos01 vicepa]# hdparm --fibmap $PWD/0
>
> /vicepa/0:
> filesystem blocksize 4096, begins at LBA 2048; assuming 512 byte sectors.
> byte_offset  begin_LBAend_LBAsectors
>0 74323228150392071808
>   1060765696373306458382792105216
>   2138636288   70841232   87586575   16745344
> 10712252416   87586576   87635727  49152
>
> Reading by segments:
>
> # dd if=0 of=/tmp/0 bs=4M status=progress count=252
> 1052770304 bytes (1.1 GB, 1004 MiB) copied, 45 s, 23.3 MB/s
> 252+0 records in
> 252+0 records out
>
> # dd if=0 of=/tmp/0 bs=4M status=progress skip=252 count=256
> 935329792 bytes (935 MB, 892 MiB) copied, 4 s, 234 MB/s
> 256+0 records in
> 256+0 records out
>
> # dd if=0 of=/tmp/0 bs=4M status=progress skip=510
> 7885291520 bytes (7.9 GB, 7.3 GiB) copied, 12 s, 657 MB/s
> 2050+0 records in
> 2050+0 records out
>
> So, 1st 1G is very slow, second segment is faster, then the rest quite
> fast, and it's reproducible (dropped caches before each dd)
>
> Now, the rbd is 3TB with 256 pgs (EC 8+3), I checked with rados that
> objects are randomly distributed on pgs, eg
>
> # rados --pgid 23.82 ls|grep rbd_data.20.2723bd3292f6f8
> rbd_data.20.2723bd3292f6f8.0008
> rbd_data.20.2723bd3292f6f8.000d
> rbd_data.20.2723bd3292f6f8.01cb
> rbd_data.20.2723bd3292f6f8.000601b2
> rbd_data.20.2723bd3292f6f8.0009001b
> rbd_data.20.2723bd3292f6f8.005b
> rbd_data.20.2723bd3292f6f8.000900e8
>
> where object ...05b for example corresponds to the 1st block of the file
> I am testing. Well, if my understanding of rbd  is correct: I assume
> that LBA regions are mapped to consecutive rbd objects.
>
> So, now I am completely confused since the slow chunk of the file is
> still mapped to ~256 objects on different pgs
>
> Maybe I misunderstood the whole thing.
>
> Any other hints? we will still do hdd tests on all the drives
>
> Cheers,
> Andrej
>
> On 3/6/23 20:25, Paul Mezzanini wrote:
> > When I have seen behavior like this it was a dying drive.  It only
> became obviously when I did a smart long test and I got failed reads.
> Still reported smart OK though so that was a lie.
> >
> >
> >
> > --
> >
> > Paul Mezzanini
> > Platform Engineer III
> > Research Computing
> >
> > Rochester Institute of Technology
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 
> > From: Andrej Filipcic
> > Sent: Monday, March 6, 2023 8:51 AM
> > To: ceph-users
> > Subject: [ceph-users] rbd on EC pool with fast and extremely slow
> writes/reads
> >
> >
> > Hi,
> >
> > I have a problem on one of ceph clusters I do not understand.
> > ceph 17.2.5 on 17 servers, 400 HDD OSDs, 10 and 25Gb/s NICs
> >
> > 3TB rbd image is on erasure coded 8+3 pool with 128pgs , xfs filesystem,
> > 4MB objects in rbd image, mostly empy.
> >
> > I have created a bunch of 10G files, most of them were written with
> > 1.5GB/s, few of them were really slow, ~10MB/s, a factor of 100.
> >
> > When reading these files back, the fast-written ones are read fast,
> > ~2-2.5GB/s, the slowly-written are also extremely slow in reading, iotop
> > shows between 1 and 30 MB/s reading speed.
> >
> > This does not happen at all on replicated images. There are some OSDs
> > with higher apply/commit latency, eg 200ms, but there are no slow ops.
> >
> > The tests were done actually on proxmox vm with librbd, but the same
> > happens with krbd, and on bare metal with mounted krbd as well.
> >
> > I have tried to check all OSDs for laggy drives, but they all look about
> > the same.
> >
> > I have also copied entire image with "rados get...", object by object,
> > the strange thing here is that most of objects were copied within
> > 0.1-0.2s, but quite some took more than 1s.
> > The cluster is quite busy with base traffic of ~1-2GB/s, so the speeds
> > can vary due to that. But I would not expect a factor of 100 slowdown
> > for some writes/reads with rbds.
> >
> > Any clues on what might be wrong or what else to check? I have another
> > similar ceph cluster where everything looks fine.
> >
> > Best,
> > Andrej
> >
> > --
> > _
> >  prof. dr. Andrej Filipcic,   E-mail:andrej.filip...@ijs.si
> >  Department of Experimental High Energy Physics - F9
> >  Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> >  SI-1001 Ljubljana, Slovenia
> >  Tel.: +386-1-477-3674Fax: +386-1-477-3166
> > 

[ceph-users] handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)

2023-03-14 Thread Arvid Picciani
since quincy i'm randomly getting authentication issues from clients to osds.

symptom is qemu hangs, but when it happens, i can reproduce it using:

>   ceph tell osd.\* version

some -  but only some - osds will never respond, but only to clients
on _some_ hosts.
the client gets stuck in a loop with this error

> 2023-03-14T10:09:38.492+0100 7f38f5d95700  1 --2- 10.180.10.36:0/329477069 >> 
> [v2:10.180.10.24:6810/697584,v1:10.180.10.24:6811/697584] conn(0x7f38f0107990 
> 0x7f38f0107d60 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 
> tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble 
> failed r=-1 ((1) Operation not permitted)

restarting the affected OSD helps for a few hours.

in the osd log i see only

> 2023-03-14T09:27:27.801+ 7fb79020a700 10 osd.4 114909 
> ms_handle_authentication session 0x55880cd58b40 client.admin h
as caps osdcap[grant(*)] 'allow *'
> 2023-03-14T09:27:27.805+ 7fb781a3c700  2 osd.4 114909 ms_handle_reset con 
> 0x55880a7fec00 session 0x55880cd58b40


searching for this issue gives me people whos mon is dead, but i dont
think "tell" is supposed to go through mon, beyond the initial
listing, which succeeds. but here's the full auth log from mon anyway
if it helps:

2023-03-14T09:34:48.847+ 7fcc8a5c7700 10 In
get_auth_session_handler for protocol 0
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 start_session
entity_name=client.admin global_id=6751719 is_new_global_id=1
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx server
client.admin: start_session server_challenge 20aa2b96857f41cf
2023-03-14T09:34:48.847+ 7fcc865bf700 10 start_session
entity_name=client.admin global_id=6751722 is_new_global_id=1
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx server
client.admin: start_session server_challenge 6066dd1200ddc855
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx server
client.admin: handle_request get_auth_session_key for client.admin
2023-03-14T09:34:48.847+ 7fcc84dbc700 20 cephx server
client.admin:  checking key: req.key=92ed7ea281e9ac0c
expected_key=92ed7ea281e9ac0c
2023-03-14T09:34:48.847+ 7fcc84dbc700 20 cephx server
client.admin:  checking old_ticket: secret_id=0 len=0,
old_ticket_may_be_omitted=0
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx server
client.admin:  new global_id 6751719
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx:
build_service_ticket_reply encoding 1 tickets with secret REDACTED==
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx:
build_service_ticket service auth secret_id 160
ticket_info.ticket.name=client.admin ticket.global_id 6751719
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx keyserverdata:
get_caps: name=client.admin
2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx keyserverdata:
get_secret: num of caps=4
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx server
client.admin: handle_request get_auth_session_key for client.admin
2023-03-14T09:34:48.847+ 7fcc865bf700 20 cephx server
client.admin:  checking key: req.key=3c1f6182caf84073
expected_key=3c1f6182caf84073
2023-03-14T09:34:48.847+ 7fcc865bf700 20 cephx server
client.admin:  checking old_ticket: secret_id=0 len=0,
old_ticket_may_be_omitted=0
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx server
client.admin:  new global_id 6751722
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx:
build_service_ticket_reply encoding 1 tickets with secret REDACTED==
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx:
build_service_ticket service auth secret_id 160
ticket_info.ticket.name=client.admin ticket.global_id 6751722
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx keyserverdata:
get_caps: name=client.admin
2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx keyserverdata:
get_secret: num of caps=4
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 start_session
entity_name=client.admin global_id=6751725 is_new_global_id=1
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx server
client.admin: start_session server_challenge 22fa068f8da1fb28
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx server
client.admin: handle_request get_auth_session_key for client.admin
2023-03-14T09:34:48.851+ 7fcc84dbc700 20 cephx server
client.admin:  checking key: req.key=fc7fdedb8e669347
expected_key=fc7fdedb8e669347
2023-03-14T09:34:48.851+ 7fcc84dbc700 20 cephx server
client.admin:  checking old_ticket: secret_id=0 len=0,
old_ticket_may_be_omitted=0
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx server
client.admin:  new global_id 6751725
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx:
build_service_ticket_reply encoding 1 tickets with secret REDACTED==
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx:
build_service_ticket service auth secret_id 160
ticket_info.ticket.name=client.admin ticket.global_id 6751725
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx keyserverdata:
get_caps: name=client.admin
2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx keyserverdata:
get_secret: num of caps=4

[ceph-users] Re: Mixed mode ssd and hdd issue

2023-03-14 Thread Eneko Lacunza

Hi,

We need more info to be able to help you.

What CPU and network in nodes?
What model of SSD?

Cheers

El 13/3/23 a las 16:27, xadhoo...@gmail.com escribió:

Hi, we have a cluster with 3 nodes . Each node has 4 HDD and 1 SSD
We would like to have a pool only on ssd and a pool only on hdd, using class 
feature.
here is the setup
# buckets
host ceph01s3 {
 id -3   # do not change unnecessarily
 id -4 class hdd # do not change unnecessarily
 id -21 class ssd# do not change unnecessarily
 # weight 34.561
 alg straw2
 hash 0  # rjenkins1
 item osd.0 weight 10.914
 item osd.5 weight 10.914
 item osd.8 weight 10.914
 item osd.9 weight 1.819
}
host ceph02s3 {
 id -5   # do not change unnecessarily
 id -6 class hdd # do not change unnecessarily
 id -22 class ssd# do not change unnecessarily
 # weight 34.561
 alg straw2
 hash 0  # rjenkins1
 item osd.1 weight 10.914
 item osd.3 weight 10.914
 item osd.7 weight 10.914
 item osd.10 weight 1.819
}
host ceph03s3 {
 id -7   # do not change unnecessarily
 id -8 class hdd # do not change unnecessarily
 id -23 class ssd# do not change unnecessarily
 # weight 34.561
 alg straw2
 hash 0  # rjenkins1
 item osd.2 weight 10.914
 item osd.4 weight 10.914
 item osd.6 weight 10.914
 item osd.11 weight 1.819
}
root default {
 id -1   # do not change unnecessarily
 id -2 class hdd # do not change unnecessarily
 id -24 class ssd# do not change unnecessarily
 # weight 103.683
 alg straw2
 hash 0  # rjenkins1
 item ceph01s3 weight 34.561
 item ceph02s3 weight 34.561
 item ceph03s3 weight 34.561
}

# rules
rule replicated_rule {
 id 0
 type replicated
 min_size 1
 max_size 10
 step take default class hdd
 step chooseleaf firstn 0 type host
 step emit
}
rule erasure-code {
 id 1
 type erasure
 min_size 3
 max_size 4
 step take default class hdd
 step set_chooseleaf_tries 5
 step set_choose_tries 100
 step chooseleaf indep 0 type host
 step emit
}
rule erasure2_1 {
 id 2
 type erasure
 min_size 3
 max_size 3
 step take default class hdd
 step set_chooseleaf_tries 5
 step set_choose_tries 100
 step chooseleaf indep 0 type host
 step emit
}
rule erasure-pool.meta {
 id 3
 type erasure
 min_size 3
 max_size 3
 step take default class hdd
 step set_chooseleaf_tries 5
 step set_choose_tries 100
 step chooseleaf indep 0 type host
 step emit
}
rule erasure-pool.data {
 id 4
 type erasure
 min_size 3
 max_size 3
 step take default class hdd
 step set_chooseleaf_tries 5
 step set_choose_tries 100
 step chooseleaf indep 0 type host
 step emit
}
rule replicated_rule_ssd {
 id 5
 type replicated
 min_size 1
 max_size 10
 step take default class ssd
 step chooseleaf firstn 0 type host
 step emit
}

# end crush map

pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1669 
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application 
mgr_devicehealth
pool 5 'Datapool' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2749 lfor 0/0/321 
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'erasure-pool.data' erasure profile k2m1 size 3 min_size 2 crush_rule 4 
object_hash rjenkins pg_num 128 pgp_num 126 pgp_num_target 128 autoscale_mode 
on last_change 2780 lfor 0/0/1676 flags hashpspool,ec_overwrites stripe_width 
8192 application cephfs
pool 8 'erasure-pool.meta' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 344 
flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs
pool 9 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 592 flags 
hashpspool stripe_width 0 application rgw
pool 10 'brescia-ovest.rgw.log' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 595 
flags hashpspool stripe_width 0 application rgw
pool 11 'brescia-ovest.rgw.control' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 

[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-14 Thread Alessandro Bolgia
cephadm is 16.2.11, because the error comes from the upgrade from 15 to 16.

Il giorno lun 13 mar 2023 alle ore 18:27 Clyso GmbH - Ceph Foundation
Member  ha scritto:

> which version of cephadm you are using?
>
> ___
> Clyso GmbH - Ceph Foundation Member
>
> Am 10.03.23 um 11:17 schrieb xadhoo...@gmail.com:
> > looking at ceph orch upgrade check
> > I find out
> >  },
> >
> "cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2":
> {
> >  "current_id": null,
> >  "current_name": null,
> >  "current_version": null
> >  },
> >
> >
> > Could this lead to the issue?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io