[ceph-users] How to repair the OSDs while WAL/DB device breaks down
hi, everyone, I have a question about repairing the broken WAL/DB device. I have a cluster with 8 OSDs, and 4 WAL/DB devices(1 OSD per WAL/DB device), and hwo can I repair the OSDs quickly if one WAL/DB device breaks down without rebuilding the them? Thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
Hi, I found the documentation for metadata get to be unhelpful for what syntax to use. I eventually found that it's this: radosgw-admin metadata get bucket:{bucket_name} or radosgw-admin metadata get bucket.instance:{bucket_name}:{instance_id} Hopefully that helps you or someone else struggling with this. Rich On Wed, 15 Mar 2023 at 07:18, Gaël THEROND wrote: > > Alright, > Seems something is odd out there, if I do a radosgw-admin metadata list > > I’ve got the following list: > > [ > ”bucket”, > ”bucket.instance”, > ”otp”, > ”user” > ] > > BUT > > When I try a radosgw-admin metadata get bucket or bucket.instance it > complain with the following error: > > ERROR: can’t get key: (22) Invalid argument > > Ok, fine for the api, I’ll deal with the s3 api. > > Even if a radosgw-admin bucket flush version —keep-current or something > similar would be much appreciated xD > > Le mar. 14 mars 2023 à 19:07, Robin H. Johnson a > écrit : > > > On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote: > > > Versioning wasn’t enabled, at least not explicitly and for the > > > documentation it isn’t enabled by default. > > > > > > Using nautilus. > > > > > > I’ll get all the required missing information on tomorrow morning, thanks > > > for the help! > > > > > > Is there a way to tell CEPH to delete versions that aren’t current used > > one > > > with radosgw-admin? > > > > > > If not I’ll use the rest api no worries. > > Nope, s3 API only. > > > > You should also check for incomplete multiparts. For that, I recommend > > using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd > > doesn't respect the flag properly. > > > > -- > > Robin Hugh Johnson > > Gentoo Linux: Dev, Infra Lead, Foundation Treasurer > > E-Mail : robb...@gentoo.org > > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 > > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
Thanks a lot for this spreadsheet, I’ll check that on but I doubt we store data smaller than the min_alloc size. Yes we do use an EC pool type of 2+1 with failure_domain being at host level. Le mar. 14 mars 2023 à 19:38, Mark Nelson a écrit : > Is it possible that you are storing object (chunks if EC) that are > smaller than the min_alloc size? This cheat sheet might help: > > > https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing > > Mark > > On 3/14/23 12:34, Gaël THEROND wrote: > > Hi everyone, I’ve got a quick question regarding one of our RadosGW > bucket. > > > > This bucket is used to store docker registries, and the total amount of > > data we use is supposed to be 4.5Tb BUT it looks like ceph told us we > > rather use ~53Tb of data. > > > > One interesting thing is, this bucket seems to shard for unknown reason > as > > it is supposed to be disabled by default, but even taking that into > account > > we’re not supposed to see such a massive amount of additional data isn’t > it? > > > > Here is the bucket stats of it: > > https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/ > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
Is it possible that you are storing object (chunks if EC) that are smaller than the min_alloc size? This cheat sheet might help: https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?usp=sharing Mark On 3/14/23 12:34, Gaël THEROND wrote: Hi everyone, I’ve got a quick question regarding one of our RadosGW bucket. This bucket is used to store docker registries, and the total amount of data we use is supposed to be 4.5Tb BUT it looks like ceph told us we rather use ~53Tb of data. One interesting thing is, this bucket seems to shard for unknown reason as it is supposed to be disabled by default, but even taking that into account we’re not supposed to see such a massive amount of additional data isn’t it? Here is the bucket stats of it: https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS thrashing through the page cache
Got the answer to my own question; posting here if someone else encounters the same problem. The issue is that the default stripe size in a cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the test I posted) inside the file, you'll end up pulling at least 4MB to the client (and then discarding most of the pulled data) even if you set readahead to zero. So, the solution for us was to set a lower stripe size, which aligns better with our workloads. Thanks and Regards, Ashu Pachauri On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri wrote: > Also, I am able to reproduce the network read amplification when I try to > do very small reads from larger files. e.g. > > for i in $(seq 1 1); do > dd if=test_${i} of=/dev/null bs=5k count=10 > done > > > This piece of code generates a network traffic of 3.3 GB while it actually > reads approx 500 MB of data. > > > Thanks and Regards, > Ashu Pachauri > > On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri > wrote: > >> We have an internal use case where we back the storage of a proprietary >> database by a shared file system. We noticed something very odd when >> testing some workload with a local block device backed file system vs >> cephfs. We noticed that the amount of network IO done by cephfs is almost >> double compared to the IO done in case of a local file system backed by an >> attached block device. >> >> We also noticed that CephFS thrashes through the page cache very quickly >> compared to the amount of data being read and think that the two issues >> might be related. So, I wrote a simple test. >> >> 1. I wrote 10k files 400KB each using dd (approx 4 GB data). >> 2. I dropped the page cache completely. >> 3. I then read these files serially, again using dd. The page cache usage >> shot up to 39 GB for reading such a small amount of data. >> >> Following is the code used to repro this in bash: >> >> for i in $(seq 1 1); do >> dd if=/dev/zero of=test_${i} bs=4k count=100 >> done >> >> sync; echo 1 > /proc/sys/vm/drop_caches >> >> for i in $(seq 1 1); do >> dd if=test_${i} of=/dev/null bs=4k count=100 >> done >> >> >> The ceph version being used is: >> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus >> (stable) >> >> The ceph configs being overriden: >> WHO MASK LEVEL OPTION VALUE >> RO >> mon advanced auth_allow_insecure_global_id_reclaim false >> >> mgr advanced mgr/balancer/mode upmap >> >> mgr advanced mgr/dashboard/server_addr >> 127.0.0.1* >> mgr advanced mgr/dashboard/server_port 8443 >> * >> mgr advanced mgr/dashboard/ssl false >> * >> mgr advanced mgr/prometheus/server_addr 0.0.0.0 >> * >> mgr advanced mgr/prometheus/server_port 9283 >> * >> osd advanced bluestore_compression_algorithmlz4 >> >> osd advanced bluestore_compression_mode >> aggressive >> osd advanced bluestore_throttle_bytes >> 536870912 >> osd advanced osd_max_backfills 3 >> >> osd advanced osd_op_num_threads_per_shard_ssd 8 >> * >> osd advanced osd_scrub_auto_repair true >> >> mds advanced client_oc false >> >> mds advanced client_readahead_max_bytes 4096 >> >> mds advanced client_readahead_max_periods 1 >> >> mds advanced client_readahead_min 0 >> >> mds basic mds_cache_memory_limit >> 21474836480 >> clientadvanced client_oc false >> >> clientadvanced client_readahead_max_bytes 4096 >> >> clientadvanced client_readahead_max_periods 1 >> >> clientadvanced client_readahead_min 0 >> >> clientadvanced fuse_disable_pagecache false >> >> >> The cephfs mount options (note that readahead was disabled for this test): >> /mnt/cephfs type ceph >> (rw,relatime,name=cephfs,secret=,acl,rasize=0) >> >> Any help or pointers are appreciated; this is a major performance issue >> for us. >> >> >> Thanks and Regards, >> Ashu Pachauri >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
Alright, Seems something is odd out there, if I do a radosgw-admin metadata list I’ve got the following list: [ ”bucket”, ”bucket.instance”, ”otp”, ”user” ] BUT When I try a radosgw-admin metadata get bucket or bucket.instance it complain with the following error: ERROR: can’t get key: (22) Invalid argument Ok, fine for the api, I’ll deal with the s3 api. Even if a radosgw-admin bucket flush version —keep-current or something similar would be much appreciated xD Le mar. 14 mars 2023 à 19:07, Robin H. Johnson a écrit : > On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote: > > Versioning wasn’t enabled, at least not explicitly and for the > > documentation it isn’t enabled by default. > > > > Using nautilus. > > > > I’ll get all the required missing information on tomorrow morning, thanks > > for the help! > > > > Is there a way to tell CEPH to delete versions that aren’t current used > one > > with radosgw-admin? > > > > If not I’ll use the rest api no worries. > Nope, s3 API only. > > You should also check for incomplete multiparts. For that, I recommend > using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd > doesn't respect the flag properly. > > -- > Robin Hugh Johnson > Gentoo Linux: Dev, Infra Lead, Foundation Treasurer > E-Mail : robb...@gentoo.org > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote: > Versioning wasn’t enabled, at least not explicitly and for the > documentation it isn’t enabled by default. > > Using nautilus. > > I’ll get all the required missing information on tomorrow morning, thanks > for the help! > > Is there a way to tell CEPH to delete versions that aren’t current used one > with radosgw-admin? > > If not I’ll use the rest api no worries. Nope, s3 API only. You should also check for incomplete multiparts. For that, I recommend using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd doesn't respect the flag properly. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: PGP signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
Versioning wasn’t enabled, at least not explicitly and for the documentation it isn’t enabled by default. Using nautilus. I’ll get all the required missing information on tomorrow morning, thanks for the help! Is there a way to tell CEPH to delete versions that aren’t current used one with radosgw-admin? If not I’ll use the rest api no worries. Le mar. 14 mars 2023 à 18:49, Robin H. Johnson a écrit : > On Tue, Mar 14, 2023 at 06:34:54PM +0100, Gaël THEROND wrote: > > Hi everyone, I’ve got a quick question regarding one of our RadosGW > bucket. > > > > This bucket is used to store docker registries, and the total amount of > > data we use is supposed to be 4.5Tb BUT it looks like ceph told us we > > rather use ~53Tb of data. > > > > One interesting thing is, this bucket seems to shard for unknown reason > as > > it is supposed to be disabled by default, but even taking that into > account > > we’re not supposed to see such a massive amount of additional data isn’t > it? > > > > Here is the bucket stats of it: > > https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/ > At a glance, is versioning enabled? > > And if so, are you pruning old versions? > > Please share "radosgw-admin metadata get" for the bucket & > bucket-instance. > > -- > Robin Hugh Johnson > Gentoo Linux: Dev, Infra Lead, Foundation Treasurer > E-Mail : robb...@gentoo.org > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 10x more used space than expected
On Tue, Mar 14, 2023 at 06:34:54PM +0100, Gaël THEROND wrote: > Hi everyone, I’ve got a quick question regarding one of our RadosGW bucket. > > This bucket is used to store docker registries, and the total amount of > data we use is supposed to be 4.5Tb BUT it looks like ceph told us we > rather use ~53Tb of data. > > One interesting thing is, this bucket seems to shard for unknown reason as > it is supposed to be disabled by default, but even taking that into account > we’re not supposed to see such a massive amount of additional data isn’t it? > > Here is the bucket stats of it: > https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/ At a glance, is versioning enabled? And if so, are you pruning old versions? Please share "radosgw-admin metadata get" for the bucket & bucket-instance. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: PGP signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] 10x more used space than expected
Hi everyone, I’ve got a quick question regarding one of our RadosGW bucket. This bucket is used to store docker registries, and the total amount of data we use is supposed to be 4.5Tb BUT it looks like ceph told us we rather use ~53Tb of data. One interesting thing is, this bucket seems to shard for unknown reason as it is supposed to be disabled by default, but even taking that into account we’re not supposed to see such a massive amount of additional data isn’t it? Here is the bucket stats of it: https://paste.opendev.org/show/bdWFRvNFtxyHnbPfXWu9/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Last day to sponsor Cephalocon Amsterdam 2023
Hi everyone, Today is the last day to sponsor Cephalocon Amsterdam 2023! I want to thank our current sponsors: Platinum: IBM Silver: 42on, Canonical Ubuntu, Clyso Startup: Koor Also, thank you to Clyso for their lanyard add-on and 42on's offsite attendee party. We are still short in covering the costs for the event, so I'm asking for contributors and members of the Ceph Foundation to consider applying today. https://events.linuxfoundation.org/cephalocon/sponsor/ Sponsor Prospectus: https://events.linuxfoundation.org/wp-content/uploads/2023/03/sponsor-ceph-23_030923.pdf Please get in touch with us at sponsorships@ceph.foundation to get started. Thank you! -- Mike Perez ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pg wait too long when osd restart
Hello, baergen Thanks for your reply, I got it. ☺ Best regards Yitte Gu Josh Baergen 于2023年3月13日周一 23:15写道: > (trimming out the dev list and Radoslaw's email) > > Hello, > > I think the two critical PRs were: > * https://github.com/ceph/ceph/pull/44585 - included in 15.2.16 > * https://github.com/ceph/ceph/pull/45655 - included in 15.2.17 > > I don't have any comments on tweaking those configuration values, and > what safe values would be. > > Josh > > On Sun, Mar 12, 2023 at 9:43 PM yite gu wrote: > > > > Hello, Baergen > > Thanks for your reply. Restart osd in planned, but my version is 15.2.7, > so, I may have encountered the problem you said. Could you provide PR to me > about optimize this mechanism? Besides that, if I don't want to upgrade > version in recently, is a good way that adjust > osd_pool_default_read_lease_ratio to lower? For example, 0.4 or 0.2 to > reach the user's tolerance time. > > > > Yite Gu > > > > Josh Baergen 于2023年3月10日周五 22:09写道: > >> > >> Hello, > >> > >> When you say "osd restart", what sort of restart are you referring to > >> - planned (e.g. for upgrades or maintenance) or unplanned (OSD > >> hang/crash, host issue, etc.)? If it's the former, then these > >> parameters shouldn't matter provided that you're running a recent > >> enough Ceph with default settings - it's supposed to handle planned > >> restarts with little I/O wait time. There were some issues with this > >> mechanism before Octopus 15.2.17 / Pacific 16.2.8 that could cause > >> planned restarts to wait for the read lease timeout in some > >> circumstances. > >> > >> Josh > >> > >> On Fri, Mar 10, 2023 at 1:31 AM yite gu wrote: > >> > > >> > Hi all, > >> > osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8 > by > >> > default, so, pg will wait 16s when osd restart in the worst case. > This wait > >> > time is too long, client i/o can not be unacceptable. I think > adjusting > >> > the osd_pool_default_read_lease_ratio to lower is a good way. Have > any good > >> > suggestions about reduce pg wait time? > >> > > >> > Best Regard > >> > Yite Gu > >> > ___ > >> > ceph-users mailing list -- ceph-users@ceph.io > >> > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade 16.2.11 -> 17.2.0 failed
On 14.03.23 14:21, bbk wrote: ` # ceph orch upgrade start --ceph-version 17.2.0 I would never recommend to update to a .0 release. Why not go directly to the latest 17.2.5? Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade 16.2.11 -> 17.2.0 failed
That's very odd, I haven't seen this before. What container image is the upgraded mgr running on (to know for sure, can check the podman/docker run command at the end of the /var/lib/ceph//mgr./unit.run file on the mgr's host)? Also, could maybe try "ceph mgr module enable cephadm" to see if it does anything? On Tue, Mar 14, 2023 at 9:23 AM bbk wrote: > Dear List, > > Today i was sucessfully upgrading with cephadm from 16.2.8 -> 16.2.9 -> > 16.2.10 -> 16.2.11 > > Now i wanted to upgrade to 17.2.0 but after starting the upgrade with > > ``` > # ceph orch upgrade start --ceph-version 17.2.0 > ``` > > The orch manager module seems to be gone now and the upgrade don't seem to > run. > > > ``` > # ceph orch upgrade status > Error ENOENT: No orchestrator configured (try `ceph orch set backend`) > > # ceph orch set backend cephadm > Error ENOENT: Module not found > ``` > > During the failed upgrade all nodes had the 16.2.11 cephadm installed. > > Fortunately the cluster is still running... somehow. I installed the > latest 17.2.X cephadm on all > nodes and rebooted them nodes, but this didn't help. > > Does someone have a hint? > > Yours, > bbk > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Upgrade 16.2.11 -> 17.2.0 failed
Dear List, Today i was sucessfully upgrading with cephadm from 16.2.8 -> 16.2.9 -> 16.2.10 -> 16.2.11 Now i wanted to upgrade to 17.2.0 but after starting the upgrade with ``` # ceph orch upgrade start --ceph-version 17.2.0 ``` The orch manager module seems to be gone now and the upgrade don't seem to run. ``` # ceph orch upgrade status Error ENOENT: No orchestrator configured (try `ceph orch set backend`) # ceph orch set backend cephadm Error ENOENT: Module not found ``` During the failed upgrade all nodes had the 16.2.11 cephadm installed. Fortunately the cluster is still running... somehow. I installed the latest 17.2.X cephadm on all nodes and rebooted them nodes, but this didn't help. Does someone have a hint? Yours, bbk ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rbd on EC pool with fast and extremely slow writes/reads
1, 2 times a year we are having similar problem in *not* ceph disk cluster, where working -> but slow disk writes give us slow reads. We somehow "understand it", since probably slow writes fill up queues and buffers. On Thu, Mar 9, 2023 at 11:37 AM Andrej Filipcic wrote: > > Thanks for the hint, did run some short test, all fine. I am not sure > it's a drive issue. > > Some more digging, the file with bad performance has this segments: > > [root@afsvos01 vicepa]# hdparm --fibmap $PWD/0 > > /vicepa/0: > filesystem blocksize 4096, begins at LBA 2048; assuming 512 byte sectors. > byte_offset begin_LBAend_LBAsectors >0 74323228150392071808 > 1060765696373306458382792105216 > 2138636288 70841232 87586575 16745344 > 10712252416 87586576 87635727 49152 > > Reading by segments: > > # dd if=0 of=/tmp/0 bs=4M status=progress count=252 > 1052770304 bytes (1.1 GB, 1004 MiB) copied, 45 s, 23.3 MB/s > 252+0 records in > 252+0 records out > > # dd if=0 of=/tmp/0 bs=4M status=progress skip=252 count=256 > 935329792 bytes (935 MB, 892 MiB) copied, 4 s, 234 MB/s > 256+0 records in > 256+0 records out > > # dd if=0 of=/tmp/0 bs=4M status=progress skip=510 > 7885291520 bytes (7.9 GB, 7.3 GiB) copied, 12 s, 657 MB/s > 2050+0 records in > 2050+0 records out > > So, 1st 1G is very slow, second segment is faster, then the rest quite > fast, and it's reproducible (dropped caches before each dd) > > Now, the rbd is 3TB with 256 pgs (EC 8+3), I checked with rados that > objects are randomly distributed on pgs, eg > > # rados --pgid 23.82 ls|grep rbd_data.20.2723bd3292f6f8 > rbd_data.20.2723bd3292f6f8.0008 > rbd_data.20.2723bd3292f6f8.000d > rbd_data.20.2723bd3292f6f8.01cb > rbd_data.20.2723bd3292f6f8.000601b2 > rbd_data.20.2723bd3292f6f8.0009001b > rbd_data.20.2723bd3292f6f8.005b > rbd_data.20.2723bd3292f6f8.000900e8 > > where object ...05b for example corresponds to the 1st block of the file > I am testing. Well, if my understanding of rbd is correct: I assume > that LBA regions are mapped to consecutive rbd objects. > > So, now I am completely confused since the slow chunk of the file is > still mapped to ~256 objects on different pgs > > Maybe I misunderstood the whole thing. > > Any other hints? we will still do hdd tests on all the drives > > Cheers, > Andrej > > On 3/6/23 20:25, Paul Mezzanini wrote: > > When I have seen behavior like this it was a dying drive. It only > became obviously when I did a smart long test and I got failed reads. > Still reported smart OK though so that was a lie. > > > > > > > > -- > > > > Paul Mezzanini > > Platform Engineer III > > Research Computing > > > > Rochester Institute of Technology > > > > > > > > > > > > > > > > > > > > > > From: Andrej Filipcic > > Sent: Monday, March 6, 2023 8:51 AM > > To: ceph-users > > Subject: [ceph-users] rbd on EC pool with fast and extremely slow > writes/reads > > > > > > Hi, > > > > I have a problem on one of ceph clusters I do not understand. > > ceph 17.2.5 on 17 servers, 400 HDD OSDs, 10 and 25Gb/s NICs > > > > 3TB rbd image is on erasure coded 8+3 pool with 128pgs , xfs filesystem, > > 4MB objects in rbd image, mostly empy. > > > > I have created a bunch of 10G files, most of them were written with > > 1.5GB/s, few of them were really slow, ~10MB/s, a factor of 100. > > > > When reading these files back, the fast-written ones are read fast, > > ~2-2.5GB/s, the slowly-written are also extremely slow in reading, iotop > > shows between 1 and 30 MB/s reading speed. > > > > This does not happen at all on replicated images. There are some OSDs > > with higher apply/commit latency, eg 200ms, but there are no slow ops. > > > > The tests were done actually on proxmox vm with librbd, but the same > > happens with krbd, and on bare metal with mounted krbd as well. > > > > I have tried to check all OSDs for laggy drives, but they all look about > > the same. > > > > I have also copied entire image with "rados get...", object by object, > > the strange thing here is that most of objects were copied within > > 0.1-0.2s, but quite some took more than 1s. > > The cluster is quite busy with base traffic of ~1-2GB/s, so the speeds > > can vary due to that. But I would not expect a factor of 100 slowdown > > for some writes/reads with rbds. > > > > Any clues on what might be wrong or what else to check? I have another > > similar ceph cluster where everything looks fine. > > > > Best, > > Andrej > > > > -- > > _ > > prof. dr. Andrej Filipcic, E-mail:andrej.filip...@ijs.si > > Department of Experimental High Energy Physics - F9 > > Jozef Stefan Institute, Jamova 39, P.o.Box 3000 > > SI-1001 Ljubljana, Slovenia > > Tel.: +386-1-477-3674Fax: +386-1-477-3166 > >
[ceph-users] handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)
since quincy i'm randomly getting authentication issues from clients to osds. symptom is qemu hangs, but when it happens, i can reproduce it using: > ceph tell osd.\* version some - but only some - osds will never respond, but only to clients on _some_ hosts. the client gets stuck in a loop with this error > 2023-03-14T10:09:38.492+0100 7f38f5d95700 1 --2- 10.180.10.36:0/329477069 >> > [v2:10.180.10.24:6810/697584,v1:10.180.10.24:6811/697584] conn(0x7f38f0107990 > 0x7f38f0107d60 crc :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 > tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble > failed r=-1 ((1) Operation not permitted) restarting the affected OSD helps for a few hours. in the osd log i see only > 2023-03-14T09:27:27.801+ 7fb79020a700 10 osd.4 114909 > ms_handle_authentication session 0x55880cd58b40 client.admin h as caps osdcap[grant(*)] 'allow *' > 2023-03-14T09:27:27.805+ 7fb781a3c700 2 osd.4 114909 ms_handle_reset con > 0x55880a7fec00 session 0x55880cd58b40 searching for this issue gives me people whos mon is dead, but i dont think "tell" is supposed to go through mon, beyond the initial listing, which succeeds. but here's the full auth log from mon anyway if it helps: 2023-03-14T09:34:48.847+ 7fcc8a5c7700 10 In get_auth_session_handler for protocol 0 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 start_session entity_name=client.admin global_id=6751719 is_new_global_id=1 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx server client.admin: start_session server_challenge 20aa2b96857f41cf 2023-03-14T09:34:48.847+ 7fcc865bf700 10 start_session entity_name=client.admin global_id=6751722 is_new_global_id=1 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx server client.admin: start_session server_challenge 6066dd1200ddc855 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx server client.admin: handle_request get_auth_session_key for client.admin 2023-03-14T09:34:48.847+ 7fcc84dbc700 20 cephx server client.admin: checking key: req.key=92ed7ea281e9ac0c expected_key=92ed7ea281e9ac0c 2023-03-14T09:34:48.847+ 7fcc84dbc700 20 cephx server client.admin: checking old_ticket: secret_id=0 len=0, old_ticket_may_be_omitted=0 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx server client.admin: new global_id 6751719 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret REDACTED== 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx: build_service_ticket service auth secret_id 160 ticket_info.ticket.name=client.admin ticket.global_id 6751719 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx keyserverdata: get_caps: name=client.admin 2023-03-14T09:34:48.847+ 7fcc84dbc700 10 cephx keyserverdata: get_secret: num of caps=4 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx server client.admin: handle_request get_auth_session_key for client.admin 2023-03-14T09:34:48.847+ 7fcc865bf700 20 cephx server client.admin: checking key: req.key=3c1f6182caf84073 expected_key=3c1f6182caf84073 2023-03-14T09:34:48.847+ 7fcc865bf700 20 cephx server client.admin: checking old_ticket: secret_id=0 len=0, old_ticket_may_be_omitted=0 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx server client.admin: new global_id 6751722 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret REDACTED== 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx: build_service_ticket service auth secret_id 160 ticket_info.ticket.name=client.admin ticket.global_id 6751722 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx keyserverdata: get_caps: name=client.admin 2023-03-14T09:34:48.847+ 7fcc865bf700 10 cephx keyserverdata: get_secret: num of caps=4 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 start_session entity_name=client.admin global_id=6751725 is_new_global_id=1 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx server client.admin: start_session server_challenge 22fa068f8da1fb28 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx server client.admin: handle_request get_auth_session_key for client.admin 2023-03-14T09:34:48.851+ 7fcc84dbc700 20 cephx server client.admin: checking key: req.key=fc7fdedb8e669347 expected_key=fc7fdedb8e669347 2023-03-14T09:34:48.851+ 7fcc84dbc700 20 cephx server client.admin: checking old_ticket: secret_id=0 len=0, old_ticket_may_be_omitted=0 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx server client.admin: new global_id 6751725 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx: build_service_ticket_reply encoding 1 tickets with secret REDACTED== 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx: build_service_ticket service auth secret_id 160 ticket_info.ticket.name=client.admin ticket.global_id 6751725 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx keyserverdata: get_caps: name=client.admin 2023-03-14T09:34:48.851+ 7fcc84dbc700 10 cephx keyserverdata: get_secret: num of caps=4
[ceph-users] Re: Mixed mode ssd and hdd issue
Hi, We need more info to be able to help you. What CPU and network in nodes? What model of SSD? Cheers El 13/3/23 a las 16:27, xadhoo...@gmail.com escribió: Hi, we have a cluster with 3 nodes . Each node has 4 HDD and 1 SSD We would like to have a pool only on ssd and a pool only on hdd, using class feature. here is the setup # buckets host ceph01s3 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily id -21 class ssd# do not change unnecessarily # weight 34.561 alg straw2 hash 0 # rjenkins1 item osd.0 weight 10.914 item osd.5 weight 10.914 item osd.8 weight 10.914 item osd.9 weight 1.819 } host ceph02s3 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily id -22 class ssd# do not change unnecessarily # weight 34.561 alg straw2 hash 0 # rjenkins1 item osd.1 weight 10.914 item osd.3 weight 10.914 item osd.7 weight 10.914 item osd.10 weight 1.819 } host ceph03s3 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily id -23 class ssd# do not change unnecessarily # weight 34.561 alg straw2 hash 0 # rjenkins1 item osd.2 weight 10.914 item osd.4 weight 10.914 item osd.6 weight 10.914 item osd.11 weight 1.819 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily id -24 class ssd# do not change unnecessarily # weight 103.683 alg straw2 hash 0 # rjenkins1 item ceph01s3 weight 34.561 item ceph02s3 weight 34.561 item ceph03s3 weight 34.561 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default class hdd step chooseleaf firstn 0 type host step emit } rule erasure-code { id 1 type erasure min_size 3 max_size 4 step take default class hdd step set_chooseleaf_tries 5 step set_choose_tries 100 step chooseleaf indep 0 type host step emit } rule erasure2_1 { id 2 type erasure min_size 3 max_size 3 step take default class hdd step set_chooseleaf_tries 5 step set_choose_tries 100 step chooseleaf indep 0 type host step emit } rule erasure-pool.meta { id 3 type erasure min_size 3 max_size 3 step take default class hdd step set_chooseleaf_tries 5 step set_choose_tries 100 step chooseleaf indep 0 type host step emit } rule erasure-pool.data { id 4 type erasure min_size 3 max_size 3 step take default class hdd step set_chooseleaf_tries 5 step set_choose_tries 100 step chooseleaf indep 0 type host step emit } rule replicated_rule_ssd { id 5 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 0 type host step emit } # end crush map pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1669 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth pool 5 'Datapool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 2749 lfor 0/0/321 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 7 'erasure-pool.data' erasure profile k2m1 size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 128 pgp_num 126 pgp_num_target 128 autoscale_mode on last_change 2780 lfor 0/0/1676 flags hashpspool,ec_overwrites stripe_width 8192 application cephfs pool 8 'erasure-pool.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 344 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 9 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 592 flags hashpspool stripe_width 0 application rgw pool 10 'brescia-ovest.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 595 flags hashpspool stripe_width 0 application rgw pool 11 'brescia-ovest.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR
cephadm is 16.2.11, because the error comes from the upgrade from 15 to 16. Il giorno lun 13 mar 2023 alle ore 18:27 Clyso GmbH - Ceph Foundation Member ha scritto: > which version of cephadm you are using? > > ___ > Clyso GmbH - Ceph Foundation Member > > Am 10.03.23 um 11:17 schrieb xadhoo...@gmail.com: > > looking at ceph orch upgrade check > > I find out > > }, > > > "cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2": > { > > "current_id": null, > > "current_name": null, > > "current_version": null > > }, > > > > > > Could this lead to the issue? > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io