[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-25 Thread Denis Polom
eekly until you redeploy all OSDs that were created with 64K bluestore_min_alloc_size. A hybrid approach (initial round of balancing with TheJJ, then switch to the built-in balancer) may also be viable. On Sun, Mar 24, 2024 at 7:09 PM Denis Polom wrote: Hi guys, recently I took over a care

[ceph-users] ceph cluster extremely unbalanced

2024-03-24 Thread Denis Polom
Hi guys, recently I took over a care of Ceph cluster that is extremely unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it. Crush failure domain is datacenter (there are 3), data pool is EC 3+3. This

[ceph-users] ceph metrics units

2024-03-14 Thread Denis Polom
Hi guys, do you know if there is some table of Ceph metrics and units that should be used for them? I currently struggling with ceph_osd_op_r_latency_sum ceph_osd_op_w_latency_sum if they are in ms or seconds? Any idea please? Thx! ___

[ceph-users] ceph-mgr client.0 error registering admin socket command: (17) File exists

2024-02-26 Thread Denis Polom
Hi, running Ceph Quincy 17.2.7 on Ubuntu Focal LTS, ceph-mgr service reports following errors: client.0 error registering admin socket command: (17) File exists I don't use any extra mgr configuration: mgr   advanced  mgr/balancer/active true mgr   advanced 

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Denis Polom
luestore_compression_mode Thanks. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Mon, Nov 27, 2023 at 2:01 PM Denis Polom wrote: Hi, no we don't: &qu

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Denis Polom
*Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Mon, Nov 27, 2023 at 2:01 PM Denis Polom wrote: Hi, no we don't: "bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buff

[ceph-users] Re: OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Denis Polom
d_compactions=2,max_total_wal_size=1073741824", thx On 11/27/23 19:17, Wesley Dillingham wrote: Curious if you are using bluestore compression? Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Mon, Nov 27, 2023 at 10:09 AM Denis Pol

[ceph-users] OSDs failing to start due to crc32 and osdmap error

2023-11-27 Thread Denis Polom
Hi we have issue to start some OSDs on one node on our Ceph Quincy 17.2.7 cluster. Some OSDs on that node are running fine, but some failing to start. Looks like crc32 checksum error, and failing to get OSD map. I found a some discussions on that but nothing helped. I've also tried to

[ceph-users] CephFS - MDS removed from map - filesystem keeps to be stopped

2023-11-22 Thread Denis Polom
Hi running Ceph Pacific 16.2.13. we had full CephFS filesystem and after adding new HW we tried to start it but our MDS daemons are pushed to be standby and are removed from MDS map. Filesystem was broken, so we repaired it with: # ceph fs fail cephfs # cephfs-journal-tool --rank=cephfs:0

[ceph-users] Re: resharding RocksDB after upgrade to Pacific breaks OSDs

2023-11-03 Thread Denis Polom
the command as documented will cause this corruption. The correct command to run is: ceph-bluestore-tool \ --path \ --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ reshard Josh On Fri, Nov 3, 2023 at 7:58 AM Denis Polom wrote:

[ceph-users] Re: resharding RocksDB after upgrade to Pacific breaks OSDs

2023-11-03 Thread Denis Polom
of this resharding operation yet, but is it really safe? I don't have an idea how to fix, I just recreated the OSDs. Zitat von Denis Polom : Hi we upgraded our Ceph cluster from latest Octopus to Pacific 16.2.14 and then we followed the docs (https://docs.ceph.com/en/latest/rados

[ceph-users] resharding RocksDB after upgrade to Pacific breaks OSDs

2023-11-02 Thread Denis Polom
Hi we upgraded our Ceph cluster from latest Octopus to Pacific 16.2.14 and then we followed the docs (https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding ) to

[ceph-users] CephFS scrub causing MDS OOM-kill

2023-11-02 Thread Denis Polom
Hi, I did setup CephFS forward scrub by executing cmd # ceph tell mds.cephfs:0 scrub start / recursive { "return_code": 0, "scrub_tag": "37a67f72-89a3-474e-8f8b-1e55cb979e14", "mode": "asynchronous" } But immediately after it started, memory usage on MDS that keeps rank 0 increased

[ceph-users] Re: Moving devices to a different device class?

2023-11-01 Thread Denis Polom
Hi, well I will first check crush rules if device class is not defined there. If it is, then you have to create new crush rule and set it to the affected pools. dp On 10/26/23 23:36, Matt Larson wrote: Thanks Janne, It is good to know that moving the devices over to a new class is a

[ceph-users] what are the options for config a CephFS client session

2023-06-12 Thread Denis Polom
Hi, I didn't find any doc and any way how to get to know valid options to configure client session over mds socket: #> ceph tell mds.mds1 session config session config [] :  Config a CephFS client session Any hint on this? Thank you ___

[ceph-users] Re: ceph fs perf stats output is empty

2023-06-12 Thread Denis Polom
tput would be empty or outdated in most cases. You need to query a few times to get the latest values. So try `watch ceph fs perf stats`. On Mon, 12 Jun 2023 at 06:30, Xiubo Li wrote: On 6/10/23 05:35, Denis Polom wrote: > Hi > > I'm running latest Ceph Pacific 16.2.13

[ceph-users] ceph fs perf stats output is empty

2023-06-09 Thread Denis Polom
Hi I'm running latest Ceph Pacific 16.2.13 with Cephfs. I need to collect performance stats per client, but getting empty list without any numbers I even run dd on client against mounted ceph fs, but output is only like this: #> ceph fs perf stats 0 4638 192.168.121.1 {"version": 2,

[ceph-users] Re: OSDs are not utilized evenly

2022-11-08 Thread Denis Polom
ov 2, 2022 at 5:01 PM Denis Polom wrote: Hi Joseph, thank you for answer. But if I'm looking correctly to 'ceph osd df' output I posted I see there are about 195 PGs per OSD. There are 608 OSDs in the pool, which is the only data pool. What I have calculated - PG calc says

[ceph-users] Re: OSDs are not utilized evenly

2022-11-02 Thread Denis Polom
Mundackal wrote: If the GB per pg is high, the balancer module won't be able to help. Your pg count per osd also looks low (30's), so increasing pgs per pool would help with both problems. You can use the pg calculator to determine which pools need what On Tue, Nov 1, 2022, 08:46 Denis Polom wrote

[ceph-users] OSDs are not utilized evenly

2022-11-01 Thread Denis Polom
Hi I observed on my Ceph cluster running latest Pacific that same size OSDs are utilized differently even if balancer is running and reports status as perfectly balanced. {     "active": true,     "last_optimize_duration": "0:00:00.622467",     "last_optimize_started": "Tue Nov  1 12:49:36

[ceph-users] Re: bunch of " received unsolicited reservation grant from osd" messages in log

2022-07-01 Thread Denis Polom
OK, and when it will be backported to Pacific? On 6/27/22 18:59, Neha Ojha wrote: This issue should be addressed by https://github.com/ceph/ceph/pull/46860. Thanks, Neha On Fri, Jun 24, 2022 at 2:53 AM Kenneth Waegeman wrote: Hi, I’ve updated the cluster to 17.2.0, but the log is still

[ceph-users] MDS error handle_find_ino_reply failed with -116

2022-06-15 Thread Denis Polom
Hi, I have Ceph Pacific 16.2.9 with CephFS and 4 MDS (2 active, 2 standby-reply) == RANK  STATE   MDS  ACTIVITY DNS    INOS   DIRS CAPS  0    active  mds3  Reqs:   31 /s   162k   159k  69.5k 177k  1    active  mds1  Reqs:    4 /s  31.0k  28.7k  10.6k

[ceph-users] large removed snaps queue

2022-05-31 Thread Denis Polom
Hi, we are applying the RBD snapshots on images on Ceph with hourly schedule and 1 day retention. When I run `ceph osd pool ls detail` command I can see a lot of in removed_snaps_queue and some few in removed_snaps. Can someone explain what does it mean and if I should give extra

[ceph-users] orphaned journal_data objects on pool after disabling rbd mirror

2022-05-23 Thread Denis Polom
Hi after disabling journaling feature on images and disabling rbd-mirror on pool, there are still a lot of journal_data objects on pool. Is it safe to remove these objects manually from the pool? Thanks ___ ceph-users mailing list --

[ceph-users] Re: Drained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients

2022-05-20 Thread Denis Polom
in octopus to a replicated pool in nautilus. Does primary affinity work for you in octopus on a replicated pool? And does a nautilus EC pool work? .. Dan On Fri., May 20, 2022, 13:53 Denis Polom, wrote: Hi I observed high latencies and mount points hanging since

[ceph-users] Drained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients

2022-05-20 Thread Denis Polom
Hi I observed high latencies and mount points hanging since Octopus release and it's still observed on Pacific latest while draining OSD. Cluster setup: Ceph Pacific 16.2.7 Cephfs with EC data pool EC profile setup: crush-device-class= crush-failure-domain=host crush-root=default

[ceph-users] Re: bunch of " received unsolicited reservation grant from osd" messages in log

2022-05-17 Thread Denis Polom
Hi is it still not backportet to latest 16.2.8? I don't see it in release notes. On 12/19/21 11:05, Ronen Friedman wrote: On Sat, Dec 18, 2021 at 7:06 PM Ronen Friedman wrote: Hi all, This was indeed a bug, which I've already fixed in 'master'. I'll look for the backporting status

[ceph-users] unable to disable journaling image feature

2022-05-15 Thread Denis Polom
Hi, on Ceph Pacific 16.2.7 I have an image where I need to disable journaling feature on it. It's on primary site, and mirror is not running anymore. Secondary site doesn't exist anymore. I want to disable mirroring on the pool, but this is blocking it. Error is: # rbd feature disable

[ceph-users] Re: RBD mirror direction settings issue

2022-05-02 Thread Denis Polom
Denis Polom wrote: Hi, I'm setting up RBD mirror between two Ceph clusters and have an issue to set up rx-tx direction on primary site. Issuing the command rbd mirror pool peer bootstrap import --direction rx-tx --site-name primary rbd token Hi Denis, Normally, the token is created

[ceph-users] RBD mirror direction settings issue

2022-04-30 Thread Denis Polom
Hi, I'm setting up RBD mirror between two Ceph clusters and have an issue to set up rx-tx direction on primary site. Issuing the command rbd mirror pool peer bootstrap import --direction rx-tx --site-name primary rbd token I'm expecting the bi-directional mirror for the pool rbd. But I'm

[ceph-users] Re: crashing OSDs with FAILED ceph_assert

2022-03-12 Thread Denis Polom
does that start to happen? How often? Thanks, Igor On 3/12/2022 6:14 PM, Denis Polom wrote: Hi Igor, before the assertion there is 2022-03-12T10:15:35.879+0100 7f0e61055700 -1 bdev(0x55a61c6a6000 /var/lib/ceph/osd/ceph-48/block) aio_submit retries 5 2022-03-12T10:15:35.883+0100 7f0e6d06d700

[ceph-users] Re: crashing OSDs with FAILED ceph_assert

2022-03-12 Thread Denis Polom
to happen? How often? Thanks, Igor On 3/12/2022 6:14 PM, Denis Polom wrote: Hi Igor, before the assertion there is 2022-03-12T10:15:35.879+0100 7f0e61055700 -1 bdev(0x55a61c6a6000 /var/lib/ceph/osd/ceph-48/block) aio_submit retries 5 2022-03-12T10:15:35.883+0100 7f0e6d06d700 -1 bdev

[ceph-users] Re: crashing OSDs with FAILED ceph_assert

2022-03-12 Thread Denis Polom
. It usually has some helpful information, e.g. error code, about the root cause. Thanks, Igor On 3/12/2022 5:01 PM, Denis Polom wrote: Hi, I have Ceph cluster version Pacific 16.2.7 with RBD pool and OSDs made on SSDs with DB on separete NVMe. What I observe OSDs are crashing randomly. Output

[ceph-users] crashing OSDs with FAILED ceph_assert

2022-03-12 Thread Denis Polom
Hi, I have Ceph cluster version Pacific 16.2.7 with RBD pool and OSDs made on SSDs with DB on separete NVMe. What I observe OSDs are crashing randomly. Output of crash info is: {     "archived": "2022-03-12 11:44:37.251897",     "assert_condition": "r == 0",     "assert_file":

[ceph-users] Re: Scrubbing

2022-03-11 Thread Denis Polom
Hi, I had similar problem on my larce cluster. What I found and helped me to solve it: Due to bad drives and replacing drives too often due to scrub error there was always some recovery operations going on. I did set this: osd_scrub_during_recovery true and it basically solved my issue.

[ceph-users] Ceph OSD spurious read errors and PG autorepair

2021-12-07 Thread Denis Polom
Hi, I'm observing following behavior on our Ceph clusters: On Ceph cluster where I have enabled osd_scrub_auto_repair = true I can observe Spurious read errors warnings. On other Ceph clusters where this option is set to false I don't see this warning. But on ohter hand I have often scrub

[ceph-users] Re: CephFS multi active MDS high availability

2021-10-24 Thread Denis Polom
Hi, even better is to set allow_standby_replay and have for example 2 active and 2 standby. More here https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay dp On 10/24/21 09:52, huxia...@horebdata.cn wrote: Dear Cephers, When setting up multiple active CephFS MDS,

[ceph-users] Re: monitor not joining quorum

2021-10-21 Thread Denis Polom
mon_status Mike On Wed, 20 Oct 2021 at 07:58, Konstantin Shalygin wrote: Do you have any backfilling operations? In our case when backfilling was done mon joins to quorum immediately k Sent from my iPhone > On 20 Oct 2021, at 08:52, Denis Polom wrote: > > 

[ceph-users] Re: monitor not joining quorum

2021-10-19 Thread Denis Polom
Hi, I've checked it, there is not IP address collision, arp tables are OK, mtu also and according tcpdump there are not packet being lost. On 10/19/21 21:36, Konstantin Shalygin wrote: Hi, On 19 Oct 2021, at 21:59, Denis Polom wrote: 2021-10-19 16:22:07.629 7faec9dd2700  1 mon.ceph1@0

[ceph-users] Re: monitor not joining quorum

2021-10-19 Thread Denis Polom
older than 16.2.6 it could be that same issue and workarounds are discussed in the tracker. Even if you are on 16.2.6 the workarounds in that tracker could still be helpful. On Tue, Oct 19, 2021 at 12:07 PM Denis Polom wrote: Hi, one of our monitor VM

[ceph-users] monitor not joining quorum

2021-10-19 Thread Denis Polom
Hi, one of our monitor VM  was rebooted and not joining quorum again (quorum consist out of 3 monitors). While monitor service (ceph1) is running on this VM, Ceph cluster become unreachable. In monitor logs on ceph3 VM  I can see a lot of  following messages: 2021-10-19 17:50:19.555

[ceph-users] Re: ceph IO are interrupted when OSD goes down

2021-10-18 Thread Denis Polom
. Are some disks utilized around 100% (iostat)  when this happens? Zitat von Denis Polom : Hi, it's min_size: 10 On 10/18/21 14:43, Eugen Block wrote: What is your min_size for the affected pool? Zitat von Denis Polom : Hi, I have 18 OSD nodes in this cluster. And it does happen even

[ceph-users] ceph IO are interrupted when OSD goes down

2021-10-18 Thread Denis Polom
Hi, I have a EC pool with these settings: crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=10 m=2 plugin=jerasure technique=reed_sol_van w=8 and my understanding is if some of the OSDs goes down because of read error or just flapping due

[ceph-users] Re: cephfs_metadata pool unexpected space utilization

2021-09-07 Thread Denis Polom
Hi any help here, please? I observe the same behavior on cluster I just updated with latest Octopus. Any help will be appreciated thx On 8/6/21 14:41, Denis Polom wrote: Hi, I observe strange behavior on my Ceph MDS cluster, where cephfs_metadata pool is filling out without obvious

[ceph-users] cephfs_metadata pool unexpected space utilization

2021-08-06 Thread Denis Polom
Hi, I observe strange behavior on my Ceph MDS cluster, where cephfs_metadata pool is filling out without obvious reason. It's getting +15% by day even when there are no I/O on the cluster. I have separate SSD disks for metadata pool, each 112G with pool replica size 3. `ceph fs status`