[ceph-users] Re: 1 stray daemon(s) not managed by cephadm

2022-07-25 Thread Adam King
Usually it's pretty explicit in "ceph health detail". What does it say there? On Mon, Jul 25, 2022 at 9:05 PM Jeremy Hansen wrote: > How do I track down what is the stray daemon? > > Thanks > -jeremy > ___ > ceph-users mailing list --

[ceph-users] 1 stray daemon(s) not managed by cephadm

2022-07-25 Thread Jeremy Hansen
How do I track down what is the stray daemon? Thanks -jeremy ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Two osd's assigned to one device

2022-07-25 Thread Jeremy Hansen
I have a situation (not sure how it happened), but Ceph believe I have two OSD's assigned to a single device. I tried to delete osd.2 and osd.3, but it just hangs. I'm also trying to zap sdc, which claims it does not have an osd, but I'm unable to zap it. Any suggestions? /dev/sdb HDD TOSHIBA

[ceph-users] Re: Quincy full osd(s)

2022-07-25 Thread Nigel Williams
Hi Wesley, thank you for the follow up. Anthony D'Atri kindly helped me out with some guidance and advice and we believe the problem is resolved now. This was a brand new install of a Quincy cluster and I made the mistake of presuming that autoscale would adjust the PGs as required, however it

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-25 Thread Neha Ojha
On Mon, Jul 25, 2022 at 3:48 PM Neha Ojha wrote: > > Hello Frank, > > 15.2.17 includes > https://github.com/ceph/ceph/pull/46611/commits/263e0fa6b3e6e1d6e7b382923a1d586d9d1ffa1b, > which adds capability in the ceph-objectstore-tool to trim the dup ops > that led to memory growth in

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
I use Ubiquiti equipment, mainly because I'm not a network admin... I rebooted the 10G switches and now everything is working and recovering. I hate when there's not a definitive answer but that's kind of the deal when you use Ubiquiti stuff. Thank you Sean and Frank. Frank, you were right. It

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-25 Thread Neha Ojha
Hello Frank, 15.2.17 includes https://github.com/ceph/ceph/pull/46611/commits/263e0fa6b3e6e1d6e7b382923a1d586d9d1ffa1b, which adds capability in the ceph-objectstore-tool to trim the dup ops that led to memory growth in https://tracker.ceph.com/issues/53729. The complete fix is being tested in

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Sean Redmond
Yea, assuming you can ping with a lower MTU, check the MTU on your switching. On Mon, 25 Jul 2022, 23:05 Jeremy Hansen, wrote: > That results in packet loss: > > [root@cn01 ~]# ping -M do -s 8972 192.168.30.14 > PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data. > ^C > ---

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
That results in packet loss: [root@cn01 ~]# ping -M do -s 8972 192.168.30.14 PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data. ^C --- 192.168.30.14 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2062ms That's very weird... but this gives me something to

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
Does ceph do any kind of io fencing if it notices an anomaly? Do I need to do something to re-enable these hosts if they get marked as bad? On Mon, Jul 25, 2022 at 2:56 PM Jeremy Hansen wrote: > MTU is the same across all hosts: > > - cn01.ceph.la1.clx.corp- > enp2s0:

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
MTU is the same across all hosts: - cn01.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.11 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:72:8d txqueuelen 1000

[ceph-users] Re: [Warning Possible spam] Re: Issues after a shutdown

2022-07-25 Thread Adam King
Do the journal logs for any of the OSDs that are marked down give any useful info on why they're failing to start back up? If the host level ip issues have gone away I think that would be the next place to check. On Mon, Jul 25, 2022 at 5:03 PM Jeremy Hansen wrote: > I noticed this on the

[ceph-users] Re: [Warning Possible spam] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
I noticed this on the initial run of ceph health, but I no longer see it. When you say "don't use ceph adm", can you explain why this is bad? This is ceph health outside of cephadm shell: HEALTH_WARN 1 filesystem is degraded; 2 MDSs report slow metadata IOs; 2/5 mons down, quorum cn02,cn03,cn01;

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-25 Thread Casey Bodley
On Sun, Jul 24, 2022 at 11:33 AM Yuri Weinstein wrote: > > Still seeking approvals for: > > rados - Travis, Ernesto, Adam > rgw - Casey rgw approved > fs, kcephfs, multimds - Venky, Patrick > ceph-ansible - Brad pls take a look > > Josh, upgrade/client-upgrade-nautilus-octopus failed, do we

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-25 Thread Adam King
orch approved. The test_cephadm_repos test failure is just a problem with the test I believe, not any actual ceph code. The other selinux denial I don't think is new. Thanks, - Adam King On Sun, Jul 24, 2022 at 11:33 AM Yuri Weinstein wrote: > Still seeking approvals for: > > rados - Travis,

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
Here's some more info: HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2 filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down, quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data availability: 13 pgs inactive, 9 pgs down; Degraded data

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
Pretty desperate here. Can someone suggest what I might be able to do to get these OSDs back up. It looks like my recovery had stalled. On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri wrote: > Do your values for public and cluster network include the new addresses on > all nodes? > This

[ceph-users] Re: weird performance issue on ceph

2022-07-25 Thread Mark Nelson
I don't think so if this is just plain old RBD.  RBD  shouldn't require a bunch of RocksDB iterator seeks in the read/write hot path and writes should pretty quickly clear out tombstones as part of the memtable flush and compaction process even in the slow case.  Maybe in some kind of

[ceph-users] Re: Ceph orch commands non-responsive after mgr/mon reboots 16.2.9

2022-07-25 Thread Tim Olow
I just wanted to follow up on this issue as it corrected itself today. I started a drain/remove on two hosts a few weeks back, after the rolling restart of mgr/mon on the cluster it seems that the ops queue either became locked or overwhelmed with requests. I had a degraded PG during the

[ceph-users] Re: weird performance issue on ceph

2022-07-25 Thread Mark Nelson
Hi Zoltan, We have a very similar setup with one of our upstream community performance test clusters.  60 4TB PM983 drives spread across 10 nodes.  We get similar numbers to what you are initially seeing (scaled down to 60 drives) though with somewhat lower random read IOPS (we tend to max

[ceph-users] Re: Map RBD to multiple nodes (line NFS)

2022-07-25 Thread Wesley Dillingham
You probably want CephFS instead RBD. Overview here: https://docs.ceph.com/en/quincy/cephfs/ Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Mon, Jul 25, 2022 at 11:00 AM Thomas Schneider <74cmo...@gmail.com> wrote: > Hi, > > I

[ceph-users] failed OSD daemon

2022-07-25 Thread Magnus Hagdorn
Hi there, on our pacific (16.2.9) cluster one of the OSD daemons has died and fails to restart. The OSD exposes a NVMe drive and is one of 4 identical machines. We are using podman to orchestrate the ceph daemons. The underlying OS is managed. The system worked fine without any issues until

[ceph-users] Map RBD to multiple nodes (line NFS)

2022-07-25 Thread Thomas Schneider
Hi, I have this use case: Multi-node DB must write backup to a device that is accessible by any node. The backup is currently provided as RBD, and this RBD is mapped on any node belonging to the multi-node DB. Is it possible that any node has access to the same files, independant of which

[ceph-users] Re: LibCephFS Python Mount Failure

2022-07-25 Thread Bogdan Adrian Velica
Hi Adam, I think this might be related to the user you are running the script as, try running the scrip as ceph user (or the user you are running your ceph with). Also make sure the variable os.environ.get is used (i might be mistaking here). do a print or something first to see the key is

[ceph-users] Re: Default erasure code profile not working for 3 node cluster?

2022-07-25 Thread Mark S. Holliman
Danny, Levin, Thanks, both your answers helped (and are exactly what I suspected was the case). Looking back at the documentation I can see where my confusion began, as it isn't clear there that the "simplest" and "default" erasure code profiles are different. I'll report a documentation bug

[ceph-users] Re: Default erasure code profile not working for 3 node cluster?

2022-07-25 Thread Danny Webb
The only thing I can see from your setup is you've not set a failure domain in your crush rule, so it would default to host. And a 2/2 erasure code wouldn't work in that scenario as each stripe of the EC must be in it's own failure domain. If you wanted it to work with that setup you'd need

[ceph-users] Re: Default erasure code profile not working for 3 node cluster?

2022-07-25 Thread Levin Ng
Hi Mark, K=2 + M=2 EC profile with set to host failure domain will require at least 4 node. “The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts:”. This is assume your EC Profile is K=2+M=1 which

[ceph-users] Default erasure code profile not working for 3 node cluster?

2022-07-25 Thread Mark S. Holliman
Dear All, I've recently setup a 3 node Ceph Quincy (17.2) cluster to serve a pair of CephFS mounts for a Slurm cluster. Each ceph node has 6 x SSD and 6 x HDD, and I've setup the pools and crush rules to create separate CephFS filesystems using the different disk classes. I used the default

[ceph-users] Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
I transitioned some servers to a new rack and now I'm having major issues with Ceph upon bringing things back up. I believe the issue may be related to the ceph nodes coming back up with different IPs before VLANs were set. That's just a guess because I can't think of any other reason this would

[ceph-users] Re: Quincy recovery load

2022-07-25 Thread Satoru Takeuchi
2022年7月25日(月) 18:45 Sridhar Seshasayee : > > > On Mon, Jul 25, 2022 at 2:05 PM Satoru Takeuchi > wrote: > >> >> - Does this problem not exist in Pacific and older versions? >> > This problem does not exist in Pacific and prior versions. On Pacific, the > default osd_op_queue > is set to 'wpq'

[ceph-users] Re: Quincy recovery load

2022-07-25 Thread Sridhar Seshasayee
On Mon, Jul 25, 2022 at 2:05 PM Satoru Takeuchi wrote: > > - Does this problem not exist in Pacific and older versions? > This problem does not exist in Pacific and prior versions. On Pacific, the default osd_op_queue is set to 'wpq' and so this issue is not observed. - Does this problem

[ceph-users] Re: ceph health "overall_status": "HEALTH_WARN"

2022-07-25 Thread Monish Selvaraj
Hi all, Recently, I deployed ceph orch ( pacific ) in my nodes with 5 mons 5 mgrs 238 osds and 5 rgw. Yesterday , 4 osds went out and 2 rgws down. So, i restart whole rgw by "ceph orch restart rgw.rgw". After two minutes , the whole rgw nodes goes down. Then I turned up the 4 osds and also

[ceph-users] Re: ceph health "overall_status": "HEALTH_WARN"

2022-07-25 Thread Konstantin Shalygin
Hi, The Mimic have many HEALTH troubles like this Mimic is EOL for a years, I suggest you to upgrade to Nautilus 14.2.22 at least k > On 25 Jul 2022, at 11:45, Frank Schilder wrote: > > Hi all, > > I made a strange observation on our cluster. The command ceph status -f > json-pretty

[ceph-users] Re: Quincy recovery load

2022-07-25 Thread Satoru Takeuchi
I'm trying to upgrade my Pacific cluster to Quincy and found this thread. Let me confirm a few things. - Does this problem not exist in Pacific and older versions? - Does this problem happen only if `osd_op_queue=mclock_scheduler`? - Do all parameters written in the OPERATIONS section not work if