[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread David Orman
Use journalctl -xe (maybe with -S/-U if you want to filter) to find the time period in which a restart attempt has happened, and see what's logged at that period. If that's not helpful, then what you may want to do is disable that service (systemctl disable blah) then get the ExecStart out of it, t

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread 胡 玮文
“podman logs ceph-xxx-osd-xxx” may contains additional logs. > 在 2021年3月19日,04:29,Philip Brown 写道: > > I've been banging on my ceph octopus test cluster for a few days now. > 8 nodes. each node has 2 SSDs and 8 HDDs. > They were all autoprovisioned so that each HDD gets an LVM slice of an

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Stefan Kooman
On 3/18/21 9:28 PM, Philip Brown wrote: I've been banging on my ceph octopus test cluster for a few days now. 8 nodes. each node has 2 SSDs and 8 HDDs. They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a db partition. service_type: osd service_id: osd_spec_default pl

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Philip Brown
yup cephadm and orch was used to set all this up. Current state of things: ceph osd tree shows 33hdd1.84698 osd.33 destroyed 0 1.0 cephadm logs --name osd.33 --fsid xx-xx-xx-xx along with the systemctl stuff I already saw, showed me new things such as

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Philip Brown
Unfortunately, the pod wont stay up. So "podman logs" wont work for it. it is not even visible with "podman ps -a" - Original Message - From: "胡 玮文" To: "Philip Brown" Cc: "ceph-users" Sent: Thursday, March 18, 2021 5:56:20 PM Subject: Re: [ceph-users] ceph octopus mysterious OSD cra

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-18 Thread Stefan Kooman
On 3/19/21 2:20 AM, Philip Brown wrote: yup cephadm and orch was used to set all this up. Current state of things: ceph osd tree shows 33hdd1.84698 osd.33 destroyed 0 1.0 ^^ Destroyed, ehh, this doesn't look good to me. Ceph thinks this OSD is dest

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Philip Brown
mkay. Sooo... what's the new and nifty proper way to clean this up? The outsider's view is, "I should just be able to run 'ceph orch osd rm 33'" but that returns Unable to find OSDs: ['33'] - Original Message - From: "Stefan Kooman" To: "Philip Brown" Cc: "ceph-users" Sent: Thursday

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Philip Brown
I made *some* progress for cleanup. I could already do "ceph osd rm 33" from my master. But doing the cleanup on the actual OSD node was problematical. ceph-volume lvm zap xxx wasnt working properly.. because the device wasnt fully released because at the regular OS level, it cant even SEE

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Philip Brown
Unfortunately, neither of those things will work. because ceph orch daemon add does not have a syntax that lets me add an SSD as a journal to a HDD and likewise ceph orch apply osd --all-available-devices will not do the right thing. both for mixed ssd/hdd.. but also, even though I have a l

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Philip Brown
if we cant replace a drive on a node in a crash situation, without blowing away the entire node seems to me ceph octopus fails the "test" part of the "test cluster" :-/ I vaguely recall running into this "doesnt have PARTUUID" problem before. THAT time, I did end up wiping the entire machine

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Stefan Kooman
On 3/19/21 3:53 PM, Philip Brown wrote: mkay. Sooo... what's the new and nifty proper way to clean this up? The outsider's view is, "I should just be able to run 'ceph orch osd rm 33'" Can you spawn a cephadm shell and run: ceph osd rm 33? And / or: ceph osd crush rm 33, or try to do it with

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Eugen Block
I am quite sure that this case is covered by cephadm already. A few months ago I tested it after a major rework of ceph-volume. I don’t have any links right now. But I had a lab environment with multiple OSDs per node with rocksDB on SSD and after wiping both HDD and DB LV cephadm automatic

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread David Orman
We also ran into a scenario in which I did exactly this, and it did _not_ work. It created the OSD, but did not put the DB/WAL on the NVME (didn't even create an LV). I'm wondering if there's some constraint applied (haven't looked at code yet) that when the NVME already has all but the one DB on i

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Stefan Kooman
On 3/19/21 6:22 PM, Philip Brown wrote: I made *some* progress for cleanup. I could already do "ceph osd rm 33" from my master. But doing the cleanup on the actual OSD node was problematical. ceph-volume lvm zap xxx wasnt working properly.. because the device wasnt fully released because a

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Stefan Kooman
On 3/19/21 7:47 PM, Philip Brown wrote: I see. I dont think it works when 7/8 devices are already configured, and the SSD is already mostly sliced. OK. If it is a test cluster you might just blow it all away. By doing this you are simulating a "SSD" failure taking down all HDDs with it. It

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Tony Liu
To: "Stefan Kooman" Cc: "ceph-users" , "Philip Brown" Sent: Friday, March 19, 2021 2:19:55 PM Subject: [BULK] Re: [ceph-users] Re: ceph octopus mysterious OSD crash I am quite sure that this case is covered by cephadm already. A few months ago I tested it after a

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Philip Brown
re's still the concern about why the thing mysteriosly crashed in the first place :-/ (on TWO osd's!) But at least I know how to rebuild a single disk. - Original Message - From: "Eugen Block" To: "Stefan Kooman" Cc: "ceph-users" , "Philip

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-19 Thread Stefan Kooman
On 3/19/21 9:11 PM, Philip Brown wrote: if we cant replace a drive on a node in a crash situation, without blowing away the entire node seems to me ceph octopus fails the "test" part of the "test cluster" :-/ I agree. This should not be necessary. And I'm sure there is, or there will be f

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-25 Thread David Orman
As we wanted to verify this behavior with 15.2.10, we went ahead and tested with a failed OSD. The drive was replaced, and we followed the steps below (comments for clarity on our process) - this assumes you have a service specification that will perform deployment once matched: # capture "db devi

[ceph-users] Re: ceph octopus mysterious OSD crash

2021-03-25 Thread Philip Brown
ay, March 25, 2021 12:04:17 PM Subject: Re: [ceph-users] Re: ceph octopus mysterious OSD crash As we wanted to verify this behavior with 15.2.10, we went ahead and tested with a failed OSD. The drive was replaced, and we followed the steps below (comments for clarity on our process) - this assumes y