[ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-14 Thread Tarek Zegar

Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1
> /sys/block/nvme0n1/device/device/remove
We got it back doing a echo 1 > /sys/bus/pci/rescan
However, it reenumerated as a different drive number (guess we didn't have
udev rules)
They restored the LVM volume (vgcfgrestore
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)

lsblk
nvme0n2
259:90  1.8T  0 diskc
ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
  253:10  1.8T  0 lvm

We are stuck here. How do we attach an OSD daemon to the drive? It was
OSD.122 previously

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-14 Thread Bob R
Does 'ceph-volume lvm list' show it? If so you can try to activate it with
'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'

Bob

On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:

> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1
> > /sys/block/nvme0n1/device/device/remove
> We got it back doing a echo 1 > /sys/bus/pci/rescan
> However, it reenumerated as a different drive number (guess we didn't have
> udev rules)
> They restored the LVM volume (vgcfgrestore
> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>
> lsblk
> nvme0n2 259:9 0 1.8T 0 diskc
> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
> 253:1 0 1.8T 0 lvm
>
> We are stuck here. How do we attach an OSD daemon to the drive? It was
> OSD.122 previously
>
> Thanks
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-15 Thread Alfredo Deza
On Tue, May 14, 2019 at 7:24 PM Bob R  wrote:
>
> Does 'ceph-volume lvm list' show it? If so you can try to activate it with 
> 'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'

Good suggestion. If `ceph-volume lvm list` can see it, it can probably
activate it again. You can activate it with the OSD ID + OSD FSID, or
do:

ceph-volume lvm activate --all

You didn't say if the OSD wasn't coming up after trying to start it
(the systemd unit should still be there for ID 122), or if you tried
rebooting and that OSD didn't come up.

The systemd unit is tied to both the ID and FSID of the OSD, so it
shouldn't matter if the underlying device changed since ceph-volume
ensures it is the right one every time it activates.
>
> Bob
>
> On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:
>>
>> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1 > 
>> /sys/block/nvme0n1/device/device/remove
>> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> However, it reenumerated as a different drive number (guess we didn't have 
>> udev rules)
>> They restored the LVM volume (vgcfgrestore 
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay 
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>>
>> lsblk
>> nvme0n2 259:9 0 1.8T 0 diskc
>> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
>>  253:1 0 1.8T 0 lvm
>>
>> We are stuck here. How do we attach an OSD daemon to the drive? It was 
>> OSD.122 previously
>>
>> Thanks
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-15 Thread Tarek Zegar

TLDR; I activated the drive successfully but the daemon won't start, looks
like it's complaining about mon config, idk why (there is a valid ceph.conf
on the host). Thoughts? I feel like it's close. Thank you

I executed the command:
ceph-volume lvm activate --all


It found the drive and activated it:
--> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a

--> ceph-volume lvm activate successful for osd ID: 122



However, systemd would not start the OSD process 122:
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
14:16:13.862 71970700 -1 monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
14:16:13.862 7116f700 -1 monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: failed to fetch
mon config (--no-mon-config to skip)
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Main process exited, code=exited, status=1/FAILURE
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Service hold-off time over, scheduling restart.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Scheduled restart job, restart counter is at 3.
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Automatic restarting of the unit ceph-osd@122.service has been
scheduled, as the result for
-- the configured Restart= setting for the unit.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object
storage daemon osd.122.
-- Subject: Unit ceph-osd@122.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit ceph-osd@122.service has finished shutting down.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Start request repeated too quickly.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Failed to start Ceph
object storage daemon osd.122





From:   Alfredo Deza 
To: Bob R 
Cc: Tarek Zegar , ceph-users

Date:   05/15/2019 08:27 AM
Subject:    [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error,
recovered, to restore OSD process



On Tue, May 14, 2019 at 7:24 PM Bob R  wrote:
>
> Does 'ceph-volume lvm list' show it? If so you can try to activate it
with 'ceph-volume lvm activate 122
74b01ec2--124d--427d--9812--e437f90261d4'

Good suggestion. If `ceph-volume lvm list` can see it, it can probably
activate it again. You can activate it with the OSD ID + OSD FSID, or
do:

ceph-volume lvm activate --all

You didn't say if the OSD wasn't coming up after trying to start it
(the systemd unit should still be there for ID 122), or if you tried
rebooting and that OSD didn't come up.

The systemd unit is tied to both the ID and FSID of the OSD, so it
shouldn't matter if the underlying device changed since ceph-volume
ensures it is the right one every time it activates.
>
> Bob
>
> On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:
>>
>> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo
1 > /sys/block/nvme0n1/device/device/remove
>> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> However, it reenumerated as a different drive number (guess we didn't
have udev rules)
>> They restored the LVM volume (vgcfgrestore
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>>
>> lsblk
>> nvme0n2 259:9 0 1.8T 0 diskc
>>
ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
 253:1 0 1.8T 0 lvm
>>
>> We are stuck here. How do we attach an OSD daemon to the drive? It was
OSD.122 previously
>>
>> Thanks
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3V1n-r1W__Mu-wEAwzq7jDpopOSMrfRfomn1f5bgT28&m=T8FGOFoarkOiORgemihDpPCoz3wRG5GH_oQWne3ROvc&s=4zaqEyKSugJ7AN4hZW6vOZ4SZ0-SxF-yj8OGBM2zv6c&e=

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3V1n-r1W__Mu-wEAwzq7jDpopOSMrfRfomn1f5bgT28&m=T8FGOFoarkOiORgemihDpPCoz3wRG5GH_oQWne3ROvc&s=4zaqEyKSugJ7AN4hZW6vOZ4SZ0-SxF-yj8OGBM2zv6c&e=




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com