[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread Kári Bertilsson
Hello David

I have physical devices i can use to mirror the OSD's no problem. But i
dont't think those disks are actually failing since there is no bad sector
on them and they are brand new with no issues reading from. But they got
corrupt OSD superblock which i believe happend because of bad SAS
controller or unclean shutdown and i can't find any way to get the data off
them or repair the OSD superblock.

On Tue, May 12, 2020 at 11:47 PM David Turner  wrote:

> Do you have access to another Ceph cluster with enough available space to
> create rbds that you dd these failing disks into? That's what I'm doing
> right now with some failing disks. I've recovered 2 out of 6 osds that
> failed in this way. I would recommend against using the same cluster for
> this, but a stage cluster or something would be great.
>
> On Tue, May 12, 2020, 7:36 PM Kári Bertilsson 
> wrote:
>
>> Hi Paul
>>
>> I was able to mount both OSD's i need data from successfully using
>> "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse
>> --mountpoint /osd92/"
>>
>> I see the PG slices that are missing in the mounted folder
>> "41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the
>> mounted folder and that works fine.
>>
>> But when i try to export it fails. I get the same error when trying to
>> list.
>>
>> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list
>> --debug
>> Output @ https://pastebin.com/nXScEL6L
>>
>> Any ideas ?
>>
>> On Tue, May 12, 2020 at 12:17 PM Paul Emmerich 
>> wrote:
>>
>> > First thing I'd try is to use objectstore-tool to scrape the
>> > inactive/broken PGs from the dead OSDs using it's PG export feature.
>> > Then import these PGs into any other OSD which will automatically
>> recover
>> > it.
>> >
>> > Paul
>> >
>> > --
>> > Paul Emmerich
>> >
>> > Looking for help with your Ceph cluster? Contact us at https://croit.io
>> >
>> > croit GmbH
>> > Freseniusstr. 31h
>> > 81247 München
>> > www.croit.io
>> > Tel: +49 89 1896585 90
>> >
>> >
>> > On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson 
>> > wrote:
>> >
>> >> Yes
>> >> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1
>> >>
>> >> On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:
>> >>
>> >> > Can you share your osd tree and the current ceph status?
>> >> >
>> >> >
>> >> > Zitat von Kári Bertilsson :
>> >> >
>> >> > > Hello
>> >> > >
>> >> > > I had an incidence where 3 OSD's crashed at once completely and
>> won't
>> >> > power
>> >> > > up. And during recovery 3 OSD's in another host have somehow become
>> >> > > corrupted. I am running erasure coding with 8+2 setup using crush
>> map
>> >> > which
>> >> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few
>> >> PG's
>> >> > > down. Unfortunately these PG's seem to overlap almost all data on
>> the
>> >> > pool,
>> >> > > so i believe the entire pool is mostly lost after only these 2% of
>> >> PG's
>> >> > > down.
>> >> > >
>> >> > > I am running ceph 14.2.9.
>> >> > >
>> >> > > OSD 92 log https://pastebin.com/5aq8SyCW
>> >> > > OSD 97 log https://pastebin.com/uJELZxwr
>> >> > >
>> >> > > ceph-bluestore-tool repair without --deep showed "success" but
>> OSD's
>> >> > still
>> >> > > fail with the log above.
>> >> > >
>> >> > > Log from trying ceph-bluestore-tool repair --deep which is still
>> >> running,
>> >> > > not sure if it will actually fix anything and log looks pretty bad.
>> >> > > https://pastebin.com/gkqTZpY3
>> >> > >
>> >> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97
>> >> --op
>> >> > > list" gave me input/output error. But everything in SMART looks OK,
>> >> and i
>> >> > > see no indication of hardware read error in any logs. Same for both
>> >> OSD.
>> >> > >
>> >> > > The OSD's with corruption have absolutely no bad sectors and likely
>> >> have
>> >> > > only a minor corruption but at important locations.
>> >> > >
>> >> > > Any ideas on how to recover this kind of scenario ? Any tips would
>> be
>> >> > > highly appreciated.
>> >> > >
>> >> > > Best regards,
>> >> > > Kári Bertilsson
>> >> > > ___
>> >> > > ceph-users mailing list -- ceph-users@ceph.io
>> >> > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >> >
>> >> >
>> >> > ___
>> >> > ceph-users mailing list -- ceph-users@ceph.io
>> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >> >
>> >> ___
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread David Turner
Do you have access to another Ceph cluster with enough available space to
create rbds that you dd these failing disks into? That's what I'm doing
right now with some failing disks. I've recovered 2 out of 6 osds that
failed in this way. I would recommend against using the same cluster for
this, but a stage cluster or something would be great.

On Tue, May 12, 2020, 7:36 PM Kári Bertilsson  wrote:

> Hi Paul
>
> I was able to mount both OSD's i need data from successfully using
> "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse
> --mountpoint /osd92/"
>
> I see the PG slices that are missing in the mounted folder
> "41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the
> mounted folder and that works fine.
>
> But when i try to export it fails. I get the same error when trying to
> list.
>
> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list
> --debug
> Output @ https://pastebin.com/nXScEL6L
>
> Any ideas ?
>
> On Tue, May 12, 2020 at 12:17 PM Paul Emmerich 
> wrote:
>
> > First thing I'd try is to use objectstore-tool to scrape the
> > inactive/broken PGs from the dead OSDs using it's PG export feature.
> > Then import these PGs into any other OSD which will automatically recover
> > it.
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> >
> > On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson 
> > wrote:
> >
> >> Yes
> >> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1
> >>
> >> On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:
> >>
> >> > Can you share your osd tree and the current ceph status?
> >> >
> >> >
> >> > Zitat von Kári Bertilsson :
> >> >
> >> > > Hello
> >> > >
> >> > > I had an incidence where 3 OSD's crashed at once completely and
> won't
> >> > power
> >> > > up. And during recovery 3 OSD's in another host have somehow become
> >> > > corrupted. I am running erasure coding with 8+2 setup using crush
> map
> >> > which
> >> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few
> >> PG's
> >> > > down. Unfortunately these PG's seem to overlap almost all data on
> the
> >> > pool,
> >> > > so i believe the entire pool is mostly lost after only these 2% of
> >> PG's
> >> > > down.
> >> > >
> >> > > I am running ceph 14.2.9.
> >> > >
> >> > > OSD 92 log https://pastebin.com/5aq8SyCW
> >> > > OSD 97 log https://pastebin.com/uJELZxwr
> >> > >
> >> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's
> >> > still
> >> > > fail with the log above.
> >> > >
> >> > > Log from trying ceph-bluestore-tool repair --deep which is still
> >> running,
> >> > > not sure if it will actually fix anything and log looks pretty bad.
> >> > > https://pastebin.com/gkqTZpY3
> >> > >
> >> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97
> >> --op
> >> > > list" gave me input/output error. But everything in SMART looks OK,
> >> and i
> >> > > see no indication of hardware read error in any logs. Same for both
> >> OSD.
> >> > >
> >> > > The OSD's with corruption have absolutely no bad sectors and likely
> >> have
> >> > > only a minor corruption but at important locations.
> >> > >
> >> > > Any ideas on how to recover this kind of scenario ? Any tips would
> be
> >> > > highly appreciated.
> >> > >
> >> > > Best regards,
> >> > > Kári Bertilsson
> >> > > ___
> >> > > ceph-users mailing list -- ceph-users@ceph.io
> >> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread Kári Bertilsson
Hi Paul

I was able to mount both OSD's i need data from successfully using
"ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op fuse
--mountpoint /osd92/"

I see the PG slices that are missing in the mounted folder
"41.b3s3_head" "41.ccs5_head" etc. And i can copy any data from inside the
mounted folder and that works fine.

But when i try to export it fails. I get the same error when trying to list.

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-92 --op list
--debug
Output @ https://pastebin.com/nXScEL6L

Any ideas ?

On Tue, May 12, 2020 at 12:17 PM Paul Emmerich 
wrote:

> First thing I'd try is to use objectstore-tool to scrape the
> inactive/broken PGs from the dead OSDs using it's PG export feature.
> Then import these PGs into any other OSD which will automatically recover
> it.
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson 
> wrote:
>
>> Yes
>> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1
>>
>> On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:
>>
>> > Can you share your osd tree and the current ceph status?
>> >
>> >
>> > Zitat von Kári Bertilsson :
>> >
>> > > Hello
>> > >
>> > > I had an incidence where 3 OSD's crashed at once completely and won't
>> > power
>> > > up. And during recovery 3 OSD's in another host have somehow become
>> > > corrupted. I am running erasure coding with 8+2 setup using crush map
>> > which
>> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few
>> PG's
>> > > down. Unfortunately these PG's seem to overlap almost all data on the
>> > pool,
>> > > so i believe the entire pool is mostly lost after only these 2% of
>> PG's
>> > > down.
>> > >
>> > > I am running ceph 14.2.9.
>> > >
>> > > OSD 92 log https://pastebin.com/5aq8SyCW
>> > > OSD 97 log https://pastebin.com/uJELZxwr
>> > >
>> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's
>> > still
>> > > fail with the log above.
>> > >
>> > > Log from trying ceph-bluestore-tool repair --deep which is still
>> running,
>> > > not sure if it will actually fix anything and log looks pretty bad.
>> > > https://pastebin.com/gkqTZpY3
>> > >
>> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97
>> --op
>> > > list" gave me input/output error. But everything in SMART looks OK,
>> and i
>> > > see no indication of hardware read error in any logs. Same for both
>> OSD.
>> > >
>> > > The OSD's with corruption have absolutely no bad sectors and likely
>> have
>> > > only a minor corruption but at important locations.
>> > >
>> > > Any ideas on how to recover this kind of scenario ? Any tips would be
>> > > highly appreciated.
>> > >
>> > > Best regards,
>> > > Kári Bertilsson
>> > > ___
>> > > ceph-users mailing list -- ceph-users@ceph.io
>> > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread Paul Emmerich
First thing I'd try is to use objectstore-tool to scrape the
inactive/broken PGs from the dead OSDs using it's PG export feature.
Then import these PGs into any other OSD which will automatically recover
it.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, May 12, 2020 at 2:07 PM Kári Bertilsson 
wrote:

> Yes
> ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1
>
> On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:
>
> > Can you share your osd tree and the current ceph status?
> >
> >
> > Zitat von Kári Bertilsson :
> >
> > > Hello
> > >
> > > I had an incidence where 3 OSD's crashed at once completely and won't
> > power
> > > up. And during recovery 3 OSD's in another host have somehow become
> > > corrupted. I am running erasure coding with 8+2 setup using crush map
> > which
> > > takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's
> > > down. Unfortunately these PG's seem to overlap almost all data on the
> > pool,
> > > so i believe the entire pool is mostly lost after only these 2% of PG's
> > > down.
> > >
> > > I am running ceph 14.2.9.
> > >
> > > OSD 92 log https://pastebin.com/5aq8SyCW
> > > OSD 97 log https://pastebin.com/uJELZxwr
> > >
> > > ceph-bluestore-tool repair without --deep showed "success" but OSD's
> > still
> > > fail with the log above.
> > >
> > > Log from trying ceph-bluestore-tool repair --deep which is still
> running,
> > > not sure if it will actually fix anything and log looks pretty bad.
> > > https://pastebin.com/gkqTZpY3
> > >
> > > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97
> --op
> > > list" gave me input/output error. But everything in SMART looks OK,
> and i
> > > see no indication of hardware read error in any logs. Same for both
> OSD.
> > >
> > > The OSD's with corruption have absolutely no bad sectors and likely
> have
> > > only a minor corruption but at important locations.
> > >
> > > Any ideas on how to recover this kind of scenario ? Any tips would be
> > > highly appreciated.
> > >
> > > Best regards,
> > > Kári Bertilsson
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread Kári Bertilsson
Yes
ceph osd df tree and ceph -s is at https://pastebin.com/By6b1ps1

On Tue, May 12, 2020 at 10:39 AM Eugen Block  wrote:

> Can you share your osd tree and the current ceph status?
>
>
> Zitat von Kári Bertilsson :
>
> > Hello
> >
> > I had an incidence where 3 OSD's crashed at once completely and won't
> power
> > up. And during recovery 3 OSD's in another host have somehow become
> > corrupted. I am running erasure coding with 8+2 setup using crush map
> which
> > takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's
> > down. Unfortunately these PG's seem to overlap almost all data on the
> pool,
> > so i believe the entire pool is mostly lost after only these 2% of PG's
> > down.
> >
> > I am running ceph 14.2.9.
> >
> > OSD 92 log https://pastebin.com/5aq8SyCW
> > OSD 97 log https://pastebin.com/uJELZxwr
> >
> > ceph-bluestore-tool repair without --deep showed "success" but OSD's
> still
> > fail with the log above.
> >
> > Log from trying ceph-bluestore-tool repair --deep which is still running,
> > not sure if it will actually fix anything and log looks pretty bad.
> > https://pastebin.com/gkqTZpY3
> >
> > Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 --op
> > list" gave me input/output error. But everything in SMART looks OK, and i
> > see no indication of hardware read error in any logs. Same for both OSD.
> >
> > The OSD's with corruption have absolutely no bad sectors and likely have
> > only a minor corruption but at important locations.
> >
> > Any ideas on how to recover this kind of scenario ? Any tips would be
> > highly appreciated.
> >
> > Best regards,
> > Kári Bertilsson
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD corruption and down PGs

2020-05-12 Thread Eugen Block

Can you share your osd tree and the current ceph status?


Zitat von Kári Bertilsson :


Hello

I had an incidence where 3 OSD's crashed at once completely and won't power
up. And during recovery 3 OSD's in another host have somehow become
corrupted. I am running erasure coding with 8+2 setup using crush map which
takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's
down. Unfortunately these PG's seem to overlap almost all data on the pool,
so i believe the entire pool is mostly lost after only these 2% of PG's
down.

I am running ceph 14.2.9.

OSD 92 log https://pastebin.com/5aq8SyCW
OSD 97 log https://pastebin.com/uJELZxwr

ceph-bluestore-tool repair without --deep showed "success" but OSD's still
fail with the log above.

Log from trying ceph-bluestore-tool repair --deep which is still running,
not sure if it will actually fix anything and log looks pretty bad.
https://pastebin.com/gkqTZpY3

Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 --op
list" gave me input/output error. But everything in SMART looks OK, and i
see no indication of hardware read error in any logs. Same for both OSD.

The OSD's with corruption have absolutely no bad sectors and likely have
only a minor corruption but at important locations.

Any ideas on how to recover this kind of scenario ? Any tips would be
highly appreciated.

Best regards,
Kári Bertilsson
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io