from:"Kevin Flöh"

Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh


ok this just gives me:

error getting xattr ec31/10004dfce92./parent: (2) No such file 
or directory


Does this mean that the lost object isn't even a file that appears in 
the ceph directory. Maybe a leftover of a file that has not been deleted 
properly? It wouldn't be an issue to mark the object as lost in that case.


On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
You need to use the first stripe of the object as that is the only one 
with the metadata.


Try "rados -p ec31 getxattr 10004dfce92. parent" instead.

Robert LeBlanc

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 4:42 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote:


Hi,

we already tried "rados -p ec31 getxattr 10004dfce92.003d
parent" but this is just hanging forever if we are looking for
unfound objects. It works fine for all other objects.

We also tried scanning the ceph directory with find -inum
1099593404050 (decimal of 10004dfce92) and found nothing. This is
also working for non unfound objects.

Is there another way to find the corresponding file?

On 24.05.19 11:12 vorm., Burkhard Linke wrote:


Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c
list_missing:|

|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know 
which file(s) is
affected. Is there a way to map the object id to the
corresponding file?



The object name is composed of the file inode id and the chunk
within the file. The first chunk has some metadata you can use to
retrieve the filename. See the 'CephFS object mapping' thread on
the mailing list for more information.


Regards,

Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com  <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh


Hi,

we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" 
but this is just hanging forever if we are looking for unfound objects. 
It works fine for all other objects.


We also tried scanning the ceph directory with find -inum 1099593404050 
(decimal of 10004dfce92) and found nothing. This is also working for non 
unfound objects.


Is there another way to find the corresponding file?

On 24.05.19 11:12 vorm., Burkhard Linke wrote:


Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?



The object name is composed of the file inode id and the chunk within 
the file. The first chunk has some metadata you can use to retrieve 
the filename. See the 'CephFS object mapping' thread on the mailing 
list for more information.



Regards,

Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh

We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?


||

On 23.05.19 3:52 nachm., Alexandre Marangone wrote:
The PGs will stay active+recovery_wait+degraded until you solve the 
unfound objects issue.
You can follow this doc to look at which objects are unfound[1]  and 
if no other recourse mark them lost


[1] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects. 



On Thu, May 23, 2019 at 5:47 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote:


thank you for this idea, it has improved the situation. Nevertheless,
there are still 2 PGs in recovery_wait. ceph -s gives me:

   cluster:
 id: 23e72372-0d44-4cad-b24f-3641b14b86f4
 health: HEALTH_WARN
 3/125481112 objects unfound (0.000%)
 Degraded data redundancy: 3/497011315 objects degraded
(0.000%), 2 pgs degraded

   services:
 mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
<http://ceph-node01.etp.kit.edu>
 mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu
<http://ceph-node03.etp.kit.edu>=up:active}, 3
up:standby
 osd: 96 osds: 96 up, 96 in

   data:
 pools:   2 pools, 4096 pgs
 objects: 125.48M objects, 259TiB
 usage:   370TiB used, 154TiB / 524TiB avail
 pgs: 3/497011315 objects degraded (0.000%)
  3/125481112 objects unfound (0.000%)
  4083 active+clean
  10   active+clean+scrubbing+deep
  2    active+recovery_wait+degraded
  1    active+clean+scrubbing

   io:
 client:   318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr


and ceph health detail:

HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data
redundancy: 3/497011315 objects degraded (0.000%), 2 p
gs degraded
OBJECT_UNFOUND 3/125481112 objects unfound (0.000%)
 pg 1.24c has 1 unfound objects
 pg 1.779 has 2 unfound objects
PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded
(0.000%), 2 pgs degraded
 pg 1.24c is active+recovery_wait+degraded, acting
[32,4,61,36], 1
unfound
 pg 1.779 is active+recovery_wait+degraded, acting
[50,4,77,62], 2
unfound


also the status changed form HEALTH_ERR to HEALTH_WARN. We also
did ceph
osd down for all OSDs of the degraded PGs. Do you have any further
suggestions on how to proceed?

On 23.05.19 11:08 vorm., Dan van der Ster wrote:
> I think those osds (1, 11, 21, 32, ...) need a little kick to
re-peer
> their degraded PGs.
>
> Open a window with `watch ceph -s`, then in another window slowly do
>
>      ceph osd down 1
>      # then wait a minute or so for that osd.1 to re-peer fully.
>      ceph osd down 11
>      ...
>
> Continue that for each of the osds with stuck requests, or until
there
> are no more recovery_wait/degraded PGs.
>
> After each `ceph osd down...`, you should expect to see several PGs
> re-peer, and then ideally the slow requests will disappear and the
> degraded PGs will become active+clean.
> If anything else happens, you should stop and let us know.
>
>
> -- dan
>
> On Thu, May 23, 2019 at 10:59 AM Kevin Flöh mailto:kevin.fl...@kit.edu>> wrote:
>> This is the current status of ceph:
>>
>>
>>     cluster:
>>       id:     2

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Kevin Flöh

thank you for this idea, it has improved the situation. Nevertheless, 
there are still 2 PGs in recovery_wait. ceph -s gives me:


  cluster:
    id: 23e72372-0d44-4cad-b24f-3641b14b86f4
    health: HEALTH_WARN
    3/125481112 objects unfound (0.000%)
    Degraded data redundancy: 3/497011315 objects degraded 
(0.000%), 2 pgs degraded


  services:
    mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
    mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
    mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3 
up:standby

    osd: 96 osds: 96 up, 96 in

  data:
    pools:   2 pools, 4096 pgs
    objects: 125.48M objects, 259TiB
    usage:   370TiB used, 154TiB / 524TiB avail
    pgs: 3/497011315 objects degraded (0.000%)
 3/125481112 objects unfound (0.000%)
 4083 active+clean
 10   active+clean+scrubbing+deep
 2    active+recovery_wait+degraded
 1    active+clean+scrubbing

  io:
    client:   318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr


and ceph health detail:

HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data 
redundancy: 3/497011315 objects degraded (0.000%), 2 p

gs degraded
OBJECT_UNFOUND 3/125481112 objects unfound (0.000%)
    pg 1.24c has 1 unfound objects
    pg 1.779 has 2 unfound objects
PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded 
(0.000%), 2 pgs degraded
    pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 
unfound
    pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2 
unfound



also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph 
osd down for all OSDs of the degraded PGs. Do you have any further 
suggestions on how to proceed?


On 23.05.19 11:08 vorm., Dan van der Ster wrote:

I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer
their degraded PGs.

Open a window with `watch ceph -s`, then in another window slowly do

 ceph osd down 1
 # then wait a minute or so for that osd.1 to re-peer fully.
 ceph osd down 11
 ...

Continue that for each of the osds with stuck requests, or until there
are no more recovery_wait/degraded PGs.

After each `ceph osd down...`, you should expect to see several PGs
re-peer, and then ideally the slow requests will disappear and the
degraded PGs will become active+clean.
If anything else happens, you should stop and let us know.


-- dan

On Thu, May 23, 2019 at 10:59 AM Kevin Flöh  wrote:

This is the current status of ceph:


cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  9/125481144 objects unfound (0.000%)
  Degraded data redundancy: 9/497011417 objects degraded
(0.000%), 7 pgs degraded
  9 stuck requests are blocked > 4096 sec. Implicated osds
1,11,21,32,43,50,65

services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

data:
  pools:   2 pools, 4096 pgs
  objects: 125.48M objects, 259TiB
  usage:   370TiB used, 154TiB / 524TiB avail
  pgs: 9/497011417 objects degraded (0.000%)
   9/125481144 objects unfound (0.000%)
   4078 active+clean
   11   active+clean+scrubbing+deep
   7active+recovery_wait+degraded

io:
  client:   211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr

On 23.05.19 10:54 vorm., Dan van der Ster wrote:

What's the full ceph status?
Normally recovery_wait just means that the relevant osd's are busy
recovering/backfilling another PG.

On Thu, May 23, 2019 at 10:53 AM Kevin Flöh  wrote:

Hi,

we have set the PGs to recover and now they are stuck in 
active+recovery_wait+degraded and instructing them to deep-scrub does not 
change anything. Hence, the rados report is empty. Is there a way to stop the 
recovery wait to start the deep-scrub and get the output? I guess the 
recovery_wait might be caused by missing objects. Do we need to delete them 
first to get the recovery going?

Kevin

On 22.05.19 6:03 nachm., Robert LeBlanc wrote:

On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:

Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have
another problem, there are 7 PGs inconsistent and a cpeh pg repair is
not doing anything. I just get "instructing pg 1.5dd on osd.24 to
repair" and nothing happens. Does somebody know how we can get the PGs
to repair?

Regards,

Kevin

Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out why they 
are inconsistent. Do these steps and then we can figure out how to proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them)
2. Print out the inconsistent report for each inconsistent PG. `rados 
list-inconsistent-ob

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Kevin Flöh


This is the current status of ceph:


  cluster:
    id: 23e72372-0d44-4cad-b24f-3641b14b86f4
    health: HEALTH_ERR
    9/125481144 objects unfound (0.000%)
    Degraded data redundancy: 9/497011417 objects degraded 
(0.000%), 7 pgs degraded
    9 stuck requests are blocked > 4096 sec. Implicated osds 
1,11,21,32,43,50,65


  services:
    mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
    mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
    mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3 
up:standby

    osd: 96 osds: 96 up, 96 in

  data:
    pools:   2 pools, 4096 pgs
    objects: 125.48M objects, 259TiB
    usage:   370TiB used, 154TiB / 524TiB avail
    pgs: 9/497011417 objects degraded (0.000%)
 9/125481144 objects unfound (0.000%)
 4078 active+clean
 11   active+clean+scrubbing+deep
 7    active+recovery_wait+degraded

  io:
    client:   211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr

On 23.05.19 10:54 vorm., Dan van der Ster wrote:

What's the full ceph status?
Normally recovery_wait just means that the relevant osd's are busy
recovering/backfilling another PG.

On Thu, May 23, 2019 at 10:53 AM Kevin Flöh  wrote:

Hi,

we have set the PGs to recover and now they are stuck in 
active+recovery_wait+degraded and instructing them to deep-scrub does not 
change anything. Hence, the rados report is empty. Is there a way to stop the 
recovery wait to start the deep-scrub and get the output? I guess the 
recovery_wait might be caused by missing objects. Do we need to delete them 
first to get the recovery going?

Kevin

On 22.05.19 6:03 nachm., Robert LeBlanc wrote:

On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:

Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have
another problem, there are 7 PGs inconsistent and a cpeh pg repair is
not doing anything. I just get "instructing pg 1.5dd on osd.24 to
repair" and nothing happens. Does somebody know how we can get the PGs
to repair?

Regards,

Kevin


Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out why they 
are inconsistent. Do these steps and then we can figure out how to proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them)
2. Print out the inconsistent report for each inconsistent PG. `rados 
list-inconsistent-obj  --format=json-pretty`
3. You will want to look at the error messages and see if all the shards have 
the same data.

Robert LeBlanc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Kevin Flöh


Hi,

we have set the PGs to recover and now they are stuck in 
active+recovery_wait+degraded and instructing them to deep-scrub does 
not change anything. Hence, the rados report is empty. Is there a way to 
stop the recovery wait to start the deep-scrub and get the output? I 
guess the recovery_wait might be caused by missing objects. Do we need 
to delete them first to get the recovery going?


Kevin

On 22.05.19 6:03 nachm., Robert LeBlanc wrote:
On Wed, May 22, 2019 at 4:31 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote:


Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we
have
another problem, there are 7 PGs inconsistent and a cpeh pg repair is
not doing anything. I just get "instructing pg 1.5dd on osd.24 to
repair" and nothing happens. Does somebody know how we can get the
PGs
to repair?

Regards,

Kevin


Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out 
why they are inconsistent. Do these steps and then we can figure out 
how to proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some 
of them)
2. Print out the inconsistent report for each inconsistent PG. `rados 
list-inconsistent-obj  --format=json-pretty`
3. You will want to look at the error messages and see if all the 
shards have the same data.


Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-22 Thread Kevin Flöh


Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have 
another problem, there are 7 PGs inconsistent and a cpeh pg repair is 
not doing anything. I just get "instructing pg 1.5dd on osd.24 to 
repair" and nothing happens. Does somebody know how we can get the PGs 
to repair?


Regards,

Kevin

On 21.05.19 4:52 nachm., Wido den Hollander wrote:


On 5/21/19 4:48 PM, Kevin Flöh wrote:

Hi,

we gave up on the incomplete pgs since we do not have enough complete
shards to restore them. What is the procedure to get rid of these pgs?


You need to start with marking the OSDs as 'lost' and then you can
force_create_pg to get the PGs back (empty).

Wido


regards,

Kevin

On 20.05.19 9:22 vorm., Kevin Flöh wrote:

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining
shards of the two pgs but we are only left with two shards (of
reasonable size) per pg. The rest of the shards displayed by ceph pg
query are empty. I guess marking the OSD as complete doesn't make
sense then.

Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:


Le 14/05/2019 à 10:04, Kevin Flöh a écrit :

On 13.05.19 11:21 nachm., Dan van der Ster wrote:

Presumably the 2 OSDs you marked as lost were hosting those
incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)

yes, but as written in my other mail, we still have enough shards,
at least I think so.


If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.

Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on
a healthy OSD) seems to be the only way to recover your lost data, as
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of
data/parity when you actually need 3 to access it. Reducing min_size
to 3 will not help.

Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html


This is probably the best way you want to follow form now on.

Regards,
Frédéric.


If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan

would this let us recover at least some of the data on the pgs? If
not we would just set up a new ceph directly without fixing the old
one and copy whatever is left.

Best regards,

Kevin




On Mon, May 13, 2019 at 4:20 PM Kevin Flöh 
wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster,
let me
first show you the current ceph status:

     cluster:
   id: 23e72372-0d44-4cad-b24f-3641b14b86f4
   health: HEALTH_ERR
   1 MDSs report slow metadata IOs
   1 MDSs report slow requests
   1 MDSs behind on trimming
   1/126319678 objects unfound (0.000%)
   19 scrub errors
   Reduced data availability: 2 pgs inactive, 2 pgs
incomplete
   Possible data damage: 7 pgs inconsistent
   Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
   118 stuck requests are blocked > 4096 sec.
Implicated osds
24,32,91

     services:
   mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
   mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
   mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
   osd: 96 osds: 96 up, 96 in

     data:
   pools:   2 pools, 4096 pgs
   objects: 126.32M objects, 260TiB
   usage:   372TiB used, 152TiB / 524TiB avail
   pgs: 0.049% pgs not active
    1/500333881 objects degraded (0.000%)
    1/126319678 objects unfound (0.000%)
    4076 active+clean
    10   active+clean+scrubbing+deep
    7    active+clean+inconsistent
    2    incomplete
    1    active+recovery_wait+degraded

     io:
   client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow
requests;
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive,

Re: [ceph-users] Major ceph disaster

2019-05-21 Thread Kevin Flöh


Hi,

we gave up on the incomplete pgs since we do not have enough complete 
shards to restore them. What is the procedure to get rid of these pgs?


regards,

Kevin

On 20.05.19 9:22 vorm., Kevin Flöh wrote:

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining 
shards of the two pgs but we are only left with two shards (of 
reasonable size) per pg. The rest of the shards displayed by ceph pg 
query are empty. I guess marking the OSD as complete doesn't make 
sense then.


Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:



Le 14/05/2019 à 10:04, Kevin Flöh a écrit :


On 13.05.19 11:21 nachm., Dan van der Ster wrote:
Presumably the 2 OSDs you marked as lost were hosting those 
incomplete PGs?

It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, 
at least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on 
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on 
a healthy OSD) seems to be the only way to recover your lost data, as 
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of 
data/parity when you actually need 3 to access it. Reducing min_size 
to 3 will not help.


Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html 

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html 



This is probably the best way you want to follow form now on.

Regards,
Frédéric.



If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If 
not we would just set up a new ceph directly without fixing the old 
one and copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  
wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, 
let me

first show you the current ceph status:

    cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. 
Implicated osds

24,32,91

    services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

    data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7    active+clean+inconsistent
   2    incomplete
   1    active+recovery_wait+degraded

    io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow 
requests;

1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are 
blocked > 30 sec

MDS_TRIM 1 MDSs behind on

Re: [ceph-users] Major ceph disaster

2019-05-20 Thread Kevin Flöh


Hi Frederic,

we do not have access to the original OSDs. We exported the remaining 
shards of the two pgs but we are only left with two shards (of 
reasonable size) per pg. The rest of the shards displayed by ceph pg 
query are empty. I guess marking the OSD as complete doesn't make sense 
then.


Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:



Le 14/05/2019 à 10:04, Kevin Flöh a écrit :


On 13.05.19 11:21 nachm., Dan van der Ster wrote:
Presumably the 2 OSDs you marked as lost were hosting those 
incomplete PGs?

It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on 
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on a 
healthy OSD) seems to be the only way to recover your lost data, as 
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity 
when you actually need 3 to access it. Reducing min_size to 3 will not 
help.


Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html 



This is probably the best way you want to follow form now on.

Regards,
Frédéric.



If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If 
not we would just set up a new ceph directly without fixing the old 
one and copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

    cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated 
osds

24,32,91

    services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

    data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7    active+clean+inconsistent
   2    incomplete
   1    active+recovery_wait+degraded

    io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow 
requests;

1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are 
blocked > 30 sec

MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming 
(46034/128)

max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB

Re: [ceph-users] Major ceph disaster

2019-05-17 Thread Kevin Flöh

We tried to export the shards from the OSDs but there are only two 
shards left for each of the pgs, so we decided to give up these pgs. 
Will the files of these pgs be deleted from the mds or do we have to 
delete them manually. Is this the correct command to mark the pgs as lost:


ceph pg {pg-id} mark_unfound_lost revert|delete

Cheers,
Kevin

On 15.05.19 8:55 vorm., Kevin Flöh wrote:
The hdds of OSDs 4 and 23 are completely lost, we cannot access them 
in any way. Is it possible to use the shards which are maybe stored on 
working OSDs as shown in the all_participants list?


On 14.05.19 5:24 nachm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:

ok, so now we see at least a diffrence in the recovery state:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-14 14:15:15.650517",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-14 14:15:15.243756",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59580",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "59562",
  "last": "59563",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "59564",
  "last": "59567",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59570",
  "last": "59574",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59577",
  "last": "59580",
  "acting": "4(1),23(2),24(0)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
      ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []
      },
  {
  "name": "Started",
  "enter_time": "2019-05-14 14:15:15.243663"
  }
  ],

the peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan





On 14.05.19 11:02 vorm., Dan van der Ster wrote:
On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  
wrote:

On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  
wrote:


On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure 
coding. [...]

Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as l

Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Kevin Flöh


ceph osd pool get ec31 min_size
min_size: 3

On 15.05.19 9:09 vorm., Konstantin Shalygin wrote:

ceph osd pool get ec31 min_size

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh

The hdds of OSDs 4 and 23 are completely lost, we cannot access them in 
any way. Is it possible to use the shards which are maybe stored on 
working OSDs as shown in the all_participants list?


On 14.05.19 5:24 nachm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:

ok, so now we see at least a diffrence in the recovery state:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-14 14:15:15.650517",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-14 14:15:15.243756",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59580",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "59562",
  "last": "59563",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "59564",
  "last": "59567",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59570",
  "last": "59574",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59577",
  "last": "59580",
  "acting": "4(1),23(2),24(0)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
  ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []
      },
  {
  "name": "Started",
      "enter_time": "2019-05-14 14:15:15.243663"
  }
  ],

the peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan





On 14.05.19 11:02 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:

On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


Hi,

since we have 3+1 ec I didn't try before. But when I run the command you 
suggested I get the following error:


ceph osd pool set ec31 min_size 2
Error EINVAL: pool min_size must be between 3 and 4

On 14.05.19 6:18 nachm., Konstantin Shalygin wrote:



  peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?



Try to reduce min_size for problem pool as 'health detail' suggested: 
`ceph osd pool set ec31 min_size 2`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


ok, so now we see at least a diffrence in the recovery state:

    "recovery_state": [
    {
    "name": "Started/Primary/Peering/Incomplete",
    "enter_time": "2019-05-14 14:15:15.650517",
    "comment": "not enough complete instances of this PG"
    },
    {
    "name": "Started/Primary/Peering",
    "enter_time": "2019-05-14 14:15:15.243756",
    "past_intervals": [
    {
    "first": "49767",
    "last": "59580",
    "all_participants": [
    {
    "osd": 2,
    "shard": 0
    },
    {
    "osd": 4,
    "shard": 1
    },
    {
    "osd": 23,
    "shard": 2
    },
    {
    "osd": 24,
    "shard": 0
    },
    {
    "osd": 72,
    "shard": 1
    },
    {
    "osd": 79,
    "shard": 3
    }
    ],
    "intervals": [
    {
    "first": "59562",
    "last": "59563",
    "acting": "4(1),24(0),79(3)"
    },
    {
    "first": "59564",
    "last": "59567",
    "acting": "23(2),24(0),79(3)"
    },
    {
    "first": "59570",
    "last": "59574",
    "acting": "4(1),23(2),79(3)"
    },
    {
    "first": "59577",
    "last": "59580",
    "acting": "4(1),23(2),24(0)"
    }
    ]
    }
    ],
    "probing_osds": [
    "2(0)",
    "4(1)",
    "23(2)",
    "24(0)",
    "72(1)",
    "79(3)"
    ],
        "down_osds_we_would_probe": [],
    "peering_blocked_by": []
    },
    {
    "name": "Started",
    "enter_time": "2019-05-14 14:15:15.243663"
    }
    ],

the peering does not seem to be blocked anymore. But still there is no 
recovery going on. Is there anything else we can try?



On 14.05.19 11:02 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:


On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-09 16:11:48.625966",
  "comment": "n

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh



On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-09 16:11:48.625966",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-09 16:11:48.611171",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59313",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "58860",
  "last": "58861",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "58875",
  "last": "58877",
  "acting": "4(1),23(2),24(0)"
  },
  {
  "first": "59002",
  "last": "59009",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59010",
  "last": "59012",
  "acting": "2(0),4(1),23(2),79(3)"
  },
  {
  "first": "59197",
  "last": "59233",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59234",
  "last": "59313",
  "acting": "23(2),24(0),72(1),79(3)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
  ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": [],
  &q

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh



On 13.05.19 11:21 nachm., Dan van der Ster wrote:

Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If not 
we would just set up a new ceph directly without fixing the old one and 
copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs incomplete
  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated osds
24,32,91

services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up  {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7active+clean+inconsistent
   2incomplete
   1active+recovery_wait+degraded

io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec
MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128)
max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
  pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
  pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
PG_DAMAGED Possible data damage: 7 pgs inconsistent
  pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
  pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
  pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
  pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
  pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
  pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
  pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded
(0.000%), 1 pg degraded
  pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1
unfound
REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Imp

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and 
therefore we decided to mark the osd as lost and set it up from 
scratch. Ceph started recovering and then we lost another osd with 
the same behavior. We did the same as for the first osd.


With 3+1 you only allow a single OSD failure per pg at a given time. 
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 
separate servers (assuming standard crush rules) is a death sentence 
for the data on some pgs using both of those OSD (the ones not fully 
recovered before the second failure).


OK, so the 2 OSDs (4,23) failed shortly one after the other but we think 
that the recovery of the first was finished before the second failed. 
Nonetheless, both problematic pgs have been on both OSDs. We think, that 
we still have enough shards left. For one of the pgs, the recovery state 
looks like this:


    "recovery_state": [
    {
    "name": "Started/Primary/Peering/Incomplete",
    "enter_time": "2019-05-09 16:11:48.625966",
    "comment": "not enough complete instances of this PG"
    },
    {
    "name": "Started/Primary/Peering",
    "enter_time": "2019-05-09 16:11:48.611171",
    "past_intervals": [
    {
    "first": "49767",
    "last": "59313",
    "all_participants": [
    {
    "osd": 2,
    "shard": 0
    },
    {
    "osd": 4,
    "shard": 1
    },
    {
    "osd": 23,
    "shard": 2
    },
    {
    "osd": 24,
    "shard": 0
    },
    {
    "osd": 72,
    "shard": 1
    },
    {
    "osd": 79,
    "shard": 3
    }
    ],
    "intervals": [
    {
    "first": "58860",
    "last": "58861",
    "acting": "4(1),24(0),79(3)"
    },
    {
    "first": "58875",
    "last": "58877",
    "acting": "4(1),23(2),24(0)"
    },
    {
    "first": "59002",
    "last": "59009",
    "acting": "4(1),23(2),79(3)"
    },
    {
    "first": "59010",
    "last": "59012",
    "acting": "2(0),4(1),23(2),79(3)"
    },
    {
    "first": "59197",
    "last": "59233",
    "acting": "23(2),24(0),79(3)"
    },
    {
    "first": "59234",
    "last": "59313",
    "acting": "23(2),24(0),72(1),79(3)"
    }
    ]
    }
    ],
    "probing_osds": [
    "2(0)",
    "4(1)",
    "23(2)",
    "24(0)",
    "72(1)",
    "79(3)"
    ],
    "down_osds_we_would_probe": [],
    "peering_blocked_by": [],
    "peering_blocked_by_detail": [
    {
    "detail": "peering_blocked_by_history_les_bound"
    }
    ]
    },
    {
    "name": "Started",

[ceph-users] Major ceph disaster

2019-05-13 Thread Kevin Flöh


Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me 
first show you the current ceph status:


  cluster:
    id: 23e72372-0d44-4cad-b24f-3641b14b86f4
    health: HEALTH_ERR
    1 MDSs report slow metadata IOs
    1 MDSs report slow requests
    1 MDSs behind on trimming
    1/126319678 objects unfound (0.000%)
    19 scrub errors
    Reduced data availability: 2 pgs inactive, 2 pgs incomplete
    Possible data damage: 7 pgs inconsistent
    Degraded data redundancy: 1/500333881 objects degraded 
(0.000%), 1 pg degraded
    118 stuck requests are blocked > 4096 sec. Implicated osds 
24,32,91


  services:
    mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
    mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
    mds: cephfs-1/1/1 up  {0=ceph-node02.etp.kit.edu=up:active}, 3 
up:standby

    osd: 96 osds: 96 up, 96 in

  data:
    pools:   2 pools, 4096 pgs
    objects: 126.32M objects, 260TiB
    usage:   372TiB used, 152TiB / 524TiB avail
    pgs: 0.049% pgs not active
 1/500333881 objects degraded (0.000%)
 1/126319678 objects unfound (0.000%)
 4076 active+clean
 10   active+clean+scrubbing+deep
 7    active+clean+inconsistent
 2    incomplete
 1    active+recovery_wait+degraded

  io:
    client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data 
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91

MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are 
blocked > 30 secs, oldest blocked for 351193 secs

MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec
MDS_TRIM 1 MDSs behind on trimming
    mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128) 
max_segments: 128, num_segments: 46034

OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
    pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
    pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31 
min_size from 3 may help; search ceph.com/docs for 'incomplete')
    pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31 
min_size from 3 may help; search ceph.com/docs for 'incomplete')

PG_DAMAGED Possible data damage: 7 pgs inconsistent
    pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
    pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
    pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
    pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
    pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
    pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
    pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded 
(0.000%), 1 pg degraded
    pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 
unfound
REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds 
24,32,91

    118 ops are blocked > 536871 sec
    osds 24,32,91 have stuck requests > 536871 sec


Let me briefly summarize the setup: We have 4 nodes with 24 osds each 
and use 3+1 erasure coding. The nodes run on centos7 and we use, due to 
a major mistake when setting up the cluster, more than one ceph version 
on the nodes, 3 nodes run on 12.2.12 and one runs on 13.2.5. We are 
currently not daring to update all nodes to 13.2.5. For all the version 
details see:


{
    "mon": {
    "ceph version 12.2.12 
(1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 3

    },
    "mgr": {
    "ceph version 12.2.12 
(1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2

    },
    "osd": {
    "ceph version 12.2.12 
(1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 72,
    "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
mimic (stable)": 24

    },
    "mds": {
    "ceph version 12.2.12 
(1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 4

    },
    "overall": {
    "ceph version 12.2.12 
(1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 81,
    "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
mimic (stable)": 24

    }
}

Here is what happened: One osd daemon could not be started and therefore 
we decided to mark the osd as lost and set it up from scratch. Ceph 
started recovering and t

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

Re: [ceph-users] Major ceph disaster

[ceph-users] Major ceph disaster

18 matches

Site Navigation

Mail list logo

Footer information