Re: [ceph-users] Major ceph disaster
ok this just gives me: error getting xattr ec31/10004dfce92./parent: (2) No such file or directory Does this mean that the lost object isn't even a file that appears in the ceph directory. Maybe a leftover of a file that has not been deleted properly? It wouldn't be an issue to mark the object as lost in that case. On 24.05.19 5:08 nachm., Robert LeBlanc wrote: You need to use the first stripe of the object as that is the only one with the metadata. Try "rados -p ec31 getxattr 10004dfce92. parent" instead. Robert LeBlanc Sent from a mobile device, please excuse any typos. On Fri, May 24, 2019, 4:42 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote: Hi, we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but this is just hanging forever if we are looking for unfound objects. It works fine for all other objects. We also tried scanning the ceph directory with find -inum 1099593404050 (decimal of 10004dfce92) and found nothing. This is also working for non unfound objects. Is there another way to find the corresponding file? On 24.05.19 11:12 vorm., Burkhard Linke wrote: Hi, On 5/24/19 9:48 AM, Kevin Flöh wrote: We got the object ids of the missing objects with|ceph pg 1.24c list_missing:| |{ "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "10004dfce92.003d", "key": "", "snapid": -2, "hash": 90219084, "max": 0, "pool": 1, "namespace": "" }, "need": "46950'195355", "have": "0'0", "flags": "none", "locations": [ "36(3)", "61(2)" ] } ], "more": false } | |we want to give up those objects with:| ceph pg 1.24c mark_unfound_lost revert But first we would like to know which file(s) is affected. Is there a way to map the object id to the corresponding file? The object name is composed of the file inode id and the chunk within the file. The first chunk has some metadata you can use to retrieve the filename. See the 'CephFS object mapping' thread on the mailing list for more information. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but this is just hanging forever if we are looking for unfound objects. It works fine for all other objects. We also tried scanning the ceph directory with find -inum 1099593404050 (decimal of 10004dfce92) and found nothing. This is also working for non unfound objects. Is there another way to find the corresponding file? On 24.05.19 11:12 vorm., Burkhard Linke wrote: Hi, On 5/24/19 9:48 AM, Kevin Flöh wrote: We got the object ids of the missing objects with|ceph pg 1.24c list_missing:| |{ "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "10004dfce92.003d", "key": "", "snapid": -2, "hash": 90219084, "max": 0, "pool": 1, "namespace": "" }, "need": "46950'195355", "have": "0'0", "flags": "none", "locations": [ "36(3)", "61(2)" ] } ], "more": false } | |we want to give up those objects with:| ceph pg 1.24c mark_unfound_lost revert But first we would like to know which file(s) is affected. Is there a way to map the object id to the corresponding file? The object name is composed of the file inode id and the chunk within the file. The first chunk has some metadata you can use to retrieve the filename. See the 'CephFS object mapping' thread on the mailing list for more information. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
We got the object ids of the missing objects with|ceph pg 1.24c list_missing:| |{ "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "10004dfce92.003d", "key": "", "snapid": -2, "hash": 90219084, "max": 0, "pool": 1, "namespace": "" }, "need": "46950'195355", "have": "0'0", "flags": "none", "locations": [ "36(3)", "61(2)" ] } ], "more": false } | |we want to give up those objects with:| ceph pg 1.24c mark_unfound_lost revert But first we would like to know which file(s) is affected. Is there a way to map the object id to the corresponding file? || On 23.05.19 3:52 nachm., Alexandre Marangone wrote: The PGs will stay active+recovery_wait+degraded until you solve the unfound objects issue. You can follow this doc to look at which objects are unfound[1] and if no other recourse mark them lost [1] http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects. On Thu, May 23, 2019 at 5:47 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote: thank you for this idea, it has improved the situation. Nevertheless, there are still 2 PGs in recovery_wait. ceph -s gives me: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_WARN 3/125481112 objects unfound (0.000%) Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu <http://ceph-node01.etp.kit.edu> mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu <http://ceph-node03.etp.kit.edu>=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 3/497011315 objects degraded (0.000%) 3/125481112 objects unfound (0.000%) 4083 active+clean 10 active+clean+scrubbing+deep 2 active+recovery_wait+degraded 1 active+clean+scrubbing io: client: 318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr and ceph health detail: HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 p gs degraded OBJECT_UNFOUND 3/125481112 objects unfound (0.000%) pg 1.24c has 1 unfound objects pg 1.779 has 2 unfound objects PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2 unfound also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph osd down for all OSDs of the degraded PGs. Do you have any further suggestions on how to proceed? On 23.05.19 11:08 vorm., Dan van der Ster wrote: > I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer > their degraded PGs. > > Open a window with `watch ceph -s`, then in another window slowly do > > ceph osd down 1 > # then wait a minute or so for that osd.1 to re-peer fully. > ceph osd down 11 > ... > > Continue that for each of the osds with stuck requests, or until there > are no more recovery_wait/degraded PGs. > > After each `ceph osd down...`, you should expect to see several PGs > re-peer, and then ideally the slow requests will disappear and the > degraded PGs will become active+clean. > If anything else happens, you should stop and let us know. > > > -- dan > > On Thu, May 23, 2019 at 10:59 AM Kevin Flöh mailto:kevin.fl...@kit.edu>> wrote: >> This is the current status of ceph: >> >> >> cluster: >> id: 2
Re: [ceph-users] Major ceph disaster
thank you for this idea, it has improved the situation. Nevertheless, there are still 2 PGs in recovery_wait. ceph -s gives me: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_WARN 3/125481112 objects unfound (0.000%) Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 3/497011315 objects degraded (0.000%) 3/125481112 objects unfound (0.000%) 4083 active+clean 10 active+clean+scrubbing+deep 2 active+recovery_wait+degraded 1 active+clean+scrubbing io: client: 318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr and ceph health detail: HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 p gs degraded OBJECT_UNFOUND 3/125481112 objects unfound (0.000%) pg 1.24c has 1 unfound objects pg 1.779 has 2 unfound objects PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded (0.000%), 2 pgs degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2 unfound also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph osd down for all OSDs of the degraded PGs. Do you have any further suggestions on how to proceed? On 23.05.19 11:08 vorm., Dan van der Ster wrote: I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer their degraded PGs. Open a window with `watch ceph -s`, then in another window slowly do ceph osd down 1 # then wait a minute or so for that osd.1 to re-peer fully. ceph osd down 11 ... Continue that for each of the osds with stuck requests, or until there are no more recovery_wait/degraded PGs. After each `ceph osd down...`, you should expect to see several PGs re-peer, and then ideally the slow requests will disappear and the degraded PGs will become active+clean. If anything else happens, you should stop and let us know. -- dan On Thu, May 23, 2019 at 10:59 AM Kevin Flöh wrote: This is the current status of ceph: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 9/125481144 objects unfound (0.000%) Degraded data redundancy: 9/497011417 objects degraded (0.000%), 7 pgs degraded 9 stuck requests are blocked > 4096 sec. Implicated osds 1,11,21,32,43,50,65 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 9/497011417 objects degraded (0.000%) 9/125481144 objects unfound (0.000%) 4078 active+clean 11 active+clean+scrubbing+deep 7active+recovery_wait+degraded io: client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr On 23.05.19 10:54 vorm., Dan van der Ster wrote: What's the full ceph status? Normally recovery_wait just means that the relevant osd's are busy recovering/backfilling another PG. On Thu, May 23, 2019 at 10:53 AM Kevin Flöh wrote: Hi, we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? Kevin On 22.05.19 6:03 nachm., Robert LeBlanc wrote: On Wed, May 22, 2019 at 4:31 AM Kevin Flöh wrote: Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-ob
Re: [ceph-users] Major ceph disaster
This is the current status of ceph: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 9/125481144 objects unfound (0.000%) Degraded data redundancy: 9/497011417 objects degraded (0.000%), 7 pgs degraded 9 stuck requests are blocked > 4096 sec. Implicated osds 1,11,21,32,43,50,65 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 125.48M objects, 259TiB usage: 370TiB used, 154TiB / 524TiB avail pgs: 9/497011417 objects degraded (0.000%) 9/125481144 objects unfound (0.000%) 4078 active+clean 11 active+clean+scrubbing+deep 7 active+recovery_wait+degraded io: client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr On 23.05.19 10:54 vorm., Dan van der Ster wrote: What's the full ceph status? Normally recovery_wait just means that the relevant osd's are busy recovering/backfilling another PG. On Thu, May 23, 2019 at 10:53 AM Kevin Flöh wrote: Hi, we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? Kevin On 22.05.19 6:03 nachm., Robert LeBlanc wrote: On Wed, May 22, 2019 at 4:31 AM Kevin Flöh wrote: Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --format=json-pretty` 3. You will want to look at the error messages and see if all the shards have the same data. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, we have set the PGs to recover and now they are stuck in active+recovery_wait+degraded and instructing them to deep-scrub does not change anything. Hence, the rados report is empty. Is there a way to stop the recovery wait to start the deep-scrub and get the output? I guess the recovery_wait might be caused by missing objects. Do we need to delete them first to get the recovery going? Kevin On 22.05.19 6:03 nachm., Robert LeBlanc wrote: On Wed, May 22, 2019 at 4:31 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote: Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --format=json-pretty` 3. You will want to look at the error messages and see if all the shards have the same data. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
Hi, thank you, it worked. The PGs are not incomplete anymore. Still we have another problem, there are 7 PGs inconsistent and a cpeh pg repair is not doing anything. I just get "instructing pg 1.5dd on osd.24 to repair" and nothing happens. Does somebody know how we can get the PGs to repair? Regards, Kevin On 21.05.19 4:52 nachm., Wido den Hollander wrote: On 5/21/19 4:48 PM, Kevin Flöh wrote: Hi, we gave up on the incomplete pgs since we do not have enough complete shards to restore them. What is the procedure to get rid of these pgs? You need to start with marking the OSDs as 'lost' and then you can force_create_pg to get the PGs back (empty). Wido regards, Kevin On 20.05.19 9:22 vorm., Kevin Flöh wrote: Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive,
Re: [ceph-users] Major ceph disaster
Hi, we gave up on the incomplete pgs since we do not have enough complete shards to restore them. What is the procedure to get rid of these pgs? regards, Kevin On 20.05.19 9:22 vorm., Kevin Flöh wrote: Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on
Re: [ceph-users] Major ceph disaster
Hi Frederic, we do not have access to the original OSDs. We exported the remaining shards of the two pgs but we are only left with two shards (of reasonable size) per pg. The rest of the shards displayed by ceph pg query are empty. I guess marking the OSD as complete doesn't make sense then. Best, Kevin On 17.05.19 2:36 nachm., Frédéric Nass wrote: Le 14/05/2019 à 10:04, Kevin Flöh a écrit : On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. Hi Kevin, You want to make sure of this. Unless you recreated the OSDs 4 and 23 and had new data written on them, they should still host the data you need. What Dan suggested (export the 7 inconsistent PGs and import them on a healthy OSD) seems to be the only way to recover your lost data, as with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity when you actually need 3 to access it. Reducing min_size to 3 will not help. Have a look here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html This is probably the best way you want to follow form now on. Regards, Frédéric. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on trimming mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128) max_segments: 128, num_segments: 46034 OBJECT_UNFOUND 1/126319687 objects unfound (0.000%) pg 1.24c has 1 unfound objects OSD_SCRUB
Re: [ceph-users] Major ceph disaster
We tried to export the shards from the OSDs but there are only two shards left for each of the pgs, so we decided to give up these pgs. Will the files of these pgs be deleted from the mds or do we have to delete them manually. Is this the correct command to mark the pgs as lost: ceph pg {pg-id} mark_unfound_lost revert|delete Cheers, Kevin On 15.05.19 8:55 vorm., Kevin Flöh wrote: The hdds of OSDs 4 and 23 are completely lost, we cannot access them in any way. Is it possible to use the shards which are maybe stored on working OSDs as shown in the all_participants list? On 14.05.19 5:24 nachm., Dan van der Ster wrote: On Tue, May 14, 2019 at 5:13 PM Kevin Flöh wrote: ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? What is the state of the hdd's which had osds 4 & 23? You may be able to use ceph-objectstore-tool to export those PG shards and import to another operable OSD. -- dan On 14.05.19 11:02 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as l
Re: [ceph-users] Major ceph disaster
ceph osd pool get ec31 min_size min_size: 3 On 15.05.19 9:09 vorm., Konstantin Shalygin wrote: ceph osd pool get ec31 min_size ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
The hdds of OSDs 4 and 23 are completely lost, we cannot access them in any way. Is it possible to use the shards which are maybe stored on working OSDs as shown in the all_participants list? On 14.05.19 5:24 nachm., Dan van der Ster wrote: On Tue, May 14, 2019 at 5:13 PM Kevin Flöh wrote: ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? What is the state of the hdd's which had osds 4 & 23? You may be able to use ceph-objectstore-tool to export those PG shards and import to another operable OSD. -- dan On 14.05.19 11:02 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of
Re: [ceph-users] Major ceph disaster
Hi, since we have 3+1 ec I didn't try before. But when I run the command you suggested I get the following error: ceph osd pool set ec31 min_size 2 Error EINVAL: pool min_size must be between 3 and 4 On 14.05.19 6:18 nachm., Konstantin Shalygin wrote: peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? Try to reduce min_size for problem pool as 'health detail' suggested: `ceph osd pool set ec31 min_size 2`. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try? On 14.05.19 11:02 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:59 AM Kevin Flöh wrote: On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "n
Re: [ceph-users] Major ceph disaster
On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-09 16:11:48.611171", "past_intervals": [ { "first": "49767", "last": "59313", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "58860", "last": "58861", "acting": "4(1),24(0),79(3)" }, { "first": "58875", "last": "58877", "acting": "4(1),23(2),24(0)" }, { "first": "59002", "last": "59009", "acting": "4(1),23(2),79(3)" }, { "first": "59010", "last": "59012", "acting": "2(0),4(1),23(2),79(3)" }, { "first": "59197", "last": "59233", "acting": "23(2),24(0),79(3)" }, { "first": "59234", "last": "59313", "acting": "23(2),24(0),72(1),79(3)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [], &q
Re: [ceph-users] Major ceph disaster
On 13.05.19 11:21 nachm., Dan van der Ster wrote: Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs? It would be useful to double confirm that: check with `ceph pg query` and `ceph pg dump`. (If so, this is why the ignore history les thing isn't helping; you don't have the minimum 3 stripes up for those 3+1 PGs.) yes, but as written in my other mail, we still have enough shards, at least I think so. If those "lost" OSDs by some miracle still have the PG data, you might be able to export the relevant PG stripes with the ceph-objectstore-tool. I've never tried this myself, but there have been threads in the past where people export a PG from a nearly dead hdd, import to another OSD, then backfilling works. guess that is not possible. If OTOH those PGs are really lost forever, and someone else should confirm what I say here, I think the next step would be to force recreate the incomplete PGs then run a set of cephfs scrub/repair disaster recovery cmds to recover what you can from the cephfs. -- dan would this let us recover at least some of the data on the pgs? If not we would just set up a new ceph directly without fixing the old one and copy whatever is left. Best regards, Kevin On Mon, May 13, 2019 at 4:20 PM Kevin Flöh wrote: Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7active+clean+inconsistent 2incomplete 1active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on trimming mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128) max_segments: 128, num_segments: 46034 OBJECT_UNFOUND 1/126319687 objects unfound (0.000%) pg 1.24c has 1 unfound objects OSD_SCRUB_ERRORS 19 scrub errors PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31 min_size from 3 may help; search ceph.com/docs for 'incomplete') pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31 min_size from 3 may help; search ceph.com/docs for 'incomplete') PG_DAMAGED Possible data damage: 7 pgs inconsistent pg 1.17f is active+clean+inconsistent, acting [65,49,25,4] pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81] pg 1.203 is active+clean+inconsistent, acting [43,49,4,72] pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4] pg 1.779 is active+clean+inconsistent, acting [50,4,77,62] pg 1.77c is active+clean+inconsistent, acting [21,49,40,4] pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4] PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Imp
Re: [ceph-users] Major ceph disaster
On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-09 16:11:48.611171", "past_intervals": [ { "first": "49767", "last": "59313", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "58860", "last": "58861", "acting": "4(1),24(0),79(3)" }, { "first": "58875", "last": "58877", "acting": "4(1),23(2),24(0)" }, { "first": "59002", "last": "59009", "acting": "4(1),23(2),79(3)" }, { "first": "59010", "last": "59012", "acting": "2(0),4(1),23(2),79(3)" }, { "first": "59197", "last": "59233", "acting": "23(2),24(0),79(3)" }, { "first": "59234", "last": "59313", "acting": "23(2),24(0),72(1),79(3)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [], "peering_blocked_by_detail": [ { "detail": "peering_blocked_by_history_les_bound" } ] }, { "name": "Started",
[ceph-users] Major ceph disaster
Dear ceph experts, we have several (maybe related) problems with our ceph cluster, let me first show you the current ceph status: cluster: id: 23e72372-0d44-4cad-b24f-3641b14b86f4 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1 MDSs report slow requests 1 MDSs behind on trimming 1/126319678 objects unfound (0.000%) 19 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs incomplete Possible data damage: 7 pgs inconsistent Degraded data redundancy: 1/500333881 objects degraded (0.000%), 1 pg degraded 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 services: mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3 up:standby osd: 96 osds: 96 up, 96 in data: pools: 2 pools, 4096 pgs objects: 126.32M objects, 260TiB usage: 372TiB used, 152TiB / 524TiB avail pgs: 0.049% pgs not active 1/500333881 objects degraded (0.000%) 1/126319678 objects unfound (0.000%) 4076 active+clean 10 active+clean+scrubbing+deep 7 active+clean+inconsistent 2 incomplete 1 active+recovery_wait+degraded io: client: 449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr and ceph health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19 scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs incomplete; Possible data damage: 7 pgs inconsistent; Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 351193 secs MDS_SLOW_REQUEST 1 MDSs report slow requests mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec MDS_TRIM 1 MDSs behind on trimming mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128) max_segments: 128, num_segments: 46034 OBJECT_UNFOUND 1/126319687 objects unfound (0.000%) pg 1.24c has 1 unfound objects OSD_SCRUB_ERRORS 19 scrub errors PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31 min_size from 3 may help; search ceph.com/docs for 'incomplete') pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31 min_size from 3 may help; search ceph.com/docs for 'incomplete') PG_DAMAGED Possible data damage: 7 pgs inconsistent pg 1.17f is active+clean+inconsistent, acting [65,49,25,4] pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81] pg 1.203 is active+clean+inconsistent, acting [43,49,4,72] pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4] pg 1.779 is active+clean+inconsistent, acting [50,4,77,62] pg 1.77c is active+clean+inconsistent, acting [21,49,40,4] pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4] PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 unfound REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds 24,32,91 118 ops are blocked > 536871 sec osds 24,32,91 have stuck requests > 536871 sec Let me briefly summarize the setup: We have 4 nodes with 24 osds each and use 3+1 erasure coding. The nodes run on centos7 and we use, due to a major mistake when setting up the cluster, more than one ceph version on the nodes, 3 nodes run on 12.2.12 and one runs on 13.2.5. We are currently not daring to update all nodes to 13.2.5. For all the version details see: { "mon": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 3 }, "mgr": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2 }, "osd": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 72, "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 24 }, "mds": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 4 }, "overall": { "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 81, "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 24 } } Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and t