Re: [ceph-users] Major ceph disaster

Dan van der Ster Tue, 14 May 2019 08:25:22 -0700

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh <kevin.fl...@kit.edu> wrote:
>
> ok, so now we see at least a diffrence in the recovery state:
>
>      "recovery_state": [
>          {
>              "name": "Started/Primary/Peering/Incomplete",
>              "enter_time": "2019-05-14 14:15:15.650517",
>              "comment": "not enough complete instances of this PG"
>          },
>          {
>              "name": "Started/Primary/Peering",
>              "enter_time": "2019-05-14 14:15:15.243756",
>              "past_intervals": [
>                  {
>                      "first": "49767",
>                      "last": "59580",
>                      "all_participants": [
>                          {
>                              "osd": 2,
>                              "shard": 0
>                          },
>                          {
>                              "osd": 4,
>                              "shard": 1
>                          },
>                          {
>                              "osd": 23,
>                              "shard": 2
>                          },
>                          {
>                              "osd": 24,
>                              "shard": 0
>                          },
>                          {
>                              "osd": 72,
>                              "shard": 1
>                          },
>                          {
>                              "osd": 79,
>                              "shard": 3
>                          }
>                      ],
>                      "intervals": [
>                          {
>                              "first": "59562",
>                              "last": "59563",
>                              "acting": "4(1),24(0),79(3)"
>                          },
>                          {
>                              "first": "59564",
>                              "last": "59567",
>                              "acting": "23(2),24(0),79(3)"
>                          },
>                          {
>                              "first": "59570",
>                              "last": "59574",
>                              "acting": "4(1),23(2),79(3)"
>                          },
>                          {
>                              "first": "59577",
>                              "last": "59580",
>                              "acting": "4(1),23(2),24(0)"
>                          }
>                      ]
>                  }
>              ],
>              "probing_osds": [
>                  "2(0)",
>                  "4(1)",
>                  "23(2)",
>                  "24(0)",
>                  "72(1)",
>                  "79(3)"
>              ],
>              "down_osds_we_would_probe": [],
>              "peering_blocked_by": []
>          },
>          {
>              "name": "Started",
>              "enter_time": "2019-05-14 14:15:15.243663"
>          }
>      ],
>
> the peering does not seem to be blocked anymore. But still there is no
> recovery going on. Is there anything else we can try?


What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan



>
>
> On 14.05.19 11:02 vorm., Dan van der Ster wrote:
> > On Tue, May 14, 2019 at 10:59 AM Kevin Flöh <kevin.fl...@kit.edu> wrote:
> >>
> >> On 14.05.19 10:08 vorm., Dan van der Ster wrote:
> >>
> >> On Tue, May 14, 2019 at 10:02 AM Kevin Flöh <kevin.fl...@kit.edu> wrote:
> >>
> >> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
> >>
> >> Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> >>
> >> Dear ceph experts,
> >>
> >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> >> Here is what happened: One osd daemon could not be started and
> >> therefore we decided to mark the osd as lost and set it up from
> >> scratch. Ceph started recovering and then we lost another osd with
> >> the same behavior. We did the same as for the first osd.
> >>
> >> With 3+1 you only allow a single OSD failure per pg at a given time.
> >> You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
> >> separate servers (assuming standard crush rules) is a death sentence
> >> for the data on some pgs using both of those OSD (the ones not fully
> >> recovered before the second failure).
> >>
> >> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
> >> that the recovery of the first was finished before the second failed.
> >> Nonetheless, both problematic pgs have been on both OSDs. We think, that
> >> we still have enough shards left. For one of the pgs, the recovery state
> >> looks like this:
> >>
> >>       "recovery_state": [
> >>           {
> >>               "name": "Started/Primary/Peering/Incomplete",
> >>               "enter_time": "2019-05-09 16:11:48.625966",
> >>               "comment": "not enough complete instances of this PG"
> >>           },
> >>           {
> >>               "name": "Started/Primary/Peering",
> >>               "enter_time": "2019-05-09 16:11:48.611171",
> >>               "past_intervals": [
> >>                   {
> >>                       "first": "49767",
> >>                       "last": "59313",
> >>                       "all_participants": [
> >>                           {
> >>                               "osd": 2,
> >>                               "shard": 0
> >>                           },
> >>                           {
> >>                               "osd": 4,
> >>                               "shard": 1
> >>                           },
> >>                           {
> >>                               "osd": 23,
> >>                               "shard": 2
> >>                           },
> >>                           {
> >>                               "osd": 24,
> >>                               "shard": 0
> >>                           },
> >>                           {
> >>                               "osd": 72,
> >>                               "shard": 1
> >>                           },
> >>                           {
> >>                               "osd": 79,
> >>                               "shard": 3
> >>                           }
> >>                       ],
> >>                       "intervals": [
> >>                           {
> >>                               "first": "58860",
> >>                               "last": "58861",
> >>                               "acting": "4(1),24(0),79(3)"
> >>                           },
> >>                           {
> >>                               "first": "58875",
> >>                               "last": "58877",
> >>                               "acting": "4(1),23(2),24(0)"
> >>                           },
> >>                           {
> >>                               "first": "59002",
> >>                               "last": "59009",
> >>                               "acting": "4(1),23(2),79(3)"
> >>                           },
> >>                           {
> >>                               "first": "59010",
> >>                               "last": "59012",
> >>                               "acting": "2(0),4(1),23(2),79(3)"
> >>                           },
> >>                           {
> >>                               "first": "59197",
> >>                               "last": "59233",
> >>                               "acting": "23(2),24(0),79(3)"
> >>                           },
> >>                           {
> >>                               "first": "59234",
> >>                               "last": "59313",
> >>                               "acting": "23(2),24(0),72(1),79(3)"
> >>                           }
> >>                       ]
> >>                   }
> >>               ],
> >>               "probing_osds": [
> >>                   "2(0)",
> >>                   "4(1)",
> >>                   "23(2)",
> >>                   "24(0)",
> >>                   "72(1)",
> >>                   "79(3)"
> >>               ],
> >>               "down_osds_we_would_probe": [],
> >>               "peering_blocked_by": [],
> >>               "peering_blocked_by_detail": [
> >>                   {
> >>                       "detail": "peering_blocked_by_history_les_bound"
> >>                   }
> >>               ]
> >>           },
> >>           {
> >>               "name": "Started",
> >>               "enter_time": "2019-05-09 16:11:48.611121"
> >>           }
> >>       ],
> >> Is there a chance to recover this pg from the shards on OSDs 2, 72, 79?
> >> ceph pg repair/deep-scrub/scrub did not work.
> >>
> >> repair/scrub are not related to this problem so they won't help.
> >>
> >> How exactly did you use the osd_find_best_info_ignore_history_les option?
> >>
> >> One correct procedure would be to set it to true in ceph.conf, then
> >> restart each of the probing_osd's above.
> >> (Once the PG has peered, you need to unset the option and restart
> >> those osds again).
> >>
> >> We executed ceph --admin-daemon /var/run/ceph/ceph-osd.X.asok config set 
> >> osd_find_best_info_ignore_history_les true
> >>
> >> And then we restarted the affected OSDs. I guess this is doing the same, 
> >> right?
> > No that doesn't work. That just sets it in memory but then the option
> > is reset to the default when you restart the OSD.
> > You need to set it in ceph.conf on the OSD host.
> >
> > -- dan
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> We are also worried about the behind on trimming of the mds or is this
> >> not too problematic?
> >>
> >> Trimming requires IO on PGs, and the mds is almost certainly stuck on
> >> those incomplete PGs.
> >> Solve the incomplete, and then address the MDS later if it doesn't
> >> resolve itself.
> >>
> >>
> >> -- dan
> >>
> >> ok, then we don't have to worry about this for now.
> >>
> >>
> >> Best regards,
> >>
> >> Kevin
> >>
> >>
> >>
> >>
> >> MDS_TRIM 1 MDSs behind on trimming
> >>       mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46178/128)
> >> max_segments: 128, num_segments: 46178
> >>
> >>
> >> Depending on the data stored (CephFS ?) you probably can recover most
> >> of it but some of it is irremediably lost.
> >>
> >> If you can recover the data from the failed OSD at the time they
> >> failed you might be able to recover some of your lost data (with the
> >> help of Ceph devs), if not there's nothing to do.
> >>
> >> In the later case I'd add a new server to use at least 3+2 for a fresh
> >> pool instead of 3+1 and begin moving the data to it.
> >>
> >> The 12.2 + 13.2 mix is a potential problem in addition to the one
> >> above but it's a different one.
> >>
> >> Best regards,
> >>
> >> Lionel
> >>
> >> The idea for the future is to set up a new ceph with 3+2 with 8 servers
> >> in total and of course with consistent versions on all nodes.
> >>
> >>
> >> Best regards,
> >>
> >> Kevin
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

Reply via email to