Re: [ceph-users] PG Recovery: HEALTH_ERR to HEALTH_OK

Jason Harley Tue, 03 Jun 2014 13:05:16 -0700

# ceph pg 4.ff3 query
> { "state": "active+recovering",
>   "epoch": 1642,
>   "up": [
>         7,
>         26],
>   "acting": [
>         7,
>         26],
>   "info": { "pgid": "4.ffe",
>       "last_update": "339'96",
>       "last_complete": "339'89",
>       "log_tail": "0'0",
>       "last_backfill": "MAX",
>       "purged_snaps": "[1~9]",
>       "history": { "epoch_created": 3,
>           "last_epoch_started": 1617,
>           "last_epoch_clean": 339,
>           "last_epoch_split": 0,
>           "same_up_since": 1616,
>           "same_interval_since": 1616,
>           "same_primary_since": 1616,
>           "last_scrub": "337'71",
>           "last_scrub_stamp": "2014-05-31 19:27:24.250179",
>           "last_deep_scrub": "337'41",
>           "last_deep_scrub_stamp": "2014-05-29 19:26:41.314233",
>           "last_clean_scrub_stamp": "2014-05-31 19:27:24.250179"},
>       "stats": { "version": "339'96",
>           "reported_seq": "4126",
>           "reported_epoch": "1642",
>           "state": "active+recovering",
>           "last_fresh": "2014-06-03 19:31:00.239728",
>           "last_change": "2014-06-03 18:27:59.485319",
>           "last_active": "2014-06-03 19:31:00.239728",
>           "last_clean": "2014-05-31 20:07:05.950472",
>           "last_became_active": "0.000000",
>           "last_unstale": "2014-06-03 19:31:00.239728",
>           "mapping_epoch": 1614,
>           "log_start": "0'0",
>           "ondisk_log_start": "0'0",
>           "created": 3,
>           "last_epoch_clean": 339,
>           "parent": "0.0",
>           "parent_split_bits": 0,
>           "last_scrub": "337'71",
>           "last_scrub_stamp": "2014-05-31 19:27:24.250179",
>           "last_deep_scrub": "337'41",
>           "last_deep_scrub_stamp": "2014-05-29 19:26:41.314233",
>           "last_clean_scrub_stamp": "2014-05-31 19:27:24.250179",
>           "log_size": 96,
>           "ondisk_log_size": 96,
>           "stats_invalid": "0",
>           "stat_sum": { "num_bytes": 33554432,
>               "num_objects": 4,
>               "num_object_clones": 0,
>               "num_object_copies": 0,
>               "num_objects_missing_on_primary": 0,
>               "num_objects_degraded": 0,
>               "num_objects_unfound": 0,
>               "num_read": 2403,
>               "num_read_kb": 15994,
>               "num_write": 103,
>               "num_write_kb": 92721,
>               "num_scrub_errors": 0,
>               "num_shallow_scrub_errors": 0,
>               "num_deep_scrub_errors": 0,
>               "num_objects_recovered": 2,
>               "num_bytes_recovered": 16777216,
>               "num_keys_recovered": 0},
>           "stat_cat_sum": {},
>           "up": [
>                 7,
>                 26],
>           "acting": [
>                 7,
>                 26]},
>       "empty": 0,
>       "dne": 0,
>       "incomplete": 0,
>       "last_epoch_started": 1617},
>   "recovery_state": [
>         { "name": "Started\/Primary\/Active",
>           "enter_time": "2014-06-03 18:27:58.473736",
>           "might_have_unfound": [
>                 { "osd": 2,
>                   "status": "already probed"},
>                 { "osd": 3,
>                   "status": "already probed"},
>                 { "osd": 12,
>                   "status": "osd is down"},
>                 { "osd": 14,
>                   "status": "osd is down"},
>                 { "osd": 19,
>                   "status": "osd is down"},
>                 { "osd": 23,
>                   "status": "querying"},
>                 { "osd": 26,
>                   "status": "already probed"}],
>           "recovery_progress": { "backfill_target": -1,
>               "waiting_on_backfill": 0,
>               "backfill_pos": "0\/\/0\/\/-1",
>               "backfill_info": { "begin": "0\/\/0\/\/-1",
>                   "end": "0\/\/0\/\/-1",
>                   "objects": []},
>               "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
>                   "end": "0\/\/0\/\/-1",
>                   "objects": []},
>               "backfills_in_flight": [],
>               "pull_from_peer": [],
>               "pushing": []},
>           "scrub": { "scrubber.epoch_start": "0",
>               "scrubber.active": 0,
>               "scrubber.block_writes": 0,
>               "scrubber.finalizing": 0,
>               "scrubber.waiting_on": 0,
>               "scrubber.waiting_on_whom": []}},
>         { "name": "Started",
>           "enter_time": "2014-06-03 18:27:57.308690"}]}
12, 14 and 19 were OSDs that corrupted.  I’ve marked them as lost and removed 
them from the cluster.  ‘ceph osd tree’ shows the following:


> # id    weight  type name       up/down reweight
> -1      16.38   root default
> -2      5.46            host r-17E813A511
> 0       0.91                    osd.0   up      1
> 1       0.91                    osd.1   up      1
> 2       0.91                    osd.2   up      1
> 3       0.91                    osd.3   up      1
> 4       0.91                    osd.4   up      1
> 5       0.91                    osd.5   up      1
> -3      5.46            host r-3A72F8075A
> 6       0.91                    osd.6   up      1
> 7       0.91                    osd.7   up      1
> 8       0.91                    osd.8   up      1
> 9       0.91                    osd.9   up      1
> 10      0.91                    osd.10  up      1
> 11      0.91                    osd.11  up      1
> -4      5.46            host r-F9CBF5C8C5
> 21      0.91                    osd.21  up      1
> 22      0.91                    osd.22  up      1
> 23      0.91                    osd.23  up      1
> 24      0.91                    osd.24  up      1
> 25      0.91                    osd.25  up      1
> 26      0.91                    osd.26  up      1

./JRH

On Jun 3, 2014, at 4:00 PM, Smart Weblications GmbH - Florian Wiessner 
<f.wiess...@smart-weblications.de> wrote:

> Hi,
> 
> Am 03.06.2014 21:46, schrieb Jason Harley:
>> Howdy —
>> 
>> I’ve had a failure on a small, Dumpling (0.67.4) cluster running on Ubuntu 
>> 13.10 machines.  I had three OSD nodes (running 6 OSDs each), and lost two 
>> of them in a beautiful failure.  One of these nodes even went so far as to 
>> scramble the XFS filesystems of my OSD disks (I’m curious if it has some bad 
>> DIMMs).
>> 
>> Anyway, the thing is: I’m okay with losing the data, this was a test setup 
>> and I want to take this opportunity to learn from the recovery process.  I’m 
>> now stuck in ‘HEALTH_ERR’ and want to get back to ‘HEALTH_OK’ without just 
>> reinitializing the cluster.
>> 
>> My OSD map seems correct, I’ve done scrubs (deep, and normal) at the PG and 
>> OSD levels.  ‘ceph -s’ shows that I have 47 unfound objects still after I 
>> told ceph to ‘mark_unfound_lost’.  The remaining 47 PGs tell me that they 
>> "haven't probed all sources, not marking lost”.  Two days have passed at 
>> this point, and I’d just like to get my cluster back to working and deal 
>> with the object loss (which seems located to a single pool).
>> 
>> How do I move forward from here, if at all?  Do I ‘force_create_pg’ the PGs 
>> containing my unfound objects?
>> 
>>> # ceph health detail | grep "unfound" | grep "^pg"
>>> pg 4.ffe is active+recovering, acting [7,26], 3 unfound
> ...
>>> pg 4.43 is active+recovering, acting [9,23], 1 unfound
>> 
>> 
> 
> what is the output of:
> 
> ceph pg query 4.ffe
> 
> 
> 
> -- 
> 
> Mit freundlichen Grüßen,
> 
> Florian Wiessner
> 
> Smart Weblications GmbH
> Martinsberger Str. 1
> D-95119 Naila
> 
> fon.: +49 9282 9638 200
> fax.: +49 9282 9638 205
> 24/7: +49 900 144 000 00 - 0,99 EUR/Min*
> http://www.smart-weblications.de
> 
> --
> Sitz der Gesellschaft: Naila
> Geschäftsführer: Florian Wiessner
> HRB-Nr.: HRB 3840 Amtsgericht Hof
> *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG Recovery: HEALTH_ERR to HEALTH_OK

Reply via email to