Ben,

I haven't look at everything in your message, but pg 12.7a1 has lost data because of writes that went only to osd.73. The way to recover this is to force recovery to ignore this fact and go with whatever data you have on the remaining OSDs. I assume that having min_size 1, having multiple nodes failing and clients continuing to write then permanently losing osd.73 caused this.

You should TEMPORARILY set osd_find_best_info_ignore_history_les config variable to 1 on osd.36 and then mark it down (ceph osd down), so it will rejoin, re-peer and mark the pg active+clean. Don't forget to set osd_find_best_info_ignore_history_les
back to 0.


Later you should fix your crush map. See http://docs.ceph.com/docs/master/rados/operations/crush-map/

The wrong placements makes you vulnerable to a single host failure taking out multiple copies of an object.

David

On 3/7/16 9:41 PM, Ben Hines wrote:
Howdy,

I was hoping someone could help me recover a couple pgs which are causing
problems in my cluster. If we aren't able to resolve this soon, we may have
to just destroy them and lose some data. Recovery has so far been
unsuccessful. Data loss would probably cause some here to reconsider Ceph
as something we'll stick with long term, so i'd love to recover it.

Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
after a disk failure

pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7

- The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
it went down, the pg was 'down + peering'. It was marked lost.
- After marking 73 lost, the new primary still wants to peer and flips
between peering and incomplete.
- Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
need to bring back an osd with the same name?)
- Noticed that the new primary got set to an osd (osd-77) which was on the
same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
with 36 because it was on the same node, i set 77 out, 36 became primary
and 76 became one of the replicas. No change.

startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
debug filestore = 30, debug ms = 1'  (large files)

osd 36 (12.7a1) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
osd 6 (10.4f) startup log:
https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log


Some other Notes:

- Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
copy of the data.  Even after running a pg repair does not pick up the data
from 76, remains stuck peering

- One of the pgs was part of a pool which was no longer needed. (the unused
radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
discussed here for a similar failure, i attempted these recovery steps on
it, to see if they would work for the others:

-- The failed osd disk only mounts 'read only' which causes
ceph-objectstore-tool to fail to export, so i exported it from a seemingly
good copy on another osd.
-- stopped all osds
-- exported the pg with objectstore-tool from an apparently good OSD
-- removed the pg from all osds which had it using objectstore-tool
-- imported the pg into an out osd, osd-100

   Importing pgid 4.95
Write 4/88aa5c95/notify.2/head
Import successful

-- Force recreated the pg on the cluster:
            ceph pg force_create_pg 4.95
-- brought up all osds
-- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects

However, the object doesn't sync to the pg from osd-100, and instead 64
tells to to remove itself from osd-100:

2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
from osd.64 on 1 pgs
2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
queue_pg_for_deletion: 4.95
2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history
4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0
66982/67983/66982

Not wanting this to happen to my needed data from the other PGs, i didn't
try this procedure with those PGs. After this procedure  osd-100 does get
listed in 'pg query' as 'might_have_unfound', but ceph apparently decides
not to use it and the active osd sends a remove.

output of 'ceph pg 4.95 query' after these recovery steps:
https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf


Quite Possibly Related:

I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems
my crush map thinks some osds are on the wrong hosts. I wonder if this is
why peering is failing?
(example)
  -5   9.04999     host cld-mtl-006
  12   1.81000         osd.12               up  1.00000          1.00000
  13   1.81000         osd.13               up  1.00000          1.00000
  14   1.81000         osd.14               up  1.00000          1.00000
  94   1.81000         osd.94               up  1.00000          1.00000
  26   1.81000         osd.26               up  0.86775          1.00000

^^ this host only has 4 osds on it! osd.26 is actually running over on
cld-mtl-004 !    Restarting 26 fixed the map.
osd.42 (out) was also in the wrong place in 'osd tree'. tree syas it's on
cld-mtl-013, it's actually on cld-mtl-024.
- fixing these issues caused a large re-balance, so 'ceph health detail' is
a bit dirty right now, but you can see the stuck pgs:
ceph health detail:

-  I wonder if these incorrect crushmaps caused ceph to put some data on
the wrong osds, resulting in a peering failure later when the map repaired
itself?
-  How does ceph determine what node an OSD is on? That process may be
periodically failing due to some issue. (dns?)
-  Perhaps if i enable 'allow peer to same host' setting, the cluster could
repair? Then i could turn it off again.


Any assistance is appreciated!

-Ben



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to