Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Alfredo Daniel Rezinovsky Mon, 08 Oct 2018 06:47:14 -0700



On 08/10/18 10:20, Yan, Zheng wrote:

On Mon, Oct 8, 2018 at 9:07 PM Alfredo Daniel Rezinovsky
<alfrenov...@gmail.com> wrote:



On 08/10/18 09:45, Yan, Zheng wrote:

On Mon, Oct 8, 2018 at 6:40 PM Alfredo Daniel Rezinovsky
<alfrenov...@gmail.com> wrote:

On 08/10/18 07:06, Yan, Zheng wrote:

On Mon, Oct 8, 2018 at 5:43 PM Sergey Malinin <h...@newmail.com> wrote:

On 8.10.2018, at 12:37, Yan, Zheng <uker...@gmail.com> wrote:

On Mon, Oct 8, 2018 at 4:37 PM Sergey Malinin <h...@newmail.com> wrote:

What additional steps need to be taken in order to (try to) regain access to 
the fs providing that I backed up metadata pool, created alternate metadata 
pool and ran scan_extents, scan_links, scan_inodes, and somewhat recursive 
scrub.
After that I only mounted the fs read-only to backup the data.
Would anything even work if I had mds journal and purge queue truncated?

did you backed up whole metadata pool?  did you make any modification
to the original metadata pool? If you did, what modifications?

I backed up both journal and purge queue and used cephfs-journal-tool to 
recover dentries, then reset journal and purge queue on original metadata pool.

You can try restoring original journal and purge queue, then downgrade
mds to 13.2.1.   Journal objects names are 20x.xxxxxxxx, purge queue
objects names are 50x.xxxxxxxxx.

I'm already done a scan_extents and doing a scan_inodes, Do i need to
finish with the scan_links?

I'm with 13.2.2. DO I finish the scan_links and then downgrade?

I have a backup done with "cephfs-journal-tool journal export
backup.bin". I think I don't have the pugue queue

can I reset the purgue-queue journal?, Can I import an empty file

It's better to restore journal to original metadata pool and reset
purge queue to empty, then try starting mds. Reset the purge queue
will leave some objects in orphan states. But we can handle them
later.

Regards
Yan, Zheng

Let's see...

"cephfs-journal-tool journal import  backup.bin" will restore the whole
metadata ?
That's what "journal" means?

It just restores the journal. If you only reset original fs' journal
and purge queue (run scan_foo commands with alternate metadata pool).
It's highly likely restoring the journal will bring your fs back.

So I can stopt  cephfs-data-scan, run the import, downgrade, and then
reset the purge queue?

you said you have already run scan_extents and scan_inodes. what
cephfs-data-scan command is running?

Already ran (without alternate metadata)

time cephfs-data-scan scan_extents cephfs_data # 10 hours

time cephfs-data-scan scan_inodes cephfs_data # running 3 hours
with a warning:

7fddd8f64ec0 -1 datascan.inject_with_backtrace: Dentry0x0x10000db852b/dovecot.index already exists but points to 0x0x1000134f97f


Still not run:

time cephfs-data-scan scan_links

After 'import original journal'.  run 'ceph mds repaired
fs_name:damaged_rank', then try restarting mds. Check if mds can
start.

Please remember me the commands:
I've been 3 days without sleep, and I don't wanna to broke it more.

sorry for that.

I updated on friday, broke a golden rule: "READ ONLY FRIDAY". My fault.

Thanks

What do I do with the journals?

Before proceeding to alternate metadata pool recovery I was able to start MDS 
but it soon failed throwing lots of 'loaded dup inode' errors, not sure if that 
involved changing anything in the pool.
I have left the original metadata pool untouched sine then.

Yan, Zheng

On 8.10.2018, at 05:15, Yan, Zheng <uker...@gmail.com> wrote:

Sorry. this is caused wrong backport. downgrading mds to 13.2.1 and
marking mds repaird can resolve this.

Yan, Zheng
On Sat, Oct 6, 2018 at 8:26 AM Sergey Malinin <h...@newmail.com> wrote:

Update:
I discovered http://tracker.ceph.com/issues/24236 and 
https://github.com/ceph/ceph/pull/22146
Make sure that it is not relevant in your case before proceeding to operations 
that modify on-disk data.


On 6.10.2018, at 03:17, Sergey Malinin <h...@newmail.com> wrote:

I ended up rescanning the entire fs using alternate metadata pool approach as 
in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
The process has not competed yet because during the recovery our cluster 
encountered another problem with OSDs that I got fixed yesterday (thanks to 
Igor Fedotov @ SUSE).
The first stage (scan_extents) completed in 84 hours (120M objects in data pool 
on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by OSDs 
failure so I have no timing stats but it seems to be runing 2-3 times faster 
than extents scan.
As to root cause -- in my case I recall that during upgrade I had forgotten to 
restart 3 OSDs, one of which was holding metadata pool contents, before 
restarting MDS daemons and that seemed to had an impact on MDS journal 
corruption, because when I restarted those OSDs, MDS was able to start up but 
soon failed throwing lots of 'loaded dup inode' errors.


On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky <alfrenov...@gmail.com> wrote:

Same problem...

# cephfs-journal-tool --journal=purge_queue journal inspect
2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.0000016c
Overall journal integrity: DAMAGED
Objects missing:
0x16c
Corrupt regions:
0x5b000000-ffffffffffffffff

Just after upgrade to 13.2.2

Did you fixed it?


On 26/09/18 13:05, Sergey Malinin wrote:

Hello,
Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
damaged. Resetting purge_queue does not seem to work well as journal still 
appears to be damaged.
Can anybody help?

mds log:

-789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map to 
version 586 from mon.2
-788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i am now 
mds.0.583
-787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map state 
change up:rejoin --> up:active
-786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
successful recovery!
<skip>
    -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue _consume: 
Decode error at read_pos=0x322ec6636
    -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 set_want_state: 
up:active -> down:damaged
    -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
down:damaged seq 137
    -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: _send_mon_message 
to mon.ceph3 at mon:6789/0
    -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 0x563b321ad480 
con 0
<skip>
     -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
mon:6789/0 conn(0x563b3213e000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 
cs=1 l=1). rx mon.2 seq 29 0x563b321ab880 mdsbeaco
n(85106/mds2 down:damaged seq 311 v587) v7
     -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
mon.2 mon:6789/0 29 ==== mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 ==== 
129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
000
     -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
handle_mds_beacon down:damaged seq 311 rtt 0.038261
      0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!

# cephfs-journal-tool --journal=purge_queue journal inspect
Overall journal integrity: DAMAGED
Corrupt regions:
0x322ec65d9-ffffffffffffffff

# cephfs-journal-tool --journal=purge_queue journal reset
old journal was 13470819801~8463
new journal start will be 13472104448 (1276184 bytes past old end)
writing journal head
done

# cephfs-journal-tool --journal=purge_queue journal inspect
2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.00000c8c
Overall journal integrity: DAMAGED
Objects missing:
0xc8c
Corrupt regions:
0x323000000-ffffffffffffffff
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Reply via email to