Re: [ceph-users] ONE pg deep-scrub blocks cluster

c Tue, 02 Aug 2016 04:30:54 -0700

Hello Guys,

this time without the original acting-set osd.4, 16 and 28. The issuestill exists...


[...]

For the record, this ONLY happens with this PG and no others that
share
the same OSDs, right?


Yes, right.

[...]

When doing the deep-scrub, monitor (atop, etc) all 3 nodes and
see if a
particular OSD (HDD) stands out, as I would expect it to.


Now I logged all disks via atop each 2 seconds while the deep-scrub
was running ( atop -w osdXX_atop 2 ).
As you expected all disks was 100% busy - with constant 150MB
(osd.4), 130MB (osd.28) and 170MB (osd.16)...

- osd.4 (/dev/sdf) http://slexy.org/view/s21emd2u6j [1]
- osd.16 (/dev/sdm): http://slexy.org/view/s20vukWz5E [2]
- osd.28 (/dev/sdh): http://slexy.org/view/s20YX0lzZY [3]
[...]
But what is causing this? A deep-scrub on all other disks - same
model and ordered at the same time - seems to not have this issue.

[...]

Next week, I will do this

1.1 Remove osd.4 completely from Ceph - again (the actual primary
for PG 0.223)


osd.4 is now removed completely.
The Primary PG is now on "osd.9"

# ceph pg map 0.223
osdmap e8671 pg 0.223 (0.223) -> up [9,16,28] acting [9,16,28]

1.2 xfs_repair -n /dev/sdf1 (osd.4): to see possible error


xfs_repair did not find/show any error

1.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.4,16,28 injectargs "--debug_osd 5/5"

Because now osd.9 is the Primary PG i have set the debug_osd on thistoo:

ceph tell osd.9 injectargs "--debug_osd 5/5"

and run the deep-scrub on 0.223 (and againg nearly all of my VMs stopworking for a while)

Start @ 15:33:27
End @ 15:48:31

The "ceph.log"
- http://slexy.org/view/s2WbdApDLz

The related LogFiles (OSDs 9,16 and 28) and the LogFile via atop for theosds


LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s2kXeLMQyw
- atop Log: http://slexy.org/view/s21wJG2qr8

LogFile - osd.16 (/dev/sdh)
- ceph-osd.16.log: http://slexy.org/view/s20D6WhD4d
- atop Log: http://slexy.org/view/s2iMjer8rC

LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s21dmXoEo7
- atop log: http://slexy.org/view/s2gJqzu3uG

2.1 Remove osd.16 completely from Ceph

osd.16 is now removed completely - now replaced with osd.17 witihin theacting set.


# ceph pg map 0.223
osdmap e9017 pg 0.223 (0.223) -> up [9,17,28] acting [9,17,28]

2.2 xfs_repair -n /dev/sdh1


xfs_repair did not find/show any error

2.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,28 injectargs "--debug_osd 5/5"

and run the deep-scrub on 0.223 (and againg nearly all of my VMs stopworking for a while)


Start @ 2016-08-02 10:02:44
End @ 2016-08-02 10:17:22

The "Ceph.log": http://slexy.org/view/s2ED5LvuV2

LogFile - osd.9 (/dev/sdk)
- ceph-osd.9.log: http://slexy.org/view/s21z9JmwSu
- atop Log: http://slexy.org/view/s20XjFZFEL

LogFile - osd.17 (/dev/sdi)
- ceph-osd.17.log: http://slexy.org/view/s202fpcZS9
- atop Log: http://slexy.org/view/s2TxeR1JSz

LogFile - osd.28 (/dev/sdm)
- ceph-osd.28.log: http://slexy.org/view/s2eCUyC7xV
- atop log: http://slexy.org/view/s21AfebBqK

3.1 Remove osd.28 completely from Ceph

Now osd.28 is also removed completely from Ceph - now replaced withosd.23


# ceph pg map 0.223
osdmap e9363 pg 0.223 (0.223) -> up [9,17,23] acting [9,17,23]

3.2 xfs_repair -n /dev/sdm1


As expected: xfs_repair did not find/show any error

3.3 ceph pg deep-scrub 0.223
- Log with " ceph tell osd.9,17,23 injectargs "--debug_osd 5/5"


... againg nearly all of my VMs stop working for a while...

Now are all "original" OSDs (4,16,28) removed which was in theacting-set when i wrote my first eMail to this mailinglist. But theissue still exists with different OSDs (9,17,23) as the acting-set whilethe questionable PG 0.223 is still the same!

In suspicion that the "tunable" could be the cause, i have now changedthis back to "default" via " ceph osd crush tunables default ".This will take a whille... then i will do " ceph pg deep-scrub 0.223 "again (without osds 4,16,28)...

For the records: Although nearly all disks are busy i have noslow/blocked requests and i am watching the logfiles for nearly 20minutes now...


Your help is realy appreciated!
- Mehmet

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ONE pg deep-scrub blocks cluster

Reply via email to