Three OSDs, holding the 3 replicas of a PG here are only half-starting, and
hence that single PG gets stuck as "stale+active+clean".
All died of suicide timeout while walking over a huge omap (pool 7
'default.rgw.buckets.index')  and would not get the PG 7.b back online
again.

>From the logs, they try to start normally, get into a bit of leveldb
things, play the journal and then say nothing more.

2019-11-19 15:15:46.967543 7fe644fad840  0 set uid:gid to 167:167
(ceph:ceph)
2019-11-19 15:15:46.967600 7fe644fad840  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 5149
2019-11-19 15:15:47.026065 7fe644fad840  0 pidfile_write: ignore empty
--pid-file
2019-11-19 15:15:47.078291 7fe644fad840  0
filestore(/var/lib/ceph/osd/ceph-22) backend xfs (magic 0x58465342)
2019-11-19 15:15:47.079317 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2019-11-19 15:15:47.079331 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2019-11-19 15:15:47.079352 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: splice
is supported
2019-11-19 15:15:47.080287 7fe644fad840  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2019-11-19 15:15:47.080529 7fe644fad840  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_feature: extsize is
disabled by conf
2019-11-19 15:15:47.095819 7fe644fad840  1 leveldb: Recovering log #2731809
2019-11-19 15:15:47.119792 7fe644fad840  1 leveldb: Level-0 table #2731812:
started
2019-11-19 15:15:47.132107 7fe644fad840  1 leveldb: Level-0 table #2731812:
140642 bytes OK
2019-11-19 15:15:47.143782 7fe644fad840  1 leveldb: Delete type=0 #2731809

2019-11-19 15:15:47.147198 7fe644fad840  1 leveldb: Delete type=3 #2731792

2019-11-19 15:15:47.159339 7fe644fad840  0
filestore(/var/lib/ceph/osd/ceph-22) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2019-11-19 15:15:47.243262 7fe644fad840  1 journal _open
/var/lib/ceph/osd/ceph-22/journal fd 18: 21472739328 bytes, block size 4096
bytes, directio = 1, aio = 1

At this point they consume a ton of cpu, systemd thinks all is fine, and
this has been going on for some 5 hours.
ceph -s think they are down, I can't talk to the OSDs remotely from a mon,
but ceph daemon on the OSD hosts works normally, except I can't do anything
from there except get conf or perf numbers.

Strace shows they all keep looping over the same sequence:
machine1:

stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0

machine2:

stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0

machine3:

stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffc63518650) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffc63518650) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0

Help wanted.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to