I jogged my own memory... My mons servers came back and didn't take the full ratio settings. ceph osd state reported osd's in full status (96%). That caused pools to report full. I run hotter than default settings. We buy disk when we hit 98% capacity not sooner. Arguing that policy is like yelling at a brick wall.
Setting correct set-full-ratio, set-nearfull-ratio, set-backfillfull-ratio let my osds flip back into good states And restarting mds shows nice lush logs :) Documentation for troubleshooting fs doesn't include this information. Hopefully this helps someone having trouble in the future. /C On Tue, May 11, 2021 at 9:04 PM Mazzystr <mazzy...@gmail.com> wrote: > I did a simple os update and reboot. Now mds is stuck in replay. I'm > running octapus > > debug mds = 20 shows some pretty lame logs > > # tail -f ceph-mds.bridge.log > 2021-05-11T18:24:04.859-0700 7f41314a1700 20 mds.0.cache upkeep thread > waiting interval 1s > 2021-05-11T18:24:05.860-0700 7f41314a1700 10 mds.0.cache cache not ready > for trimming > 2021-05-11T18:24:05.860-0700 7f41314a1700 20 mds.0.cache upkeep thread > waiting interval 1s > 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 get_task_status > 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 > send_task_status: updating 1 status keys > 2021-05-11T18:24:06.859-0700 7f4133ca6700 20 mds.0.2898629 > schedule_update_timer_task > 2021-05-11T18:24:06.859-0700 7f41314a1700 10 mds.0.cache cache not ready > for trimming > 2021-05-11T18:24:06.859-0700 7f41314a1700 20 mds.0.cache upkeep thread > waiting interval 1s > 2021-05-11T18:24:07.859-0700 7f41314a1700 10 mds.0.cache cache not ready > for trimming > 2021-05-11T18:24:07.859-0700 7f41314a1700 20 mds.0.cache upkeep thread > waiting interval 1s > > > # cephfs-journal-tool event recover_dentries summary > gets stuck on an object and stays stuck. I tried to run rados -p > cephfs_metadata_pool rmomapkey per https://tracker.ceph.com/issues/38452 > but the cmd ran for hours and never completes. > > > # cephfs-journal-tool --rank cephfs:0 journal reset > 2021-05-11T18:31:26.860-0700 7f2e9c2a9700 -1 NetHandler create_socket > couldn't create socket (97) Address family not supported by protocol > 2021-05-11T18:31:26.860-0700 7f2f2989ba80 4 waiting for MDS map... > 2021-05-11T18:31:26.860-0700 7f2f2989ba80 4 Got MDS map 2898629 > 2021-05-11T18:31:26.861-0700 7f2f2989ba80 10 main: JournalTool::main > 2021-05-11T18:31:26.861-0700 7f2f2989ba80 4 main: JournalTool: connecting > to RADOS... > 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: JournalTool: resolving > pool 1 > 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: JournalTool: creating > IoCtx.. > 2021-05-11T18:31:26.863-0700 7f2f2989ba80 4 main: Executing for rank 0 > 2021-05-11T18:31:26.864-0700 7f2edc2aa700 -1 NetHandler create_socket > couldn't create socket (97) Address family not supported by protocol > 2021-05-11T18:31:26.864-0700 7f2f2989ba80 4 waiting for MDS map... > 2021-05-11T18:31:26.865-0700 7f2f2989ba80 4 Got MDS map 2898629 > 2021-05-11T18:31:26.865-0700 7f2f2989ba80 4 client.2024650.journalpointer > Reading journal pointer '400.00000000' > 2021-05-11T18:31:26.865-0700 7f2f2989ba80 1 > client.2024650.journaler.resetter(ro) recover start > 2021-05-11T18:31:26.865-0700 7f2f2989ba80 1 > client.2024650.journaler.resetter(ro) read_head > 2021-05-11T18:31:26.865-0700 7f291c293700 1 > client.2024650.journaler.resetter(ro) _finish_read_head loghead(trim > 14172553216, expire 14174788378, write 14400838791, stream_format 1). > probing for end of log (from 14400838791)... > 2021-05-11T18:31:26.865-0700 7f291c293700 1 > client.2024650.journaler.resetter(ro) probing for end of the log > > I've been stuck here for hours > > > # strace -f -p 10357 > [pid 10360] <... sendmsg resumed>) = 9 > [pid 10361] read(14, <unfinished ...> > [pid 10360] epoll_wait(7, <unfinished ...> > [pid 10361] <... read resumed>0x55e95d982000, 4096) = -1 EAGAIN (Resource > temporarily unavailable) > [pid 10360] <... epoll_wait resumed>[{EPOLLIN, {u32=16, u64=16}}, > {EPOLLIN, {u32=18, u64=18}}], 5000, 30000) = 2 > [pid 10361] epoll_wait(10, <unfinished ...> > [pid 10360] read(16, > "\23\1\10\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\354^\340;"..., > 4096) = 57 > [pid 10360] read(16, 0x55e95d9a8000, 4096) = -1 EAGAIN (Resource > temporarily unavailable) > [pid 10360] read(18, "\17\264R\233`\327\275\222+", 4096) = 9 > [pid 10360] read(18, 0x55e95d9f4000, 4096) = -1 EAGAIN (Resource > temporarily unavailable) > [pid 10360] epoll_wait(7, ^X <unfinished ...> > [pid 10370] <... futex resumed>) = -1 ETIMEDOUT (Connection timed > out) > [pid 10381] <... futex resumed>) = -1 ETIMEDOUT (Connection timed > out) > [pid 10370] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 10389] <... futex resumed>) = -1 ETIMEDOUT (Connection timed > out) > [pid 10381] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989, > tv_nsec=731038214}) = 0 > [pid 10389] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989, > tv_nsec=731105584}) = 0 > [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989, > tv_nsec=731125991}) = 0 > [pid 10370] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 10381] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 10389] clock_gettime(CLOCK_REALTIME, <unfinished ...> > [pid 10370] <... clock_gettime resumed>{tv_sec=1620791989, > tv_nsec=731162065}) = 0 > [pid 10389] <... clock_gettime resumed>{tv_sec=1620791989, > tv_nsec=731184311}) = 0 > [pid 10381] <... clock_gettime resumed>{tv_sec=1620791989, > tv_nsec=731174345}) = 0 > [pid 10370] futex(0x55e95d97c2d8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 10381] futex(0x55e95d8a5320, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 10370] <... futex resumed>) = 0 > [pid 10389] futex(0x55e95d97fad8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> > [pid 10381] <... futex resumed>) = 0 > [pid 10370] futex(0x55e95d97c31c, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, > tv_nsec=731161399}, 0xffffffff <unfinished ...> > [pid 10389] <... futex resumed>) = 0 > [pid 10381] futex(0x55e95d8a5364, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, > tv_nsec=731173986}, 0xffffffff <unfinished ...> > [pid 10389] futex(0x55e95d97fb1c, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 17805, {tv_sec=1620791990, > tv_nsec=731183618}, 0xffffffff^Cstrace: Process 10357 detached > > > Any help would be great. > > Thanks, > /C > > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io