On a cluster running Nautilus (14.2.8), we are getting a complaint about a PG not being deep-scrubbed on time.  Looking at the primary OSD's logs, it looks like it tries to deep-scrub the PG every hour or so, emits some complaints that I don't understand, but the deep scrub does not finish (either with or without a scrub error).

Here is the PG from pg dump:

1.43f     31794                  0        0         0       0 66930087214           0          0 3004     3004 active+clean+scrubbing+deep 2020-04-20 04:48:13.055481 46286'483734  46286:563439 [354,694,851]        354 [354,694,851]            354  39594'348643 2020-04-10 12:39:16.261088    38482'314644 2020-04-04 10:37:03.121638             0

Here is a section of the primary OSD logs (osd.354):

2020-04-18 08:21:08.322 7fffd2e2b700  0 log_channel(cluster) log [DBG] : 1.43f deep-scrub starts 2020-04-18 08:37:53.362 7fffd2e2b700  1 osd.354 pg_epoch: 45910 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f 45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910 pi=[45908,45910)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0 active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] -> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:37:53.362 7fffd2e2b700  1 osd.354 pg_epoch: 45910 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f 45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910 pi=[45908,45910)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:38:01.002 7fffd2e2b700  1 osd.354 pg_epoch: 45912 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f 45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912 pi=[45908,45912)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0 active mbc={}] start_peering_interval up [354,851] -> [354,694,851], acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:38:01.002 7fffd2e2b700  1 osd.354 pg_epoch: 45912 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f 45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912 pi=[45908,45912)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:40:04.219 7fffd2e2b700  0 log_channel(cluster) log [DBG] : 1.43f deep-scrub starts 2020-04-18 08:56:49.095 7fffd2e2b700  1 osd.354 pg_epoch: 45914 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f 45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914 pi=[45912,45914)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0 active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] -> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:56:49.095 7fffd2e2b700  1 osd.354 pg_epoch: 45914 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f 45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914 pi=[45912,45914)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:56:55.627 7fffd2e2b700  1 osd.354 pg_epoch: 45916 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f 45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916 pi=[45912,45916)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0 active mbc={}] start_peering_interval up [354,851] -> [354,694,851], acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:56:55.627 7fffd2e2b700  1 osd.354 pg_epoch: 45916 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f 45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916 pi=[45912,45916)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:56:56.867 7fffd2e2b700  0 log_channel(cluster) log [DBG] : 1.43f deep-scrub starts 2020-04-18 09:13:37.419 7fffd2e2b700  1 osd.354 pg_epoch: 45918 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f 45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918 pi=[45916,45918)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0 active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] -> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 09:13:37.419 7fffd2e2b700  1 osd.354 pg_epoch: 45918 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f 45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918 pi=[45916,45918)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 09:13:44.959 7fffd2e2b700  1 osd.354 pg_epoch: 45920 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f 45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920 pi=[45916,45920)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0 active mbc={}] start_peering_interval up [354,851] -> [354,694,851], acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 09:13:44.959 7fffd2e2b700  1 osd.354 pg_epoch: 45920 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f 45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920 pi=[45916,45920)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary

I've tried restarting all 3 OSDs in question, it didn't help.  I'm running the upmap balancer on this cluster (if that matters).  Any ideas?

Andras
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to