I don't have much experience in recovering struggling EC pools
unfortunately. Looks like it can't find OSDs for 2 out of the 6
shards. Since you run EC 4+2 the data isn't lost but not 100% sure how
to make it healthy.
There was a thread a while back that had some similar issue albeit
possibly different underlying causes but maybe there is some helpful
advise in there:
https://www.mail-archive.com/ceph-users@ceph.io/msg06854.html
I suspect your OSDs frequently suiciding has put these PGs into the state

Maybe someone else on the list has some ideas.

On Fri, 1 Oct 2021 at 18:13, Szabo, Istvan (Agoda)
<istvan.sz...@agoda.com> wrote:
>
> Thank you very much Christian, maybe you have idea how can I take out the 
> cluster from this state? Something blocks the recovery and the rebalance, 
> something stuck somewhere, thats why can’t increase the pg further.
> I don’t have auto pg scaler or anything just on warn state.
>
> If I set the min size of the pool to 4, will this pg be recovered? Or how I 
> can take out the cluster from health error like this?
>
> Mark as lost seems risky based on some maillist experience, even if marked 
> lost after you still have issue, so curious what is the way to take the 
> cluster out from this and let it recover:
>
>
>
> Example problematic pg:
>
> dumped pgs_brief
>
> PG_STAT  STATE                                                 UP             
>       UP_PRIMARY  ACTING                              ACTING_PRIMARY
>
> 28.5b    active+recovery_unfound+undersized+degraded+remapped    
> [18,33,10,0,48,1]          18  [2147483647,2147483647,29,21,4,47]             
>  29
>
>
>
> Cluster state:
>
>   cluster:
>
>     id:     5a07ec50-4eee-4336-aa11-46ca76edcc24
>
>     health: HEALTH_ERR
>
>             10 OSD(s) experiencing BlueFS spillover
>
>             4/1055070542 objects unfound (0.000%)
>
>             noout flag(s) set
>
>             Possible data damage: 2 pgs recovery_unfound
>
>             Degraded data redundancy: 64150765/6329079237 objects degraded 
> (1.014%), 10 pgs degraded, 26 pgs undersized
>
>             4 pgs not deep-scrubbed in time
>
>
>
>   services:
>
>     mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M)
>
>     mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02
>
>     osd: 49 osds: 49 up (since 36m), 49 in (since 4d); 28 remapped pgs
>
>          flags noout
>
>     rgw: 3 daemons active (mon-2s01.rgw0, mon-2s02.rgw0, mon-2s03.rgw0)
>
>
>
>   task status:
>
>
>
>   data:
>
>     pools:   9 pools, 425 pgs
>
>     objects: 1.06G objects, 66 TiB
>
>     usage:   158 TiB used, 465 TiB / 623 TiB avail
>
>     pgs:     64150765/6329079237 objects degraded (1.014%)
>
>              38922319/6329079237 objects misplaced (0.615%)
>
>              4/1055070542 objects unfound (0.000%)
>
>              393 active+clean
>
>              13  active+undersized+remapped+backfill_wait
>
>              8   active+undersized+degraded+remapped+backfill_wait
>
>              3   active+clean+scrubbing
>
>              3   active+undersized+remapped+backfilling
>
>              2   active+recovery_unfound+undersized+degraded+remapped
>
>              2   active+remapped+backfill_wait
>
>              1   active+clean+scrubbing+deep
>
>
>
>   io:
>
>     client:   181 MiB/s rd, 9.4 MiB/s wr, 5.38k op/s rd, 2.42k op/s wr
>
>     recovery: 23 MiB/s, 389 objects/s
>
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---------------------------------------------------
>
> On 2021. Oct 1., at 1:25, Christian Wuerdig <christian.wuer...@gmail.com> 
> wrote:
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> ________________________________
>
> That is - one thing you could do is to rate limit PUT requests on your
> haproxy down to a level that your cluster is stable. At least that
> gives you a chance to finish the PG scaling without OSDs dying on you
> constantly
>
> On Fri, 1 Oct 2021 at 11:56, Christian Wuerdig
> <christian.wuer...@gmail.com> wrote:
>
>
> Ok, so I guess there are several things coming together that end up
>
> making your life a bit miserable at the moment:
>
> - PG scaling causing increase IO
>
> - Ingesting large number of objects into RGW causing lots of IOPs
>
> - Usual client traffic
>
> - Your NVME that's being used for WAL/DB has only half the listed
>
> performance in terms of random write IOPS than your backing storage
>
> SSD while it should be the other way around - the WAL/DB device is
>
> supposed to be the faster device.
>
>    you'd probably be better off replacing your NVME's with something
>
> like a P20096
>
> - Smaller number of large drives - ceph traditionally scales better
>
> with more but smaller OSDs - especially if you plan on hosting
>
> truckloads of RGW blobs
>
>
> Don't have a good solution - maybe you can stop the pg scaling until
>
> the big data load has finished or arrange a schedule - load data at
>
> night and pause during the day to continue the PG scaling
>
> Try and get your hands on a couple of faster NVME drives and replace
>
> the WAL/DB drives in one node so see how much of a difference it make
>
>
> Also I wouldn't lower the osd memory target if you can afford the RAM
>
> - you only have 6 OSDs per server with a mem target of 32GB thats
>
> 192GB RAM - so if you have at least 256GB in your servers then I would
>
> leave it. It won't help with writes but it should help with reducing
>
> read iops - you probably don't want to make your existing problems
>
> even bigger by chucking more read io onto the system due to lower
>
> in-memory buffers.
>
>
> On Thu, 30 Sept 2021 at 21:02, Szabo, Istvan (Agoda)
>
> <istvan.sz...@agoda.com> wrote:
>
>
> Hi Christian,
>
>
> Yes, I very clearly know what is spillover, read that github leveled document 
> in the last couple of days every day multiple time. (Answers for your 
> questions are after the cluster background information).
>
>
> About the cluster:
>
> - users are doing continuously put/head/delete operations
>
> - cluster iops: 10-50k read, 5000 write iops
>
> - throughput: 142MiB/s  write and 662 MiB/s read
>
> - Not containerized deployment, 3 cluster in multisite
>
> - 3x mon/mgr/rgw (5 rgw in each mon, altogether 15 behind haproxy vip)
>
>
> 7 nodes and in each node the following config:
>
> - 1x 1.92TB nvme for index pool
>
> - 6x 15.3 TB osd SAS SSD (hpe VO015360JWZJN read intensive ssd, SKU 
> P19911-B21 in this document: 
> https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf)
>
> - 2x 1.92TB nvme  block.db for the 6 ssd (model: HPE KCD6XLUL1T92 SKU: 
> P20131-B21 in this document 
> https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf)
>
> - osd deployed with dmcrypt
>
> - data pool is on ec 4:2 other pools are on the ssds with 3 replica
>
>
> Config file that we have on all nodes + on the mon nodes has the rgw 
> definition also:
>
> [global]
>
> cluster network = 192.168.199.0/24
>
> fsid = 5a07ec50-4eee-4336-aa11-46ca76edcc24
>
> mon host = 
> [v2:10.118.199.1:3300,v1:10.118.199.1:6789],[v2:10.118.199.2:3300,v1:10.118.199.2:6789],[v2:10.118.199.3:3300,v1:10.118.199.3:6789]
>
> mon initial members = mon-2s01,mon-2s02,mon-2s03
>
> osd pool default crush rule = -1
>
> public network = 10.118.199.0/24
>
> rgw_relaxed_s3_bucket_names = true
>
> rgw_dynamic_resharding = false
>
> rgw_enable_apis = s3, s3website, swift, swift_auth, admin, sts, iam, pubsub, 
> notifications
>
> #rgw_bucket_default_quota_max_objects = 1126400
>
>
> [mon]
>
> mon_allow_pool_delete = true
>
> mon_pg_warn_max_object_skew = 0
>
> mon_osd_nearfull_ratio = 70
>
>
> [osd]
>
> osd_max_backfills = 1
>
> osd_recovery_max_active = 1
>
> osd_recovery_op_priority = 1
>
> osd_memory_target = 31490694621
>
> # due to osd reboots, the below configs has been added to survive the suicide 
> timeout
>
> osd_scrub_during_recovery = true
>
> osd_op_thread_suicide_timeout=3000
>
> osd_op_thread_timeout=120
>
>
> Stability issue that I mean:
>
> - Pg increase still in progress, hasn’t been finished from 32-128 on the 
> erasure coded data pool. 103 currently and the degraded objects are always 
> stuck almost when finished, but at the end osd dies and start again the 
> recovery process.
>
> - compaction is happening all the time, so all the nvme drives are generating 
> iowait continuously because it is 100% utilized (iowait is around 1-3). If I 
> try to compact with ceph tell osd.x compact that is impossible, it will never 
> finish, only with ctrl+c.
>
> - At the beginning when we didn't have so much spilledover disks, I didn't 
> mind it actually I was happy of the spillover because the underlaying ssd can 
> take some load from the nvme, but after the osds started to reboot and I'd 
> say started to collapse 1 by 1. When I monitor which osds are collapsing, it 
> was always the one which was spillovered. This op thread and suicide timeout 
> can keep a bit longer the osds up.
>
> - Now ALL rgw started to die once 1 specific osd goes down, and this make 
> total outage. In the logs there isn't anything about this, neither message, 
> nor rgw log just like timeout the connections. This is unacceptable from the 
> user's perspective that thay need to wait 1.5 hour until my manual compaction 
> finished and I can start the osd.
>
>
> Current cluster state ceph -s:
>
> health: HEALTH_ERR
>
>            12 OSD(s) experiencing BlueFS spillover
>
>            4/1055038256 objects unfound (0.000%)
>
>            noout flag(s) set
>
>            Possible data damage: 2 pgs recovery_unfound
>
>            Degraded data redundancy: 12341016/6328900227 objects degraded 
> (0.195%), 16 pgs degraded, 21 pgs u
>
> ndersized
>
>            4 pgs not deep-scrubbed in time
>
>
>  services:
>
>    mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M)
>
>    mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02
>
>    osd: 49 osds: 49 up (since 101m), 49 in (since 4d); 23 remapped pgs
>
>         flags noout
>
>    rgw: 15 daemons active (mon-2s01.rgw0, mon-2s01.rgw1, mon-2s01.rgw2, 
> mon-2s01.rgw3, mon-2s01.rgw4, mon-2s02.rgw0, mon-2s02.rgw1, mon-2s02.rgw2, 
> mon-2s02.rgw3, mon-2s02.rgw4, mon-2s03.rgw0, mon-2s03.rgw1, mon-2s03.rgw2, 
> mon-2s03.rgw3, mon-2s03.rgw4)
>
>
>  task status:
>
>
>  data:
>
>    pools:   9 pools, 425 pgs
>
>    objects: 1.06G objects, 67 TiB
>
>    usage:   159 TiB used, 465 TiB / 623 TiB avail
>
>    pgs:     12032346/6328762449 objects degraded (0.190%)
>
>             68127707/6328762449 objects misplaced (1.076%)
>
>             4/1055015441 objects unfound (0.000%)
>
>             397 active+clean
>
>             13  active+undersized+degraded+remapped+backfill_wait
>
>             4   active+undersized+remapped+backfill_wait
>
>             4   active+clean+scrubbing+deep
>
>             2   active+recovery_unfound+undersized+degraded+remapped
>
>             2   active+remapped+backfill_wait
>
>             1   active+clean+scrubbing
>
>             1   active+undersized+remapped+backfilling
>
>             1   active+undersized+degraded+remapped+backfilling
>
>
>  io:
>
>    client:   256 MiB/s rd, 94 MiB/s wr, 17.70k op/s rd, 2.75k op/s wr
>
>    recovery: 16 MiB/s, 223 objects/s
>
>
> Ty
>
>
> -----Original Message-----
>
> From: Christian Wuerdig <christian.wuer...@gmail.com>
>
> Sent: Thursday, September 30, 2021 1:01 PM
>
> To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
>
> Cc: Ceph Users <ceph-users@ceph.io>
>
> Subject: Re: [ceph-users] osd_memory_target=level0 ?
>
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
>
> ________________________________
>
>
> Bluestore memory targets have nothing to do with spillover. It's already been 
> said several times: The spillover warning is simply telling you that instead 
> of writing data to your supposedly fast wal/blockdb device it's now hitting 
> your slow device.
>
>
> You've stated previously that your fast device is nvme and your slow device 
> is SSD. So the spill-over is probably less of a problem than you think. It's 
> currently unclear what your actual problem is and why you think it's to do 
> with spill-over.
>
>
> What model are your NVMEs and SSDs - what IOPS can each sustain (4k random 
> write direct IO), what's their current load? What are the actual problems 
> that you are observing, i.e. what does "stability problems" actually mean?
>
>
> On Thu, 30 Sept 2021 at 18:33, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> 
> wrote:
>
>
> Hi,
>
>
> Still suffering with the spilledover disks and stability issue in 3 of my 
> cluster after uploaded 6-900 millions objects to the cluster. (Octopus 
> 15.2.10).
>
>
> I’ve set memory target around 31-32GB so could that be that the spilledover 
> issue is coming from here?
>
> So have mem target 31GB, next level would be 310 and after go to the 
> underlaying ssd disk. So the 4 level doesn’t have space on the nvme.
>
>
> Let’s say set to default 4GB, it would be 444GB the level0-3 so it
>
> should fit in on the 600GB lvm assigned on the nvme for db with wal.
>
>
> This is how it looks like, eg. Osd 27 even after 2 times manual
>
> compact still spilled over :(
>
>
> osd.1 spilled over 198 GiB metadata from 'db' device (303 GiB used of 596 
> GiB) to slow device
>
>     osd.5 spilled over 251 GiB metadata from 'db' device (163 GiB used of 596 
> GiB) to slow device
>
>     osd.8 spilled over 61 GiB metadata from 'db' device (264 GiB used of 596 
> GiB) to slow device
>
>     osd.11 spilled over 260 GiB metadata from 'db' device (242 GiB used of 
> 596 GiB) to slow device
>
>     osd.12 spilled over 149 GiB metadata from 'db' device (238 GiB used of 
> 596 GiB) to slow device
>
>     osd.15 spilled over 259 GiB metadata from 'db' device (195 GiB used of 
> 596 GiB) to slow device
>
>     osd.17 spilled over 10 GiB metadata from 'db' device (314 GiB used of 596 
> GiB) to slow device
>
>     osd.21 spilled over 324 MiB metadata from 'db' device (346 GiB used of 
> 596 GiB) to slow device
>
>     osd.27 spilled over 12 GiB metadata from 'db' device (486 GiB used of 596 
> GiB) to slow device
>
>     osd.29 spilled over 61 GiB metadata from 'db' device (306 GiB used of 596 
> GiB) to slow device
>
>     osd.31 spilled over 59 GiB metadata from 'db' device (308 GiB used of 596 
> GiB) to slow device
>
>     osd.46 spilled over 69 GiB metadata from 'db' device (308 GiB
>
> used of 596 GiB) to slow device
>
>
> Also is there a way to fasten compaction? It takes 1-1.5 hours /osd to 
> compact.
>
>
> Thank you
>
> _______________________________________________
>
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>
> email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to