We have a pacific cluster that is overly filled and is having major trouble 
recovering.  We are desperate for help in improving recovery speed.  We have 
modified all of the various recovery throttling parameters.

The full_ratio is 0.95 but we have several osds that continue to grow and are 
approaching 100% utilization.  They are reweighted to almost 0, but yet 
continue to grow.
Why is this happening?  I thought the cluster would stop writing to the osd 
when it was at above the full ratio.


We have added additional capacity to the cluster but the new OSDs are being 
used very very slowly.  The primary pool in the cluster is the RGW data pool 
which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
hosts with 20x10TB osds each were recently added but they are only very very 
slowly being filled up.  I don't see how to force recovery on that particular 
pool.   From what I understand, we cannot modify the EC parameters without 
destroying the pool and we cannot offload that pool to any others because there 
is no other place to store the amount of data.


We have been running "ceph osd reweight-by-utilization"  periodically and it 
works for a while (a few hours) but then recovery and backfill IO numbers drop 
to negligible values.

The balancer module will not run because the current misplaced % is about 97%.

Would it be more effective to use the osmaptool and generate a bunch of upmap 
commands to manually move data around or keep trying to get 
reweight-by-utlilization to work?

Any suggestions (other than deleting data which we cannot do at this point, the 
pools are not accessible) or adding more storage (we already did and it is not 
being utilized very heavily yet for some reason).




_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to