I've inherited a Ceph Octopus cluster that seems like it needs urgent 
maintenance before data loss begins to happen. I'm the guy with the most Ceph 
experience on hand and that's not saying much. I'm experiencing most of the ops 
and repair tasks for the first time here.

Ceph health output looks like this:

HEALTH_WARN Degraded data redundancy: 3640401/8801868 objects degraded 
(41.359%),
 128 pgs degraded, 128 pgs undersized; 128 pgs not deep-scrubbed in time;
 128 pgs not scrubbed in time

Ceph -s output: https://termbin.com/i06u

The crush rule 'cephfs.media' is here: https://termbin.com/2klmq

So, it seems like all PGs are in a 'warning' state for the main pool, which is 
erasure coded and 11TiB across 4 OSDs, of which around 6.4TiB is used. The Ceph 
services themselves seem happy, they're stable and have Quorum. I'm able to 
access the web panel fine also.  The block devices are of different sizes and 
types (2 large, different sized spinners, and 2 identical SSDs)

I would welcome any pointers on what my steps to bring this up to full health 
may be.  If it's undersized, can I simply add another block device/OSD? Or 
perhaps adjusting config somewhere will get it to rebalance successfully? (the 
rebalance jobs have been stuck at 0% for weeks)

Thank you for your time reading this message.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to