Hello,

We have an unusual scrub failure on one of our PGs. Ordinarily we can trigger a 
repair using ceph pg repair, however this mechanism fails to cause a repair 
operation to be initiated.

On looking through the logs, we have discovered the original cause of the scrub 
error, a single file which has a 'missing attr _, missing attr snapset’ error, 
however when we run find to locate this file, it does not physically exist on 
any of the three replicas.

The only thing I can think is that some thread hit a suicide timeout whilst 
carrying out the write to the cluster in between the metadata being written 
leveldb but before the data could be committed to the FS. When running a rados 
get against the file, it returns an IO exception (as expected).

I have repeatedly sent various commands to attempt to get the OSDs in question 
to do some maintenance, however they don’t seem to want to do anything. I have 
tried restarting them, marking the primary as down and out temporarily, all to 
no avail. I really don’t want to deliberately trigger a large shuffle of data 
by removing a disk entirely - as it won’t get reinfected into the cluster due 
to the type of disk it is (smr) - besides I have no guarantee that doing so 
would change anything.

The question is, how can we trigger this to get cleaned up and take the cluster 
out of HEALTH_ERR? We are running jewel (10.2.6).

Regards

Stuart Harland
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to