Hi,

Every now and then , sectors die on disks.
When this happens on my bluestore (kraken) OSDs, I get 1 PG that becomes 
degraded.
The exact status is :


HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

pg 12.127 is active+clean+inconsistent, acting [141,67,85]

If I do a # rados list-inconsistent-obj 12.127 --format=json-pretty
I get :
(...)

                    "osd": 112,

                    "errors": [

                        "read_error"

                    ],

                    "size": 4194304

When this happens, I'm forced to manually run "ceph pg repair" on the 
inconsistent PGs after I made sure this was a read error : I feel this should 
not be a manual process.

If I go on the machine and look at the syslogs, I indeed see a sector read 
error happened once or twice.
But if I try to read the sector manually, then I can because it was reallocated 
on the disk I presume.
Last time this happened, I ran badblocks on the disk and it found no issue...

My question therefore are :

why doen't bluestore retry reading the sector (in case of transient errors) ? 
(maybe it does)
why isn't the pg automatically fixed when a read error was detected ?
what will happen when the disks get old and reach up to 2048 bad sectors before 
the controllers/smart declare them as "failure predicted" ?
I can't imagine manually fixing  up to Nx2048 PGs in an infrastructure of N 
disks where N could reach the sky...

Ideas ?

Thanks && regards
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to