I looked at the code. The automatic repair should handle getting an EIO during read of the object replica. It does NOT require removing the object as I said before, so it doesn’t matter which copy has bad sectors. It will copy from a good replica to the primary, if necessary. By default a deep-scrub which would catch this case is performed weekly. A repair must be initiated by administrative action.
When replicas differ due to comparison of checksums, we currently don’t have a way to determine which copy(s) are corrupt. This is where a manual intervention may be necessary if the administrator can determine which copy(s) are bad. David Zafman Senior Developer http://www.inktank.com On Nov 18, 2013, at 1:11 PM, Chris Dunlop <ch...@onthe.net.au> wrote: > OK, that's good (as far is it goes, being a manual process). > > So then, back to what I think was Mihály's original issue: > >> pg repair or deep-scrub can not fix this issue. But if I >> understand correctly, osd has to known it can not retrieve >> object from osd.0 and need to be replicate an another osd >> because there is no 3 working replicas now. > > Given a bad checksum and/or read error tells ceph that an object > is corrupt, it would seem to be a natural step to then have ceph > automatically use another good-checksum copy, and even rewrite > the corrupt object, either in normal operation or under a scub > or repair. > > Is there a reason this isn't done, apart from lack of tuits? > > Cheers, > > Chris > > > On Mon, Nov 18, 2013 at 11:43:46AM -0800, David Zafman wrote: >> >> No, you wouldn’t need to re-replicate the whole disk for a single bad >> sector. The way to deal with that if the object is on the primary is to >> remove the file manually from the OSD’s filesystem and perform a repair of >> the PG that holds that object. This will copy the object back from one of >> the replicas. >> >> David >> >> On Nov 17, 2013, at 10:46 PM, Chris Dunlop <ch...@onthe.net.au> wrote: >> >>> Hi David, >>> >>> On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote: >>>> >>>> Replication does not occur until the OSD is “out.” This creates a new >>>> mapping in the cluster of where the PGs should be and thus data begins to >>>> move and/or create sufficient copies. This scheme lets you control how >>>> and when you want the replication to occur. If you have plenty of space >>>> and you aren’t going to replace the drive immediately, just mark the OSD >>>> “down" AND “out.". If you are going to replace the drive immediately, set >>>> the “noout” flag. Take the OSD “down” and replace drive. Assuming it is >>>> mounted in the same place as the bad drive, bring the OSD back up. This >>>> will replicate exactly the same PGs the bad drive held back to the >>>> replacement drive. As was stated before don’t forget to “ceph osd unset >>>> noout" >>>> >>>> Keep in mind that in the case of a machine that has a hardware failure and >>>> takes OSD(s) down there is an automatic timeout which will mark them “out" >>>> for unattended operation. Unless you are monitoring the cluster 24/7 you >>>> should have enough disk space available to handle failures. >>>> >>>> Related info in: >>>> >>>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ >>>> >>>> David Zafman >>>> Senior Developer >>>> http://www.inktank.com >>> >>> >>> Are you saying, if a disk suffers from a bad sector in an object >>> for which it's primary, and for which good data exists on other >>> replica PGs, there's no way for ceph to recover other than by >>> (re-)replicating the whole disk? >>> >>> I.e., even if the disk is able to remap the bad sector using a >>> spare, so the disk is ok (albeit missing a sector's worth of >>> object data), the only way to recover is to basically blow away >>> all the data on that disk and start again, replicating >>> everything back to the disk (or to other disks)? >>> >>> Cheers, >>> >>> Chris. _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com