On Jun 11, 2014, at 10:02 PM, Gregory Farnum <g...@inktank.com> wrote:
> On Wed, Jun 11, 2014 at 12:54 AM, Guang Yang <yguan...@outlook.com> wrote: >> On Jun 11, 2014, at 6:33 AM, Gregory Farnum <g...@inktank.com> wrote: >> >>> On Tue, May 20, 2014 at 6:44 PM, Guang Yang <yguan...@outlook.com> wrote: >>>> Hi ceph-devel, >>>> Like some users of Ceph, we are using Ceph for a latency sensitive >>>> project, and scrubbing (especially deep-scrubbing) impacts the SLA in a >>>> non-trivial way, as commodity hardware could fail in one way or the other, >>>> I think it is essential to have scrubbing enabled to preserve data >>>> durability. >>>> >>>> Inspired by how erasure coding backend implement scrubbing[1], I am >>>> wondering if the following changes is valid to somehow reduce the >>>> performance impact from scrubbing: >>>> 1. Store the CRC checksum along with each physical copy of the object on >>>> filesystem (via xattr or omap?) >>>> 2. For read request, it checks the CRC locally and if it mismatch, >>>> redirect the request to a replica and mark the PG as inconsistent. >>> >>> The problem with this is that you need to maintain the CRC across >>> partial overwrites of the object. And the real cost of scrubbing isn't >>> in the network traffic, it's in the disk reads, which you would have >>> to do anyway with this method. :) >> Thanks Greg for the response! >> Partial update is the right concern if that happens frequently. However, the >> major benefit of this proposal is to postpone the CRC check to READ request >> instead of doing it from within a background job (although we may still need >> to do background check as deep-scrubbing, we can reduce the frequency >> dramatically). By checking the CRC at read time, in-consistent object are >> marked as inconsistent (PG) and further we can trigger a repair for the PG. > > Oh, I see. > Still, partial update is in fact the major concern. We have a debug > mechanism called "sloppy crc" or similar that keeps track of them for > full (or sufficiently large?) writes, but it's not something you can > use on production cluster because it turns every write into a > read-modify-write cycle, and that's just prohibitively expensive (in > addition to issues with stuff like OSD restart, I think). This sort of > thing would make sense for the erasure-coded pools; maybe that would > be a better place to start? Yeah, that sounds like a good starting point, let me see if I can spend some time doing a simple POC. Thanks Greg. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html