On Jun 11, 2014, at 10:02 PM, Gregory Farnum <g...@inktank.com> wrote:

> On Wed, Jun 11, 2014 at 12:54 AM, Guang Yang <yguan...@outlook.com> wrote:
>> On Jun 11, 2014, at 6:33 AM, Gregory Farnum <g...@inktank.com> wrote:
>> 
>>> On Tue, May 20, 2014 at 6:44 PM, Guang Yang <yguan...@outlook.com> wrote:
>>>> Hi ceph-devel,
>>>> Like some users of Ceph, we are using Ceph for a latency sensitive 
>>>> project, and scrubbing (especially deep-scrubbing) impacts the SLA in a 
>>>> non-trivial way, as commodity hardware could fail in one way or the other, 
>>>> I think it is essential to have scrubbing enabled to preserve data 
>>>> durability.
>>>> 
>>>> Inspired by how erasure coding backend implement scrubbing[1], I am 
>>>> wondering if the following changes is valid to somehow reduce the 
>>>> performance impact from scrubbing:
>>>> 1. Store the CRC checksum along with each physical copy of the object on 
>>>> filesystem (via xattr or omap?)
>>>> 2. For read request, it checks the CRC locally and if it mismatch, 
>>>> redirect the request to a replica and mark the PG as inconsistent.
>>> 
>>> The problem with this is that you need to maintain the CRC across
>>> partial overwrites of the object. And the real cost of scrubbing isn't
>>> in the network traffic, it's in the disk reads, which you would have
>>> to do anyway with this method. :)
>> Thanks Greg for the response!
>> Partial update is the right concern if that happens frequently. However, the 
>> major benefit of this proposal is to postpone the CRC check to READ request 
>> instead of doing it from within a background job (although we may still need 
>> to do background check as deep-scrubbing, we can reduce the frequency 
>> dramatically). By checking the CRC at read time, in-consistent object are 
>> marked as inconsistent (PG) and further we can trigger a repair for the PG.
> 
> Oh, I see.
> Still, partial update is in fact the major concern. We have a debug
> mechanism called "sloppy crc" or similar that keeps track of them for
> full (or sufficiently large?) writes, but it's not something you can
> use on production cluster because it turns every write into a
> read-modify-write cycle, and that's just prohibitively expensive (in
> addition to issues with stuff like OSD restart, I think). This sort of
> thing would make sense for the erasure-coded pools; maybe that would
> be a better place to start?
Yeah, that sounds like a good starting point, let me see if I can spend some 
time doing a simple POC.
Thanks Greg.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to