Re: Erasure coding library API

Atchley, Scott Fri, 05 Jul 2013 09:56:26 -0700

On Jul 5, 2013, at 12:34 PM, Loic Dachary <l...@dachary.org> wrote:

> Hi Scott,
>
> That's unfortunate indeed and I now better understand why it is necessary to 
> add a signature to the chunks. I added the following to:
>
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>
> Bit flips happen. Not often, but it is possible. Here is an article from 2011 
> also search for "bit rot" and "bit error rate". To detect corrupted chunks, a 
> signature (SHA1 for instance) should be added as an attribute of the file 
> containing the chunk so that deep scrubbing can check that the chunk is valid 
> by rehashing the content of the chunk and compare it with the signature.


Note, this can be local to the OSD to avoid sending an invalid chunk of the 
network.

> While writing a more detailed version of the API taking your comments and 
> Sam's comment in account, I realized that I don't know if it is possible to 
> recover a parity chunk. If a data chunk is missing, it is enough to recover 
> with a decode operation and write it back ( assuming a systematic code is 
> used ). Would it be possible to do something similar with parity chunks ? Or 
> would we need to re-encode all the parity chunks to write just one of them ? 
> I assume the later but ... I'm not sure ;-)

Yes, you can recover a parity chunk just as you would a data chunk. I have not 
used the jerasure library, so I do not know what it requires.

>
> Cheers
>
> On 05/07/2013 17:02, Atchley, Scott wrote:
>> On Jul 5, 2013, at 10:06 AM, Loic Dachary <l...@dachary.org> wrote:
>>
>>> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic,
>>>>
>>>> Erasure codes take what ever you give them. You need to verify the chunk 
>>>> before using it. Perhaps storing the checksum in the metadata/context that 
>>>> describes the parity object?
>>>
>>> Hi Scott,
>>>
>>> Does that mean that if I give the chunks A + B + C to decode() and B is 
>>> corrupted but A and C are ok it will return an incorrectly decoded content ?
>>
>> Unfortunately, yes.
>>
>>> I'm curious to know the answer but I don't think it is an actual problem. A 
>>> corrupted chunk would mean that the underlying file system corrupted the 
>>> content of one of its file. I can't remember the last time I saw that 
>>> happen ;-)
>>
>> Bit flips happen. Not often, but it is possible. Here is an article from 
>> 2011:
>>
>> http://www.linux-mag.com/id/8794/
>>
>> also search for "bit rot" and "bit error rate".
>>
>> Scott
>>
>>>
>>> Cheers
>>>
>>>>
>>>> Scott
>>>>
>>>> On Jul 4, 2013, at 9:24 AM, Loic Dachary <l...@dachary.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I was thinking about scrubbing of erasure coded chunks and realized I 
>>>>> don't know the answer to this very simple question : what happens when a 
>>>>> chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z 
>>>>> (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the 
>>>>> corrupted chunk ? If that's the case I think it slightly changes what the 
>>>>> API should be.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>>>>>> Scott, et al.
>>>>>>
>>>>>> Here is an interesting paper from Usenix HotStorage Conference which 
>>>>>> provides local codes without additional capacity overhead.
>>>>>>
>>>>>> Check it out. (abstract with links to paper and slides)
>>>>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>>>>>>
>>>>>> Cheers,
>>>>>> pvs
>>>>>>
>>>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>>>>>>
>>>>>>> Hi Scott,
>>>>>>>
>>>>>>> Point taken.
>>>>>>>
>>>>>>> I was thinking about Loic's decode description where k+m was requested 
>>>>>>> and
>>>>>>> data was decoded when k blocks were received. But he was referring to 
>>>>>>> full
>>>>>>> stripe reads where all the memory is allocated.
>>>>>>>
>>>>>>> Degraded reads and block repair are a different matter.
>>>>>>>
>>>>>>> pvs
>>>>>>>
>>>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>>>>>>
>>>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>>>>>> <pvonstamw...@us.fujitsu.com> wrote:
>>>>>>>>
>>>>>>>>> Scott,
>>>>>>>>>
>>>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>>>>>>
>>>>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>>>>>> so it is as efficient as Xorbas."
>>>>>>>>>
>>>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>>>>>> to
>>>>>>>> cover for the possibility of double errors. The time to reconstruct
>>>>>>> would
>>>>>>>> be about the same, but there could be more disk and network I/O. (LRC
>>>>>>> will
>>>>>>>> need to issue I/O to the rest of the global stripe if it detected
>>>>>>> multiple
>>>>>>>> local errors.)
>>>>>>>>
>>>>>>>> Why would you request more than five? If one cannot be read, request
>>>>>>>> another.
>>>>>>>>
>>>>>>>> Also, I am not sure that you want to request five at once since it will
>>>>>>>> lead to spikes in network traffic and require memory for all five 
>>>>>>>> blocks.
>>>>>>>> You will need at least two buffers. Request the first two and start the
>>>>>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>>>>>> current block with the communication for the next block. It may be that
>>>>>>>> the decode time is less than the communication and, in that case, you
>>>>>>> will
>>>>>>>> want to request all of the blocks at once.
>>>>>>>>
>>>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>>>>>> can start with traditional RS. If degraded reads and repaired blocks 
>>>>>>>> are
>>>>>>>> causing a problem, you can add the LRC. If capacity is an issue, you 
>>>>>>>> can
>>>>>>>> take it out.
>>>>>>>>
>>>>>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>>>>>> That
>>>>>>>> said, my example using traditional RS can provide more fault-tolerance
>>>>>>> on
>>>>>>>> average given the same amount of storage overhead.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>>>>>> I think we should be able to cover most cases by adding an interface
>>>>>>>> like:
>>>>>>>>>>
>>>>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>>>>>> &available_chunks);
>>>>>>>>>>
>>>>>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>>>>>> might
>>>>>>>>>> include a "cost" for reading each chunk like
>>>>>>>>>>
>>>>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>>>>>> const
>>>>>>>>>> map<int, int> &available)
>>>>>>>>>>
>>>>>>>>>> which returns the minimum cost set required to read the specified
>>>>>>>> chunks
>>>>>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>>>>>> us
>>>>>>>> to
>>>>>>>>>> consider the difference between reading local chunks vs remote 
>>>>>>>>>> chunks.
>>>>>>>>>> This should be sufficient to cover the read case (esp the degraded
>>>>>>> read
>>>>>>>>>> case) and the repair case.
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchle...@ornl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchle...@ornl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <l...@dachary.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>>>>>> for
>>>>>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>>>>>>
>>>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>>>>>>
>>>>>>>>>>>> While Plank's focused on the processor requirements for
>>>>>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>>>>>> requirements.
>>>>>>>>>>>>
>>>>>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>>>>>> used
>>>>>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>>>>>> are
>>>>>>>>>>>>> rebuilt
>>>>>>>>>>>>>
>>>>>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>>>>>>
>>>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>>>>>>
>>>>>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>>>>>> [offset, offset+length) the function returns NULL and sets
>>>>>>>> missing_chunks
>>>>>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>>>>>> read
>>>>>>>>>> the desired data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>>>>>> If
>>>>>>>>>> the underlying library implements LRC, it would ask for a subset of
>>>>>>> the
>>>>>>>>>> chunks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>>>>>> ( which
>>>>>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>>>>>> size
>>>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>>>>>> are
>>>>>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Comments are welcome :-)
>>>>>>>>>>>>
>>>>>>>>>>>> I have loosely followed this discussion and I have not looked
>>>>>>> closely
>>>>>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>>>>>> this
>>>>>>>>>> has already been addressed.
>>>>>>>>>>>>
>>>>>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>>>>>> partial
>>>>>>>>>> read) what it would do. Was the original intent to encode the entire
>>>>>>>> file
>>>>>>>>>> using k+m blocks irregardless of the file size and of the rados
>>>>>>> object
>>>>>>>>>> size?
>>>>>>>>>>>>
>>>>>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>>>>>> and
>>>>>>>>>> vice versa?
>>>>>>>>>>>>
>>>>>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>>>>>> logical or rados object).
>>>>>>>>>>>>
>>>>>>>>>>>> I would assume that you would want to operate on rados sized
>>>>>>> objects.
>>>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>>>>>> per
>>>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>>>>>> the
>>>>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>>>>>> video),
>>>>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>>>>>> the
>>>>>>>>>> logical offset and rados object size, no?
>>>>>>>>>>>>
>>>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>>>>>> library
>>>>>>>>>> can be modified to generate the local parity blocks such that they
>>>>>>> can
>>>>>>>> be
>>>>>>>>>> used to generate the global parity blocks. That would be a question
>>>>>>> for
>>>>>>>>>> Jim Plank.
>>>>>>>>>>>
>>>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>>>>>> for
>>>>>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>>>>>> than
>>>>>>>> 3x
>>>>>>>>>> replication while using less space.
>>>>>>>>>>>
>>>>>>>>>>> You can do almost the same thing with jerasure without modifying it
>>>>>>> at
>>>>>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>>>>>> m
>>>>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>>>>>> data
>>>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>>>>>> for
>>>>>>>>>> each 10 data blocks.
>>>>>>>>>>>
>>>>>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>>>>>> amount
>>>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>>>>>> parity
>>>>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>>>>>> each
>>>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>>>>>> failures
>>>>>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>>>>>> I/O
>>>>>>>> to
>>>>>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>>>>>> as
>>>>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>>>>>> but
>>>>>>>> it
>>>>>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>>>>>> blocks),
>>>>>>>>>> but that worst case is the same as 3x replication.
>>>>>>>>>>>
>>>>>>>>>>> Scott--
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>>>>> devel"
>>>>>>>>>>> in the body of a message to majord...@vger.kernel.org More majordomo
>>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>>>> the body of a message to majord...@vger.kernel.org More majordomo
>>>>>>> info
>>>>>>>> at
>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> All that is necessary for the triumph of evil is that good people do 
>>>>> nothing.
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do 
>>> nothing.
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Erasure coding library API

Reply via email to