On Jul 5, 2013, at 12:34 PM, Loic Dachary <l...@dachary.org> wrote: > Hi Scott, > > That's unfortunate indeed and I now better understand why it is necessary to > add a signature to the chunks. I added the following to: > > https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst > > Bit flips happen. Not often, but it is possible. Here is an article from 2011 > also search for "bit rot" and "bit error rate". To detect corrupted chunks, a > signature (SHA1 for instance) should be added as an attribute of the file > containing the chunk so that deep scrubbing can check that the chunk is valid > by rehashing the content of the chunk and compare it with the signature.
Note, this can be local to the OSD to avoid sending an invalid chunk of the network. > While writing a more detailed version of the API taking your comments and > Sam's comment in account, I realized that I don't know if it is possible to > recover a parity chunk. If a data chunk is missing, it is enough to recover > with a decode operation and write it back ( assuming a systematic code is > used ). Would it be possible to do something similar with parity chunks ? Or > would we need to re-encode all the parity chunks to write just one of them ? > I assume the later but ... I'm not sure ;-) Yes, you can recover a parity chunk just as you would a data chunk. I have not used the jerasure library, so I do not know what it requires. > > Cheers > > On 05/07/2013 17:02, Atchley, Scott wrote: >> On Jul 5, 2013, at 10:06 AM, Loic Dachary <l...@dachary.org> wrote: >> >>> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic, >>>> >>>> Erasure codes take what ever you give them. You need to verify the chunk >>>> before using it. Perhaps storing the checksum in the metadata/context that >>>> describes the parity object? >>> >>> Hi Scott, >>> >>> Does that mean that if I give the chunks A + B + C to decode() and B is >>> corrupted but A and C are ok it will return an incorrectly decoded content ? >> >> Unfortunately, yes. >> >>> I'm curious to know the answer but I don't think it is an actual problem. A >>> corrupted chunk would mean that the underlying file system corrupted the >>> content of one of its file. I can't remember the last time I saw that >>> happen ;-) >> >> Bit flips happen. Not often, but it is possible. Here is an article from >> 2011: >> >> http://www.linux-mag.com/id/8794/ >> >> also search for "bit rot" and "bit error rate". >> >> Scott >> >>> >>> Cheers >>> >>>> >>>> Scott >>>> >>>> On Jul 4, 2013, at 9:24 AM, Loic Dachary <l...@dachary.org> wrote: >>>> >>>>> Hi, >>>>> >>>>> I was thinking about scrubbing of erasure coded chunks and realized I >>>>> don't know the answer to this very simple question : what happens when a >>>>> chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z >>>>> (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the >>>>> corrupted chunk ? If that's the case I think it slightly changes what the >>>>> API should be. >>>>> >>>>> Cheers >>>>> >>>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote: >>>>>> Scott, et al. >>>>>> >>>>>> Here is an interesting paper from Usenix HotStorage Conference which >>>>>> provides local codes without additional capacity overhead. >>>>>> >>>>>> Check it out. (abstract with links to paper and slides) >>>>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage >>>>>> >>>>>> Cheers, >>>>>> pvs >>>>>> >>>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote: >>>>>>> >>>>>>> Hi Scott, >>>>>>> >>>>>>> Point taken. >>>>>>> >>>>>>> I was thinking about Loic's decode description where k+m was requested >>>>>>> and >>>>>>> data was decoded when k blocks were received. But he was referring to >>>>>>> full >>>>>>> stripe reads where all the memory is allocated. >>>>>>> >>>>>>> Degraded reads and block repair are a different matter. >>>>>>> >>>>>>> pvs >>>>>>> >>>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote: >>>>>>>> >>>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz >>>>>>>> <pvonstamw...@us.fujitsu.com> wrote: >>>>>>>> >>>>>>>>> Scott, >>>>>>>>> >>>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit: >>>>>>>>> >>>>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks >>>>>>>> so it is as efficient as Xorbas." >>>>>>>>> >>>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks >>>>>>> to >>>>>>>> cover for the possibility of double errors. The time to reconstruct >>>>>>> would >>>>>>>> be about the same, but there could be more disk and network I/O. (LRC >>>>>>> will >>>>>>>> need to issue I/O to the rest of the global stripe if it detected >>>>>>> multiple >>>>>>>> local errors.) >>>>>>>> >>>>>>>> Why would you request more than five? If one cannot be read, request >>>>>>>> another. >>>>>>>> >>>>>>>> Also, I am not sure that you want to request five at once since it will >>>>>>>> lead to spikes in network traffic and require memory for all five >>>>>>>> blocks. >>>>>>>> You will need at least two buffers. Request the first two and start the >>>>>>>> decoding. You may want a third buffer to overlap the decoding of the >>>>>>>> current block with the communication for the next block. It may be that >>>>>>>> the decode time is less than the communication and, in that case, you >>>>>>> will >>>>>>>> want to request all of the blocks at once. >>>>>>>> >>>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You >>>>>>>> can start with traditional RS. If degraded reads and repaired blocks >>>>>>>> are >>>>>>>> causing a problem, you can add the LRC. If capacity is an issue, you >>>>>>>> can >>>>>>>> take it out. >>>>>>>> >>>>>>>> I like it too and Microsoft has something similar with Pyramid codes. >>>>>>> That >>>>>>>> said, my example using traditional RS can provide more fault-tolerance >>>>>>> on >>>>>>>> average given the same amount of storage overhead. >>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Paul >>>>>>>>> >>>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote: >>>>>>>>>> I think we should be able to cover most cases by adding an interface >>>>>>>> like: >>>>>>>>>> >>>>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int> >>>>>>>>>> &available_chunks); >>>>>>>>>> >>>>>>>>>> which returns the smallest set required to read/rebuild the chunks in >>>>>>>>>> want_to_read given the chunks in available_chunks. Alternately, we >>>>>>>> might >>>>>>>>>> include a "cost" for reading each chunk like >>>>>>>>>> >>>>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, >>>>>>> const >>>>>>>>>> map<int, int> &available) >>>>>>>>>> >>>>>>>>>> which returns the minimum cost set required to read the specified >>>>>>>> chunks >>>>>>>>>> given a mapping of available chunks to costs. The costs might allow >>>>>>> us >>>>>>>> to >>>>>>>>>> consider the difference between reading local chunks vs remote >>>>>>>>>> chunks. >>>>>>>>>> This should be sufficient to cover the read case (esp the degraded >>>>>>> read >>>>>>>>>> case) and the repair case. >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchle...@ornl.gov> >>>>>>>>>> wrote: >>>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchle...@ornl.gov> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <l...@dachary.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project >>>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ >>>>>>> for >>>>>>>>>> instance ) would need to be different from the one initialy proposed: >>>>>>>>>>>> >>>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-) >>>>>>>>>>>> >>>>>>>>>>>> While Plank's focused on the processor requirements for >>>>>>>>>> encoding/decoding, this video focuses on the network and disk I/O >>>>>>>>>> requirements. >>>>>>>>>>>> >>>>>>>>>>>>> context(k, m, reed-solomon|...) => context* c >>>>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m] >>>>>>>>>>>>> decode(context* c, void* chunk[k+m], int* >>>>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not >>>>>>>> used >>>>>>>>>>>>> repair(context* c, void* chunk[k+m], int* >>>>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks >>>>>>> are >>>>>>>>>>>>> rebuilt >>>>>>>>>>>>> >>>>>>>>>>>>> The decode function must allow for partial read: >>>>>>>>>>>>> >>>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int* >>>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data >>>>>>>>>>>>> >>>>>>>>>>>>> If there are not enough chunks to recover the desired data range >>>>>>>>>> [offset, offset+length) the function returns NULL and sets >>>>>>>> missing_chunks >>>>>>>>>> to the list of chunks that must be retrieved in order to be able to >>>>>>>> read >>>>>>>>>> the desired data. >>>>>>>>>>>>> >>>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed- >>>>>>>>>> solomon would return on error and ask for all other chunks to repair. >>>>>>>> If >>>>>>>>>> the underlying library implements LRC, it would ask for a subset of >>>>>>> the >>>>>>>>>> chunks. >>>>>>>>>>>>> >>>>>>>>>>>>> An implementation allowing only full reads and using jerasure >>>>>>>> ( which >>>>>>>>>> does not do LRC ) requires that offset is always zero, length is the >>>>>>>> size >>>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there >>>>>>>> are >>>>>>>>>> not enough chunks to rebuild the missing ones. >>>>>>>>>>>>> >>>>>>>>>>>>> Comments are welcome :-) >>>>>>>>>>>> >>>>>>>>>>>> I have loosely followed this discussion and I have not looked >>>>>>> closely >>>>>>>>>> at the proposed API nor at the jerasure interface. My apologies if >>>>>>> this >>>>>>>>>> has already been addressed. >>>>>>>>>>>> >>>>>>>>>>>> It is not clear to me from the above proposed API (ignoring the >>>>>>>> partial >>>>>>>>>> read) what it would do. Was the original intent to encode the entire >>>>>>>> file >>>>>>>>>> using k+m blocks irregardless of the file size and of the rados >>>>>>> object >>>>>>>>>> size? >>>>>>>>>>>> >>>>>>>>>>>> If so, how will you map rados objects to the logical k+m objects >>>>>>> and >>>>>>>>>> vice versa? >>>>>>>>>>>> >>>>>>>>>>>> If not, then the initial API needed an offset and length (either >>>>>>>>>> logical or rados object). >>>>>>>>>>>> >>>>>>>>>>>> I would assume that you would want to operate on rados sized >>>>>>> objects. >>>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects >>>>>>>> per >>>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if >>>>>>>> the >>>>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas >>>>>>> video), >>>>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The >>>>>>>>>> first for objects 1-10 and the second for objects 11-20. >>>>>>>>>>>> >>>>>>>>>>>> Perhaps, this is what the context is above. If so, it should have >>>>>>> the >>>>>>>>>> logical offset and rados object size, no? >>>>>>>>>>>> >>>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure >>>>>>>> library >>>>>>>>>> can be modified to generate the local parity blocks such that they >>>>>>> can >>>>>>>> be >>>>>>>>>> used to generate the global parity blocks. That would be a question >>>>>>> for >>>>>>>>>> Jim Plank. >>>>>>>>>>> >>>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O >>>>>>> for >>>>>>>>>> failures while maintaining traditional RS's higher fault-tolerance >>>>>>> than >>>>>>>> 3x >>>>>>>>>> replication while using less space. >>>>>>>>>>> >>>>>>>>>>> You can do almost the same thing with jerasure without modifying it >>>>>>> at >>>>>>>>>> all. Using the values from the Xorbas video, they have k data blocks, >>>>>>> m >>>>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2 >>>>>>>> data >>>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m >>>>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks >>>>>>> for >>>>>>>>>> each 10 data blocks. >>>>>>>>>>> >>>>>>>>>>> If you use traditional RS encoding via jerasure and used the same >>>>>>>> amount >>>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3 >>>>>>>> parity >>>>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for >>>>>>>> each >>>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6 >>>>>>>> failures >>>>>>>>>> depending on how the failures fell between the two groups of 5 blocks >>>>>>>>>> which is higher than the static 4 failures for the Xorbas code. The >>>>>>> I/O >>>>>>>> to >>>>>>>>>> recover from a single failure for both schemes is 5 blocks so it is >>>>>>> as >>>>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance, >>>>>>> but >>>>>>>> it >>>>>>>>>> can be less (four failures from one group of 5 data + 3 parity >>>>>>> blocks), >>>>>>>>>> but that worst case is the same as 3x replication. >>>>>>>>>>> >>>>>>>>>>> Scott-- >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph- >>>>>>> devel" >>>>>>>>>>> in the body of a message to majord...@vger.kernel.org More majordomo >>>>>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>> in >>>>>>>>>> the body of a message to majord...@vger.kernel.org More majordomo >>>>>>> info >>>>>>>> at >>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>> the body of a message to majord...@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>> >>>>> -- >>>>> Loïc Dachary, Artisan Logiciel Libre >>>>> All that is necessary for the triumph of evil is that good people do >>>>> nothing. >>>>> >>>> >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> All that is necessary for the triumph of evil is that good people do >>> nothing. >>> >> > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html