On 06/15/2013 03:12 AM, Paul Von-Stamwitz wrote: > Hi Loic and Martin, > > This is a great discussion and I agree that the performance ramifications of > erasure coding need to be thought out very carefully. Since chunks are not > only encoded, but distributed across the cluster, we need to pay attention to > the network overhead as well as the arithmetic involved in the > encoding/decoding. > > If I understand the proposal correctly, objects begin their life's journey > replicated as normal. As it grows cold, it gets transformed to an encode PG > in another pool. Subsequent reads will be redirected (ehh). Subsequent writes > will first decode the original object and re-replicate it (ouch!). Client > writes never are encoded on the fly; they are always replicated (nice).
Hi Paul, The first step is to have an encoded PG : it could be useful on its own. > So... > encode() is run as a low-priority background process probably once a week > during deep scrubs. > decode() should be rare (if not, the object shouldn't have been encoded in > the first place.) If the cluster is healthy no arithmetic is needed, just > concatenation, but a lot of network activity. > repair() operations will be the most prevalent and may occur at any time > during normal self-healing/rebalancing operations. > > Therefore, in my opinion, the algorithm that we choose must be optimized for > repairing damaged and missing chunks. The main problem I have with > Reed-Solomon is that it uses MDS codes which maximize network activity for > recalculations. Pyramid codes have the same write (encode) overhead, but have > better read (repair) overhead. Which library do you typically use ? I would love to hear about the use cases you worked on. Cheers > Loic, I know nothing about Mojette Transforms. From what little I gleaned, it > might be good for repair (needing only a subset of chunks within a range to > recalculate a missing chunk) but I'm worried about the storage efficiency. > RozoFS claims 1.5x. I'd like to do better than that. > > All the best, > Paul > > On 06/14/2013 3:57 PM, Loic Dachary wrote: >> Hi Martin, >> >> Your explanations are very helpful to better understand the tradeoffs of >> the existing implementations. To be honest I was looking forward to your >> intervention. Not you specifically, of course :-) But someone with a good >> theoretical background to be a judge of what's best in the context of Ceph. >> If you say it's the upcoming library to be released in August 2013, I'll >> take your word for it. >> >> The work currently being done within Ceph is to architecture to storage >> backend ( namely placement groups ) to make room for distributed parity. >> My initial idea was to isolate the low level library under an API that >> takes a region ( 16KB for instance, as in gf_unit.c found in >> http://web.eecs.utk.edu/~plank/plank/papers/CS-13- >> 703/gf_complete_0.1.tar ) as input and outputs chunks that can then be >> written on different hosts. For instance >> >> encode(char* region, char** chuncks) => encode the region into N >> chuncks >> decode(char** chunks, char* region) => decode the N chuncks into a >> region >> repair(char** chunks, int damaged) => repair the damaged chunck >> >> Do you think it is a sensible approach ? And if you do, will I find >> examples of such higher level functions in >> http://web.eecs.utk.edu/~plank/plank/papers/CS-13- >> 703/gf_complete_0.1.tar ? Or elsewhere ? >> >> I'm a little confused about the relation between GF complete ( as found at >> http://web.eecs.utk.edu/~plank/plank/papers/CS-13- >> 703/gf_complete_0.1.tar ) which is very recent ( 2013 ) and Jerasure ( as >> found at http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627/Jerasure- >> 1.2.tar ) which is comparatively older ( 2008 ). Do you know how Jerasure >> 2.0 relates to GF complete ? >> >> For completeness, here is a thread with pointers to Mojette Transform >> that's being used as part of Rozofs. >> >> http://www.mail-archive.com/[email protected]/msg14666.html >> >> I'm not able to compare it with the other libraries because it seems to >> take a completely different approach. Do you have an opinion about it ? >> >> As Patrick mentioned, I'll be at http://www.oscon.com/oscon2013 next month >> but I'd love to understand more about this as soon as possible :-) >> >> Cheers >> >> P.S. Updated >> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_ >> a_storage_backend#Erasure_Encoded_Storage with a link to >> http://web.eecs.utk.edu/~plank/plank/www/software.html for the record >> >> On 06/14/2013 10:13 PM, Martin Flyvbjerg wrote: >>> Dear Community >>> I am a young engineer (not software or math, please bare with me) with >> some suggestions regarding erasure codes. I never used Ceph before or any >> other distributed file system. >>> >>> I stumped upon the suggestion for adding erasure codes to Ceph, as >>> described in this article >>> >> http://wiki.Ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_ >> a_storage_backend >>> >>> first I would like to say great initiative to add erasure codes to Ceph. >>> Ceph needs its own implementation and it have to be done right, I cannot >> stress this enough, suggested software mentioned in that article would >> result in very low performance. >>> >>> Why? >>> Reed-Solomon is normally something regarded as being very slow compared >> to other erasure codes, because the underlying Galois-Field multiplication >> is slow. Please see video at usenix.org forexplanation. >>> >>> The implementations of Zfec library and other suggested software the >> others rely on the Vandermonde matrix, a matrix used in in Reed-Solomon >> erasure codes, a faster approach would be to use the Cauchy-Reed-Solomon >> implementation. Please see [1,2,3] >>> Although there is something even better, by using the Intel SSE2/3 SIMD >> instructions it is possible to do the as fast as any other XOR based >> erasure codes (RaptorQ LT-codes, LDPC etc.). >>> >>> The suggested FECpp lib uses the optimisation but with a relative small >> Galois-field only 2^8, since Ceph aimes at unlimited scalability >> increasing the size of the Galois-Field would improve performance [4]. Of >> course the configured Ceph Object Size and/or Stripe width have to be >> taken into account. >>> Please see >>> https://www.usenix.org/conference/fast13/screaming-fast-galois-field- >> arithmetic-using-sse2-extensions >>> >>> >>> The solution >>> Using the GF-Complete open source library [4] to implement Reed-Solomon >> in Ceph in order to allow Ceph to scale to infinity. >>> James S. Plank the author of GF-complete have developed a library >> implementing various Reed-Solomon codes called Jerasure. >> http://web.eecs.utk.edu/~plank/plank/www/software.html >>> Jerasure 2.0 using the GF-complete artimetric based in Intel SSE SIMD >> instructions, is current in development expected release august 2013. Will >> be released under the new BSD license. Jerasure 2.0 also supports >> arbitrary Galois-field sizes 8,16,32,64 or 128 bit. >>> >>> The limit of this implementation would be the processors L2/L3 cache not >> the underlying arithmetic. >>> >>> Best Regards >>> Martin Flyvbjerg >>> >>> [1] http://web.eecs.utk.edu/~plank/plank/papers/CS-05-569.pdf >>> [2] http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf >>> [3] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2009.pdf >>> [4] http://web.eecs.utk.edu/~plank/plank/papers/FAST-2013-GF.pdf >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to [email protected] >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> All that is necessary for the triumph of evil is that good people do >> nothing. >> > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing.
signature.asc
Description: OpenPGP digital signature
