Re: Review request : Erasure Code plugin loader implementation
Hi Sage, I created erasure code : convenience functions to code / decode http://tracker.ceph.com/issues/6064 to implement the suggested functions. Please let me know if this should be merged with another task. Cheers On 19/08/2013 17:06, Loic Dachary wrote: On 19/08/2013 02:01, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Sage, Unless I misunderstood something ( which is still possible at this stage ;-) decode() is used both for recovery of missing chunks and retrieval of the original buffer. Decoding the M data chunks is a special case of decoding N = M chunks out of the M+K chunks that were produced by encode(). It can be used to recover parity chunks as well as data chunks. https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api mapint, buffer decode(const setint want_to_read, const mapint, buffer chunks) decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling decode([1,2], { 1 = 'A', 2 = 'B', 3 = 'Z' }) = { 1 = 'A', 2 = 'B' } If however, the chunk B is to be read but is missing it will be: decode([2], { 1 = 'A', 3 = 'Z' }) = { 2 = 'B' } Ah, I guess this works when some of the chunks contain the original data (as with a parity code). There are codes that don't work that way, although I suspect we won't use them. Regardless, I wonder if we should generalize slightly and have some methods work in terms of (offset,length) of the original stripe to generalize that bit. Then we would have something like mapint, buffer transcode(const setint want_to_read, const mapint, buffer chunks); to go from chunks - chunks (as we would want to do with, say, a LRC-like code where we can rebuild some shards from a subset of the other shards). And then also have int decode(const mapint, buffer chunks, unsigned offset, unsigned len, bufferlist *out); This function would be implemented more or less as: setint want_to_read = range_to_chunks(offset, len) // compute what chunks must be retrieved setint available = the up set setint minimum = minimum_to_decode(want_to_read, available); mapint, buffer available_chunks = retrieve_chunks_from_osds(minimum); mapint, buffer chunks = transcode(want_to_read, available_chunks); // repairs if necessary out = bufferptr(concat_chunks(chunks), offset - offset of the first chunk, len) or do you have something else in mind ? that recovers the original data. In our case, the read path would use decode, and for recovery we would use transcode. We'd also want to have alternate minimum_to_decode* methods, like virtual setint minimum_to_decode(unsigned offset, unsigned len, const setint available_chunks) = 0; I also have a convenience wrapper in mind for this but I feel I'm missing something. Cheers What do you think? sage Cheers On 18/08/2013 19:34, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Ceph, I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: ErasureCodePlugin::factory(erasure_code, example, parameters) as shown at https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 to get an object implementing the interface https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h which matches the proposal described at https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api The draft is at https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual mapint, bufferptr decode(const setint want_to_read, const mapint, bufferptr chunks) = 0; it seems like this one should be more like virtual int decode(const mapint, bufferptr chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to get (missing) chunks for recovery, you'd do minimum_to_decode(...); // see what we need fetch those chunks from other nodes decode(...); // reconstruct original buffer encode(...); // encode missing chunks from original data sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Review request : Erasure Code plugin loader implementation
Hi Sage, This makes a lot more sense indeed. I updated the http://tracker.ceph.com/issues/5878 description accordingly. ceph osd pool create poolname erasure-code-dir=/var/lib/ceph/erasure-code erasure-code-plugin=jerasure erasure-code-m=10 erasure-code-k=3 erasure-code-algorithm=Reed-Solomon Thanks :-) On 19/08/2013 02:24, Sage Weil wrote: Hi Loic, One other thought on http://tracker.ceph.com/issues/5878: The user interface there would let you adjust various parameters of the pool's erasure coding scheme after the pool is created. As a practical matter, I suspect that many/most of these fields will be specified exactly once (at pool creation time) and will be immutable properties of the pool after that. The m/k at a minimum need to match up with what we are requesting out of crush. And once there is data stored, I don't think it will make sense to be able to change the encoding scheme for new objects and still be able to deal with old objects. (Or maybe it will be, if the code metadata is in the object_info_t.) Even if we do support changing some of these on the fly, though, I suspect the most important interface, and the first we implement, will be something like ceph osd pool create name [key=value ...] the various parameters listed, like EC algorithm, m, k, and pg_num. We can probably generalize the mon command interface to have a key/value list type that will make this easy to plumb from the CLI (and trivial via ceph-rest-api). sage -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: Review request : Erasure Code plugin loader implementation
On 19/08/2013 02:01, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Sage, Unless I misunderstood something ( which is still possible at this stage ;-) decode() is used both for recovery of missing chunks and retrieval of the original buffer. Decoding the M data chunks is a special case of decoding N = M chunks out of the M+K chunks that were produced by encode(). It can be used to recover parity chunks as well as data chunks. https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api mapint, buffer decode(const setint want_to_read, const mapint, buffer chunks) decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling decode([1,2], { 1 = 'A', 2 = 'B', 3 = 'Z' }) = { 1 = 'A', 2 = 'B' } If however, the chunk B is to be read but is missing it will be: decode([2], { 1 = 'A', 3 = 'Z' }) = { 2 = 'B' } Ah, I guess this works when some of the chunks contain the original data (as with a parity code). There are codes that don't work that way, although I suspect we won't use them. Regardless, I wonder if we should generalize slightly and have some methods work in terms of (offset,length) of the original stripe to generalize that bit. Then we would have something like mapint, buffer transcode(const setint want_to_read, const mapint, buffer chunks); to go from chunks - chunks (as we would want to do with, say, a LRC-like code where we can rebuild some shards from a subset of the other shards). And then also have int decode(const mapint, buffer chunks, unsigned offset, unsigned len, bufferlist *out); This function would be implemented more or less as: setint want_to_read = range_to_chunks(offset, len) // compute what chunks must be retrieved setint available = the up set setint minimum = minimum_to_decode(want_to_read, available); mapint, buffer available_chunks = retrieve_chunks_from_osds(minimum); mapint, buffer chunks = transcode(want_to_read, available_chunks); // repairs if necessary out = bufferptr(concat_chunks(chunks), offset - offset of the first chunk, len) or do you have something else in mind ? that recovers the original data. In our case, the read path would use decode, and for recovery we would use transcode. We'd also want to have alternate minimum_to_decode* methods, like virtual setint minimum_to_decode(unsigned offset, unsigned len, const setint available_chunks) = 0; I also have a convenience wrapper in mind for this but I feel I'm missing something. Cheers What do you think? sage Cheers On 18/08/2013 19:34, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Ceph, I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: ErasureCodePlugin::factory(erasure_code, example, parameters) as shown at https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 to get an object implementing the interface https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h which matches the proposal described at https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api The draft is at https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual mapint, bufferptr decode(const setint want_to_read, const mapint, bufferptr chunks) = 0; it seems like this one should be more like virtual int decode(const mapint, bufferptr chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to get (missing) chunks for recovery, you'd do minimum_to_decode(...); // see what we need fetch those chunks from other nodes decode(...); // reconstruct original buffer encode(...); // encode missing chunks from original data sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
Re: Review request : Erasure Code plugin loader implementation
On Mon, 19 Aug 2013, Loic Dachary wrote: On 19/08/2013 02:01, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Sage, Unless I misunderstood something ( which is still possible at this stage ;-) decode() is used both for recovery of missing chunks and retrieval of the original buffer. Decoding the M data chunks is a special case of decoding N = M chunks out of the M+K chunks that were produced by encode(). It can be used to recover parity chunks as well as data chunks. https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api mapint, buffer decode(const setint want_to_read, const mapint, buffer chunks) decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling decode([1,2], { 1 = 'A', 2 = 'B', 3 = 'Z' }) = { 1 = 'A', 2 = 'B' } If however, the chunk B is to be read but is missing it will be: decode([2], { 1 = 'A', 3 = 'Z' }) = { 2 = 'B' } Ah, I guess this works when some of the chunks contain the original data (as with a parity code). There are codes that don't work that way, although I suspect we won't use them. Regardless, I wonder if we should generalize slightly and have some methods work in terms of (offset,length) of the original stripe to generalize that bit. Then we would have something like mapint, buffer transcode(const setint want_to_read, const mapint, buffer chunks); to go from chunks - chunks (as we would want to do with, say, a LRC-like code where we can rebuild some shards from a subset of the other shards). And then also have int decode(const mapint, buffer chunks, unsigned offset, unsigned len, bufferlist *out); This function would be implemented more or less as: setint want_to_read = range_to_chunks(offset, len) // compute what chunks must be retrieved setint available = the up set setint minimum = minimum_to_decode(want_to_read, available); mapint, buffer available_chunks = retrieve_chunks_from_osds(minimum); mapint, buffer chunks = transcode(want_to_read, available_chunks); // repairs if necessary out = bufferptr(concat_chunks(chunks), offset - offset of the first chunk, len) or do you have something else in mind ? This makes sense. I am still wondering if it is worth generalizing this a bit further to codes without a nice mapping of a range - want_to_read (i.e. that require decoding the entire stripe to get any part of it). For those codes, we would want to choose the N cheapest/available chunks and the sequence above would be a bit different. I guess in reality, though, we probably don't care to implement any such codes (I'm not sure what their advantages would be, if any)! sage that recovers the original data. In our case, the read path would use decode, and for recovery we would use transcode. We'd also want to have alternate minimum_to_decode* methods, like virtual setint minimum_to_decode(unsigned offset, unsigned len, const setint available_chunks) = 0; I also have a convenience wrapper in mind for this but I feel I'm missing something. Cheers What do you think? sage Cheers On 18/08/2013 19:34, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Ceph, I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: ErasureCodePlugin::factory(erasure_code, example, parameters) as shown at https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 to get an object implementing the interface https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h which matches the proposal described at https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api The draft is at https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual mapint, bufferptr decode(const setint want_to_read, const mapint, bufferptr chunks) = 0; it seems like this one should be more like virtual int decode(const mapint, bufferptr chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to
Re: Review request : Erasure Code plugin loader implementation
On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Ceph, I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: ErasureCodePlugin::factory(erasure_code, example, parameters) as shown at https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 to get an object implementing the interface https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h which matches the proposal described at https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api The draft is at https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual mapint, bufferptr decode(const setint want_to_read, const mapint, bufferptr chunks) = 0; it seems like this one should be more like virtual int decode(const mapint, bufferptr chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to get (missing) chunks for recovery, you'd do minimum_to_decode(...); // see what we need fetch those chunks from other nodes decode(...); // reconstruct original buffer encode(...); // encode missing chunks from original data sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Review request : Erasure Code plugin loader implementation
Hi Sage, Unless I misunderstood something ( which is still possible at this stage ;-) decode() is used both for recovery of missing chunks and retrieval of the original buffer. Decoding the M data chunks is a special case of decoding N = M chunks out of the M+K chunks that were produced by encode(). It can be used to recover parity chunks as well as data chunks. https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api mapint, buffer decode(const setint want_to_read, const mapint, buffer chunks) decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling decode([1,2], { 1 = 'A', 2 = 'B', 3 = 'Z' }) = { 1 = 'A', 2 = 'B' } If however, the chunk B is to be read but is missing it will be: decode([2], { 1 = 'A', 3 = 'Z' }) = { 2 = 'B' } Cheers On 18/08/2013 19:34, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Ceph, I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: ErasureCodePlugin::factory(erasure_code, example, parameters) as shown at https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 to get an object implementing the interface https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h which matches the proposal described at https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api The draft is at https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual mapint, bufferptr decode(const setint want_to_read, const mapint, bufferptr chunks) = 0; it seems like this one should be more like virtual int decode(const mapint, bufferptr chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to get (missing) chunks for recovery, you'd do minimum_to_decode(...); // see what we need fetch those chunks from other nodes decode(...); // reconstruct original buffer encode(...); // encode missing chunks from original data sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: Review request : Erasure Code plugin loader implementation
On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Sage, Unless I misunderstood something ( which is still possible at this stage ;-) decode() is used both for recovery of missing chunks and retrieval of the original buffer. Decoding the M data chunks is a special case of decoding N = M chunks out of the M+K chunks that were produced by encode(). It can be used to recover parity chunks as well as data chunks. https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api mapint, buffer decode(const setint want_to_read, const mapint, buffer chunks) decode chunks to read the content of the want_to_read chunks and return a map associating the chunk number with its decoded content. For instance, in the simplest case M=2,K=1 for an encoded payload of data A and B with parity Z, calling decode([1,2], { 1 = 'A', 2 = 'B', 3 = 'Z' }) = { 1 = 'A', 2 = 'B' } If however, the chunk B is to be read but is missing it will be: decode([2], { 1 = 'A', 3 = 'Z' }) = { 2 = 'B' } Ah, I guess this works when some of the chunks contain the original data (as with a parity code). There are codes that don't work that way, although I suspect we won't use them. Regardless, I wonder if we should generalize slightly and have some methods work in terms of (offset,length) of the original stripe to generalize that bit. Then we would have something like mapint, buffer transcode(const setint want_to_read, const mapint, buffer chunks); to go from chunks - chunks (as we would want to do with, say, a LRC-like code where we can rebuild some shards from a subset of the other shards). And then also have int decode(const mapint, buffer chunks, unsigned offset, unsigned len, bufferlist *out); that recovers the original data. In our case, the read path would use decode, and for recovery we would use transcode. We'd also want to have alternate minimum_to_decode* methods, like virtual setint minimum_to_decode(unsigned offset, unsigned len, const setint available_chunks) = 0; What do you think? sage Cheers On 18/08/2013 19:34, Sage Weil wrote: On Sun, 18 Aug 2013, Loic Dachary wrote: Hi Ceph, I've implemented a draft of the Erasure Code plugin loader in the context of http://tracker.ceph.com/issues/5878. It has a trivial unit test and an example plugin. It would be great if someone could do a quick review. The general idea is that the erasure code pool calls something like: ErasureCodePlugin::factory(erasure_code, example, parameters) as shown at https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/test/osd/TestErasureCode.cc#L28 to get an object implementing the interface https://github.com/ceph/ceph/blob/5a2b1d66ae17b78addc14fee68c73985412f3c8c/src/osd/ErasureCodeInterface.h which matches the proposal described at https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api The draft is at https://github.com/ceph/ceph/commit/5a2b1d66ae17b78addc14fee68c73985412f3c8c Thanks in advance :-) I haven't been following this discussion too closely, but taking a look now, the first 3 make sense, but virtual mapint, bufferptr decode(const setint want_to_read, const mapint, bufferptr chunks) = 0; it seems like this one should be more like virtual int decode(const mapint, bufferptr chunks, bufferlist *out); As in, you'd decode the chunks you have to get the actual data. If you want to get (missing) chunks for recovery, you'd do minimum_to_decode(...); // see what we need fetch those chunks from other nodes decode(...); // reconstruct original buffer encode(...); // encode missing chunks from original data sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Review request : Erasure Code plugin loader implementation
Hi Loic, One other thought on http://tracker.ceph.com/issues/5878: The user interface there would let you adjust various parameters of the pool's erasure coding scheme after the pool is created. As a practical matter, I suspect that many/most of these fields will be specified exactly once (at pool creation time) and will be immutable properties of the pool after that. The m/k at a minimum need to match up with what we are requesting out of crush. And once there is data stored, I don't think it will make sense to be able to change the encoding scheme for new objects and still be able to deal with old objects. (Or maybe it will be, if the code metadata is in the object_info_t.) Even if we do support changing some of these on the fly, though, I suspect the most important interface, and the first we implement, will be something like ceph osd pool create name [key=value ...] the various parameters listed, like EC algorithm, m, k, and pg_num. We can probably generalize the mon command interface to have a key/value list type that will make this easy to plumb from the CLI (and trivial via ceph-rest-api). sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html