Re: [ceph-users] why the erasure code pool not support random write?
2014-10-21 7:40 GMT+08:00 Lionel Bouton lionel+c...@bouton.name: Hi, Le 21/10/2014 01:10, 池信泽 a écrit : Thanks. Another reason is the checksum in the attr of object used for deep scrub in EC pools should be computed when modify the object. When supporting the random write, We should caculate the whole object for checksum, even if there is a bit modified. If only supporting append write, We can get the checksum based on the previously checksum and the append date which is more quickly. Am I right? From what I understand, the deep scrub doesn't use a Ceph checksum but compares data between OSDs (and probably use a majority wins rule for repair). If you are using Btrfs it will report an I/O error because it uses an internal checksum by default which will force Ceph to use other OSDs for repair. I'd be glad to be proven wrong on this subject though. No, when deep scrubbing, not whole 4M objects(I mean if we set object size: 4M) content compare with each other byte by byte. I will introduce high overload on network, If you transmit whole 4M objects, even if we compress the object content. Instead, whole 4M object content will generate a 64bit hash-digest. With comparing the hash digest, it confirms whether the content is consistent. But it still need to read out whole 4M object content, so scrub without deep just compare the meta info of each object. But for the erasure pool, the situation change. Since in replicated pool, the content in each replica is actually same as the primary. However, for 4+2 erasure, for example, the content in this six chunks are totally different. so if deep scrub, all content need to read out and transmitted to the scrub sponsor osd. Calculate the EC parity and compare them. This will be expensive. So I am not quite clear whether it is in the erasure pool case. Wait others input Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
2014-10-20 22:39 GMT+08:00 Wido den Hollander w...@42on.com: On 10/20/2014 03:25 PM, 池信泽 wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? To modify a EC object you need to read all chunks in order to compute the parity again. So that would involve a lot of reads for what might be just a very small write. That's also why EC can't be used for RBD images. But for RBD cases, the smallest write will be 4k and the largest will be 512k, which is determined by the block device driver. Currently, if we use cache-tiering, one 4k random write may promote the whole 4M object into the hot pool even though just serveral 4Ks in this object is hot. So this is unreasonable. Furthermore, for the hot pool, it can be a common replicated pool, and even, we can use kv-store for hot pool. In kv-store, the content will be stripped to 1k = 1024 byte as default. So is that possible to make use of this feature to realize the model, if miss 4k, we just promote 4k data to the hot pool and saving it into kv-store? Expecting haomai Wang's input Nicheal, Regards Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
This is a common constraint in many erasure coding storage system. It arises because random writes turn into a read-modify-write cycle (in order to redo the parity calculations). So we simply disallow them in EC pools, which works fine for the target use cases right now. -Greg On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? Thanks. -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
On 10/20/2014 03:25 PM, 池信泽 wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? To modify a EC object you need to read all chunks in order to compute the parity again. So that would involve a lot of reads for what might be just a very small write. That's also why EC can't be used for RBD images. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander Ceph consultant and trainer 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
Le 20/10/2014 16:39, Wido den Hollander a écrit : On 10/20/2014 03:25 PM, 池信泽 wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? To modify a EC object you need to read all chunks in order to compute the parity again. So that would involve a lot of reads for what might be just a very small write. That's also why EC can't be used for RBD images. I'm surprised this is a show stopper. Even if writes are really slow, I can see several uses case for RBD images on EC pools (archiving, template RDBs, ...). Using tier caching in a write-back configuration might even alleviate some of the performance problems if writes from the cache pool are done on properly aligned and sized chunks of data. It may be overly optimistic (the small benchmark on the following page might be done with all planets aligned...) but Sheepdog seems to implement EC storage with what would be interesting for me if I could get equivalent performance on purely sequential accesses with a theoretical Ceph EC RBDs. https://github.com/sheepdog/sheepdog/wiki/Erasure-Code-Support#performance Lionel Bouotn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
Thanks. Another reason is the checksum in the attr of object used for deep scrub in EC pools should be computed when modify the object. When supporting the random write, We should caculate the whole object for checksum, even if there is a bit modified. If only supporting append write, We can get the checksum based on the previously checksum and the append date which is more quickly. Am I right? 2014-10-21 0:36 GMT+08:00 Gregory Farnum g...@inktank.com: This is a common constraint in many erasure coding storage system. It arises because random writes turn into a read-modify-write cycle (in order to redo the parity calculations). So we simply disallow them in EC pools, which works fine for the target use cases right now. -Greg On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? Thanks. -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why the erasure code pool not support random write?
Hi, Le 21/10/2014 01:10, 池信泽 a écrit : Thanks. Another reason is the checksum in the attr of object used for deep scrub in EC pools should be computed when modify the object. When supporting the random write, We should caculate the whole object for checksum, even if there is a bit modified. If only supporting append write, We can get the checksum based on the previously checksum and the append date which is more quickly. Am I right? From what I understand, the deep scrub doesn't use a Ceph checksum but compares data between OSDs (and probably use a majority wins rule for repair). If you are using Btrfs it will report an I/O error because it uses an internal checksum by default which will force Ceph to use other OSDs for repair. I'd be glad to be proven wrong on this subject though. Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com