Re: [ceph-users] why the erasure code pool not support random write?

2014-10-21 Thread Nicheal
2014-10-21 7:40 GMT+08:00 Lionel Bouton lionel+c...@bouton.name:
 Hi,

 Le 21/10/2014 01:10, 池信泽 a écrit :

 Thanks.

Another reason is the checksum in the attr of object used for deep scrub
 in EC pools should be computed when modify the object. When supporting the
 random write, We should caculate the whole object for checksum, even if
 there is a bit modified. If only supporting append write, We can get the
 checksum based on the previously checksum and the append date which is more
 quickly.

Am I right?


 From what I understand, the deep scrub doesn't use a Ceph checksum but
 compares data between OSDs (and probably use a majority wins rule for
 repair). If you are using Btrfs it will report an I/O error because it uses
 an internal checksum by default which will force Ceph to use other OSDs for
 repair.
 I'd be glad to be proven wrong on this subject though.
No, when deep scrubbing, not whole 4M objects(I mean if we set object
size: 4M) content compare with each other byte by byte. I will
introduce high overload on network, If you transmit whole 4M objects,
even if we compress the object content. Instead, whole 4M object
content will generate a 64bit hash-digest. With comparing the hash
digest,  it confirms whether the content is consistent. But it still
need to read out whole 4M object content, so scrub without deep just
compare the meta info of each object.

But for the erasure pool, the situation change. Since in replicated
pool, the content in each replica is actually same as the primary.
However, for 4+2 erasure, for example, the content in this six chunks
are totally different. so if deep scrub, all content need to read out
and transmitted to the  scrub sponsor osd. Calculate the EC parity and
compare them. This will be expensive. So I am not quite clear whether
it is in the erasure pool case. Wait others input


 Best regards,

 Lionel Bouton

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why the erasure code pool not support random write?

2014-10-21 Thread Nicheal
2014-10-20 22:39 GMT+08:00 Wido den Hollander w...@42on.com:
 On 10/20/2014 03:25 PM, 池信泽 wrote:
 hi, cephers:

   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?


 To modify a EC object you need to read all chunks in order to compute
 the parity again.

 So that would involve a lot of reads for what might be just a very small
 write.

 That's also why EC can't be used for RBD images.

But for RBD cases, the smallest write will be 4k and the largest will
be 512k, which is determined by the block device driver. Currently, if
we use cache-tiering, one 4k random write may promote the whole 4M
object into the hot pool even though just serveral 4Ks in this object
is hot. So this is unreasonable. Furthermore, for the hot pool, it can
be a common replicated pool, and even, we can use kv-store for hot
pool. In kv-store, the content will be stripped to 1k = 1024 byte as
default. So is that possible to make use of this feature to realize
the model, if miss 4k, we just promote 4k data to the hot pool  and
saving it into kv-store?  Expecting haomai Wang's input

Nicheal,
Regards

  Thanks.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Gregory Farnum
This is a common constraint in many erasure coding storage system. It
arises because random writes turn into a read-modify-write cycle (in order
to redo the parity calculations). So we simply disallow them in EC pools,
which works fine for the target use cases right now.
-Greg

On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote:

 hi, cephers:

   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

  Thanks.



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Wido den Hollander
On 10/20/2014 03:25 PM, 池信泽 wrote:
 hi, cephers:
 
   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?
 

To modify a EC object you need to read all chunks in order to compute
the parity again.

So that would involve a lot of reads for what might be just a very small
write.

That's also why EC can't be used for RBD images.

  Thanks.
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Lionel Bouton
Le 20/10/2014 16:39, Wido den Hollander a écrit :
 On 10/20/2014 03:25 PM, 池信泽 wrote:
 hi, cephers:

   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

 To modify a EC object you need to read all chunks in order to compute
 the parity again.

 So that would involve a lot of reads for what might be just a very small
 write.

 That's also why EC can't be used for RBD images.

I'm surprised this is a show stopper. Even if writes are really slow, I
can see several uses case for RBD images on EC pools (archiving,
template RDBs, ...). Using tier caching in a write-back configuration
might even alleviate some of the performance problems if writes from the
cache pool are done on properly aligned and sized chunks of data.

It may be overly optimistic (the small benchmark on the following page
might be done with all planets aligned...) but Sheepdog seems to
implement EC storage with what would be interesting for me if I could
get equivalent performance on purely sequential accesses with a
theoretical Ceph EC RBDs.

https://github.com/sheepdog/sheepdog/wiki/Erasure-Code-Support#performance

Lionel Bouotn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread 池信泽
Thanks.

   Another reason is the checksum in the attr of object used for deep scrub
in EC pools should be computed when modify the object. When supporting the
random write, We should caculate the whole object for checksum, even if
there is a bit modified. If only supporting append write, We can get the
checksum based on the previously checksum and the append date which is more
quickly.

   Am I right?

2014-10-21 0:36 GMT+08:00 Gregory Farnum g...@inktank.com:

 This is a common constraint in many erasure coding storage system. It
 arises because random writes turn into a read-modify-write cycle (in order
 to redo the parity calculations). So we simply disallow them in EC pools,
 which works fine for the target use cases right now.
 -Greg


 On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote:

 hi, cephers:

   When I look into the ceph source code, I found the erasure code
 pool not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

  Thanks.



 --
 Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Lionel Bouton
Hi,

Le 21/10/2014 01:10, 池信泽 a écrit :
 Thanks.

Another reason is the checksum in the attr of object used for deep
 scrub in EC pools should be computed when modify the object. When
 supporting the random write, We should caculate the whole object for
 checksum, even if there is a bit modified. If only supporting append
 write, We can get the checksum based on the previously checksum and
 the append date which is more quickly. 
   
Am I right?

From what I understand, the deep scrub doesn't use a Ceph checksum but
compares data between OSDs (and probably use a majority wins rule for
repair). If you are using Btrfs it will report an I/O error because it
uses an internal checksum by default which will force Ceph to use other
OSDs for repair.
I'd be glad to be proven wrong on this subject though.

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com