Hi James et.al ,

Here is an example for clarity, 
1. Client Writes object  object.abcd
2. Based on the crush rules, say  OSD.a is the primary OSD which receives the 
write
3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and 
generates a list of segments, the object.abcd is now represented by a manifest 
object with the list of segment hash and len
 [Header] 
 [Seg1_sha, len]
 [Seg2_sha, len]
 ...
 [Seg3_sha, len]
4. OSD.a writes each segment as a new object in the cluster with object name  
<reserved_dedupe_perfix><sha>
5. The dedupe object write is treated differently from regular object writes, 
If the object is present then an object reference count is incremented and the 
object is not overwritten - this forms the basis of the dedupe logic. Multiple 
objects with one or more same constituent segments start sharing the segment 
objects.
6. Once all the segments are successfully written the object 'object.abcd' is 
now just a stub object with the segment manifest as described above and is goes 
through a regular object write sequence 

Partial writes on objects will be complicated,
- Partially affected segments will have to be read and segmentation logic has 
to be run from first to last affected segment boundaries
-  New segments will be written  
- Old overwritten segments have to be deleted
- Write merged manifest of the object 

All this will need protection of the PG lock, Also additional journaling 
mechanism will be needed to  recover from cases where the osd goes down before 
writing all the segments. 

Since this is quite a lot of processing, a better use case for this dedupe 
mechanism would be in the data tiering model with object redirects.
The manifest object fits quiet well into object redirects scheme of things, the 
idea is that, when an object is moved out of the base tier, you have an option 
to create a dedupe stub object and write individual segments into the cold 
backend tier with a rados plugin. 

Remaining responses inline.

Regards,
Chaitanya

-----Original Message-----
From: James (Fei) Liu-SSI [mailto:james....@ssi.samsung.com] 
Sent: Wednesday, July 01, 2015 4:00 AM
To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Chaitanya,
   Very interesting thoughts. I am not sure whether I get all of them or now. 
Here are several questions for the solution you provided, Might be a little bit 
detailed.

    Regards,
    James

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
[James] Does the OSD/PG mean PG Backend over here? 
[Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - 
not the specific OSD component

- Data is segmented (rabin/static) and secure hash computed [James] Which 
component in OSD are you going to do the data segment and hash computation?
[Chaitanya] If partial writes are not supported then this could be down before 
acquiring the PG lock, else we need the protection of the PG lock.  Probably in 
the do_request() path?

- A manifest is created with the offset/len/hash for all the segments [James] 
The manifest is going to be part of xattr of object? Where are you going to 
save manifest?
[Chaitanya] The manifest is a stub object with the constituent segments list 

- OSD/pg sends rados write with a special name <__known__prefix><secure hash> 
for all segments [James] What's your meaning of Rados Wirte?  Where do the all 
segments with secure hash signature write to?
[Chaitanya] All segments are unique objects with the above mentioned naming 
scheme, they get written back into the cluster as a regular client rados object 
write

- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented 
(check and increment needs to be atomic) [James] It makes sense. But I was 
wondering the unit for dedupe is segment or object? If object base, it totally 
make sense. However, why we need to have segment with manifest?

- Response is received by original primary PG for all segments [James] What 
response?
[Chaitanya] Write response indicating the status of the segment object write

- Primary PG writes the manifest to local and replicas or EC members [James] 
How about the dedupe data if the data is not present in replicas?
[Chaitanya] I am sorry, I did not get your question, the manifest object gets 
written in the primary and the replicas or encoded and written to the EC 
members, it is afforded the protection policy set for the pool. Same is the 
case with the individual constituent segments.  
 
- Response sent to client

Read:
- Read received at primary PG
[James]  The read can only fetch data from Primary PG?
- Reads manifest object

- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns 
Latency and increased traffic on the network
   


-----Original Message-----
From: Chaitanya Huilgol [mailto:chaitanya.huil...@sandisk.com]
Sent: Tuesday, June 30, 2015 8:50 AM
To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression


- Reference count has to be maintained as an attribute of the object
- As mentioned in the write workflow, duplicate segment writes increment the 
reference count
- Object Delete would result in delete on constituent segments listed in the 
object segment manifest
- Segment object delete will decrement reference count and remove the segment 
when there are no more references present 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels
Sent: Tuesday, June 30, 2015 9:02 PM
To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

This covers the read and write, what about the delete? One of the major issues 
with Dedupe, whether global or local is to address the inherent ref-counting 
associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-----Original Message-----
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph 
without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> 
for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented 
(check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns 
Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  
if we do compression on the client level, it is not global. And the compression 
was only applied to the local client, am I right?  I think there is pros and 
cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think 
more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI 
<james....@ssi.samsung.com> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable 
> easier task but still very challenge in terms of implementation no matter 
> where we should implement . Client side like RBD, or RDBGW or CephFS, or PG 
> should be a little bit better place to implementation in terms of efficiency 
> and cost reduction before the data were duplicated to other OSDs. It has  two 
> reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come 
> into play in pool level. However, we can also have second level of 
> compression in the local objectstore.  In term of unit size of compression , 
> It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if 
> we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can 
> understand the pros and cons to implement inline dedupe/compression. We all 
> understand the benefits of dedupe/compression. However, the side effect is 
> performance hurt and need more computing resources. It would be great if we 
> can understand the problems from 30,000 feet high for the whole picture about 
> the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. 
As Joe mentioned, we can compress slave pg data to avoid performance hurt, but 
it may increase the complexity of recovery and pg remap things. Another 
in-detail implement way if we begin to compress data from messenger, osd thread 
and pg thread won't access data for normal client op, so maybe we can make it 
parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not 
get the best compression result. If we can do compression in libcephfs, librbd 
and radosgw and make rados unknown to compression, it maybe simpler and we can 
get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for 
checksum store usage. Then we calculate object data and map to PG instead of 
object name at client side, so a object could always in a osd where it's also 
responsible for dedup storage. It also could be distributed at pool level.


>
> By the way, Both of software defined storage solution startups like Hdevig 
> and Springpath provide inline dedupe/compression.  It is not apple to apple 
> comparison. But it is good reference. The datacenters need cost effective 
> solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiw...@gmail.com]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI 
> <james....@ssi.samsung.com> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline 
>> dedup/compression across OSDs in RADOS because it is not easy task and 
>> answered. Ceph is providing replication and EC for performance and failure 
>> recovery. But we also lose the efficiency  of storage store and cost 
>> associate with it. It is kind of contradicted with each other. But I am 
>> curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline 
>> dedupe/compression except the features brought by local node itself like 
>> BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important 
> thing about compression is where we begin to compress, client, pg or 
> objectstore. Then we need to decide how much the compress unit is. Of course, 
> compress and dedup both like to use keyvalue-alike storage api to use, but I 
> think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or 
> cluster, and if we want to do dedup for the pool level, we need to do dedup 
> from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



--
Best Regards,

Wheat
  칻 & ~ &   +-  ݶ  w  ˛   m  ^  b  ^n r   z   h    &   G   h ( 階 ݢj"  
 m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j   f   h   z  w       j:+v   
w j m         zZ+     ݢj"  ! i
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�m��������zZ+�����ݢj"��!�i

Reply via email to