[ 
https://issues.apache.org/jira/browse/HBASE-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clara Xiong updated HBASE-15400:
--------------------------------
    Description: 
When we compact, we can output multiple files along the current window 
boundaries. There are two use cases:

1. Major compaction: We want to output date tiered store files with data older 
than max age archived in trunks of the window size on the higher tier. Once a 
window is old enough, we don't combine the windows to promote to the next tier 
any further. So files in these windows retain the same timespan as they were 
minor-compacted last time, which is the window size of the highest tier. Major 
compaction will touch these files and we want to maintain the same layout.

2. Bulk load files and the old file generated by major compaction before 
upgrading to DTCP.

Pros: 
1. Restore locality, process versioning, updates and deletes while maintaining 
the tiered layout.
2. The best way to fix a skewed layout.
 
This work is based on a prototype of DateTieredCompactor from HBASE-15389 and 
focused on the part to meet needs for these two use cases while supporting 
others. I have to call out a few design decisions:

1. We only want to output the files along all windows for major compaction. And 
we want to output multiple files older than max age in the trunks of the 
maximum tier window size determined by base window size, windows per tier and 
max age.

2. For minor compaction, we don't want to output too many files, which will 
remain around because of current restriction of contiguous compaction by seq 
id. I will only output two files if all the files in the windows are being 
combined, one for the data within window and the other for the out-of-window 
tail. If there is any file in the window excluded from compaction, only one 
file will be output from compaction. When the windows are promoted, the 
situation of out of order data will gradually improve. For the incoming window, 
we need to accommodate the case with user-specified future data.

3. We have to pass the boundaries with the list of store file as a complete 
time snapshot instead of two separate calls because window layout is determined 
by the time the computation is called. So we will need new type of compaction 
request. 

4. Since we will assign the same seq id for all output files, we need to sort 
by maxTimestamp subsequently. Right now all compaction policy gets the files 
sorted for StoreFileManager which sorts by seq id and other criteria. I will 
use this order for DTCP only, to avoid impacting other compaction policies. 

5. We need some cleanup of current design of StoreEngine and CompactionPolicy.


  was:
When we compact, we can output multiple files along the current window 
boundaries. There are two use cases:

1. Major compaction: We want to output date tiered store files with data older 
than max age archived in trunks of the window size on the higher tier.1. Once a 
file is old enough to be out of the range that we do tiered minor compaction, 
we don't compact them any further. So they retain the same timespan as they 
were compacted last time, which is the window size of the highest tier. Major 
compaction will touch these files and we want to maintain the same layout.

2. Bulk load files and the old file generated by major compaction before 
upgrading to DTCP.

Pros: 
1. Restore locality, process versioning, updates and deletes while maintaining 
the tiered layout.
2. The best way to fix a skewed layout.
 
This work is based on a prototype of DateTieredCompactor from HBASE-15389 and 
focused on the part to meet needs for these two use cases while supporting 
others. I have to call out a few design decisions:

1. We only want to output the files along all windows for major compaction. And 
we want to output multiple files older than max age in the trunks of the 
maximum tier window size determined by base window size, windows per tier and 
max age.

2. For minor compaction, we don't want to output too many files, which will 
remain around because of current restriction of contiguous compaction by seq 
id. I will only output two files if all the files in the windows are being 
combined, one for the data within window and the other for the out-of-window 
tail. If there is any file in the window excluded from compaction, only one 
file will be output from compaction. When the windows are promoted, the 
situation of out of order data will gradually improve. For the incoming window, 
we need to accommodate the case with user-specified future data.

3. We have to pass the boundaries with the list of store file as a complete 
time snapshot instead of two separate calls because window layout is determined 
by the time the computation is called. So we will need new type of compaction 
request. 

4. Since we will assign the same seq id for all output files, we need to sort 
by maxTimestamp subsequently. Right now all compaction policy gets the files 
sorted for StoreFileManager which sorts by seq id and other criteria. I will 
use this order for DTCP only, to avoid impacting other compaction policies. 

5. We need some cleanup of current design of StoreEngine and CompactionPolicy.



> Use DateTieredCompactor for Date Tiered Compaction
> --------------------------------------------------
>
>                 Key: HBASE-15400
>                 URL: https://issues.apache.org/jira/browse/HBASE-15400
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>             Fix For: 2.0.0
>
>         Attachments: HBASE-15400-15389-v12.patch, HBASE-15400-v1.pa, 
> HBASE-15400-v3.patch, HBASE-15400.patch
>
>
> When we compact, we can output multiple files along the current window 
> boundaries. There are two use cases:
> 1. Major compaction: We want to output date tiered store files with data 
> older than max age archived in trunks of the window size on the higher tier. 
> Once a window is old enough, we don't combine the windows to promote to the 
> next tier any further. So files in these windows retain the same timespan as 
> they were minor-compacted last time, which is the window size of the highest 
> tier. Major compaction will touch these files and we want to maintain the 
> same layout.
> 2. Bulk load files and the old file generated by major compaction before 
> upgrading to DTCP.
> Pros: 
> 1. Restore locality, process versioning, updates and deletes while 
> maintaining the tiered layout.
> 2. The best way to fix a skewed layout.
>  
> This work is based on a prototype of DateTieredCompactor from HBASE-15389 and 
> focused on the part to meet needs for these two use cases while supporting 
> others. I have to call out a few design decisions:
> 1. We only want to output the files along all windows for major compaction. 
> And we want to output multiple files older than max age in the trunks of the 
> maximum tier window size determined by base window size, windows per tier and 
> max age.
> 2. For minor compaction, we don't want to output too many files, which will 
> remain around because of current restriction of contiguous compaction by seq 
> id. I will only output two files if all the files in the windows are being 
> combined, one for the data within window and the other for the out-of-window 
> tail. If there is any file in the window excluded from compaction, only one 
> file will be output from compaction. When the windows are promoted, the 
> situation of out of order data will gradually improve. For the incoming 
> window, we need to accommodate the case with user-specified future data.
> 3. We have to pass the boundaries with the list of store file as a complete 
> time snapshot instead of two separate calls because window layout is 
> determined by the time the computation is called. So we will need new type of 
> compaction request. 
> 4. Since we will assign the same seq id for all output files, we need to sort 
> by maxTimestamp subsequently. Right now all compaction policy gets the files 
> sorted for StoreFileManager which sorts by seq id and other criteria. I will 
> use this order for DTCP only, to avoid impacting other compaction policies. 
> 5. We need some cleanup of current design of StoreEngine and CompactionPolicy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to