[ 
https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542192#comment-17542192
 ] 

Todd Lipcon commented on KUDU-3371:
-----------------------------------

Oh actually looks like I did put it on public github. Some info on my work here 
(including the link): https://issues.apache.org/jira/browse/KUDU-2204

> Use RocksDB to store LBM metadata
> ---------------------------------
>
>                 Key: KUDU-3371
>                 URL: https://issues.apache.org/jira/browse/KUDU-3371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Yingchun Lai
>            Priority: Major
>
> h1. Motivation
> The current LBM container use separate .data and .metadata files. The .data 
> file store the real user data, we can use hole punching to reduce disk space. 
> While the metadata use write protobuf serialized string to a file, in append 
> only mode. Each protobuf object is a struct of BlockRecordPB:
>  
> {code:java}
> message BlockRecordPB {
>   required BlockIdPB block_id = 1;  // int64
>   required BlockRecordType op_type = 2;  // CREATE or DELETE
>   required uint64 timestamp_us = 3;
>   optional int64 offset = 4; // Required for CREATE.
>   optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block' 
> as deleted, there will be 2 objects in the metadata, one is CREATE type and 
> the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is 
> only 1 alive block (suppose it hasn't reach the runtime compact threshold), 
> all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to 
> find out the alive blocks. In the worst case, we may replayed thousands of 
> blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in 
> production environment always run without bootstrap with several months, the 
> LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM, 
> both at runtime and bootstrap stage.
> The one at runtime will lock the container, and it's synchronous.
> The one in bootstrap stage is synchronous too, and may make the bootstrap 
> time longer.
> h1. Optimization by using RocksDB
> h2. Storage design
>  * RocksDB instance: one RocksDB instance per data directory.
>  * Key: <container_id>.<block_id>
>  * Value: the same as before, i.e. the serialized protobuf string, and only 
> store for CREATE entries.
>  * Put/Delete: put value to rocksdb when create block, delete it from rocksdb 
> when delete block
>  * Scan: happened only in bootstrap stage to retrieve all blocks
>  * DeleteRange: happened only when invalidate a container
> h2. Advantages
>  # Disk space amplification: There is still disk space amplification problem. 
> But we can tune RocksDB to reach a balanced point, I trust in most cases, 
> RocksDB is better than append only file.
>  # Bootstrap time: since there are only valid blocks left in rocksdb, so it 
> maybe much faster than before.
>  # metadata compaction: we can leave it to rocksdb to do this work, of course 
> tuning needed.
> h2. test & benchmark
> I'm trying to use RocksDB to store LBM container metadata recently, finished 
> most of work now, and did some benchmark. It show that the fs module block 
> read/write/delete performance is similar to or little worse than the old 
> implemention, the bootstrap time may reduce several times.
> I not sure if it is worth to continue the work, or anybody know if there is 
> any discussion on this topic ever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to