[ 
https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182908#comment-17182908
 ] 

runzhiwang commented on HDDS-3630:
----------------------------------

In my test, the capacity of the container is 5GB,  ozone.container.cache.size 
is 1500, about 25,000 blocks in each container. So there are 1500 RocksDB 
instances in memory when one datanode writes 7.5TB data. The basic settings of 
memory: -Xmx3500m, -XX:MaxDirectMemorySize=1000m.

After one datanode writting 7.5GB data, Resident Memory is 13.6GB, Virtual 
Memory is 53.5GB, so off-heap memory of RocksDB is about 9.1GB. If the block 
size is smaller,  the off-heap memory is bigger.

There are 4835 threads in the datanode, and 3000 RocksDB’s threads. There are 
1500 rocksdb:dump_st threads, and 1500 rocksdb:pst_st threads.

So the off-heap memory of rocksdb consists of:
1. block cache of rocksdb.
2. 1500 rocksdb:dump_st threads, and 1500 rocksdb:pst_st threads.

Except merging rocksdb, there are 2 other options which both have cons.
1.  flush the rocksdb memory of closed container.
     cons: thousands of rocksdb threads still exist.
2.  remove rocksdb in datanode, and store the data in file.
     cons: 
      a. It really needs a big work to remove rocksdb
      b. If we store all the checksum in file, in order to query checksum 
faster, we must load the file into memory,  because the file is for each 
container, when the number of closed container increase, the file number also 
increase, we can not load all the file of all closed container into memory, 
otherwise OOM will happen.  we must maintain an elimination strategy for the 
checksum file in memory such as LRU. It looks like we do the work which can be 
done by rocksdb.
    c. For open container, there is also some data needed to store in file and 
update frequently, we can not force sync to disk every time when update 
happens, because some data is not important such as block count in each 
container. So we must create a background thread to do a batch sync for these 
types of data, it looks complicated. Everytime when we add these type of data, 
we must do the duplicated work, the code may be hard to maintain.
    d. Reduce memory can be achieved by merging rocksdb, it looks like more 
easier than removing it, so maybe we need not spend so much time removing it.
 

> Merge rocksdb in datanode
> -------------------------
>
>                 Key: HDDS-3630
>                 URL: https://issues.apache.org/jira/browse/HDDS-3630
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: Merge RocksDB in Datanode-v1.pdf, Merge RocksDB in 
> Datanode-v2.pdf
>
>
> Currently, one rocksdb for one container. one container has 5GB capacity. 
> 10TB data need more than 2000 rocksdb in one datanode.  It's difficult to 
> limit the memory of 2000 rocksdb. So maybe we should limited instance of 
> rocksdb for each disk.
> The design of improvement is in the follow link, but still is a draft. 
> TODO: 
>  1. compatibility with current logic i.e. one rocksdb for each container
>  2. measure the memory usage before and after improvement
>  3. effect on efficiency of read and write.
> https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to