[ https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182908#comment-17182908 ]
runzhiwang commented on HDDS-3630: ---------------------------------- In my test, the capacity of the container is 5GB, ozone.container.cache.size is 1500, about 25,000 blocks in each container. So there are 1500 RocksDB instances in memory when one datanode writes 7.5TB data. The basic settings of memory: -Xmx3500m, -XX:MaxDirectMemorySize=1000m. After one datanode writting 7.5GB data, Resident Memory is 13.6GB, Virtual Memory is 53.5GB, so off-heap memory of RocksDB is about 9.1GB. If the block size is smaller, the off-heap memory is bigger. There are 4835 threads in the datanode, and 3000 RocksDB’s threads. There are 1500 rocksdb:dump_st threads, and 1500 rocksdb:pst_st threads. So the off-heap memory of rocksdb consists of: 1. block cache of rocksdb. 2. 1500 rocksdb:dump_st threads, and 1500 rocksdb:pst_st threads. Except merging rocksdb, there are 2 other options which both have cons. 1. flush the rocksdb memory of closed container. cons: thousands of rocksdb threads still exist. 2. remove rocksdb in datanode, and store the data in file. cons: a. It really needs a big work to remove rocksdb b. If we store all the checksum in file, in order to query checksum faster, we must load the file into memory, because the file is for each container, when the number of closed container increase, the file number also increase, we can not load all the file of all closed container into memory, otherwise OOM will happen. we must maintain an elimination strategy for the checksum file in memory such as LRU. It looks like we do the work which can be done by rocksdb. c. For open container, there is also some data needed to store in file and update frequently, we can not force sync to disk every time when update happens, because some data is not important such as block count in each container. So we must create a background thread to do a batch sync for these types of data, it looks complicated. Everytime when we add these type of data, we must do the duplicated work, the code may be hard to maintain. d. Reduce memory can be achieved by merging rocksdb, it looks like more easier than removing it, so maybe we need not spend so much time removing it. > Merge rocksdb in datanode > ------------------------- > > Key: HDDS-3630 > URL: https://issues.apache.org/jira/browse/HDDS-3630 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Reporter: runzhiwang > Assignee: runzhiwang > Priority: Major > Attachments: Merge RocksDB in Datanode-v1.pdf, Merge RocksDB in > Datanode-v2.pdf > > > Currently, one rocksdb for one container. one container has 5GB capacity. > 10TB data need more than 2000 rocksdb in one datanode. It's difficult to > limit the memory of 2000 rocksdb. So maybe we should limited instance of > rocksdb for each disk. > The design of improvement is in the follow link, but still is a draft. > TODO: > 1. compatibility with current logic i.e. one rocksdb for each container > 2. measure the memory usage before and after improvement > 3. effect on efficiency of read and write. > https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit# -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org