[jira] [Commented] (HDDS-3630) Merge rocksdb in datanode

Stephen O'Donnell (Jira) Tue, 15 Sep 2020 04:45:42 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196076#comment-17196076
 ]


Stephen O'Donnell commented on HDDS-3630:
-----------------------------------------

Have a look at HDDS-4246 - it seems there is only one 8MB cache shared by all 
RocksDBs related to datanode containers.

Looking at the rocksDB manual, one key memory user is the "write buffer size"

https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#write-buffer-size";

{quote}
It represents the amount of data to build up in memory (backed by an unsorted 
log on disk) before converting to a sorted on-disk file. The default is 64 MB.

You need to budget for 2 x your worst case memory use. If you don't have enough 
memory for this, you should reduce this value. Otherwise, it is not recommended 
to change this option.
{quote}

It seems to be, this default of 64MB is setup for "high write throughput", 
which is probably a usual use case for RocksDB. However for datanode 
containers, I doubt rocksDB is really stressed, especially for closed 
containers. What if we:

1. Reduced this value significantly - eg to 1MB?

2. Reduced it significantly for only closed containers?

There are also some other interesting Rocks DB options. You can configure a 
"Write Buffer Manager" and give it a target size for all RocksDB instances / 
column families related to write buffers, and then all open instances will 
share this. You can also make it be part of the LRU cache:

https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager

And you can have the Index and Filter blocks cached in the LRU cache too via 
the option - cache_index_and_filter_blocks.

Therefore, if we created a large shared LRU cache, use a shared Write Buffer 
Manager which stored the memtables inside this LRU cache, and also cache the 
Index and Filter block there - perhaps we could constrain the rocksDB memory 
within reasonable bounds.

It would be good to experiment with some of these options before jumping into a 
major refactor to use a single RocksDB per disk or other major changes.

> Merge rocksdb in datanode
> -------------------------
>
>                 Key: HDDS-3630
>                 URL: https://issues.apache.org/jira/browse/HDDS-3630
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: Merge RocksDB in Datanode-v1.pdf, Merge RocksDB in 
> Datanode-v2.pdf
>
>
> Currently, one rocksdb for one container. one container has 5GB capacity. 
> 10TB data need more than 2000 rocksdb in one datanode.  It's difficult to 
> limit the memory of 2000 rocksdb. So maybe we should limited instance of 
> rocksdb for each disk.
> The design of improvement is in the follow link, but still is a draft. 
> TODO: 
>  1. compatibility with current logic i.e. one rocksdb for each container
>  2. measure the memory usage before and after improvement
>  3. effect on efficiency of read and write.
> https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-3630) Merge rocksdb in datanode

Reply via email to