[ 
https://issues.apache.org/jira/browse/HDDS-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-12903:
-------------------------------
    Description: 
Currently when OM list keys, it needs to create a RocksDB iterator which will 
load both the RocksDB key and value.

For a key with a lot of blocks (for example MPU key with few hundred parts), 
this can take a considerable amount of OM heap memory. These blocks is even 
loaded to OM memory even when we use listKeysLight which will remove the key 
block info.

One possible way to handle this is to separate the keyspace and blockspace into 
two separate column families. The keyspace CF will only store the basic key 
info (similar to BasicOmKeyInfo), while the blockspace CF stores the blocks 
associated with the key. Therefore, during list, we will only load the 
BasicOmKeyInfo which will result in lower memory overhead. The goal is to make 
the key CF entry size to be predictable, regardless of data size. This idea can 
be extended to other features like ACLs, Tags, Metadata, or other attributes 
that might be able to increase indefinitely.

Another possible optimization would be the deduplication for keys with the same 
data (e.g. copied keys) where the blockspace CF entry can contain a reference 
count which increases for every unique keyspace entry that refers to the data. 
This will safe some space, but might cause further complexity.

The downside is that now we have two CF to update or query for key get, 
creation, update, and deletion. For get, we can use RocksDB multiget to get 
both the keyspace and blockspace CF. Additionally, future sharding 
implementations will be more complex.

This is inspired by Tectonic Filesystem Namespace and Block layer 
([https://www.usenix.org/system/files/fast21-pan.pdf]).

Edit: This might have an additional benefits regarding to bucket replication. 
For example, if we can separate both RocksDB SST files between the keyspace CF 
and the blockspace CF. During bucket replication, we can simply send the SST 
files on the metadata table as is without any change (since the metadata do not 
contain block locations). This can speed up the bucket metadata replication 
process.

  was:
Currently when OM list keys, it needs to create a RocksDB iterator which will 
load both the RocksDB key and value. 

For a key with a lot of blocks (for example MPU key with few hundred parts), 
this can take a considerable amount of OM heap memory. These blocks is even 
loaded to OM memory even when we use listKeysLight which will remove the key 
block info.

One possible way to handle this is to separate the keyspace and blockspace into 
two separate column families. The keyspace CF will only store the basic key 
info (similar to BasicOmKeyInfo), while the blockspace CF stores the blocks 
associated with the key. Therefore, during list, we will only load the 
BasicOmKeyInfo which will result in lower memory overhead. The goal is to make 
the key CF entry size to be predictable, regardless of data size. This idea can 
be extended to other features like ACLs, Tags, Metadata, or other attributes 
that might be able to increase indefinitely.

Another possible optimization would be the deduplication for keys with the same 
data (e.g. copied keys) where the blockspace CF entry can contain a reference 
count which increases for every unique keyspace entry that refers to the data. 
This will safe some space, but might cause further complexity.

The downside is that now we have two CF to update or query for key get, 
creation, update, and deletion. For get, we can use RocksDB multiget to get 
both the keyspace and blockspace CF. Additionally, future sharding 
implementations will be more complex.

This is inspired by Tectonic Filesystem Namespace and Block layer 
(https://www.usenix.org/system/files/fast21-pan.pdf).


> Separate OM namespace and blockspace into separate Column Families
> ------------------------------------------------------------------
>
>                 Key: HDDS-12903
>                 URL: https://issues.apache.org/jira/browse/HDDS-12903
>             Project: Apache Ozone
>          Issue Type: Wish
>          Components: Ozone Manager
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently when OM list keys, it needs to create a RocksDB iterator which will 
> load both the RocksDB key and value.
> For a key with a lot of blocks (for example MPU key with few hundred parts), 
> this can take a considerable amount of OM heap memory. These blocks is even 
> loaded to OM memory even when we use listKeysLight which will remove the key 
> block info.
> One possible way to handle this is to separate the keyspace and blockspace 
> into two separate column families. The keyspace CF will only store the basic 
> key info (similar to BasicOmKeyInfo), while the blockspace CF stores the 
> blocks associated with the key. Therefore, during list, we will only load the 
> BasicOmKeyInfo which will result in lower memory overhead. The goal is to 
> make the key CF entry size to be predictable, regardless of data size. This 
> idea can be extended to other features like ACLs, Tags, Metadata, or other 
> attributes that might be able to increase indefinitely.
> Another possible optimization would be the deduplication for keys with the 
> same data (e.g. copied keys) where the blockspace CF entry can contain a 
> reference count which increases for every unique keyspace entry that refers 
> to the data. This will safe some space, but might cause further complexity.
> The downside is that now we have two CF to update or query for key get, 
> creation, update, and deletion. For get, we can use RocksDB multiget to get 
> both the keyspace and blockspace CF. Additionally, future sharding 
> implementations will be more complex.
> This is inspired by Tectonic Filesystem Namespace and Block layer 
> ([https://www.usenix.org/system/files/fast21-pan.pdf]).
> Edit: This might have an additional benefits regarding to bucket replication. 
> For example, if we can separate both RocksDB SST files between the keyspace 
> CF and the blockspace CF. During bucket replication, we can simply send the 
> SST files on the metadata table as is without any change (since the metadata 
> do not contain block locations). This can speed up the bucket metadata 
> replication process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to