[jira] [Updated] (HDDS-15530) Write small file directly to RocksDB with deferred writes to disk

Ivan Andika (Jira) Thu, 11 Jun 2026 20:18:09 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-15530:
-------------------------------
    Description: 
We have previously discussed of storing small files in the RocksDB.

When reading the Ceph BlueStore paper 
([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an idea 
of writing small files to RocksDB as well with deferred writes to disk
{quote}For writes smaller than the minimum allocation size, both data and 
metadata are first inserted to RocksDB as promises of future I/O, and then 
asynchronously written to disk after the transaction commits. This deferred 
write mechanism has two purposes. First, it batches small writes to increase 
efficiency, because new data writes require two I/O operations whereas an 
insert to RocksDB requires one. Second, it optimizes I/O based on the device 
type. 64 KiB (or smaller) overwrites of a large object on an HDD are performed 
asynchronously in place to avoid seeks during reads, whereas in-place 
overwrites only happen for I/O sizes less than 16 KiB on SSDs
{quote}
We can consider this by writing the small files directly to block_data table 
and then asynchronously write this to the actual separate block file. 
Previously, we also considered whether storing data inline OM DB is better, but 
I think we should always store data in datanodes, not in OM DB since if we have 
billions of small files, this can overwhelm OM DB quickly.

  was:
We have previously discussed of storing small files in the RocksDB.

When reading the Ceph BlueStore paper 
([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an idea 
of writing small files to RocksDB as well with deferred writes to disk
{quote}For writes smaller than the minimum allocation size, both data and 
metadata are first inserted to RocksDB as promises of future I/O, and then 
asynchronously written to disk after the transaction commits. This deferred 
write mechanism has two purposes. First, it batches small writes to increase 
efficiency, because new data writes require two I/O operations whereas an 
insert to RocksDB requires one. Second, it optimizes I/O based on the device 
type. 64 KiB (or smaller) overwrites of a large object on an HDD are performed 
asynchronously in place to avoid seeks during reads, whereas in-place 
overwrites only happen for I/O sizes less than 16 KiB on SSDs
{quote}
We can consider this. Previously, we also considered whether storing data 
inline OM DB is better, but I think we should always store data in datanodes, 
not in OM DB since if we have billions of small files, this can overwhelm OM DB 
quickly.


> Write small file directly to RocksDB with deferred writes to disk
> -----------------------------------------------------------------
>
>                 Key: HDDS-15530
>                 URL: https://issues.apache.org/jira/browse/HDDS-15530
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We have previously discussed of storing small files in the RocksDB.
> When reading the Ceph BlueStore paper 
> ([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an 
> idea of writing small files to RocksDB as well with deferred writes to disk
> {quote}For writes smaller than the minimum allocation size, both data and 
> metadata are first inserted to RocksDB as promises of future I/O, and then 
> asynchronously written to disk after the transaction commits. This deferred 
> write mechanism has two purposes. First, it batches small writes to increase 
> efficiency, because new data writes require two I/O operations whereas an 
> insert to RocksDB requires one. Second, it optimizes I/O based on the device 
> type. 64 KiB (or smaller) overwrites of a large object on an HDD are 
> performed asynchronously in place to avoid seeks during reads, whereas 
> in-place overwrites only happen for I/O sizes less than 16 KiB on SSDs
> {quote}
> We can consider this by writing the small files directly to block_data table 
> and then asynchronously write this to the actual separate block file. 
> Previously, we also considered whether storing data inline OM DB is better, 
> but I think we should always store data in datanodes, not in OM DB since if 
> we have billions of small files, this can overwhelm OM DB quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15530) Write small file directly to RocksDB with deferred writes to disk

Reply via email to