[ 
https://issues.apache.org/jira/browse/HDDS-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-15530:
-------------------------------
    Description: 
We have previously discussed of storing small files in the RocksDB.

When reading the Ceph BlueStore paper 
([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an idea 
of writing small files to RocksDB as well with deferred writes to disk
{quote}For writes smaller than the minimum allocation size, both data and 
metadata are first inserted to RocksDB as promises of future I/O, and then 
asynchronously written to disk after the transaction commits. This deferred 
write mechanism has two purposes. First, it batches small writes to increase 
efficiency, because new data writes require two I/O operations whereas an 
insert to RocksDB requires one. Second, it optimizes I/O based on the device 
type. 64 KiB (or smaller) overwrites of a large object on an HDD are performed 
asynchronously in place to avoid seeks during reads, whereas in-place 
overwrites only happen for I/O sizes less than 16 KiB on SSDs
{quote}
We can consider this. Previously, we also considered whether storing data 
inline OM DB is better, but I think we should always store data in datanodes, 
not in OM DB since if we have billions of small files, this can overwhelm OM DB 
quickly.

  was:
We have previously discussed of storing small files in the RocksDB.

When reading the Ceph BlueStore paper 
([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an idea 
of writing small files to RocksDB as well with deferred writes to disk
{quote}For writes smaller than the minimum allocation size, both data and 
metadata are first inserted to RocksDB as promises of future I/O, and then 
asynchronously written to disk after the transaction commits. This deferred 
write mechanism has two purposes. First, it batches small writes to increase 
efficiency, because new data writes require two I/O operations whereas an 
insert to RocksDB requires one. Second, it optimizes I/O based on the device 
type. 64 KiB (or smaller) overwrites of a large object on an HDD are performed 
asynchronously in place to avoid seeks during reads, whereas in-place 
overwrites only happen for I/O sizes less than 16 KiB on SSDs
{quote}
We can consider this.


> Write small file data RocksDB with deferred writes
> --------------------------------------------------
>
>                 Key: HDDS-15530
>                 URL: https://issues.apache.org/jira/browse/HDDS-15530
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We have previously discussed of storing small files in the RocksDB.
> When reading the Ceph BlueStore paper 
> ([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an 
> idea of writing small files to RocksDB as well with deferred writes to disk
> {quote}For writes smaller than the minimum allocation size, both data and 
> metadata are first inserted to RocksDB as promises of future I/O, and then 
> asynchronously written to disk after the transaction commits. This deferred 
> write mechanism has two purposes. First, it batches small writes to increase 
> efficiency, because new data writes require two I/O operations whereas an 
> insert to RocksDB requires one. Second, it optimizes I/O based on the device 
> type. 64 KiB (or smaller) overwrites of a large object on an HDD are 
> performed asynchronously in place to avoid seeks during reads, whereas 
> in-place overwrites only happen for I/O sizes less than 16 KiB on SSDs
> {quote}
> We can consider this. Previously, we also considered whether storing data 
> inline OM DB is better, but I think we should always store data in datanodes, 
> not in OM DB since if we have billions of small files, this can overwhelm OM 
> DB quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to