[
https://issues.apache.org/jira/browse/HDDS-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-15530:
-------------------------------
Description:
We have previously discussed of storing small files in the RocksDB.
When reading the Ceph BlueStore paper
([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an idea
of writing small files to RocksDB as well with deferred writes to disk
{quote}For writes smaller than the minimum allocation size, both data and
metadata are first inserted to RocksDB as promises of future I/O, and then
asynchronously written to disk after the transaction commits. This deferred
write mechanism has two purposes. First, it batches small writes to increase
efficiency, because new data writes require two I/O operations whereas an
insert to RocksDB requires one. Second, it optimizes I/O based on the device
type. 64 KiB (or smaller) overwrites of a large object on an HDD are performed
asynchronously in place to avoid seeks during reads, whereas in-place
overwrites only happen for I/O sizes less than 16 KiB on SSDs
{quote}
We can consider this by writing the small files directly to block_data table
and then asynchronously write this to the actual separate block file.
Previously, we also considered whether storing data inline OM DB is better, but
I think we should always store data in datanodes, not in OM DB since if we have
billions of small files, this can overwhelm OM DB quickly.
was:
We have previously discussed of storing small files in the RocksDB.
When reading the Ceph BlueStore paper
([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an idea
of writing small files to RocksDB as well with deferred writes to disk
{quote}For writes smaller than the minimum allocation size, both data and
metadata are first inserted to RocksDB as promises of future I/O, and then
asynchronously written to disk after the transaction commits. This deferred
write mechanism has two purposes. First, it batches small writes to increase
efficiency, because new data writes require two I/O operations whereas an
insert to RocksDB requires one. Second, it optimizes I/O based on the device
type. 64 KiB (or smaller) overwrites of a large object on an HDD are performed
asynchronously in place to avoid seeks during reads, whereas in-place
overwrites only happen for I/O sizes less than 16 KiB on SSDs
{quote}
We can consider this. Previously, we also considered whether storing data
inline OM DB is better, but I think we should always store data in datanodes,
not in OM DB since if we have billions of small files, this can overwhelm OM DB
quickly.
> Write small file directly to RocksDB with deferred writes to disk
> -----------------------------------------------------------------
>
> Key: HDDS-15530
> URL: https://issues.apache.org/jira/browse/HDDS-15530
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We have previously discussed of storing small files in the RocksDB.
> When reading the Ceph BlueStore paper
> ([https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf] ), there is an
> idea of writing small files to RocksDB as well with deferred writes to disk
> {quote}For writes smaller than the minimum allocation size, both data and
> metadata are first inserted to RocksDB as promises of future I/O, and then
> asynchronously written to disk after the transaction commits. This deferred
> write mechanism has two purposes. First, it batches small writes to increase
> efficiency, because new data writes require two I/O operations whereas an
> insert to RocksDB requires one. Second, it optimizes I/O based on the device
> type. 64 KiB (or smaller) overwrites of a large object on an HDD are
> performed asynchronously in place to avoid seeks during reads, whereas
> in-place overwrites only happen for I/O sizes less than 16 KiB on SSDs
> {quote}
> We can consider this by writing the small files directly to block_data table
> and then asynchronously write this to the actual separate block file.
> Previously, we also considered whether storing data inline OM DB is better,
> but I think we should always store data in datanodes, not in OM DB since if
> we have billions of small files, this can overwhelm OM DB quickly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]