[
https://issues.apache.org/jira/browse/HDDS-15120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076416#comment-18076416
]
Ivan Andika edited comment on HDDS-15120 at 4/27/26 1:41 AM:
-------------------------------------------------------------
Feasibility Verdict (from AI)
Implementing bucket forks in Apache Ozone is feasible, but not as a small
extension of snapshots. A useful prototype is very feasible; a
production-ready, zero-copy, mutable, Git-like bucket fork feature is a major
OM metadata project.
The strongest path is not “make snapshots writable.” Current snapshots are
explicitly read-oriented checkpoint DBs: OmSnapshot wraps metadata reads only,
and checkpoint metadata managers are opened read-only by default in
OmSnapshot.java (line 53) and OmMetadataManagerImpl.java (line 238). Instead,
the best prototype would be a fork bucket that stores only fork-local deltas in
active OM metadata while falling back to a retained base snapshot for unchanged
keys.
Why Ozone Is A Good Fit
Ozone already has the two ingredients forks need:
* Immutable data blocks and snapshot-aware retention. The docs say snapshots
duplicate metadata pointers, not data blocks, and retain blocks while
referenced by the live bucket or snapshots. See Ozone Snapshot docs.
* O(1) snapshot creation is already an explicit design goal in code:
OMSnapshotCreateRequest.java (line 179) avoids key-table walks, and
OMSnapshotCreateResponse.java (line 62) writes SnapshotInfo then creates the
RocksDB checkpoint.
This aligns well with the model in HDDS-15120: fork creation should be cheap,
data should be shared until changed, and forks should isolate agent writes. It
also matches external precedent: Neon branching uses isolated copy-on-write
branches, and Tigris snapshots/forks describes isolated zero-copy bucket forks.
Main Blockers
The hard part is metadata, not data blocks.
Current snapshot chains are linear. SnapshotChainManager.java (line 39)
maintains chronological global and per-bucket chains, and rejects non-linear
additions around line 97 (line 97). Forks naturally form a DAG.
GC assumes linear ancestry. SnapshotDeletingService.java (line 147) moves
deleted entries to the next active snapshot or AOS. ReclaimableKeyFilter.java
(line 76) checks prior snapshots to decide reclaimability. Forks require
reachability/refcount semantics across multiple children, not just
“previous/next.”
Existing bucket links are not forks. Link buckets resolve requests to the
source bucket, so writes would hit the parent, not an isolated fork. See
OzoneManagerUtils.java (line 146) and ResolvedBucket.java (line 106).
Existing deleted tables cannot represent fork tombstones safely. A fork
deleting a base key should hide it from the fork without freeing base blocks.
Current deleted tables are part of physical deletion flow, so fork deletes need
separate logical tombstone metadata.
S3 API support is initially awkward. Ozone snapshot management is not available
via S3 today; the docs say snapshot creation/list/delete are managed via Ozone
RPC/CLI, though snapshot data can be read through .snapshot paths. See Known
Issues.
Recommended Design Direction
Start with a delta-overlay fork bucket:
* fork create /vol/src-bucket /vol/fork-bucket --from-snapshot S creates a new
bucket with a pointer to a retained base snapshot.
* Reads check fork-local metadata first, then fall back to the base snapshot.
* Writes create normal fork-local keys.
* Deletes of base-only keys create fork tombstones, not deletedTable entries.
* Overwrites create fork-local keys plus tombstones hiding the base version.
* The base snapshot is internal or refcounted so blocks remain protected while
forks exist.
This avoids multiple writable RocksDB instances and keeps OM Ratis mostly
centered on active om.db. It still requires careful changes in key
lookup/listing, delete/rename semantics, quota accounting, bucket metadata, and
cleanup.
I would avoid a first implementation that mutates checkpoint DBs directly. It
sounds tempting because snapshot checkpoints are already zero-copy, but it
pulls in multi-writable metadata stores, HA bootstrap, write routing, cache
eviction, compaction, lock semantics, and Ratis replay across multiple DBs.
That is the deep end.
Prototype Scope
A realistic first prototype:
* Fork from an existing active snapshot only.
* OBS bucket layout first, FSO later.
* CLI/RPC only, no S3 management API initially.
* Support get/list/put/delete/overwrite.
* Defer rename, multipart edge cases, bucket quota precision, lifecycle,
replication tooling, and fork merge.
* Keep fork base snapshots hidden from normal users but visible in admin/debug
output.
This would be enough to validate the architecture without pretending the whole
feature is done.
Bottom Line
Bucket forks are a strong fit for Ozone’s immutable-block architecture, but the
existing snapshot implementation is read-only and linear. The feature is
feasible if implemented as “new mutable bucket + base snapshot +
delta/tombstone overlay + fork-aware GC.” It becomes risky if framed as “just
writable snapshots.” No code changes were made; this was a feasibility pass
grounded in the Jira, docs, and local OM implementation.
was (Author: JIRAUSER298977):
Feasibility Verdict (from AI)
Implementing bucket forks in Apache Ozone is feasible, but not as a small
extension of snapshots. A useful prototype is very feasible; a
production-ready, zero-copy, mutable, Git-like bucket fork feature is a major
OM metadata project.
The strongest path is not “make snapshots writable.” Current snapshots are
explicitly read-oriented checkpoint DBs: OmSnapshot wraps metadata reads only,
and checkpoint metadata managers are opened read-only by default in
OmSnapshot.java (line 53) and OmMetadataManagerImpl.java (line 238). Instead,
the best prototype would be a fork bucket that stores only fork-local deltas in
active OM metadata while falling back to a retained base snapshot for unchanged
keys.
Why Ozone Is A Good Fit
Ozone already has the two ingredients forks need:
Immutable data blocks and snapshot-aware retention. The docs say snapshots
duplicate metadata pointers, not data blocks, and retain blocks while
referenced by the live bucket or snapshots. See Ozone Snapshot docs.
O(1) snapshot creation is already an explicit design goal in code:
OMSnapshotCreateRequest.java (line 179) avoids key-table walks, and
OMSnapshotCreateResponse.java (line 62) writes SnapshotInfo then creates the
RocksDB checkpoint.
This aligns well with the model in HDDS-15120: fork creation should be cheap,
data should be shared until changed, and forks should isolate agent writes. It
also matches external precedent: Neon branching uses isolated copy-on-write
branches, and Tigris snapshots/forks describes isolated zero-copy bucket forks.
Main Blockers
The hard part is metadata, not data blocks.
Current snapshot chains are linear. SnapshotChainManager.java (line 39)
maintains chronological global and per-bucket chains, and rejects non-linear
additions around line 97 (line 97). Forks naturally form a DAG.
GC assumes linear ancestry. SnapshotDeletingService.java (line 147) moves
deleted entries to the next active snapshot or AOS. ReclaimableKeyFilter.java
(line 76) checks prior snapshots to decide reclaimability. Forks require
reachability/refcount semantics across multiple children, not just
“previous/next.”
Existing bucket links are not forks. Link buckets resolve requests to the
source bucket, so writes would hit the parent, not an isolated fork. See
OzoneManagerUtils.java (line 146) and ResolvedBucket.java (line 106).
Existing deleted tables cannot represent fork tombstones safely. A fork
deleting a base key should hide it from the fork without freeing base blocks.
Current deleted tables are part of physical deletion flow, so fork deletes need
separate logical tombstone metadata.
S3 API support is initially awkward. Ozone snapshot management is not available
via S3 today; the docs say snapshot creation/list/delete are managed via Ozone
RPC/CLI, though snapshot data can be read through .snapshot paths. See Known
Issues.
Recommended Design Direction
Start with a delta-overlay fork bucket:
* fork create /vol/src-bucket /vol/fork-bucket --from-snapshot S creates a new
bucket with a pointer to a retained base snapshot.
* Reads check fork-local metadata first, then fall back to the base snapshot.
* Writes create normal fork-local keys.
* Deletes of base-only keys create fork tombstones, not deletedTable entries.
* Overwrites create fork-local keys plus tombstones hiding the base version.
* The base snapshot is internal or refcounted so blocks remain protected while
forks exist.
This avoids multiple writable RocksDB instances and keeps OM Ratis mostly
centered on active om.db. It still requires careful changes in key
lookup/listing, delete/rename semantics, quota accounting, bucket metadata, and
cleanup.
I would avoid a first implementation that mutates checkpoint DBs directly. It
sounds tempting because snapshot checkpoints are already zero-copy, but it
pulls in multi-writable metadata stores, HA bootstrap, write routing, cache
eviction, compaction, lock semantics, and Ratis replay across multiple DBs.
That is the deep end.
Prototype Scope
A realistic first prototype:
* Fork from an existing active snapshot only.
* OBS bucket layout first, FSO later.
* CLI/RPC only, no S3 management API initially.
* Support get/list/put/delete/overwrite.
* Defer rename, multipart edge cases, bucket quota precision, lifecycle,
replication tooling, and fork merge.
* Keep fork base snapshots hidden from normal users but visible in admin/debug
output.
This would be enough to validate the architecture without pretending the whole
feature is done.
Bottom Line
Bucket forks are a strong fit for Ozone’s immutable-block architecture, but the
existing snapshot implementation is read-only and linear. The feature is
feasible if implemented as “new mutable bucket + base snapshot +
delta/tombstone overlay + fork-aware GC.” It becomes risky if framed as “just
writable snapshots.” No code changes were made; this was a feasibility pass
grounded in the Jira, docs, and local OM implementation.
> Support bucket forks for agentic workload
> -----------------------------------------
>
> Key: HDDS-15120
> URL: https://issues.apache.org/jira/browse/HDDS-15120
> Project: Apache Ozone
> Issue Type: New Feature
> Reporter: Ivan Andika
> Priority: Major
>
> Currently, Ozone supports bucket snapshot which creates a read-only immutable
> state of the entire bucket for use cases such as backup, replication,
> compliance, etc. This is achieved using the RocksDB checkpoint feature which
> tracks the current SST files at that point.
> With the recent rise agentic workloads, there is a need for storage systems
> to implement forking / branching to cater for multi-agents workload. Unlike
> snapshot, forks can be mutated. The idea of forking and branch is similar to
> Git branch / worktrees where a new "branch" is created based on the base
> directory. Multiple agents can fork the same base file system in parallel and
> mutate these forks without affecting each other. These forks should also have
> zero-copy, similar to snapshot (which should only require O(1) time to
> create). Additionally, these forks lifetime can varies (it can be retained
> for a long time or discarded quite quickly).
> Example systems
> * NeonDB branching: https://neon.com/docs/introduction/branching
> * Tigris Object Store: https://www.tigrisdata.com/docs/snapshots-and-forks/
> (please see the related blogs on the implementations of forks).
> Ozone can consider supporting this feature. Since more systems implement
> storage compute separation architecture on object storage, the compute /
> caching layer can rely on Ozone as the backing store for agentic workloads
> since Ozone supports snapshot and forking (they don't need to implement
> snapshot and forking or need to write complicated logic to store their forks
> state). Ozone can then position itself as the open-source object store /
> distributed file system for agentic workloads.
> This ticket acts as a way to start a discussion in the community on this
> direction. We can start thinking about this (and probably try to start
> prototyping some ideas). This might require a radical change of Ozone Manager
> design (e.g. might need to introduce versioning, reference counting,
> copy-on-write, log subsystems, OM deletions semantic change, etc).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]