[ 
https://issues.apache.org/jira/browse/HDDS-15120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076416#comment-18076416
 ] 

Ivan Andika edited comment on HDDS-15120 at 4/27/26 1:41 AM:
-------------------------------------------------------------

Feasibility Verdict (from AI)

Implementing bucket forks in Apache Ozone is feasible, but not as a small 
extension of snapshots. A useful prototype is very feasible; a 
production-ready, zero-copy, mutable, Git-like bucket fork feature is a major 
OM metadata project.

The strongest path is not “make snapshots writable.” Current snapshots are 
explicitly read-oriented checkpoint DBs: OmSnapshot wraps metadata reads only, 
and checkpoint metadata managers are opened read-only by default in 
OmSnapshot.java (line 53) and OmMetadataManagerImpl.java (line 238). Instead, 
the best prototype would be a fork bucket that stores only fork-local deltas in 
active OM metadata while falling back to a retained base snapshot for unchanged 
keys.

Why Ozone Is A Good Fit

Ozone already has the two ingredients forks need:

* Immutable data blocks and snapshot-aware retention. The docs say snapshots 
duplicate metadata pointers, not data blocks, and retain blocks while 
referenced by the live bucket or snapshots. See Ozone Snapshot docs.
* O(1) snapshot creation is already an explicit design goal in code: 
OMSnapshotCreateRequest.java (line 179) avoids key-table walks, and 
OMSnapshotCreateResponse.java (line 62) writes SnapshotInfo then creates the 
RocksDB checkpoint.
This aligns well with the model in HDDS-15120: fork creation should be cheap, 
data should be shared until changed, and forks should isolate agent writes. It 
also matches external precedent: Neon branching uses isolated copy-on-write 
branches, and Tigris snapshots/forks describes isolated zero-copy bucket forks.

Main Blockers

The hard part is metadata, not data blocks.

Current snapshot chains are linear. SnapshotChainManager.java (line 39) 
maintains chronological global and per-bucket chains, and rejects non-linear 
additions around line 97 (line 97). Forks naturally form a DAG.

GC assumes linear ancestry. SnapshotDeletingService.java (line 147) moves 
deleted entries to the next active snapshot or AOS. ReclaimableKeyFilter.java 
(line 76) checks prior snapshots to decide reclaimability. Forks require 
reachability/refcount semantics across multiple children, not just 
“previous/next.”

Existing bucket links are not forks. Link buckets resolve requests to the 
source bucket, so writes would hit the parent, not an isolated fork. See 
OzoneManagerUtils.java (line 146) and ResolvedBucket.java (line 106).

Existing deleted tables cannot represent fork tombstones safely. A fork 
deleting a base key should hide it from the fork without freeing base blocks. 
Current deleted tables are part of physical deletion flow, so fork deletes need 
separate logical tombstone metadata.

S3 API support is initially awkward. Ozone snapshot management is not available 
via S3 today; the docs say snapshot creation/list/delete are managed via Ozone 
RPC/CLI, though snapshot data can be read through .snapshot paths. See Known 
Issues.

Recommended Design Direction

Start with a delta-overlay fork bucket:

* fork create /vol/src-bucket /vol/fork-bucket --from-snapshot S creates a new 
bucket with a pointer to a retained base snapshot.
* Reads check fork-local metadata first, then fall back to the base snapshot.
* Writes create normal fork-local keys.
* Deletes of base-only keys create fork tombstones, not deletedTable entries.
* Overwrites create fork-local keys plus tombstones hiding the base version.
* The base snapshot is internal or refcounted so blocks remain protected while 
forks exist.
This avoids multiple writable RocksDB instances and keeps OM Ratis mostly 
centered on active om.db. It still requires careful changes in key 
lookup/listing, delete/rename semantics, quota accounting, bucket metadata, and 
cleanup.

I would avoid a first implementation that mutates checkpoint DBs directly. It 
sounds tempting because snapshot checkpoints are already zero-copy, but it 
pulls in multi-writable metadata stores, HA bootstrap, write routing, cache 
eviction, compaction, lock semantics, and Ratis replay across multiple DBs. 
That is the deep end.

Prototype Scope

A realistic first prototype:

* Fork from an existing active snapshot only.
* OBS bucket layout first, FSO later.
* CLI/RPC only, no S3 management API initially.
* Support get/list/put/delete/overwrite.
* Defer rename, multipart edge cases, bucket quota precision, lifecycle, 
replication tooling, and fork merge.
* Keep fork base snapshots hidden from normal users but visible in admin/debug 
output.
This would be enough to validate the architecture without pretending the whole 
feature is done.

Bottom Line

Bucket forks are a strong fit for Ozone’s immutable-block architecture, but the 
existing snapshot implementation is read-only and linear. The feature is 
feasible if implemented as “new mutable bucket + base snapshot + 
delta/tombstone overlay + fork-aware GC.” It becomes risky if framed as “just 
writable snapshots.” No code changes were made; this was a feasibility pass 
grounded in the Jira, docs, and local OM implementation.


was (Author: JIRAUSER298977):
Feasibility Verdict (from AI)

Implementing bucket forks in Apache Ozone is feasible, but not as a small 
extension of snapshots. A useful prototype is very feasible; a 
production-ready, zero-copy, mutable, Git-like bucket fork feature is a major 
OM metadata project.

The strongest path is not “make snapshots writable.” Current snapshots are 
explicitly read-oriented checkpoint DBs: OmSnapshot wraps metadata reads only, 
and checkpoint metadata managers are opened read-only by default in 
OmSnapshot.java (line 53) and OmMetadataManagerImpl.java (line 238). Instead, 
the best prototype would be a fork bucket that stores only fork-local deltas in 
active OM metadata while falling back to a retained base snapshot for unchanged 
keys.

Why Ozone Is A Good Fit

Ozone already has the two ingredients forks need:

Immutable data blocks and snapshot-aware retention. The docs say snapshots 
duplicate metadata pointers, not data blocks, and retain blocks while 
referenced by the live bucket or snapshots. See Ozone Snapshot docs.
O(1) snapshot creation is already an explicit design goal in code: 
OMSnapshotCreateRequest.java (line 179) avoids key-table walks, and 
OMSnapshotCreateResponse.java (line 62) writes SnapshotInfo then creates the 
RocksDB checkpoint.
This aligns well with the model in HDDS-15120: fork creation should be cheap, 
data should be shared until changed, and forks should isolate agent writes. It 
also matches external precedent: Neon branching uses isolated copy-on-write 
branches, and Tigris snapshots/forks describes isolated zero-copy bucket forks.

Main Blockers

The hard part is metadata, not data blocks.

Current snapshot chains are linear. SnapshotChainManager.java (line 39) 
maintains chronological global and per-bucket chains, and rejects non-linear 
additions around line 97 (line 97). Forks naturally form a DAG.

GC assumes linear ancestry. SnapshotDeletingService.java (line 147) moves 
deleted entries to the next active snapshot or AOS. ReclaimableKeyFilter.java 
(line 76) checks prior snapshots to decide reclaimability. Forks require 
reachability/refcount semantics across multiple children, not just 
“previous/next.”

Existing bucket links are not forks. Link buckets resolve requests to the 
source bucket, so writes would hit the parent, not an isolated fork. See 
OzoneManagerUtils.java (line 146) and ResolvedBucket.java (line 106).

Existing deleted tables cannot represent fork tombstones safely. A fork 
deleting a base key should hide it from the fork without freeing base blocks. 
Current deleted tables are part of physical deletion flow, so fork deletes need 
separate logical tombstone metadata.

S3 API support is initially awkward. Ozone snapshot management is not available 
via S3 today; the docs say snapshot creation/list/delete are managed via Ozone 
RPC/CLI, though snapshot data can be read through .snapshot paths. See Known 
Issues.

Recommended Design Direction

Start with a delta-overlay fork bucket:

* fork create /vol/src-bucket /vol/fork-bucket --from-snapshot S creates a new 
bucket with a pointer to a retained base snapshot.
* Reads check fork-local metadata first, then fall back to the base snapshot.
* Writes create normal fork-local keys.
* Deletes of base-only keys create fork tombstones, not deletedTable entries.
* Overwrites create fork-local keys plus tombstones hiding the base version.
* The base snapshot is internal or refcounted so blocks remain protected while 
forks exist.
This avoids multiple writable RocksDB instances and keeps OM Ratis mostly 
centered on active om.db. It still requires careful changes in key 
lookup/listing, delete/rename semantics, quota accounting, bucket metadata, and 
cleanup.

I would avoid a first implementation that mutates checkpoint DBs directly. It 
sounds tempting because snapshot checkpoints are already zero-copy, but it 
pulls in multi-writable metadata stores, HA bootstrap, write routing, cache 
eviction, compaction, lock semantics, and Ratis replay across multiple DBs. 
That is the deep end.

Prototype Scope

A realistic first prototype:

* Fork from an existing active snapshot only.
* OBS bucket layout first, FSO later.
* CLI/RPC only, no S3 management API initially.
* Support get/list/put/delete/overwrite.
* Defer rename, multipart edge cases, bucket quota precision, lifecycle, 
replication tooling, and fork merge.
* Keep fork base snapshots hidden from normal users but visible in admin/debug 
output.
This would be enough to validate the architecture without pretending the whole 
feature is done.

Bottom Line

Bucket forks are a strong fit for Ozone’s immutable-block architecture, but the 
existing snapshot implementation is read-only and linear. The feature is 
feasible if implemented as “new mutable bucket + base snapshot + 
delta/tombstone overlay + fork-aware GC.” It becomes risky if framed as “just 
writable snapshots.” No code changes were made; this was a feasibility pass 
grounded in the Jira, docs, and local OM implementation.

> Support bucket forks for agentic workload
> -----------------------------------------
>
>                 Key: HDDS-15120
>                 URL: https://issues.apache.org/jira/browse/HDDS-15120
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Priority: Major
>
> Currently, Ozone supports bucket snapshot which creates a read-only immutable 
> state of the entire bucket for use cases such as backup, replication, 
> compliance, etc. This is achieved using the RocksDB checkpoint feature which 
> tracks the current SST files at that point.
> With the recent rise agentic workloads, there is a need for storage systems 
> to implement forking / branching to cater for multi-agents workload. Unlike 
> snapshot, forks can be mutated. The idea of forking and branch is similar to 
> Git branch / worktrees where a new "branch" is created based on the base 
> directory. Multiple agents can fork the same base file system in parallel and 
> mutate these forks without affecting each other. These forks should also have 
> zero-copy, similar to snapshot (which should only require O(1) time to 
> create). Additionally, these forks lifetime can varies (it can be retained 
> for a long time or discarded quite quickly).
> Example systems
> * NeonDB branching: https://neon.com/docs/introduction/branching
> * Tigris Object Store: https://www.tigrisdata.com/docs/snapshots-and-forks/ 
> (please see the related blogs on the implementations of forks).
> Ozone can consider supporting this feature. Since more systems implement 
> storage compute separation architecture on object storage, the compute / 
> caching layer can rely on Ozone as the backing store for agentic workloads 
> since Ozone supports snapshot and forking (they don't need to implement 
> snapshot and forking or need to write complicated logic to store their forks 
> state). Ozone can then position itself as the open-source object store / 
> distributed file system for agentic workloads.
> This ticket acts as a way to start a discussion in the community on this 
> direction. We can start thinking about this (and probably try to start 
> prototyping some ideas). This might require a radical change of Ozone Manager 
> design (e.g. might need to introduce versioning, reference counting, 
> copy-on-write, log subsystems, OM deletions semantic change, etc).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to