[
https://issues.apache.org/jira/browse/HDDS-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085647#comment-18085647
]
Ivan Andika edited comment on HDDS-15463 at 6/3/26 1:46 AM:
------------------------------------------------------------
+1 on this direction. Although Streaming Write Pipeline removes the Raft
overhead of the WriteChunk by making the streaming the actual data, there is
still Raft overhead on the data commit (PutBlock and Client Watch Commit). The
mix of both Raft and streaming is sometimes also pretty challenging to debug
and reason about. So I think either we use full Raft Pipeline (i.e. V1) or full
streaming (i.e V3). The hope is that we can saturate the datanodes IO with less
number of pipelines compared to V1 and V2.
Just wondering whether this be similar to CRAQ
(https://issues.apache.org/jira/browse/HDDS-12578 requires tracking) or HDFS
DataStreamer? We can take a look at
https://transactional.blog/blog/2024-data-replication-design-spectrum for the
tradeoffs. For CRAQ, this will make the write guarantee to be stronger
(requires all 3 nodes to replicate the data) than Ratis pipeline
(MAJORITY_COMMITTED only requires leader to apply the data and 1 follower to
commit the transaction and promise to apply the data), but the write latency
can increase if one of the datanode is slow or stuck.
Also, recently I came across a Ceph paper that argues that Storage Backends
should be implemented on top of OS FileSystem
(https://dl.acm.org/doi/10.1145/3341301.3359656), which causes Ceph to
implements their own backend (BlueStore). This might not be in scope since it
requires reworking the Ozone DN Backend, but I think it's worth thinking about.
Looking forward to the design.
was (Author: JIRAUSER298977):
+1 on this direction. Although Streaming Write Pipeline removes the Raft
overhead of the WriteChunk by making the streaming the actual data, there is
still Raft overhead on the data commit (PutBlock and Client Watch Commit). The
mix of both Raft and streaming is sometimes also pretty challenging to debug
and reason about. So I think either we support full Raft Pipeline (i.e. V1) or
full streaming (i.e V3). The hope is that we can saturate the datanodes IO with
less number of pipelines compared to V1 and V2.
Just wondering whether this be similar to CRAQ
(https://issues.apache.org/jira/browse/HDDS-12578) and HDFS DataStreamer? We
can take a look at
https://transactional.blog/blog/2024-data-replication-design-spectrum for the
tradeoffs.
Also, recently I came across a Ceph paper that argues that Storage Backends
should be implemented on top of OS FileSystem
(https://dl.acm.org/doi/10.1145/3341301.3359656), which causes Ceph to
implements their own backend (BlueStore). This might not be in scope since it
requires reworking the Ozone DN Backend, but I think it's worth thinking about.
Looking forward to the design.
> Streaming Write Pipeline without Raft
> -------------------------------------
>
> Key: HDDS-15463
> URL: https://issues.apache.org/jira/browse/HDDS-15463
> Project: Apache Ozone
> Issue Type: New Feature
> Components: Ozone Client, Ozone Datanode
> Reporter: Tsz-wo Sze
> Assignee: Tsz-wo Sze
> Priority: Major
>
> - V1) Raft Pipeline: Use Raft for both WriteChunk and PutBlock
> - V2)Streaming Write Pipeline (HDDS-4454): Use Ratis streaming (RATIS-979)
> for WriteChunk and Raft for PutBlock
> - V3) Streaming Write Pipeline without Raft: Use Ratis streaming for both
> WriteChunk and PutBlock.
> We implement V3 in this JIRA. Will post a design and create subtasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]