[jira] [Updated] (HDDS-15271) Client should prioritize replicas with BCSID that cover the blocks

Ivan Andika (Jira) Wed, 13 May 2026 19:10:09 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-15271:
-------------------------------
    Description: 
Currently, client read prioritizes based on the locality, datanode status 
(maintenance & decommission), etc. However, the client do not check whether the 
replica BCSID covers the block the client is trying to read. This causes 
BCSID_MISMATCH which triggers failover and increases read latency.

The idea of this patch is to also consider the BCSID as a hint (not a 
requirement) for client to pick preferred datanodes. If a client requested a 
block with BCSID N, any datanodes that contains BCSID >= N should be 
prioritized over those that have datanodes BCSID < N. 

However, we need to note a few important things
* We should not exclude the replicas with BCSID < N since the known container 
replica BCSID might be stale (either container location cache is stale or 
container replica heartbeat has not been recorded by SCM) and the actual 
container replica BCSID is higher. This means that we still have to read from 
replicas with BCSID < N.
* We need to consider all BCSID >= N as being equal since both contains the 
requested blocks (even if one contains more blocks than the other). So a 
replica 1 with BCSID N + 1 and replica 2 with BCSID N + 2 are the same even 
though replica 2 is more up-to-date. This should prevent read hotspot.

We can include BCSID as the sorting requirements for client read. This should 
reduce the chances of BCSID_MISMATCH and reduce read latency.

  was:
Currently, client read prioritizes based on the locality, datanode status 
(maintenance & decommission), etc. However, the client do not check whether the 
replica BCSID covers the block the client is trying to read. This causes 
BCSID_MISMATCH which triggers failover and increases read latency.

The idea of this patch is to also consider the BCSID as a hint (not a 
requirement) for client to pick preferred datanodes. If a client requested a 
block with BCSID N, any datanodes that contains BCSID >= N should be 
prioritized over those that have datanodes BCSID < N. 

However, we need to note a few important things
* We should not exclude the replicas with BCSID < N since the known container 
replica BCSID might be stale (either container location cache is stale or 
container replica heartbeat has not been recorded by SCM) and the actual 
container replica BCSID is higher. This means that we still have to read from 
replicas with BCSID < N.
* We need to consider all BCSID >= N as being equal. So a replica 1 with BCSID 
N + 1 and replica 2 with BCSID N + 2 are the same even though replica 2 is more 
up-to-date. This should prevent hotspot.

We can include BCSID as the sorting requirements for client read. This should 
reduce the chances of BCSID_MISMATCH and reduce read latency.


> Client should prioritize replicas with BCSID that cover the blocks
> ------------------------------------------------------------------
>
>                 Key: HDDS-15271
>                 URL: https://issues.apache.org/jira/browse/HDDS-15271
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, client read prioritizes based on the locality, datanode status 
> (maintenance & decommission), etc. However, the client do not check whether 
> the replica BCSID covers the block the client is trying to read. This causes 
> BCSID_MISMATCH which triggers failover and increases read latency.
> The idea of this patch is to also consider the BCSID as a hint (not a 
> requirement) for client to pick preferred datanodes. If a client requested a 
> block with BCSID N, any datanodes that contains BCSID >= N should be 
> prioritized over those that have datanodes BCSID < N. 
> However, we need to note a few important things
> * We should not exclude the replicas with BCSID < N since the known container 
> replica BCSID might be stale (either container location cache is stale or 
> container replica heartbeat has not been recorded by SCM) and the actual 
> container replica BCSID is higher. This means that we still have to read from 
> replicas with BCSID < N.
> * We need to consider all BCSID >= N as being equal since both contains the 
> requested blocks (even if one contains more blocks than the other). So a 
> replica 1 with BCSID N + 1 and replica 2 with BCSID N + 2 are the same even 
> though replica 2 is more up-to-date. This should prevent read hotspot.
> We can include BCSID as the sorting requirements for client read. This should 
> reduce the chances of BCSID_MISMATCH and reduce read latency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15271) Client should prioritize replicas with BCSID that cover the blocks

Reply via email to