[
https://issues.apache.org/jira/browse/HDDS-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ren Koike updated HDDS-15581:
-----------------------------
Description:
{{ozone debug replicas chunk-info}} returns incorrect {{blockData.size}} (and
chunk metadata) for EC-replicated keys. For example, a partial-stripe key,
every internal block/replica is reported with the size (typically 1 MiB) of a
specific block in the datanode, instead of the expected per-replica sizes.
*Steps to reproduce*
# Create an EC key with {{rs-3-2-1024k}} (or similar), e.g. size 1,148,576
bytes (between 1 MiB and 2 MiB).
# Run:
ozone debug replicas chunk-info o3://<volume>/<bucket>/<key>
# Inspect {{blockData.size}} for each entry in {{{}keyLocations{}}}.
Expected (EC 3+2, 1,148,576 bytes):
||Replica||Expected size||
|Data 1|1,048,576|
|Data 2|100,000|
|Data 3|0|
|Parity 4, 5|1,048,576 each|
Actual: all replicas show 1,048,576.
*Root cause*
This is a regression introduced whenHDDS-13445 replaced
{{getBlockFromAllNodes()}} with a per-datanode loop that calls
{{ContainerProtocolCalls.getBlock(). }}{{getBlock()}}{{ uses
}}{{{}tryEachDatanode(){}}}{{{}, which always queries the same datanode (the
pipeline’s “closest” / first node), not the datanode from the loop variable.
Each iteration then writes that datanode’s block metadata under a different
hostname/file path, duplicating the same replica’s data 5× (for EC 3+2).{}}}
*Proposed fix*
* Add {{ContainerProtocolCalls.getBlockFromDatanode(..., datanode,
replicaIndexes)}} that uses the existing private {{getBlock(..., datanode,
...)}} without {{{}tryEachDatanode{}}}.
* Use it in {{ChunkKeyHandler}} with the loop’s {{{}datanodeDetails{}}}.
Avoid restoring {{getBlockFromAllNodes()}} to prevent holding all block
metadata in memory for large keys.
was:
{{ozone debug replicas chunk-info}} returns incorrect {{blockData.size}} (and
chunk metadata) for EC-replicated keys. For example, a partial-stripe key,
every internal block/replica is reported with the size (typically 1 MiB) of a
specific block in the datanode, instead of the expected per-replica sizes.
*Steps to reproduce*
# Create an EC key with {{rs-3-2-1024k}} (or similar), e.g. size 1,148,576
bytes (between 1 MiB and 2 MiB).
# Run:
ozone debug replicas chunk-info o3://<volume>/<bucket>/<key>
# Inspect {{blockData.size}} for each entry in {{{}keyLocations{}}}.
Expected (EC 3+2, 1,148,576 bytes):
||Replica||Expected size||
|Data 1|1,048,576|
|Data 2|100,000|
|Data 3|0|
|Parity 4, 5|1,048,576 each|
Actual: all replicas show 1,048,576.
*Root cause*
This is a regression introduced
when[HDDS-13445|https://issues.apache.org/jira/browse/HDDS-13445] replaced
{{getBlockFromAllNodes()}} with a per-datanode loop that calls
{{ContainerProtocolCalls.getBlock(). }}{{getBlock()}}{{ uses
}}{{{}tryEachDatanode(){}}}{{{}, which always queries the same datanode (the
pipeline’s “closest” / first node), not the datanode from the loop variable.
Each iteration then writes that datanode’s block metadata under a different
hostname/file path, duplicating the same replica’s data 5× (for EC 3+2).{}}}
{{{}{}}}{*}Proposed fix{*}
* Add {{ContainerProtocolCalls.getBlockFromDatanode(..., datanode,
replicaIndexes)}} that uses the existing private {{getBlock(..., datanode,
...)}} without {{{}tryEachDatanode{}}}.
* Use it in {{ChunkKeyHandler}} with the loop’s {{{}datanodeDetails{}}}.
Avoid restoring {{getBlockFromAllNodes()}} to prevent holding all block
metadata in memory for large keys.
{{}}
> ozone debug replicas chunk-info reports wrong block size for EC keys
> --------------------------------------------------------------------
>
> Key: HDDS-15581
> URL: https://issues.apache.org/jira/browse/HDDS-15581
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ren Koike
> Assignee: Ren Koike
> Priority: Major
>
> {{ozone debug replicas chunk-info}} returns incorrect {{blockData.size}} (and
> chunk metadata) for EC-replicated keys. For example, a partial-stripe key,
> every internal block/replica is reported with the size (typically 1 MiB) of a
> specific block in the datanode, instead of the expected per-replica sizes.
> *Steps to reproduce*
> # Create an EC key with {{rs-3-2-1024k}} (or similar), e.g. size 1,148,576
> bytes (between 1 MiB and 2 MiB).
> # Run:
> ozone debug replicas chunk-info o3://<volume>/<bucket>/<key>
> # Inspect {{blockData.size}} for each entry in {{{}keyLocations{}}}.
> Expected (EC 3+2, 1,148,576 bytes):
> ||Replica||Expected size||
> |Data 1|1,048,576|
> |Data 2|100,000|
> |Data 3|0|
> |Parity 4, 5|1,048,576 each|
> Actual: all replicas show 1,048,576.
> *Root cause*
> This is a regression introduced whenHDDS-13445 replaced
> {{getBlockFromAllNodes()}} with a per-datanode loop that calls
> {{ContainerProtocolCalls.getBlock(). }}{{getBlock()}}{{ uses
> }}{{{}tryEachDatanode(){}}}{{{}, which always queries the same datanode (the
> pipeline’s “closest” / first node), not the datanode from the loop variable.
> Each iteration then writes that datanode’s block metadata under a different
> hostname/file path, duplicating the same replica’s data 5× (for EC 3+2).{}}}
> *Proposed fix*
> * Add {{ContainerProtocolCalls.getBlockFromDatanode(..., datanode,
> replicaIndexes)}} that uses the existing private {{getBlock(..., datanode,
> ...)}} without {{{}tryEachDatanode{}}}.
> * Use it in {{ChunkKeyHandler}} with the loop’s {{{}datanodeDetails{}}}.
> Avoid restoring {{getBlockFromAllNodes()}} to prevent holding all block
> metadata in memory for large keys.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]