[
https://issues.apache.org/jira/browse/HDDS-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-15228:
-------------------------------
Parent: HDDS-13775
Issue Type: Sub-task (was: Improvement)
> KeyDeletingService limit batch deletions based on number of blocks
> ------------------------------------------------------------------
>
> Key: HDDS-15228
> URL: https://issues.apache.org/jira/browse/HDDS-15228
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We encountered the following NegativeArraySizeException issue which
> triggered unnecessary failovers which increases the latency.
> {code:java}
> com.google.protobuf.ServiceException: java.lang.NegativeArraySizeException:
> -1273201896, while invoking $Proxy34.send over
> nodeId=scm4,nodeAddress=<redacted> after 256 failover attempts. Trying to
> failover after sleeping for 2000ms.
> {code}
> Currently, KeyDeletingService would send deletions based on the number of
> keys (ozone.key.deleting.limit.per.task). However, some keys can have a large
> number of blocksĀ especially keys with EC where one block is assigned per
> shard (e.g. EC 6+3 will have 9 different BlockID per KeyLocationInfo compare
> to RATIS/THREE only have 1 BlockID).
> This can cause issues where a large SCM deleteKeyBlocks response causes
> Integer overflow which triggers java.lang.NegativeArraySizeException. Even
> when we set the ipc.maximum.data.length (512MB) and
> ipc.maximum.response.length (640MB) to higher value, it seems to still
> trigger the issue.
> To prevent this, we can batch the deletions based on the number of blocks.
> However, we need ensure that at least a single key is sent to deletion (even
> if breaches the number of blocks) so that the OM deletion still proceeds.
> HDDS-13517 already divided the key based on the Ratis limit, but it seems if
> there is one large key is larger than the limit and cannot be split further,
> Ratis would still reject it and pending deletion will be retried again. This
> might cause the deletion process to be stuck. So an alternative way might be
> to extend HDDS-13517 to also divide a single key based on the blocks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]