[
https://issues.apache.org/jira/browse/HDDS-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-15228:
-------------------------------
Description:
We encountered the following NegativeArraySizeException issue which triggered
unnecessary failovers which increases the latency.
{code:java}
com.google.protobuf.ServiceException: java.lang.NegativeArraySizeException:
-1273201896, while invoking $Proxy34.send over
nodeId=scm4,nodeAddress=<redacted> after 256 failover attempts. Trying to
failover after sleeping for 2000ms.
{code}
Currently, KeyDeletingService would send deletions based on the number of keys
(ozone.key.deleting.limit.per.task). However, some keys can have a large number
of blocks especially keys with EC where one block is assigned per shard (e.g.
EC 6+3 will have 9 different BlockID per KeyLocationInfo compare to RATIS/THREE
only have 1 BlockID).
This can cause issues where a large SCM deleteKeyBlocks response causes Integer
overflow which triggers java.lang.NegativeArraySizeException. Even when we set
the ipc.maximum.data.length (512MB) and ipc.maximum.response.length (640MB) to
higher value, it seems to still trigger the issue.
To prevent this, we can batch the deletions based on the number of blocks.
However, we need ensure that at least a single key is sent to deletion (even if
breaches the number of blocks) so that the OM deletion still proceeds.
Full stack
{code:java}
Error while running delete keys background task. Will retry at next run.
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.NegativeArraySizeException: -1283966625
at
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
at
org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
at
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
at
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
at
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.google.protobuf.ServiceException:
java.lang.NegativeArraySizeException: -1283966625
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
at
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
... 11 more
Caused by: java.lang.NegativeArraySizeException: -1283966625
at
com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
at
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
at
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
at org.apache.hadoop.ipc.Client.call(Client.java:1531)
at org.apache.hadoop.ipc.Client.call(Client.java:1477)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
... 22 more
{code}
was:
We encountered the following issue.
{code:java}
com.google.protobuf.ServiceException: java.lang.NegativeArraySizeException:
-1273201896, while invoking $Proxy34.send over
nodeId=scm4,nodeAddress=<redacted> after 256 failover attempts. Trying to
failover after sleeping for 2000ms.
{code}
Currently, KeyDeletingService would send deletions based on the number of keys
(ozone.key.deleting.limit.per.task). However, some keys can have a large number
of blocks especially keys with EC where one block is assigned per shard (e.g.
EC 6+3 will have 9 different BlockID per KeyLocationInfo compare to RATIS/THREE
only have 1 BlockID).
This can cause issues where a large SCM deleteKeyBlocks response causes Integer
overflow which triggers java.lang.NegativeArraySizeException. Even when we set
the ipc.maximum.data.length (512MB) and ipc.maximum.response.length (640MB) to
higher value, it seems to still trigger the issue.
To prevent this, we can batch the deletions based on the number of blocks.
However, we need ensure that at least a single key is sent to deletion (even if
breaches the number of blocks) so that the OM deletion still proceeds.
Full stack
{code:java}
Error while running delete keys background task. Will retry at next run.
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.NegativeArraySizeException: -1283966625
at
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
at
org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
at
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
at
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
at
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.google.protobuf.ServiceException:
java.lang.NegativeArraySizeException: -1283966625
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
at
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
... 11 more
Caused by: java.lang.NegativeArraySizeException: -1283966625
at
com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
at
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
at
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
at org.apache.hadoop.ipc.Client.call(Client.java:1531)
at org.apache.hadoop.ipc.Client.call(Client.java:1477)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
... 22 more
{code}
> KeyDeletingService limit batch deletions based on number of blocks
> ------------------------------------------------------------------
>
> Key: HDDS-15228
> URL: https://issues.apache.org/jira/browse/HDDS-15228
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We encountered the following NegativeArraySizeException issue which
> triggered unnecessary failovers which increases the latency.
> {code:java}
> com.google.protobuf.ServiceException: java.lang.NegativeArraySizeException:
> -1273201896, while invoking $Proxy34.send over
> nodeId=scm4,nodeAddress=<redacted> after 256 failover attempts. Trying to
> failover after sleeping for 2000ms.
> {code}
> Currently, KeyDeletingService would send deletions based on the number of
> keys (ozone.key.deleting.limit.per.task). However, some keys can have a large
> number of blocks especially keys with EC where one block is assigned per
> shard (e.g. EC 6+3 will have 9 different BlockID per KeyLocationInfo compare
> to RATIS/THREE only have 1 BlockID).
> This can cause issues where a large SCM deleteKeyBlocks response causes
> Integer overflow which triggers java.lang.NegativeArraySizeException. Even
> when we set the ipc.maximum.data.length (512MB) and
> ipc.maximum.response.length (640MB) to higher value, it seems to still
> trigger the issue.
> To prevent this, we can batch the deletions based on the number of blocks.
> However, we need ensure that at least a single key is sent to deletion (even
> if breaches the number of blocks) so that the OM deletion still proceeds.
> Full stack
> {code:java}
> Error while running delete keys background task. Will retry at next run.
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.NegativeArraySizeException: -1283966625
> at
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
> at
> org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
> at
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: com.google.protobuf.ServiceException:
> java.lang.NegativeArraySizeException: -1283966625
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
> at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
> at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
> ... 11 more
> Caused by: java.lang.NegativeArraySizeException: -1283966625
> at
> com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
> at
> com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
> at
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
> at org.apache.hadoop.ipc.Client.call(Client.java:1531)
> at org.apache.hadoop.ipc.Client.call(Client.java:1477)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
> ... 22 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]