[
https://issues.apache.org/jira/browse/HDDS-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079956#comment-18079956
]
Ivan Andika commented on HDDS-15228:
------------------------------------
See the following for the full stack.
{code:java}
Error while running delete keys background task. Will retry at next run.
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.NegativeArraySizeException: -1283966625
at
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
at
org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
at
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
at
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
at
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.google.protobuf.ServiceException:
java.lang.NegativeArraySizeException: -1283966625
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
at
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
at
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
... 11 more
Caused by: java.lang.NegativeArraySizeException: -1283966625
at
com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
at
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
at
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
at org.apache.hadoop.ipc.Client.call(Client.java:1531)
at org.apache.hadoop.ipc.Client.call(Client.java:1477)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
... 22 more
{code}
The root cause analysis
Flow:
# OM calls ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks.
# It builds one huge DeleteScmKeyBlocksRequestProto from all BlockGroups.
# Hadoop RPC wraps it in RpcProtobufRequest.
# RpcProtobufRequest.writeTo() calls payload.writeDelimitedTo(out).
# protobuf AbstractMessageLite.writeDelimitedTo() does:
{code:java}
int serialized = getSerializedSize();
int bufferSize = CodedOutputStream.computePreferredBufferSize(
CodedOutputStream.computeRawVarint32Size(serialized) + serialized);
CodedOutputStream.newInstance(output, bufferSize);
{code}
The exception value confirms integer overflow:
{code:java}
bufferSize = -1283966625
serialized ~= -1283966630
unsigned serialized ~= 3011000666 bytes, about 2.8 GiB / 3.0 GB
{code}
So the actual payload was around 3.01 GB, but protobuf stores serialized size
in a Java int. Once the computed protobuf message size crosses
Integer.MAX_VALUE, it wraps negative. Then CodedOutputStream.newInstance(...,
negative) tries new byte[negative], causing NegativeArraySizeException.
Why Hadoop limits did not catch it:
* ipc.maximum.data.length is enforced by the server after it reads the 4-byte
RPC frame length. Here OM fails before it can send a valid framed request to
SCM.
* ipc.maximum.response.length is enforced by the client when reading a
response. Here there is no response yet.
* Hadoop RPC does not appear to have a client-side outbound request-size guard
before protobuf serialization.
ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks
-> submitRequest
-> ProtobufRpcEngine$Invoker.invoke
-> Client.call
-> Client$Connection.sendRpcRequest
-> RpcProtobufRequest.writeTo
-> payload.writeDelimitedTo
-> CodedOutputStream.newInstance
-> new byte[negative]
> KeyDeletingService limit batch deletions based on number of blocks
> ------------------------------------------------------------------
>
> Key: HDDS-15228
> URL: https://issues.apache.org/jira/browse/HDDS-15228
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We encountered the following NegativeArraySizeException issue which
> triggered unnecessary failovers which increases the latency.
> {code:java}
> com.google.protobuf.ServiceException: java.lang.NegativeArraySizeException:
> -1273201896, while invoking $Proxy34.send over
> nodeId=scm4,nodeAddress=<redacted> after 256 failover attempts. Trying to
> failover after sleeping for 2000ms.
> {code}
> Currently, KeyDeletingService would send deletions based on the number of
> keys (ozone.key.deleting.limit.per.task). However, some keys can have a large
> number of blocksĀ especially keys with EC where one block is assigned per
> shard (e.g. EC 6+3 will have 9 different BlockID per KeyLocationInfo compare
> to RATIS/THREE only have 1 BlockID).
> This can cause issues where a large SCM deleteKeyBlocks response causes
> Integer overflow which triggers java.lang.NegativeArraySizeException. Even
> when we set the ipc.maximum.data.length (512MB) and
> ipc.maximum.response.length (640MB) to higher value, it seems to still
> trigger the issue.
> To prevent this, we can batch the deletions based on the number of blocks.
> However, we need ensure that at least a single key is sent to deletion (even
> if breaches the number of blocks) so that the OM deletion still proceeds.
> Full stack
> {code:java}
> Error while running delete keys background task. Will retry at next run.
> java.io.IOException: com.google.protobuf.ServiceException:
> java.lang.NegativeArraySizeException: -1283966625
> at
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
> at
> org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
> at
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
> at
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
> at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: com.google.protobuf.ServiceException:
> java.lang.NegativeArraySizeException: -1283966625
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
> at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
> at
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
> at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
> at
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
> ... 11 more
> Caused by: java.lang.NegativeArraySizeException: -1283966625
> at
> com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
> at
> com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
> at
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
> at org.apache.hadoop.ipc.Client.call(Client.java:1531)
> at org.apache.hadoop.ipc.Client.call(Client.java:1477)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
> ... 22 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]