[ 
https://issues.apache.org/jira/browse/HDDS-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079956#comment-18079956
 ] 

Ivan Andika commented on HDDS-15228:
------------------------------------

See the following for the full stack.

{code:java}
Error while running delete keys background task. Will retry at next run.
java.io.IOException: com.google.protobuf.ServiceException: 
java.lang.NegativeArraySizeException: -1283966625
        at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
        at 
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
        at 
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
        at 
org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
        at 
org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
        at 
org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
        at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: com.google.protobuf.ServiceException: 
java.lang.NegativeArraySizeException: -1283966625
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
        at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
        at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
        at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
        at 
org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
        ... 11 more
Caused by: java.lang.NegativeArraySizeException: -1283966625
        at 
com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
        at 
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
        at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
        at org.apache.hadoop.ipc.Client.call(Client.java:1531)
        at org.apache.hadoop.ipc.Client.call(Client.java:1477)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
        ... 22 more
{code}

The root cause analysis

Flow:

# OM calls ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks.
# It builds one huge DeleteScmKeyBlocksRequestProto from all BlockGroups.
# Hadoop RPC wraps it in RpcProtobufRequest.
# RpcProtobufRequest.writeTo() calls payload.writeDelimitedTo(out).
# protobuf AbstractMessageLite.writeDelimitedTo() does:

{code:java}
int serialized = getSerializedSize();
int bufferSize = CodedOutputStream.computePreferredBufferSize(
    CodedOutputStream.computeRawVarint32Size(serialized) + serialized);
CodedOutputStream.newInstance(output, bufferSize);
{code}

The exception value confirms integer overflow:

{code:java}
bufferSize = -1283966625
serialized ~= -1283966630
unsigned serialized ~= 3011000666 bytes, about 2.8 GiB / 3.0 GB
{code}

So the actual payload was around 3.01 GB, but protobuf stores serialized size 
in a Java int. Once the computed protobuf message size crosses 
Integer.MAX_VALUE, it wraps negative. Then CodedOutputStream.newInstance(..., 
negative) tries new byte[negative], causing NegativeArraySizeException.

Why Hadoop limits did not catch it:
* ipc.maximum.data.length is enforced by the server after it reads the 4-byte 
RPC frame length. Here OM fails before it can send a valid framed request to 
SCM.
* ipc.maximum.response.length is enforced by the client when reading a 
response. Here there is no response yet.
* Hadoop RPC does not appear to have a client-side outbound request-size guard 
before protobuf serialization.

ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks
  -> submitRequest
  -> ProtobufRpcEngine$Invoker.invoke
  -> Client.call
  -> Client$Connection.sendRpcRequest
  -> RpcProtobufRequest.writeTo
  -> payload.writeDelimitedTo
  -> CodedOutputStream.newInstance
  -> new byte[negative]

> KeyDeletingService limit batch deletions based on number of blocks
> ------------------------------------------------------------------
>
>                 Key: HDDS-15228
>                 URL: https://issues.apache.org/jira/browse/HDDS-15228
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We encountered  the following NegativeArraySizeException issue which 
> triggered unnecessary failovers which increases the latency.
> {code:java}
> com.google.protobuf.ServiceException: java.lang.NegativeArraySizeException: 
> -1273201896, while invoking $Proxy34.send over 
> nodeId=scm4,nodeAddress=<redacted> after 256 failover attempts. Trying to 
> failover after sleeping for 2000ms.
> {code}
> Currently, KeyDeletingService would send deletions based on the number of 
> keys (ozone.key.deleting.limit.per.task). However, some keys can have a large 
> number of blocksĀ  especially keys with EC where one block is assigned per 
> shard (e.g. EC 6+3 will have 9 different BlockID per KeyLocationInfo compare 
> to RATIS/THREE only have 1 BlockID).
> This can cause issues where a large SCM deleteKeyBlocks response causes 
> Integer overflow which triggers java.lang.NegativeArraySizeException. Even 
> when we set the ipc.maximum.data.length (512MB) and 
> ipc.maximum.response.length (640MB) to higher value, it seems to still 
> trigger the issue.
> To prevent this, we can batch the deletions based on the number of blocks. 
> However, we need ensure that at least a single key is sent to deletion (even 
> if breaches the number of blocks) so that the OM deletion still proceeds.
> Full stack
> {code:java}
> Error while running delete keys background task. Will retry at next run.
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.NegativeArraySizeException: -1283966625
>       at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:76)
>       at 
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:133)
>       at 
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.deleteKeyBlocks(ScmBlockLocationProtocolClientSideTranslatorPB.java:284)
>       at 
> org.apache.hadoop.ozone.om.service.AbstractKeyDeletingService.processKeyDeletes(AbstractKeyDeletingService.java:118)
>       at 
> org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:209)
>       at 
> org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:140)
>       at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>       at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: com.google.protobuf.ServiceException: 
> java.lang.NegativeArraySizeException: -1283966625
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:269)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:135)
>       at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
>       at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:580)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:465)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:185)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:177)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:106)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:385)
>       at jdk.proxy2/jdk.proxy2.$Proxy34.send(Unknown Source)
>       at 
> org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:130)
>       ... 11 more
> Caused by: java.lang.NegativeArraySizeException: -1283966625
>       at 
> com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
>       at 
> com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$RpcProtobufRequest.writeTo(ProtobufRpcEngine.java:544)
>       at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1238)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1531)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1477)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:255)
>       ... 22 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to