rdhabalia opened a new pull request #5443: [pulsar-client] Fix message corruption on OOM for batch messages URL: https://github.com/apache/pulsar/pull/5443 ### Motivation With below test setup, we can see message corruption and incorrect batch-size for a batch message topic and consumer will see below exception: It's also related to similar issue mentioned: #1201 ``` 17:47:55.973 [pulsar-client-io-31-1:org.apache.pulsar.client.impl.ClientCnx@257] WARN org.apache.pulsar.client.impl.ClientCnx - [localhost/127.0.0.1:15119] Got exception IndexOutOfBoundsException : readerIndex: 85, writerIndex: 1702065078 (expected: 0 <= readerIndex <= writerIndex <= capacity(183)) java.lang.IndexOutOfBoundsException: readerIndex: 85, writerIndex: 1702065078 (expected: 0 <= readerIndex <= writerIndex <= capacity(183)) at io.netty.buffer.AbstractByteBuf.checkIndexBounds(AbstractByteBuf.java:112) ~[netty-all-4.1.32.Final.jar:4.1.32.Final] at io.netty.buffer.AbstractByteBuf.writerIndex(AbstractByteBuf.java:135) ~[netty-all-4.1.32.Final.jar:4.1.32.Final] at org.apache.pulsar.common.protocol.Commands.deSerializeSingleMessageInBatch(Commands.java:1518) ~[classes/:?] at org.apache.pulsar.client.impl.ConsumerImpl.receiveIndividualMessagesFromBatch(ConsumerImpl.java:950) ~[classes/:?] at org.apache.pulsar.client.impl.ConsumerImpl.messageReceived(ConsumerImpl.java:835) ~[classes/:?] at org.apache.pulsar.client.impl.ClientCnx.handleMessage(ClientCnx.java:375) ~[classes/:?] ``` 1. Create a partition topic with 400 partitions 2. Restrict direct memory to 512m `-XX:MaxDirectMemorySize=512m` 3. Publish messages with large producer queue size which can cause OOM `./pulsar-perf produce persistent://sample/standalone/batch/part -r 10000 -s 10240 -o 200000 -p 100000000` 4. unload namespace which will make client to full producer queue and cause OOM With above setup, producer will have OOM while serializing individual batch message ``` 18:08:56.738 [pulsar-timer-5-1] WARN org.apache.pulsar.client.impl.ProducerImpl - [persistent://sample/standalone/batch/t2-partition-151] [standalone-7-3208] error while create opSendMsg by batch message container -- java.lang.RuntimeException: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 20971520 byte(s) of direct memory (used: 134217728, max: 134217728) io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 134217728, max: 134217728) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:655) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:610) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:769) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:745) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) at io.netty.buffer.PoolArena.reallocate(PoolArena.java:397) at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:118) at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:299) at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:278) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1096) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1087) at org.apache.pulsar.common.protocol.Commands.serializeSingleMessageInBatchWithPayload(Commands.java:1471) at org.apache.pulsar.common.protocol.Commands.serializeSingleMessageInBatchWithPayload(Commands.java:1501) at org.apache.pulsar.client.impl.BatchMessageContainerImpl.getCompressedBatchMetadataAndPayload(BatchMessageContainerImpl.java:92) at org.apache.pulsar.client.impl.BatchMessageContainerImpl.createOpSendMsg(BatchMessageContainerImpl.java:160) at org.apache.pulsar.client.impl.ProducerImpl.batchMessageAndSend(ProducerImpl.java:1296) at org.apache.pulsar.client.impl.ProducerImpl.access$500(ProducerImpl.java:76) at org.apache.pulsar.client.impl.ProducerImpl$2.run(ProducerImpl.java:1253) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:682) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:757) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:485) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:745) ``` This error will corrupt 1. [batch-message buffer](https://github.com/apache/pulsar/blob/branch-2.4/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/Commands.java#L1166) and 2. individual message buffer's index and also [recycles messageMetadata](https://github.com/apache/pulsar/blob/branch-2.4/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BatchMessageContainerImpl.java#L89) It can be also reproduced with change mentioned into this [commit](https://github.com/apache/pulsar/commit/1411ff3e55470f3253c118f816b4cb23f14a84dc). It's hard to reproduce with unit-test but I will see if we can add it. ### Modification Reset batch-message buffer index and individual message's buffer index in case of failure in message serialization.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services