rdhabalia opened a new pull request #5443: [pulsar-client] Fix message 
corruption on OOM for batch messages
URL: https://github.com/apache/pulsar/pull/5443
 
 
   ### Motivation
   
   With below test setup, we can see message corruption and incorrect 
batch-size for a batch message topic and consumer will see below exception:
   
   It's also related to similar issue mentioned: #1201 
   
   ```
   17:47:55.973 
[pulsar-client-io-31-1:org.apache.pulsar.client.impl.ClientCnx@257] WARN  
org.apache.pulsar.client.impl.ClientCnx - [localhost/127.0.0.1:15119] Got 
exception IndexOutOfBoundsException : readerIndex: 85, writerIndex: 1702065078 
(expected: 0 <= readerIndex <= writerIndex <= capacity(183))
   java.lang.IndexOutOfBoundsException: readerIndex: 85, writerIndex: 
1702065078 (expected: 0 <= readerIndex <= writerIndex <= capacity(183))
        at 
io.netty.buffer.AbstractByteBuf.checkIndexBounds(AbstractByteBuf.java:112) 
~[netty-all-4.1.32.Final.jar:4.1.32.Final]
        at 
io.netty.buffer.AbstractByteBuf.writerIndex(AbstractByteBuf.java:135) 
~[netty-all-4.1.32.Final.jar:4.1.32.Final]
        at 
org.apache.pulsar.common.protocol.Commands.deSerializeSingleMessageInBatch(Commands.java:1518)
 ~[classes/:?]
        at 
org.apache.pulsar.client.impl.ConsumerImpl.receiveIndividualMessagesFromBatch(ConsumerImpl.java:950)
 ~[classes/:?]
        at 
org.apache.pulsar.client.impl.ConsumerImpl.messageReceived(ConsumerImpl.java:835)
 ~[classes/:?]
        at 
org.apache.pulsar.client.impl.ClientCnx.handleMessage(ClientCnx.java:375) 
~[classes/:?]
   ```
   
   
   
   1. Create a partition topic with 400 partitions
   2. Restrict direct memory to 512m `-XX:MaxDirectMemorySize=512m` 
   3. Publish messages with large producer queue size which can cause OOM 
   `./pulsar-perf produce persistent://sample/standalone/batch/part -r 10000 -s 
10240 -o 200000 -p 100000000`
   4. unload namespace which will make client to full producer queue and cause 
OOM 
   
   With above setup, producer will have OOM while serializing individual batch 
message
   ```
   18:08:56.738 [pulsar-timer-5-1] WARN  
org.apache.pulsar.client.impl.ProducerImpl - 
[persistent://sample/standalone/batch/t2-partition-151] [standalone-7-3208] 
error while create opSendMsg by batch message container -- 
java.lang.RuntimeException: io.netty.util.internal.OutOfDirectMemoryError: 
failed to allocate 20971520 byte(s) of direct memory (used: 134217728, max: 
134217728)
   io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 
byte(s) of direct memory (used: 134217728, max: 134217728)
        at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:655)
        at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:610)
        at 
io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:769)
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:745)
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:397)
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:118)
        at 
io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:299)
        at 
io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:278)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1096)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1087)
        at 
org.apache.pulsar.common.protocol.Commands.serializeSingleMessageInBatchWithPayload(Commands.java:1471)
        at 
org.apache.pulsar.common.protocol.Commands.serializeSingleMessageInBatchWithPayload(Commands.java:1501)
        at 
org.apache.pulsar.client.impl.BatchMessageContainerImpl.getCompressedBatchMetadataAndPayload(BatchMessageContainerImpl.java:92)
        at 
org.apache.pulsar.client.impl.BatchMessageContainerImpl.createOpSendMsg(BatchMessageContainerImpl.java:160)
        at 
org.apache.pulsar.client.impl.ProducerImpl.batchMessageAndSend(ProducerImpl.java:1296)
        at 
org.apache.pulsar.client.impl.ProducerImpl.access$500(ProducerImpl.java:76)
        at 
org.apache.pulsar.client.impl.ProducerImpl$2.run(ProducerImpl.java:1253)
        at 
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:682)
        at 
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:757)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:485)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:745)
   ```
   
   This error will corrupt 
   1. [batch-message 
buffer](https://github.com/apache/pulsar/blob/branch-2.4/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/Commands.java#L1166)
 and 
   2. individual message buffer's index and also [recycles 
messageMetadata](https://github.com/apache/pulsar/blob/branch-2.4/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BatchMessageContainerImpl.java#L89)
   
   It can be also reproduced with change mentioned into this 
[commit](https://github.com/apache/pulsar/commit/1411ff3e55470f3253c118f816b4cb23f14a84dc).
 It's hard to reproduce with unit-test but I will see if we can add it.
   
   ### Modification
   Reset batch-message buffer index and individual message's buffer index in 
case of failure in message serialization.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to