bschofield edited a comment on issue #6806: URL: https://github.com/apache/pulsar/issues/6806#issuecomment-618900141
Yes I agree, that's the problem. In `Commands::deSerializeSingleMessageInBatch()` then you can see that the format of the uncompressed payload is expected as: ``` // Format of batch message // Each Message = [METADATA_SIZE][METADATA] [PAYLOAD] ``` If you look at the dump of `uncompressedPayload` I gave above, then you can see that **the first four bytes are 0x20201611 = 538973713**. So what I think has happened is that the code has read a metadata size of 538973713 and tried to jump ahead to `readIdx_ = 538973713 + 4 = 538973717` (the extra 4 to jump over the metadata size field itself). This is off the end of the buffer and causes a segfault. So, the question is: why was the uncompressed payload in this batch corrupted? I'm not sure but I suspect a data corruption bug in the producer. I manually cleared the backlog on the broken topic, and made two changes on the producer side: (1) I dropped the batch production size to 100 (it was 1000) and (2) I switched from LZ4 to Zlib compression. With those two changes in place I'm not now seeing the segfault. If I get it again I will do more debugging and let you know. If I had kept the *compressed* payload then we could check out whether this is an LZ4 bug, but unfortunately I didn't think to take a dump of it :-(. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org