bschofield edited a comment on issue #6806:
URL: https://github.com/apache/pulsar/issues/6806#issuecomment-618900141


   Yes I agree, that's the problem. In 
`Commands::deSerializeSingleMessageInBatch()` then you can see that the format 
of the uncompressed payload is expected as:
   
   ```
       // Format of batch message
       // Each Message = [METADATA_SIZE][METADATA] [PAYLOAD]
   ```
   
   If you look at the dump of `uncompressedPayload` I gave above, then you can 
see that **the first four bytes are 0x20201611 = 538973713**. So what I think 
has happened is that the code has read a metadata size of 538973713 and tried 
to jump ahead to `readIdx_ =  538973713 + 4 = 538973717` (the extra 4 to jump 
over the metadata size field itself). This is off the end of the buffer and 
causes a segfault.
   
   So, the question is: why was the uncompressed payload in this batch 
corrupted? I'm not sure but I suspect a data corruption bug in the producer.
   
   I manually cleared the backlog on the broken topic, and made two changes on 
the producer side: (1) I dropped the batch production size to 100 (it was 1000) 
and (2) I switched from LZ4 to Zlib compression. With those two changes in 
place I'm not now seeing the segfault. If I get it again I will do more 
debugging and let you know.
   
   If I had kept the *compressed* payload then we could check out whether this 
is an LZ4 bug, but unfortunately I didn't think to take a dump of it :-(.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to