Hello,
I’m using Flink 1.4.0 with FlinkKafkaConsumer010 and have been for almost a 
year.  Recently, I started getting messages of the wrong length in Flink 
causing my deserializer to fail.  Let me share what I’ve learned:


  1.  All of my messages are 520 bytes exactly when my producer places them in 
kafka
  2.  About 1% of these messages have this deserialization issue in flink
  3.  When it happens, I read 10104 bytes in flink
  4.  When I write the bytes my producer creates to a file on disk (rather than 
kafka) my code reads 520 bytes and consumes them without issue off of disk
  5.  When I use kafka tool (http://www.kafkatool.com/index.html)  to dump the 
contents of my topic to disk, and read each message one at a time off of disk, 
my code reads 520 bytes per message and consumes them without issue
  6.  When I write a simple Kafka consumer (not using flink) I read one message 
at a time it’s 520 bytes and my code runs without issue

#5 and #6 are what lead me to believe that this issue is squarely a problem 
with Flink.

However, it gets more complicated, I took the messages I wrote out with both my 
simple consumer and the kafka tool, and I load them into a local kafka server, 
then attach a local flink cluster and I cannot reproduce the error, yet I can 
reproduce it 100% of the time in something closer to my production environment.

I realize this latter sounds suspicious, but I have not found anything in the 
Kafka docs indicating that I might have a configuration issue here, yet my 
simple local setup that would allow me to iterate on this and debug has failed 
me.

I’m really quite at a loss here, I believe there’s a Flink Kafka consumer bug, 
it happens exceedingly rarely as I went a year without seeing it.  I can 
reproduce it in an expensive environment but not in a “cheap” environment.

Thank you for your time, I can provide my sample data set in case that helps.  
I dumped it on my google drive 
https://drive.google.com/file/d/1h8jpAFdkSolMrT8n47JJdS6x21nd_b7n/view?usp=sharing
 that’s the full data set, about 1% of it ends up failing, it’s really hard to 
figure out which message since I can’t read any of the message that I receive 
and I get data out of order.


Reply via email to