Jun, we are already passing the retention period. so can't go back and do a DumpLogSegment. plus there are other factors make this exercise difficult: 1) this topic has very high traffic volume 2) we don't know the msg offset that is corrupted
anyhow, it doesn't happen often. but can you advise proper action/handling in this case? any other exceptions from iterator.next() that we should handle? Thanks, Steven On Wed, Feb 4, 2015 at 9:33 PM, Jun Rao <j...@confluent.io> wrote: > 1) Does the corruption happen to console consumer as well? If so, could you > run DumpLogSegment tool to see if the data is corrupted on disk? > > Thanks, > > Jun > > > On Wed, Feb 4, 2015 at 9:55 AM, Steven Wu <stevenz...@gmail.com> wrote: > > > Hi, > > > > We have observed these two exceptions with consumer *iterator.next()* > > recently. want to ask how should we handle them properly. > > > > *1) CRC corruption* > > Message is corrupt (stored crc = 433657556, computed crc = 3265543163) > > > > I assume in this case we should just catch it and move on to the next > msg? > > any other iterator/consumer exception we should catch and handle? > > > > > > *2) Unrecoverable consumer erorr with "Iterator is in failed state"* > > > > yesterday, one of our kafka consumers got stuck with very large maxalg > and > > was throwing the following exception. > > > > 2015-02-04 08:35:19,841 ERROR KafkaConsumer-0 KafkaConsumer - Exception > on > > consuming kafka with topic: <topic_name> > > java.lang.IllegalStateException: Iterator is in failed state > > at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:54) > > at kafka.utils.IteratorTemplate.next(IteratorTemplate.scala:38) > > at kafka.consumer.ConsumerIterator.next(ConsumerIterator.scala:46) > > at > com.netflix.suro.input.kafka.KafkaConsumer$1.run(KafkaConsumer.java:103) > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > > > we had a surge of traffic of <topic_name>, so I guess the traffic storm > > caused the problem. I tried to restart a few consumer instances but after > > rebalancing, another instance got assigned the problematic partitions and > > got stuck again with the above errors. > > > > We decided to drop messages, stop all consumer instances, reset all > offset > > by deleting zk entries and restarted them, the problem went away. > > > > Producer version is kafka_2.8.2-0.8.1.1 with snappy-java-1.0.5 > > Consumer version is kafka_2.9.2-0.8.2-beta with snappy-java-1.1.1.6 > > > > We googled this issue but this was already fixed long time ago on 0.7.x. > > any idea? is mismatched snappy version the culpit? is it a bug in > > 0.8.2-beta? > > > > > > Thanks, > > Steven > > >