[ https://issues.apache.org/jira/browse/KAFKA-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16189731#comment-16189731 ]
Stanislav Chizhov commented on KAFKA-6003: ------------------------------------------ So to summarise the impact: In case there is topic with finite retention and idempotent producer there is very high risk of not being able to: - replace a dead broker - scale up cluster - reassign partitions (BTW I did try it and it failed in the same way) If the above is true it renders idempotent producer feature pretty much unusable. Please correct me if I am wrong. > Replication Fetcher thread for a partition with no data fails to start > ---------------------------------------------------------------------- > > Key: KAFKA-6003 > URL: https://issues.apache.org/jira/browse/KAFKA-6003 > Project: Kafka > Issue Type: Bug > Components: replication > Affects Versions: 0.11.0.1 > Reporter: Stanislav Chizhov > Assignee: Apurva Mehta > Priority: Blocker > Fix For: 1.0.0, 0.11.0.2 > > > If a partition of a topic with idempotent producer has no data on 1 of the > brokers, but it does exist on others and some of the segments for this > partition have been already deleted replication thread responsible for this > partition on the broker which has no data for it fails to start with out of > order sequence exception: > {code} > [2017-10-02 09:44:23,825] ERROR [ReplicaFetcherThread-2-4]: Error due to > (kafka.server.ReplicaFetcherThread) > kafka.common.KafkaException: error processing data for partition > [stage.data.adevents.v2,20] offset 1660336429 > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:203) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:174) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:174) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:171) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:171) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171) > at > kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:171) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:169) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) > Caused by: org.apache.kafka.common.errors.OutOfOrderSequenceException: > Invalid sequence number for new epoch: 0 (request epoch), 154277489 (seq. > number) > {code} > We run kafka 0.11.0.1 and we ran into the situation when 1 of replication > threads was stopped for few days, while everything else on that broker was > functional. This is our staging cluster and retention is less than a day, so > everything for partitions for which replication thread was down was cleaned > up. At the moment we have a broker which cannot start replication for few > partitions. I was also able to reproduce in my local test environment. > Another possible use case when this might cause real pain is disk failure or > any situation when previously deleting all the data for the partition on a > broker helped - since it would just fetch all the data from other replicas. > Now it does not work for topics with idempotent producers. It might also > affect other not-idempotent topics if those are unlucky to share same > replication fetcher thread. > This seems to be caused by this logic: > https://github.com/apache/kafka/blob/0.11.0.1/core/src/main/scala/kafka/log/ProducerStateManager.scala#L119 > and might be fixed in the scope of > https://issues.apache.org/jira/browse/KAFKA-5793. > However any hints on how to get those partition to fully replicated state are > highly appreciated. -- This message was sent by Atlassian JIRA (v6.4.14#64029)