[jira] [Commented] (KAFKA-7635) FetcherThread stops processing after "Error processing data for partition"

2021-07-14 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380590#comment-17380590
 ] 

Ismael Juma commented on KAFKA-7635:


Has anyone seen this issue with 2.3.0 or newer?

> FetcherThread stops processing after "Error processing data for partition"
> --
>
> Key: KAFKA-7635
> URL: https://issues.apache.org/jira/browse/KAFKA-7635
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 2.0.0
>Reporter: Steven Aerts
>Priority: Major
> Attachments: stacktraces.txt
>
>
> After disabling unclean leader leader again after recovery of a situation 
> where we enabled unclean leader due to a split brain in zookeeper, we saw 
> that some of our brokers stopped replicating their partitions.
> Digging into the logs, we saw that the replica thread was stopped because one 
> partition had a failure which threw a [{{Error processing data for 
> partition}} 
> exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207].
>   But the broker kept running and serving the partitions from which it was 
> leader.
> We saw three different types of exceptions triggering this (example 
> stacktraces attached):
> * {{kafka.common.UnexpectedAppendOffsetException}}
> * {{Trying to roll a new log segment for topic partition partition-b-97 with 
> start offset 1388 while it already exists.}}
> * {{Kafka scheduler is not running.}}
> We think there are two acceptable ways for the kafka broker to handle this:
> * Mark those partitions as a partition with error and handle them 
> accordingly.  As is done [when a {{CorruptRecordException}} or 
> {{KafkaStorageException}}|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L196]
>  is thrown.
> * Exit the broker as is done [when log truncation is not 
> allowed|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/ReplicaFetcherThread.scala#L189].
>  
> Maybe even a combination of both.  Our probably naive idea is that for the 
> first two types the first strategy would be the best, but for the last type, 
> it is probably better to re-throw a {{FatalExitError}} and exit the broker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7635) FetcherThread stops processing after "Error processing data for partition"

2021-04-30 Thread Yi Ding (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337705#comment-17337705
 ] 

Yi Ding commented on KAFKA-7635:


This bug has been fixed by KIP-461: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-461+-+Improve+Replica+Fetcher+behavior+at+handling+partition+failure]

github commit: 
[https://github.com/confluentinc/ce-kafka/commit/414852c701763b6f8362b44d156753b6c3ef247a#]

Earliest available release:

[https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1]

[https://github.com/confluentinc/ce-kafka/releases/tag/2.3.1-rc2]

 

> FetcherThread stops processing after "Error processing data for partition"
> --
>
> Key: KAFKA-7635
> URL: https://issues.apache.org/jira/browse/KAFKA-7635
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 2.0.0
>Reporter: Steven Aerts
>Priority: Major
> Attachments: stacktraces.txt
>
>
> After disabling unclean leader leader again after recovery of a situation 
> where we enabled unclean leader due to a split brain in zookeeper, we saw 
> that some of our brokers stopped replicating their partitions.
> Digging into the logs, we saw that the replica thread was stopped because one 
> partition had a failure which threw a [{{Error processing data for 
> partition}} 
> exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207].
>   But the broker kept running and serving the partitions from which it was 
> leader.
> We saw three different types of exceptions triggering this (example 
> stacktraces attached):
> * {{kafka.common.UnexpectedAppendOffsetException}}
> * {{Trying to roll a new log segment for topic partition partition-b-97 with 
> start offset 1388 while it already exists.}}
> * {{Kafka scheduler is not running.}}
> We think there are two acceptable ways for the kafka broker to handle this:
> * Mark those partitions as a partition with error and handle them 
> accordingly.  As is done [when a {{CorruptRecordException}} or 
> {{KafkaStorageException}}|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L196]
>  is thrown.
> * Exit the broker as is done [when log truncation is not 
> allowed|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/ReplicaFetcherThread.scala#L189].
>  
> Maybe even a combination of both.  Our probably naive idea is that for the 
> first two types the first strategy would be the best, but for the last type, 
> it is probably better to re-throw a {{FatalExitError}} and exit the broker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7635) FetcherThread stops processing after "Error processing data for partition"

2019-03-27 Thread Paul Whalen (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802888#comment-16802888
 ] 

Paul Whalen commented on KAFKA-7635:


For what it's worth, my team is also running 2.0.0 and seems to have 
encountered this error in our development environment after doing some 
maintenance work to expand the cluster.  Restarting the broker did not fix the 
issue, the replica fetcher thread would still die in short order.  We 
ultimately wiped the data directory and restarted the broker to get back in a 
healthy state.

{code:java}
ERROR [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Error due to 
(kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.KafkaException: Error processing data for partition 
topic.a-0 offset 3395
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala
:207)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala
:172)
at scala.Option.foreach(Option.scala:257)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:172)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:169)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:169)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:167)
at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: kafka.common.UnexpectedAppendOffsetException: Unexpected offset in 
append to topic.a-0. First offset 3389 
is less than the next offset 3395. First 10 offsets in append: List(3389, 3390, 
3391, 3392, 3393, 3394, 3395, 3396, 3397, 3398), last offset in append:
 4945. Log start offset = 3353
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:825)
at kafka.log.Log$$anonfun$append$2.apply(Log.scala:752)
at kafka.log.Log.maybeHandleIOException(Log.scala:1837)
at kafka.log.Log.append(Log.scala:752)
at kafka.log.Log.appendAsFollower(Log.scala:733)
at 
kafka.cluster.Partition$$anonfun$doAppendRecordsToFollowerOrFutureReplica$1.apply(Partition.scala:589)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:257)
at 
kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:576)
at 
kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:596)
at 
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:129)
at 
kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:43)
at 
kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala
:186)
... 13 more
{code}

> FetcherThread stops processing after "Error processing data for partition"
> --
>
> Key: KAFKA-7635
> URL: https://issues.apache.org/jira/browse/KAFKA-7635
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 2.0.0
>Reporter: Steven Aerts
>Priority: Major
> Attachments: stacktraces.txt
>
>
> After disabling unclean leader leader again after recovery of a situation 
> where we enabled unclean leader due to a split brain in zookeeper, we saw 
> that some of our brokers stopped replicating their partitions.
> Digging into the logs, we saw that the replica thread was stopped because one 
> partition had a failure which threw a [{{Error processing data for 
> partition}} 
> exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207].
>   But the broker kept running and serving the partitions from which it was 
> leader.
> We saw three different types of exceptions triggering this (example 
> stacktraces attached):
> * {{kafka.common.UnexpectedAppendOffsetException}}

[jira] [Commented] (KAFKA-7635) FetcherThread stops processing after "Error processing data for partition"

2018-11-18 Thread Steven Aerts (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691337#comment-16691337
 ] 

Steven Aerts commented on KAFKA-7635:
-

To give you an idea a code change which we think might resolve this bug:

{code:scala}
@@ -200,12 +200,13 @@ abstract class AbstractFetcherThread(name: String,
   // should get fixed in the subsequent fetches
   error(s"Found invalid messages during fetch for 
partition $topicPartition offset ${currentPartitionFetchState.fetchOffset}", 
ime)
   partitionsWithError += topicPartition
-case e: KafkaStorageException =>
+case e @(_ : KafkaStorageException | _ : 
UnexpectedAppendOffsetException) =>
   error(s"Error while processing data for partition 
$topicPartition", e)
   partitionsWithError += topicPartition
 case e: Throwable =>
-  throw new KafkaException(s"Error processing data for 
partition $topicPartition " +
+  fatal(s"Error processing data for partition 
$topicPartition " +
 s"offset ${currentPartitionFetchState.fetchOffset}", e)
+  throw new FatalExitError()
   }
 case Errors.OFFSET_OUT_OF_RANGE =>
   try {
{code}

We did not submit this as a PR, as we are rather uncertain about which error 
should end up in the {{partitionsWithError}} and which should become fatal.

> FetcherThread stops processing after "Error processing data for partition"
> --
>
> Key: KAFKA-7635
> URL: https://issues.apache.org/jira/browse/KAFKA-7635
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 2.0.0
>Reporter: Steven Aerts
>Priority: Major
> Attachments: stacktraces.txt
>
>
> After disabling unclean leader leader again after recovery of a situation 
> where we enabled unclean leader due to a split brain in zookeeper, we saw 
> that some of our brokers stopped replicating their partitions.
> Digging into the logs, we saw that the replica thread was stopped because one 
> partition had a failure which threw a [{{Error processing data for 
> partition}} 
> exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207].
>   But the broker kept running and serving the partitions from which it was 
> leader.
> We saw three different types of exceptions triggering this (example 
> stacktraces attached):
> * {{kafka.common.UnexpectedAppendOffsetException}}
> * {{Trying to roll a new log segment for topic partition partition-b-97 with 
> start offset 1388 while it already exists.}}
> * {{Kafka scheduler is not running.}}
> We think there are two acceptable ways for the kafka broker to handle this:
> * Mark those partitions as a partition with error and handle them 
> accordingly.  As is done [when a {{CorruptRecordException}} or 
> {{KafkaStorageException}}|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L196]
>  is thrown.
> * Exit the broker as is done [when log truncation is not 
> allowed|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/ReplicaFetcherThread.scala#L189].
>  
> Maybe even a combination of both.  Our probably naive idea is that for the 
> first two types the first strategy would be the best, but for the last type, 
> it is probably better to re-throw a {{FatalExitError}} and exit the broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7635) FetcherThread stops processing after "Error processing data for partition"

2018-11-16 Thread Steven Aerts (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689278#comment-16689278
 ] 

Steven Aerts commented on KAFKA-7635:
-

Similar but focusses on a specific issue.

> FetcherThread stops processing after "Error processing data for partition"
> --
>
> Key: KAFKA-7635
> URL: https://issues.apache.org/jira/browse/KAFKA-7635
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 2.0.0
>Reporter: Steven Aerts
>Priority: Major
> Attachments: stacktraces.txt
>
>
> After disabling unclean leader leader again after recovery of a situation 
> where we enabled unclean leader due to a split brain in zookeeper, we saw 
> that some of our stopped replicating their partitions.
> Digging into the logs, we saw that the replica thread was stopped because one 
> partition had a failure which threw a [{{Error processing data for 
> partition}} 
> exception|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L207].
>   But the broker kept running and serving the partitions from which it was 
> leader.
> We saw three different types of exceptions triggering this (example 
> stacktraces attached):
> * {{kafka.common.UnexpectedAppendOffsetException}}
> * {{Trying to roll a new log segment for topic partition partition-b-97 with 
> start offset 1388 while it already exists.}}
> * {{Kafka scheduler is not running.}}
> We think there are two acceptable ways for the kafka broker to handle this:
> * Mark those partitions as a partition with error and handle them 
> accordingly.  As is done [when a {{CorruptRecordException}} or 
> {{KafkaStorageException}}|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L196]
>  is thrown.
> * Exit the broker as is done [when log truncation is not 
> allowed|https://github.com/apache/kafka/blob/2.0.0/core/src/main/scala/kafka/server/ReplicaFetcherThread.scala#L189].
>  
> Maybe even a combination of both.  Our probably naive idea is that for the 
> first two types the first strategy would be the best, but for the last type, 
> it is probably better to re-throw a {{FatalExitError}} and exit the broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)