Sergey Ivanov created KAFKA-13855:
-------------------------------------

             Summary: FileNotFoundException: Error while rolling log segment 
for topic partition in dir
                 Key: KAFKA-13855
                 URL: https://issues.apache.org/jira/browse/KAFKA-13855
             Project: Kafka
          Issue Type: Bug
          Components: log
    Affects Versions: 2.6.1
            Reporter: Sergey Ivanov


Hello,

We faced an issue when one of Kafka broker in cluster has failed with an 
exception and restarted:

 
{code:java}
[2022-04-13T09:51:44,563][ERROR][category=kafka.server.LogDirFailureChannel] 
Error while rolling log segment for prod_data_topic-7 in dir 
/var/opt/kafka/data/1
java.io.FileNotFoundException: 
/var/opt/kafka/data/1/prod_data_topic-7/00000000000026872377.index (No such 
file or directory)
        at java.base/java.io.RandomAccessFile.open0(Native Method)
        at java.base/java.io.RandomAccessFile.open(Unknown Source)
        at java.base/java.io.RandomAccessFile.<init>(Unknown Source)
        at java.base/java.io.RandomAccessFile.<init>(Unknown Source)
        at kafka.log.AbstractIndex.$anonfun$resize$1(AbstractIndex.scala:183)
        at kafka.log.AbstractIndex.resize(AbstractIndex.scala:176)
        at 
kafka.log.AbstractIndex.$anonfun$trimToValidSize$1(AbstractIndex.scala:242)
        at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:242)
        at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:508)
        at kafka.log.Log.$anonfun$roll$8(Log.scala:1916)
        at kafka.log.Log.$anonfun$roll$2(Log.scala:1916)
        at kafka.log.Log.roll(Log.scala:2349)
        at kafka.log.Log.maybeRoll(Log.scala:1865)
        at kafka.log.Log.$anonfun$append$2(Log.scala:1169)
        at kafka.log.Log.append(Log.scala:2349)
        at kafka.log.Log.appendAsLeader(Log.scala:1019)
        at 
kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:984)
        at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:972)
        at 
kafka.server.ReplicaManager.$anonfun$appendToLocalLog$4(ReplicaManager.scala:883)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273)
        at 
scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
        at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
        at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
        at scala.collection.TraversableLike.map(TraversableLike.scala:273)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:266)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:871)
        at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:571)
        at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:605)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:132)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:70)
        at java.base/java.lang.Thread.run(Unknown Source)

[2022-04-13T09:51:44,812][ERROR][category=kafka.log.LogManager] Shutdown broker 
because all log dirs in /var/opt/kafka/data/1 have failed {code}
 

There are no any additional useful information in logs, just one warn before 
this error:

 
{code:java}
2022-04-13T09:51:44.720Z","filebeat-sf8sf","[2022-04-13T09:51:44,720][WARN][category=kafka.server.ReplicaManager]
 [ReplicaManager broker=1] Broker 1 stopped fetcher for partitions 
__consumer_offsets-22,prod_data_topic-5,__consumer_offsets-30,
....
prod_data_topic-0 and stopped moving logs for partitions  because they are in 
the failed log directory /var/opt/kafka/data/1." {code}
 

 

The topic configuration is:

 
{code:java}
/opt/kafka $ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe 
--topic prod_data_topic
Topic: prod_data_topic        PartitionCount: 12      ReplicationFactor: 3    
Configs: 
min.insync.replicas=2,segment.bytes=1073741824,max.message.bytes=15728640,retention.bytes=4294967296
        Topic: prod_data_topic        Partition: 0    Leader: 3       Replicas: 
3,1,2 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 1    Leader: 1       Replicas: 
1,2,3 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 2    Leader: 2       Replicas: 
2,3,1 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 3    Leader: 3       Replicas: 
3,2,1 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 4    Leader: 1       Replicas: 
1,3,2 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 5    Leader: 2       Replicas: 
2,1,3 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 6    Leader: 3       Replicas: 
3,2,1 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 7    Leader: 1       Replicas: 
1,3,2 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 8    Leader: 2       Replicas: 
2,1,3 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 9    Leader: 3       Replicas: 
3,1,2 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 10   Leader: 1       Replicas: 
1,2,3 Isr: 3,2,1
        Topic: prod_data_topic        Partition: 11   Leader: 2       Replicas: 
2,3,1 Isr: 3,2,1 {code}
 

 

Previously (a day before it happened) we have set "rettention.bytes" broker 
config to: 5368709120 (previously the values was 6442450944). But not sure it 
affected. Current custom broker config:

 
{code:java}
log.retention.check.interval.ms=300000
log.segment.bytes=1073741824
log.retention.bytes=4294967296
log.retention.hours=40


message.max.bytes=15728640
replica.lag.time.max.ms=30000
min.insync.replicas=2
delete.topic.enable=true
replica.fetch.max.bytes=15728640
default.replication.factor=3
num.replica.fetchers=2 

{code}
 

 

 

Could you please help to investigate what could be a reason of this fail? 
Because we don't have any ideas (there were no cleaning topics, files or other 
maintenance procedure with disk). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to