[ 
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yannick updated KAFKA-9212:
---------------------------
    Description: 
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted (leaderEpoch updated at this point), the connect worker 
crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
 org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
times in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

This multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
 [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
  
  
 But according to our brokers log, the leaderEpoch should be 2, as follows :
  
 [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
  
  
 This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
  
 It is also impossible to consumer with a 2.3 kafka-console-consumer as follows 
:
  
 kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
  
 the above will just hang forever ( which is not expected cause there is data)
  
  
 Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can 
consume without problem ( must be the way kafkacat is consuming which is 
different somehow):
  
 kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
  
  

  was:
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted, the connect worker crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times 
in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps=\{connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

This multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
[2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
 
 
But according to our brokers log, the leaderEpoch should be 2, as follows :
 
[2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
 
 
This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
 
It is also impossible to consumer with a 2.3 kafka-console-consumer as follows :
 
kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
 
the above will just hang forever ( which is not expected cause there is data)
 
 
Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can 
consume without problem ( must be the way kafkacat is consuming which is 
different somehow):
 
kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
 
 


> Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
> ------------------------------------------------------------------
>
>                 Key: KAFKA-9212
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9212
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, offset manager
>    Affects Versions: 2.3.0
>         Environment: Linux
>            Reporter: Yannick
>            Priority: Major
>
> When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
> broker got restarted (leaderEpoch updated at this point), the connect worker 
> crashed with the following error : 
> [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, 
> groupId=connect-ls] Uncaught exception in herder work thread, exiting: 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
>  org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
> times in 30003ms
>  
> After investigation, it seems it's because it got fenced when sending 
> ListOffsetRequest in loop and then got timed out , as follows :
> [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
> replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
> maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
> isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
> rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)
> [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Attempt to fetch offsets for partition 
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. 
> (org.apache.kafka.clients.consumer.internals.Fetcher:985)
>  
> This multiple times until timeout.
>  
> According to the debugs, the consumer always get a leaderEpoch of 1 for this 
> topic when starting up :
>  
>  [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
> connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
>   
>   
>  But according to our brokers log, the leaderEpoch should be 2, as follows :
>   
>  [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
> connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
> Epoch was: 1 (kafka.cluster.Partition)
>   
>   
>  This make impossible to restart the worker as it will always get fenced and 
> then finally timeout.
>   
>  It is also impossible to consumer with a 2.3 kafka-console-consumer as 
> follows :
>   
>  kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
> connect_ls_config --from-beginning 
>   
>  the above will just hang forever ( which is not expected cause there is data)
>   
>   
>  Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we 
> can consume without problem ( must be the way kafkacat is consuming which is 
> different somehow):
>   
>  kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to