[ 
https://issues.apache.org/jira/browse/KAFKA-10827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246250#comment-17246250
 ] 

lqjacklee commented on KAFKA-10827:
-----------------------------------

[~jaebinyo] If we are in the bad network environment, will we get the response 
which we send the request will too long? 


> Consumer group coordinator node never gets updated for manual partition 
> assignment
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-10827
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10827
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 2.2.1
>            Reporter: Jaebin Yoon
>            Priority: Major
>         Attachments: KAFKA-10827.patch
>
>
> We've run into a situation where the coordinator node in the consumer never 
> gets updated with the new coordinator when the coordinator broker gets 
> replaced with a new instance. Once the consumer gets into this mode, the 
> consumer keeps trying to connect to the old coordinator and never recovers 
> unless restarted.
> This happens when the consumer uses manual partition assignment and commits 
> offsets very infrequently (every 5 minutes) and the coordinator broker is not 
> reachable (ip address, hostname are gone in a cloud environment).
> The exception the consumer keeps getting isĀ 
> {code:java}
> Offset commit failed with a retriable exception. You should retry committing 
> the latest consumed offsets. Caused by: 
> org.apache.kafka.common.errors.TimeoutException: Failed to send request after 
> 120000 ms.
> {code}
> We could see a bunch of *SYN_SENT* tcp state from the consumer app to the old 
> hostname in this error condition.
> In the current manual partition assignment scenario, the only way for the 
> coordinator to gets updated is through checkAndGetCoordinator in 
> AbstractCoordinator but this gets called only in committing offsets every 5 
> minutes in our case.  
> The current logic of checkAndGetCoordinator is using 
> ConsumerNetworkClient.isUnavailable but it returns false unless the Network 
> client is in reconnect backoff time, which is currently configured with 
> default values (reconnect.backoff.ms (50), reconnect.backoff.max.ms (1000) 
> while request.timeout.ms is 120000.  In this scenario, 
> ConsumerNetworkClient.isUnavailable for the old coordinator node always 
> returns false, resulting in checkAndGetCoordinator keeps the old coordinator 
> node forever.
> It seems checkAndGetCoordinator should not rely on 
> ConsumerNetworkClient.isUnavailable but just check the 
> client.connectionFailed.
> We had to restart the consumer to recover from this condition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to