[ 
https://issues.apache.org/jira/browse/KAFKA-12901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suriya Vijayaraghavan updated KAFKA-12901:
------------------------------------------
    Description: 
We upgraded to version 2.8 from 2.7. After monitoring for few weeks we upgraded 
in our production setup (as we didn't enable Kraft we went ahead), we faced 
TimeoutException in our clients after few weeks in our production setup. We 
tried to list all active brokers using admin client API, all brokers were 
listed properly. So we logged into that broker and tried to do a describe topic 
with localhost as bootstrap-server, but we got timeout as there.

When checking the logs, we noticed a Shutdown print from kafka-shutdown-hook
thread (zookeeper session timed out and we had three retry failures). But the 
controlled shutdown got failed (got unknown server error response from the 
controller), and proceeded to unclean shutdown. Still the process didn't get 
quit but the process didnt process any other operation as well.  And this did 
not remove the broker from alive status for hours (able to see this broker in 
list of brokers) and our clients were still trying to contact this broker and 
failing with timeout exception. So we tried restarting the problematic broker, 
but we faced unknown topic or partition issue in our client after the restart 
which caused timeout as well. We noticed that metadata was not loaded. So we 
had to restart our controller. And after restarting the controller everthing 
got back to normal.



  was:
We upgraded to version 2.8 from 2.7. After monitoring for few weeks we upgraded 
in our production setup (as we didn't enable Kraft we went ahead), we faced 
TimeoutException in our clients after few weeks in our production setup. We 
tried to list all active brokers using admin client API, all brokers were 
listed properly. So we logged into that broker and tried to do a describe topic 
with localhost as bootstrap-server, but we got timeout as there.

When checking the logs, we noticed a Shutdown print from kafka-shutdown-hook
thread (zookeeper session timed out and we had three retry failures). But the 
controlled shutdown got failed (got unknown server error response from the 
controller), and proceeded to unclean shutdown. Still the process didn't get 
quit but the process didnt process any other operation as well.  And this did 
not remove the broker from alive status for hours (able to see this broker in 
list of brokers) and our clients were still trying to contact this broker and 
failing with timeout exception. So we tried restarting the problematic broker, 
but we faced unknown topic or partition issue in our client after the restart 
which caused timeout as well. We noticed that metadata was not loaded. So we 
had to restart our controller. And after restarting the controller everthing 
got back to normal.

So how metadata loading is handled? Is there any alternative ways for us to 
automate monitoring for metadata update? 



> Metadata not updated after broker restart.
> ------------------------------------------
>
>                 Key: KAFKA-12901
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12901
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.8.0
>            Reporter: Suriya Vijayaraghavan
>            Priority: Major
>
> We upgraded to version 2.8 from 2.7. After monitoring for few weeks we 
> upgraded in our production setup (as we didn't enable Kraft we went ahead), 
> we faced TimeoutException in our clients after few weeks in our production 
> setup. We tried to list all active brokers using admin client API, all 
> brokers were listed properly. So we logged into that broker and tried to do a 
> describe topic with localhost as bootstrap-server, but we got timeout as 
> there.
> When checking the logs, we noticed a Shutdown print from kafka-shutdown-hook
> thread (zookeeper session timed out and we had three retry failures). But the 
> controlled shutdown got failed (got unknown server error response from the 
> controller), and proceeded to unclean shutdown. Still the process didn't get 
> quit but the process didnt process any other operation as well.  And this did 
> not remove the broker from alive status for hours (able to see this broker in 
> list of brokers) and our clients were still trying to contact this broker and 
> failing with timeout exception. So we tried restarting the problematic 
> broker, but we faced unknown topic or partition issue in our client after the 
> restart which caused timeout as well. We noticed that metadata was not 
> loaded. So we had to restart our controller. And after restarting the 
> controller everthing got back to normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to