[ 
https://issues.apache.org/jira/browse/KAFKA-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776404#comment-16776404
 ] 

Sebastian Schmitz commented on KAFKA-6582:
------------------------------------------

We encounter the same problem...

Last week I did a rolling upgrade from 0.10.1.0 in our Dev-Environment to 
prepare the upgrade of all Kafka-Environments to 2.1. And starting on Friday we 
have that error...

It started after we did the third step and changed to 
log.message.format.version from 0.10.1 to 2.1. The two previous steps didn't 
cause the problem... Or there was not enough time between them for the problem 
to occur as I did the upgrade of Kafka and then the 
inter.broker.protocol.version to 2.1 within like 24 hours.

We are using three node cluster on Docker Containers in Azure.

On Friday first the node1, then node2 failed within 3 hours of each other 
during morning. The node3 then failed with the mentioned error during afternoon 
the same day. As all three had been restarted on Friday it ran fine until 
Sunday evening at 8:09 PM. When node2 failed with that error and I had to 
restart it this Morning to get everything back online.

As this is our Dev-Environment it's not too much pain it it fails or needs to 
be restarted. So we can use this for debugging etc.
So just tell me which Logs I should get or what else you need or what I should 
try to solve this issue.

> Partitions get underreplicated, with a single ISR, and doesn't recover. Other 
> brokers do not take over and we need to manually restart the broker.
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6582
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6582
>             Project: Kafka
>          Issue Type: Bug
>          Components: network
>    Affects Versions: 1.0.0
>         Environment: Ubuntu 16.04
> Linux kafka04 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 
> x86_64 x86_64 x86_64 GNU/Linux
> java version "9.0.1"
> Java(TM) SE Runtime Environment (build 9.0.1+11)
> Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode) 
> but also tried with the latest JVM 8 before with the same result.
>            Reporter: Jurriaan Pruis
>            Priority: Major
>         Attachments: Screenshot 2019-01-18 at 13.08.17.png, Screenshot 
> 2019-01-18 at 13.16.59.png
>
>
> Partitions get underreplicated, with a single ISR, and doesn't recover. Other 
> brokers do not take over and we need to manually restart the 'single ISR' 
> broker (if you describe the partitions of replicated topic it is clear that 
> some partitions are only in sync on this broker).
> This bug resembles KAFKA-4477 a lot, but since that issue is marked as 
> resolved this is probably something else but similar.
> We have the same issue (or at least it looks pretty similar) on Kafka 1.0. 
> Since upgrading to Kafka 1.0 in November 2017 we've had these issues (we've 
> upgraded from Kafka 0.10.2.1).
> This happens almost every 24-48 hours on a random broker. This is why we 
> currently have a cronjob which restarts every broker every 24 hours. 
> During this issue the ISR shows the following server log: 
> {code:java}
> [2018-02-20 12:02:08,342] WARN Attempting to send response via channel for 
> which there is no open connection, connection id 
> 10.132.0.32:9092-10.14.148.20:56352-96708 (kafka.network.Processor)
> [2018-02-20 12:02:08,364] WARN Attempting to send response via channel for 
> which there is no open connection, connection id 
> 10.132.0.32:9092-10.14.150.25:54412-96715 (kafka.network.Processor)
> [2018-02-20 12:02:08,349] WARN Attempting to send response via channel for 
> which there is no open connection, connection id 
> 10.132.0.32:9092-10.14.149.18:35182-96705 (kafka.network.Processor)
> [2018-02-20 12:02:08,379] WARN Attempting to send response via channel for 
> which there is no open connection, connection id 
> 10.132.0.32:9092-10.14.150.25:54456-96717 (kafka.network.Processor)
> [2018-02-20 12:02:08,448] WARN Attempting to send response via channel for 
> which there is no open connection, connection id 
> 10.132.0.32:9092-10.14.159.20:36388-96720 (kafka.network.Processor)
> [2018-02-20 12:02:08,683] WARN Attempting to send response via channel for 
> which there is no open connection, connection id 
> 10.132.0.32:9092-10.14.157.110:41922-96740 (kafka.network.Processor)
> {code}
> Also on the ISR broker, the controller log shows this:
> {code:java}
> [2018-02-20 12:02:14,927] INFO [Controller-3-to-broker-3-send-thread]: 
> Controller 3 connected to 10.132.0.32:9092 (id: 3 rack: null) for sending 
> state change requests (kafka.controller.RequestSendThread)
> [2018-02-20 12:02:14,927] INFO [Controller-3-to-broker-0-send-thread]: 
> Controller 3 connected to 10.132.0.10:9092 (id: 0 rack: null) for sending 
> state change requests (kafka.controller.RequestSendThread)
> [2018-02-20 12:02:14,928] INFO [Controller-3-to-broker-1-send-thread]: 
> Controller 3 connected to 10.132.0.12:9092 (id: 1 rack: null) for sending 
> state change requests (kafka.controller.RequestSendThread){code}
> And the non-ISR brokers show these kind of errors:
>  
> {code:java}
> 2018-02-20 12:02:29,204] WARN [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in fetch to broker 3, request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={......................}, isolationLevel=READ_UNCOMMITTED) 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 3 was disconnected before the response was 
> read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:95)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:205)
>  at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:41)
>  at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to