[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

Tom DeVoe (JIRA) Tue, 13 Dec 2016 09:21:16 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745663#comment-15745663
 ]


Tom DeVoe edited comment on KAFKA-4477 at 12/13/16 5:20 PM:
------------------------------------------------------------

[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate any sessions expired at the time 
this issue occurred.


was (Author: tdevoe):
[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate *any* sessions expired at the 
time this issue occurred.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4477
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0
>         Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>            Reporter: Michael Andre Pearce (IG)
>            Assignee: Apurva Mehta
>            Priority: Critical
>              Labels: reliability
>         Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

Reply via email to