[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-05-22 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019536#comment-16019536
 ] 

dhiraj prajapati edited comment on KAFKA-4477 at 5/22/17 1:45 PM:
--

Hi all,
We have a 3-node cluster on our production environment. We recently upgraded 
kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of 
intermittent disconnection. We never had this issue in 0.9.0.1. Below is the 
exception from broker's server.log:

Below is the exception stack trace:
[2017-05-15 09:33:55,398] WARN [ReplicaFetcherThread-0-2], Error in fetch 
kafka.server.ReplicaFetcherThread$FetchRequest@7213d6d (kafka.server.
ReplicaFetcherThread)
java.io.IOException: Connection to 2 was disconnected before the response was 
read
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.sca
la:115)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.sca
la:112)
at scala.Option.foreach(Option.scala:257)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scal
a:143)
at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)


Is this issue fixed in later versions? I am asking this because I saw a similar 
thread for version 10.2:
https://issues.apache.org/jira/browse/KAFKA-5153

Please assist.


was (Author: dhirajpraj):
Hi all,
We have a 3-node cluster on our production environment. We recently upgraded 
kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of 
intermittent disconnection. We never had this issue in 0.9.0.1. 

Is this issue fixed in later versions? I am asking this because I saw a similar 
thread for version 10.2:
https://issues.apache.org/jira/browse/KAFKA-5153

Please assist.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, 72_Server_Thread_Dump.txt, 
> 73_Server_Thread_Dump.txt, 74_Server_Thread_Dump, issue_node_1001_ext.log, 
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, 
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, 
> server_1_72server.log, server_2_73_server.log, server_3_74Server.log, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: 

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-05-22 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019536#comment-16019536
 ] 

dhiraj prajapati edited comment on KAFKA-4477 at 5/22/17 12:49 PM:
---

Hi all,
We have a 3-node cluster on our production environment. We recently upgraded 
kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of 
intermittent disconnection. We never had this issue in 0.9.0.1. 

Is this issue fixed in later versions? I am asking this because I saw a similar 
thread for version 10.2:
https://issues.apache.org/jira/browse/KAFKA-5153

Please assist.


was (Author: dhirajpraj):
Hi all,
We have a 3-node cluster on our production environment. We recenctly upgraded 
kafka from 0.9.0.1 to 0.10.1.0 and we are seeing a similar issue of 
intermittent disconnection. We never had this issue in 0.9.0.1. 

Is this issue fixed in later versions? I am asking this because I saw a similar 
thread for version 10.2:
https://issues.apache.org/jira/browse/KAFKA-5153

Please assist.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, 72_Server_Thread_Dump.txt, 
> 73_Server_Thread_Dump.txt, 74_Server_Thread_Dump, issue_node_1001_ext.log, 
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, 
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, 
> server_1_72server.log, server_2_73_server.log, server_3_74Server.log, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-04-28 Thread Arpan (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988460#comment-15988460
 ] 

Arpan edited comment on KAFKA-4477 at 4/28/17 9:25 AM:
---

Hi [~apurva] ,

Attaching server logs for all 3 kafka broker servers. The set up i have is 
below in cluster

72 Server - Zookeeper/KAFKA
73 Server - Zookeeper/KAFKA
74 Server - Zookeeper/KAFKA

It;s been 4 days now since we restarted all the nodes because of this issue and 
current FD  count on all 3 servers is below 500.

Do not have thread dump when we had the issue, i have taken it now for 
reference and attaching the same here as well.


was (Author: arpan.khagram0...@gmail.com):
Hi [~apurva] ,

Attaching server logs for all 3 kafka broker servers. The set up i have is 
below in cluster

72 Server - Zookeeper/KAFKA
73 Server - Zookeeper/KAFKA
74 Server - Zookeeper/KAFKA

It;s been 4 days now since we restarted all the nodes because of this issue and 
current FD  count on all 3 servers is below 500.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, issue_node_1001_ext.log, 
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, 
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, 
> server_1_72server.log, server_2_73_server.log, server_3_74Server.log, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-04-28 Thread Arpan (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988460#comment-15988460
 ] 

Arpan edited comment on KAFKA-4477 at 4/28/17 9:13 AM:
---

Hi [~apurva] ,

Attaching server logs for all 3 kafka broker servers. The set up i have is 
below in cluster

72 Server - Zookeeper/KAFKA
73 Server - Zookeeper/KAFKA
74 Server - Zookeeper/KAFKA

It;s been 4 days now since we restarted all the nodes because of this issue and 
current FD  count on all 3 servers is below 500.


was (Author: arpan.khagram0...@gmail.com):
Hi [~apurva] ,

Attaching server logs for all 3 kafka broker servers. The set up i have is 
below in cluster

72 Server - Zookeeper/KAFKA
73 Server - Zookeeper/KAFKA
74 Server - Zookeeper/KAFKA

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, issue_node_1001_ext.log, 
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, 
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, 
> server_1_72server.log, server_2_73_server.log, server_3_74Server.log, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-04-28 Thread Arpan (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988159#comment-15988159
 ] 

Arpan edited comment on KAFKA-4477 at 4/28/17 9:08 AM:
---

Hi Apurva, I shall send you the stack trace again but the behavior is exactly 
same - we see this behavior almost every week. I haven't observed file 
descriptor during the issue.

And it also gets resolved after restarts of nodes. We even observed missing 
offsets in Consumer at the time of issue.

We restarted it 2-3 days ago, I shall take thread dumps, observe file 
descriptor count and let you know today.

Also not sure why I am unable to find this bug reference in release notes.

Regards,
Arpan Khagram


was (Author: arpan.khagram0...@gmail.com):
Hi Apurva, I shall send you the stack trace again but the behavior is exactly 
same - we see this behavior almost every week. I haven't observed file 
descriptor during the issue.

And it also gets resolved after restarts of nodes. We even observed missing 
offsets in Consumer at the time of issue.

We restarted it 2-3 days ago, I shall take thread dumps, observe file 
descriptor count and let you know today.

Regards,
Arpan Khagram

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, issue_node_1001_ext.log, 
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, 
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, 
> server_1_72server.log, server_2_73_server.log, server_3_74Server.log, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-04-28 Thread Arpan (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988460#comment-15988460
 ] 

Arpan edited comment on KAFKA-4477 at 4/28/17 9:08 AM:
---

Hi [~apurva] ,

Attaching server logs for all 3 kafka broker servers. The set up i have is 
below in cluster

72 Server - Zookeeper/KAFKA
73 Server - Zookeeper/KAFKA
74 Server - Zookeeper/KAFKA


was (Author: arpan.khagram0...@gmail.com):
Attaching server logs for all 3 kafka broker servers. The set up i have is 
below in cluster

72 Server - Zookeeper/KAFKA
73 Server - Zookeeper/KAFKA
74 Server - Zookeeper/KAFKA

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, issue_node_1001_ext.log, 
> issue_node_1001.log, issue_node_1002_ext.log, issue_node_1002.log, 
> issue_node_1003_ext.log, issue_node_1003.log, kafka.jstack, 
> server_1_72server.log, server_2_73_server.log, server_3_74Server.log, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-01-05 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15801238#comment-15801238
 ] 

Ismael Juma edited comment on KAFKA-4477 at 1/5/17 12:31 PM:
-

>From all the responses, I think we can mark this as Fixed. The deadlock fixes 
>seem most likely to have done it, but hard to be sure.

Feel free to reopen if any of you experience this with 0.10.1.1.


was (Author: ijuma):
>From all the responses, I think we can mark this as Fixed. The deadlock fixes 
>seem most likely to have done it, but hard to be sure.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.1.1
>
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2017-01-04 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15799732#comment-15799732
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 1/4/17 11:44 PM:
---

We haven't seen this re-occur, though running <1 week only running 0.10.1.1 in 
testing and uat envs.


was (Author: michael.andre.pearce):
We haven't seen this re-occur, though running <1 week only running in testing 
and uat envs.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: 2016_12_15.zip, issue_node_1001.log, 
> issue_node_1001_ext.log, issue_node_1002.log, issue_node_1002_ext.log, 
> issue_node_1003.log, issue_node_1003_ext.log, kafka.jstack, 
> state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:59 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt, we didn't run for a long period on 
0.10.0.0 before upgrading some brokers to 0.10.1.0), is there any possible way 
to downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> 

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:58 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments. 

Maybe its worth we push for getting 0.10.1.1 tagged and released now, without 
waiting for additional fixes, as from what i understand this version is just 
fixes anyhow, then if still issues detected we get a 0.10.1.2 with further hot 
fixes.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in 

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747271#comment-15747271
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:55 AM:


Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi Apurva,

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> 

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Michael Andre Pearce (IG) (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745749#comment-15745749
 ] 

Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/13/16 5:49 PM:


It is worth noting we see the open file descriptors increase as mentioned by 
someone else if we leave the process in a sick mode (now we restart quickly we 
don't get to observe this).


was (Author: michael.andre.pearce):
It is worth noting we see the open file descriptors if we leave the process in 
a sick mode (now we restart quickly we don't get to observe this).

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-13 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745663#comment-15745663
 ] 

Tom DeVoe edited comment on KAFKA-4477 at 12/13/16 5:20 PM:


[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate any sessions expired at the time 
this issue occurred.


was (Author: tdevoe):
[~junrao] I respectfully disagree, and this is why I originally was hesitant to 
post the extended logs - the shrinking ISR from 1003, 1001, 1002 happened after 
I restarted node 1002 (as is expected).

If we pay attention to the timestamps, the symptoms described in the ticket 
*exactly* match what I have seen. 

- In the node 1002 log, we see all ISRs reduced to itself at {{2016-11-28 
19:57:05}}.

- Approximately 10 seconds later at {{2016-11-28 19:57:16,003}} (as in the 
original issue description) the other two nodes (1001, 1003) both log 
{{java.io.IOException: Connection to 1002 was disconnected before the response 
was read}}.

- After this occurs, we also see an increasing amount of file descriptors 
opening on node 1002.

Checking the zookeeper logs does not indicate *any* sessions expired at the 
time this issue occurred.

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

2016-12-09 Thread Tom DeVoe (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735863#comment-15735863
 ] 

Tom DeVoe edited comment on KAFKA-4477 at 12/9/16 5:39 PM:
---

File limit is set to {{Max open files1048576  1048576   
   files}} and the server only ever got to 2K open file descriptors. 
Somewhat related however - the instances that encountered this saw its open 
files start steadily increasing and it seemed that it would keep increasing had 
I not restarted the process. 


was (Author: tdevoe):
File limit is set to {{Max open files1048576  1048576   
   files}} and the server only ever got to 2K open file descriptors. 
Though the instances that encountered this saw its open files start steadily 
increasing. It seemed that it would keep increasing had I not restarted the 
process. 

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>Reporter: Michael Andre Pearce (IG)
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1002.log, 
> issue_node_1003.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P.+) to (?P.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)