[jira] [Updated] (KAFKA-2572) zk connection instability

2015-10-07 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Sometimes, the reconnections are successful, elements are 
pulled from Kafka, but then disconnection and reconnection occurs shortly 
thereafter, causing OOMs when new elements are pulled in. Restarting the 
process that uses the zk client resolved the problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our restarting the process.
It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same 
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at 
2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this day.

  was:
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Restarting the process using the zk client resolved the 
problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our restarting the process.
It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same 
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at 
2015-09-08T12:5

[jira] [Updated] (KAFKA-2572) zk connection instability

2015-10-07 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On several occasions, we've seen our process enter a cycle of: zk session 
expiry; new session creation; rebalancing activity; pause during which nothing 
is heard from the zk server. Sometimes, the reconnections are successful, 
elements are pulled from Kafka, but then disconnection and reconnection occurs 
shortly thereafter, causing OOMs when new elements are pulled in (although OOMs 
were not seen in the two cases attached as examples). Restarting the process 
that uses the zk client resolved the problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our restarting the process.
It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same 
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at 
2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this day.

  was:
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Sometimes, the reconnections are successful, elements are 
pulled from Kafka, but then disconnection and reconnection occurs shortly 
thereafter, causing OOMs when new elements are pulled in. Restarting the 
process that uses the zk client resolved the problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our 

[jira] [Updated] (KAFKA-2572) zk connection instability

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Restarting the process using the zk client resolved the 
problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our restarting the process.
It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same 
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at 
2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this day.

  was:
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Restarting the process using the zk client resolved the 
problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our restarting the process. It's 
perhaps also noteworthy that in the 09/08 case, two timeouts for the same 
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at 
2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this

[jira] [Updated] (KAFKA-2572) zk connection instability

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Restarting the process using the zk client resolved the 
problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not established prior to our restarting the process. It's 
perhaps also noteworthy that in the 09/08 case, two timeouts for the same 
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at 
2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this day.

  was:
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Restarting the process using the zk client resolved the 
problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not re-established. It's perhaps also noteworthy that in the 
09/08 case, two timeouts for the same session are seen during a single 
rebalance, at 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. 
The rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this day.


> zk connection instabili

[jira] [Updated] (KAFKA-2572) zk connection instability

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, we've seen our process enter a cycle of: zk session expiry; 
new session creation; rebalancing activity; pause during which nothing is heard 
from the zk server. Restarting the process using the zk client resolved the 
problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout. A stable session seems to have been established 
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this 
conflicted ephemeral node 
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
 at 
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
 a while back in a different session, hence I will backoff for this node to be 
deleted by Zookeeper and retry" were logged out repeatedly until we restarted 
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in 
the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a 
stable session was not re-established. It's perhaps also noteworthy that in the 
09/08 case, two timeouts for the same session are seen during a single 
rebalance, at 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. 
The rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the 
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
entry in the logs for this day.

  was:
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Restarting the process using the zk client resolved the 
problems. 
Comparing these with a case in which a new stable zk session was created 
following a session expiry, the timeout during rebalance is not seen in the 
successful case.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/08 case is a little more straightforward than the 09/11 case. In the 
09/08 case, a session times out at 2015-09-08T12:52:06.069-04:00; two timeouts 
for the same session are then seen during the rebalance that follows the 
establishment of that session, at 2015-09-08T12:52:19.107-04:00 and 
2015-09-08T12:52:31.639-04:00. The rebalance begins at 
2015-09-08T12:52:06.667-04:00. The connection to ZK then expires and is 
restablished multiple times before the process is killed after 
2015-09-08T13:13:41.655-04:00, which marks the last entry in the digest.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout.


> zk connection instability
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
> 091115-full.log.zip
>
>
> On two occasions, we've seen our process enter a cycle of

[jira] [Updated] (KAFKA-2572) zk connection instability

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Summary: zk connection instability  (was: zk connection instability, 
perhaps precipitated by zk client timeout during rebalance)

> zk connection instability
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
> 091115-full.log.zip
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Restarting the process using the zk client resolved the 
> problems. 
> Comparing these with a case in which a new stable zk session was created 
> following a session expiry, the timeout during rebalance is not seen in the 
> successful case.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity. Unfortunately, debug logging was not enabled during 
> these events.
> The 09/08 case is a little more straightforward than the 09/11 case. In the 
> 09/08 case, a session times out at 2015-09-08T12:52:06.069-04:00; two 
> timeouts for the same session are then seen during the rebalance that follows 
> the establishment of that session, at 2015-09-08T12:52:19.107-04:00 and 
> 2015-09-08T12:52:31.639-04:00. The rebalance begins at 
> 2015-09-08T12:52:06.667-04:00. The connection to ZK then expires and is 
> restablished multiple times before the process is killed after 
> 2015-09-08T13:13:41.655-04:00, which marks the last entry in the digest.
> The 09/11 case shows repeated cycles of session expiry, followed by 
> rebalancing activity, followed by a pause during which nothing is heard from 
> the zk server, followed by a session timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Restarting the process using the zk client resolved the 
problems. 
Comparing these with a case in which a new stable zk session was created 
following a session expiry, the timeout during rebalance is not seen in the 
successful case.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.

The 09/08 case is a little more straightforward than the 09/11 case. In the 
09/08 case, a session times out at 2015-09-08T12:52:06.069-04:00; two timeouts 
for the same session are then seen during the rebalance that follows the 
establishment of that session, at 2015-09-08T12:52:19.107-04:00 and 
2015-09-08T12:52:31.639-04:00. The rebalance begins at 
2015-09-08T12:52:06.667-04:00. The connection to ZK then expires and is 
restablished multiple times before the process is killed after 
2015-09-08T13:13:41.655-04:00, which marks the last entry in the digest.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing 
activity, followed by a pause during which nothing is heard from the zk server, 
followed by a session timeout.

  was:
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Restarting the process using the zk client resolved the 
problems. 
Comparing these with a case in which a new stable zk session was created 
following a session expiry, the timeout during rebalance is not seen in the 
successful case.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.





> zk connection instability, perhaps precipitated by zk client timeout during 
> rebalance
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
> 091115-full.log.zip
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Restarting the process using the zk client resolved the 
> problems. 
> Comparing these with a case in which a new stable zk session was created 
> following a session expiry, the timeout during rebalance is not seen in the 
> successful case.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity. Unfortunately, debug logging was not enabled during 
> these events.
> The 09/08 case is a little more straightforward than the 09/

[jira] [Updated] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Attachment: 091115-full.log.zip

> zk connection instability, perhaps precipitated by zk client timeout during 
> rebalance
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
> 091115-full.log.zip
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Restarting the process using the zk client resolved the 
> problems. 
> Comparing these with a case in which a new stable zk session was created 
> following a session expiry, the timeout during rebalance is not seen in the 
> successful case.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity. Unfortunately, debug logging was not enabled during 
> these events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Restarting the process using the zk client resolved the 
problems. 
Comparing these with a case in which a new stable zk session was created 
following a session expiry, the timeout during rebalance is not seen in the 
successful case.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity. 
Unfortunately, debug logging was not enabled during these events.




  was:
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Restarting the process using the zk client resolved the 
problems. 
Comparing these with a case in which a new stable zk session was created 
following a session expiry, the timeout during rebalance is not seen in the 
successful case.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity.





> zk connection instability, perhaps precipitated by zk client timeout during 
> rebalance
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
> 091115-full.log.zip
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Restarting the process using the zk client resolved the 
> problems. 
> Comparing these with a case in which a new stable zk session was created 
> following a session expiry, the timeout during rebalance is not seen in the 
> successful case.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity. Unfortunately, debug logging was not enabled during 
> these events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Description: 
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Restarting the process using the zk client resolved the 
problems. 
Comparing these with a case in which a new stable zk session was created 
following a session expiry, the timeout during rebalance is not seen in the 
successful case.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all 
logs entries minus entries particular to our application. For 09/08, the time 
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for 
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only 
error and warning entries, and entries containing any of: "begin rebalancing", 
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs, 
entries from the kafka.network.Processor logger are also excised for clarity.




  was:
On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Comparing these with a case in which a new stable zk session 
was created following a session expiry, the timeout during rebalance is not 
seen in the successful case.



> zk connection instability, perhaps precipitated by zk client timeout during 
> rebalance
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Restarting the process using the zk client resolved the 
> problems. 
> Comparing these with a case in which a new stable zk session was created 
> following a session expiry, the timeout during rebalance is not seen in the 
> successful case.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Attachment: (was: 091115-full.log.zip)

> zk connection instability, perhaps precipitated by zk client timeout during 
> rebalance
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Restarting the process using the zk client resolved the 
> problems. 
> Comparing these with a case in which a new stable zk session was created 
> following a session expiry, the timeout during rebalance is not seen in the 
> successful case.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Firth updated KAFKA-2572:
--
Attachment: 091115-full.log.zip
091115-digest.log
090815-full.log
090815-digest.log

> zk connection instability, perhaps precipitated by zk client timeout during 
> rebalance
> -
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
> 091115-full.log.zip
>
>
> On two occasions, have seen zk session expiry, followed by a timeout during a 
> consumer rebalance following this expiry, followed by multiple successive zk 
> session expiries. Comparing these with a case in which a new stable zk 
> session was created following a session expiry, the timeout during rebalance 
> is not seen in the successful case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2572) zk connection instability, perhaps precipitated by zk client timeout during rebalance

2015-09-22 Thread John Firth (JIRA)
John Firth created KAFKA-2572:
-

 Summary: zk connection instability, perhaps precipitated by zk 
client timeout during rebalance
 Key: KAFKA-2572
 URL: https://issues.apache.org/jira/browse/KAFKA-2572
 Project: Kafka
  Issue Type: Bug
  Components: zkclient
Affects Versions: 0.8.2.1
 Environment: zk version 3.4.6,
CentOS 6, 2.6.32-504.1.3.el6.x86_64
Reporter: John Firth


On two occasions, have seen zk session expiry, followed by a timeout during a 
consumer rebalance following this expiry, followed by multiple successive zk 
session expiries. Comparing these with a case in which a new stable zk session 
was created following a session expiry, the timeout during rebalance is not 
seen in the successful case.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)