[
https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Firth updated KAFKA-2572:
------------------------------
Description:
On two occasions, we've seen our process enter a cycle of: zk session expiry;
new session creation; rebalancing activity; pause during which nothing is heard
from the zk server. Restarting the process using the zk client resolved the
problems in both cases.
This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all
logs entries minus entries particular to our application. For 09/08, the time
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only
error and warning entries, and entries containing any of: "begin rebalancing",
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs,
entries from the kafka.network.Processor logger are also excised for clarity.
Unfortunately, debug logging was not enabled during these events.
The 09/11 case shows repeated cycles of session expiry, followed by rebalancing
activity, followed by a pause during which nothing is heard from the zk server,
followed by a session timeout. A stable session seems to have been established
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this
conflicted ephemeral node
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
at
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
a while back in a different session, hence I will backoff for this node to be
deleted by Zookeeper and retry" were logged out repeatedly until we restarted
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in
the log.
The 09/08 case is a little more straightforward than the 09/11 case, in that a
stable session was not established prior to our restarting the process. It's
perhaps also noteworthy that in the 09/08 case, two timeouts for the same
session are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00
and 2015-09-08T12:52:31.639-04:00. The rebalance in question begins at
2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last
entry in the logs for this day.
was:
On two occasions, we've seen our process enter a cycle of: zk session expiry;
new session creation; rebalancing activity; pause during which nothing is heard
from the zk server. Restarting the process using the zk client resolved the
problems in both cases.
This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all
logs entries minus entries particular to our application. For 09/08, the time
span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; for
11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to
2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only
error and warning entries, and entries containing any of: "begin rebalancing",
"end rebalancing", "timed", and "zookeeper state". For the 09/11 digest logs,
entries from the kafka.network.Processor logger are also excised for clarity.
Unfortunately, debug logging was not enabled during these events.
The 09/11 case shows repeated cycles of session expiry, followed by rebalancing
activity, followed by a pause during which nothing is heard from the zk server,
followed by a session timeout. A stable session seems to have been established
at 2015-09-11T04:13:47.140-04:00, but messages of the form "I wrote this
conflicted ephemeral node
[{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
at
/consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
a while back in a different session, hence I will backoff for this node to be
deleted by Zookeeper and retry" were logged out repeatedly until we restarted
the process after 2015-09-11T07:44:47.124-04:00, which marks the final entry in
the log.
The 09/08 case is a little more straightforward than the 09/11 case, in that a
stable session was not re-established. It's perhaps also noteworthy that in the
09/08 case, two timeouts for the same session are seen during a single
rebalance, at 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00.
The rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the
process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last
entry in the logs for this day.
> zk connection instability
> -------------------------
>
> Key: KAFKA-2572
> URL: https://issues.apache.org/jira/browse/KAFKA-2572
> Project: Kafka
> Issue Type: Bug
> Components: zkclient
> Affects Versions: 0.8.2.1
> Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
> Reporter: John Firth
> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log,
> 091115-full.log.zip
>
>
> On two occasions, we've seen our process enter a cycle of: zk session expiry;
> new session creation; rebalancing activity; pause during which nothing is
> heard from the zk server. Restarting the process using the zk client resolved
> the problems in both cases.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show
> all logs entries minus entries particular to our application. For 09/08, the
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00;
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining
> only error and warning entries, and entries containing any of: "begin
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the
> 09/11 digest logs, entries from the kafka.network.Processor logger are also
> excised for clarity. Unfortunately, debug logging was not enabled during
> these events.
> The 09/11 case shows repeated cycles of session expiry, followed by
> rebalancing activity, followed by a pause during which nothing is heard from
> the zk server, followed by a session timeout. A stable session seems to have
> been established at 2015-09-11T04:13:47.140-04:00, but messages of the form
> "I wrote this conflicted ephemeral node
> [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
> at
> /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
> a while back in a different session, hence I will backoff for this node to
> be deleted by Zookeeper and retry" were logged out repeatedly until we
> restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the
> final entry in the log.
> The 09/08 case is a little more straightforward than the 09/11 case, in that
> a stable session was not established prior to our restarting the process.
> It's perhaps also noteworthy that in the 09/08 case, two timeouts for the
> same session are seen during a single rebalance, at
> 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The
> rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
> The connection to ZK expires and is restablished multiple times before the
> process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last
> entry in the logs for this day.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)