[
https://issues.apache.org/jira/browse/ZOOKEEPER-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Norbert Kalmar updated ZOOKEEPER-3037:
--------------------------------------
Description:
After a ZK crash, or client timeout sometimes it's hard to determine from the
logs what happened. Knowing if ZK was responsive at the time would help a lot.
For example, ZK might spend a lot of time waiting on GC (there is still some
misconception that ZK is a storage).
To help detect this, HADOOP already has a great tool called JVM Pause Monitor.
(As the name suggest, it can be also used for monitoring, but it also helps
post-mortem in a lot of cases). Basically it has a daemon that sleeps for one
second, and if the sleep time exceeds the 1s by more than the threshold (1s:
INFO, 10s: WARN by default - this can be configurable in our case, see below),
it will alert/make a log entry. It can also monitor the time GC took.
The class implementing this is in HADOOP-common, but ZK should not depend on
this package. Since this is a straightforward implementation, and in the past
five years the few commits it had is nothing really serious, I think we could
just copy this class in ZooKeeper, and introduce it as a configurable feature,
by default it can be off.
The class:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
Task:
- Create a class in ZK under contrib called JvmPauseMonitor.
- Make feature configurable, by default: OFF
- Make sleep time and threshold time configurable
- Update documentation
- Add [current size of the heap OR % of heap used] in the log entry whenever
sleep threshold had exceeded by a lot (10s)
was:
After a ZK crash, or client timeout sometimes it's hard to determine from the
logs what happened. Knowing if ZK was responsive at the time would help a lot.
For example, ZK might spend a lot of time waiting on GC (there is still some
misconception that ZK is a storage).
To help detect this, HADOOP already has a great tool called JVM Pause Monitor.
(As the name suggest, it can be also used for monitoring, but it also helps
post-mortem in a lot of cases). Basically it has a daemon that sleeps for one
second, and if the sleep time exceeds the 1s by more than the threshold (1s:
INFO, 10s: WARN by default - this can be configurable in our case, see below),
it will alert/make a log entry. It can also monitor the time GC took.
The class implementing this is in HADOOP-common, but ZK should not depend on
this package. Since this is a straightforward implementation, and in the past
five years the few commits it had is nothing really serious, I think we could
just copy this class in ZooKeeper, and introduce it as a configurable feature,
by default it can be off.
The class:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
Task:
- Create a class in ZK under contrib called JvmPauseMonitor.
- Make feature configurable, by default: OFF
- Make sleep time and threshold time configurable
> Add JvmPauseMonitor to ZooKeeper
> --------------------------------
>
> Key: ZOOKEEPER-3037
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3037
> Project: ZooKeeper
> Issue Type: Improvement
> Components: contrib
> Affects Versions: 3.5.3, 3.4.12
> Reporter: Norbert Kalmar
> Assignee: Norbert Kalmar
> Priority: Minor
>
> After a ZK crash, or client timeout sometimes it's hard to determine from the
> logs what happened. Knowing if ZK was responsive at the time would help a
> lot. For example, ZK might spend a lot of time waiting on GC (there is still
> some misconception that ZK is a storage).
> To help detect this, HADOOP already has a great tool called JVM Pause
> Monitor. (As the name suggest, it can be also used for monitoring, but it
> also helps post-mortem in a lot of cases). Basically it has a daemon that
> sleeps for one second, and if the sleep time exceeds the 1s by more than the
> threshold (1s: INFO, 10s: WARN by default - this can be configurable in our
> case, see below), it will alert/make a log entry. It can also monitor the
> time GC took.
> The class implementing this is in HADOOP-common, but ZK should not depend on
> this package. Since this is a straightforward implementation, and in the past
> five years the few commits it had is nothing really serious, I think we could
> just copy this class in ZooKeeper, and introduce it as a configurable
> feature, by default it can be off.
> The class:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
> Task:
> - Create a class in ZK under contrib called JvmPauseMonitor.
> - Make feature configurable, by default: OFF
> - Make sleep time and threshold time configurable
> - Update documentation
> - Add [current size of the heap OR % of heap used] in the log entry whenever
> sleep threshold had exceeded by a lot (10s)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)