[jira] [Created] (ZOOKEEPER-3864) Reject create/renew/close global session in RO mode
Jie Huang created ZOOKEEPER-3864: Summary: Reject create/renew/close global session in RO mode Key: ZOOKEEPER-3864 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3864 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Jie Huang Fix For: 3.6.2 These Ops are not read operations. They will modify the state, -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3863) Do not track global sessions in ReadOnlyZooKeeperServer
Jie Huang created ZOOKEEPER-3863: Summary: Do not track global sessions in ReadOnlyZooKeeperServer Key: ZOOKEEPER-3863 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3863 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.6.2 Reporter: Jie Huang ReadOnlyZooKeeperServer is using the default SessionTrackerImpl, which tracks and expires the global sessions, which should be tracked and expired only by the leader. This diff changes the code to use LearnerSessionTracker, which only tracks and expires local session. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3859) Add a couple request processor metrics
Jie Huang created ZOOKEEPER-3859: Summary: Add a couple request processor metrics Key: ZOOKEEPER-3859 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3859 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.2 These metrics, together with existing request processor metrics, help identify the bottleneck in the pipeline: PROPOSAL_PROCESS_TIME LEARNER_REQUEST_PROCESSOR_QUEUE_SIZE -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3858) Add metrics to track server unavailable time
Jie Huang created ZOOKEEPER-3858: Summary: Add metrics to track server unavailable time Key: ZOOKEEPER-3858 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3858 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.2 These metrics track the time when a ZooKeeper server is up and running but not serving client traffic because it is not part of a quorum. They don't track the hardware down time or ZooKeeper process down time. UNAVAILABLE_TIME: time between LOOKING and BROADCAST LEADER_UNAVAILABLE_TIME: time between LOOKING and BROADCAST on the leader -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3856) Add a couple metrics to track inflight diff syncs and snap syncs
Jie Huang created ZOOKEEPER-3856: Summary: Add a couple metrics to track inflight diff syncs and snap syncs Key: ZOOKEEPER-3856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3856 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3847) Add a couple metrics to help track Netty memory usage
Jie Huang created ZOOKEEPER-3847: Summary: Add a couple metrics to help track Netty memory usage Key: ZOOKEEPER-3847 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3847 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.2 Adding these metrics: * RESPONSE_BYTES: size of responses (in bytes) being sent to a client * WATCH_BYTES: size of watch events (in bytes) being sent to a client -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3846) Add a couple TLS related metrics
Jie Huang created ZOOKEEPER-3846: Summary: Add a couple TLS related metrics Key: ZOOKEEPER-3846 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3846 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.2 Adding those metrics: * UNSUCCESSFUL_HANDSHAKE: number of unsuccessful TLS handshakes * INSECURE_ADMIN: number of insecure connections to admin port -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3845) Add metric JVM_PAUSE_TIME
Jie Huang created ZOOKEEPER-3845: Summary: Add metric JVM_PAUSE_TIME Key: ZOOKEEPER-3845 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3845 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.2 This metric is used to report how long the JVM stalls, which will help understand issues when there is unexpected high latency due to things like GC. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3844) Add useful metrics for ZK servers
Jie Huang created ZOOKEEPER-3844: Summary: Add useful metrics for ZK servers Key: ZOOKEEPER-3844 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3844 Project: ZooKeeper Issue Type: Improvement Components: metric system Reporter: Jie Huang Fix For: 3.6.2 In ZOOKEEPER-3245, we upstreamed metrics that we use to monitor and debug Zookeeper. We have introduced more metrics since then, which will be upstreamed in this JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3816) Improve the lagging detection between the leader and learners
Jie Huang created ZOOKEEPER-3816: Summary: Improve the lagging detection between the leader and learners Key: ZOOKEEPER-3816 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3816 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Jie Huang Assignee: Jie Huang Fix For: 3.6.2 Currently, we have SyncLimitCheck on the leader to detect a lagging leaner by tracking the time a proposal being acknowledged. If the leader doesn't receive the ack for a proposal from a learner within the syncLimit, it disconnects the learner. The purpose of the SyncLimitCheck is to prevent sessions connected to a slow learner from being expired. By disconnecting the slow learner, it gives the clients a chance to re-connect to another server before session expiration. However, there are two cases that the sessions can still expire with current SyncLimitCheck implementation. One case is that the ack reaches the leader on time but a ping response including the session table is delayed. The lagging detection is based on the proposal/ack time yet the sessions are updated when the ping response is received. If the ping response is delayed longer than the ack, the sessions could expire without lagging being detected. It makes more sense to detect lagging based on ping/ping response time. Another case is that the leader detects lagging and closes the connection to the slower learner but the learner doesn't know that it is being disconnected due to long socket closing time or a lost RST signal. So the learner doesn't disconnect its clients, who lose their chance to re-connect to anther server before session expiration. The learner, like the leader, also needs a means to detect communication issues at a higher-than-socket layer. So we need a lagging detector based on ping/ping response and bi-directional between the leader and the learners. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3774) Close quorum socket asynchronously on the leader to avoid ping being blocked by long socket closing time
Jie Huang created ZOOKEEPER-3774: Summary: Close quorum socket asynchronously on the leader to avoid ping being blocked by long socket closing time Key: ZOOKEEPER-3774 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3774 Project: ZooKeeper Issue Type: Sub-task Components: server Reporter: Jie Huang Fix For: 3.7.0 In ZOOKEEPER-3574 we close the quorum sockets on followers asynchronously when a leader is partitioned away so the shutdown process will not be stalled by long socket closing time and the followers can quickly establish a new quorum to serve client requests. We've found that the long socket closing time can cause trouble on the leader too when a follower is partitioned away if the partition is detected by PingLaggingDetector. When the ping thread detects partition, it tries to disconnect the follower. If the socket closing time is long, the ping thread will be blocked and no ping is sent to any follower--even the ones still connected to the leader--since the ping thread is responsible for sending pings to all followers. When followers don't receive pings, they don't send ping response. When the leader don't receive ping response, the sessions expire. To prevent good sessions from expiring, we need to close the socket asynchronously on the leader too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3683) Discard requests that are delayed longer than a configured threshold
Jie Huang created ZOOKEEPER-3683: Summary: Discard requests that are delayed longer than a configured threshold Key: ZOOKEEPER-3683 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3683 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Jie Huang Fix For: 3.6.0 The RequestThrottler ensures that no requests more than the system can handle be fed into the request processor pipeline. In the meantime, the throttler queues all incoming requests and there is nothing to instruct the clients to slow down. This new feature will mark all requests that wait in the RequestThrottler longer that specified throttledOpWaitTime as throttled and such requests will not see any processing other than being fed down the pipeline preserving the order of all requests. The FinalProcessor will issue an error response (new error code: ZTHROTTLEDOP) for these undigested requests. The intent is for the clients to not retry them immediately. Also the fact that throttled requests are unprocessed will speed the entire work of the pipeline. Throttled requests are not communicated between servers and only travel thru the server they belong to. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3682) Stop initializing new SSL connection if ZK server is shutting down
Jie Huang created ZOOKEEPER-3682: Summary: Stop initializing new SSL connection if ZK server is shutting down Key: ZOOKEEPER-3682 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3682 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Jie Huang Fix For: 3.6.0 ZK keeps accepting new connections while it's being shut down then immediately close them when it finds out that the ZK server is not running. It's not a big deal before SSL is enabled since creating TCP connections is relatively cheap. With SSL being widely enabled, creating SSL connections involves handshake that takes non-trivial CPU time, which is wasted since the connections are closed right after. This JIRA is going to stop initializing TLS handshake if the zkServer is not serving to save resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3575) Moving sending packets in Learner to a separate thread
Jie Huang created ZOOKEEPER-3575: Summary: Moving sending packets in Learner to a separate thread Key: ZOOKEEPER-3575 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3575 Project: ZooKeeper Issue Type: Sub-task Components: server Affects Versions: 3.6.0 Reporter: Jie Huang After changing to close the socket asynchronously, the shutdown process can proceed while the socket is being closed. However, the shutdown process could still stall if a thread being shutdown is writing to the socket. For example, the SyncRequestProcessor flushes all ACK packets in queue when shutdown is called, which calls Learner.writePacket(), which will not return (with an IO exception) until the socket finishes closing. So it's still delayed by the socket closing time. To get around the delay, we move Learner.writePacket() to a separate thread. The tricky part is to handle the IO exception thrown by Learner.writePacket(). Currently, the IO exception is caught by different callers in different ways. For example, if an IO exception caught during revalidateSession, the session is closed and removed. In other cases, like in FollowerRequestProcessor and SendAckRequestProcess, the quorum socket is closed when the IO exception is caught. After moving it to a thread, the callers won't be able to catch and handle the exception. We need to handle it within the sending function. We reason that if an IO exception is thrown on the quorum socket of a follower, it only makes sense to shut down the server. So we make the sending thread a ZooKeeperCriticalThread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3574) Close quorum socket asynchronously to avoid shutdown stalled by long socket closing time
Jie Huang created ZOOKEEPER-3574: Summary: Close quorum socket asynchronously to avoid shutdown stalled by long socket closing time Key: ZOOKEEPER-3574 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3574 Project: ZooKeeper Issue Type: Sub-task Components: server Affects Versions: 3.6.0 Reporter: Jie Huang Since we can't use SO_LINGER option or find a substitute to close a TLS socket quickly in JDK 11, we call close() asynchronously so the shutdown can proceed and a new leader election can be started while the socket being closed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3573) Dealing with long TLS connection closing time without SO_LINGER option
Jie Huang created ZOOKEEPER-3573: Summary: Dealing with long TLS connection closing time without SO_LINGER option Key: ZOOKEEPER-3573 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3573 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.6.0 Reporter: Jie Huang As described in ZOOKEEPER-3384, with SSL sockets, a close_notify is required to be sent before closing the write side of a connection. When the send buffer is full and the writing is blocked, it will take a long time to send close_notify thus a long time to close the socket. The long closing time on followers with a partitioned-away leader would stall the shutdown process and delay a new leader election to establish a new quorum. As a result, the ensemble would be unavailable for a long time. In ZOOKEEPER-3384, SO_LINGER option is used to close the socket quickly (and potentially uncleanly). In JDK 11, however, SO_LINGER option is not honored so we need a new way to avoid the long quorum unavailable time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3547) Add detailed documentation on throttling
Jie Huang created ZOOKEEPER-3547: Summary: Add detailed documentation on throttling Key: ZOOKEEPER-3547 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3547 Project: ZooKeeper Issue Type: Improvement Components: documentation Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ZOOKEEPER-3503) Add server-side large request protection
Jie Huang created ZOOKEEPER-3503: Summary: Add server-side large request protection Key: ZOOKEEPER-3503 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3503 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.6.0 Reporter: Jie Huang This task adds a new request limiting mechanism to ZooKeeper that aims to protect ZooKeeper from accepting too many large requests and crashing because it runs out of memory. This is designed to augment the connection throttling (ZOOKEEPER-3242) and request throttling (ZOOKEEPER-3243), which focus on limiting the number rather than size of requests. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ZOOKEEPER-3493) Deflake testConcurrentRequestProcessingInCommitProcessor in CommitProcessorMetricsTest
Jie Huang created ZOOKEEPER-3493: Summary: Deflake testConcurrentRequestProcessingInCommitProcessor in CommitProcessorMetricsTest Key: ZOOKEEPER-3493 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3493 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.6.0 Reporter: Jie Huang -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ZOOKEEPER-3492) Add weights to server side connection throttling
Jie Huang created ZOOKEEPER-3492: Summary: Add weights to server side connection throttling Key: ZOOKEEPER-3492 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3492 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Jie Huang Fix For: 3.6.0 In ZOOKEEPER-3242, we introduced connection throttling to protect the server from being overloaded. We realize that the costs for creating a local session, creating a global session, and reconnecting are different. So we associate weights to the costs when throttling. For example, for the same setting, the throttler will allow more connections to be created if they are local. This allows the server resources to be fully utilized. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ZOOKEEPER-3437) Improve sync throttling on a learner master
Jie Huang created ZOOKEEPER-3437: Summary: Improve sync throttling on a learner master Key: ZOOKEEPER-3437 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3437 Project: ZooKeeper Issue Type: Improvement Components: quorum Affects Versions: 3.6.0 Reporter: Jie Huang Fix For: 3.6.0 As described in ZOOKEEPER-1928, a leader can become overloaded if it sends too many snapshots concurrently during sync time. Sending too many diffs at the same time can also cause the overloading issue. In this JIRA, we will: # add diff sync throttling in addition to snap sync throttling # extend the protection to followers that serve observers # improve the counting of concurrent snap syncs/diff syncs to avoid double counting or missing counting -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3401) Fix metric PROPOSAL_ACK_CREATION_LATENCY
Jie Huang created ZOOKEEPER-3401: Summary: Fix metric PROPOSAL_ACK_CREATION_LATENCY Key: ZOOKEEPER-3401 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3401 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3383) Improve prep processor metric accuracy and de-flaky unit test
Jie Huang created ZOOKEEPER-3383: Summary: Improve prep processor metric accuracy and de-flaky unit test Key: ZOOKEEPER-3383 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3383 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3379) De-flaky test in Quorum Packet Metrics
Jie Huang created ZOOKEEPER-3379: Summary: De-flaky test in Quorum Packet Metrics Key: ZOOKEEPER-3379 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3379 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ZOOKEEPER-3316) Remove unused code in SyncRequestProcessor
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang resolved ZOOKEEPER-3316. -- Resolution: Invalid > Remove unused code in SyncRequestProcessor > -- > > Key: ZOOKEEPER-3316 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3316 > Project: ZooKeeper > Issue Type: Bug > Components: server >Reporter: Jie Huang >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > to make spotbugs happy -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3324) Add read/write metrics for top level znodes
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated ZOOKEEPER-3324: - Description: These metrics provide bytes read from each branch under the root and bytes written to each branch under the root. We use top level znodes not only to manage applications that share an ensemble but also to organize data on a dedicated ensemble. These metrics help us to do quota management, ACL management, etc at the top znode level. > Add read/write metrics for top level znodes > --- > > Key: ZOOKEEPER-3324 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3324 > Project: ZooKeeper > Issue Type: Sub-task > Components: metric system >Reporter: Jie Huang >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 50m > Remaining Estimate: 0h > > These metrics provide bytes read from each branch under the root and bytes > written to each branch under the root. We use top level znodes not only to > manage applications that share an ensemble but also to organize data on a > dedicated ensemble. These metrics help us to do quota management, ACL > management, etc at the top znode level. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ZOOKEEPER-3328) misc metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang resolved ZOOKEEPER-3328. -- Resolution: Not A Problem It turns out that I don't have any left over metrics intended for this category. > misc metrics > > > Key: ZOOKEEPER-3328 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3328 > Project: ZooKeeper > Issue Type: Sub-task > Components: metric system >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ZOOKEEPER-3325) Add unavailable time metrics for quorum peers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang resolved ZOOKEEPER-3325. -- Resolution: Later These two metrics require ZabState. should be upstreamed together with ZabState. > Add unavailable time metrics for quorum peers > - > > Key: ZOOKEEPER-3325 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3325 > Project: ZooKeeper > Issue Type: Sub-task > Components: metric system >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3328) misc metrics
Jie Huang created ZOOKEEPER-3328: Summary: misc metrics Key: ZOOKEEPER-3328 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3328 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3327) Add unrecoverable error count
Jie Huang created ZOOKEEPER-3327: Summary: Add unrecoverable error count Key: ZOOKEEPER-3327 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3327 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3326) Add session/connection related metrics
Jie Huang created ZOOKEEPER-3326: Summary: Add session/connection related metrics Key: ZOOKEEPER-3326 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3326 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3325) Add unavailable time metrics for quorum peers
Jie Huang created ZOOKEEPER-3325: Summary: Add unavailable time metrics for quorum peers Key: ZOOKEEPER-3325 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3325 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3324) Add read/write metrics for top level znodes
Jie Huang created ZOOKEEPER-3324: Summary: Add read/write metrics for top level znodes Key: ZOOKEEPER-3324 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3324 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3323) Add TxnSnapLog metrics
Jie Huang created ZOOKEEPER-3323: Summary: Add TxnSnapLog metrics Key: ZOOKEEPER-3323 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3323 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3321) Add metrics for Leader
Jie Huang created ZOOKEEPER-3321: Summary: Add metrics for Leader Key: ZOOKEEPER-3321 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3321 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3319) Add metrics for follower and observer
Jie Huang created ZOOKEEPER-3319: Summary: Add metrics for follower and observer Key: ZOOKEEPER-3319 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3319 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ZOOKEEPER-3313) Upgrade a few metrics to percentile counter
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang resolved ZOOKEEPER-3313. -- Resolution: Not A Problem > Upgrade a few metrics to percentile counter > --- > > Key: ZOOKEEPER-3313 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3313 > Project: ZooKeeper > Issue Type: Sub-task > Components: metric system >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3313) Upgrade a few metrics to percentile counter
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794407#comment-16794407 ] Jie Huang commented on ZOOKEEPER-3313: -- was planning to update READ_LATENCY, UPDATE_LATENCY, and PROPAGATION_LATENCY. but find out they are using percentile counters already in master > Upgrade a few metrics to percentile counter > --- > > Key: ZOOKEEPER-3313 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3313 > Project: ZooKeeper > Issue Type: Sub-task > Components: metric system >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3316) Remove unused code in SyncRequestProcessor
Jie Huang created ZOOKEEPER-3316: Summary: Remove unused code in SyncRequestProcessor Key: ZOOKEEPER-3316 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3316 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Jie Huang Fix For: 3.6.0 to make spotbugs happy -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3313) Upgrade a few metrics to percentile counter
Jie Huang created ZOOKEEPER-3313: Summary: Upgrade a few metrics to percentile counter Key: ZOOKEEPER-3313 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3313 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3310) Add metrics for prep processor
Jie Huang created ZOOKEEPER-3310: Summary: Add metrics for prep processor Key: ZOOKEEPER-3310 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3310 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3309) Add sync processor metrics
Jie Huang created ZOOKEEPER-3309: Summary: Add sync processor metrics Key: ZOOKEEPER-3309 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3309 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3305) Add Quorum Packet metrics
Jie Huang created ZOOKEEPER-3305: Summary: Add Quorum Packet metrics Key: ZOOKEEPER-3305 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3305 Project: ZooKeeper Issue Type: Sub-task Components: metric system Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3268) Add commit processor metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760024#comment-16760024 ] Jie Huang commented on ZOOKEEPER-3268: -- Add metrics for requests queued in the commit processor, time spent in the commit processor, and so on. > Add commit processor metrics > > > Key: ZOOKEEPER-3268 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3268 > Project: ZooKeeper > Issue Type: Sub-task > Components: server >Reporter: Jie Huang >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3267) Add watcher metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759626#comment-16759626 ] Jie Huang commented on ZOOKEEPER-3267: -- Add metrics for fired watch counts, metrics for dead watchers (cleared count, cleaner latency, etc) in DeadWatcherListener > Add watcher metrics > --- > > Key: ZOOKEEPER-3267 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3267 > Project: ZooKeeper > Issue Type: Sub-task > Components: server >Affects Versions: 3.6.0 >Reporter: Jie Huang >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (ZOOKEEPER-3268) Add commit processor metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated ZOOKEEPER-3268: - Comment: was deleted (was: Add metrics for fired watch counts, metrics for dead watchers (cleared count, cleaner latency, etc) in DeadWatcherListener ) > Add commit processor metrics > > > Key: ZOOKEEPER-3268 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3268 > Project: ZooKeeper > Issue Type: Sub-task > Components: server >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3268) Add commit processor metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759618#comment-16759618 ] Jie Huang commented on ZOOKEEPER-3268: -- Add metrics for fired watch counts, metrics for dead watchers in DeadWatcherListener > Add commit processor metrics > > > Key: ZOOKEEPER-3268 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3268 > Project: ZooKeeper > Issue Type: Sub-task > Components: server >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ZOOKEEPER-3268) Add commit processor metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759618#comment-16759618 ] Jie Huang edited comment on ZOOKEEPER-3268 at 2/4/19 4:22 AM: -- Add metrics for fired watch counts, metrics for dead watchers (cleared count, cleaner latency, etc) in DeadWatcherListener was (Author: jiehuang): Add metrics for fired watch counts, metrics for dead watchers in DeadWatcherListener > Add commit processor metrics > > > Key: ZOOKEEPER-3268 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3268 > Project: ZooKeeper > Issue Type: Sub-task > Components: server >Reporter: Jie Huang >Priority: Minor > Fix For: 3.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3267) Add watcher metrics
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Huang updated ZOOKEEPER-3267: - Summary: Add watcher metrics (was: Add watch metrics) > Add watcher metrics > --- > > Key: ZOOKEEPER-3267 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3267 > Project: ZooKeeper > Issue Type: Sub-task > Components: server >Affects Versions: 3.6.0 >Reporter: Jie Huang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3267) Add watch metrics
Jie Huang created ZOOKEEPER-3267: Summary: Add watch metrics Key: ZOOKEEPER-3267 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3267 Project: ZooKeeper Issue Type: Sub-task Components: server Affects Versions: 3.6.0 Reporter: Jie Huang -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3268) Add commit processor metrics
Jie Huang created ZOOKEEPER-3268: Summary: Add commit processor metrics Key: ZOOKEEPER-3268 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3268 Project: ZooKeeper Issue Type: Sub-task Components: server Reporter: Jie Huang Fix For: 3.6.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3245) Add useful metrics for ZK pipeline and request/server states
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746924#comment-16746924 ] Jie Huang commented on ZOOKEEPER-3245: -- Splitting this Jira into smaller children tasks > Add useful metrics for ZK pipeline and request/server states > > > Key: ZOOKEEPER-3245 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3245 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Jie Huang >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Add metrics to track time spent in the commit processor, watch counts and > fire rates, how long a Zookeeper server is unavailable between elections, > quorum packet size and time spent in the queue, aggregate request > states/flow, request throttle, sync processor queue time, per-connection read > and write request counts, commit processor queue sizes(read/write/commit), > final request processor read/write times, watch manager cnxn/path counts, > latencies at different points in pipeline for commits/informs, split up > request type counters for more request types, export sum metrics for all > AvgMinMax counters, per-connection watch fired counts, ack latency for each > follower, percentile metrics to zeus latency counters, proposal count, number > of outstanding changes, snapshot and txns loading time during startup, > number of non-voting followers, leader unavailable time, etc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3251) Add new server metric types: percentile counter and counter set
Jie Huang created ZOOKEEPER-3251: Summary: Add new server metric types: percentile counter and counter set Key: ZOOKEEPER-3251 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3251 Project: ZooKeeper Issue Type: Sub-task Components: server Reporter: Jie Huang Fix For: 3.6.0 This will add three metric types: AvgMinMaxCounterSet AvgMinMaxPercentileCounter AvgMinMaxPercentileCounterSet The percentile metrics allow us to get a better sense of the latency distribution. They are more expensive than AvgMinMax counters and are restricted to latency measurements for now. The counter set allows the grouping of metrics such as write per namespace, read per namespace. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3245) Add useful metrics for ZK pipeline and request/server states
Jie Huang created ZOOKEEPER-3245: Summary: Add useful metrics for ZK pipeline and request/server states Key: ZOOKEEPER-3245 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3245 Project: ZooKeeper Issue Type: Improvement Reporter: Jie Huang Fix For: 3.6.0 Add metrics to track time spent in the commit processor, watch counts and fire rates, how long a Zookeeper server is unavailable between elections, quorum packet size and time spent in the queue, aggregate request states/flow, request throttle, sync processor queue time, per-connection read and write request counts, commit processor queue sizes(read/write/commit), final request processor read/write times, watch manager cnxn/path counts, latencies at different points in pipeline for commits/informs, split up request type counters for more request types, export sum metrics for all AvgMinMax counters, per-connection watch fired counts, ack latency for each follower, percentile metrics to zeus latency counters, proposal count, number of outstanding changes, snapshot and txns loading time during startup, number of non-voting followers, leader unavailable time, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3243) Add server side request throttling
Jie Huang created ZOOKEEPER-3243: Summary: Add server side request throttling Key: ZOOKEEPER-3243 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3243 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Jie Huang Fix For: 3.6.0 On-going performance investigation at Facebook has demonstrated that Zookeeper is easily overwhelmed by spikes in connection rates and/or write request rates. Zookeeper performance gets progressively worse, clients timeout and try to reconnect (exacerbating the problem) and things enter a death spiral. To solve this problem, we need to add load protection to Zookeeper via rate limiting and work shedding. This JIRA task adds a new request throttling mechanism (RequestThrottler) to Zookeeper in hopes of preventing Zookeeper from becoming overwhelmed during request spikes. When enabled, the RequestThrottler limits the number Of outstanding requests currently submitted to the request processor pipeline. The throttler augments the limit imposed by the globalOutstandingLimit that is enforced by the connection layer (NIOServerCnxn, NettyServerCnxn). The connection layer limit applies backpressure against the TCP connection by disabling selection on connections once the request limit is reached. However, the connection layer always allows a connection to send at least one request before disabling selection on that connection. Thus, in a scenario with 4 client connections, the total number of requests inflight may be as high as 4 even if the globalOustandingLimit was set lower. The RequestThrottler addresses this issue by adding additional queueing. When enabled, client connections no longer submit requests directly to the request processor pipeline but instead to the RequestThrottler. The RequestThrottler is then responsible for issuing requests to the request processors, and enforces a separate maxRequests limit. If the total number of outstanding requests is higher than maxRequests, the throttler will continually stall for stallTime milliseconds until under limit. The RequestThrottler can also optionally drop stale requests rather than submit them to the processor pipeline. A stale request is a request sent by a connection that is already closed, and/or a request whose latency will end up being higher than its associated session timeout. To ensure ordering guarantees, if a request is ever dropped from a connection that connection is closed and flagged as invalid. All subsequent requests inflight from that connection are then dropped as well. The notion of staleness is configurable, both connection staleness and latency staleness can be individually enabled/disabled. Both these settings and the various throttle settings (limit, stall time, stale drop) can be configured via system properties as well as at runtime via JMX. The throttler has been tested and benchmarked at Facebook -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3242) Add server side connecting throttling
Jie Huang created ZOOKEEPER-3242: Summary: Add server side connecting throttling Key: ZOOKEEPER-3242 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3242 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Jie Huang Fix For: 3.6.0 On-going performance investigation at Facebook has demonstrated that Zookeeper is easily overwhelmed by spikes in connection rates and/or write request rates. Zookeeper performance gets progressively worse, clients timeout and try to reconnect (exacerbating the problem) and things enter a death spiral. To solve this problem, we need to add load protection to Zookeeper via rate limiting and work shedding. This Jira adds a new connection rate limiting mechanism to Zookeeper in hopes of preventing Zookeeper from becoming overwhelmed during connection spikes. The new throttle is focused on limiting connections per second. The throttle is implemented as a token-bucket with optional probabilistic dropping based on the BLUE queue management algorithm. This token-bucket design allows the throttle to allow short bursts to pass, while still capping the total number of requests per second. However, an issue with a token bucket approach is that the wall clock arrival time of requests affects the probability of a request being allowed to pass or not. Under constant load this can lead to request starvation for requests that constantly arrive later than the majority. The optional probabilistic dropping mechanism is designed to combat this, making rejections a random event with little skew based on arrival time. A more verbose description can be found in the comments in org.apache.zookeeper.server.BlueThrottle. By default, both the token-bucket and probabilistic dropping mechanism are disabled. Enabling and tuning the throttles can be done both via Java system properties as well as against a running node via JMX. The throttle has been tested and benchmarked at Facebook. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3239) Adding EnsembleAuthProvider to verify the ensemble name
Jie Huang created ZOOKEEPER-3239: Summary: Adding EnsembleAuthProvider to verify the ensemble name Key: ZOOKEEPER-3239 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3239 Project: ZooKeeper Issue Type: Improvement Reporter: Jie Huang Fix For: 3.6.0 This AuthenticationProvider checks to make sure that the ensemble name the client intends to connect to matches the name that the server thinks it belongs to. If the name does not match, this provider will close the connection This AuthenticationProvider does not "authenticate" the client. It prevents the client accidentally connecting to a wrong ensemble. This feature has been implemented in the Facebook internal branch and I'm going to upstream it to the trunk. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3216) Make init/sync limit tunable via JMX
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725345#comment-16725345 ] Jie Huang commented on ZOOKEEPER-3216: -- link to the PR: https://github.com/apache/zookeeper/pull/738 > Make init/sync limit tunable via JMX > > > Key: ZOOKEEPER-3216 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3216 > Project: ZooKeeper > Issue Type: Improvement > Components: jmx >Reporter: Jie Huang >Priority: Minor > > Add beans for initLimit and syncLimit so they can be adjusted through JMX -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3216) Make init/sync limit tunable via JMX
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722018#comment-16722018 ] Jie Huang commented on ZOOKEEPER-3216: -- This will allow us to fix syncing issues when they happen without restart > Make init/sync limit tunable via JMX > > > Key: ZOOKEEPER-3216 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3216 > Project: ZooKeeper > Issue Type: Improvement > Components: jmx >Reporter: Jie Huang >Priority: Minor > > Add beans for initLimit and syncLimit so they can be adjusted through JMX -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3216) Make init/sync limit tunable via JMX
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720702#comment-16720702 ] Jie Huang commented on ZOOKEEPER-3216: -- This feature has been implemented in the Facebook internal branch and I'm going to upstream it to the trunk. > Make init/sync limit tunable via JMX > > > Key: ZOOKEEPER-3216 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3216 > Project: ZooKeeper > Issue Type: Improvement > Components: jmx >Reporter: Jie Huang >Priority: Minor > > Add beans for initLimit and syncLimit so they can be adjusted through JMX -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3216) Make init/sync limit tunable via JMX
Jie Huang created ZOOKEEPER-3216: Summary: Make init/sync limit tunable via JMX Key: ZOOKEEPER-3216 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3216 Project: ZooKeeper Issue Type: Improvement Components: jmx Reporter: Jie Huang Add beans for initLimit and syncLimit so they can be adjusted through JMX -- This message was sent by Atlassian JIRA (v7.6.3#76005)