[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm Got a few spare cycles to address my comments? ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Agreed @zodvik . The scope of this Jira is to limit it to a generic slow query log and identify the outliers if any. For that reason, I am printing the actual request to the log as well. An improvement would be definitely to separate based on request types as others have also suggested. If this gets in, I would like to work on the improvement part of using t-digest(https://github.com/tdunning/t-digest) to give out detailed insight and not have the problem of setting a fixed threshold. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm Thank you! I have rebased the patch. The jenkins build has been failing for a while and my patch doesn't introduce new failure. Thanks once again! ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 looks like the branch has merge conflicts. do you mind update your patch with latest master and I'll review @karanmehta93 ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user zodvik commented on the issue: https://github.com/apache/zookeeper/pull/307 > reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem IMO, slow query log is useful for those systems were the response times can vary due to type of query (long scans, improper index, poor query plan, etc.). For the type of issues mentioned, we can expect all operations to be impacted and fill up the log. If the intent is to get an time stamp indication of the problem, a metric reporting mechanism will be more useful. If we have rolling interval percentiles (per command?) published either to `stat` or `jmx`, that would give a good visibility. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Hello everyone, Appreciate your efforts in reviewing this patch. @hanm @tdunning @eribeiro @skamille Is there any possibility that the patch will get merged in (with minor changes if required) or shall we 'never' this JIRA and close this PR? Thanks! ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 > Can we at least make it so that the process of "push a metric to a buffer, have a thread that wakes up periodically and flushes information out of that buffer" is usable by multiple parts of the system, instead of coupling it to the one metric of request time? I am ready to put in the effort of making this framework more generic, if you feel its worth the time and effort to put into it. I would also like to hear others opinions on this. The inspiration for this JIRA came from HBase, where it has been used to log slow requests at the basic level. I feel that it can be useful to keep track of latency spikes that are seen in production. @apurtell can suggest use-cases. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user skamille commented on the issue: https://github.com/apache/zookeeper/pull/307 I appreciate the work that you are trying to do here, and this still feels like an incomplete approach to a problem that would be worth investing further into. It seems like we all agree that it would be good to have clearer metrics about the performance behavior of the internals of the system. This implements some basic infrastructure for a metric recording framework, coupled to the recording of one metric. Can we at least make it so that the process of "push a metric to a buffer, have a thread that wakes up periodically and flushes information out of that buffer" is usable by multiple parts of the system, instead of coupling it to the one metric of request time? ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Ping @hanm @tdunning @eribeiro @skamille for a review if you got some time. Thanks! ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 Due to some reason, Jenkins didn't build it automatically. I triggered a manual build with my latest commit ID SHA1 and the results are at https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1123/ All the Findbugs warnings have been resolved, and the single failing test seems to be unrelated to my change (or possibly a flapper since it is passing on my local machine). This commit is a squashed commit based on the earlier inputs provided by everyone. Thank you! As of now, the default value for `requestWarnThresholdMs` is 10 seconds, which is too high. Users can set it according to the average latency values they typically see from the serverstats. The patch also limits logging rate, so if the value is set too low, it will print out a message every 60 seconds showing the number of high latency requests in past minute. The motivation of this JIRA is specified in description at https://issues.apache.org/jira/browse/ZOOKEEPER-2770. If you can review this PR, it would be great. @hanm @tdunning @eribeiro @skamille If we can get in this JIRA, the future plan is to work on @tdunning t-digest and use it to determine high latency requests specifically based on its type. Thanks! ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @tdunning Sorry about this. I rebased my branch with the current master and pushed my patch along with it. Now it is showing up all those commits as my commits. Let me try to figure out a way to resolve this. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user tdunning commented on the issue: https://github.com/apache/zookeeper/pull/307 What is this pull request? 45 commits? 95 changed files? Lots of commits without associated issues? (danger, danger) ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 There are -1 on findbugs, release audits, and test failures reported by Jekins. Please investigate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm @eribeiro @tdunning @skamille Please review. Now that I have added rate limiting to logging, can we also turn this on by default? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm @eribeiro @tdunning I have fixed the whitespace issues and final patch is ready, Please review whenever convenient. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm > I think at some point after merge this you need create another pull request targeting stable branch (branch-3.4) where the version number should be 3.4.11 (hopefully, if this can get in before next release) I will create a new PR for that branch as well. I am hoping this to get this out in next release if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 >> And for the documentation, this property is being introduced from 3.5.4 version, Is that correct? Yes, for this specific pull request targeting master. I think at some point after merge this you need create another pull request targeting stable branch (branch-3.4) where the version number should be 3.4.11 (hopefully, if this can get in before next release). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @eribeiro Thank you for your comments. Will post an updated patch soon. A small concern regarding the `ServerStats.updateLatency()` method. Should the updated patch have both the methods for now and file another JIRA to look into it? And for the documentation, this property is being introduced from `3.5.4` version, Is that correct? @hanm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user eribeiro commented on the issue: https://github.com/apache/zookeeper/pull/307 @karanmehta93 hi, mate. You don't need to touch `ZooKeeperAdmin.html` You change `ZooKeeperAdmin.xml` **only** and Apache Forrest generate html anf PDF. Also, this xml change should be part of this patch, okay? Cheers! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 @hanm @eribeiro Thank you for your comments. I have added a rebased patch with a test in `ZooKeeperServerMainTest` class. Please review whenever convenient. > It's odd to have two (or in future even more) types of latency checks scattered around which creates fragmentation w.r.t. the definition of what a request latency means. We can decide based on opinion from other people and I can update the patch accordingly. I will create new JIRA's based on @tdunning suggestions, after this one gets approved. I am eager to take them up. Documentation is still pending. Will add a patch for it soon. I am not sure if I have to modify the `ZooKeeperAdmin.xml` file or `ZooKeeperAdmin.html` file. Can you point me to any relevant stuff? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 I think we should consolidate the latency check in `zks.serverStats().updateLatency`. It's odd to have two (or in future even more) types of latency checks scattered around which creates fragmentation w.r.t. the definition of what a request latency means. The existing latency measurement in ServerStats measures the time between a request creation and a request landing at final request processor; the patch instead measures end to end time of a request from its start to finish processing. I am fine with the end to end processing time, though I'd like to double check with a few folks around to make sure the regression and impact of this change is limited. I think ServerStats is a good place to put the DS Ted recommended. I think it's a good idea to scope the JIRA so it's easier to get it reviewed and committed. What this patch is doing is a positive improvement to the operational aspects of ZK so that can be the scope of this PR. On top of that future improvements could be what Edward and Ted suggested (JMX, distribution of latencies / histogram etc). These work can be tracked by making them sub tasks under current JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 @karanmehta93 Thanks for update the pull request with more context information. Do you mind put the context description in the pull request description? Regarding test cases, I think you can just add a new test file under zookeeper/test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 From my understanding, `QuorumPeer` class will be the candidate class to maintain this data structure. Let me know alternatives. I will look into the data-structure ideas you provided till then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user tdunning commented on the issue: https://github.com/apache/zookeeper/pull/307 So here is an alternative suggestion. For each class of operation (GET, PUT, MULTIPUT, DELETE and so on) maintain a data structure (such as t-digest or Floathistogram see https://github.com/tdunning/t-digest) that keeps track of the distribution of latencies for that operation. Log any transactions that are above a settable percentile latency for operations of that class. Default could be 99.99%. Operations longer than 2x the 99.9%-ile latency should not be entered into the historical distribution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user karanmehta93 commented on the issue: https://github.com/apache/zookeeper/pull/307 ZooKeeper is a complex distributed application. There are many reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem. Although ZooKeeper is not a datastore, it does persist data, and can suffer intermittent performance degradation, and should consider implementing a 'slow query' log, a feature very common to services which persist information on behalf of clients which may be sensitive to latency while waiting for confirmation of successful persistence. @hanm FYI --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/307 General comment: please add more description on the pull request on what's the motivation of change and how it impacts performance / improve debugging etc. Or add those to the JIRA and post a digest on the pull request. That will make others who does not have a lot of knowledge on your use cases understand the motivation / impact of the change. Here is a good example w.r.t. pull request description: https://github.com/apache/zookeeper/pull/306, for reference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---