[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2018-05-24 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm Got a few spare cycles to address my comments?


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2018-05-13 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Agreed @zodvik . The scope of this Jira is to limit it to a generic slow 
query log and identify the outliers if any. For that reason, I am printing the 
actual request to the log as well. An improvement would be definitely to 
separate based on request types as others have also suggested. If this gets in, 
I would like to work on the improvement part of using 
t-digest(https://github.com/tdunning/t-digest) to give out detailed insight and 
not have the problem of setting a fixed threshold. 


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2018-04-15 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm Thank you! I have rebased the patch. The jenkins build has been 
failing for a while and my patch doesn't introduce new failure. Thanks once 
again!


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2018-04-15 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
looks like the branch has merge conflicts. do you mind update your patch 
with latest master and I'll review @karanmehta93 


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2018-03-08 Thread zodvik
Github user zodvik commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
> reasons why any given read or write operation may become slow: a software 
bug, a protocol problem, a hardware issue with the commit log(s), a network 
issue. If the problem is constant it is trivial to come to an understanding of 
the cause. However in order to diagnose intermittent problems we often don't 
know where, or when, to begin looking. We need some sort of timestamped 
indication of the problem

IMO, slow query log is useful for those systems were the response times can 
vary due to type of query (long scans, improper index, poor query plan, etc.). 
For the type of issues mentioned, we can expect all operations to be impacted 
and fill up the log. 

If the intent is to get an time stamp indication of the problem, a metric 
reporting mechanism will be more useful. If we have rolling interval 
percentiles (per command?) published either to `stat` or `jmx`, that would give 
a good visibility.



---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2018-03-07 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Hello everyone,
Appreciate your efforts in reviewing this patch. @hanm @tdunning @eribeiro 
@skamille 
Is there any possibility that the patch will get merged in (with minor 
changes if required) or shall we 'never' this JIRA and close this PR?
Thanks!


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-11-22 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
>  Can we at least make it so that the process of "push a metric to a 
buffer, have a thread that wakes up periodically and flushes information out of 
that buffer" is usable by multiple parts of the system, instead of coupling it 
to the one metric of request time?

I am ready to put in the effort of making this framework more generic, if 
you feel its worth the time and effort to put into it. I would also like to 
hear others opinions on this. 
The inspiration for this JIRA came from HBase, where it has been used to 
log slow requests at the basic level. I feel that it can be useful to keep 
track of latency spikes that are seen in production. @apurtell can suggest 
use-cases.


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-11-21 Thread skamille
Github user skamille commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
I appreciate the work that you are trying to do here, and this still feels 
like an incomplete approach to a problem that would be worth investing further 
into. It seems like we all agree that it would be good to have clearer metrics 
about the performance behavior of the internals of the system. This implements 
some basic infrastructure for a metric recording framework, coupled to the 
recording of one metric. Can we at least make it so that the process of "push a 
metric to a buffer, have a thread that wakes up periodically and flushes 
information out of that buffer" is usable by multiple parts of the system, 
instead of coupling it to the one metric of request time?


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-11-08 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Ping @hanm @tdunning @eribeiro @skamille for a review if you got some time. 
Thanks!


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-10-26 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
Due to some reason, Jenkins didn't build it automatically. I triggered a 
manual build with my latest commit ID SHA1 and the results are at 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1123/

All the Findbugs warnings have been resolved, and the single failing test 
seems to be unrelated to my change (or possibly a flapper since it is passing 
on my local machine).

This commit is a squashed commit based on the earlier inputs provided by 
everyone. Thank you!

As of now, the default value for `requestWarnThresholdMs` is 10 seconds, 
which is too high. Users can set it according to the average latency values 
they typically see from the serverstats. The patch also limits logging rate, so 
if the value is set too low, it will print out a message every 60 seconds 
showing the number of high latency requests in past minute. The motivation of 
this JIRA is specified in description at  
https://issues.apache.org/jira/browse/ZOOKEEPER-2770.

If you can review this PR, it would be great.
@hanm @tdunning @eribeiro @skamille 

If we can get in this JIRA, the future plan is to work on @tdunning 
t-digest and use it to determine high latency requests specifically based on 
its type. Thanks!


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-10-25 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@tdunning Sorry about this. I rebased my branch with the current master and 
pushed my patch along with it. Now it is showing up all those commits as my 
commits. Let me try to figure out a way to resolve this.


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-10-25 Thread tdunning
Github user tdunning commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
What is this pull request?  45 commits? 95 changed files? Lots of commits 
without associated issues?

(danger, danger)


---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-08-18 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
There are -1 on findbugs, release audits, and test failures reported by 
Jekins. Please investigate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-08-17 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm @eribeiro @tdunning @skamille 
Please review.
Now that I have added rate limiting to logging, can we also turn this on by 
default?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-24 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm @eribeiro @tdunning 
I have fixed the whitespace issues and final patch is ready, Please review 
whenever convenient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-18 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm 
> I think at some point after merge this you need create another pull 
request targeting stable branch (branch-3.4) where the version number should be 
3.4.11 (hopefully, if this can get in before next release)

I will create a new PR for that branch as well. I am hoping this to get 
this out in next release if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-18 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
>> And for the documentation, this property is being introduced from 3.5.4 
version, Is that correct?

Yes, for this specific pull request targeting master. I think at some point 
after merge this you need create another pull request targeting stable branch 
(branch-3.4) where the version number should be 3.4.11 (hopefully, if this can 
get in before next release).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-16 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@eribeiro Thank you for your comments. Will post an updated patch soon.
A small concern regarding the `ServerStats.updateLatency()` method. 
Should the updated patch have both the methods for now and file another 
JIRA to look into it?
And for the documentation, this property is being introduced from `3.5.4` 
version, Is that correct?
@hanm 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-16 Thread eribeiro
Github user eribeiro commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@karanmehta93 hi, mate.

You don't need to touch `ZooKeeperAdmin.html` You change 
`ZooKeeperAdmin.xml` **only** and Apache Forrest generate html anf PDF. Also, 
this xml change should be part of this patch, okay?

Cheers!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-15 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@hanm @eribeiro 
Thank you for your comments. I have added a rebased patch with a test in 
`ZooKeeperServerMainTest` class. Please review whenever convenient. 

> It's odd to have two (or in future even more) types of latency checks 
scattered around which creates fragmentation w.r.t. the definition of what a 
request latency means. 

We can decide based on opinion from other people and I can update the patch 
accordingly. I will create new JIRA's based on @tdunning suggestions, after 
this one gets approved. I am eager to take them up.

Documentation is still pending. Will add a patch for it soon. I am not sure 
if I have to modify the `ZooKeeperAdmin.xml` file or `ZooKeeperAdmin.html` 
file. Can you point me to any relevant stuff?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-13 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
I think we should consolidate the latency check in 
`zks.serverStats().updateLatency`. It's odd to have two (or in future even 
more) types of latency checks scattered around which creates fragmentation 
w.r.t. the definition of what a request latency means. The existing latency 
measurement in ServerStats measures the time between a request creation and a 
request landing at final request processor; the patch instead measures end to 
end time of a request from its start to finish processing. I am fine with the 
end to end processing time, though I'd like to double check with a few folks 
around to make sure the regression and impact of this change is limited.

I think ServerStats is a good place to put the DS Ted recommended. 

I think it's a good idea to scope the JIRA so it's easier to get it 
reviewed and committed. What this patch is doing is a positive improvement to 
the operational aspects of ZK so that can be the scope of this PR. On top of 
that future improvements could be what Edward and Ted suggested (JMX, 
distribution of latencies / histogram etc). These work can be tracked by making 
them sub tasks under current JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-12 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
@karanmehta93 Thanks for update the pull request with more context 
information. Do you mind put the context description in the pull request 
description? 

Regarding test cases, I think you can just add a new test file under 
zookeeper/test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-12 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
From my understanding, `QuorumPeer` class will be the candidate class to 
maintain this data structure. Let me know alternatives. 
I will look into the data-structure ideas you provided till then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-12 Thread tdunning
Github user tdunning commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
So here is an alternative suggestion.

For each class of operation (GET, PUT, MULTIPUT, DELETE and so on) maintain 
a data structure (such as t-digest or Floathistogram see 
https://github.com/tdunning/t-digest) that keeps track of the distribution of 
latencies for that operation.

Log any transactions that are above a settable percentile latency for 
operations of that class. Default could be 99.99%.

Operations longer than 2x the 99.9%-ile latency should not be entered into 
the historical distribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-12 Thread karanmehta93
Github user karanmehta93 commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
ZooKeeper is a complex distributed application. There are many reasons why 
any given read or write operation may become slow: a software bug, a protocol 
problem, a hardware issue with the commit log(s), a network issue. If the 
problem is constant it is trivial to come to an understanding of the cause. 
However in order to diagnose intermittent problems we often don't know where, 
or when, to begin looking. We need some sort of timestamped indication of the 
problem. Although ZooKeeper is not a datastore, it does persist data, and can 
suffer intermittent performance degradation, and should consider implementing a 
'slow query' log, a feature very common to services which persist information 
on behalf of clients which may be sensitive to latency while waiting for 
confirmation of successful persistence.

@hanm FYI


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper issue #307: ZOOKEEPER-2770 ZooKeeper slow operation log

2017-07-11 Thread hanm
Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/307
  
General comment: please add more description on the pull request on what's 
the motivation of change and how it impacts performance / improve debugging 
etc. Or add those to the JIRA and post a digest on the pull request. That will 
make others who does not have a lot of knowledge on your use cases understand 
the motivation / impact of the change.

Here is a good example w.r.t. pull request description: 
https://github.com/apache/zookeeper/pull/306, for reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---