[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100640#comment-16100640
 ] 

Camille Fournier commented on ZOOKEEPER-2770:
---------------------------------------------

Are there really 10s long slow requests? It's defaults like this that make me 
skeptical about the usefulness of this particular implementation. If we have a 
request through ZK that takes 10s to process your whole system is completely 
effed. 

I don't think we should add complexity to the code base without suitable 
justification for the value of the new feature. With that in mind, I'd like to 
understand what, specifically, the circumstances we're trying to measure are. 
It looks like processing time for a request through the ZK quorum alone, 
correct? The only network time that might be captured would be, in the case of 
a write, the quorum voting time.

I'm all for making ZK more operable and exposing metrics but I don't think 
exposing low-value metrics is worth the additional code complexity without 
justification.

> ZooKeeper slow operation log
> ----------------------------
>
>                 Key: ZOOKEEPER-2770
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2770
>             Project: ZooKeeper
>          Issue Type: Improvement
>            Reporter: Karan Mehta
>            Assignee: Karan Mehta
>         Attachments: ZOOKEEPER-2770.001.patch, ZOOKEEPER-2770.002.patch, 
> ZOOKEEPER-2770.003.patch
>
>
> ZooKeeper is a complex distributed application. There are many reasons why 
> any given read or write operation may become slow: a software bug, a protocol 
> problem, a hardware issue with the commit log(s), a network issue. If the 
> problem is constant it is trivial to come to an understanding of the cause. 
> However in order to diagnose intermittent problems we often don't know where, 
> or when, to begin looking. We need some sort of timestamped indication of the 
> problem. Although ZooKeeper is not a datastore, it does persist data, and can 
> suffer intermittent performance degradation, and should consider implementing 
> a 'slow query' log, a feature very common to services which persist 
> information on behalf of clients which may be sensitive to latency while 
> waiting for confirmation of successful persistence.
> Log the client and request details if the server discovers, when finally 
> processing the request, that the current time minus arrival time of the 
> request is beyond a configured threshold. 
> Look at the HBase {{responseTooSlow}} feature for inspiration. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to