[jira] [Commented] (CASSANDRA-10245) Provide after the fact visibility into the reliability of the environment C* operates in

Ariel Weisberg (JIRA) Tue, 01 Sep 2015 14:41:16 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726246#comment-14726246
 ]


Ariel Weisberg commented on CASSANDRA-10245:
--------------------------------------------

For measuring network performance. Every 5 milliseconds (or whatever) send a 
message to every other node in the cluster, or some subset (do cover all nodes 
eventually). In the heartbeat place the wall clock time the message was sent.

The thread waking up periodically to send messages should keep a histogram of 
how far off from it's target wakeup it is off. Also track the delta between 
when remote heartbeats claim to be sent and when they are received as well as 
the delta between the expected amount of time since the last heartbeat was 
received and the actual amount of time it took. Combining these facts across 
nodes will give you visibility into the difference between node wide pauses and 
network related pauses.

You can also look at clock skew. If a node reliably delivers it's messages on 
the expected interval, but the timestamp is not as expected you can guess that 
there is some clock skew.

You can set thresholds for when to be chatty about conditions and start dumping 
histograms, percentiles, or whatever to a human readable log.

There is overlap between this and jHiccup, but we need to run something out of 
process anyways to track JVM pauses. jHiccup also comes with some 
reporting/visualization.

> Provide after the fact visibility into the reliability of the environment C* 
> operates in
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10245
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Ariel Weisberg
>             Fix For: 3.x
>
>
> I think that by default databases should not be completely dependent on 
> operator provided tools for monitoring node and network health.
> The database should be able to detect and report on several dimensions of 
> performance in its environment, and more specifically report on deviations 
> from acceptable performance.
> * Node wide pauses
> * JVM wide pauses
> * Latency, and roundtrip time to all endpoints
> * Block device IO latency
> If flight recorder were available for use in production I would say as a 
> start just turn that on, add jHiccup (inside and outside the server process), 
> and a daemon inside the server to measure network performance between 
> endpoints.
> FR is not available (requires a license in production) so instead focus on 
> adding instrumentation for the most useful facets of flight recorder in 
> diagnosing performance issues. I think we can get pretty far because what we 
> need to do is not quite as undirected as the exploration FR and JMC 
> facilitate.
> Until we dial in how we measure and how to signal without false positives I 
> would expect this kind of logging to be in the background for post-hoc 
> analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10245) Provide after the fact visibility into the reliability of the environment C* operates in

Reply via email to