Simon Zhou created CASSANDRA-13491: -------------------------------------- Summary: Emit metrics for JVM safepoint pause Key: CASSANDRA-13491 URL: https://issues.apache.org/jira/browse/CASSANDRA-13491 Project: Cassandra Issue Type: New Feature Reporter: Simon Zhou
GC pause is not the only source of latency from JVM. In one of our recent production issues, the metrics for GC looks good (some >200ms and longest 500ms) but GC logs show periodic pauses like this: {code} 2017-04-26T01:51:29.420+0000: 352535.998: Total time for which application threads were stopped: 19.8835870 seconds, Stopping threads took: 19.7842073 seconds {code} This huge delay should be JVM malfunction but it caused some requests timeout. So I'm suggesting to add support for safepoint pause for better observability. Two problems though: 1. This depends on JVM. Some JVMs may not expose these internal MBeans. This is actually the same case for existing GCInspector. 2. For Hotspot, it has HotspotRuntime as an internal MBean so that we can get safepoint pause. However, there is no notification support for that. I got error "MBean sun.management:type=HotspotRuntime does not implement javax.management.NotificationBroadcaster" when trying to register a listener. This means we will need to pull the safepoint pauses from HotspotRuntime periodically. Reference: http://blog.ragozin.info/2012/10/safepoints-in-hotspot-jvm.html Anyone think we should support this? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org