Thanks,
Paul
*From: *Jeremy Manson <[email protected]>
*Date: *Monday, October 23, 2017 at 11:50 AM
*To: *"Hohensee, Paul" <[email protected]>, Derek Thomson
<[email protected]>, Wessam Hassanein <[email protected]>, Man Cao
<[email protected]>
*Cc: *"[email protected]"
<[email protected]>
*Subject: *Re: GcInfo: longest pause duration?
Yes, this information is useful.
We've done a couple of things to get at it, which folks might find
interesting.
1) We extended the hsperfdata to include a large number of stats for
CMS. Parsing hsperfdata is ridiculously easy, and hasn't changed in
years, so we just parse it directly and report the values to
monitoring tools. I'm too lazy to retype what we export, so are the
relevant CMS variables from our code; what they do should be
relatively obvious:
_concurrent_mode_failure_count =
PerfDataManager::create_counter(NULL_NS,
"concurrent-mode-failure-count",
PerfData::U_Events, CHECK);
_concurrent_mode_failure_due_to_fragmentation_count =
PerfDataManager::create_counter(NULL_NS,
"concurrent-mode-failure-due-to-fragmentation-count",
PerfData::U_Events, CHECK);
_concurrent_mode_failure_time =
PerfDataManager::create_counter(NULL_NS,
"concurrent-mode-failure-time",
PerfData::U_Ticks, CHECK);
_cms_initial_mark_count =
PerfDataManager::create_counter(NULL_NS,
"*cms-initial-mark-count*",
PerfData::U_Events, CHECK);
_cms_initial_mark_time =
PerfDataManager::create_counter(NULL_NS, "cms-initial-mark-time",
PerfData::U_Ticks, CHECK);
_cms_remark_count =
PerfDataManager::create_counter(NULL_NS, "cms-remark-count",
PerfData::U_Events, CHECK);
_cms_remark_time =
PerfDataManager::create_counter(NULL_NS, "cms-remark-time",
PerfData::U_Ticks, CHECK);
_cms_full_gc_count =
PerfDataManager::create_counter(NULL_NS, "cms-full-gc-count",
PerfData::U_Events, CHECK);
_cms_full_gc_time =
PerfDataManager::create_counter(NULL_NS, "cms-full-gc-time",
PerfData::U_Ticks, CHECK);
_cms_compacting_full_gc_count =
PerfDataManager::create_counter(NULL_NS,
"cms-compacting-full-gc-count",
PerfData::U_Events, CHECK);
_cms_compacting_full_gc_time =
PerfDataManager::create_counter(NULL_NS,
"cms-compacting-full-gc-time",
PerfData::U_Ticks, CHECK);
_cms_noncompacting_full_gc_count =
PerfDataManager::create_counter(NULL_NS,
"cms-noncompacting-full-gc-count",
PerfData::U_Events, CHECK);
_cms_noncompacting_full_gc_time =
PerfDataManager::create_counter(NULL_NS,
"cms-noncompacting-full-gc-time",
PerfData::U_Ticks, CHECK);
_cms_conc_mark_count =
PerfDataManager::create_counter(NULL_NS, "cms-conc-mark-count",
PerfData::U_Events, CHECK);
_cms_conc_mark_time =
PerfDataManager::create_counter(NULL_NS, "cms-conc-mark-time",
PerfData::U_Ticks, CHECK);
_cms_conc_preclean_count =
PerfDataManager::create_counter(NULL_NS,
"cms-conc-preclean-count",
PerfData::U_Events, CHECK);
_cms_conc_preclean_time =
PerfDataManager::create_counter(NULL_NS, "cms-conc-preclean-time",
PerfData::U_Ticks, CHECK);
_cms_conc_abortable_preclean_count =
PerfDataManager::create_counter(NULL_NS,
"cms-conc-abortable-preclean-count",
PerfData::U_Events, CHECK);
_cms_conc_abortable_preclean_time =
PerfDataManager::create_counter(NULL_NS,
"cms-conc-abortable-preclean-time",
PerfData::U_Ticks, CHECK);
_cms_conc_sweep_count =
PerfDataManager::create_counter(NULL_NS, "cms-conc-sweep-count",
PerfData::U_Events, CHECK);
_cms_conc_sweep_time =
PerfDataManager::create_counter(NULL_NS, "cms-conc-sweep-time",
PerfData::U_Ticks, CHECK);
_cms_conc_reset_count =
PerfDataManager::create_counter(NULL_NS, "cms-conc-reset-count",
PerfData::U_Events, CHECK);
_cms_conc_reset_time =
PerfDataManager::create_counter(NULL_NS, "cms-conc-reset-time",
PerfData::U_Ticks, CHECK);
_cms_collection_count =
PerfDataManager::create_counter(NULL_NS, "cms-collection-count",
PerfData::U_Events, CHECK);
_cms_collection_time =
PerfDataManager::create_counter(NULL_NS, "cms-collection-time",
PerfData::U_Ticks, CHECK);
_cms_ref_proc_count =
PerfDataManager::create_counter(NULL_NS, "cms-ref-proc-count",
PerfData::U_Events, CHECK);
_cms_ref_proc_time =
PerfDataManager::create_counter(NULL_NS, "cms-ref-proc-time",
PerfData::U_Ticks, CHECK);
_yg_alloc_bytes =
PerfDataManager::create_variable(NULL_NS,
"jvm-gc-cms-young-gen-alloc-bytes",
PerfData::U_None, CHECK);
_yg_promo_bytes =
PerfDataManager::create_variable(NULL_NS,
"jvm-gc-cms-young-gen-promo-bytes",
PerfData::U_None, CHECK);
_og_direct_alloc_bytes =
PerfDataManager::create_variable(NULL_NS,
"jvm-gc-cms-old-gen-directalloc-bytes",
PerfData::U_None, CHECK);
_og_alloc_bytes =
PerfDataManager::create_variable(NULL_NS,
"jvm-gc-cms-old-gen-alloc-bytes",
PerfData::U_None, CHECK);
_total_alloc_bytes =
PerfDataManager::create_variable(NULL_NS,
"jvm-gc-cms-total-alloc-bytes",
PerfData::U_None, CHECK);
2) We also implemented an API built on JFR's handling of GC events.
We use it to export the GC history to monitoring tools in a format
that doesn't involve parsing logs that can change.
(Prior to JFR, we had a bespoke implementation where we instrumented
everywhere a log happened, but we fortunately don't have to do that
anymore.)
We'd love both of these things to be something that was more generally
usable.
Jeremy
On Mon, Oct 23, 2017 at 9:31 AM, Hohensee, Paul <[email protected]
<mailto:[email protected]>> wrote:
I’d like to solicit opinions on reporting GC pause duration
(stopped-world pause time) via JMX. This info would be useful in
figuring out whether or not GC pause times are factors in failing
to meet response time SLAs. The info is of course available
directly from GC logs, but parsing logs is fraught and JMX doesn’t
seem to report the equivalent info.
GcInfo
https://docs.oracle.com/javase/9/docs/api/com/sun/management/GcInfo.html
has a getDuration() method which works fine for the non-concurrent
collectors (since they’re STW), but for CMS and G1 it appears to
report the duration of an entire concurrent cycle, which isn’t
what I want. The number of STW pauses during a concurrent cycle
varies by collector, so ideally there would be a method that
reports cause (as a string) and duration for each STW pause. If
that’s too much, perhaps the minimum might be a
getMaxPauseDuration() method that reports the maximum pause
duration of all the STW pauses that happen during a concurrent cycle.
Relatedly, the full compacting GCs that happen as a result of CMS
and G1 concurrent mode failure aren’t reported separately from
concurrent cycles. It would be useful to differentiate these from
“ConcurrentMarkSweep” and “G1 Old Generation”. Perhaps add
collector types to CMS and G1, vis. “MarkSweepCompact” (which
already exists and is literally what’s executed by CMS) and a new
“G1 MarkSweepCompact” collector for G1.
If there’s a consensus that something should be done about either
of these issues, I’d be happy to file RFE(s) and do the work.
Thanks,
Paul