Re: JVM detection of thread at safepoint

Gil Tene Tue, 05 Dec 2017 03:29:18 -0800

One of the challenges in using GC logs to find out if you have TTSP issues 
is that GC events are extremely infrequent, so not having seen a high TTSP 
in the log doesn't mean you won't hit one soon... It's literally like 
deducing the likelihood of hitting something bad from 1000 data points 
collected with a 1-in-a-million random sampler. Coverage is very low.


In Zing, we have a built-in TTSP profiler for exactly this reason. 
Coverage. We initially used it internally, to detect and clean up all the 
JVM code paths that were problematic, and now we use it with customers, 
mostly to prove and point fingers at the system-related causes of TTSP when 
they occur. The way the TTSP profiler works is basically by using a tick 
based profiler that records the tick time arms a thread-local "go to a 
no-op safepoint" whose only purpose is to report the amount of time elapsed 
form the original tick to that thread getting to a safepoint. The thread 
leaves the safepoint immediate, so no pauses happen, and we keep track of 
the call stacks for e.g. the top 50 TTSP times seen since the last profile 
reset. With this technique, we can reliably completely cover and report on 
any experienced code path with a TTSP larger than e.g. 5 msec, and we get 
fairly good coverage even for the lower levels (just not 100% coverage, 
unless you are willing to take a e.g. TTSP-profiling tick 2000 times per 
second on each running thread).

I expect that with HotSpot finally looking at adding thread-specific 
safe-pointing and checkpoints, similar tooling will be needed there.

On Tuesday, December 5, 2017 at 12:11:44 PM UTC+1, Kirk Pepperdine wrote:
>
> Hi,
>
>
> On Dec 5, 2017, at 11:53 AM, Gil Tene <g...@azul.com <javascript:>> wrote:
>
> Page faults in mapped file i/o and counted loops are certainly two common 
> causes of long TTSP. But there are many other paths that *could* cause it 
> as well in HotSpot. Without catching it and looking at the stack trace, 
> it's hard to know which ones to blame. Once you knock out one cause, you'll 
> see if there is another.
>
> In the specific stack trace you showed [assuming that trace was taken 
> during a long TTSP], mapped file i/o is the most likely culprit. Your trace 
> seems to be around making the page write-able for the first time and 
> updating the file time (which takes a lock), but even without needing the 
> lock, the fault itself could end up waiting for the i/o to complete (read 
> page from disk), and that (when Murh=phy pays you a visit) can end up 
> waiting behind 100s other i/o operations (e.g. when your i/o happens at the 
> same time the kernel decided to flush some dirty pages in the cache), 
> leading to TTSPs in the 100s of msec.
>
> As I'm sure you already know, one simple way to get around mapped file 
> related TTSP is to not used mapped files. Explicit random i/o calls are 
> always done while at a safepoint, so they can't cause high TTSPs.
>
>
> I guess another way to avoid the long TTSP is to not safe point.. ;-)
>
> Most of the long TTSP’s IME are either due to outside scheduling 
> interference or loops that are compiled to the point where they turn into 
> counted loops. Finding this using traditional tooling is impossible because 
> you need the safe point which has most likely been compiled out of the code 
> to get a sample to see why you’re waiting for a safe point. Using an async 
> profiler (like honest) leaves you wanting for some correlation between 
> calls for safe point and the samples you’re looking at. So, I don’t really 
> have a good answer as to how to find these things. What I have done in the 
> past is used the GC log to get an idea of the toxicity level of the 
> environment you’re running in and then use that information to decide if I 
> need to look for internal or external causes. If the causes are internal 
> than it’s most likely a counted loop. Using JITWatch in this cause can you 
> get a candidate list of counted loops to consider. After that I’ve binary 
> searched through that list. If the environment is toxic.. then you need to 
> clean up the environment. If you are brave, you add some instrumentation to 
> the JVM to sample for the last thread to reach the safe point and use that 
> in your prod environment.
>
> Kind regards,
> Kirk
>
>
> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
>>
>> Hi Aleksey,
>> thanks for the response. The I/O is definitely one problem, but I was 
>> trying to figure out whether it was contributing to the long TTSP times, or 
>> whether I might have some code that was misbehaving (e.g. NonCountedLoops).
>>
>> Your response aligns with my guesswork, so hopefully I just have the one 
>> problem to solve ;)
>>
>>
>>
>> Cheers,
>>
>> Mark
>>
>> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote:
>>>
>>> On 12/05/2017 09:26 AM, Mark Price wrote: 
>>> > I'm investigating some long time-to-safepoint pauses in 
>>> oracle/openjdk. The application in question 
>>> > is also suffering from some fairly nasty I/O problems where 
>>> latency-sensitive threads are being 
>>> > descheduled in uninterruptible sleep state due to needing a 
>>> file-system lock. 
>>> > 
>>> > My question: can the JVM detect that a thread is in 
>>> signal/interrupt-handler code and thus treat it 
>>> > as though it is at a safepoint (as I believe happens when a thread is 
>>> in native code via a JNI call)? 
>>> > 
>>> > For instance, given the stack trace below, will the JVM need to wait 
>>> for the thread to be scheduled 
>>> > back on to CPU in order to come to a safepoint, or will it be treated 
>>> as "in-native"? 
>>> > 
>>> >         7fff81714cd9 __schedule ([kernel.kallsyms]) 
>>> >         7fff817151e5 schedule ([kernel.kallsyms]) 
>>> >         7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms]) 
>>> >         7fff813556e7 call_rwsem_down_write_failed ([kernel.kallsyms]) 
>>> >         7fff817172ad down_write ([kernel.kallsyms]) 
>>> >         7fffa0403dcf xfs_ilock ([kernel.kallsyms]) 
>>> >         7fffa04018fe xfs_vn_update_time ([kernel.kallsyms]) 
>>> >         7fff8122cc5d file_update_time ([kernel.kallsyms]) 
>>> >         7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms]) 
>>> >         7fff811ba935 do_page_mkwrite ([kernel.kallsyms]) 
>>> >         7fff811bda74 handle_pte_fault ([kernel.kallsyms]) 
>>> >         7fff811c041b handle_mm_fault ([kernel.kallsyms]) 
>>> >         7fff8106adbe __do_page_fault ([kernel.kallsyms]) 
>>> >         7fff8106b0c0 do_page_fault ([kernel.kallsyms]) 
>>> >         7fff8171af48 page_fault ([kernel.kallsyms]) 
>>> >         ---- java stack trace ends here ---- 
>>>
>>> I am pretty sure out-of-band page fault in Java thread does not yield a 
>>> safepoint. At least because 
>>> safepoint polls happen at given location in the generated code, because 
>>> we need the pointer map as 
>>> the part of the machine state, and that is generated by Hotspot (only) 
>>> around the safepoint polls. 
>>> Page faulting on random read/write insns does not have that luxury. Even 
>>> if JVM had intercepted that 
>>> fault, there is not enough metadata to work on. 
>>>
>>> The stacktrace above seems to say you have page faulted and this 
>>> incurred disk I/O? This is 
>>> swapping, I think, and all performance bets are off at that point. 
>>>
>>> Thanks, 
>>> -Aleksey 
>>>
>>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: JVM detection of thread at safepoint

Reply via email to