Re: Long unexplained time-to-safepoint (TTSP) in the JVM

Gil Tene Mon, 08 Jun 2020 10:14:12 -0700

Here is an idea for a quick way to analyze this. Zing has a built-in TTSP 
profiler (since we pretty much coined the term "TTSP"). You could just 
download the (free) Zing trial version, run your application on it, see if 
the TTSP behavior is still there, and potentially get some quick clarity 
about what is causing it.


Zing will, by default, print out "Detected TTSP" warnings to stderr when it 
sees any thread experience a TTSP above a detection threshold (the warning 
threshold defaults to 1 second, but can be set with 
-XX:SafepointProfilerThresholdMS=<val>). You don't have to wait for a real 
pause to hit you to find these. Zing runs all threads through a background 
checkpoint every 5 seconds, so code paths with long TTSPs have a hard time 
hiding from the detector. Each thread crosses the checkpoint individually 
at its next safepoint polling opportunity, with no actual pause, and no 
actual delay in execution. The only time a thread will bear a real cost and 
interact with IPC is when its TTSP is so long that it crosses the 
threshold, in which case signals are used to collect context and report it.

For an example of the kind of output you'd see when e.g. mapped file page 
faults are the thing causing long TTSPs, see the below [note the clear 
finger pointing at org.apache.lucene.store.ByteBufferIndexInput.readByte() 
as the culprit in this example].

-- Gil.

Detected TTSP issue: start: 41.768 wait: 1748.384
Dumping stack for thread 0x0000440019600000
"elasticsearch[localhost][search][T#49]" id: 171575 prio: 5 os_prio: 0 sched
: SCHED_OTHER allowed_cpus: 000000000ffffffffffffffff
lock_release: 43.516 last_cpu: 9 cpu_time: 25
  0 0x00007fc63245f5bc SafepointProfilerBuf::record_sync_stack(JavaThread*)
  1 0x00007fc6324817e5 JavaThread::poll_at_safepoint_static(JavaThread*)
  2 0x000000003000eae9 StubRoutines::safepoint_handler
  3 0x00000000300dfc73 org.apache.lucene.store.DataInput.readVInt()I
  4 0x0000000031fa91df org.apache.lucene.codecs.blocktree.
SegmentTermsEnumFrame.loadBlock()V
  5 0x0000000031f99664 org.apache.lucene.codecs.blocktree.SegmentTermsEnum.
seekExact(Lorg/apache/lucene/util/BytesRef;)Z
  6 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/apache
/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)Lorg/apache/
lucene/index/TermContext;
  7 0x00000000319a40f6 interpreter
  8 0x00000000319a3cba interpreter


signal_sent: 42.768 signal_responded: 43.516 state: D wchan: 
wait_on_page_bit_killable last_cpu: 9 cpu_time: 25
  0 0x00000000300926f2 org.apache.lucene.store.ByteBufferIndexInput.readByte
()B
  1 0x00000000300dfacd org.apache.lucene.store.DataInput.readVInt()I
  2 0x0000000031fa9119 org.apache.lucene.codecs.blocktree.
SegmentTermsEnumFrame.loadBlock()V
  3 0x0000000031f99664 org.apache.lucene.codecs.blocktree.SegmentTermsEnum.
seekExact(Lorg/apache/lucene/util/BytesRef;)Z
  4 0x0000000031fad584 org.apache.lucene.index.TermContext.build(Lorg/apache
/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)Lorg/apache/
lucene/index/TermContext;
  5 0x00000000319a40f6 interpreter
  6 0x00000000319a3cba interpreter
  7 0x00000000319a3cba interpreter
  8 0x00000000319a3cba interpreter


On Thursday, June 4, 2020 at 2:07:10 PM UTC-7, Zac Blanco wrote:
>
> Hi All,
>
> I'm hoping someone might be able to point me in the right direction with 
> this or offer some ideas.
> I've done a lot of research but haven't found any other threads with 
> similar behavior. The closest
> one I could find was this 
> <https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE> 
> which actually never seems like it was solved.
>
> I've been working on this issue for a few weeks. We have a JVM-based 
> server application which
> is experiencing long (anywhere from 10s to multiple minutes) pauses. Below 
> is an example of
> the safepoint statistics output:
>
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 298.236: no vm operation                  [     385          0           
>    0    ]      [     0     0 21638     1     0    ]  0   
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 320.877: no vm operation                  [     336          0           
>    0    ]      [     0     0  1978     1     0    ]  0   
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 336.859: no vm operation                  [     309          1           
>    2    ]      [     0     0  6497     1     0    ]  0   
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 351.362: no vm operation                  [     294          0           
>    0    ]      [     0     0 32821     0     0    ]  0   
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 401.189: no vm operation                  [     296          3           
>    2    ]      [     0     0  3471     0     0    ]  0   
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 427.675: no vm operation                  [     303          0           
>    0    ]      [     0     0 10414     1     0    ]  0   
>          vmop                    [threads: total initially_running 
> wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
> 446.098: no vm operation                  [     346          2           
>    1    ]      [     0     0 48768     1     0    ]  0   
>
>
> These are all printed back-to-back when running the application. The GC 
> logs don't even
> indicate that this pause was occurring.
>
> 2020-06-04T00:49:40.008+0000: 268.104: Total time for which application 
> threads were stopped: 1.5127234 seconds, Stopping threads took: 1.5089316 
> seconds
> 2020-06-04T00:49:42.411+0000: 270.507: Total time for which application 
> threads were stopped: 0.0108752 seconds, Stopping threads took: 0.0065995 
> seconds
> 2020-06-04T00:49:42.760+0000: 270.856: [GC (Allocation Failure) [
> PSYoungGen: 8389120K->1016108K(9786880K)] 8550808K->1177813K(32156672K), 
> 1.8051522 secs] [Times: user=6.83 sys=4.62, real=1.80 secs] 
> 2020-06-04T00:49:44.566+0000: 272.662: Total time for which application 
> threads were stopped: 1.8379623 seconds, Stopping threads took: 0.0267499 
> seconds
> 2020-06-04T00:49:46.114+0000: 274.210: Total time for which application 
> threads were stopped: 0.5484401 seconds, Stopping threads took: 0.5438942 
> seconds
> 2020-06-04T00:49:47.120+0000: 275.216: Total time for which application 
> threads were stopped: 0.0055404 seconds, Stopping threads took: 0.0007684 
> seconds
> 2020-06-04T00:49:50.127+0000: 278.222: Total time for which application 
> threads were stopped: 0.0059341 seconds, Stopping threads took: 0.0005394 
> seconds
> 2020-06-04T00:50:02.136+0000: 290.232: Total time for which application 
> threads were stopped: 0.0048284 seconds, Stopping threads took: 0.0008109 
> seconds
> 2020-06-04T00:50:07.140+0000: 295.235: Total time for which application 
> threads were stopped: 0.0029092 seconds, Stopping threads took: 0.0004694 
> seconds
> 2020-06-04T00:50:31.781+0000: 319.877: Total time for which application 
> threads were stopped: 0.0035001 seconds, Stopping threads took: 0.0012628 
> seconds
> 2020-06-04T00:50:34.761+0000: 322.857: Total time for which application 
> threads were stopped: 0.0014956 seconds, Stopping threads took: 0.0002727 
> seconds
> 2020-06-04T00:50:55.261+0000: 343.357: Total time for which application 
> threads were stopped: 0.0016439 seconds, Stopping threads took: 0.0004175 
> seconds
> 2020-06-04T00:51:36.088+0000: 384.184: Total time for which application 
> threads were stopped: 0.0011020 seconds, Stopping threads took: 0.0001815 
> seconds
> 2020-06-04T00:51:56.565+0000: 404.661: Total time for which application 
> threads were stopped: 0.0012547 seconds, Stopping threads took: 0.0003102 
> seconds
> 2020-06-04T00:52:29.996+0000: 438.092: Total time for which application 
> threads were stopped: 0.0016347 seconds, Stopping threads took: 0.0002464 
> seconds
> 2020-06-04T00:53:26.772+0000: 494.868: Total time for which application 
> threads were stopped: 0.0021074 seconds, Stopping threads took: 0.0003538 
> seconds
>
> During the time period ~270-494 the application pretty much stopped 
> serving all IO
> requests and did not respond to any jstack operations
>
> Hardware information:
> OS: RHEL 7.8 (kernel version 3.10.0-1127)
> JVM: Oracle JDK 8u161
> Machine specs: 72 cores, 754GiB RAM.
>
> about 650 GiB of the memory is allocated to a ramdisk, the JVM application 
> (as you
> will see below) has a max heap of 32GiB, and direct memory of 24GiB. This 
> is the
> only application running on this server aside from a metrics collection 
> daemon.
>
> The JVM has the following flags specified:
>
> -Xmx32g
> -Xms32g
> -XX:MaxDirectMemorySize=24g
> -XX:+PrintGC
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintSafepointStatistics
> -XX:PrintSafepointStatisticsCount=1
> -XX:+SafepointTimeout
> -XX:SafepointTimeoutDelay=10000
> -XX:-UseBiasedLocking
> -Xloggc:<path>/logs/gc_worker.log
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.port=7093
> -Dcom.sun.management.jmxremote.rmi.port=7093
> -Djava.net.preferIPv4Stack=true
>
>
> I already read through most of the discussion at these links regarding 
> somewhat
> similar behavior:
> - 
> https://stackoverflow.com/questions/30393470/what-causes-long-spin-and-sync-times-in-java
> - 
> https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ
> - 
> https://groups.google.com/forum/#!msg/mechanical-sympathy/htQ3Rc1JEKk/VzSV8iBvBgAJ
> - https://groups.google.com/forum/#!topic/mechanical-sympathy/9oeDihc2WiE
> - http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-May/002262.html
> - 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201610.mbox/%3c1475948419.268606.749851841.16dfb...@webmail.messagingengine.com%3e
>
> The application itself does a lot of mapped file IO against spinning disks. 
> Each mmap
> is usually just a few MiB across 12 enterprise-class disk drives or a 
> ramdisk. However,
> previously we've been able to max out 10Gbit connections with the same 
> application on
> much smaller machines (8-16 cores, 64GiB RAM) without any issue. There are 144
> threads doing the mmapped IO at any given time on this hardware.
>
> One of the points that stands out the most in this situation is that the
> -XX:+SafepointTimeout doesn't actually print any information about threads 
> failing to come
> to a safepoint. From my understanding, any time I get a safepoint statistics 
> log where the
> "sync" time is larger than 10000, then I should be getting output to stdout 
> of the thread
> names which failed to come to a safepoint. In my logs, there is no such 
> output.
>
> Any ideas for debugging this is greatly appreciated. One thing I have not yet 
> tried is using perf
> and kernel uprobes to capture the stacks (as suggested here 
> <https://groups.google.com/forum/#!msg/mechanical-sympathy/tepoA7PRFRU/-4-FjKbUBwAJ>).
>  The reason I have not tried this
> is because this software actually runs on a cluster of ~90 nodes (all the 
> same hardware and
> software stack) and the occurrence of the pauses isseemingly random across 
> all of the nodes.
> Setting up the uprobes and collecting the data will be a complex task which I 
> haven't had the
> time for yet.
>  
> Appreciate ideas anyone has to offer
>
> Thanks,
> Zac
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/bfd2024e-6e33-48fa-8364-e931d7e5b87fo%40googlegroups.com.

Re: Long unexplained time-to-safepoint (TTSP) in the JVM

Reply via email to