Node pause for no obvious reason

Ray Fri, 08 Jun 2018 02:33:01 -0700

I setup a six node Ignite cluster to test the performance and stability.
Here's my setup.
    <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="segmentationPolicy" value="RESTART_JVM"/>
        <property name="peerClassLoadingEnabled" value="true"/>
        <property name="failureDetectionTimeout" value="60000"/>
        <property name="dataStorageConfiguration">
            <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
            <property name="defaultDataRegionConfiguration">
                <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
                    <property name="name" value="default_Region"/>
                    <property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
                    <property name="maxSize" value="#{500L * 1024 * 1024 *
1024}"/>
                    <property name="persistenceEnabled" value="true"/>
                    <property name="checkpointPageBufferSize" value="#{2L *
1024 * 1024 * 1024}"/>
                </bean>
            </property>
            <property name="walMode" value="BACKGROUND"/>
            <property name="walFlushFrequency" value="5000"/>
            <property name="checkpointFrequency" value="600000"/>
      <property name="writeThrottlingEnabled" value="true"/>


And I used this command to start the Ignite node.
./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
-J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
-J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
-J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
+%Y_%m_%d-%H_%M).log  config/persistent-config.xml

One of the node just dropped from the topology. Here's the log for last
three minutes before this node going down.
[08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=8333aa56, uptime=02:34:01.948]
    ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
    ^-- CPU [cur=41%, avg=33.18%, GC=0%]
    ^-- PageMemory [pages=8912687]
    ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
    ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
[08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
fsync=965250ms, total=978172ms]
[08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
[08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl] Throttling
is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
(total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
cpBufTotal=518215)]
[08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=8333aa56, uptime=02:35:01.957]
    ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
    ^-- CPU [cur=9.3%, avg=33%, GC=0%]
    ^-- PageMemory [pages=8920631]
    ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
    ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
[08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 22667 milliseconds.
[08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
[08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
[08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl] Throttling
is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
(total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
cpBufTotal=518215)]
[08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 67967 milliseconds.
[08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
seems to be disconnected from topology (failure detection timeout is
reached) [failureDetectionTimeout=60000, connCheckFreq=20000]
[08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
[08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
[08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/10.29.42.45, rmtPort=38712]
[08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712]
[08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
out of topology (probably, due to short-time network problems).
[08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
[08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local
node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-809b1d2e2e5b,
addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29.42.44:49500,
/127.0.0.1:49500], discPort=49500, order=1, intOrder=1,
lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:86e110c7,
isClient=false]
[08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error detected.
Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately
due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]

As you can see from the log, the JVM paused for 67967 milliseconds so this
the reason why this node is down.
But I looked at the GC log, here's the last GC log.

2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause)
(young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet,
_pending_cards: 32828, predicted base time: 18.85 ms, remaining time: 181.15
ms, target pause time: 200.00 ms]
 9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 2350 regions, survivors: 50 regions, predicted young region time:
79.31 ms]
 9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time:
98.16 ms, target pause time: 200.00 ms]
, 0.0417411 secs]
   [Parallel Time: 32.8 ms, GC Workers: 38]
      [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max: 9199215.7,
Diff: 0.3]
      [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6, Sum:
21.8]
      [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum: 158.3]
         [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum: 228]
      [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.2]
      [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum:
1017.5]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 38]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum:
8.9]
      [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6,
Sum: 1225.7]
      [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0,
Diff: 0.4]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 2.7 ms]
   [Other: 6.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.6 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.4 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 4.4 ms]
   [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap:
21.6G(31.2G)->3110.7M(31.2G)]
 [Times: user=1.28 sys=0.01, real=0.04 secs]
Heap
 garbage-first heap   total 32768000K, used 14908229K [0x00007f9b38000000,
0x00007f9b38807d00, 0x00007fa308000000)
  region size 8192K, 1458 young (11943936K), 29 survivors (237568K)
 Metaspace       used 43388K, capacity 44288K, committed 44492K, reserved
1089536K
  class space    used 4241K, capacity 4487K, committed 4556K, reserved
1048576K


Everything looks normal for this GC, but this GC happened at 8:38 and the
JVM freeze happed at around 8:40.
And this picture show what the java heap looks like in the last minutes
before node going down.(Please add four hours here to match the log time
because of different timezone)
Screen_Shot_2018-06-08_at_17.png
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Screen_Shot_2018-06-08_at_17.png>
  

So my question is why is this node going down? 
What's JVM doing in the last minutes before node going down?
Why is not there any logs in the last minutes?
And why isn't the segment_policy RESTART_JVM working?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Node pause for no obvious reason

Reply via email to