I setup a six node Ignite cluster to test the performance and stability. Here's my setup. <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="segmentationPolicy" value="RESTART_JVM"/> <property name="peerClassLoadingEnabled" value="true"/> <property name="failureDetectionTimeout" value="60000"/> <property name="dataStorageConfiguration"> <bean class="org.apache.ignite.configuration.DataStorageConfiguration"> <property name="defaultDataRegionConfiguration"> <bean class="org.apache.ignite.configuration.DataRegionConfiguration"> <property name="name" value="default_Region"/> <property name="initialSize" value="#{100L * 1024 * 1024 * 1024}"/> <property name="maxSize" value="#{500L * 1024 * 1024 * 1024}"/> <property name="persistenceEnabled" value="true"/> <property name="checkpointPageBufferSize" value="#{2L * 1024 * 1024 * 1024}"/> </bean> </property> <property name="walMode" value="BACKGROUND"/> <property name="walFlushFrequency" value="5000"/> <property name="checkpointFrequency" value="600000"/> <property name="writeThrottlingEnabled" value="true"/>
And I used this command to start the Ignite node. ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date +%Y_%m_%d-%H_%M).log config/persistent-config.xml One of the node just dropped from the topology. Here's the log for last three minutes before this node going down. [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal] Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=8333aa56, uptime=02:34:01.948] ^-- H/N/C [hosts=9, nodes=16, CPUs=552] ^-- CPU [cur=41%, avg=33.18%, GC=0%] ^-- PageMemory [pages=8912687] ^-- Heap [used=8942MB, free=72.05%, comm=32000MB] ^-- Non heap [used=70MB, free=95.35%, comm=73MB] ^-- Outbound messages queue [size=0] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=6, qSize=0] [08:40:51,945][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1, pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904, len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms, fsync=965250ms, total=978172ms] [08:40:53,086][INFO][db-checkpoint-thread-#178][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14, startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981], checkpointLockWait=0ms, checkpointLockHoldTime=39ms, walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout'] [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0.22, markDirty=7192 pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543 pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages: (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92, cpBufTotal=518215)] [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal] Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=8333aa56, uptime=02:35:01.957] ^-- H/N/C [hosts=9, nodes=16, CPUs=552] ^-- CPU [cur=9.3%, avg=33%, GC=0%] ^-- PageMemory [pages=8920631] ^-- Heap [used=13262MB, free=58.55%, comm=32000MB] ^-- Non heap [used=70MB, free=95.34%, comm=73MB] ^-- Outbound messages queue [size=0] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=6, qSize=0] [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM pause: 22667 milliseconds. [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815 [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515 [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0.49, markDirty=26945 pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815 pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages: (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657, cpBufTotal=518215)] [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM pause: 67967 milliseconds. [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node seems to be disconnected from topology (failure detection timeout is reached) [failureDetectionTimeout=60000, connCheckFreq=20000] [08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509 [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712] [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/10.29.42.45, rmtPort=38712] [08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712] [08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems). [08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712 [08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-809b1d2e2e5b, addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29.42.44:49500, /127.0.0.1:49500], discPort=49500, order=1, intOrder=1, lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:86e110c7, isClient=false] [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated unexpectedly.]] java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated unexpectedly. at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated unexpectedly.]] As you can see from the log, the JVM paused for 67967 milliseconds so this the reason why this node is down. But I looked at the GC log, here's the last GC log. 2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause) (young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 32828, predicted base time: 18.85 ms, remaining time: 181.15 ms, target pause time: 200.00 ms] 9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 2350 regions, survivors: 50 regions, predicted young region time: 79.31 ms] 9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time: 98.16 ms, target pause time: 200.00 ms] , 0.0417411 secs] [Parallel Time: 32.8 ms, GC Workers: 38] [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max: 9199215.7, Diff: 0.3] [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6, Sum: 21.8] [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum: 158.3] [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum: 228] [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2] [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum: 1017.5] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3] [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 38] [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum: 8.9] [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6, Sum: 1225.7] [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0, Diff: 0.4] [Code Root Fixup: 0.2 ms] [Code Root Purge: 0.0 ms] [Clear CT: 2.7 ms] [Other: 6.1 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.6 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.4 ms] [Humongous Register: 0.1 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 4.4 ms] [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap: 21.6G(31.2G)->3110.7M(31.2G)] [Times: user=1.28 sys=0.01, real=0.04 secs] Heap garbage-first heap total 32768000K, used 14908229K [0x00007f9b38000000, 0x00007f9b38807d00, 0x00007fa308000000) region size 8192K, 1458 young (11943936K), 29 survivors (237568K) Metaspace used 43388K, capacity 44288K, committed 44492K, reserved 1089536K class space used 4241K, capacity 4487K, committed 4556K, reserved 1048576K Everything looks normal for this GC, but this GC happened at 8:38 and the JVM freeze happed at around 8:40. And this picture show what the java heap looks like in the last minutes before node going down.(Please add four hours here to match the log time because of different timezone) Screen_Shot_2018-06-08_at_17.png <http://apache-ignite-users.70518.x6.nabble.com/file/t1346/Screen_Shot_2018-06-08_at_17.png> So my question is why is this node going down? What's JVM doing in the last minutes before node going down? Why is not there any logs in the last minutes? And why isn't the segment_policy RESTART_JVM working? -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/