Hi,

Looks like node was segmented due to long JVM pause.
There are 2 "long JVM pause" messages in long an suspicious long checkpoint:

Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
fsync=965250ms, total=978172ms]


Do you use HDD? Are you sure there is no swapping happens?
Do you have enough free space on disk?
Is there any other heavy process on server that may took too much CPU time
and JVM was descheduled from CPU for too long?


It looks weird JVM was not restarted. We've to check such case.

On Fri, Jun 8, 2018 at 12:32 PM, Ray <ray...@cisco.com> wrote:

> I setup a six node Ignite cluster to test the performance and stability.
> Here's my setup.
>     <bean id="grid.cfg"
> class="org.apache.ignite.configuration.IgniteConfiguration">
>         <property name="segmentationPolicy" value="RESTART_JVM"/>
>         <property name="peerClassLoadingEnabled" value="true"/>
>         <property name="failureDetectionTimeout" value="60000"/>
>         <property name="dataStorageConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>             <property name="defaultDataRegionConfiguration">
>                 <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>                     <property name="name" value="default_Region"/>
>                     <property name="initialSize" value="#{100L * 1024 *
> 1024
> * 1024}"/>
>                     <property name="maxSize" value="#{500L * 1024 * 1024 *
> 1024}"/>
>                     <property name="persistenceEnabled" value="true"/>
>                     <property name="checkpointPageBufferSize" value="#{2L
> *
> 1024 * 1024 * 1024}"/>
>                 </bean>
>             </property>
>             <property name="walMode" value="BACKGROUND"/>
>             <property name="walFlushFrequency" value="5000"/>
>             <property name="checkpointFrequency" value="600000"/>
>       <property name="writeThrottlingEnabled" value="true"/>
>
> And I used this command to start the Ignite node.
> ./ignite.sh -J-Xmx32000m -J-Xms32000m -J-XX:+UseG1GC
> -J-XX:+ScavengeBeforeFullGC -J-XX:+DisableExplicitGC -J-XX:+AlwaysPreTouch
> -J-XX:+PrintGCDetails -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDateStamps
> -J-XX:+PrintAdaptiveSizePolicy -J-Xloggc:/ignitegc-$(date
> +%Y_%m_%d-%H_%M).log  config/persistent-config.xml
>
> One of the node just dropped from the topology. Here's the log for last
> three minutes before this node going down.
> [08:39:58,982][INFO][grid-timeout-worker-#119][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=8333aa56, uptime=02:34:01.948]
>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>     ^-- CPU [cur=41%, avg=33.18%, GC=0%]
>     ^-- PageMemory [pages=8912687]
>     ^-- Heap [used=8942MB, free=72.05%, comm=32000MB]
>     ^-- Non heap [used=70MB, free=95.35%, comm=73MB]
>     ^-- Outbound messages queue [size=0]
>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>     ^-- System thread pool [active=0, idle=6, qSize=0]
> [08:40:51,945][INFO][db-checkpoint-thread-#178][
> GridCacheDatabaseSharedManager]
> Checkpoint finished [cpId=77cf2fa2-2a9f-48ea-bdeb-dda81b15dac1,
> pages=2050858, markPos=FileWALPointer [idx=2051, fileOff=38583904,
> len=15981], walSegmentsCleared=0, markDuration=920ms, pagesWrite=12002ms,
> fsync=965250ms, total=978172ms]
> [08:40:53,086][INFO][db-checkpoint-thread-#178][
> GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=14d929ac-1b5c-4df2-a71f-002d5eb41f14,
> startPtr=FileWALPointer [idx=2242, fileOff=65211837, len=15981],
> checkpointLockWait=0ms, checkpointLockHoldTime=39ms,
> walCpRecordFsyncDuration=720ms, pages=2110545, reason='timeout']
> [08:40:57,793][INFO][data-streamer-stripe-1-#58][PageMemoryImpl]
> Throttling
> is applied to page modifications [percentOfPartTime=0.22, markDirty=7192
> pages/sec, checkpointWrite=2450 pages/sec, estIdealMarkDirty=139543
> pages/sec, curDirty=0.00, maxDirty=0.17, avgParkTime=1732784 ns, pages:
> (total=2110545, evicted=0, written=875069, synced=0, cpBufUsed=92,
> cpBufTotal=518215)]
> [08:40:58,991][INFO][grid-timeout-worker-#119][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=8333aa56, uptime=02:35:01.957]
>     ^-- H/N/C [hosts=9, nodes=16, CPUs=552]
>     ^-- CPU [cur=9.3%, avg=33%, GC=0%]
>     ^-- PageMemory [pages=8920631]
>     ^-- Heap [used=13262MB, free=58.55%, comm=32000MB]
>     ^-- Non heap [used=70MB, free=95.34%, comm=73MB]
>     ^-- Outbound messages queue [size=0]
>     ^-- Public thread pool [active=0, idle=0, qSize=0]
>     ^-- System thread pool [active=0, idle=6, qSize=0]
> [08:41:29,050][WARNING][jvm-pause-detector-worker][] Possible too long JVM
> pause: 22667 milliseconds.
> [08:41:29,050][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.41.23:32815, rmtPort=32815
> [08:41:29,052][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.41.25:46515, rmtPort=46515
> [08:41:30,063][INFO][data-streamer-stripe-3-#60][PageMemoryImpl]
> Throttling
> is applied to page modifications [percentOfPartTime=0.49, markDirty=26945
> pages/sec, checkpointWrite=2612 pages/sec, estIdealMarkDirty=210815
> pages/sec, curDirty=0.00, maxDirty=0.34, avgParkTime=1024456 ns, pages:
> (total=2110545, evicted=0, written=1861330, synced=0, cpBufUsed=8657,
> cpBufTotal=518215)]
> [08:42:42,276][WARNING][jvm-pause-detector-worker][] Possible too long JVM
> pause: 67967 milliseconds.
> [08:42:42,277][INFO][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node
> seems to be disconnected from topology (failure detection timeout is
> reached) [failureDetectionTimeout=60000, connCheckFreq=20000]
> [08:42:42,280][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.42.49:36509, rmtPort=36509
> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
> accepted incoming connection [rmtAddr=/10.29.42.45, rmtPort=38712]
> [08:42:42,286][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
> spawning a new thread for connection [rmtAddr=/10.29.42.45, rmtPort=38712]
> [08:42:42,287][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started
> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712]
> [08:42:42,289][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
> out of topology (probably, due to short-time network problems).
> [08:42:42,290][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished
> serving remote node connection [rmtAddr=/10.29.42.45:38712, rmtPort=38712
> [08:42:42,290][WARNING][disco-event-worker-#161][GridDiscoveryManager]
> Local
> node SEGMENTED: TcpDiscoveryNode [id=8333aa56-8bf4-4558-a387-809b1d2e2e5b,
> addrs=[10.29.42.44, 127.0.0.1], sockAddrs=[sap-datanode1/10.29.42.44:49500
> ,
> /127.0.0.1:49500], discPort=49500, order=1, intOrder=1,
> lastExchangeTime=1528447362286, loc=true, ver=2.5.0#20180523-sha1:
> 86e110c7,
> isClient=false]
> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] Critical system error
> detected.
> Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
> Thread
> tcp-disco-srvr-#2 is terminated unexpectedly.]]
> java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
> unexpectedly.
>         at
> org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.
> java:5686)
>         at
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> [08:42:42,294][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately
> due to the failure: [failureCtx=FailureContext
> [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException:
> Thread
> tcp-disco-srvr-#2 is terminated unexpectedly.]]
>
> As you can see from the log, the JVM paused for 67967 milliseconds so this
> the reason why this node is down.
> But I looked at the GC log, here's the last GC log.
>
> 2018-06-08T08:38:56.654+0000: 9199.215: [GC pause (G1 Evacuation Pause)
> (young) 9199.215: [G1Ergonomics (CSet Construction) start choosing CSet,
> _pending_cards: 32828, predicted base time: 18.85 ms, remaining time:
> 181.15
> ms, target pause time: 200.00 ms]
>  9199.215: [G1Ergonomics (CSet Construction) add young regions to CSet,
> eden: 2350 regions, survivors: 50 regions, predicted young region time:
> 79.31 ms]
>  9199.215: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
> 2350 regions, survivors: 50 regions, old: 0 regions, predicted pause time:
> 98.16 ms, target pause time: 200.00 ms]
> , 0.0417411 secs]
>    [Parallel Time: 32.8 ms, GC Workers: 38]
>       [GC Worker Start (ms): Min: 9199215.4, Avg: 9199215.5, Max:
> 9199215.7,
> Diff: 0.3]
>       [Ext Root Scanning (ms): Min: 0.4, Avg: 0.6, Max: 1.0, Diff: 0.6,
> Sum:
> 21.8]
>       [Update RS (ms): Min: 2.1, Avg: 4.2, Max: 5.3, Diff: 3.3, Sum: 158.3]
>          [Processed Buffers: Min: 1, Avg: 6.0, Max: 23, Diff: 22, Sum: 228]
>       [Scan RS (ms): Min: 0.1, Avg: 0.5, Max: 0.9, Diff: 0.8, Sum: 18.6]
>       [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.2]
>       [Object Copy (ms): Min: 25.9, Avg: 26.8, Max: 28.5, Diff: 2.6, Sum:
> 1017.5]
>       [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
>          [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 38]
>       [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum:
> 8.9]
>       [GC Worker Total (ms): Min: 32.0, Avg: 32.3, Max: 32.5, Diff: 0.6,
> Sum: 1225.7]
>       [GC Worker End (ms): Min: 9199247.6, Avg: 9199247.8, Max: 9199248.0,
> Diff: 0.4]
>    [Code Root Fixup: 0.2 ms]
>    [Code Root Purge: 0.0 ms]
>    [Clear CT: 2.7 ms]
>    [Other: 6.1 ms]
>       [Choose CSet: 0.0 ms]
>       [Ref Proc: 0.6 ms]
>       [Ref Enq: 0.0 ms]
>       [Redirty Cards: 0.4 ms]
>       [Humongous Register: 0.1 ms]
>       [Humongous Reclaim: 0.0 ms]
>       [Free CSet: 4.4 ms]
>    [Eden: 18.4G(18.4G)->0.0B(18.5G) Survivors: 400.0M->232.0M Heap:
> 21.6G(31.2G)->3110.7M(31.2G)]
>  [Times: user=1.28 sys=0.01, real=0.04 secs]
> Heap
>  garbage-first heap   total 32768000K, used 14908229K [0x00007f9b38000000,
> 0x00007f9b38807d00, 0x00007fa308000000)
>   region size 8192K, 1458 young (11943936K), 29 survivors (237568K)
>  Metaspace       used 43388K, capacity 44288K, committed 44492K, reserved
> 1089536K
>   class space    used 4241K, capacity 4487K, committed 4556K, reserved
> 1048576K
>
>
> Everything looks normal for this GC, but this GC happened at 8:38 and the
> JVM freeze happed at around 8:40.
> And this picture show what the java heap looks like in the last minutes
> before node going down.(Please add four hours here to match the log time
> because of different timezone)
> Screen_Shot_2018-06-08_at_17.png
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> t1346/Screen_Shot_2018-06-08_at_17.png>
>
> So my question is why is this node going down?
> What's JVM doing in the last minutes before node going down?
> Why is not there any logs in the last minutes?
> And why isn't the segment_policy RESTART_JVM working?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov

Reply via email to