Re: cluster hanged when client node put data into the caches
Hello, You don't have to use synchronization for Ignite Cache API. This instance already is threadsafe. I recommend to read the following documentation page: https://apacheignite.readme.io/docs/jvm-and-system-tuning On Sun, Dec 10, 2017 at 4:49 PM, Aurora <2565003...@qq.com> wrote: > @Nikolai > Your suggestion is right. > "to-space exhausted" occured in GC Logs on 3 out of 6 server nodes, as: > Node 1: > 2017-12-05T18:36:48.716+0800: 6870.705: [GC pause (G1 Evacuation Pause) > (young), 0.4171797 secs] >[Parallel Time: 412.1 ms, GC Workers: 13] > [GC Worker Start (ms): Min: 6870705.5, Avg: 6870705.7, Max: > 6870705.8, > Diff: 0.3] > [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.4, > Sum: > 9.6] > [Update RS (ms): Min: 56.8, Avg: 58.6, Max: 60.2, Diff: 3.4, Sum: > 761.8] > [Processed Buffers: Min: 69, Avg: 76.7, Max: 91, Diff: 22, Sum: > 997] > [Scan RS (ms): Min: 1.4, Avg: 2.8, Max: 4.7, Diff: 3.3, Sum: 36.7] > [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, > Sum: 0.2] > [Object Copy (ms): Min: 349.1, Avg: 349.2, Max: 349.3, Diff: 0.2, > Sum: > 4540.2] > [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5] > [Termination Attempts: Min: 1, Avg: 40.3, Max: 75, Diff: 74, Sum: > 524] > [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: > 1.7] > [GC Worker Total (ms): Min: 411.4, Avg: 411.6, Max: 411.8, Diff: 0.4, > Sum: 5350.7] > [GC Worker End (ms): Min: 6871117.2, Avg: 6871117.3, Max: 6871117.4, > Diff: 0.2] >[Code Root Fixup: 0.1 ms] >[Code Root Purge: 0.0 ms] >[Clear CT: 0.9 ms] >[Other: 4.1 ms] > [Choose CSet: 0.0 ms] > [Ref Proc: 0.4 ms] > [Ref Enq: 0.0 ms] > [Redirty Cards: 0.9 ms] > [Humongous Register: 0.1 ms] > [Humongous Reclaim: 0.0 ms] > [Free CSet: 1.9 ms] >[Eden: 8280.0M(8280.0M)->0.0B(8032.0M) Survivors: 968.0M->968.0M Heap: > 13.4G(16.0G)->5680.8M(16.0G)] > [Times: user=5.37 sys=0.00, real=0.42 secs] > 2017-12-05T18:40:45.218+0800: 7107.207: [GC pause (G1 Evacuation Pause) > (young) (to-space exhausted), 19.5379052 secs] >[Parallel Time: 1.0 ms, GC Workers: 13] > [GC Worker Start (ms): Min: 7107207.2, Avg: 7107207.4, Max: > 7107207.5, > Diff: 0.3] > [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.3, > Sum: > 9.1] > [Update RS (ms): Min: 143.9, Avg: 144.0, Max: 144.1, Diff: 0.2, Sum: > 1871.6] > [Processed Buffers: Min: 78, Avg: 87.2, Max: 108, Diff: 30, Sum: > 1134] > [Scan RS (ms): Min: 6.1, Avg: 6.2, Max: 6.3, Diff: 0.2, Sum: 80.3] > [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, > Sum: 0.2] > [Object Copy (ms): Min: 16474.9, Avg: 16481.7, Max: 16514.7, Diff: > 39.8, Sum: 214262.3] > [Termination (ms): Min: 0.0, Avg: 32.9, Max: 39.6, Diff: 39.6, Sum: > 427.8] > [Termination Attempts: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 19] > [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: > 0.9] > [GC Worker Total (ms): Min: 16665.4, Avg: 16665.6, Max: 16665.7, > Diff: > 0.3, Sum: 216652.3] > [GC Worker End (ms): Min: 7123872.9, Avg: 7123872.9, Max: 7123873.0, > Diff: 0.1] >[Code Root Fixup: 0.1 ms] >[Code Root Purge: 0.0 ms] >[Clear CT: 0.8 ms] >[Other: 2871.0 ms] > [Evacuation Failure: 2861.6 ms] > [Choose CSet: 0.0 ms] > [Ref Proc: 0.4 ms] > [Ref Enq: 0.0 ms] > [Redirty Cards: 6.9 ms] > [Humongous Register: 0.1 ms] > [Humongous Reclaim: 0.0 ms] > [Free CSet: 1.3 ms] >[Eden: 8032.0M(8032.0M)->0.0B(328.0M) Survivors: 968.0M->1504.0M Heap: > 13.5G(16.0G)->12.5G(16.0G)] > [Times: user=43.38 sys=7.36, real=19.53 secs] > 2017-12-05T18:41:05.194+0800: 7127.183: [GC pause (G1 Evacuation Pause) > (young) (initial-mark), 0.9987695 secs] >[Parallel Time: 990.4 ms, GC Workers: 13] > [GC Worker Start (ms): Min: 7127183.8, Avg: 7127183.9, Max: > 7127184.0, > Diff: 0.2] > [Ext Root Scanning (ms): Min: 0.9, Avg: 1.1, Max: 1.3, Diff: 0.4, > Sum: > 14.3] > [Update RS (ms): Min: 563.4, Avg: 563.6, Max: 564.4, Diff: 1.0, Sum: > 7327.2] > [Processed Buffers: Min: 1025, Avg: 1339.8, Max: 1575, Diff: 550, > Sum: 17418] > [Scan RS (ms): Min: 15.0, Avg: 15.7, Max: 16.0, Diff: 1.0, Sum: > 204.6] > [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, > Sum: 0.1] > [Object Copy (ms): Min: 409.2, Avg: 409.4, Max: 409.4, Diff: 0.2, > Sum: > 5321.6] > [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.6] > [Termination Attempts: Min: 1, Avg: 33.3, Max: 48, Diff: 47, Sum: > 433] > [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: > 1.2] > [GC Worker Total (ms): Min: 989.8, Avg: 990.0, Max: 990.2, Diff: 0.4, > Sum: 12869.6] > [GC Worker End (ms): Min: 7128173.8, Avg: 7128173.9, Max:
Re: cluster hanged when client node put data into the caches
@Nikolai Your suggestion is right. "to-space exhausted" occured in GC Logs on 3 out of 6 server nodes, as: Node 1: 2017-12-05T18:36:48.716+0800: 6870.705: [GC pause (G1 Evacuation Pause) (young), 0.4171797 secs] [Parallel Time: 412.1 ms, GC Workers: 13] [GC Worker Start (ms): Min: 6870705.5, Avg: 6870705.7, Max: 6870705.8, Diff: 0.3] [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.4, Sum: 9.6] [Update RS (ms): Min: 56.8, Avg: 58.6, Max: 60.2, Diff: 3.4, Sum: 761.8] [Processed Buffers: Min: 69, Avg: 76.7, Max: 91, Diff: 22, Sum: 997] [Scan RS (ms): Min: 1.4, Avg: 2.8, Max: 4.7, Diff: 3.3, Sum: 36.7] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2] [Object Copy (ms): Min: 349.1, Avg: 349.2, Max: 349.3, Diff: 0.2, Sum: 4540.2] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5] [Termination Attempts: Min: 1, Avg: 40.3, Max: 75, Diff: 74, Sum: 524] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: 1.7] [GC Worker Total (ms): Min: 411.4, Avg: 411.6, Max: 411.8, Diff: 0.4, Sum: 5350.7] [GC Worker End (ms): Min: 6871117.2, Avg: 6871117.3, Max: 6871117.4, Diff: 0.2] [Code Root Fixup: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.9 ms] [Other: 4.1 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.4 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.9 ms] [Humongous Register: 0.1 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 1.9 ms] [Eden: 8280.0M(8280.0M)->0.0B(8032.0M) Survivors: 968.0M->968.0M Heap: 13.4G(16.0G)->5680.8M(16.0G)] [Times: user=5.37 sys=0.00, real=0.42 secs] 2017-12-05T18:40:45.218+0800: 7107.207: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 19.5379052 secs] [Parallel Time: 1.0 ms, GC Workers: 13] [GC Worker Start (ms): Min: 7107207.2, Avg: 7107207.4, Max: 7107207.5, Diff: 0.3] [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.3, Sum: 9.1] [Update RS (ms): Min: 143.9, Avg: 144.0, Max: 144.1, Diff: 0.2, Sum: 1871.6] [Processed Buffers: Min: 78, Avg: 87.2, Max: 108, Diff: 30, Sum: 1134] [Scan RS (ms): Min: 6.1, Avg: 6.2, Max: 6.3, Diff: 0.2, Sum: 80.3] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2] [Object Copy (ms): Min: 16474.9, Avg: 16481.7, Max: 16514.7, Diff: 39.8, Sum: 214262.3] [Termination (ms): Min: 0.0, Avg: 32.9, Max: 39.6, Diff: 39.6, Sum: 427.8] [Termination Attempts: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 19] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 0.9] [GC Worker Total (ms): Min: 16665.4, Avg: 16665.6, Max: 16665.7, Diff: 0.3, Sum: 216652.3] [GC Worker End (ms): Min: 7123872.9, Avg: 7123872.9, Max: 7123873.0, Diff: 0.1] [Code Root Fixup: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.8 ms] [Other: 2871.0 ms] [Evacuation Failure: 2861.6 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.4 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 6.9 ms] [Humongous Register: 0.1 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 1.3 ms] [Eden: 8032.0M(8032.0M)->0.0B(328.0M) Survivors: 968.0M->1504.0M Heap: 13.5G(16.0G)->12.5G(16.0G)] [Times: user=43.38 sys=7.36, real=19.53 secs] 2017-12-05T18:41:05.194+0800: 7127.183: [GC pause (G1 Evacuation Pause) (young) (initial-mark), 0.9987695 secs] [Parallel Time: 990.4 ms, GC Workers: 13] [GC Worker Start (ms): Min: 7127183.8, Avg: 7127183.9, Max: 7127184.0, Diff: 0.2] [Ext Root Scanning (ms): Min: 0.9, Avg: 1.1, Max: 1.3, Diff: 0.4, Sum: 14.3] [Update RS (ms): Min: 563.4, Avg: 563.6, Max: 564.4, Diff: 1.0, Sum: 7327.2] [Processed Buffers: Min: 1025, Avg: 1339.8, Max: 1575, Diff: 550, Sum: 17418] [Scan RS (ms): Min: 15.0, Avg: 15.7, Max: 16.0, Diff: 1.0, Sum: 204.6] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] [Object Copy (ms): Min: 409.2, Avg: 409.4, Max: 409.4, Diff: 0.2, Sum: 5321.6] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.6] [Termination Attempts: Min: 1, Avg: 33.3, Max: 48, Diff: 47, Sum: 433] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.2] [GC Worker Total (ms): Min: 989.8, Avg: 990.0, Max: 990.2, Diff: 0.4, Sum: 12869.6] [GC Worker End (ms): Min: 7128173.8, Avg: 7128173.9, Max: 7128174.0, Diff: 0.2] [Code Root Fixup: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.3 ms] [Other: 7.9 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.4 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 4.8 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.8 ms] [Eden: 328.0M(328.0M)->0.0B(336.0M) Survivors: 1504.0M->280.0M Heap: 12.9G(16.0G)->12.8G(16.0G)] [Times: user=12.93 sys=0.00, real=1.00 secs] Node 2: 2017-12-05T18:37:14.177+0800: 9497.100
Re: cluster hanged when client node put data into the caches
thread dump: Attaching to process ID 12837, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.65-b01 Deadlock Detection: No deadlocks found. - 12838 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9ce5355 Unsafe_Park + 0xf5 0x7fe3e5015754 * sun.misc.Unsafe.park(boolean, long) bci:0 (Interpreted frame) 0x7fe3e500798d * java.util.concurrent.locks.LockSupport.park(java.lang.Object) bci:14 line:175 (Interpreted frame) 0x7fe3e500798d * java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() bci:1 line:836 (Interpreted frame) 0x7fe3e5007160 * java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int) bci:72 line:997 (Interpreted frame) 0x7fe3e500798d * java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int) bci:24 line:1304 (Interpreted frame) 0x7fe3e500798d * java.util.concurrent.CountDownLatch.await() bci:5 line:231 (Interpreted frame) 0x7fe3e500798d * org.apache.ignite.startup.cmdline.CommandLineStartup.main(java.lang.String[]) bci:309 line:335 (Interpreted frame) 0x7fe3e50004e7 0x7fe3f98e1be6 _ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread + 0x1056 0x7fe3f9923182 _ZL17jni_invoke_staticP7JNIEnv_P9JavaValueP8_jobject11JNICallTypeP10_jmethodIDP18JNI_ArgumentPusherP6Thread + 0x362 0x7fe3f993fb8a jni_CallStaticVoidMethod + 0x17a 0x7fe3fa23abcc JavaMain + 0x80c - 12839 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12840 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12841 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12842 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12843 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12844 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12845 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12846 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12847 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12848 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7 0x7fe3f9b2d746 _ZN7Monitor4waitEblb + 0x256 0x7fe3f9d2fbeb _ZN10GangWorker4loopEv + 0x5b 0x7fe3f9b72b88 _ZL10java_startP6Thread + 0x108 - 12849 - 0x00311860b68c __pthread_cond_wait + 0xcc 0x7fe3f9b2ce27 _ZN7Monitor5IWaitEP6Threadl + 0xf7
Re: cluster hanged when client node put data into the caches
@Nikolai thank you for your response. some error in client node: 2017-12-06 13:35:03.721 -ERROR 9688 [order] com.a.u.IgniteAffinityHelper : error: org.apache.ignite.IgniteClientDisconnectedException: Client node disconnected: null at org.apache.ignite.internal.GridKernalGatewayImpl.readLock(GridKernalGatewayImpl.java:92) at org.apache.ignite.internal.IgniteKernal.guard(IgniteKernal.java:3709) at org.apache.ignite.internal.IgniteKernal.cache(IgniteKernal.java:2745) at com.a.util.IgniteAffinityHelper.save(IgniteAffinityHelper.java:161) at com.a.util.IgniteClient.run(KafkaToIgnite.java:15) at java.lang.Thread.run(Thread.java:745) -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: cluster hanged when client node put data into the caches
Hello, Can you share thread dumps from all nodes? It need for further investigation. On Thu, Dec 7, 2017 at 7:41 PM, Aurora <2565003...@qq.com> wrote: > Hi guys. > our project was stunned into this critical issue. > Ignite version 2.2, 10 nodes on 5 servers. > Client node consumed data from Kafka, and put them into caches. > Sometimes client node was disconnected suddenly, then the cluster with 10 > server nodes hanged, > CPU usage on 5 servers reached almost 100%. > > I appreciate if you could give me a hint on solving this issue. > > thanks. > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
cluster hanged when client node put data into the caches
Hi guys. our project was stunned into this critical issue. Ignite version 2.2, 10 nodes on 5 servers. Client node consumed data from Kafka, and put them into caches. Sometimes client node was disconnected suddenly, then the cluster with 10 server nodes hanged, CPU usage on 5 servers reached almost 100%. I appreciate if you could give me a hint on solving this issue. thanks. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/