Re: cluster hanged when client node put data into the caches

2017-12-12 Thread Nikolai Tikhonov
Hello,

You don't have to use synchronization for Ignite Cache API. This instance
already is threadsafe. I recommend to read the following documentation
page: https://apacheignite.readme.io/docs/jvm-and-system-tuning

On Sun, Dec 10, 2017 at 4:49 PM, Aurora <2565003...@qq.com> wrote:

> @Nikolai
> Your suggestion is right.
> "to-space exhausted" occured  in GC Logs on 3 out of 6 server nodes, as:
> Node 1:
> 2017-12-05T18:36:48.716+0800: 6870.705: [GC pause (G1 Evacuation Pause)
> (young), 0.4171797 secs]
>[Parallel Time: 412.1 ms, GC Workers: 13]
>   [GC Worker Start (ms): Min: 6870705.5, Avg: 6870705.7, Max:
> 6870705.8,
> Diff: 0.3]
>   [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.4,
> Sum:
> 9.6]
>   [Update RS (ms): Min: 56.8, Avg: 58.6, Max: 60.2, Diff: 3.4, Sum:
> 761.8]
>  [Processed Buffers: Min: 69, Avg: 76.7, Max: 91, Diff: 22, Sum:
> 997]
>   [Scan RS (ms): Min: 1.4, Avg: 2.8, Max: 4.7, Diff: 3.3, Sum: 36.7]
>   [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1,
> Sum: 0.2]
>   [Object Copy (ms): Min: 349.1, Avg: 349.2, Max: 349.3, Diff: 0.2,
> Sum:
> 4540.2]
>   [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>  [Termination Attempts: Min: 1, Avg: 40.3, Max: 75, Diff: 74, Sum:
> 524]
>   [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum:
> 1.7]
>   [GC Worker Total (ms): Min: 411.4, Avg: 411.6, Max: 411.8, Diff: 0.4,
> Sum: 5350.7]
>   [GC Worker End (ms): Min: 6871117.2, Avg: 6871117.3, Max: 6871117.4,
> Diff: 0.2]
>[Code Root Fixup: 0.1 ms]
>[Code Root Purge: 0.0 ms]
>[Clear CT: 0.9 ms]
>[Other: 4.1 ms]
>   [Choose CSet: 0.0 ms]
>   [Ref Proc: 0.4 ms]
>   [Ref Enq: 0.0 ms]
>   [Redirty Cards: 0.9 ms]
>   [Humongous Register: 0.1 ms]
>   [Humongous Reclaim: 0.0 ms]
>   [Free CSet: 1.9 ms]
>[Eden: 8280.0M(8280.0M)->0.0B(8032.0M) Survivors: 968.0M->968.0M Heap:
> 13.4G(16.0G)->5680.8M(16.0G)]
>  [Times: user=5.37 sys=0.00, real=0.42 secs]
> 2017-12-05T18:40:45.218+0800: 7107.207: [GC pause (G1 Evacuation Pause)
> (young) (to-space exhausted), 19.5379052 secs]
>[Parallel Time: 1.0 ms, GC Workers: 13]
>   [GC Worker Start (ms): Min: 7107207.2, Avg: 7107207.4, Max:
> 7107207.5,
> Diff: 0.3]
>   [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.3,
> Sum:
> 9.1]
>   [Update RS (ms): Min: 143.9, Avg: 144.0, Max: 144.1, Diff: 0.2, Sum:
> 1871.6]
>  [Processed Buffers: Min: 78, Avg: 87.2, Max: 108, Diff: 30, Sum:
> 1134]
>   [Scan RS (ms): Min: 6.1, Avg: 6.2, Max: 6.3, Diff: 0.2, Sum: 80.3]
>   [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.2]
>   [Object Copy (ms): Min: 16474.9, Avg: 16481.7, Max: 16514.7, Diff:
> 39.8, Sum: 214262.3]
>   [Termination (ms): Min: 0.0, Avg: 32.9, Max: 39.6, Diff: 39.6, Sum:
> 427.8]
>  [Termination Attempts: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 19]
>   [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum:
> 0.9]
>   [GC Worker Total (ms): Min: 16665.4, Avg: 16665.6, Max: 16665.7,
> Diff:
> 0.3, Sum: 216652.3]
>   [GC Worker End (ms): Min: 7123872.9, Avg: 7123872.9, Max: 7123873.0,
> Diff: 0.1]
>[Code Root Fixup: 0.1 ms]
>[Code Root Purge: 0.0 ms]
>[Clear CT: 0.8 ms]
>[Other: 2871.0 ms]
>   [Evacuation Failure: 2861.6 ms]
>   [Choose CSet: 0.0 ms]
>   [Ref Proc: 0.4 ms]
>   [Ref Enq: 0.0 ms]
>   [Redirty Cards: 6.9 ms]
>   [Humongous Register: 0.1 ms]
>   [Humongous Reclaim: 0.0 ms]
>   [Free CSet: 1.3 ms]
>[Eden: 8032.0M(8032.0M)->0.0B(328.0M) Survivors: 968.0M->1504.0M Heap:
> 13.5G(16.0G)->12.5G(16.0G)]
>  [Times: user=43.38 sys=7.36, real=19.53 secs]
> 2017-12-05T18:41:05.194+0800: 7127.183: [GC pause (G1 Evacuation Pause)
> (young) (initial-mark), 0.9987695 secs]
>[Parallel Time: 990.4 ms, GC Workers: 13]
>   [GC Worker Start (ms): Min: 7127183.8, Avg: 7127183.9, Max:
> 7127184.0,
> Diff: 0.2]
>   [Ext Root Scanning (ms): Min: 0.9, Avg: 1.1, Max: 1.3, Diff: 0.4,
> Sum:
> 14.3]
>   [Update RS (ms): Min: 563.4, Avg: 563.6, Max: 564.4, Diff: 1.0, Sum:
> 7327.2]
>  [Processed Buffers: Min: 1025, Avg: 1339.8, Max: 1575, Diff: 550,
> Sum: 17418]
>   [Scan RS (ms): Min: 15.0, Avg: 15.7, Max: 16.0, Diff: 1.0, Sum:
> 204.6]
>   [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.1]
>   [Object Copy (ms): Min: 409.2, Avg: 409.4, Max: 409.4, Diff: 0.2,
> Sum:
> 5321.6]
>   [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.6]
>  [Termination Attempts: Min: 1, Avg: 33.3, Max: 48, Diff: 47, Sum:
> 433]
>   [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum:
> 1.2]
>   [GC Worker Total (ms): Min: 989.8, Avg: 990.0, Max: 990.2, Diff: 0.4,
> Sum: 12869.6]
>   [GC Worker End (ms): Min: 7128173.8, Avg: 7128173.9, Max:

Re: cluster hanged when client node put data into the caches

2017-12-10 Thread Aurora
@Nikolai
Your suggestion is right.
"to-space exhausted" occured  in GC Logs on 3 out of 6 server nodes, as:
Node 1:
2017-12-05T18:36:48.716+0800: 6870.705: [GC pause (G1 Evacuation Pause)
(young), 0.4171797 secs]
   [Parallel Time: 412.1 ms, GC Workers: 13]
  [GC Worker Start (ms): Min: 6870705.5, Avg: 6870705.7, Max: 6870705.8,
Diff: 0.3]
  [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.4, Sum:
9.6]
  [Update RS (ms): Min: 56.8, Avg: 58.6, Max: 60.2, Diff: 3.4, Sum:
761.8]
 [Processed Buffers: Min: 69, Avg: 76.7, Max: 91, Diff: 22, Sum:
997]
  [Scan RS (ms): Min: 1.4, Avg: 2.8, Max: 4.7, Diff: 3.3, Sum: 36.7]
  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1,
Sum: 0.2]
  [Object Copy (ms): Min: 349.1, Avg: 349.2, Max: 349.3, Diff: 0.2, Sum:
4540.2]
  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
 [Termination Attempts: Min: 1, Avg: 40.3, Max: 75, Diff: 74, Sum:
524]
  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum:
1.7]
  [GC Worker Total (ms): Min: 411.4, Avg: 411.6, Max: 411.8, Diff: 0.4,
Sum: 5350.7]
  [GC Worker End (ms): Min: 6871117.2, Avg: 6871117.3, Max: 6871117.4,
Diff: 0.2]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 4.1 ms]
  [Choose CSet: 0.0 ms]
  [Ref Proc: 0.4 ms]
  [Ref Enq: 0.0 ms]
  [Redirty Cards: 0.9 ms]
  [Humongous Register: 0.1 ms]
  [Humongous Reclaim: 0.0 ms]
  [Free CSet: 1.9 ms]
   [Eden: 8280.0M(8280.0M)->0.0B(8032.0M) Survivors: 968.0M->968.0M Heap:
13.4G(16.0G)->5680.8M(16.0G)]
 [Times: user=5.37 sys=0.00, real=0.42 secs] 
2017-12-05T18:40:45.218+0800: 7107.207: [GC pause (G1 Evacuation Pause)
(young) (to-space exhausted), 19.5379052 secs]
   [Parallel Time: 1.0 ms, GC Workers: 13]
  [GC Worker Start (ms): Min: 7107207.2, Avg: 7107207.4, Max: 7107207.5,
Diff: 0.3]
  [Ext Root Scanning (ms): Min: 0.6, Avg: 0.7, Max: 0.9, Diff: 0.3, Sum:
9.1]
  [Update RS (ms): Min: 143.9, Avg: 144.0, Max: 144.1, Diff: 0.2, Sum:
1871.6]
 [Processed Buffers: Min: 78, Avg: 87.2, Max: 108, Diff: 30, Sum:
1134]
  [Scan RS (ms): Min: 6.1, Avg: 6.2, Max: 6.3, Diff: 0.2, Sum: 80.3]
  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.2]
  [Object Copy (ms): Min: 16474.9, Avg: 16481.7, Max: 16514.7, Diff:
39.8, Sum: 214262.3]
  [Termination (ms): Min: 0.0, Avg: 32.9, Max: 39.6, Diff: 39.6, Sum:
427.8]
 [Termination Attempts: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 19]
  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum:
0.9]
  [GC Worker Total (ms): Min: 16665.4, Avg: 16665.6, Max: 16665.7, Diff:
0.3, Sum: 216652.3]
  [GC Worker End (ms): Min: 7123872.9, Avg: 7123872.9, Max: 7123873.0,
Diff: 0.1]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.8 ms]
   [Other: 2871.0 ms]
  [Evacuation Failure: 2861.6 ms]
  [Choose CSet: 0.0 ms]
  [Ref Proc: 0.4 ms]
  [Ref Enq: 0.0 ms]
  [Redirty Cards: 6.9 ms]
  [Humongous Register: 0.1 ms]
  [Humongous Reclaim: 0.0 ms]
  [Free CSet: 1.3 ms]
   [Eden: 8032.0M(8032.0M)->0.0B(328.0M) Survivors: 968.0M->1504.0M Heap:
13.5G(16.0G)->12.5G(16.0G)]
 [Times: user=43.38 sys=7.36, real=19.53 secs] 
2017-12-05T18:41:05.194+0800: 7127.183: [GC pause (G1 Evacuation Pause)
(young) (initial-mark), 0.9987695 secs]
   [Parallel Time: 990.4 ms, GC Workers: 13]
  [GC Worker Start (ms): Min: 7127183.8, Avg: 7127183.9, Max: 7127184.0,
Diff: 0.2]
  [Ext Root Scanning (ms): Min: 0.9, Avg: 1.1, Max: 1.3, Diff: 0.4, Sum:
14.3]
  [Update RS (ms): Min: 563.4, Avg: 563.6, Max: 564.4, Diff: 1.0, Sum:
7327.2]
 [Processed Buffers: Min: 1025, Avg: 1339.8, Max: 1575, Diff: 550,
Sum: 17418]
  [Scan RS (ms): Min: 15.0, Avg: 15.7, Max: 16.0, Diff: 1.0, Sum: 204.6]
  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.1]
  [Object Copy (ms): Min: 409.2, Avg: 409.4, Max: 409.4, Diff: 0.2, Sum:
5321.6]
  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.6]
 [Termination Attempts: Min: 1, Avg: 33.3, Max: 48, Diff: 47, Sum:
433]
  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum:
1.2]
  [GC Worker Total (ms): Min: 989.8, Avg: 990.0, Max: 990.2, Diff: 0.4,
Sum: 12869.6]
  [GC Worker End (ms): Min: 7128173.8, Avg: 7128173.9, Max: 7128174.0,
Diff: 0.2]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 7.9 ms]
  [Choose CSet: 0.0 ms]
  [Ref Proc: 0.4 ms]
  [Ref Enq: 0.0 ms]
  [Redirty Cards: 4.8 ms]
  [Humongous Register: 0.0 ms]
  [Humongous Reclaim: 0.0 ms]
  [Free CSet: 0.8 ms]
   [Eden: 328.0M(328.0M)->0.0B(336.0M) Survivors: 1504.0M->280.0M Heap:
12.9G(16.0G)->12.8G(16.0G)]
 [Times: user=12.93 sys=0.00, real=1.00 secs] 

Node 2:
2017-12-05T18:37:14.177+0800: 9497.100

Re: cluster hanged when client node put data into the caches

2017-12-07 Thread Aurora
thread dump:


Attaching to process ID 12837, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.65-b01
Deadlock Detection:

No deadlocks found.

- 12838 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9ce5355  Unsafe_Park + 0xf5
0x7fe3e5015754  * sun.misc.Unsafe.park(boolean, long) bci:0 (Interpreted
frame)
0x7fe3e500798d  *
java.util.concurrent.locks.LockSupport.park(java.lang.Object) bci:14
line:175 (Interpreted frame)
0x7fe3e500798d  *
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
bci:1 line:836 (Interpreted frame)
0x7fe3e5007160  *
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
bci:72 line:997 (Interpreted frame)
0x7fe3e500798d  *
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
bci:24 line:1304 (Interpreted frame)
0x7fe3e500798d  * java.util.concurrent.CountDownLatch.await() bci:5
line:231 (Interpreted frame)
0x7fe3e500798d  *
org.apache.ignite.startup.cmdline.CommandLineStartup.main(java.lang.String[])
bci:309 line:335 (Interpreted frame)
0x7fe3e50004e7  
0x7fe3f98e1be6
_ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread
+ 0x1056
0x7fe3f9923182
_ZL17jni_invoke_staticP7JNIEnv_P9JavaValueP8_jobject11JNICallTypeP10_jmethodIDP18JNI_ArgumentPusherP6Thread
+ 0x362
0x7fe3f993fb8a  jni_CallStaticVoidMethod + 0x17a
0x7fe3fa23abcc  JavaMain + 0x80c
- 12839 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12840 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12841 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12842 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12843 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12844 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12845 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12846 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12847 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12848 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7
0x7fe3f9b2d746  _ZN7Monitor4waitEblb + 0x256
0x7fe3f9d2fbeb  _ZN10GangWorker4loopEv + 0x5b
0x7fe3f9b72b88  _ZL10java_startP6Thread + 0x108
- 12849 -
0x00311860b68c  __pthread_cond_wait + 0xcc
0x7fe3f9b2ce27  _ZN7Monitor5IWaitEP6Threadl + 0xf7

Re: cluster hanged when client node put data into the caches

2017-12-07 Thread Aurora
@Nikolai
thank you for your response.

some error in client node:
2017-12-06 13:35:03.721 -ERROR 9688 [order] com.a.u.IgniteAffinityHelper
 
: error:
org.apache.ignite.IgniteClientDisconnectedException: Client node
disconnected: null
at
org.apache.ignite.internal.GridKernalGatewayImpl.readLock(GridKernalGatewayImpl.java:92)
at org.apache.ignite.internal.IgniteKernal.guard(IgniteKernal.java:3709)
at org.apache.ignite.internal.IgniteKernal.cache(IgniteKernal.java:2745)
at com.a.util.IgniteAffinityHelper.save(IgniteAffinityHelper.java:161)
at com.a.util.IgniteClient.run(KafkaToIgnite.java:15)
at java.lang.Thread.run(Thread.java:745)





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: cluster hanged when client node put data into the caches

2017-12-07 Thread Nikolai Tikhonov
Hello,

Can you share thread dumps from all nodes? It need for further
investigation.

On Thu, Dec 7, 2017 at 7:41 PM, Aurora <2565003...@qq.com> wrote:

> Hi guys.
> our project was stunned into this critical issue.
> Ignite version 2.2, 10 nodes on 5 servers.
> Client node consumed data from Kafka, and put them into caches.
> Sometimes client node was disconnected suddenly, then the cluster with 10
> server nodes hanged,
> CPU usage on 5 servers reached almost 100%.
>
> I appreciate if you could give me a hint on solving this issue.
>
> thanks.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


cluster hanged when client node put data into the caches

2017-12-07 Thread Aurora
Hi guys.
our project was stunned into this critical issue.
Ignite version 2.2, 10 nodes on 5 servers.
Client node consumed data from Kafka, and put them into caches.
Sometimes client node was disconnected suddenly, then the cluster with 10
server nodes hanged,
CPU usage on 5 servers reached almost 100%.

I appreciate if you could give me a hint on solving this issue.

thanks.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/