Re: Ignite Node failure - Node out of topology (SEGMENTED)

2020-04-14 Thread VeenaMithare
Hi Evgenii,

Thank you for the reply and suggestion.

regards,
Veena.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite Node failure - Node out of topology (SEGMENTED)

2020-04-14 Thread Evgenii Zhuravlev
Hi,

Segmentation plugin won't help with the issue itself. If you have a long GC
pause, it means that node is unresponsive for all this time. If you have GC
pause longer than 10 seconds, node will be dropped from the cluster(by
default). If you have long GC pauses, probably your load too big for your
configuration and it makes sense to increase heap or add more nodes to the
cluster.

Evgenii

вт, 14 апр. 2020 г. в 03:50, VeenaMithare :

> Hi Dmitry,
>
> Would having a segmentation plugin help to resolve segmentation due to GC
> pauses ?
>
> Or is the best resolution for long GC pauses is to resolve it and get the
> GC
> pauses to be within the failure detection timeout ?
>
> regards,
> Veena.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2020-04-14 Thread VeenaMithare
Hi Dmitry,

Would having a segmentation plugin help to resolve segmentation due to GC
pauses ?

Or is the best resolution for long GC pauses is to resolve it and get the GC
pauses to be within the failure detection timeout ?

regards,
Veena.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite Node failure - Node out of topology (SEGMENTED)

2018-08-27 Thread luqmanahmad
See [1] for free network segmentation plugin

[1]  https://github.com/luqmanahmad/ignite-plugins
  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-27 Thread dkarachentsev
Naresh,

GC logs show not only GC pause, but system pause as well. Try these
parameters:
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime 

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-26 Thread naresh.goty
Thanks for the recommendation, but we already identified and addressed the
issues with GC pauses in JVM, and now we could not find any long gc activity
during the time of node failure due to network segmentation. (please find
the attached screenshot of GC activity from dynatrace agent).

>From the screenshot, there is only young generation gc collection and that
too < 100ms.

We can still enable gc logs and i strongly suspect the issue is beyond jvm
pauses.

Thanks,
Naresh

 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-26 Thread dkarachentsev
Hi Naresh,

Actually any JVM process hang could lead to segmentation. If some node is
not responsive for longer than failureDetectionTimeout, it will be kicked
off from the cluster to prevent all over grid performance degradation.

It works on following scenario. Let's say we have 3 nodes in a ring: n1 ->
n2 -> n3. Over ring go some discovery messages along with metrics and
connection checks with predefined interval. Node 2 start experiencing issues
like GC pause or OS failures that forces process to stop. For that time node
1 is unable to send message to n2 (it doesn't receive ack). n1 waits for
failureDetectionTimeout and establishes connection to n3: n1 -> n3; when n2
is not connected. 

Cluster treated n2 as failed. When n2 comes back it tries to connect to n3
and send message across ring, when it receives message that it's out of
grid. For n2 that means it was segmented and best what it could do is stop.

To check if there were large JVM or system pauses, you may enable GC logs.
If they longer than failureDetectionTimeout, then node will be segmented.

The best way would be to solve pauses, but like a workaround - increase
timeout.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-25 Thread naresh.goty
Hi Dmitry,

We are again seeing segmentation failure in one of the node of our prod env.
This time we did not run jmap, but still node failed.

-> CPU, memory utilization and network are in optimal state.

We observed that there are page faults in memory at the same time of
segmentation failure, as reported by dynatrace agent (attached screenshot).

Can you please confirm if page faults could result in network segmentation
in a node?
I think, we see page faults in a node, but not always result in segmentation
failure.


Logs from Failed Agent:

INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=3f568bb8, name=delivery, uptime=24:31:12.859]
^-- H/N/C [hosts=9, nodes=9, CPUs=18]
^-- CPU [cur=7%, avg=9.06%, GC=0%]
^-- PageMemory [pages=30244]
^-- Heap [used=3184MB, free=22.09%, comm=4087MB]
^-- Non heap [used=213MB, free=-1%, comm=222MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=5, qSize=0]
^-- Outbound messages queue [size=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4879, reusePages=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: TCP discovery accepted incoming connection [rmtAddr=/10.40.173.14,
rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: TCP discovery spawning a new thread for connection
[rmtAddr=/10.40.173.14, rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Started serving remote node connection [rmtAddr=/10.40.173.14:33762,
rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node is out of topology (probably, due to short-time network
problems).
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Local node SEGMENTED: TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=152978
6434361, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished serving remote node connection [rmtAddr=/10.40.173.14:33762,
rmtPort=33762
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished serving remote node connection [rmtAddr=/10.40.173.41:52584,
rmtPort=52584
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Stopping local node according to configured segmentation policy.
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1529050123714,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Command protocol successfully stopped: TCP binary
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=28.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=a26de809-dde1-41b8-87a3-d5576851a0be, addrs=[10.40.173.56, 127.0.0.1],
sockAddrs=[/10.40.173.56:47500, /127.0.0.1:47500], discPort=47500, order=23,
intOrder=16, lastExchangeTime=1529050123735,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=57, servers=7, clients=0, CPUs=14, heap=26.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=910ea19f-af5c-4745-a035-b24a3bb48206, addrs=[10.40.173.88, 127.0.0.1],
sockAddrs=[/10.40.173.88:47500, /127.0.0.1:47500], discPort=47500, order=25,
intOrder=17, lastExchangeTime=1529050123735,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=58, servers=6, clients=0, CPUs=12, heap=24.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=17f3ba9c-e32e-47e4-9ca2-136338d8c4ac, addrs=[10.40.173.39, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.39:47500], discPort=47500, order=30,
intOrder=19, lastExchangeTime=1529050123735, loc=false,

RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-18 Thread naresh.goty
Thanks Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-13 Thread dkarachentsev
Hi Naresh,

Recommendation will be the same: increase failureDetectionTimeout unless
nodes stop segmenting or use gdb (or remove "live" option from jmap command
to skip full GC).

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-13 Thread naresh.goty
thanks Stan.

We have enabled actionable alerts to generate dumps when memory utilization
reaches a certain threshold on all cache nodes. Whenever the alert is
triggered, cache node is getting segmented. So, essentially we cannot take
dumps on a live node. Even increasing socket timeout may not work, as
heapdump process is taking minutes to complete and node is unresponsive
during that time. How to get around this issue? (Could not evaluate gdb
yet).

Regards,
Naresh





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-11 Thread Stanislav Lukyanov
Quick googling suggests to use gdb:
https://stackoverflow.com/a/37375351/4153863

Again, Ignite doesn’t play any role here – whatever works for any Java 
application should work for Ignite as well.

Stan

From: naresh.goty
Sent: 11 июня 2018 г. 20:56
To: user@ignite.apache.org
Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED)

Hi All,

We found that, when jmap utility is triggered to generate heapdumps in the
application node, NODE_SEGMENTATION event is fired from the node. Can some
one please let us know how to safely take heapdumps in a live node with
ignite cache running in embedded node without crashing the node due to
segmentation failure ?

Regards,
Naresh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/



Re: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-11 Thread naresh.goty
Hi All,

We found that, when jmap utility is triggered to generate heapdumps in the
application node, NODE_SEGMENTATION event is fired from the node. Can some
one please let us know how to safely take heapdumps in a live node with
ignite cache running in embedded node without crashing the node due to
segmentation failure ?

Regards,
Naresh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-11 Thread Stanislav Lukyanov
I see messages in the log that are not from Ignite, but which also suggest that 
there is some network issue:

2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request


In any case, given that there were no messages in the logs for ~35 minutes, it 
is unlikely that it is an Ignite issue – Ignite would at least print metrics or 
errors 
if it was running and getting any CPU time.

Stan

From: naresh.goty
Sent: 9 июня 2018 г. 18:16
To: user@ignite.apache.org
Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED)

We are still seeing the NODE SEGMENTATION issue happening to one of the node
in our production even after JVM option is enabled (
-Djava.net.preferIPv4Stack=true).

We don't see any activity reported in the logs for a period of ~30min after
node failed.
The below logs are from the failed node, and it can be observed that till
this timestamp (9:43:25) node was up and running, and after that there are
no messages till 10:19:59 in the catalina logs. Thugh node was failed at
09:44, the node segmentation was reported at 10:19AM

Right before the node failure, the metrics clearly stated that CPU and
memory utilization are very low. Long GC pauses are also not an issue.
=
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907]
^-- H/N/C [hosts=9, nodes=9, CPUs=18]
^-- CPU [cur=1.67%, avg=2.71%, GC=0%]
^-- PageMemory [pages=43003]
^-- Heap [used=605MB, free=85.11%, comm=4065MB]
^-- Non heap [used=204MB, free=-1%, comm=213MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
2018-06-09 10:19:59 [a9fc3882] warning [java  ] ... last message repeated 1
time ...
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed
to send message (connection was closed): GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800,
bytesSent0=183862, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery,
finished=false, hashCode=1045175627, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/10.40.173.21:50320,
rmtAddr=/10.40.173.14:47100, createTime=1528334304232,
closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481,
bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232

Re: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-09 Thread naresh.goty
We are still seeing the NODE SEGMENTATION issue happening to one of the node
in our production even after JVM option is enabled (
-Djava.net.preferIPv4Stack=true).

We don't see any activity reported in the logs for a period of ~30min after
node failed.
The below logs are from the failed node, and it can be observed that till
this timestamp (9:43:25) node was up and running, and after that there are
no messages till 10:19:59 in the catalina logs. Thugh node was failed at
09:44, the node segmentation was reported at 10:19AM

Right before the node failure, the metrics clearly stated that CPU and
memory utilization are very low. Long GC pauses are also not an issue.
=
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907]
^-- H/N/C [hosts=9, nodes=9, CPUs=18]
^-- CPU [cur=1.67%, avg=2.71%, GC=0%]
^-- PageMemory [pages=43003]
^-- Heap [used=605MB, free=85.11%, comm=4065MB]
^-- Non heap [used=204MB, free=-1%, comm=213MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
2018-06-09 10:19:59 [a9fc3882] warning [java  ] ... last message repeated 1
time ...
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed
to send message (connection was closed): GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800,
bytesSent0=183862, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery,
finished=false, hashCode=1045175627, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/10.40.173.21:50320,
rmtAddr=/10.40.173.14:47100, createTime=1528334304232,
closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481,
bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232,
lastSndTime=1528537460337, lastRcvTime=1528537460337, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@7d2bc944, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]


>From one of the active node, it reported that other node (above) was failed
immediately at 09:44 (as reported in the logs below)

Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=b392f3c6, name=delivery, uptime=56:18:06.778]
^-- H/N/C [hosts=9, nodes=9, CPUs=18]
^-- CPU [cur=9.17%, avg=9%, GC=0%]
^-- PageMemory [pages=41038]
^-- Heap 

Re: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-07 Thread Andrey Mashenkov
Hi,

Seems, there is a bug with IPv6 usage [1]. This has to be investigated.
Also, there is a discussion [2].

[1] https://issues.apache.org/jira/browse/IGNITE-6503
[2]
http://apache-ignite-developers.2346864.n4.nabble.com/Issues-if-Djava-net-preferIPv4Stack-true-is-not-set-td22372.html

On Wed, Jun 6, 2018 at 9:24 PM, naresh.goty  wrote:

> Thanks.
> We have enabled IPV4 JVM option in our non-production environment, found no
> issue reported on segmentation. Our main concern is, the issue is happening
> only in production, and we are very much interested in finding the real
> root
> cause (we can rule out - GC pauses, CPU spikes, network latencies as the
> cause is none of them).
>
> 1) please provide us with any useful tips in identifying the source of the
> problem, so that we can avoid the problem altogether instead of taking a
> remediation steps (of restarting JVM) if the issue happens.
>
> 2) do let us know if any timeout configurations should be increased to
> mitigate the problem?
>
> Regards
> Naresh
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov


Re: Ignite Node failure - Node out of topology (SEGMENTED)

2018-06-06 Thread naresh.goty
Thanks.
We have enabled IPV4 JVM option in our non-production environment, found no
issue reported on segmentation. Our main concern is, the issue is happening
only in production, and we are very much interested in finding the real root
cause (we can rule out - GC pauses, CPU spikes, network latencies as the
cause is none of them).

1) please provide us with any useful tips in identifying the source of the
problem, so that we can avoid the problem altogether instead of taking a
remediation steps (of restarting JVM) if the issue happens.

2) do let us know if any timeout configurations should be increased to
mitigate the problem?

Regards
Naresh





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite Node failure - Node out of topology (SEGMENTED)

2018-04-27 Thread Andrey Mashenkov
Hi,

Try to disable IPv6 on all nodes via JVM option -Djava.net.preferIPv4Stack=true
[1]
as using both IPv4 and IPv6 can cause grid segmentation.


[1]
https://stackoverflow.com/questions/11850655/how-can-i-disable-ipv6-stack-use-for-ipv4-ips-on-jre

On Fri, Apr 27, 2018 at 8:52 AM, naresh.goty  wrote:

> Hi,
>
> We are running apache ignite (v2.3) in embedded mode in a java based
> application with 9 node cluster in our production environment in AWS cloud
> infrastructure.
>
> Most of the time, we don't see any issue with node communication failure,
> but occasionally we find one of the node failure reporting the below error
> message.
>
> WARNING: Node is out of topology (probably, due to short-time network
> problems).
> Apr 16, 2018 5:19:24 AM org.apache.ignite.logger.java.JavaLogger warning
> WARNING: Local node SEGMENTED: TcpDiscoveryNode
> [id=13b6f3ec-a759-408f-9d3f-62f2381c649b, addrs=[0:0:0:0:0:0:0:1%lo,
> 10.40.173.93, 127.0.0.1], sockAddrs=[/10.40.173.93:47500,
> /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=157,
> intOrder=83, lastExchangeTime=1523855964541, loc=true,
> ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
>
> Our analysis so far:
> 1) We are constantly monitoring the GC activities of the node, and can
> confirm that there is no long GC pauses occurred during the time frame of
> the node failure.
>
> 2) There is also no abnormal network spikes reported in AWS instance
> monitors as well.
>
> 3) CPU utilization on the affected node is low. No blocked threads reported
> from thread dumps.
>
> Attached Tomcat Logs of two nodes from the cluster of 9
> TomcatLogs_Node1: provided log details of Network Segmentation failure
> TomcatLogs_Node2: other node provided log info of discovery message
> ApplicationLogs_Node1: Detailed logs of Node stopping exceptions
> Two thread dumps
>
> Could some one provide any insights on how to trace the root cause of this
> issue and to prevent this issue from happening again?
>
> Thanks
> Naresh
>
>
> TomcatLog_Node1.txt
>  t1286/TomcatLog_Node1.txt>
> TomcatLog_Node2.txt
>  t1286/TomcatLog_Node2.txt>
> ApplicationLog_Node1.txt
>  t1286/ApplicationLog_Node1.txt>
> threaddump_1.threaddump_1
>  t1286/threaddump_1.threaddump_1>
> threaddump_2.threaddump_2
>  t1286/threaddump_2.threaddump_2>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>



-- 
Best regards,
Andrey V. Mashenkov