Re: Ignite cluster going down frequently

Ilya Kasnacheev Fri, 30 Nov 2018 04:28:52 -0800

Hello!

[04:45:53,179][WARNING][tcp-disco-msg-worker-#2%StaticGrid_NG_Dev%][TcpDiscoverySpi]
Timed out waiting for message delivery receipt (most probably, the reason
is in long GC pauses on remote
 node; consider tuning GC and increasing 'ackTimeout' configuration
property). Will retry to send message with increased timeout
[currentTimeout=10000, rmtAddr=/10.201.30.64:47603, rmtPort=
47603]
[04:45:53,180][WARNING][tcp-disco-msg-worker-#2%StaticGrid_NG_Dev%][TcpDiscoverySpi]
Failed to send message to next node [msg=TcpDiscoveryJoinRequestMessage
[node=TcpDiscoveryNode [id=47aa2
976-0a02-4ffe-9c8d-3f0fbfcc532b, addrs=[10.201.30.173], sockAddrs=[/
10.201.30.173:0], discPort=0, order=0, intOrder=0,
lastExchangeTime=1542861943131, loc=false, ver=2.4.0#20180305-sha1:aa3
42270, isClient=true],
dataPacket=o.a.i.spi.discovery.tcp.internal.DiscoveryDataPacket@6ce6ae2,
super=TcpDiscoveryAbstractMessage
[sndNodeId=8a825790-a987-42c3-acb0-b3ea270143e1, id=5e14ec5
3761-47aa2976-0a02-4ffe-9c8d-3f0fbfcc532b, verifierNodeId=null, topVer=0,
pendingIdx=0, failedNodes=null, isClient=true]], next=TcpDiscoveryNode
[id=d7782a2e-4cfc-4427-8ba7-a9af3954ae3f, ad
drs=[10.201.30.64], sockAddrs=[/10.201.30.64:47603], discPort=47603,
order=53, intOrder=32, lastExchangeTime=1542272829304, loc=false,
ver=2.4.0#20180305-sha1:aa342270, isClient=false], err
Msg=Failed to send message to next node [msg=TcpDiscoveryJoinRequestMessage
[node=TcpDiscoveryNode [id=47aa2976-0a02-4ffe-9c8d-3f0fbfcc532b,
addrs=[10.201.30.173], sockAddrs=[/10.201.30.173
:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1542861943131,
loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=true],
dataPacket=o.a.i.spi.discovery.tcp.internal.DiscoveryDataP
acket@6ce6ae2, super=TcpDiscoveryAbstractMessage
[sndNodeId=8a825790-a987-42c3-acb0-b3ea270143e1,
id=5e14ec53761-47aa2976-0a02-4ffe-9c8d-3f0fbfcc532b, verifierNodeId=null,
topVer=0, pending
Idx=0, failedNodes=null, isClient=true]], next=ClusterNode
[id=d7782a2e-4cfc-4427-8ba7-a9af3954ae3f, order=53, addr=[10.201.30.64],
daemon=true]]]
[04:45:53,190][WARNING][tcp-disco-msg-worker-#2%StaticGrid_NG_Dev%][TcpDiscoverySpi]
Local node has detected failed nodes and started cluster-wide procedure. To
speed up failure detection p
lease see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi'


and then, on another node:
[04:45:58,335][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
Local node SEGMENTED: TcpDiscoveryNode
[id=8a825790-a987-42c3-acb0-b3ea270143e1, addrs=[10.201.30.63], sockAddrs=[/
10.201.30.63:47600], discPort=47600, order=42, intOrder=23,
lastExchangeTime=1542861958327, loc=true, ver=2.4.0#20180305-sha1:aa342270,
isClient=false]

I think that you either have long GC pauses or flaky network (or system
goes into swapping and such).

Consider increasing 'ackTimeout' and/or 'failureDetectionTimeout'. Also
consider collecting GC logs for your nodes, looking into them for a root
cause.

Regards,
-- 
Ilya Kasnacheev


пт, 30 нояб. 2018 г. в 14:01, Hemasundara Rao <
hemasundara....@travelcentrictechnology.com>:

> Hi Ilya Kasnacheev,
>
>  I am attaching all logs from second server (10.201.30.64).
> Please let me know if you need any other details.
>
> Thanks and Regards,
> Hemasundar.
>
> On Fri, 30 Nov 2018 at 09:40, Hemasundara Rao <
> hemasundara....@travelcentrictechnology.com> wrote:
>
>> Hi Ilya Kasnacheev,
>>
>>   We are running one cluster node (10.201.30.63). I am attaching all logs
>> from this server.
>> Please let me know if you need any other details.
>>
>> Thanks and Regards,
>> Hemasundar.
>>
>>
>> On Thu, 29 Nov 2018 at 20:07, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> It is not clear from this log alone why this node became segmented. Do
>>> you have log from other server node in the topology? It was coordinator so
>>> maybe it was the one experiencing problems.
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> ср, 28 нояб. 2018 г. в 13:56, Hemasundara Rao <
>>> hemasundara....@travelcentrictechnology.com>:
>>>
>>>> Hi  Ilya Kasnacheev,
>>>>
>>>>  Did you get a chance to go though the log attached?
>>>> This is one of the critical issue we are facing in our dev environment.
>>>> Your input is of great help for us if we get, what is causing this
>>>> issue and a probable solution to it.
>>>>
>>>> Thanks and Regards,
>>>> Hemasundar.
>>>>
>>>> On Mon, 26 Nov 2018 at 16:54, Hemasundara Rao <
>>>> hemasundara....@travelcentrictechnology.com> wrote:
>>>>
>>>>> Hi  Ilya Kasnacheev,
>>>>>   I have attached the log file.
>>>>>
>>>>> Regards,
>>>>> Hemasundar.
>>>>>
>>>>> On Mon, 26 Nov 2018 at 16:50, Ilya Kasnacheev <
>>>>> ilya.kasnach...@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Maybe you have some data in your caches which causes runaway heap
>>>>>> usage in your own code. Previously you did not have such data or code 
>>>>>> which
>>>>>> would react in such fashion.
>>>>>>
>>>>>> It's hard to say, can you provide more logs from the node before it
>>>>>> segments?
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пн, 26 нояб. 2018 г. в 14:17, Hemasundara Rao <
>>>>>> hemasundara....@travelcentrictechnology.com>:
>>>>>>
>>>>>>> Thank you very much Ilya Kasnacheev for your response.
>>>>>>>
>>>>>>> We are loading data initially, after that only small delta change
>>>>>>> will be updated.
>>>>>>> Grid down issue is happening after it is running successfully 2 to 3
>>>>>>> days.
>>>>>>> Once the issue started, it is repeating frequently and not getting
>>>>>>> any clue.
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Hemasundar.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 26 Nov 2018 at 13:43, Ilya Kasnacheev <
>>>>>>> ilya.kasnach...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> Node will get segmented if other nodes fail to wait for Discovery
>>>>>>>> response from that node. This usually means either network problems or 
>>>>>>>> long
>>>>>>>> GC pauses causes by insufficient heap on one of nodes.
>>>>>>>>
>>>>>>>> Make sure your data load process does not cause heap usage spikes.
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>> --
>>>>>>>> Ilya Kasnacheev
>>>>>>>>
>>>>>>>>
>>>>>>>> пт, 23 нояб. 2018 г. в 07:54, Hemasundara Rao <
>>>>>>>> hemasundara....@travelcentrictechnology.com>:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>> We are running two node ignite server cluster.
>>>>>>>>> It was running without any issue for almost 5 days. We are using
>>>>>>>>> this grid for static data. Ignite process is running with around 8GB 
>>>>>>>>> memory
>>>>>>>>> after we load our data.
>>>>>>>>> Suddenly grid server nodes going down , we tried 3 times running
>>>>>>>>> the server nodes and loading static data. Those server node going down
>>>>>>>>> again and again.
>>>>>>>>>
>>>>>>>>> Please let us know how to overcome these kind of issue.
>>>>>>>>>
>>>>>>>>> Attache the log file and configuration file.
>>>>>>>>>
>>>>>>>>> *Following Is the part of log from on server : *
>>>>>>>>>
>>>>>>>>> [04:45:58,335][WARNING][tcp-disco-msg-worker-#2%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Node is out of topology (probably, due to short-time network 
>>>>>>>>> problems).
>>>>>>>>> [04:45:58,335][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Local node SEGMENTED: TcpDiscoveryNode
>>>>>>>>> [id=8a825790-a987-42c3-acb0-b3ea270143e1, addrs=[10.201.30.63], 
>>>>>>>>> sockAddrs=[/
>>>>>>>>> 10.201.30.63:47600], discPort=47600, order=42, intOrder=23,
>>>>>>>>> lastExchangeTime=1542861958327, loc=true, 
>>>>>>>>> ver=2.4.0#20180305-sha1:aa342270,
>>>>>>>>> isClient=false]
>>>>>>>>> [04:45:58,335][INFO][tcp-disco-sock-reader-#78%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Finished serving remote node connection [rmtAddr=/
>>>>>>>>> 10.201.30.64:36695, rmtPort=36695
>>>>>>>>> [04:45:58,337][INFO][tcp-disco-sock-reader-#70%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Finished serving remote node connection [rmtAddr=/
>>>>>>>>> 10.201.30.172:58418, rmtPort=58418
>>>>>>>>> [04:45:58,337][INFO][tcp-disco-sock-reader-#74%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Finished serving remote node connection [rmtAddr=/
>>>>>>>>> 10.201.10.125:63403, rmtPort=63403
>>>>>>>>> [04:46:01,516][INFO][tcp-comm-worker-#1%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Pinging node: 6a603d8b-f8bf-40bf-af50-6c04a56b572e
>>>>>>>>> [04:46:01,546][INFO][tcp-comm-worker-#1%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Finished node ping [nodeId=6a603d8b-f8bf-40bf-af50-6c04a56b572e, 
>>>>>>>>> res=true,
>>>>>>>>> time=49ms]
>>>>>>>>> [04:46:02,482][INFO][tcp-comm-worker-#1%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Pinging node: 5ec6ee69-075e-4829-84ca-ae40411c7bc3
>>>>>>>>> [04:46:02,482][INFO][tcp-comm-worker-#1%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Finished node ping [nodeId=5ec6ee69-075e-4829-84ca-ae40411c7bc3, 
>>>>>>>>> res=false,
>>>>>>>>> time=7ms]
>>>>>>>>> [04:46:08,283][INFO][tcp-disco-sock-reader-#4%StaticGrid_NG_Dev%][TcpDiscoverySpi]
>>>>>>>>> Finished serving remote node connection [rmtAddr=/
>>>>>>>>> 10.201.30.64:48038, rmtPort=48038
>>>>>>>>> [04:46:08,367][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Restarting JVM according to configured segmentation policy.
>>>>>>>>> [04:46:08,388][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Node FAILED: TcpDiscoveryNode 
>>>>>>>>> [id=20687a72-b5c7-48bf-a5ab-37bd3f7fa064,
>>>>>>>>> addrs=[10.201.30.64], sockAddrs=[/10.201.30.64:47601],
>>>>>>>>> discPort=47601, order=41, intOrder=22, lastExchangeTime=1542262724642,
>>>>>>>>> loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
>>>>>>>>> [04:46:08,389][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Topology snapshot [ver=680, servers=1, clients=17, CPUs=36, 
>>>>>>>>> offheap=8.0GB,
>>>>>>>>> heap=84.0GB]
>>>>>>>>> [04:46:08,389][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Data Regions Configured:
>>>>>>>>> [04:46:08,389][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>>  ^-- Default_Region [initSize=256.0 MiB, maxSize=8.0 GiB,
>>>>>>>>> persistenceEnabled=false]
>>>>>>>>> [04:46:08,396][INFO][exchange-worker-#42%StaticGrid_NG_Dev%][time]
>>>>>>>>> Started exchange init [topVer=AffinityTopologyVersion [topVer=680,
>>>>>>>>> minorTopVer=0], crd=true, evt=NODE_FAILED,
>>>>>>>>> evtNode=20687a72-b5c7-48bf-a5ab-37bd3f7fa064, customEvt=null,
>>>>>>>>> allowMerge=true]
>>>>>>>>> [04:46:08,398][INFO][exchange-worker-#42%StaticGrid_NG_Dev%][GridDhtPartitionsExchangeFuture]
>>>>>>>>> Finished waiting for partition release future
>>>>>>>>> [topVer=AffinityTopologyVersion [topVer=680, minorTopVer=0], 
>>>>>>>>> waitTime=0ms,
>>>>>>>>> futInfo=NA]
>>>>>>>>> [04:46:08,398][INFO][exchange-worker-#42%StaticGrid_NG_Dev%][GridDhtPartitionsExchangeFuture]
>>>>>>>>> Coordinator received all messages, try merge 
>>>>>>>>> [ver=AffinityTopologyVersion
>>>>>>>>> [topVer=680, minorTopVer=0]]
>>>>>>>>> [04:46:08,398][INFO][exchange-worker-#42%StaticGrid_NG_Dev%][GridCachePartitionExchangeManager]
>>>>>>>>> Stop merge, custom task found: WalStateNodeLeaveExchangeTask
>>>>>>>>> [node=TcpDiscoveryNode [id=20687a72-b5c7-48bf-a5ab-37bd3f7fa064,
>>>>>>>>> addrs=[10.201.30.64], sockAddrs=[/10.201.30.64:47601],
>>>>>>>>> discPort=47601, order=41, intOrder=22, lastExchangeTime=1542262724642,
>>>>>>>>> loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]]
>>>>>>>>> [04:46:08,398][INFO][exchange-worker-#42%StaticGrid_NG_Dev%][GridDhtPartitionsExchangeFuture]
>>>>>>>>> finishExchangeOnCoordinator [topVer=AffinityTopologyVersion 
>>>>>>>>> [topVer=680,
>>>>>>>>> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=680, 
>>>>>>>>> minorTopVer=0]]
>>>>>>>>> [04:46:08,512][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Node FAILED: TcpDiscoveryNode 
>>>>>>>>> [id=6a603d8b-f8bf-40bf-af50-6c04a56b572e,
>>>>>>>>> addrs=[10.201.30.172], sockAddrs=[BLRVM-HHNG01.devdom/
>>>>>>>>> 10.201.30.172:0], discPort=0, order=98, intOrder=53,
>>>>>>>>> lastExchangeTime=1542348596592, loc=false,
>>>>>>>>> ver=2.4.0#20180305-sha1:aa342270, isClient=true]
>>>>>>>>> [04:46:08,512][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Topology snapshot [ver=683, servers=1, clients=16, CPUs=36, 
>>>>>>>>> offheap=8.0GB,
>>>>>>>>> heap=78.0GB]
>>>>>>>>> [04:46:08,512][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Data Regions Configured:
>>>>>>>>> [04:46:08,512][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>>  ^-- Default_Region [initSize=256.0 MiB, maxSize=8.0 GiB,
>>>>>>>>> persistenceEnabled=false]
>>>>>>>>> [04:46:08,513][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Node FAILED: TcpDiscoveryNode 
>>>>>>>>> [id=5ec6ee69-075e-4829-84ca-ae40411c7bc3,
>>>>>>>>> addrs=[10.201.30.172], sockAddrs=[BLRVM-HHNG01.devdom/
>>>>>>>>> 10.201.30.172:0], discPort=0, order=129, intOrder=71,
>>>>>>>>> lastExchangeTime=1542360580600, loc=false,
>>>>>>>>> ver=2.4.0#20180305-sha1:aa342270, isClient=true]
>>>>>>>>> [04:46:08,513][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Topology snapshot [ver=684, servers=1, clients=15, CPUs=36, 
>>>>>>>>> offheap=8.0GB,
>>>>>>>>> heap=72.0GB]
>>>>>>>>> [04:46:08,513][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Data Regions Configured:
>>>>>>>>> [04:46:08,513][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>>  ^-- Default_Region [initSize=256.0 MiB, maxSize=8.0 GiB,
>>>>>>>>> persistenceEnabled=false]
>>>>>>>>> [04:46:08,514][WARNING][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Node FAILED: TcpDiscoveryNode 
>>>>>>>>> [id=224648a6-e515-479e-88e4-44f7bceaeb14,
>>>>>>>>> addrs=[10.201.50.96], sockAddrs=[BLRWSVERMA3420.devdom/
>>>>>>>>> 10.201.50.96:0], discPort=0, order=175, intOrder=96,
>>>>>>>>> lastExchangeTime=1542365246419, loc=false,
>>>>>>>>> ver=2.4.0#20180305-sha1:aa342270, isClient=true]
>>>>>>>>> [04:46:08,514][INFO][disco-event-worker-#41%StaticGrid_NG_Dev%][GridDiscoveryManager]
>>>>>>>>> Topology snapshot [ver=685, servers=1, clients=14, CPUs=32, 
>>>>>>>>> offheap=8.0GB,
>>>>>>>>> heap=71.0GB]
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Hemasundara Rao Pottangi  | Senior Project Leader
>>>>>>>>>
>>>>>>>>> [image: HotelHub-logo]
>>>>>>>>> HotelHub LLP
>>>>>>>>> Phone: +91 80 6741 8700
>>>>>>>>> Cell: +91 99 4807 7054
>>>>>>>>> Email: hemasundara....@hotelhub.com
>>>>>>>>> Website: www.hotelhub.com <http://hotelhub.com/>
>>>>>>>>> ------------------------------
>>>>>>>>>
>>>>>>>>> HotelHub LLP is a service provider working on behalf of Travel
>>>>>>>>> Centric Technology Ltd, a company registered in the United Kingdom.
>>>>>>>>> DISCLAIMER: This email message and all attachments are
>>>>>>>>> confidential and may contain information that is Privileged, 
>>>>>>>>> Confidential
>>>>>>>>> or exempt from disclosure under applicable law. If you are not the 
>>>>>>>>> intended
>>>>>>>>> recipient, you are notified that any dissemination, distribution or 
>>>>>>>>> copying
>>>>>>>>> of this email is strictly prohibited. If you have received this email 
>>>>>>>>> in
>>>>>>>>> error, please notify us immediately by return email to
>>>>>>>>> noti...@travelcentrictechnology.com and destroy the original
>>>>>>>>> message. Opinions, conclusions and other information in this message 
>>>>>>>>> that
>>>>>>>>> do not relate to the official business of Travel Centric Technology 
>>>>>>>>> Ltd or
>>>>>>>>> HotelHub LLP, shall be understood to be neither given nor endorsed by
>>>>>>>>> either company.
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>
>
> --
> Hemasundara Rao Pottangi  | Senior Project Leader
>
> [image: HotelHub-logo]
> HotelHub LLP
> Phone: +91 80 6741 8700
> Cell: +91 99 4807 7054
> Email: hemasundara....@hotelhub.com
> Website: www.hotelhub.com <http://hotelhub.com/>
> ------------------------------
>
> HotelHub LLP is a service provider working on behalf of Travel Centric
> Technology Ltd, a company registered in the United Kingdom.
> DISCLAIMER: This email message and all attachments are confidential and
> may contain information that is Privileged, Confidential or exempt from
> disclosure under applicable law. If you are not the intended recipient, you
> are notified that any dissemination, distribution or copying of this email
> is strictly prohibited. If you have received this email in error, please
> notify us immediately by return email to
> noti...@travelcentrictechnology.com and destroy the original message.
> Opinions, conclusions and other information in this message that do not
> relate to the official business of Travel Centric Technology Ltd or
> HotelHub LLP, shall be understood to be neither given nor endorsed by
> either company.
>
>

Re: Ignite cluster going down frequently

Reply via email to