subject:"Grid suddenly went in bad state"

Re: Grid suddenly went in bad state

2019-09-27 Thread Ilya Kasnacheev

Hello!

I'm not really sure, maybe it is because nodes tried to enter PME when they
were already unable to communicate and therefore not all cache operations
were completed.

Anyway, since your logs don't start from beginning, it's impossible to know
if there were any other clues. Currently there are none.

Please also consider https://issues.apache.org/jira/browse/IGNITE-11365

Regards,
-- 
Ilya Kasnacheev


чт, 26 сент. 2019 г. в 19:11, Abhishek Gupta (BLOOMBERG/ 731 LEX) <
agupta...@bloomberg.net>:

> Thanks for the response Ilya.
> So from a sequence of events perspective, first the logs show "Partition
> states validation has failed for group" for many minutes. And only after
> that we see the "Failed to read data from remote connection" caused by
> "java.nio.channels.ClosedChannelException". So the question remains - what
> could cause "Partition states validation has failed for group" in the first
> place?
>
> Will also appreciate insights into my question 2. Below about 'client'
> being nominator the coordinator. Is that by design?
>
> Thanks,
> Abhishek
>
>
>
> From: ilya.kasnach...@gmail.com At: 09/26/19 11:33:36
> To: Abhishek Gupta (BLOOMBERG/ 731 LEX ) 
> Cc: user@ignite.apache.org
> Subject: Re: Grid suddenly went in bad state
>
> Hello!
>
> "Failed to read data from remote connection" in absence of other errors
> points to potential network problems. Maybe you have short idle timeout for
> TCP connections? Maybe they get blockaded?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) <
> agupta...@bloomberg.net>:
>
>> Hello Folks,
>> Would really appreciate any suggestions you could provide about the below.
>>
>>
>> Thanks,
>> Abhishek
>>
>> From: user@ignite.apache.org At: 09/20/19 15:11:33
>> To: user@ignite.apache.org
>> Subject: Re: Grid suddenly went in bad state
>>
>>
>> Find attached the logs from 3 of the nodes and their GC graphs. The logs
>> from the other nodes look pretty much the same.
>>
>> Some questions -
>> 1. What could be the trigger for the "Partition states validation has
>> failed for group" in node 1 ? Seems like it came on suddenly
>> 2. If you look at the logs, there seems to be a change in coordinator
>> 3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175]
>> GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode
>> [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1,
>> 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste
>> rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110,
>> 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
>> 3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175]
>> GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode
>> [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148,
>> 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true],
>> cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d,
>> addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false,
>> client=false]]
>>
>> What is curious is that it seems to suggest, a client was a coordinator.
>> Is that by design? Clients are allowed to be coordinators?
>>
>>
>> 3. It just seems like the grid went into a tailspin as show in the logs
>> for node 1. Any help in understanding what triggered these series of event
>> will be very helpful.
>>
>>
>> Thanks,
>> Abhishek
>>
>>
>>
>>
>> From: user@ignite.apache.org At: 09/20/19 05:24:59
>> To: user@ignite.apache.org
>> Subject: Re: Grid suddenly went in bad state
>>
>> Hi,
>>
>> Could please also attach logs for other nodes? And what version of Ignite
>> you're currently using?
>>
>> Also you've mentioned high GC activity, is it possible to provide GC logs?
>>
>> Regards,
>> Igor
>>
>> On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <
>> agupta...@bloomberg.net> wrote:
>>
>>> Hello,
>>> I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each.
>>> The grid seemed to be performing normally until at one point it started
>>> logging "Partition states validation has failed for group" warning - see
>>> attached log file. This kept happening for about 3 minutes and then stopped
>>> (see line 85 in the attached log file). Just then a client seems to have
>>> connected

Re: Grid suddenly went in bad state

2019-09-26 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)

Thanks for the response Ilya. 
So from a sequence of events perspective, first the logs show "Partition states 
validation has failed for group" for many minutes. And only after that we see 
the  "Failed to read data from remote connection" caused by 
"java.nio.channels.ClosedChannelException".  So the question remains - what 
could cause  "Partition states validation has failed for group" in the first 
place? 

Will also appreciate insights into my question 2. Below about 'client' being 
nominator the coordinator. Is that by design?  

Thanks,
Abhishek


From: ilya.kasnach...@gmail.com At: 09/26/19 11:33:36To:  Abhishek Gupta 
(BLOOMBERG/ 731 LEX ) 
Cc:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state

Hello!

"Failed to read data from remote connection" in absence of other errors points 
to potential network problems. Maybe you have short idle timeout for TCP 
connections? Maybe they get blockaded?

Regards,
-- 
Ilya Kasnacheev


вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) 
:

Hello Folks,
  Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: user@ignite.apache.org At: 09/20/19 15:11:33To:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from 
the other nodes look pretty much the same. 

Some questions - 
1. What could be the trigger for the "Partition states validation has failed 
for group"  in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator 
   3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] 
GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode 
[id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 
10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste
rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 
10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
   3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] 
GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode 
[id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 
10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], 
cur=ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, 
client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is 
that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for 
node 1.  Any help in understanding what triggered these series of event will be 
very helpful.


Thanks,
Abhishek


From: user@ignite.apache.org At: 09/20/19 05:24:59To:  user@ignite.apache.org
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite 
you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor
On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) 
 wrote:

Hello,
  I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. 
The grid seemed to be performing normally until at one point it started logging 
"Partition states validation has failed for group" warning - see attached log 
file.  This kept happening for about 3 minutes and then stopped (see line 85 in 
the attached log file).  Just then a client seems to have connected (see line 
135 where connection was accepted). But soon after, it kept logging the below 
exception. After a while (~1 hour), it started showing logging "Partition 
states validation has failed for group" again (line 284). 


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] 
GridDhtPartitionsExchangeFuture - Completed partition exchange 
[localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, 
evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, 
addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, 
minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange 
init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] 
GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) 
[top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, 
evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] 
TcpCommunicationSpi - Accepted incoming communication

Re: Grid suddenly went in bad state

2019-09-26 Thread Ilya Kasnacheev

Hello!

"Failed to read data from remote connection" in absence of other errors
points to potential network problems. Maybe you have short idle timeout for
TCP connections? Maybe they get blockaded?

Regards,
-- 
Ilya Kasnacheev


вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) <
agupta...@bloomberg.net>:

> Hello Folks,
> Would really appreciate any suggestions you could provide about the below.
>
>
> Thanks,
> Abhishek
>
> From: user@ignite.apache.org At: 09/20/19 15:11:33
> To: user@ignite.apache.org
> Subject: Re: Grid suddenly went in bad state
>
>
> Find attached the logs from 3 of the nodes and their GC graphs. The logs
> from the other nodes look pretty much the same.
>
> Some questions -
> 1. What could be the trigger for the "Partition states validation has
> failed for group" in node 1 ? Seems like it came on suddenly
> 2. If you look at the logs, there seems to be a change in coordinator
> 3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175]
> GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode
> [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1,
> 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste
> rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110,
> 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
> 3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175]
> GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode
> [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148,
> 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true],
> cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d,
> addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false,
> client=false]]
>
> What is curious is that it seems to suggest, a client was a coordinator.
> Is that by design? Clients are allowed to be coordinators?
>
>
> 3. It just seems like the grid went into a tailspin as show in the logs
> for node 1. Any help in understanding what triggered these series of event
> will be very helpful.
>
>
> Thanks,
> Abhishek
>
>
>
>
> From: user@ignite.apache.org At: 09/20/19 05:24:59
> To: user@ignite.apache.org
> Subject: Re: Grid suddenly went in bad state
>
> Hi,
>
> Could please also attach logs for other nodes? And what version of Ignite
> you're currently using?
>
> Also you've mentioned high GC activity, is it possible to provide GC logs?
>
> Regards,
> Igor
>
> On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <
> agupta...@bloomberg.net> wrote:
>
>> Hello,
>> I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each.
>> The grid seemed to be performing normally until at one point it started
>> logging "Partition states validation has failed for group" warning - see
>> attached log file. This kept happening for about 3 minutes and then stopped
>> (see line 85 in the attached log file). Just then a client seems to have
>> connected (see line 135 where connection was accepted). But soon after, it
>> kept logging the below exception. After a while (~1 hour), it started
>> showing logging "Partition states validation has failed for group" again
>> (line 284).
>>
>>
>> 2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176]
>> GridDhtPartitionsExchangeFuture - Completed partition exchange
>> [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d,
>> exchange=GridDhtPartitionsExchangeFuture [topVer
>> =AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED,
>> evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2,
>> addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
>> nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126,
>> minorTopVer=0], durationFromInit=0]
>> 2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished
>> exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0],
>> crd=true]
>> 2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176]
>> GridCachePartitionExchangeManager - Skipping rebalancing (nothing
>> scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0],
>> force=false, evt=NODE_JOI
>> NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
>> 2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130]
>> TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/
>> 10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
>> 2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131]
>> TcpCommunicationSpi - Accepted

Re: Grid suddenly went in bad state

2019-09-20 Thread Igor Belyakov

Hi,

Could please also attach logs for other nodes? And what version of Ignite
you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <
agupta...@bloomberg.net> wrote:

> Hello,
> I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each.
> The grid seemed to be performing normally until at one point it started
> logging "Partition states validation has failed for group" warning - see
> attached log file. This kept happening for about 3 minutes and then stopped
> (see line 85 in the attached log file). Just then a client seems to have
> connected (see line 135 where connection was accepted). But soon after, it
> kept logging the below exception. After a while (~1 hour), it started
> showing logging "Partition states validation has failed for group" again
> (line 284).
>
>
> 2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176]
> GridDhtPartitionsExchangeFuture - Completed partition exchange
> [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d,
> exchange=GridDhtPartitionsExchangeFuture [topVer
> =AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED,
> evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2,
> addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
> nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126,
> minorTopVer=0], durationFromInit=0]
> 2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished
> exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0],
> crd=true]
> 2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176]
> GridCachePartitionExchangeManager - Skipping rebalancing (nothing
> scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0],
> force=false, evt=NODE_JOI
> NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
> 2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130]
> TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/
> 10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
> 2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131]
> TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/
> 10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
> 2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132]
> TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/
> 10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
> 2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133]
> TcpCommunicationSpi - Failed to read data from remote connection (will wait
> for 2000ms).
> org.apache.ignite.IgniteCheckedException: Failed to select events on
> selector.
> at
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182)
> ~[ignite-core-2.7.5-0-2.jar:2.7.5]
> at
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
> [ignite-core-2.7.5-0-2.jar:2.7.5]
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> [ignite-core-2.7.5-0-2.jar:2.7.5]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
> Caused by: java.nio.channels.ClosedChannelException
> at
> java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
> ~[?:1.8.0_172]
> at
> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
> ~[ignite-core-2.7.5-0-2.jar:2.7.5]
> ... 3 more
>
>
> after a lot of these exceptions and warnings, the node started throwing
> the below (client had started ingestion using datastreamer). And the below
> exceptions were seen on all the nodes
>
> 2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical
> system error detected. Will be handled accordingly to configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
> ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
> [name=data-str
> eamer-stripe-42, igniteInstanceName=null, finished=false,
> heartbeatTs=1568920228643]]]
> org.apache.ignite.IgniteException: GridWorker
> [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false,
> heartbeatTs=1568920228643]
> at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
> [ignite-core-2.7.5-0-2.jar:2.7.5]
> at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
> [ignite-core-2.7.5-0-2.jar:2.7.5]
> at
> org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
> [ignite-core-2.7.5-0-2.jar:2.7.5]
> at
> org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
> [ignite-core-2.7.5-0-2.jar:2.7.5]
> at
>

Grid suddenly went in bad state

2019-09-19 Thread Abhishek Gupta (BLOOMBERG/ 731 LEX)

Hello,
  I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. 
The grid seemed to be performing normally until at one point it started logging 
"Partition states validation has failed for group" warning - see attached log 
file.  This kept happening for about 3 minutes and then stopped (see line 85 in 
the attached log file).  Just then a client seems to have connected (see line 
135 where connection was accepted). But soon after, it kept logging the below 
exception. After a while (~1 hour), it started showing logging "Partition 
states validation has failed for group" again (line 284). 


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] 
GridDhtPartitionsExchangeFuture - Completed partition exchange 
[localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, 
exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, 
evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, 
addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, 
minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange 
init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] 
GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) 
[top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, 
evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] 
TcpCommunicationSpi - Accepted incoming communication connection 
[locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] 
TcpCommunicationSpi - Failed to read data from remote connection (will wait for 
2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at 
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
 ~[?:1.8.0_172]
at 
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
 ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the 
below (client had started ingestion using datastreamer). And the below 
exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115]  - Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, 
heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, 
igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221)
 [ignite-core-2.7.5-0-2.jar:2.7.5]
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) 
[ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There

Re: Grid suddenly went in bad state

Re: Grid suddenly went in bad state

Re: Grid suddenly went in bad state

Re: Grid suddenly went in bad state

Grid suddenly went in bad state

5 matches

Site Navigation

Mail list logo

Footer information