[ https://issues.apache.org/jira/browse/IGNITE-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ilya Kasnacheev updated IGNITE-7476: ------------------------------------ Description: Sometimes server node will fail with the following trace: {code:java} SEVERE: TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. java.lang.NullPointerException at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1149) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5022) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2690) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2491) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6675) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2574) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62){code} Two problems here: * Uncaught exception in cacheMetrics() leads to unconditional failure of node, because it happens to be in discovery thread. Should probably wrap all non-trivial code include try-catch. * Lack of proper locking when destroying cache (see also IGNITE-6580, IGNITE-7278 and IGNITE-7165) was: Sometimes server node will fail with the following trace: {code:java} SEVERE: TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability. java.lang.NullPointerException at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1149) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5022) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2690) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2491) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6675) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2574) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62){code} Two problems here: * Uncaught exception in cacheMetrics() leads to unconditional failure of node, because it happens to be in discovery thread. Should probably wrap all non-trivial code include try-catch. * Lack of proper locking when destroying cache (see also IGNITE-6423 and IGNITE-7165) > Server node will join with failure gathering metrics > ---------------------------------------------------- > > Key: IGNITE-7476 > URL: https://issues.apache.org/jira/browse/IGNITE-7476 > Project: Ignite > Issue Type: Bug > Reporter: Ilya Kasnacheev > Priority: Critical > > Sometimes server node will fail with the following trace: > {code:java} > SEVERE: TcpDiscoverSpi's message worker thread failed abnormally. Stopping > the node in order to prevent cluster wide instability. > java.lang.NullPointerException > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$7.cacheMetrics(GridDiscoveryManager.java:1149) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5022) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2690) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2491) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6675) > at > org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2574) > at > org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62){code} > Two problems here: > * Uncaught exception in cacheMetrics() leads to unconditional failure of > node, because it happens to be in discovery thread. Should probably wrap all > non-trivial code include try-catch. > * Lack of proper locking when destroying cache (see also IGNITE-6580, > IGNITE-7278 and IGNITE-7165) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)