Re: Split brain in 2.9.0?
Hello! It looks like your nodes has re-joined with different consistent IDs/data dirs and thus some of your data was not accessible. Please make sure that your nodes preserve their consistent IDs/data dirs over restart. It also looks like your one failing node has formed this topology as opposed to two surviving ones, which seem to have restarted and rejoined to it. What's the specific ordering of events? Regards, -- Ilya Kasnacheev пт, 11 июн. 2021 г. в 04:30, Devin Bost : > We encountered a situation after a node unexpectedly went down and came > back up. > After it came back, none of our transactions were going through (due to > rollbacks), and we started getting a lot of exceptions in the logs. (I've > added the exceptions at the bottom of this message.) > We were getting "Failed to execute the cache operation (all partition > owners have left the grid, partition data has been lost)", so we tried to > reset the partitions (since these are persistent caches), and the commands > succeeded, but we kept seeing errors. > > We checked the cluster state, and it looks like we have two nodes that > came up with different IDs. > > Cluster state: active > Current topology version: 1170 > Baseline auto adjustment disabled: softTimeout=30 > Current topology version: 1170 (Coordinator: > ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1) > Baseline nodes: > ConsistentId=1a0aa611-58b7-479a-b1e6-735e31f87ed9, State=ONLINE, > Order=1169 > ConsistentId=92bc8407-30f1-433d-9c32-5eeb759c73be, State=OFFLINE > ConsistentId=b5875ab9-7923-46c9-b3f3-1550455a24e5, State=OFFLINE > > > Number of baseline nodes: 3 > Other nodes: > ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1 > ConsistentId=5e8f3b03-aa20-45aa-892a-37e988e3741f, Order=2 > > > Could this be a split-brain scenario? > > Here's the more complete logs: > > javax.cache.CacheException: class > org.apache.ignite.internal.processors.cache.CacheInvalidStateException: > Failed to execute the cache operation (all partition owners have left the > grid, partition data has been lost) [cacheName=propensity-customer, > partition=430, key=com.company.PropensityKey [idHash=362342248, > hash=42458921, customerId=142045188, variant=MODEL_A]] > at > org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270) > at > org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083) > at > org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110) > at > org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676) > at > org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41) > at > org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99) > at > org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202) > at > org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56) > at > org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279) > at > org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109) > at > org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at > org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: class > org.apache.ignite.internal.processors.cache.CacheInvalidStateException: > Failed to execute the cache operation (all partition owners have left the > grid, partition data has been lost) [cacheName=propensity-customer, > partition=430, key=com.company.PropensityKey [idHash=362342248, > hash=42458921, customerId=142045188, variant=MODEL_A]] > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209) > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128) > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859) >
Split brain in 2.9.0?
We encountered a situation after a node unexpectedly went down and came back up. After it came back, none of our transactions were going through (due to rollbacks), and we started getting a lot of exceptions in the logs. (I've added the exceptions at the bottom of this message.) We were getting "Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost)", so we tried to reset the partitions (since these are persistent caches), and the commands succeeded, but we kept seeing errors. We checked the cluster state, and it looks like we have two nodes that came up with different IDs. Cluster state: active Current topology version: 1170 Baseline auto adjustment disabled: softTimeout=30 Current topology version: 1170 (Coordinator: ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1) Baseline nodes: ConsistentId=1a0aa611-58b7-479a-b1e6-735e31f87ed9, State=ONLINE, Order=1169 ConsistentId=92bc8407-30f1-433d-9c32-5eeb759c73be, State=OFFLINE ConsistentId=b5875ab9-7923-46c9-b3f3-1550455a24e5, State=OFFLINE Number of baseline nodes: 3 Other nodes: ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1 ConsistentId=5e8f3b03-aa20-45aa-892a-37e988e3741f, Order=2 Could this be a split-brain scenario? Here's the more complete logs: javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=430, key=com.company.PropensityKey [idHash=362342248, hash=42458921, customerId=142045188, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676) at org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41) at org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56) at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109) at org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=430, key=com.company.PropensityKey [idHash=362342248, hash=42458921, customerId=142045188, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244) at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810)