Re: Split brain in 2.9.0?

2021-06-11 Thread Ilya Kasnacheev
Hello!

It looks like your nodes has re-joined with different consistent IDs/data
dirs and thus some of your data was not accessible.

Please make sure that your nodes preserve their consistent IDs/data dirs
over restart.

It also looks like your one failing node has formed this topology as
opposed to two surviving ones, which seem to have restarted and rejoined to
it. What's the specific ordering of events?

Regards,
-- 
Ilya Kasnacheev


пт, 11 июн. 2021 г. в 04:30, Devin Bost :

> We encountered a situation after a node unexpectedly went down and came
> back up.
> After it came back, none of our transactions were going through (due to
> rollbacks), and we started getting a lot of exceptions in the logs. (I've
> added the exceptions at the bottom of this message.)
> We were getting "Failed to execute the cache operation (all partition
> owners have left the grid, partition data has been lost)", so we tried to
> reset the partitions (since these are persistent caches), and the commands
> succeeded, but we kept seeing errors.
>
> We checked the cluster state, and it looks like we have two nodes that
> came up with different IDs.
>
> Cluster state: active
> Current topology version: 1170
> Baseline auto adjustment disabled: softTimeout=30
> Current topology version: 1170 (Coordinator:
> ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1)
> Baseline nodes:
> ConsistentId=1a0aa611-58b7-479a-b1e6-735e31f87ed9, State=ONLINE,
> Order=1169
> ConsistentId=92bc8407-30f1-433d-9c32-5eeb759c73be, State=OFFLINE
> ConsistentId=b5875ab9-7923-46c9-b3f3-1550455a24e5, State=OFFLINE
>
> 
> Number of baseline nodes: 3
> Other nodes:
> ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1
> ConsistentId=5e8f3b03-aa20-45aa-892a-37e988e3741f, Order=2
>
>
> Could this be a split-brain scenario?
>
> Here's the more complete logs:
>
> javax.cache.CacheException: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=430, key=com.company.PropensityKey [idHash=362342248,
> hash=42458921, customerId=142045188, variant=MODEL_A]]
> at
> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
> at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
> at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
> at
> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
> at
> org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
> at
> org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
> at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
> at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
> at
> org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
> at
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
> at
> org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> at
> org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=430, key=com.company.PropensityKey [idHash=362342248,
> hash=42458921, customerId=142045188, variant=MODEL_A]]
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859)
>

Split brain in 2.9.0?

2021-06-10 Thread Devin Bost
We encountered a situation after a node unexpectedly went down and came
back up.
After it came back, none of our transactions were going through (due to
rollbacks), and we started getting a lot of exceptions in the logs. (I've
added the exceptions at the bottom of this message.)
We were getting "Failed to execute the cache operation (all partition
owners have left the grid, partition data has been lost)", so we tried to
reset the partitions (since these are persistent caches), and the commands
succeeded, but we kept seeing errors.

We checked the cluster state, and it looks like we have two nodes that came
up with different IDs.

Cluster state: active
Current topology version: 1170
Baseline auto adjustment disabled: softTimeout=30
Current topology version: 1170 (Coordinator:
ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1)
Baseline nodes:
ConsistentId=1a0aa611-58b7-479a-b1e6-735e31f87ed9, State=ONLINE,
Order=1169
ConsistentId=92bc8407-30f1-433d-9c32-5eeb759c73be, State=OFFLINE
ConsistentId=b5875ab9-7923-46c9-b3f3-1550455a24e5, State=OFFLINE

Number of baseline nodes: 3
Other nodes:
ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1
ConsistentId=5e8f3b03-aa20-45aa-892a-37e988e3741f, Order=2


Could this be a split-brain scenario?

Here's the more complete logs:

javax.cache.CacheException: class
org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
Failed to execute the cache operation (all partition owners have left the
grid, partition data has been lost) [cacheName=propensity-customer,
partition=430, key=com.company.PropensityKey [idHash=362342248,
hash=42458921, customerId=142045188, variant=MODEL_A]]
at
org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
at
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
at
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
at
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
at
org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
at
org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
at
org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
at
org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
at
org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
at
org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
at
org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at
org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: class
org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
Failed to execute the cache operation (all partition owners have left the
grid, partition data has been lost) [cacheName=propensity-customer,
partition=430, key=com.company.PropensityKey [idHash=362342248,
hash=42458921, customerId=142045188, variant=MODEL_A]]
at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209)
at
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128)
at
org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859)
at
org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277)
at
org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244)
at
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810)