Hello,

In general there are two possible ways to handle lost partitions for a
cluster that uses Ignite Native Persistence:
1.
   - Return all failed nodes to baseline topology.
   - Call resetLostPartitions

2.
   - Stop all remaining nodes in the cluster.
   - Start all nodes in the cluster (including previously failed nodes) and
activate a cluster.

it’s important to return all failed nodes to the topology before calling
resetLostPartitions, otherwise a cluster could end up having stale data.

If some owners cannot be returned to the topology for a some reason, they
should be excluded from baseline before attempting resetting lost partition
state or an ClusterTopologyCheckedException will be thrown
with a message "Cannot reset lost partitions because no baseline nodes are
online [cache=someCahe, partition=someLostPart]” indicating safe recovery
is not possible.

In your particular case, the cache does not have backups and returning a
node that holds a lost partition should not lead to data inconsistencies.
This particular case can be detected and automatically "resolved". I will
file a jira ticket in order to address this improvement.

Thanks,
Slava.

пн, 26 сент. 2022 г. в 16:51, 38797715 <38797...@qq.com>:

> hello,
>
> Start two nodes with native persistent enabled, and then activate it.
>
> create a table with no backups, sql like follows:
>
> CREATE TABLE City (
>   ID INT,
>   Name VARCHAR,
>   CountryCode CHAR(3),
>   District VARCHAR,
>   Population INT,
>   PRIMARY KEY (ID, CountryCode)
> ) WITH "template=partitioned, affinityKey=CountryCode, CACHE_NAME=City,
> KEY_TYPE=demo.model.CityKey, VALUE_TYPE=demo.model.City";
>
> INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES
> (1,'Kabul','AFG','Kabol',1780000);
> INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES
> (2,'Qandahar','AFG','Qandahar',237500);
>
> then execute SELECT COUNT(*) FROM city;
>
> normal.
>
> then kill one node.
>
> then execute SELECT COUNT(*) FROM city;
>
> Failed to execute query because cache partition has been lostPart
> [cacheName=City, part=0]
>
> this alse normal.
>
> Next, start the node that was shut down before.
>
> then execute SELECT COUNT(*) FROM city;
>
> Failed to execute query because cache partition has been lostPart
> [cacheName=City, part=0]
>
> At this time, all partitions have been recovered, and all baseline nodes
> are ONLINE. Why still report this error? It is very confusing. Execute
> reset_lost_partitions operation at this time seems redundant. Do have any
> special considerations here?
>
> if this time restart the whole cluster,  then execute SELECT COUNT(*)
> FROM city; normal, this state is the same as the previous state, but the
> behavior is different.
>
>
>
>
>
>
>

Reply via email to