Hello, In general there are two possible ways to handle lost partitions for a cluster that uses Ignite Native Persistence: 1. - Return all failed nodes to baseline topology. - Call resetLostPartitions
2. - Stop all remaining nodes in the cluster. - Start all nodes in the cluster (including previously failed nodes) and activate a cluster. it’s important to return all failed nodes to the topology before calling resetLostPartitions, otherwise a cluster could end up having stale data. If some owners cannot be returned to the topology for a some reason, they should be excluded from baseline before attempting resetting lost partition state or an ClusterTopologyCheckedException will be thrown with a message "Cannot reset lost partitions because no baseline nodes are online [cache=someCahe, partition=someLostPart]” indicating safe recovery is not possible. In your particular case, the cache does not have backups and returning a node that holds a lost partition should not lead to data inconsistencies. This particular case can be detected and automatically "resolved". I will file a jira ticket in order to address this improvement. Thanks, Slava. пн, 26 сент. 2022 г. в 16:51, 38797715 <38797...@qq.com>: > hello, > > Start two nodes with native persistent enabled, and then activate it. > > create a table with no backups, sql like follows: > > CREATE TABLE City ( > ID INT, > Name VARCHAR, > CountryCode CHAR(3), > District VARCHAR, > Population INT, > PRIMARY KEY (ID, CountryCode) > ) WITH "template=partitioned, affinityKey=CountryCode, CACHE_NAME=City, > KEY_TYPE=demo.model.CityKey, VALUE_TYPE=demo.model.City"; > > INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES > (1,'Kabul','AFG','Kabol',1780000); > INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES > (2,'Qandahar','AFG','Qandahar',237500); > > then execute SELECT COUNT(*) FROM city; > > normal. > > then kill one node. > > then execute SELECT COUNT(*) FROM city; > > Failed to execute query because cache partition has been lostPart > [cacheName=City, part=0] > > this alse normal. > > Next, start the node that was shut down before. > > then execute SELECT COUNT(*) FROM city; > > Failed to execute query because cache partition has been lostPart > [cacheName=City, part=0] > > At this time, all partitions have been recovered, and all baseline nodes > are ONLINE. Why still report this error? It is very confusing. Execute > reset_lost_partitions operation at this time seems redundant. Do have any > special considerations here? > > if this time restart the whole cluster, then execute SELECT COUNT(*) > FROM city; normal, this state is the same as the previous state, but the > behavior is different. > > > > > > >