Hi Igniters,

can you please comment or correct the scenario described below? I assume
zero-downtime and fault-tolerance against data loss on node crashes is a
key feature of Ignite but the procedure in failure scenarios is not
clear to me.

Thanks!

On 19.06.22 11:32, jay.et...@gmx.de wrote:
Hi,

yes, your 4-point initial setup is how we've been operating our cluster so far. 
For maintenance we go in reverse order (leave out point 2).

I'm very curious myself about the second part of your mail on how to correctly 
restore the cluster after a node crash.

Can anyone comment on this? Any help appreciated here.

Jay



-----Original Message-----
From: don.tequ...@gmx.de <don.tequ...@gmx.de>
Sent: Saturday, 11 June 2022 14:33
To: user@ignite.apache.org
Subject: Live fallback/backup scenario

Hi,

I'm experimenting with a Ignite cluster with multiple server nodes and multiple 
client nodes. My understanding is that with Ignite I can avoid data loss of all 
persistent caches and can avoid downtime for all clients.

If the above assumption is correct, how do I manage the servers and baseline 
topology for this scenario?

Caches are configured with persistence enabled and:
              <property name="cacheMode" value="PARTITIONED" />
              <property name="backups" value="1" />
              <property name="atomicityMode" value="TRANSACTIONAL"/>
              <property name="writeSynchronizationMode" value="FULL_SYNC"/>

Is this procedure correct for initial setup?

1. Start all server nodes and keep the cluster inactive.
2. Once all server nodes are connected, set the baseline topology to all server 
nodes.
3. Activate the cluster.
4. Connect clients and start application operation with compute and persistent 
caches.

Let's say one server node crashes, I can see operation continues without 
interruption and no data loss. However, what's the scenario after the crashed 
server node was restarted and connected again to the cluster? I can see it does 
not automatically get a member of the baseline topology.

What's the correct procedure in the two below scenarios:

a) The persistent directory on the crashed server is still available (but most 
likely is not up-to-date with latest data from the other caches). Do I add the 
server to the base topology and it will automatically get the data that it 
missed during its downtime?

b) The server was wiped and the persistent directory is empty when it's 
restarted. Do I add the server to the base topology and it will automatically 
participate in the shared data caches to become a potential full backup in case 
another server crashes?

How long does such data sync for the restarted server take? Is there an event 
for this?

Thanks for some background on this scenario.

Reply via email to