Can you tell two stories which start out all nodes in the intended cluster
configuration are down, one story resulting in a successful cluster
startup, but the other detecting an invalid configuration, and refusing to
start?

I can anticipate problems understanding what to do when the first node
attempts to start, but only has its own AZ represented in the topology. How
can this first node know whether future nodes will be able to fulfill the
condition backup_replicas + 1 >=  AZ_count? The general case, allowing
elastic deployment, requires individual Ignite nodes to work in a
best-effort capacity.

I would approach this from a DevOps perspective, and just validate the
deployment before starting up any infrastructure. Look at all of the
relevant config files which would be deployed. Enumerate a projection of
deployed nodes and their AZs. Compare this against the desired backup
filter configuration and fail before starting any Ignite nodes with a
deployment automation tool exception.

On Tue, Nov 1, 2022 at 9:49 AM Surinder Mehra <redni...@gmail.com> wrote:

> Thanks for your reply. Let me try to answer your 2 questions below.
> 1. I understand that it sacrifices the backups incase it can't place
> backups appropriately. Question is, is it possible to fail the deployment
> rather than risking single copy of data present in cluster. If this only
> copy goes down, we will have downtime as data won't be present in cluster.
> We should rather throw error if enough hardware is not present than risking
> data unavailability issue during business activity
>
> 2. Why we want 3 copies of data. It's a design choice. We want to ensure
> even if 2 nodes go down, we still have 3rd present to serve the data.
>
> Hope I answered your question
>
> On Tue, 1 Nov 2022, 19:40 Jeremy McMillan, <jeremy.mcmil...@gridgain.com>
> wrote:
>
>> This question is a design question.
>>
>> What kids of fault states do you expect to tolerate? What is your failure
>> budget?
>>
>> Why are you trying to make more than 2 copies of the data distribute
>> across only two failure domains?
>>
>> Also "fail fast" means discover your implementation defects faster than
>> your release cycle, not how fast you can cause data loss.
>>
>> On Tue, Nov 1, 2022, 09:01 Surinder Mehra <redni...@gmail.com> wrote:
>>
>>> gentle reminder.
>>> One additional question: We have observed that if available AZs are less
>>> than backups count, ignite skips creating backups. Is this correct
>>> understanding? If yes, how can we fail fast if backups can not be placed
>>> due to AZ limitation?
>>>
>>> On Mon, Oct 31, 2022 at 6:30 PM Surinder Mehra <redni...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> As per link attached, to ensure primary and backup partitions are not
>>>> stored on same node, We used AWS AZ as backup filter and now I can see if I
>>>> start two ignite nodes on the same machine, primary partitions are evenly
>>>> distributed but backups are always zero which is expected.
>>>>
>>>>
>>>> https://www.gridgain.com/docs/latest/installation-guide/aws/multiple-availability-zone-aws
>>>>
>>>> My question is what would happen if AZ-1 has 2 machines and AZ-2 has 1
>>>> machine and ignite cluster has only 3 nodes, each machine having one ignite
>>>> node.
>>>>
>>>> Node1[AZ1] - keys 1-100
>>>> Node2[AZ1] -  keys 101-200
>>>> Node3[AZ2] - keys  201 -300
>>>>
>>>> In the above scenario, if the backup count is 2, how would back up
>>>> partitions be distributed.
>>>>
>>>> 1. Would it mean node3 will have 2 backup copies of primary partitions
>>>> of node 1 and 2 ?
>>>> 2. If we have a 4 node cluster with 2 nodes in each AZ, would backup
>>>> copies also be placed on different nodes(In other words, does the backup
>>>> filter also apply to how backup copies are placed on nodes) ?
>>>>
>>>>
>>>>

Reply via email to