Thanks for suggestions. Will try to ensure infra as suggested and will explore topology validator if this can be used.
On Tue, 1 Nov 2022, 21:51 Jeremy McMillan, <jeremy.mcmil...@gridgain.com> wrote: > Can you tell two stories which start out all nodes in the intended cluster > configuration are down, one story resulting in a successful cluster > startup, but the other detecting an invalid configuration, and refusing to > start? > > I can anticipate problems understanding what to do when the first node > attempts to start, but only has its own AZ represented in the topology. How > can this first node know whether future nodes will be able to fulfill the > condition backup_replicas + 1 >= AZ_count? The general case, allowing > elastic deployment, requires individual Ignite nodes to work in a > best-effort capacity. > > I would approach this from a DevOps perspective, and just validate the > deployment before starting up any infrastructure. Look at all of the > relevant config files which would be deployed. Enumerate a projection of > deployed nodes and their AZs. Compare this against the desired backup > filter configuration and fail before starting any Ignite nodes with a > deployment automation tool exception. > > On Tue, Nov 1, 2022 at 9:49 AM Surinder Mehra <redni...@gmail.com> wrote: > >> Thanks for your reply. Let me try to answer your 2 questions below. >> 1. I understand that it sacrifices the backups incase it can't place >> backups appropriately. Question is, is it possible to fail the deployment >> rather than risking single copy of data present in cluster. If this only >> copy goes down, we will have downtime as data won't be present in cluster. >> We should rather throw error if enough hardware is not present than risking >> data unavailability issue during business activity >> >> 2. Why we want 3 copies of data. It's a design choice. We want to ensure >> even if 2 nodes go down, we still have 3rd present to serve the data. >> >> Hope I answered your question >> >> On Tue, 1 Nov 2022, 19:40 Jeremy McMillan, <jeremy.mcmil...@gridgain.com> >> wrote: >> >>> This question is a design question. >>> >>> What kids of fault states do you expect to tolerate? What is your >>> failure budget? >>> >>> Why are you trying to make more than 2 copies of the data distribute >>> across only two failure domains? >>> >>> Also "fail fast" means discover your implementation defects faster than >>> your release cycle, not how fast you can cause data loss. >>> >>> On Tue, Nov 1, 2022, 09:01 Surinder Mehra <redni...@gmail.com> wrote: >>> >>>> gentle reminder. >>>> One additional question: We have observed that if available AZs are >>>> less than backups count, ignite skips creating backups. Is this correct >>>> understanding? If yes, how can we fail fast if backups can not be placed >>>> due to AZ limitation? >>>> >>>> On Mon, Oct 31, 2022 at 6:30 PM Surinder Mehra <redni...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> As per link attached, to ensure primary and backup partitions are not >>>>> stored on same node, We used AWS AZ as backup filter and now I can see if >>>>> I >>>>> start two ignite nodes on the same machine, primary partitions are evenly >>>>> distributed but backups are always zero which is expected. >>>>> >>>>> >>>>> https://www.gridgain.com/docs/latest/installation-guide/aws/multiple-availability-zone-aws >>>>> >>>>> My question is what would happen if AZ-1 has 2 machines and AZ-2 has 1 >>>>> machine and ignite cluster has only 3 nodes, each machine having one >>>>> ignite >>>>> node. >>>>> >>>>> Node1[AZ1] - keys 1-100 >>>>> Node2[AZ1] - keys 101-200 >>>>> Node3[AZ2] - keys 201 -300 >>>>> >>>>> In the above scenario, if the backup count is 2, how would back up >>>>> partitions be distributed. >>>>> >>>>> 1. Would it mean node3 will have 2 backup copies of primary partitions >>>>> of node 1 and 2 ? >>>>> 2. If we have a 4 node cluster with 2 nodes in each AZ, would backup >>>>> copies also be placed on different nodes(In other words, does the backup >>>>> filter also apply to how backup copies are placed on nodes) ? >>>>> >>>>> >>>>>