Seems I got the vision, thanks.
There should be only 2 ways to reset lost partition: to gain an owner from
resurrected first or to remove ex-owner from baseline (partition will be
rearranged).
And we should make a decision for every lost partition before calling the
reset.

On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov <
alexey.scherbak...@gmail.com> wrote:

> ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <a...@apache.org>:
>
> > Alexei,
> >
> > 1,2,4,5 - looks good to me, no objections here.
> >
> > >> 3. Lost state is impossible to reset if a topology doesn't have at
> least
> > >> one owner for each lost partition.
> >
> > Do you mean that, according to your example, where
> > >> a node2 has left, soon a node3 has left. If the node2 is returned to
> > >> the topology first, it would have stale data for some keys.
> > we have to have node2 at cluster to be able to reset "lost" to node2's
> > data?
> >
>
> Not sure if I understand a question, but try to answer using an example:
> Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
> owned by n2 and n3.
> 1. Topology is activated.
> 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
> 3. n2 has failed.
> 4. cache.put(p, 1) // n3 has p->1, updateCounter=2
> 5. n3 has failed, partition loss is happened.
> 6. n2 joins a topology, it has stale data (p->0)
>
> We actually have 2 issues:
> 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is diverged
> and will not be adjusted by counters rebalancing if n3 is later joins a
> topology.
> or
> 8. n3 joins a topology, it has actual data (p->1) but rebalancing will not
> work because joining node has highest counter (it can only be a demander in
> this scenario).
>
> In both cases rebalancing by counters will not work causing data divergence
> in copies.
>
>
> >
> > >> at least one owner for each lost partition.
> > What the reason to have owners for all lost partitions when we want to
> > reset only some (available)?
> >
>
> It's never were possible to reset only subset of lost partitions. The
> reason is to make guarantee of resetLostPartitions method - all cache
> operations are resumed, data is correct.
>
>
> > Will it be possible to perform operations on non-lost partitions when the
> > cluster has at least one lost partition?
> >
>
> Yes it will be.
>
>
> >
> > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> > alexey.scherbak...@gmail.com> wrote:
> >
> > > Folks,
> > >
> > > I've almost finished a patch bringing some improvements to the data
> loss
> > > handling code, and I wish to discuss proposed changes with the
> community
> > > before submitting.
> > >
> > > *The issue*
> > >
> > > During the grid's lifetime, it's possible to get into a situation when
> > some
> > > data nodes have failed or mistakenly stopped. If a number of stopped
> > nodes
> > > exceeds a certain threshold depending on configured backups, count a
> data
> > > loss will occur. For example, a grid having one backup (meaning at
> least
> > > two copies of each data partition exist at the same time) can tolerate
> > only
> > > one node loss at the time. Generally, data loss is guaranteed to occur
> if
> > > backups + 1 or more nodes have failed simultaneously using default
> > affinity
> > > function.
> > >
> > > For in-memory caches, data is lost forever. For persistent caches, data
> > is
> > > not physically lost and accessible again after failed nodes are
> returned
> > to
> > > the topology.
> > >
> > > Possible data loss should be taken into consideration while designing
> an
> > > application.
> > >
> > >
> > >
> > > *Consider an example: money is transferred from one deposit to another,
> > and
> > > all nodes holding data for one of the deposits are gone.In such a
> case, a
> > > transaction temporary cannot be completed until a cluster is recovered
> > from
> > > the data loss state. Ignoring this can cause data inconsistency.*
> > > It is necessary to have an API telling us if an operation is safe to
> > > complete from the perspective of data loss.
> > >
> > > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > > configured to switch caches to the partial availability mode if data
> loss
> > > is detected.
> > >
> > > Let's give two definitions according to the Javadoc for
> > > *PartitionLossPolicy*:
> > >
> > > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> > >
> > > ·   *Unsafe policy* - cache operations are always possible
> > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured,
> lost
> > > partitions automatically re-created on the remaining nodes if needed or
> > > immediately owned if a last supplier has left during rebalancing.
> > >
> > > *That needs to be fixed*
> > >
> > > 1. The default loss policy is unsafe, even for persistent caches in the
> > > current implementation. It can result in unintentional data loss and
> > > business invariants' failure.
> > >
> > > 2. Node restarts in the persistent grid with detected data loss will
> > cause
> > > automatic resetting of LOST state after the restart, even if the safe
> > > policy is configured. It can result in data loss or partition desync if
> > not
> > > all nodes are returned to the topology or returned in the wrong order.
> > >
> > >
> > > *An example: a grid has three nodes, one backup. The grid is under
> load.
> > > First, a node2 has left, soon a node3 has left. If the node2 is
> returned
> > to
> > > the topology first, it would have stale data for some keys. Most recent
> > > data are on node3, which is not in the topology yet. Because a lost
> state
> > > was reset, all caches are fully available, and most probably will
> become
> > > inconsistent even in safe mode.*
> > > 3. Configured loss policy doesn't provide guarantees described in the
> > > Javadoc depending on the cluster configuration[4]. In particular,
> unsafe
> > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > > automatically readjusted on node left), because partitions are not
> > > automatically get reassigned on topology change, and no nodes are
> > existing
> > > to fulfill a read/write request. Same for READ_ONLY_ALL and
> > READ_WRITE_ALL.
> > >
> > > 4. Calling resetLostPartitions doesn't provide a guarantee for full
> cache
> > > operations availability if a topology doesn't have at least one owner
> for
> > > each lost partition.
> > >
> > > The ultimate goal of the patch is to fix API inconsistencies and fix
> the
> > > most crucial bugs related to data loss handling.
> > >
> > > *The planned changes are:*
> > >
> > > 1. The safe policy is used by default, except for in-memory grids with
> > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> > case,
> > > the unsafe policy is used by default. It protects from unintentional
> data
> > > loss.
> > >
> > > 2. Lost state is never reset in the case of grid nodes restart (despite
> > > full restart). It makes real data loss impossible in persistent grids
> if
> > > following the recovery instruction.
> > >
> > > 3. Lost state is impossible to reset if a topology doesn't have at
> least
> > > one owner for each lost partition. If nodes are physically dead, they
> > > should be removed from a baseline first before calling
> > resetLostPartitions.
> > >
> > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> > their
> > > guarantees are impossible to fulfill, not on the full baseline.
> > >
> > > 5. Any operation failed due to data loss contains
> > > CacheInvalidStateException as a root cause.
> > >
> > > In addition to code fixes, I plan to write a tutorial for safe data
> loss
> > > recovery in the persistent mode in the Ignite wiki.
> > >
> > > Any comments for the proposed changes are welcome.
> > >
> > > [1]
> > >
> > >
> >
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > > partLossPlc)
> > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > > [3] org.apache.ignite.IgniteCache#lostPartitions
> > > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > > [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov
>

Reply via email to