[jira] [Resolved] (IGNITE-18612) Make RAFT snapshot streaming resistant to network glitches

Roman Puchkovskiy (Jira) Mon, 01 Jul 2024 06:47:09 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy resolved IGNITE-18612.
----------------------------------------
    Resolution: Won't Fix

Not relevant as network messages are delivered until the recipient gets removed 
from the topology

> Make RAFT snapshot streaming resistant to network glitches
> ----------------------------------------------------------
>
>                 Key: IGNITE-18612
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18612
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> Network is inherently unreliable (see IGNITE-18605). RAFT snapshot streaming 
> might take dozens of minutes. Current implementation breaks on the first 
> lost/improperly handled message, so  If we get a glitch in the middle of a 
> long snapshot installation, we'll waste a lot of resources.
> The idea is to:
>  # Make requests idempotent (so that we can repeat a lost request). 
> {{SnapshotMetaRequest}} is already idempotent. {{SnapshotMvDataRequest}} and 
> {{SnapshotTxDataRequest}} can be provided with a sequenceNo (increased at the 
> receiving side for each next request, but not increased for a retry of a 
> previous request). As there is only one receiver, and it always works 
> sequentially, on the sending side we'll have to only additionally remember 
> the previous response and its sequenceNo to support idempotent retries that 
> do not cause excessive cursor advancement.
>  # On the receiving side, specify a sane timeout (like a few seconds) per 
> request and retry requests that error or timeout (using the correct 
> sequenceNo)
>  # If an error happens while processing a request on the sending side, return 
> an indication of an error to the receiver instead of just dropping the 
> message (so that the receiver gets informed about the necessity  to make a 
> retry faster, AND the receiver can see whether it should stop retrying if the 
> failure is fatal).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-18612) Make RAFT snapshot streaming resistant to network glitches

Reply via email to