[ 
https://issues.apache.org/jira/browse/IGNITE-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072320#comment-18072320
 ] 

Roman Puchkovskiy commented on IGNITE-28225:
--------------------------------------------

Here is a proposal.

All exceptions happening during handshake can be partitioned into 2 classes:
 # 'no use to retry until something changes on the other side' (of I/O 
exceptions those are ConnectException and NoRouteToHostException; plus 
HandshakeException that signals that the other side does not understand our 
handshake protocol); let's call it T (for Terminal)
 # and 'others'; let's call it O

What shoud happen in send/invoke of MessagingService:
 # O we always retry
 # T:
 ## in send by address, never retry, just communicate the exception to the 
caller
 ## in send/invoke by name, retry until the recipient falls off the Physical 
Topology (PT); after that, return RecipientLeftException
 ## in send/invoke by ephemeral ID, retry until we know that the ID has become 
stale (in which case return RecipientLeftException)
 ### in invoke by ephemeral ID, additionally track the timeout (if it's 
triggered, return TimeoutException just in case, as on the higher level a 
TimeoutException will be triggered)
 ### in send by ephemeral ID, if we never get information about a node getting 
stale, we'll retry forever

The last sub-sub-item is unpleasant. But I see that usages of send by ephemeral 
ID fall into 2 groups:
 # Most of them are for components that operate on the LT (Logical topology) 
nodes
 # Just one is for cluster initializer

So we could split send-by-ephemeral-ID into 2 methods:
 # Send by ID that respects LT. If keeps retrying T only while the node falls 
off the LT
 # Send by ID that doesn't care about LT - it would be used by the cluster 
initializer; there, potential infinite retries don't seem to be a problem

> Retry handshake if a network issue appeared
> -------------------------------------------
>
>                 Key: IGNITE-28225
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28225
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> In 
> `org.apache.ignite.internal.network.DefaultMessagingService#sendViaNetwork` 
> we only retry opening the channel in `sender.send`, while expecting that 
> `channelFuture` must be acquired successfully.
> This might not be the case, we should retry the channel creation with a retry 
> strategy that would depend on a specific API call (send by address / name / 
> ID / etc.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to