[
https://issues.apache.org/jira/browse/IGNITE-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072320#comment-18072320
]
Roman Puchkovskiy commented on IGNITE-28225:
--------------------------------------------
Here is a proposal.
All exceptions happening during handshake can be partitioned into 2 classes:
# 'no use to retry until something changes on the other side' (of I/O
exceptions those are ConnectException and NoRouteToHostException; plus
HandshakeException that signals that the other side does not understand our
handshake protocol); let's call it T (for Terminal)
# and 'others'; let's call it O
What shoud happen in send/invoke of MessagingService:
# O we always retry
# T:
## in send by address, never retry, just communicate the exception to the
caller
## in send/invoke by name, retry until the recipient falls off the Physical
Topology (PT); after that, return RecipientLeftException
## in send/invoke by ephemeral ID, retry until we know that the ID has become
stale (in which case return RecipientLeftException)
### in invoke by ephemeral ID, additionally track the timeout (if it's
triggered, return TimeoutException just in case, as on the higher level a
TimeoutException will be triggered)
### in send by ephemeral ID, if we never get information about a node getting
stale, we'll retry forever
The last sub-sub-item is unpleasant. But I see that usages of send by ephemeral
ID fall into 2 groups:
# Most of them are for components that operate on the LT (Logical topology)
nodes
# Just one is for cluster initializer
So we could split send-by-ephemeral-ID into 2 methods:
# Send by ID that respects LT. If keeps retrying T only while the node falls
off the LT
# Send by ID that doesn't care about LT - it would be used by the cluster
initializer; there, potential infinite retries don't seem to be a problem
> Retry handshake if a network issue appeared
> -------------------------------------------
>
> Key: IGNITE-28225
> URL: https://issues.apache.org/jira/browse/IGNITE-28225
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> In
> `org.apache.ignite.internal.network.DefaultMessagingService#sendViaNetwork`
> we only retry opening the channel in `sender.send`, while expecting that
> `channelFuture` must be acquired successfully.
> This might not be the case, we should retry the channel creation with a retry
> strategy that would depend on a specific API call (send by address / name /
> ID / etc.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)