[ 
https://issues.apache.org/jira/browse/IGNITE-20914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy reassigned IGNITE-20914:
------------------------------------------

    Assignee: Roman Puchkovskiy

> Make ScaleCube's metadataTimeout configurable
> ---------------------------------------------
>
>                 Key: IGNITE-20914
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20914
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> ScaleCube's MembershipProtocolImpl fetches node's metadata periodically 
> (using GetMetaDataRequest). If it does not get a response before 
> metadataTimeout expires, it seems to think that the node is not alive anymore 
> and generates a REMOVED event:
> [2023-11-17T00:20:22,153][WARN ][sc-cluster-3345-2][MembershipProtocol] 
> [default:sqllogic1:1ca7b2f5308489d@10.233.107.205:3345][updateMembership][MEMBERSHIP_GOSSIP]
>  Skipping to add/update member: \{m: 
> default:sqllogic0:6a78c57fcd0a496d@10.233.107.205:3344, s: ALIVE, inc: 9}, 
> due to failed fetchMetadata call (cause: 
> java.util.concurrent.TimeoutException: Did not observe any item or terminal 
> signal within 1000ms in 'source(MonoDefer)' (and no fallback has been 
> configured))
> [2023-11-17T00:20:29,189][INFO ][sc-cluster-3345-2][MembershipProtocol] 
> [default:sqllogic1:1ca7b2f5308489d@10.233.107.205:3345] Member left without 
> notification: default:sqllogic0:6a78c57fcd0a496d@10.233.107.205:3344
> [2023-11-17T00:20:29,190][INFO ][sc-cluster-3345-2][MembershipProtocol] 
> [default:sqllogic1:1ca7b2f5308489d@10.233.107.205:3345][publishEvent] 
> MembershipEvent[type=REMOVED, 
> member=default:sqllogic0:6a78c57fcd0a496d@10.233.107.205:3344, 
> oldMetadata=1e61c6c8-154, newMetadata=null, 
> timestamp=2023-11-17T00:20:29.189Z]
> We should avoid this. It seems that 1 second might be too small for a node 
> under load.
> We should make this configurable via Ignite configuration.
> Also, it probably makes sense to set a higher default (like 10 seconds). The 
> reason for the latter is that, if the timeout expires, a node is removed from 
> the physical topology and cannot return there without a restart (this is what 
> our connection establishment protocol requires), so this timeout is critical 
> for stability of Ignite (while it is probably not critical for an average 
> ScaleCube-based application).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to