[
https://issues.apache.org/jira/browse/IGNITE-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987353#comment-17987353
]
Roman Puchkovskiy commented on IGNITE-24942:
--------------------------------------------
Retries are now scheduled instead of being executed right away.
Exception handling improvements are split to IGNITE-25815.
Race between starting a raft node and registering a replica (the race was found
while working on this issue) is described in IGNITE-25814.
> StackOverflowError in PartitionMover
> ------------------------------------
>
> Key: IGNITE-24942
> URL: https://issues.apache.org/jira/browse/IGNITE-24942
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.1
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> PartitionMover makes a retry on an exception. Retries are made on each
> exception (including those that are not retriable), there is no retry limit
> and the retries might happen in the same thread, which sometimes leads to an
> infinite loop (resulting in StackOverflowError) if something is broken.
> # We need to differentiate which exceptions are retryable and which are not
> # For non-retryable ones, we should call FailureManager right away and stop
> retrying
> # For retryable ones, we should add a retry counter and stop handling an
> exception as a retryable when the counter reaches some limit (that is, stop
> retrying and notify FailureManager)
> # Maybe we should initiate a retry in a separate thread pool to avoid stack
> overflow if there are many retries (or simply pick max retry count that is
> not big enough to trigger stack overflow)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)