[
https://issues.apache.org/jira/browse/IGNITE-25137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vadim Pakhnushev updated IGNITE-25137:
--------------------------------------
Description:
Even after IGNITE-24910 is fixed, the test still fails.
This happens when the remote compute job is submitted from node1 to node2 and
cancel query is executed on node3. In this case {{ExecutionManager}} on node3
doesn't contain the execution so the {{ComputeComponentImpl}} broadcasts the
cancel request to node1 and node2.
node1 holds a {{RemoteJobExecution}} which doesn't participate in the
cancellation after IGNITE-24910.
In addition, the node1 holds a {{FailSafeJobExecution}} which sends a message
to the node2.
node2 holds a {{DelegatingJobExecution}}.
One of the requests succeeds and cancels the job, returning {{true}}, other
returns {{false}}.
The broadcast method in the {{ComputeMessaging}} completes a result future with
the first received response which could happen to be {{false}}. When {{true}}
response arrives, the future is already complete and so the result of the
cancel on node3 is {{false}}.
The solution is to assign a unique job id to the {{FailSafeJobExecution}}
rather than copying it from the underlying job.
This will ensure that only a locally running {{DelegatingJobExecution}} will be
cancelled, and it will be cancelled only once.
was:Even after IGNITE-24910 is fixed, the test still fails sometimes.
> ItSqlKillCommandTest#killComputeJobFromRemote is flaky
> ------------------------------------------------------
>
> Key: IGNITE-25137
> URL: https://issues.apache.org/jira/browse/IGNITE-25137
> Project: Ignite
> Issue Type: Bug
> Components: compute
> Reporter: Vadim Pakhnushev
> Assignee: Vadim Pakhnushev
> Priority: Major
> Labels: ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Even after IGNITE-24910 is fixed, the test still fails.
> This happens when the remote compute job is submitted from node1 to node2 and
> cancel query is executed on node3. In this case {{ExecutionManager}} on node3
> doesn't contain the execution so the {{ComputeComponentImpl}} broadcasts the
> cancel request to node1 and node2.
> node1 holds a {{RemoteJobExecution}} which doesn't participate in the
> cancellation after IGNITE-24910.
> In addition, the node1 holds a {{FailSafeJobExecution}} which sends a message
> to the node2.
> node2 holds a {{DelegatingJobExecution}}.
> One of the requests succeeds and cancels the job, returning {{true}}, other
> returns {{false}}.
> The broadcast method in the {{ComputeMessaging}} completes a result future
> with the first received response which could happen to be {{false}}. When
> {{true}} response arrives, the future is already complete and so the result
> of the cancel on node3 is {{false}}.
> The solution is to assign a unique job id to the {{FailSafeJobExecution}}
> rather than copying it from the underlying job.
> This will ensure that only a locally running {{DelegatingJobExecution}} will
> be cancelled, and it will be cancelled only once.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)