Andrzej Bialecki created SOLR-12480:
----------------------------------------
Summary: TriggerAction failures may cause inconsistent trigger
behavior
Key: SOLR-12480
URL: https://issues.apache.org/jira/browse/SOLR-12480
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: AutoScaling
Affects Versions: 7.4, master (8.0)
Reporter: Andrzej Bialecki
The following issue occasionally appears when running
{{TestLargeCluster.testNodeLost}}.
The test kills a large number of nodes, waiting for a certain time between the
kills. Depending on the sequence and the length of {{waitFor}} it may happen
that when {{ExecutePlanAction}} processes MOVEREPLICA the target node may just
have been killed. This results in an exception and a FAILED status of the
action.
However, this failure is not reported back to the trigger as unprocessed event
because it happens asynchronously in the action executor (in
{{ScheduledTriggers}}) - so the trigger happily resets its internal state to no
longer track the lost node. As a result, replicas remain lost and even if
there’s a Policy violation the event will not be generated again, and the
number of replicas won’t go back to the original number.
Also, {{ScheduledTriggers:311}} and 323 only logs the exception but doesn’t
fire listeners with FAILED status, which is a bug.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]