[
https://issues.apache.org/jira/browse/CAUSEWAY-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Keir Haywood updated CAUSEWAY-3775:
------------------------------------------
Description:
was: Suspect that failure of background command results in locks being held as
attempt to sync exception.
having looked at the code, can't find a root cause for this per se, but there
were several bugs:
1) we call transactionService.callInTran(...) 3 times, if there's a failure
then the two innermost correctly detect and result in setting the xactn to
MUST_ABORT, the outer most one didn't and attempted to commit ... but then the
MUST_ABORT is noticed and so it is effectively a commit after all.
2) if there is a failure then the startedAt on the CommandLogEntry, which
initially is popluated when the execution starts, but ends up being rolled back
to null if there's a failure. Meantime, the recovery processing was setting
the completedAt and the exception stacktrace, but the startedAt is left
untouched, ie null.
Fortuitiously, the query to find non-started queries looks at the startedAt
rather than completedAt, and so re-executes.
3) the logic to execute each command in its own interaction was using
REQUIRES_NEW, which suspends the top-level transaction that's started
automatically when the interactionService.call..() is called. the failure
within the transactions sets that new transaction as MUST_ABORT, but once
finished, the original transaction is resumed and that is still healthy -
resulting in the onCompletion publishing to be performed as normal. This would
include the audit trail (EntityChangeTrackerDefault).
the fixes are:
1) make sure that rollback is called for the outermost.
2) provide a new config option
(causeway.extensions.command-log.run-background-commands.on-failure-policy) to
decide whether to STOP_THE_LINE (this was the accidental but most safe original
behaviour) or to CONTINUE_WITH_NEXT. If the latter, then we do correctly log
the exception and move on.
3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the
failure correctly is captured. This means that no command complete listeners
(eg auditing) is performed if there was a failure.
to summarise, the new behaviour is (or is intended to be):
* if onFailurePolicy = STOP_THE_LINE (the safe, fail-fast mode)
** execute command (CommandLogEntry)
*** if succeeds, then run publish on interactoin completion (auditing etc)
*** if deadlocks, attempt up to 3x else treat as failure
*** if fails, then do nothing.
** The quartz job will pick up the _same_ command again in 10 seconds.
* if onFailurePolicy = CONTINUE_WITH_NEXT
** execute command
*** if succeeds, then run publish on interactoin completion (auditing etc)
*** if deadlocks, attempt up to 3x else treat as failure
*** if fails, then update CommandLogEntry as completed and its exception prop
to hold the stack trace.
** the quartz job will pick up the next command in 10 seconds.
**
> Improve background commands job failure handling
> ------------------------------------------------
>
> Key: CAUSEWAY-3775
> URL: https://issues.apache.org/jira/browse/CAUSEWAY-3775
> Project: Causeway
> Issue Type: Bug
> Components: Core, Ext Core CommandLog
> Affects Versions: 2.0.0
> Reporter: Daniel Keir Haywood
> Assignee: Daniel Keir Haywood
> Priority: Minor
> Fix For: 2.1.0
>
>
> was: Suspect that failure of background command results in locks being held
> as attempt to sync exception.
>
> having looked at the code, can't find a root cause for this per se, but there
> were several bugs:
> 1) we call transactionService.callInTran(...) 3 times, if there's a failure
> then the two innermost correctly detect and result in setting the xactn to
> MUST_ABORT, the outer most one didn't and attempted to commit ... but then
> the MUST_ABORT is noticed and so it is effectively a commit after all.
> 2) if there is a failure then the startedAt on the CommandLogEntry, which
> initially is popluated when the execution starts, but ends up being rolled
> back to null if there's a failure. Meantime, the recovery processing was
> setting the completedAt and the exception stacktrace, but the startedAt is
> left untouched, ie null.
> Fortuitiously, the query to find non-started queries looks at the startedAt
> rather than completedAt, and so re-executes.
> 3) the logic to execute each command in its own interaction was using
> REQUIRES_NEW, which suspends the top-level transaction that's started
> automatically when the interactionService.call..() is called. the failure
> within the transactions sets that new transaction as MUST_ABORT, but once
> finished, the original transaction is resumed and that is still healthy -
> resulting in the onCompletion publishing to be performed as normal. This
> would include the audit trail (EntityChangeTrackerDefault).
> the fixes are:
> 1) make sure that rollback is called for the outermost.
> 2) provide a new config option
> (causeway.extensions.command-log.run-background-commands.on-failure-policy)
> to decide whether to STOP_THE_LINE (this was the accidental but most safe
> original behaviour) or to CONTINUE_WITH_NEXT. If the latter, then we do
> correctly log the exception and move on.
> 3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the
> failure correctly is captured. This means that no command complete listeners
> (eg auditing) is performed if there was a failure.
>
> to summarise, the new behaviour is (or is intended to be):
> * if onFailurePolicy = STOP_THE_LINE (the safe, fail-fast mode)
> ** execute command (CommandLogEntry)
> *** if succeeds, then run publish on interactoin completion (auditing etc)
> *** if deadlocks, attempt up to 3x else treat as failure
> *** if fails, then do nothing.
> ** The quartz job will pick up the _same_ command again in 10 seconds.
> * if onFailurePolicy = CONTINUE_WITH_NEXT
> ** execute command
> *** if succeeds, then run publish on interactoin completion (auditing etc)
> *** if deadlocks, attempt up to 3x else treat as failure
> *** if fails, then update CommandLogEntry as completed and its exception
> prop to hold the stack trace.
> ** the quartz job will pick up the next command in 10 seconds.
> **
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)