[jira] [Updated] (CAUSEWAY-3775) Improve background commands job failure handling

Daniel Keir Haywood (Jira) Wed, 12 Jun 2024 00:58:43 -0700


     [ 
https://issues.apache.org/jira/browse/CAUSEWAY-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Keir Haywood updated CAUSEWAY-3775:
------------------------------------------
    Description: 
was: Suspect that failure of background command results in locks being held as 
attempt to sync exception.

 

having looked at the code, can't find a root cause for this per se, but there 
were several bugs:

1) we call transactionService.callInTran(...) 3 times, if there's a failure 
then the two innermost correctly detect and result in setting the xactn to 
MUST_ABORT, the outer most one didn't and attempted to commit ... but then the 
MUST_ABORT is noticed and so it is effectively a commit after all.

2) if there is a failure then the startedAt on the CommandLogEntry, which 
initially is popluated when the execution starts, but ends up being rolled back 
to null if there's a failure.  Meantime, the recovery processing was setting 
the completedAt and the exception stacktrace, but the startedAt is left 
untouched, ie null.

Fortuitiously, the query to find non-started queries looks at the startedAt 
rather than completedAt, and so re-executes.

3) the logic to execute each command in its own interaction was using 
REQUIRES_NEW, which suspends the top-level transaction that's started 
automatically when the interactionService.call..() is called.  the failure 
within the transactions sets that new transaction as MUST_ABORT, but once 
finished, the original transaction is resumed and that is still healthy - 
resulting in the onCompletion publishing to be performed as normal.  This would 
include the audit trail (EntityChangeTrackerDefault).

the fixes are:

1) make sure that rollback is called for the outermost.

2) provide a new config option 
(causeway.extensions.command-log.run-background-commands.on-failure-policy) to 
decide whether to STOP_THE_LINE (this was the accidental but most safe original 
behaviour) or to CONTINUE_WITH_NEXT.  If the latter, then we do correctly log 
the exception and move on.

3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the 
failure correctly is captured.  This means that no command complete listeners 
(eg auditing) is performed if there was a failure. 

 

to summarise, the new behaviour is (or is intended to be):
 * if onFailurePolicy = STOP_THE_LINE   (the safe, fail-fast mode)
 ** execute command (CommandLogEntry)
 *** if succeeds, then run publish on interactoin completion (auditing etc)
 *** if deadlocks, attempt up to 3x else treat as failure
 *** if fails, then do nothing.  
 ** The quartz job will pick up the _same_ command again in 10 seconds.  
 * if onFailurePolicy = CONTINUE_WITH_NEXT
 ** execute command
 *** if succeeds, then run publish on interactoin completion (auditing etc)
 *** if deadlocks, attempt up to 3x else treat as failure
 *** if fails, then update CommandLogEntry as completed and its exception prop 
to hold the stack trace.  
 ** the quartz job will pick up the next command in 10 seconds.
 **  

 

 

 

> Improve background commands job failure handling
> ------------------------------------------------
>
>                 Key: CAUSEWAY-3775
>                 URL: https://issues.apache.org/jira/browse/CAUSEWAY-3775
>             Project: Causeway
>          Issue Type: Bug
>          Components: Core, Ext Core CommandLog
>    Affects Versions: 2.0.0
>            Reporter: Daniel Keir Haywood
>            Assignee: Daniel Keir Haywood
>            Priority: Minor
>             Fix For: 2.1.0
>
>
> was: Suspect that failure of background command results in locks being held 
> as attempt to sync exception.
>  
> having looked at the code, can't find a root cause for this per se, but there 
> were several bugs:
> 1) we call transactionService.callInTran(...) 3 times, if there's a failure 
> then the two innermost correctly detect and result in setting the xactn to 
> MUST_ABORT, the outer most one didn't and attempted to commit ... but then 
> the MUST_ABORT is noticed and so it is effectively a commit after all.
> 2) if there is a failure then the startedAt on the CommandLogEntry, which 
> initially is popluated when the execution starts, but ends up being rolled 
> back to null if there's a failure.  Meantime, the recovery processing was 
> setting the completedAt and the exception stacktrace, but the startedAt is 
> left untouched, ie null.
> Fortuitiously, the query to find non-started queries looks at the startedAt 
> rather than completedAt, and so re-executes.
> 3) the logic to execute each command in its own interaction was using 
> REQUIRES_NEW, which suspends the top-level transaction that's started 
> automatically when the interactionService.call..() is called.  the failure 
> within the transactions sets that new transaction as MUST_ABORT, but once 
> finished, the original transaction is resumed and that is still healthy - 
> resulting in the onCompletion publishing to be performed as normal.  This 
> would include the audit trail (EntityChangeTrackerDefault).
> the fixes are:
> 1) make sure that rollback is called for the outermost.
> 2) provide a new config option 
> (causeway.extensions.command-log.run-background-commands.on-failure-policy) 
> to decide whether to STOP_THE_LINE (this was the accidental but most safe 
> original behaviour) or to CONTINUE_WITH_NEXT.  If the latter, then we do 
> correctly log the exception and move on.
> 3) remove the unnecessary nested transactions with REQUIRES_NEW, so that the 
> failure correctly is captured.  This means that no command complete listeners 
> (eg auditing) is performed if there was a failure. 
>  
> to summarise, the new behaviour is (or is intended to be):
>  * if onFailurePolicy = STOP_THE_LINE   (the safe, fail-fast mode)
>  ** execute command (CommandLogEntry)
>  *** if succeeds, then run publish on interactoin completion (auditing etc)
>  *** if deadlocks, attempt up to 3x else treat as failure
>  *** if fails, then do nothing.  
>  ** The quartz job will pick up the _same_ command again in 10 seconds.  
>  * if onFailurePolicy = CONTINUE_WITH_NEXT
>  ** execute command
>  *** if succeeds, then run publish on interactoin completion (auditing etc)
>  *** if deadlocks, attempt up to 3x else treat as failure
>  *** if fails, then update CommandLogEntry as completed and its exception 
> prop to hold the stack trace.  
>  ** the quartz job will pick up the next command in 10 seconds.
>  **  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CAUSEWAY-3775) Improve background commands job failure handling

Reply via email to