[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068555#comment-16068555 ]
Andras Piros edited comment on OOZIE-2854 at 6/29/17 4:30 PM: -------------------------------------------------------------- All functionalities covered here. A blacklist based {{javax.persistence.PersistenceException}} retry predicate is present to provide a database agnostic method to identify which {{Exception}} instances originating from JPA / DBCP / JDBC layers are good candidates to begin / continue retrying. Tests covered in code: * unit tests ** testing the retry handler, the retry predicate filter, and parallel calls to JPA {{EntityManager}} (mostly Oozie database reads and writes) when injecting failures * integration tests ** using the {{MiniOozieTestCase}} framework ** fixing it so that also asynchronous workflow applications (the ones that use {{CallableQueueService}}) can be run ** for following workflow scenarios: *** a very simple one consisting only of a {{<start/>}} and an {{<end/>}} node *** a more sophisticated one consisting of multiple synchronous {{<fs/>}} nodes and a {{<decision/>}} node *** the ultimate one consisting of a {{<decision/>}} node, and two branches of an {{<fs/>}} and an asynchronous {{<shell/>}} nodes Test cases run: {noformat} mvn clean test -Dtest=TestDBOperationRetryHandler,TestPersistenceExceptionSubclassFilterRetryPredicate,TestParallelJPAOperationRetries,TestWorkflow,TestWorkflowRetries,TestJPAService {noformat} Functional and stress tests performed on a 4-node MySQL cluster. MySQL daemon has been stopped / killed / restarted several times. Also firewall rules have been modified temporarily to simulate network outages. was (Author: andras.piros): All functionalities covered here. A blacklist based {{javax.persistence.PersistenceException}} retry predicate is present to provide a database agnostic method to identify which {{Exception}} instances originating from JPA / DBCP / JDBC layers are good candidates to begin / continue retrying. Tests covered: * unit tests ** testing the retry handler, the retry predicate filter, and parallel calls to JPA {{EntityManager}} (mostly Oozie database reads and writes) when injecting failures * integration tests ** using the {{MiniOozieTestCase}} framework ** fixing it so that also asynchronous workflow applications (the ones that use {{CallableQueueService}}) can be run ** for following workflow scenarios: *** a very simple one consisting only of a {{<start/>}} and an {{<end/>}} node *** a more sophisticated one consisting of multiple synchronous {{<fs/>}} nodes and a {{<decision/>}} node *** the ultimate one consisting of a {{<decision/>}} node, and two branches of an {{<fs/>}} and an asynchronous {{<shell/>}} nodes Test cases run: {noformat} mvn clean test -Dtest=TestDBOperationRetryHandler,TestPersistenceExceptionSubclassFilterRetryPredicate,TestParallelJPAOperationRetries,TestWorkflow,TestWorkflowRetries,TestJPAService {noformat} > Oozie should handle transient database problems > ----------------------------------------------- > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core > Reporter: Peter Bacsko > Assignee: Andras Piros > Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, > OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, > OOZIE-2854.006.patch, OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.4.14#64029)