[jira] [Comment Edited] (OOZIE-2854) Oozie should handle transient database problems

Andras Piros (JIRA) Thu, 29 Jun 2017 09:31:41 -0700

    [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068555#comment-16068555
 ]


Andras Piros edited comment on OOZIE-2854 at 6/29/17 4:30 PM:
--------------------------------------------------------------

All functionalities covered here. A blacklist based 
{{javax.persistence.PersistenceException}} retry predicate is present to 
provide a database agnostic method to identify which {{Exception}} instances 
originating from JPA / DBCP / JDBC layers are good candidates to begin / 
continue retrying.

Tests covered in code:
* unit tests
** testing the retry handler, the retry predicate filter, and parallel calls to 
JPA {{EntityManager}} (mostly Oozie database reads and writes) when injecting 
failures
* integration tests
** using the {{MiniOozieTestCase}} framework
** fixing it so that also asynchronous workflow applications (the ones that use 
{{CallableQueueService}}) can be run
** for following workflow scenarios:
*** a very simple one consisting only of a {{<start/>}} and an {{<end/>}} node
*** a more sophisticated one consisting of multiple synchronous {{<fs/>}} nodes 
and a {{<decision/>}} node
*** the ultimate one consisting of a {{<decision/>}} node, and two branches of 
an {{<fs/>}} and an asynchronous {{<shell/>}} nodes

Test cases run:
{noformat}
mvn clean test 
-Dtest=TestDBOperationRetryHandler,TestPersistenceExceptionSubclassFilterRetryPredicate,TestParallelJPAOperationRetries,TestWorkflow,TestWorkflowRetries,TestJPAService
{noformat}

Functional and stress tests performed on a 4-node MySQL cluster. MySQL daemon 
has been stopped / killed / restarted several times. Also firewall rules have 
been modified temporarily to simulate network outages.


was (Author: andras.piros):
All functionalities covered here. A blacklist based 
{{javax.persistence.PersistenceException}} retry predicate is present to 
provide a database agnostic method to identify which {{Exception}} instances 
originating from JPA / DBCP / JDBC layers are good candidates to begin / 
continue retrying.

Tests covered:
* unit tests
** testing the retry handler, the retry predicate filter, and parallel calls to 
JPA {{EntityManager}} (mostly Oozie database reads and writes) when injecting 
failures
* integration tests
** using the {{MiniOozieTestCase}} framework
** fixing it so that also asynchronous workflow applications (the ones that use 
{{CallableQueueService}}) can be run
** for following workflow scenarios:
*** a very simple one consisting only of a {{<start/>}} and an {{<end/>}} node
*** a more sophisticated one consisting of multiple synchronous {{<fs/>}} nodes 
and a {{<decision/>}} node
*** the ultimate one consisting of a {{<decision/>}} node, and two branches of 
an {{<fs/>}} and an asynchronous {{<shell/>}} nodes

Test cases run:
{noformat}
mvn clean test 
-Dtest=TestDBOperationRetryHandler,TestPersistenceExceptionSubclassFilterRetryPredicate,TestParallelJPAOperationRetries,TestWorkflow,TestWorkflowRetries,TestJPAService
{noformat}

> Oozie should handle transient database problems
> -----------------------------------------------
>
>                 Key: OOZIE-2854
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2854
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>            Reporter: Peter Bacsko
>            Assignee: Andras Piros
>         Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, 
> OOZIE-2854.006.patch, OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (OOZIE-2854) Oozie should handle transient database problems

Reply via email to