Peter Bacsko created OOZIE-2854:
-----------------------------------

             Summary: Oozie should handle transient DB problems
                 Key: OOZIE-2854
                 URL: https://issues.apache.org/jira/browse/OOZIE-2854
             Project: Oozie
          Issue Type: Improvement
          Components: core
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


There can be problems when Oozie cannot update the database properly. Recently, 
we have experienced erratic behavior with two setups:

* MySQL was set up with the Galera cluster manager. Galera uses cluster-wide 
optimistic locking which might cause a transaction to rollback if there are two 
or more parallel transaction running and one of them cannot complete because of 
a conflict.

* Another setup is MySQL with Percona XtraDB Cluster. If one of the MySQL 
instances is killed, Oozie might get "Communications link failure" exception. 

The problem is that failed DB transactions later might cause a workflow (which 
are started/re-started by RecoveryService) to get stuck. It's not clear to us 
how this happens but it has to do with the fact that certain DB updates are not 
executed.

The solution is to use some sort of retry logic with exponential backoff if the 
DB update fails. We could start with a 100ms wait time which is doubled at 
every retry. The operation can be considered a failure if it still fails after 
10 attempts. These values could be configurable. We should discuss initial 
values in the scope of this JIRA.

Note that this solution is to handle *transient* failures. If the DB is long 
for a longer period of time, we have to accept that the internal state of Oozie 
is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to