[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957289#comment-15957289 ]
Steven Hancz commented on OOZIE-2854: ------------------------------------- Peter, I am glad that you found this issue. I just built two labs to test an HA solution for cluster metadata storage. It makes no sense to have a hadoop cluster and a single MySQL failure will bring it down. So far I build a Galera MySQL cluster and Galera MariaDB cluster but I have not pointed hadoop to it just imported the data. NDB clustering will not work as it does not uses the InnoDB engine and the import will fail. I also have a case opened with Cloudera on this matter that I am waiting on. Galera cluster states that it is synchronous https://mariadb.com/kb/en/mariadb/about-galera-replication/ but your findings are different. All I want to do is to have a metadata storage that is not prone to a single point of failure. Have you run similar test with Oracle RAC? Steven > Oozie should handle transient DB problems > ----------------------------------------- > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core > Reporter: Peter Bacsko > Assignee: Peter Bacsko > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)