[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Attachment: OOZIE-2854-005.patch > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, > OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, > OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Attachment: OOZIE-2854-004.patch > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, > OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Attachment: OOZIE-2854-003.patch > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, > OOZIE-2854-003.patch, OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Attachment: OOZIE-2854-002.patch > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, > OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Attachment: OOZIE-2854-001.patch > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: OOZIE-2854-001.patch, OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Attachment: OOZIE-2854-POC-001.patch > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > Attachments: OOZIE-2854-POC-001.patch > > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Hancz updated OOZIE-2854: Hi Peter, Yes I found the article and I set up a lab to test it. It seems that is works well if I only run it on two nodes. Meaning the locking is not an issue if only to active nodes are used. But executing it on the third one will generate a deadlock on of them.. One way to fix this is to use a DNS or ProxyHA and send write traffic to only two nodes. Sending to only one as in the blog limits the throughput. Also I am working with Cloudera on this take a look at case 132505. Steven -- This email is intended only for the individual(s) to whom it is addressed and may be a confidential communication protected by law. Any unauthorized use, dissemination, distribution, disclosure, or copying is prohibited. Please notify the sender immediately by return email, if you believe you have received this message in error, and please delete it from your system. > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Description: There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups: * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict. * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover. The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed. The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA. Note that this solution is to handle *transient* failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted. was: There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups: * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict. * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover. The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed. The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA. Note that this solution is to handle *transient* failures. If the DB is long for a longer period of time, we have to accept that the internal state of Oozie is corrupted. > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is down > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems
[ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated OOZIE-2854: Description: There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups: * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict. * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover. The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed. The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA. Note that this solution is to handle *transient* failures. If the DB is long for a longer period of time, we have to accept that the internal state of Oozie is corrupted. was: There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups: * MySQL was set up with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict. * Another setup is MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception. The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed. The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA. Note that this solution is to handle *transient* failures. If the DB is long for a longer period of time, we have to accept that the internal state of Oozie is corrupted. > Oozie should handle transient DB problems > - > > Key: OOZIE-2854 > URL: https://issues.apache.org/jira/browse/OOZIE-2854 > Project: Oozie > Issue Type: Improvement > Components: core >Reporter: Peter Bacsko >Assignee: Peter Bacsko > > There can be problems when Oozie cannot update the database properly. > Recently, we have experienced erratic behavior with two setups: > * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic > locking which might cause a transaction to rollback if there are two or more > parallel transaction running and one of them cannot complete because of a > conflict. > * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, > Oozie might get "Communications link failure" exception during the failover. > The problem is that failed DB transactions later might cause a workflow > (which are started/re-started by RecoveryService) to get stuck. It's not > clear to us how this happens but it has to do with the fact that certain DB > updates are not executed. > The solution is to use some sort of retry logic with exponential backoff if > the DB update fails. We could start with a 100ms wait time which is doubled > at every retry. The operation can be considered a failure if it still fails > after 10 attempts. These values could be configurable. We should discuss > initial values in the scope of this JIRA. > Note that this solution is to handle *transient* failures. If the DB is long > for a longer period of time, we have to accept that the internal state of > Oozie is corrupted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)