[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-05-11 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Attachment: OOZIE-2854-005.patch

> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, 
> OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-05-11 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Attachment: OOZIE-2854-004.patch

> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-05-09 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Attachment: OOZIE-2854-003.patch

> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-003.patch, OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-05-09 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Attachment: OOZIE-2854-002.patch

> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, 
> OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-05-04 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Attachment: OOZIE-2854-001.patch

> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: OOZIE-2854-001.patch, OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-05-03 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Attachment: OOZIE-2854-POC-001.patch

> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
> Attachments: OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-04-06 Thread Steven Hancz (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Hancz updated OOZIE-2854:


Hi Peter,

Yes I found the article and I set up a lab to test it.
It seems that is works well if I only run it on two nodes. Meaning the
locking is not an issue if only to active nodes are used. But executing it
on the third one will generate a deadlock on of them.. One way to fix this
is to use a DNS or ProxyHA and send write traffic to only two nodes.
Sending to only one as in the blog limits the throughput. Also I am working
with Cloudera on this take a look at case 132505.

Steven





-- 


This email is intended only for the individual(s) to whom it is addressed
and may be a confidential
communication protected by law.  Any unauthorized use, dissemination,
distribution, disclosure,
or copying is prohibited.  Please notify the sender immediately by return
email, if you believe
you have received this message in error, and please delete it from your
system.


> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-04-03 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Description: 
There can be problems when Oozie cannot update the database properly. Recently, 
we have experienced erratic behavior with two setups:

* MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
locking which might cause a transaction to rollback if there are two or more 
parallel transaction running and one of them cannot complete because of a 
conflict.

* MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
Oozie might get "Communications link failure" exception during the failover.

The problem is that failed DB transactions later might cause a workflow (which 
are started/re-started by RecoveryService) to get stuck. It's not clear to us 
how this happens but it has to do with the fact that certain DB updates are not 
executed.

The solution is to use some sort of retry logic with exponential backoff if the 
DB update fails. We could start with a 100ms wait time which is doubled at 
every retry. The operation can be considered a failure if it still fails after 
10 attempts. These values could be configurable. We should discuss initial 
values in the scope of this JIRA.

Note that this solution is to handle *transient* failures. If the DB is down 
for a longer period of time, we have to accept that the internal state of Oozie 
is corrupted.

  was:
There can be problems when Oozie cannot update the database properly. Recently, 
we have experienced erratic behavior with two setups:

* MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
locking which might cause a transaction to rollback if there are two or more 
parallel transaction running and one of them cannot complete because of a 
conflict.

* MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
Oozie might get "Communications link failure" exception during the failover.

The problem is that failed DB transactions later might cause a workflow (which 
are started/re-started by RecoveryService) to get stuck. It's not clear to us 
how this happens but it has to do with the fact that certain DB updates are not 
executed.

The solution is to use some sort of retry logic with exponential backoff if the 
DB update fails. We could start with a 100ms wait time which is doubled at 
every retry. The operation can be considered a failure if it still fails after 
10 attempts. These values could be configurable. We should discuss initial 
values in the scope of this JIRA.

Note that this solution is to handle *transient* failures. If the DB is long 
for a longer period of time, we have to accept that the internal state of Oozie 
is corrupted.


> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (OOZIE-2854) Oozie should handle transient DB problems

2017-04-03 Thread Peter Bacsko (JIRA)

 [ 
https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated OOZIE-2854:

Description: 
There can be problems when Oozie cannot update the database properly. Recently, 
we have experienced erratic behavior with two setups:

* MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
locking which might cause a transaction to rollback if there are two or more 
parallel transaction running and one of them cannot complete because of a 
conflict.

* MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
Oozie might get "Communications link failure" exception during the failover.

The problem is that failed DB transactions later might cause a workflow (which 
are started/re-started by RecoveryService) to get stuck. It's not clear to us 
how this happens but it has to do with the fact that certain DB updates are not 
executed.

The solution is to use some sort of retry logic with exponential backoff if the 
DB update fails. We could start with a 100ms wait time which is doubled at 
every retry. The operation can be considered a failure if it still fails after 
10 attempts. These values could be configurable. We should discuss initial 
values in the scope of this JIRA.

Note that this solution is to handle *transient* failures. If the DB is long 
for a longer period of time, we have to accept that the internal state of Oozie 
is corrupted.

  was:
There can be problems when Oozie cannot update the database properly. Recently, 
we have experienced erratic behavior with two setups:

* MySQL was set up with the Galera cluster manager. Galera uses cluster-wide 
optimistic locking which might cause a transaction to rollback if there are two 
or more parallel transaction running and one of them cannot complete because of 
a conflict.

* Another setup is MySQL with Percona XtraDB Cluster. If one of the MySQL 
instances is killed, Oozie might get "Communications link failure" exception. 

The problem is that failed DB transactions later might cause a workflow (which 
are started/re-started by RecoveryService) to get stuck. It's not clear to us 
how this happens but it has to do with the fact that certain DB updates are not 
executed.

The solution is to use some sort of retry logic with exponential backoff if the 
DB update fails. We could start with a 100ms wait time which is doubled at 
every retry. The operation can be considered a failure if it still fails after 
10 attempts. These values could be configurable. We should discuss initial 
values in the scope of this JIRA.

Note that this solution is to handle *transient* failures. If the DB is long 
for a longer period of time, we have to accept that the internal state of Oozie 
is corrupted.


> Oozie should handle transient DB problems
> -
>
> Key: OOZIE-2854
> URL: https://issues.apache.org/jira/browse/OOZIE-2854
> Project: Oozie
>  Issue Type: Improvement
>  Components: core
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>
> There can be problems when Oozie cannot update the database properly. 
> Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic 
> locking which might cause a transaction to rollback if there are two or more 
> parallel transaction running and one of them cannot complete because of a 
> conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, 
> Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow 
> (which are started/re-started by RecoveryService) to get stuck. It's not 
> clear to us how this happens but it has to do with the fact that certain DB 
> updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if 
> the DB update fails. We could start with a 100ms wait time which is doubled 
> at every retry. The operation can be considered a failure if it still fails 
> after 10 attempts. These values could be configurable. We should discuss 
> initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is long 
> for a longer period of time, we have to accept that the internal state of 
> Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)