[ 
https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078041#comment-14078041
 ] 

Purshotam Shah commented on OOZIE-1940:
---------------------------------------

Here is  plan.

1. Optimized SQL, one SQL call to get all coord/bundle in pending state ( one 
sql for coord and one for bundle).
Fix this. OOZIE-1885

2. StatusTransitService need to acquire StatusTransitService lock, so that no 
other service can run in HA.

3. Have two new commands CoordStatusTransitXCommand and 
BundleStatusTransitXCommand.

4. StatusTransitService will call CoordStatusTransitXCommand and 
BundleStatusTransitXCommand to change job status synchronously.

5. *StatusTransitXCommand will try to acquire lock, but with no delay and 
requeue set to false. If command can't acquire lock, its ignored and next run 
will pick the job.


> StatusTransitService has race condition
> ---------------------------------------
>
>                 Key: OOZIE-1940
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1940
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Purshotam Shah
>
> StatusTransitService doesn't acquire lock while updating DB. 
> We noticed one such issue while doing HA testing, thanks to [~mchiang]
> We issue a change command to change pause time, which got executed on one 
> server. While change command was running on one server, other server started 
> executing StatusTransitService.
> Server 1 log
> {code}
> 2014-07-16 17:28:05,268  INFO StatusTransitService$StatusTransitRunnable:539 
> [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for 
> [org.apache.oozie.service.StatusTransitService]
> 2014-07-16 17:28:09,694  INFO StatusTransitService$StatusTransitRunnable:539 
> [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job 
> [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 
> 2014-07-16 17:28:15,416  INFO StatusTransitService$StatusTransitRunnable:539 
> [pool-1-thread-13] - USER[-] GROUP[-] Released lock for 
> [org.apache.oozie.service.StatusTransitService]
> {code}
> Server 2 log
> {code}
> 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - 
> USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] 
> JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed 
> Jul 16 17:30:00 UTC 2014 and last action number is : 3
> 2014-07-16 17:28:06,508  INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - 
> USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] 
> JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand 
> for jobId=0011385-140716042555-oozie-oozi-C
> {code}
> CoordMaterializeTransitionXCommand has created all actions( few were in 
> waiting and few were in running state) and set doneMaterialization to true.
> Change command deletes all waiting coords, except 3 running/SUCCEEDED action 
> and reset doneMaterialization.
> StatusTransitService first loads a set of pending jobs and for each job it 
> make DB calls to check coord action status. Coord jobs are loaded only once 
> in beginning.
> This is what happened.
> 1.StatusTransitService loads the coord job which doneMaterialization is set 
> to true at 17:28:05,268 (server 1)
> 2.Change command deletes waiting cation and reset  doneMaterialization at  
> 17:28:06,508 (server 2)
> 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. 
> It never reload the doneMaterialization at 17:28:09,694 (server 1)
> StatusTransitService overrides set job status to SUCCEEDED, bcz it's 
> doneMaterialization and all action are SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to