[ https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138388#comment-14138388 ]
Hadoop QA commented on OOZIE-1940: ---------------------------------- Testing JIRA OOZIE-1940 Cleaning local git workspace ---------------------------- {color:green}+1 PATCH_APPLIES{color} {color:green}+1 CLEAN{color} {color:red}-1 RAW_PATCH_ANALYSIS{color} . {color:green}+1{color} the patch does not introduce any @author tags . {color:green}+1{color} the patch does not introduce any tabs . {color:green}+1{color} the patch does not introduce any trailing spaces . {color:red}-1{color} the patch contains 5 line(s) longer than 132 characters . {color:green}+1{color} the patch does adds/modifies 1 testcase(s) {color:green}+1 RAT{color} . {color:green}+1{color} the patch does not seem to introduce new RAT warnings {color:green}+1 JAVADOC{color} . {color:green}+1{color} the patch does not seem to introduce new Javadoc warnings {color:green}+1 COMPILE{color} . {color:green}+1{color} HEAD compiles . {color:green}+1{color} patch compiles . {color:green}+1{color} the patch does not seem to introduce new javac warnings {color:green}+1 BACKWARDS_COMPATIBILITY{color} . {color:green}+1{color} the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations . {color:green}+1{color} the patch does not modify JPA files {color:red}-1 TESTS{color} . Tests run: 1532 . Tests failed: 2 . Tests errors: 2 . The patch failed the following testcases: . testBundleStartNegative2(org.apache.oozie.command.bundle.TestBundleStartXCommand) . testBundleStartWithFailedCoordinator(org.apache.oozie.command.bundle.TestBundleStartXCommand) {color:green}+1 DISTRO{color} . {color:green}+1{color} distro tarball builds with the patch ---------------------------- {color:red}*-1 Overall result, please check the reported -1(s)*{color} The full output of the test-patch run is available at . https://builds.apache.org/job/oozie-trunk-precommit-build/1989/ > StatusTransitService has race condition > --------------------------------------- > > Key: OOZIE-1940 > URL: https://issues.apache.org/jira/browse/OOZIE-1940 > Project: Oozie > Issue Type: Bug > Reporter: Purshotam Shah > Assignee: Purshotam Shah > Attachments: OOZIE-1940-V5.patch > > > StatusTransitService doesn't acquire lock while updating DB. > We noticed one such issue while doing HA testing, thanks to [~mchiang] > We issue a change command to change pause time, which got executed on one > server. While change command was running on one server, other server started > executing StatusTransitService. > Server 1 log > {code} > 2014-07-16 17:28:05,268 INFO StatusTransitService$StatusTransitRunnable:539 > [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for > [org.apache.oozie.service.StatusTransitService] > 2014-07-16 17:28:09,694 INFO StatusTransitService$StatusTransitRunnable:539 > [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job > [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' > 2014-07-16 17:28:15,416 INFO StatusTransitService$StatusTransitRunnable:539 > [pool-1-thread-13] - USER[-] GROUP[-] Released lock for > [org.apache.oozie.service.StatusTransitService] > {code} > Server 2 log > {code} > 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - > USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] > JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed > Jul 16 17:30:00 UTC 2014 and last action number is : 3 > 2014-07-16 17:28:06,508 INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - > USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] > JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand > for jobId=0011385-140716042555-oozie-oozi-C > {code} > CoordMaterializeTransitionXCommand has created all actions( few were in > waiting and few were in running state) and set doneMaterialization to true. > Change command deletes all waiting coords, except 3 running/SUCCEEDED action > and reset doneMaterialization. > StatusTransitService first loads a set of pending jobs and for each job it > make DB calls to check coord action status. Coord jobs are loaded only once > in beginning. > This is what happened. > 1.StatusTransitService loads the coord job which doneMaterialization is set > to true at 17:28:05,268 (server 1) > 2.Change command deletes waiting cation and reset doneMaterialization at > 17:28:06,508 (server 2) > 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. > It never reload the doneMaterialization at 17:28:09,694 (server 1) > StatusTransitService overrides set job status to SUCCEEDED, bcz it's > doneMaterialization and all action are SUCCEEDED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)