[ https://issues.apache.org/jira/browse/AMBARI-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Hurley updated AMBARI-18240: ------------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) {code} commit 04a534ceacb1887c4666c97ea0d1a2670fe4a1cd (HEAD -> trunk, origin/trunk, origin/HEAD) Author: Jonathan Hurley <jhur...@hortonworks.com> Date: Tue Aug 23 12:03:19 2016 -0400 AMBARI-18240 - During a Rolling Downgrade Oozie Long Running Jobs Can Fail (jonathanhurley) {code} > During a Rolling Downgrade Oozie Long Running Jobs Can Fail > ----------------------------------------------------------- > > Key: AMBARI-18240 > URL: https://issues.apache.org/jira/browse/AMBARI-18240 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.4.0 > Reporter: Jonathan Hurley > Assignee: Jonathan Hurley > Priority: Blocker > Fix For: trunk > > Attachments: AMBARI-18240.patch > > > - Install HDP-2.3.2.0-2950 with Ambari 2.4.0 > - Being a long-running job (LRJ) in Oozie > - Start upgrading to HDP-2.5.0.0-1235 > - Before finalizing step, start downgrading to HDP-2.3.2.0-2950. > Sometimes, the LRJ will fail: > {code} > /usr/hdp/current/oozie-client/bin/oozie job -oozie > http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie > -info 0000001-160821214718970-oozie-oozi-C@248 > ID : 0000001-160821214718970-oozie-oozi-C@248 > ------------------------------------------------------------------------------------------------------------------------------------ > Action Number : 248 > Console URL : - > Error Code : - > Error Message : - > External ID : 0000030-160822042035608-oozie-oozi-W > External Status : - > Job ID : 0000001-160821214718970-oozie-oozi-C > Tracker URI : - > Created : 2016-08-22 00:37 GMT > Nominal Time : 2009-01-01 21:35 GMT > Status : FAILED > Last Modified : 2016-08-22 05:15 GMT > First Missing Dependency : - > ------------------------------------------------------------------------------------------------------------------------------------ > [hrt_qa@natr66-grls-dlm10toeriedwngdsec-r6-21 ~]$ > /usr/hdp/current/oozie-client/bin/oozie job -oozie > http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie > -info 0000030-160822042035608-oozie-oozi-W > Job ID : 0000030-160822042035608-oozie-oozi-W > ------------------------------------------------------------------------------------------------------------------------------------ > Workflow Name : wordcount > App Path : hdfs://nameservice/user/hrt_qa/test_oozie_long_running > Status : FAILED > Run : 0 > User : hrt_qa > Group : - > Created : 2016-08-22 05:08 GMT > Started : 2016-08-22 05:08 GMT > Last Modified : 2016-08-22 05:15 GMT > Ended : 2016-08-22 05:15 GMT > CoordAction ID: 0000001-160821214718970-oozie-oozi-C@248 > Actions > ------------------------------------------------------------------------------------------------------------------------------------ > ID > Status Ext ID Ext Status Err Code > ------------------------------------------------------------------------------------------------------------------------------------ > 0000030-160822042035608-oozie-oozi-W@wc > FAILED job_1471842441396_0002 FAILED JA017 > ------------------------------------------------------------------------------------------------------------------------------------ > 0000030-160822042035608-oozie-oozi-W@:start: > OK - OK - > ------------------------------------------------------------------------------------------------------------------------------------ > {code} > This is caused by an outage of both NameNodes during the downgrade. > - We have two NNs at the "Finalize Upgrade" state; > -- nn1 is standby (out of safemode) > -- nn2 is active (out of safemode) > - A downgrade begins and we restart nn1 > -- After the restart of nn1, it hasn't come online yet. Our code tries to > contact it and can't, so we move onto nn2. > -- nn2 is online and active and out of safemode (because it hasn't been > downgraded yet), so we let the downgrade continue > - The downgrade continues and we restart nn2 > -- However, nn1 is still coming online and isn't even standby yet > Now we have an nn1 which isn't fully loaded and an nn2 which is restarting > and trying to figure out whether to be active or standby. It's during this > gap that the tests must be failing. > So, it seems like we need to be a little bit smarter about waiting for the > namenode to restart; we can't just look at the "active" one and say things > are OK because it might be the next one to restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)