[jira] [Updated] (AMBARI-18240) During a Rolling Downgrade Oozie Long Running Jobs Can Fail

Jonathan Hurley (JIRA) Tue, 23 Aug 2016 14:07:44 -0700

     [ 
https://issues.apache.org/jira/browse/AMBARI-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Hurley updated AMBARI-18240:
-------------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

{code}
commit 04a534ceacb1887c4666c97ea0d1a2670fe4a1cd (HEAD -> trunk, origin/trunk, 
origin/HEAD)
Author: Jonathan Hurley <jhur...@hortonworks.com>
Date:   Tue Aug 23 12:03:19 2016 -0400

    AMBARI-18240 - During a Rolling Downgrade Oozie Long Running Jobs Can Fail 
(jonathanhurley)
{code}

> During a Rolling Downgrade Oozie Long Running Jobs Can Fail
> -----------------------------------------------------------
>
>                 Key: AMBARI-18240
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18240
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>             Fix For: trunk
>
>         Attachments: AMBARI-18240.patch
>
>
> - Install HDP-2.3.2.0-2950 with Ambari 2.4.0
> - Being a long-running job (LRJ) in Oozie
> - Start upgrading to HDP-2.5.0.0-1235
> - Before finalizing step, start downgrading to HDP-2.3.2.0-2950. 
> Sometimes, the LRJ will fail:
> {code}
> /usr/hdp/current/oozie-client/bin/oozie job -oozie 
> http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie   
> -info 0000001-160821214718970-oozie-oozi-C@248 
> ID : 0000001-160821214718970-oozie-oozi-C@248
> ------------------------------------------------------------------------------------------------------------------------------------
> Action Number        : 248
> Console URL          : -
> Error Code           : -
> Error Message        : -
> External ID          : 0000030-160822042035608-oozie-oozi-W
> External Status      : -
> Job ID               : 0000001-160821214718970-oozie-oozi-C
> Tracker URI          : -
> Created              : 2016-08-22 00:37 GMT
> Nominal Time         : 2009-01-01 21:35 GMT
> Status               : FAILED
> Last Modified        : 2016-08-22 05:15 GMT
> First Missing Dependency : -
> ------------------------------------------------------------------------------------------------------------------------------------
> [hrt_qa@natr66-grls-dlm10toeriedwngdsec-r6-21 ~]$  
> /usr/hdp/current/oozie-client/bin/oozie job -oozie 
> http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie   
> -info 0000030-160822042035608-oozie-oozi-W
> Job ID : 0000030-160822042035608-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------------------
> Workflow Name : wordcount
> App Path      : hdfs://nameservice/user/hrt_qa/test_oozie_long_running
> Status        : FAILED
> Run           : 0
> User          : hrt_qa
> Group         : -
> Created       : 2016-08-22 05:08 GMT
> Started       : 2016-08-22 05:08 GMT
> Last Modified : 2016-08-22 05:15 GMT
> Ended         : 2016-08-22 05:15 GMT
> CoordAction ID: 0000001-160821214718970-oozie-oozi-C@248
> Actions
> ------------------------------------------------------------------------------------------------------------------------------------
> ID                                                                            
> Status    Ext ID                 Ext Status Err Code  
> ------------------------------------------------------------------------------------------------------------------------------------
> 0000030-160822042035608-oozie-oozi-W@wc                                       
> FAILED    job_1471842441396_0002 FAILED     JA017     
> ------------------------------------------------------------------------------------------------------------------------------------
> 0000030-160822042035608-oozie-oozi-W@:start:                                  
> OK        -                      OK         -         
> ------------------------------------------------------------------------------------------------------------------------------------
> {code}
> This is caused by an outage of both NameNodes during the downgrade. 
> - We have two NNs at the "Finalize Upgrade" state; 
> -- nn1 is standby (out of safemode)
> -- nn2 is active (out of safemode)
> - A downgrade begins and we restart nn1
> -- After the restart of nn1, it hasn't come online yet. Our code tries to 
> contact it and can't, so we move onto nn2.
> -- nn2 is online and active and out of safemode (because it hasn't been 
> downgraded yet), so we let the downgrade continue
> - The downgrade continues and we restart nn2
> -- However, nn1 is still coming online and isn't even standby yet
> Now we have an nn1 which isn't fully loaded and an nn2 which is restarting 
> and trying to figure out whether to be active or standby. It's during this 
> gap that the tests must be failing. 
> So, it seems like we need to be a little bit smarter about waiting for the 
> namenode to restart; we can't just look at the "active" one and say things 
> are OK because it might be the next one to restart. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AMBARI-18240) During a Rolling Downgrade Oozie Long Running Jobs Can Fail

Reply via email to