Added: oozie/site/trunk/content/resources/docs/5.0.0/WorkflowFunctionalSpec.html URL: http://svn.apache.org/viewvc/oozie/site/trunk/content/resources/docs/5.0.0/WorkflowFunctionalSpec.html?rev=1828722&view=auto ============================================================================== --- oozie/site/trunk/content/resources/docs/5.0.0/WorkflowFunctionalSpec.html (added) +++ oozie/site/trunk/content/resources/docs/5.0.0/WorkflowFunctionalSpec.html Mon Apr 9 14:12:36 2018 @@ -0,0 +1,5648 @@ +<!DOCTYPE html> +<!-- + | Generated by Apache Maven Doxia at Apr 9, 2018 + | Rendered using Apache Maven Fluido Skin 1.4 +--> +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> + <head> + <meta charset="UTF-8" /> + <meta name="viewport" content="width=device-width, initial-scale=1.0" /> + <meta http-equiv="Content-Language" content="en" /> + <title>Oozie - </title> + <link rel="stylesheet" href="./css/apache-maven-fluido-1.4.min.css" /> + <link rel="stylesheet" href="./css/site.css" /> + <link rel="stylesheet" href="./css/print.css" media="print" /> + + + <script type="text/javascript" src="./js/apache-maven-fluido-1.4.min.js"></script> + + + </head> + <body class="topBarDisabled"> + + + + <div class="container-fluid"> + <div id="banner"> + <div class="pull-left"> + <a href="https://oozie.apache.org/" id="bannerLeft"> + <img src="https://oozie.apache.org/images/oozie_200x.png" alt="Oozie"/> + </a> + </div> + <div class="pull-right"> </div> + <div class="clear"><hr/></div> + </div> + + <div id="breadcrumbs"> + <ul class="breadcrumb"> + + + <li class=""> + <a href="../../" title="Apache"> + Apache</a> + <span class="divider">/</span> + </li> + <li class=""> + <a href="../../" title="Oozie"> + Oozie</a> + <span class="divider">/</span> + </li> + <li class=""> + <a href="../" title="docs"> + docs</a> + <span class="divider">/</span> + </li> + <li class=""> + <a href="./" title="5.0.0"> + 5.0.0</a> + <span class="divider">/</span> + </li> + <li class="active ">Oozie - </li> + + + + <li id="publishDate" class="pull-right"><span class="divider">|</span> Last Published: 2018-04-09</li> + <li id="projectVersion" class="pull-right"> + Version: 5.0.0 + </li> + + </ul> + </div> + + + <div class="row-fluid"> + <div id="leftColumn" class="span2"> + <div class="well sidebar-nav"> + + + <ul class="nav nav-list"> + </ul> + + + + <hr /> + + <div id="poweredBy"> + <div class="clear"></div> + <div class="clear"></div> + <div class="clear"></div> + <div class="clear"></div> + <a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy"> + <img class="builtBy" alt="Built by Maven" src="./images/logos/maven-feather.png" /> + </a> + </div> + </div> + </div> + + + <div id="bodyColumn" class="span10" > + + <p></p> +<p><a href="./index.html">::Go back to Oozie Documentation Index::</a> +</p> +<hr /> +<a name="Oozie_Specification_a_Hadoop_Workflow_System"></a> +<div class="section"><h2> Oozie Specification, a Hadoop Workflow System</h2> +<p><b><center>(v5.0)</center></b> +</p> +<p>The goal of this document is to define a workflow engine system specialized in coordinating the execution of Hadoop +Map/Reduce and Pig jobs.</p> +<p><ul><ul><li><a href="#Changelog">Changelog</a> +<ul></ul> +</li> +<li><a href="#a0_Definitions">0 Definitions</a> +</li> +<li><a href="#a1_Specification_Highlights">1 Specification Highlights</a> +</li> +<li><a href="#a2_Workflow_Definition">2 Workflow Definition</a> +<ul><li><a href="#a2.1_Cycles_in_Workflow_Definitions">2.1 Cycles in Workflow Definitions</a> +</li> +</ul> +</li> +<li><a href="#a3_Workflow_Nodes">3 Workflow Nodes</a> +<ul><li><a href="#a3.1_Control_Flow_Nodes">3.1 Control Flow Nodes</a> +<ul><li><a href="#a3.1.1_Start_Control_Node">3.1.1 Start Control Node</a> +</li> +<li><a href="#a3.1.2_End_Control_Node">3.1.2 End Control Node</a> +</li> +<li><a href="#a3.1.3_Kill_Control_Node">3.1.3 Kill Control Node</a> +</li> +<li><a href="#a3.1.4_Decision_Control_Node">3.1.4 Decision Control Node</a> +</li> +<li><a href="#a3.1.5_Fork_and_Join_Control_Nodes">3.1.5 Fork and Join Control Nodes</a> +</li> +</ul> +</li> +<li><a href="#a3.2_Workflow_Action_Nodes">3.2 Workflow Action Nodes</a> +<ul><li><a href="#a3.2.1_Action_Basis">3.2.1 Action Basis</a> +<ul><li><a href="#a3.2.1.1_Action_ComputationProcessing_Is_Always_Remote">3.2.1.1 Action Computation/Processing Is Always Remote</a> +</li> +<li><a href="#a3.2.1.2_Actions_Are_Asynchronous">3.2.1.2 Actions Are Asynchronous</a> +</li> +<li><a href="#a3.2.1.3_Actions_Have_2_Transitions_ok_and_error">3.2.1.3 Actions Have 2 Transitions, =ok= and =error=</a> +</li> +<li><a href="#a3.2.1.4_Action_Recovery">3.2.1.4 Action Recovery</a> +</li> +</ul> +</li> +<li><a href="#a3.2.2_Map-Reduce_Action">3.2.2 Map-Reduce Action</a> +<ul><li><a href="#a3.2.2.1_Adding_Files_and_Archives_for_the_Job">3.2.2.1 Adding Files and Archives for the Job</a> +</li> +<li><a href="#a3.2.2.2_Configuring_the_MapReduce_action_with_Java_code">3.2.2.2 Configuring the MapReduce action with Java code</a> +</li> +<li><a href="#a3.2.2.3_Streaming">3.2.2.3 Streaming</a> +</li> +<li><a href="#a3.2.2.4_Pipes">3.2.2.4 Pipes</a> +</li> +<li><a href="#a3.2.2.5_Syntax">3.2.2.5 Syntax</a> +</li> +</ul> +</li> +<li><a href="#a3.2.3_Pig_Action">3.2.3 Pig Action</a> +</li> +<li><a href="#a3.2.4_Fs_HDFS_action">3.2.4 Fs (HDFS) action</a> +</li> +<li><a href="#a3.2.5_Sub-workflow_Action">3.2.5 Sub-workflow Action</a> +</li> +<li><a href="#a3.2.6_Java_Action">3.2.6 Java Action</a> +<ul><li><a href="#a3.2.6.1_Overriding_an_actions_Main_class">3.2.6.1 Overriding an action's Main class</a> +</li> +</ul> +</li> +</ul> +</li> +</ul> +</li> +<li><a href="#a4_Parameterization_of_Workflows">4 Parameterization of Workflows</a> +<ul><li><a href="#a4.1_Workflow_Job_Properties_or_Parameters">4.1 Workflow Job Properties (or Parameters)</a> +</li> +<li><a href="#a4.2_Expression_Language_Functions">4.2 Expression Language Functions</a> +<ul><li><a href="#a4.2.1_Basic_EL_Constants">4.2.1 Basic EL Constants</a> +</li> +<li><a href="#a4.2.2_Basic_EL_Functions">4.2.2 Basic EL Functions</a> +</li> +<li><a href="#a4.2.3_Workflow_EL_Functions">4.2.3 Workflow EL Functions</a> +</li> +<li><a href="#a4.2.4_Hadoop_EL_Constants">4.2.4 Hadoop EL Constants</a> +</li> +<li><a href="#a4.2.5_Hadoop_EL_Functions">4.2.5 Hadoop EL Functions</a> +</li> +<li><a href="#a4.2.6_Hadoop_Jobs_EL_Function">4.2.6 Hadoop Jobs EL Function</a> +</li> +<li><a href="#a4.2.7_HDFS_EL_Functions">4.2.7 HDFS EL Functions</a> +</li> +<li><a href="#a4.2.8_HCatalog_EL_Functions">4.2.8 HCatalog EL Functions</a> +</li> +</ul> +</li> +</ul> +</li> +<li><a href="#a5_Workflow_Notifications">5 Workflow Notifications</a> +<ul><li><a href="#a5.1_Workflow_Job_Status_Notification">5.1 Workflow Job Status Notification</a> +</li> +<li><a href="#a5.2_Node_Start_and_End_Notifications">5.2 Node Start and End Notifications</a> +</li> +</ul> +</li> +<li><a href="#a6_User_Propagation">6 User Propagation</a> +</li> +<li><a href="#a7_Workflow_Application_Deployment">7 Workflow Application Deployment</a> +</li> +<li><a href="#a8_External_Data_Assumptions">8 External Data Assumptions</a> +</li> +<li><a href="#a9_Workflow_Jobs_Lifecycle">9 Workflow Jobs Lifecycle</a> +<ul><li><a href="#a9.1_Workflow_Job_Lifecycle">9.1 Workflow Job Lifecycle</a> +</li> +<li><a href="#a9.2_Workflow_Action_Lifecycle">9.2 Workflow Action Lifecycle</a> +</li> +</ul> +</li> +<li><a href="#a10_Workflow_Jobs_Recovery_re-run">10 Workflow Jobs Recovery (re-run)</a> +</li> +<li><a href="#a11_Oozie_Web_Services_API">11 Oozie Web Services API</a> +</li> +<li><a href="#a12_Client_API">12 Client API</a> +</li> +<li><a href="#a13_Command_Line_Tools">13 Command Line Tools</a> +</li> +<li><a href="#a14_Web_UI_Console">14 Web UI Console</a> +</li> +<li><a href="#a15_Customizing_Oozie_with_Extensions">15 Customizing Oozie with Extensions</a> +</li> +<li><a href="#a16_Workflow_Jobs_Priority">16 Workflow Jobs Priority</a> +</li> +<li><a href="#a17_HDFS_Share_Libraries_for_Workflow_Applications_since_Oozie_2.3">17 HDFS Share Libraries for Workflow Applications (since Oozie 2.3)</a> +<ul><li><a href="#a17.1_Action_Share_Library_Override_since_Oozie_3.3">17.1 Action Share Library Override (since Oozie 3.3)</a> +</li> +</ul> +</li> +<li><a href="#a18_User-Retry_for_Workflow_Actions_since_Oozie_3.1">18 User-Retry for Workflow Actions (since Oozie 3.1)</a> +</li> +<li><a href="#a19_Global_Configurations">19 Global Configurations</a> +</li> +<li><a href="#a20_Suspend_On_Nodes">20 Suspend On Nodes</a> +</li> +<li><a href="#Appendixes">Appendixes</a> +<ul><li><a href="#Appendix_A_Oozie_Workflow_and_Common_XML_Schemas">Appendix A, Oozie Workflow and Common XML Schemas</a> +<ul><li><a href="#Oozie_Workflow_Schema_Version_1.0">Oozie Workflow Schema Version 1.0</a> +</li> +<li><a href="#Oozie_Common_Schema_Version_1.0">Oozie Common Schema Version 1.0</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.5">Oozie Workflow Schema Version 0.5</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.4.5">Oozie Workflow Schema Version 0.4.5</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.4">Oozie Workflow Schema Version 0.4</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.3">Oozie Workflow Schema Version 0.3</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.2.5">Oozie Workflow Schema Version 0.2.5</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.2">Oozie Workflow Schema Version 0.2</a> +</li> +<li><a href="#Oozie_SLA_Version_0.2">Oozie SLA Version 0.2</a> +</li> +<li><a href="#Oozie_SLA_Version_0.1">Oozie SLA Version 0.1</a> +</li> +<li><a href="#Oozie_Workflow_Schema_Version_0.1">Oozie Workflow Schema Version 0.1</a> +</li> +</ul> +</li> +<li><a href="#Appendix_B_Workflow_Examples">Appendix B, Workflow Examples</a> +<ul></ul> +</li> +</ul> +</li> +</ul> +</ul> +</p> +<a name="Changelog"></a> +<div class="section"><h3>Changelog</h3> +<a name="a2016FEB19"></a> +<div class="section"><h4> 2016FEB19</h4> +<p><ul><li>#3.2.7 Updated notes on System.exit(int n) behavior</li> +</ul> +</p> +<a name="a2015APR29"></a> +</div> +<div class="section"><h4> 2015APR29</h4> +<p><ul><li>#3.2.1.4 Added notes about Java action retries</li> +<li>#3.2.7 Added notes about Java action retries</li> +</ul> +</p> +<a name="a2014MAY08"></a> +</div> +<div class="section"><h4> 2014MAY08</h4> +<p><ul><li>#3.2.2.4 Added support for fully qualified job-xml path</li> +</ul> +</p> +<a name="a2013JUL03"></a> +</div> +<div class="section"><h4> 2013JUL03</h4> +<p><ul><li>#Appendix A, Added new workflow schema 0.5 and SLA schema 0.2</li> +</ul> +</p> +<a name="a2012AUG30"></a> +</div> +<div class="section"><h4> 2012AUG30</h4> +<p><ul><li>#4.2.2 Added two EL functions (replaceAll and appendAll)</li> +</ul> +</p> +<a name="a2012JUL26"></a> +</div> +<div class="section"><h4> 2012JUL26</h4> +<p><ul><li>#Appendix A, updated XML schema 0.4 to include <tt>parameters</tt> + element</li> +<li>#4.1 Updated to mention about <tt>parameters</tt> + element as of schema 0.4</li> +</ul> +</p> +<a name="a2012JUL23"></a> +</div> +<div class="section"><h4> 2012JUL23</h4> +<p><ul><li>#Appendix A, updated XML schema 0.4 (Fs action)</li> +<li>#3.2.4 Updated to mention that a <tt>name-node</tt> +, a <tt>job-xml</tt> +, and a <tt>configuration</tt> + element are allowed in the Fs action as of</li> +</ul> +schema 0.4</p> +<a name="a2012JUN19"></a> +</div> +<div class="section"><h4> 2012JUN19</h4> +<p><ul><li>#Appendix A, added XML schema 0.4</li> +<li>#3.2.2.4 Updated to mention that multiple <tt>job-xml</tt> + elements are allowed as of schema 0.4</li> +<li>#3.2.3 Updated to mention that multiple <tt>job-xml</tt> + elements are allowed as of schema 0.4</li> +</ul> +</p> +<a name="a2011AUG17"></a> +</div> +<div class="section"><h4> 2011AUG17</h4> +<p><ul><li>#3.2.4 fs 'chmod' xml closing element typo in Example corrected</li> +</ul> +</p> +<a name="a2011AUG12"></a> +</div> +<div class="section"><h4> 2011AUG12</h4> +<p><ul><li>#3.2.4 fs 'move' action characteristics updated, to allow for consistent source and target paths and existing target path only if directory</li> +<li>#18, Update the doc for user-retry of workflow action.</li> +</ul> +</p> +<a name="a2011FEB19"></a> +</div> +<div class="section"><h4> 2011FEB19</h4> +<p><ul><li>#10, Update the doc to rerun from the failed node.</li> +</ul> +</p> +<a name="a2010OCT31"></a> +</div> +<div class="section"><h4> 2010OCT31</h4> +<p><ul><li>#17, Added new section on Shared Libraries</li> +</ul> +</p> +<a name="a2010APR27"></a> +</div> +<div class="section"><h4> 2010APR27</h4> +<p><ul><li>#3.2.3 Added new "arguments" tag to PIG actions</li> +<li>#3.2.5 SSH actions are deprecated in Oozie schema 0.1 and removed in Oozie schema 0.2</li> +<li>#Appendix A, Added schema version 0.2</li> +</ul> +</p> +<a name="a2009OCT20"></a> +</div> +<div class="section"><h4> 2009OCT20</h4> +<p><ul><li>#Appendix A, updated XML schema</li> +</ul> +</p> +<a name="a2009SEP15"></a> +</div> +<div class="section"><h4> 2009SEP15</h4> +<p><ul><li>#3.2.6 Removing support for sub-workflow in a different Oozie instance (removing the 'oozie' element)</li> +</ul> +</p> +<a name="a2009SEP07"></a> +</div> +<div class="section"><h4> 2009SEP07</h4> +<p><ul><li>#3.2.2.3 Added Map Reduce Pipes specifications.</li> +<li>#3.2.2.4 Map-Reduce Examples. Previously was 3.2.2.3.</li> +</ul> +</p> +<a name="a2009SEP02"></a> +</div> +<div class="section"><h4> 2009SEP02</h4> +<p><ul><li>#10 Added missing skip nodes property name.</li> +<li>#3.2.1.4 Reworded action recovery explanation.</li> +</ul> +</p> +<a name="a2009AUG26"></a> +</div> +<div class="section"><h4> 2009AUG26</h4> +<p><ul><li>#3.2.9 Added <tt>java</tt> + action type</li> +<li>#3.1.4 Example uses EL constant to refer to counter group/name</li> +</ul> +</p> +<a name="a2009JUN09"></a> +</div> +<div class="section"><h4> 2009JUN09</h4> +<p><ul><li>#12.2.4 Added build version resource to admin end-point</li> +<li>#3.2.6 Added flag to propagate workflow configuration to sub-workflows</li> +<li>#10 Added behavior for workflow job parameters given in the rerun</li> +<li>#11.3.4 workflows info returns pagination information</li> +</ul> +</p> +<a name="a2009MAY18"></a> +</div> +<div class="section"><h4> 2009MAY18</h4> +<p><ul><li>#3.1.4 decision node, 'default' element, 'name' attribute changed to 'to'</li> +<li>#3.1.5 fork node, 'transition' element changed to 'start', 'to' attribute change to 'path'</li> +<li>#3.1.5 join node, 'transition' element remove, added 'to' attribute to 'join' element</li> +<li>#3.2.1.4 Rewording on action recovery section</li> +<li>#3.2.2 map-reduce action, added 'job-tracker', 'name-node' actions, 'file', 'file' and 'archive' elements</li> +<li>#3.2.2.1 map-reduce action, remove from 'streaming' element 'file', 'file' and 'archive' elements</li> +<li>#3.2.2.2 map-reduce action, reorganized streaming section</li> +<li>#3.2.3 pig action, removed information about implementation (SSH), changed elements names</li> +<li>#3.2.4 fs action, removed 'fs-uri' and 'user-name' elements, file system URI is now specified in path, user is propagated</li> +<li>#3.2.6 sub-workflow action, renamed elements 'oozie-url' to 'oozie' and 'workflow-app' to 'app-path'</li> +<li>#4 Properties that are valid Java identifiers can be used as ${NAME}</li> +<li>#4.1 Renamed default properties file from 'configuration.xml' to 'default-configuration.xml'</li> +<li>#4.2 Changes in EL Constants and Functions</li> +<li>#5 Updated notification behavior and tokens</li> +<li>#6 Changed user propagation behavior</li> +<li>#7 Changed application packaging from ZIP to HDFS directory</li> +<li>Removed application lifecycle and self containment model sections</li> +<li>#10 Changed workflow job recovery, simplified recovery behavior</li> +<li>#11 Detailed Web Services API</li> +<li>#12 Updated Client API section</li> +<li>#15 Updated Action Executor API section</li> +<li>#Appendix A XML namespace updated to 'uri:oozie:workflow:0.1'</li> +<li>#Appendix A Updated XML schema to changes in map-reduce/pig/fs/ssh actions</li> +<li>#Appendix B Updated workflow example to schema changes</li> +</ul> +</p> +<a name="a2009MAR25"></a> +</div> +<div class="section"><h4> 2009MAR25</h4> +<p><ul><li>Changing all references of HWS to Oozie (project name)</li> +<li>Typos, XML Formatting</li> +<li>XML Schema URI correction</li> +</ul> +</p> +<a name="a2009MAR09"></a> +</div> +<div class="section"><h4> 2009MAR09</h4> +<p><ul><li>Changed <tt>CREATED</tt> + job state to <tt>PREP</tt> + to have same states as Hadoop</li> +<li>Renamed 'hadoop-workflow' element to 'workflow-app'</li> +<li>Decision syntax changed to be 'switch/case' with no transition indirection</li> +<li>Action nodes common root element 'action', with the action type as sub-element (using a single built-in XML schema)</li> +<li>Action nodes have 2 explicit transitions 'ok to' and 'error to' enforced by XML schema</li> +<li>Renamed 'fail' action element to 'kill'</li> +<li>Renamed 'hadoop' action element to 'map-reduce'</li> +<li>Renamed 'hdfs' action element to 'fs'</li> +<li>Updated all XML snippets and examples</li> +<li>Made user propagation simpler and consistent</li> +<li>Added Oozie XML schema to Appendix A</li> +<li>Added workflow example to Appendix B</li> +</ul> +</p> +<a name="a2009FEB22"></a> +</div> +<div class="section"><h4> 2009FEB22</h4> +<p><ul><li>Opened <a class="externalLink" href="https://issues.apache.org/jira/browse/HADOOP-5303">JIRA HADOOP-5303</a> +</li> +</ul> +</p> +<a name="a27DEC2012:"></a> +</div> +<div class="section"><h4> 27/DEC/2012:</h4> +<p><ul><li>Added information on dropping hcatalog table partitions in prepare block</li> +<li>Added hcatalog EL functions section</li> +</ul> +</p> +<a name="a0_Definitions"></a> +</div> +</div> +<div class="section"><h3>0 Definitions</h3> +<p><b>Action:</b> + An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or +'action node'.</p> +<p><b>Workflow:</b> + A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). "control dependency" +from one action to another means that the second action can't run until the first action has completed.</p> +<p><b>Workflow Definition:</b> + A programmatic description of a workflow that can be executed.</p> +<p><b>Workflow Definition Language:</b> + The language used to define a Workflow Definition.</p> +<p><b>Workflow Job:</b> + An executable instance of a workflow definition.</p> +<p><b>Workflow Engine:</b> + A system that executes workflows jobs. It can also be referred as a DAG engine.</p> +<a name="a1_Specification_Highlights"></a> +</div> +<div class="section"><h3>1 Specification Highlights</h3> +<p>A Workflow application is DAG that coordinates the following types of actions: Hadoop, Pig, and +sub-workflows.</p> +<p>Flow control operations within the workflow applications can be done using decision, fork and join nodes. Cycles in +workflows are not supported.</p> +<p>Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow +definition as <tt>${VAR}</tt> + variables.</p> +<p>A Workflow application is a ZIP file that contains the workflow definition (an XML file), all the necessary files to +run all the actions: JAR files for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Pig +scripts, and other resource files.</p> +<p>Before running a workflow job, the corresponding workflow application must be deployed in Oozie.</p> +<p>Deploying workflow application and running workflow jobs can be done via command line tools, a WS API and a Java API.</p> +<p>Monitoring the system and workflow jobs can be done via a web console, command line tools, a WS API and a Java API.</p> +<p>When submitting a workflow job, a set of properties resolving all the formal parameters in the workflow definitions +must be provided. This set of properties is a Hadoop configuration.</p> +<p>Possible states for a workflow jobs are: <tt>PREP</tt> +, <tt>RUNNING</tt> +, <tt>SUSPENDED</tt> +, <tt>SUCCEEDED</tt> +, <tt>KILLED</tt> + and <tt>FAILED</tt> +.</p> +<p>In the case of a action start failure in a workflow job, depending on the type of failure, Oozie will attempt automatic +retries, it will request a manual retry or it will fail the workflow job.</p> +<p>Oozie can make HTTP callback notifications on action start/end/failure events and workflow end/failure events.</p> +<p>In the case of workflow job failure, the workflow job can be resubmitted skipping previously completed actions. +Before doing a resubmission the workflow application could be updated with a patch to fix a problem in the workflow +application code.</p> +<p><a name="WorkflowDefinition"></a> +</p> +<a name="a2_Workflow_Definition"></a> +</div> +<div class="section"><h3>2 Workflow Definition</h3> +<p>A workflow definition is a DAG with control flow nodes (start, end, decision, fork, join, kill) or action nodes +(map-reduce, pig, etc.), nodes are connected by transitions arrows.</p> +<p>The workflow definition language is XML based and it is called hPDL (Hadoop Process Definition Language).</p> +<p>Refer to the Appendix A for the<a href="./WorkflowFunctionalSpec.html#OozieWFSchema">Oozie Workflow Definition XML Schema</a> +. Appendix +B has <a href="./WorkflowFunctionalSpec.html#OozieWFExamples">Workflow Definition Examples</a> +.</p> +<a name="a2.1_Cycles_in_Workflow_Definitions"></a> +<div class="section"><h4>2.1 Cycles in Workflow Definitions</h4> +<p>Oozie does not support cycles in workflow definitions, workflow definitions must be a strict DAG.</p> +<p>At workflow application deployment time, if Oozie detects a cycle in the workflow definition it must fail the +deployment.</p> +<a name="a3_Workflow_Nodes"></a> +</div> +</div> +<div class="section"><h3>3 Workflow Nodes</h3> +<p>Workflow nodes are classified in control flow nodes and action nodes:</p> +<p><ul><li><b>Control flow nodes:</b> + nodes that control the start and end of the workflow and workflow job execution path.</li> +<li><b>Action nodes:</b> + nodes that trigger the execution of a computation/processing task.</li> +</ul> +</p> +<p>Node names and transitions must be conform to the following pattern =[a-zA-Z][\-_a-zA-Z0-0]*=, of up to 20 characters +long.</p> +<a name="a3.1_Control_Flow_Nodes"></a> +<div class="section"><h4>3.1 Control Flow Nodes</h4> +<p>Control flow nodes define the beginning and the end of a workflow (the <tt>start</tt> +, <tt>end</tt> + and <tt>kill</tt> + nodes) and provide a +mechanism to control the workflow execution path (the <tt>decision</tt> +, <tt>fork</tt> + and <tt>join</tt> + nodes).</p> +<p><a name="StartNode"></a> +</p> +<a name="a3.1.1_Start_Control_Node"></a> +<div class="section"><h5>3.1.1 Start Control Node</h5> +<p>The <tt>start</tt> + node is the entry point for a workflow job, it indicates the first workflow node the workflow job must +transition to.</p> +<p>When a workflow is started, it automatically transitions to the node specified in the <tt>start</tt> +.</p> +<p>A workflow definition must have one <tt>start</tt> + node.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <start to="[NODE-NAME]"/> + ... +</workflow-app> +</pre></p> +<p>The <tt>to</tt> + attribute is the name of first workflow node to execute.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <start to="firstHadoopJob"/> + ... +</workflow-app> +</pre></p> +<p><a name="EndNode"></a> +</p> +<a name="a3.1.2_End_Control_Node"></a> +</div> +<div class="section"><h5>3.1.2 End Control Node</h5> +<p>The <tt>end</tt> + node is the end for a workflow job, it indicates that the workflow job has completed successfully.</p> +<p>When a workflow job reaches the <tt>end</tt> + it finishes successfully (SUCCEEDED).</p> +<p>If one or more actions started by the workflow job are executing when the <tt>end</tt> + node is reached, the actions will be +killed. In this scenario the workflow job is still considered as successfully run.</p> +<p>A workflow definition must have one <tt>end</tt> + node.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <end name="[NODE-NAME]"/> + ... +</workflow-app> +</pre></p> +<p>The <tt>name</tt> + attribute is the name of the transition to do to end the workflow job.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <end name="end"/> +</workflow-app> +</pre></p> +<p><a name="KillNode"></a> +</p> +<a name="a3.1.3_Kill_Control_Node"></a> +</div> +<div class="section"><h5>3.1.3 Kill Control Node</h5> +<p>The <tt>kill</tt> + node allows a workflow job to kill itself.</p> +<p>When a workflow job reaches the <tt>kill</tt> + it finishes in error (KILLED).</p> +<p>If one or more actions started by the workflow job are executing when the <tt>kill</tt> + node is reached, the actions will be +killed.</p> +<p>A workflow definition may have zero or more <tt>kill</tt> + nodes.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <kill name="[NODE-NAME]"> + <message>[MESSAGE-TO-LOG]</message> + </kill> + ... +</workflow-app> +</pre></p> +<p>The <tt>name</tt> + attribute in the <tt>kill</tt> + node is the name of the Kill action node.</p> +<p>The content of the <tt>message</tt> + element will be logged as the kill reason for the workflow job.</p> +<p>A <tt>kill</tt> + node does not have transition elements because it ends the workflow job, as <tt>KILLED</tt> +.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <kill name="killBecauseNoInput"> + <message>Input unavailable</message> + </kill> + ... +</workflow-app> +</pre></p> +<p><a name="DecisionNode"></a> +</p> +<a name="a3.1.4_Decision_Control_Node"></a> +</div> +<div class="section"><h5>3.1.4 Decision Control Node</h5> +<p>A <tt>decision</tt> + node enables a workflow to make a selection on the execution path to follow.</p> +<p>The behavior of a <tt>decision</tt> + node can be seen as a switch-case statement.</p> +<p>A <tt>decision</tt> + node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated +in order or appearance until one of them evaluates to <tt>true</tt> + and the corresponding transition is taken. If none of the +predicates evaluates to <tt>true</tt> + the <tt>default</tt> + transition is taken.</p> +<p>Predicates are JSP Expression Language (EL) expressions (refer to section 4.2 of this document) that resolve into a +boolean value, <tt>true</tt> + or <tt>false</tt> +. For example:</p> +<p><pre> + ${fs:fileSize('/usr/foo/myinputdir') gt 10 * GB} +</pre></p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <decision name="[NODE-NAME]"> + <switch> + <case to="[NODE_NAME]">[PREDICATE]</case> + ... + <case to="[NODE_NAME]">[PREDICATE]</case> + <default to="[NODE_NAME]"/> + </switch> + </decision> + ... +</workflow-app> +</pre></p> +<p>The <tt>name</tt> + attribute in the <tt>decision</tt> + node is the name of the decision node.</p> +<p>Each <tt>case</tt> + elements contains a predicate and a transition name. The predicate ELs are evaluated +in order until one returns <tt>true</tt> + and the corresponding transition is taken.</p> +<p>The <tt>default</tt> + element indicates the transition to take if none of the predicates evaluates +to <tt>true</tt> +.</p> +<p>All decision nodes must have a <tt>default</tt> + element to avoid bringing the workflow into an error +state if none of the predicates evaluates to true.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <decision name="mydecision"> + <switch> + <case to="reconsolidatejob"> + ${fs:fileSize(secondjobOutputDir) gt 10 * GB} + </case> <case to="rexpandjob"> + ${fs:fileSize(secondjobOutputDir) lt 100 * MB} + </case> + <case to="recomputejob"> + ${ hadoop:counters('secondjob')[RECORDS][REDUCE_OUT] lt 1000000 } + </case> + <default to="end"/> + </switch> + </decision> + ... +</workflow-app> +</pre></p> +<p><a name="ForkJoinNodes"></a> +</p> +<a name="a3.1.5_Fork_and_Join_Control_Nodes"></a> +</div> +<div class="section"><h5>3.1.5 Fork and Join Control Nodes</h5> +<p>A <tt>fork</tt> + node splits one path of execution into multiple concurrent paths of execution.</p> +<p>A <tt>join</tt> + node waits until every concurrent execution path of a previous <tt>fork</tt> + node arrives to it.</p> +<p>The <tt>fork</tt> + and <tt>join</tt> + nodes must be used in pairs. The <tt>join</tt> + node assumes concurrent execution paths are children of +the same <tt>fork</tt> + node.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <fork name="[FORK-NODE-NAME]"> + <path start="[NODE-NAME]" /> + ... + <path start="[NODE-NAME]" /> + </fork> + ... + <join name="[JOIN-NODE-NAME]" to="[NODE-NAME]" /> + ... +</workflow-app> +</pre></p> +<p>The <tt>name</tt> + attribute in the <tt>fork</tt> + node is the name of the workflow fork node. The <tt>start</tt> + attribute in the <tt>path</tt> + +elements in the <tt>fork</tt> + node indicate the name of the workflow node that will be part of the concurrent execution paths.</p> +<p>The <tt>name</tt> + attribute in the <tt>join</tt> + node is the name of the workflow join node. The <tt>to</tt> + attribute in the <tt>join</tt> + node +indicates the name of the workflow node that will executed after all concurrent execution paths of the corresponding +fork arrive to the join node.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <fork name="forking"> + <path start="firstparalleljob"/> + <path start="secondparalleljob"/> + </fork> + <action name="firstparallejob"> + <map-reduce> + <resource-manager>foo:8032</resource-manager> + <name-node>bar:8020</name-node> + <job-xml>job1.xml</job-xml> + </map-reduce> + <ok to="joining"/> + <error to="kill"/> + </action> + <action name="secondparalleljob"> + <map-reduce> + <resource-manager>foo:8032</resource-manager> + <name-node>bar:8020</name-node> + <job-xml>job2.xml</job-xml> + </map-reduce> + <ok to="joining"/> + <error to="kill"/> + </action> + <join name="joining" to="nextaction"/> + ... +</workflow-app> +</pre></p> +<p>By default, Oozie performs some validation that any forking in a workflow is valid and won't lead to any incorrect behavior or +instability. However, if Oozie is preventing a workflow from being submitted and you are very certain that it should work, you can +disable forkjoin validation so that Oozie will accept the workflow. To disable this validation just for a specific workflow, simply +set <tt>oozie.wf.validate.ForkJoin</tt> + to <tt>false</tt> + in the job.properties file. To disable this validation for all workflows, simply set +=oozie.validate.ForkJoin= to <tt>false</tt> + in the oozie-site.xml file. Disabling this validation is determined by the AND of both of +these properties, so it will be disabled if either or both are set to false and only enabled if both are set to true (or not +specified).</p> +<p><a name="ActionNodes"></a> +</p> +<a name="a3.2_Workflow_Action_Nodes"></a> +</div> +</div> +<div class="section"><h4>3.2 Workflow Action Nodes</h4> +<p>Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task.</p> +<a name="a3.2.1_Action_Basis"></a> +<div class="section"><h5>3.2.1 Action Basis</h5> +<p>The following sub-sections define common behavior and capabilities for all action types.</p> +<a name="a3.2.1.1_Action_ComputationProcessing_Is_Always_Remote"></a> +<div class="section"><h6>3.2.1.1 Action Computation/Processing Is Always Remote</h6> +<p>All computation/processing tasks triggered by an action node are remote to Oozie. No workflow application specific +computation/processing task is executed within Oozie.</p> +<a name="a3.2.1.2_Actions_Are_Asynchronous"></a> +</div> +<div class="section"><h6>3.2.1.2 Actions Are Asynchronous</h6> +<p>All computation/processing tasks triggered by an action node are executed asynchronously by Oozie. For most types of +computation/processing tasks triggered by workflow action, the workflow job has to wait until the +computation/processing task completes before transitioning to the following node in the workflow.</p> +<p>The exception is the <tt>fs</tt> + action that is handled as a synchronous action.</p> +<p>Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling.</p> +<p>When a computation/processing tasks is started by Oozie, Oozie provides a unique callback URL to the task, the task +should invoke the given URL to notify its completion.</p> +<p>For cases that the task failed to invoke the callback URL for any reason (i.e. a transient network failure) or when +the type of task cannot invoke the callback URL upon completion, Oozie has a mechanism to poll computation/processing +tasks for completion.</p> +<a name="a3.2.1.3_Actions_Have_2_Transitions_ok_and_error"></a> +</div> +<div class="section"><h6>3.2.1.3 Actions Have 2 Transitions, =ok= and =error=</h6> +<p>If a computation/processing task -triggered by a workflow- completes successfully, it transitions to <tt>ok</tt> +.</p> +<p>If a computation/processing task -triggered by a workflow- fails to complete successfully, its transitions to <tt>error</tt> +.</p> +<p>If a computation/processing task exits in error, there computation/processing task must provide <tt>error-code</tt> + and + <tt>error-message</tt> + information to Oozie. This information can be used from <tt>decision</tt> + nodes to implement a fine grain +error handling at workflow application level.</p> +<p>Each action type must clearly define all the error codes it can produce.</p> +<a name="a3.2.1.4_Action_Recovery"></a> +</div> +<div class="section"><h6>3.2.1.4 Action Recovery</h6> +<p>Oozie provides recovery capabilities when starting or ending actions.</p> +<p>Once an action starts successfully Oozie will not retry starting the action if the action fails during its execution. +The assumption is that the external system (i.e. Hadoop) executing the action has enough resilience to recover jobs +once it has started (i.e. Hadoop task retries).</p> +<p>Java actions are a special case with regard to retries. Although Oozie itself does not retry Java actions +should they fail after they have successfully started, Hadoop itself can cause the action to be restarted due to a +map task retry on the map task running the Java application. See the Java Action section below for more detail.</p> +<p>For failures that occur prior to the start of the job, Oozie will have different recovery strategies depending on the +nature of the failure.</p> +<p>If the failure is of transient nature, Oozie will perform retries after a pre-defined time interval. The number of +retries and timer interval for a type of action must be pre-configured at Oozie level. Workflow jobs can override such +configuration.</p> +<p>Examples of a transient failures are network problems or a remote system temporary unavailable.</p> +<p>If the failure is of non-transient nature, Oozie will suspend the workflow job until an manual or programmatic +intervention resumes the workflow job and the action start or end is retried. It is the responsibility of an +administrator or an external managing system to perform any necessary cleanup before resuming the workflow job.</p> +<p>If the failure is an error and a retry will not resolve the problem, Oozie will perform the error transition for the +action.</p> +<p><a name="MapReduceAction"></a> +</p> +<a name="a3.2.2_Map-Reduce_Action"></a> +</div> +</div> +<div class="section"><h5>3.2.2 Map-Reduce Action</h5> +<p>The <tt>map-reduce</tt> + action starts a Hadoop map/reduce job from a workflow. Hadoop jobs can be Java Map/Reduce jobs or +streaming jobs.</p> +<p>A <tt>map-reduce</tt> + action can be configured to perform file system cleanup and directory creation before starting the +map reduce job. This capability enables Oozie to retry a Hadoop job in the situation of a transient failure (Hadoop +checks the non-existence of the job output directory and then creates it when the Hadoop job is starting, thus a retry +without cleanup of the job output directory would fail).</p> +<p>The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the +workflow execution path.</p> +<p>The counters of the Hadoop job and job exit status (=FAILED=, <tt>KILLED</tt> + or <tt>SUCCEEDED</tt> +) must be available to the +workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions +configurations.</p> +<p>The <tt>map-reduce</tt> + action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop +map/reduce job.</p> +<p>Hadoop JobConf properties can be specified as part of<ul><li>the <tt>config-default.xml</tt> + or</li> +<li>JobConf XML file bundled with the workflow application or</li> +<li><global> tag in workflow definition or</li> +<li>Inline <tt>map-reduce</tt> + action configuration or</li> +<li>An implementation of OozieActionConfigurator specified by the <config-class> tag in workflow definition.</li> +</ul> +</p> +<p>The configuration properties are loaded in the following above order i.e. <tt>streaming</tt> +, <tt>job-xml</tt> +, <tt>configuration</tt> +, +and <tt>config-class</tt> +, and the precedence order is later values override earlier values.</p> +<p>Streaming and inline property values can be parameterized (templatized) using EL expressions.</p> +<p>The Hadoop <tt>mapred.job.tracker</tt> + and <tt>fs.default.name</tt> + properties must not be present in the job-xml and inline +configuration.</p> +<p><a name="FilesArchives"></a> +</p> +<a name="a3.2.2.1_Adding_Files_and_Archives_for_the_Job"></a> +<div class="section"><h6>3.2.2.1 Adding Files and Archives for the Job</h6> +<p>The <tt>file</tt> +, <tt>archive</tt> + elements make available, to map-reduce jobs, files and archives. If the specified path is +relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. +If the path is absolute, the file or archive it is expected in the given absolute path.</p> +<p>Files specified with the <tt>file</tt> + element, will be symbolic links in the home directory of the task.</p> +<p>If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running +directory, thus available to the task JVM.</p> +<p>To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example +'mycat.sh#cat'.</p> +<p>Refer to Hadoop distributed cache documentation for details more details on files and archives.</p> +<a name="a3.2.2.2_Configuring_the_MapReduce_action_with_Java_code"></a> +</div> +<div class="section"><h6>3.2.2.2 Configuring the MapReduce action with Java code</h6> +<p>Java code can be used to further configure the MapReduce action. This can be useful if you already have "driver" code for your +MapReduce action, if you're more familiar with MapReduce's Java API, if there's some configuration that requires logic, or some +configuration that's difficult to do in straight XML (e.g. Avro).</p> +<p>Create a class that implements the org.apache.oozie.action.hadoop.OozieActionConfigurator interface from the "oozie-sharelib-oozie" +artifact. It contains a single method that receives a <tt>JobConf</tt> + as an argument. Any configuration properties set on this <tt>JobConf</tt> + +will be used by the MapReduce action.</p> +<p>The OozieActionConfigurator has this signature: +<pre> +public interface OozieActionConfigurator { + public void configure(JobConf actionConf) throws OozieActionConfiguratorException; +} +</pre> +where <tt>actionConf</tt> + is the <tt>JobConf</tt> + you can update. If you need to throw an Exception, you can wrap it in +an <tt>OozieActionConfiguratorException</tt> +, also in the "oozie-sharelib-oozie" artifact.</p> +<p>For example: +<pre> +package com.example;import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.FileInputFormat; +import org.apache.hadoop.mapred.FileOutputFormat; +import org.apache.hadoop.mapred.JobConf; +import org.apache.oozie.action.hadoop.OozieActionConfigurator; +import org.apache.oozie.action.hadoop.OozieActionConfiguratorException; +import org.apache.oozie.example.SampleMapper; +import org.apache.oozie.example.SampleReducer; +public class MyConfigClass implements OozieActionConfigurator { + @Override + public void configure(JobConf actionConf) throws OozieActionConfiguratorException { + if (actionConf.getUser() == null) { + throw new OozieActionConfiguratorException("No user set"); + } + actionConf.setMapperClass(SampleMapper.class); + actionConf.setReducerClass(SampleReducer.class); + FileInputFormat.setInputPaths(actionConf, new Path("/user/" + actionConf.getUser() + "/input-data")); + FileOutputFormat.setOutputPath(actionConf, new Path("/user/" + actionConf.getUser() + "/output")); + ... + } +} +</pre> +</p> +<p>To use your config class in your MapReduce action, simply compile it into a jar, make the jar available to your action, and specify +the class name in the <tt>config-class</tt> + element (this requires at least schema 0.5): +<pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="[NODE-NAME]"> + <map-reduce> + ... + <job-xml>[JOB-XML-FILE]</job-xml> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + <config-class>com.example.MyConfigClass</config-class> + ... + </map-reduce> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p>Another example of this can be found in the "map-reduce" example that comes with Oozie.</p> +<p>A useful tip: The initial <tt>JobConf</tt> + passed to the <tt>configure</tt> + method includes all of the properties listed in the <tt>configuration</tt> + +section of the MR action in a workflow. If you need to pass any information to your OozieActionConfigurator, you can simply put +them here.</p> +<p><a name="StreamingMapReduceAction"></a> +</p> +<a name="a3.2.2.3_Streaming"></a> +</div> +<div class="section"><h6>3.2.2.3 Streaming</h6> +<p>Streaming information can be specified in the <tt>streaming</tt> + element.</p> +<p>The <tt>mapper</tt> + and <tt>reducer</tt> + elements are used to specify the executable/script to be used as mapper and reducer.</p> +<p>User defined scripts must be bundled with the workflow application and they must be declared in the <tt>files</tt> + element of +the streaming configuration. If the are not declared in the <tt>files</tt> + element of the configuration it is assumed they +will be available (and in the command PATH) of the Hadoop slave machines.</p> +<p>Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using +the <tt>file</tt> + and <tt>archive</tt> + elements described in the previous section.</p> +<p>The Mapper/Reducer can be overridden by a <tt>mapred.mapper.class</tt> + or <tt>mapred.reducer.class</tt> + properties in the <tt>job-xml</tt> + +file or <tt>configuration</tt> + elements.</p> +<p><a name="PipesMapReduceAction"></a> +</p> +<a name="a3.2.2.4_Pipes"></a> +</div> +<div class="section"><h6>3.2.2.4 Pipes</h6> +<p>Pipes information can be specified in the <tt>pipes</tt> + element.</p> +<p>A subset of the command line options which can be used while using the Hadoop Pipes Submitter can be specified +via elements - <tt>map</tt> +, <tt>reduce</tt> +, <tt>inputformat</tt> +, <tt>partitioner</tt> +, <tt>writer</tt> +, <tt>program</tt> +.</p> +<p>The <tt>program</tt> + element is used to specify the executable/script to be used.</p> +<p>User defined program must be bundled with the workflow application.</p> +<p>Some pipe jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using +the <tt>file</tt> + and <tt>archive</tt> + elements described in the previous section.</p> +<p>Pipe properties can be overridden by specifying them in the <tt>job-xml</tt> + file or <tt>configuration</tt> + element.</p> +<a name="a3.2.2.5_Syntax"></a> +</div> +<div class="section"><h6>3.2.2.5 Syntax</h6> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="[NODE-NAME]"> + <map-reduce> + <resource-manager>[RESOURCE-MANAGER]</resource-manager> + <name-node>[NAME-NODE]</name-node> + <prepare> + <delete path="[PATH]"/> + ... + <mkdir path="[PATH]"/> + ... + </prepare> + <streaming> + <mapper>[MAPPER-PROCESS]</mapper> + <reducer>[REDUCER-PROCESS]</reducer> + <record-reader>[RECORD-READER-CLASS]</record-reader> + <record-reader-mapping>[NAME=VALUE]</record-reader-mapping> + ... + <env>[NAME=VALUE]</env> + ... + </streaming> + <!-- Either streaming or pipes can be specified for an action, not both --> + <pipes> + <map>[MAPPER]</map> + <reduce>[REDUCER]</reducer> + <inputformat>[INPUTFORMAT]</inputformat> + <partitioner>[PARTITIONER]</partitioner> + <writer>[OUTPUTFORMAT]</writer> + <program>[EXECUTABLE]</program> + </pipes> + <job-xml>[JOB-XML-FILE]</job-xml> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + <config-class>com.example.MyConfigClass</config-class> + <file>[FILE-PATH]</file> + ... + <archive>[FILE-PATH]</archive> + ... + </map-reduce> <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre> +</p> +<p>The <tt>prepare</tt> + element, if present, indicates a list of paths to delete before starting the job. This should be used +exclusively for directory cleanup or dropping of hcatalog table or table partitions for the job to be executed. The delete operation +will be performed in the <tt>fs.default.name</tt> + filesystem for hdfs URIs. The format for specifying hcatalog table URI is +hcat://[metastore server]:[port]/[database name]/[table name] and format to specify a hcatalog table partition URI is +hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]. +In case of a hcatalog URI, the hive-site.xml needs to be shipped using <tt>file</tt> + tag and the hcatalog and hive jars +need to be placed in workflow lib directory or specified using <tt>archive</tt> + tag.</p> +<p>The <tt>job-xml</tt> + element, if present, must refer to a Hadoop JobConf <tt>job.xml</tt> + file bundled in the workflow application. +By default the <tt>job.xml</tt> + file is taken from the workflow application namenode, regardless the namenode specified for the action. +To specify a <tt>job.xml</tt> + on another namenode use a fully qualified file path. +The <tt>job-xml</tt> + element is optional and as of schema 0.4, multiple <tt>job-xml</tt> + elements are allowed in order to specify multiple Hadoop JobConf <tt>job.xml</tt> + files.</p> +<p>The <tt>configuration</tt> + element, if present, contains JobConf properties for the Hadoop job.</p> +<p>Properties specified in the <tt>configuration</tt> + element override properties specified in the file specified in the + <tt>job-xml</tt> + element.</p> +<p>As of schema 0.5, the <tt>config-class</tt> + element, if present, contains a class that implements OozieActionConfigurator that can be used +to further configure the MapReduce job.</p> +<p>Properties specified in the <tt>config-class</tt> + class override properties specified in <tt>configuration</tt> + element.</p> +<p>External Stats can be turned on/off by specifying the property <i>oozie.action.external.stats.write</i> + as <i>true</i> + or <i>false</i> + in the configuration element of workflow.xml. The default value for this property is <i>false</i> +.</p> +<p>The <tt>file</tt> + element, if present, must specify the target symbolic link for binaries by separating the original file and target with a # (file#target-sym-link). This is not required for libraries.</p> +<p>The <tt>mapper</tt> + and <tt>reducer</tt> + process for streaming jobs, should specify the executable command with URL encoding. e.g. '%' should be replaced by '%25'.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="foo-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="myfirstHadoopJob"> + <map-reduce> + <resource-manager>foo:8032</resource-manager> + <name-node>bar:8020</name-node> + <prepare> + <delete path="hdfs://foo:8020/usr/tucu/output-data"/> + </prepare> + <job-xml>/myfirstjob.xml</job-xml> + <configuration> + <property> + <name>mapred.input.dir</name> + <value>/usr/tucu/input-data</value> + </property> + <property> + <name>mapred.output.dir</name> + <value>/usr/tucu/input-data</value> + </property> + <property> + <name>mapred.reduce.tasks</name> + <value>${firstJobReducers}</value> + </property> + <property> + <name>oozie.action.external.stats.write</name> + <value>true</value> + </property> + </configuration> + </map-reduce> + <ok to="myNextAction"/> + <error to="errorCleanup"/> + </action> + ... +</workflow-app> +</pre></p> +<p>In the above example, the number of Reducers to be used by the Map/Reduce job has to be specified as a parameter of +the workflow job configuration when creating the workflow job.</p> +<p><b>Streaming Example:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="firstjob"> + <map-reduce> + <resource-manager>foo:8032</resource-manager> + <name-node>bar:8020</name-node> + <prepare> + <delete path="${output}"/> + </prepare> + <streaming> + <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper> + <reducer>/bin/bash testarchive/bin/reducer.sh</reducer> + </streaming> + <configuration> + <property> + <name>mapred.input.dir</name> + <value>${input}</value> + </property> + <property> + <name>mapred.output.dir</name> + <value>${output}</value> + </property> + <property> + <name>stream.num.map.output.key.fields</name> + <value>3</value> + </property> + </configuration> + <file>/users/blabla/testfile.sh#testfile</file> + <archive>/users/blabla/testarchive.jar#testarchive</archive> + </map-reduce> + <ok to="end"/> + <error to="kill"/> + </action> + ... +</workflow-app> +</pre></p> +<p><b>Pipes Example:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="firstjob"> + <map-reduce> + <resource-manager>foo:8032</resource-manager> + <name-node>bar:8020</name-node> + <prepare> + <delete path="${output}"/> + </prepare> + <pipes> + <program>bin/wordcount-simple#wordcount-simple</program> + </pipes> + <configuration> + <property> + <name>mapred.input.dir</name> + <value>${input}</value> + </property> + <property> + <name>mapred.output.dir</name> + <value>${output}</value> + </property> + </configuration> + <archive>/users/blabla/testarchive.jar#testarchive</archive> + </map-reduce> + <ok to="end"/> + <error to="kill"/> + </action> + ... +</workflow-app> +</pre></p> +<p><a name="PigAction"></a> +</p> +<a name="a3.2.3_Pig_Action"></a> +</div> +</div> +<div class="section"><h5>3.2.3 Pig Action</h5> +<p>The <tt>pig</tt> + action starts a Pig job.</p> +<p>The workflow job will wait until the pig job completes before continuing to the next action.</p> +<p>The <tt>pig</tt> + action has to be configured with the resource-manager, name-node, pig script and the necessary parameters and +configuration to run the Pig job.</p> +<p>A <tt>pig</tt> + action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup before +starting the Pig job. This capability enables Oozie to retry a Pig job in the situation of a transient failure (Pig +creates temporary directories for intermediate data, thus a retry without cleanup would fail).</p> +<p>Hadoop JobConf properties can be specified as part of<ul><li>the <tt>config-default.xml</tt> + or</li> +<li>JobConf XML file bundled with the workflow application or</li> +<li><global> tag in workflow definition or</li> +<li>Inline <tt>pig</tt> + action configuration.</li> +</ul> +</p> +<p>The configuration properties are loaded in the following above order i.e. <tt>job-xml</tt> + and <tt>configuration</tt> +, and +the precedence order is later values override earlier values.</p> +<p>Inline property values can be parameterized (templatized) using EL expressions.</p> +<p>The YARN <tt>yarn.resourcemanager.address</tt> + and HDFS <tt>fs.default.name</tt> + properties must not be present in the job-xml and inline +configuration.</p> +<p>As with Hadoop map-reduce jobs, it is possible to add files and archives to be available to the Pig job, refer to +section [#FilesArchives][Adding Files and Archives for the Job].</p> +<p><b>Syntax for Pig actions in Oozie schema 1.0:</b> + +<pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="[NODE-NAME]"> + <pig> + <resource-manager>[RESOURCE-MANAGER]</resource-manager> + <name-node>[NAME-NODE]</name-node> + <prepare> + <delete path="[PATH]"/> + ... + <mkdir path="[PATH]"/> + ... + </prepare> + <job-xml>[JOB-XML-FILE]</job-xml> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + <script>[PIG-SCRIPT]</script> + <param>[PARAM-VALUE]</param> + ... + <param>[PARAM-VALUE]</param> + <argument>[ARGUMENT-VALUE]</argument> + ... + <argument>[ARGUMENT-VALUE]</argument> + <file>[FILE-PATH]</file> + ... + <archive>[FILE-PATH]</archive> + ... + </pig> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p><b>Syntax for Pig actions in Oozie schema 0.2:</b> + +<pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.2"> + ... + <action name="[NODE-NAME]"> + <pig> + <job-tracker>[JOB-TRACKER]</job-tracker> + <name-node>[NAME-NODE]</name-node> + <prepare> + <delete path="[PATH]"/> + ... + <mkdir path="[PATH]"/> + ... + </prepare> + <job-xml>[JOB-XML-FILE]</job-xml> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + <script>[PIG-SCRIPT]</script> + <param>[PARAM-VALUE]</param> + ... + <param>[PARAM-VALUE]</param> + <argument>[ARGUMENT-VALUE]</argument> + ... + <argument>[ARGUMENT-VALUE]</argument> + <file>[FILE-PATH]</file> + ... + <archive>[FILE-PATH]</archive> + ... + </pig> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p><b>Syntax for Pig actions in Oozie schema 0.1:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1"> + ... + <action name="[NODE-NAME]"> + <pig> + <job-tracker>[JOB-TRACKER]</job-tracker> + <name-node>[NAME-NODE]</name-node> + <prepare> + <delete path="[PATH]"/> + ... + <mkdir path="[PATH]"/> + ... + </prepare> + <job-xml>[JOB-XML-FILE]</job-xml> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + <script>[PIG-SCRIPT]</script> + <param>[PARAM-VALUE]</param> + ... + <param>[PARAM-VALUE]</param> + <file>[FILE-PATH]</file> + ... + <archive>[FILE-PATH]</archive> + ... + </pig> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p>The <tt>prepare</tt> + element, if present, indicates a list of paths to delete before starting the job. This should be used +exclusively for directory cleanup or dropping of hcatalog table or table partitions for the job to be executed. The delete operation +will be performed in the <tt>fs.default.name</tt> + filesystem for hdfs URIs. The format for specifying hcatalog table URI is +hcat://[metastore server]:[port]/[database name]/[table name] and format to specify a hcatalog table partition URI is +hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]. +In case of a hcatalog URI, the hive-site.xml needs to be shipped using <tt>file</tt> + tag and the hcatalog and hive jars +need to be placed in workflow lib directory or specified using <tt>archive</tt> + tag.</p> +<p>The <tt>job-xml</tt> + element, if present, must refer to a Hadoop JobConf <tt>job.xml</tt> + file bundled in the workflow application. +The <tt>job-xml</tt> + element is optional and as of schema 0.4, multiple <tt>job-xml</tt> + elements are allowed in order to specify multiple Hadoop JobConf <tt>job.xml</tt> + files.</p> +<p>The <tt>configuration</tt> + element, if present, contains JobConf properties for the underlying Hadoop jobs.</p> +<p>Properties specified in the <tt>configuration</tt> + element override properties specified in the file specified in the + <tt>job-xml</tt> + element.</p> +<p>External Stats can be turned on/off by specifying the property <i>oozie.action.external.stats.write</i> + as <i>true</i> + or <i>false</i> + in the configuration element of workflow.xml. The default value for this property is <i>false</i> +.</p> +<p>The inline and job-xml configuration properties are passed to the Hadoop jobs submitted by Pig runtime.</p> +<p>The <tt>script</tt> + element contains the pig script to execute. The pig script can be templatized with variables of the +form <tt>${VARIABLE}</tt> +. The values of these variables can then be specified using the <tt>params</tt> + element.</p> +<p>NOTE: Oozie will perform the parameter substitution before firing the pig job. This is different from the +<a class="externalLink" href="http://wiki.apache.org/pig/ParameterSubstitution">parameter substitution mechanism provided by Pig</a> +, which has a +few limitations.</p> +<p>The <tt>params</tt> + element, if present, contains parameters to be passed to the pig script.</p> +<p><b>In Oozie schema 0.2:</b> + +The <tt>arguments</tt> + element, if present, contains arguments to be passed to the pig script.</p> +<p>All the above elements can be parameterized (templatized) using EL expressions.</p> +<p><b>Example for Oozie schema 0.2:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.2"> + ... + <action name="myfirstpigjob"> + <pig> + <job-tracker>foo:8021</job-tracker> + <name-node>bar:8020</name-node> + <prepare> + <delete path="${jobOutput}"/> + </prepare> + <configuration> + <property> + <name>mapred.compress.map.output</name> + <value>true</value> + </property> + <property> + <name>oozie.action.external.stats.write</name> + <value>true</value> + </property> + </configuration> + <script>/mypigscript.pig</script> + <argument>-param</argument> + <argument>INPUT=${inputDir}</argument> + <argument>-param</argument> + <argument>OUTPUT=${outputDir}/pig-output3</argument> + </pig> + <ok to="myotherjob"/> + <error to="errorcleanup"/> + </action> + ... +</workflow-app> +</pre></p> +<p><b>Example for Oozie schema 0.1:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"> + ... + <action name="myfirstpigjob"> + <pig> + <job-tracker>foo:8021</job-tracker> + <name-node>bar:8020</name-node> + <prepare> + <delete path="${jobOutput}"/> + </prepare> + <configuration> + <property> + <name>mapred.compress.map.output</name> + <value>true</value> + </property> + </configuration> + <script>/mypigscript.pig</script> + <param>InputDir=/home/tucu/input-data</param> + <param>OutputDir=${jobOutput}</param> + </pig> + <ok to="myotherjob"/> + <error to="errorcleanup"/> + </action> + ... +</workflow-app> +</pre></p> +<p><a name="FsAction"></a> +</p> +<a name="a3.2.4_Fs_HDFS_action"></a> +</div> +<div class="section"><h5>3.2.4 Fs (HDFS) action</h5> +<p>The <tt>fs</tt> + action allows to manipulate files and directories in HDFS from a workflow application. The supported commands +are <tt>move</tt> +, <tt>delete</tt> +, <tt>mkdir</tt> +, <tt>chmod</tt> +, <tt>touchz</tt> +, <tt>setrep</tt> + and <tt>chgrp</tt> +.</p> +<p>The FS commands are executed synchronously from within the FS action, the workflow job will wait until the specified +file commands are completed before continuing to the next action.</p> +<p>Path names specified in the <tt>fs</tt> + action can be parameterized (templatized) using EL expressions. +Path name should be specified as a absolute path. In case of <tt>move</tt> +, <tt>delete</tt> +, <tt>chmod</tt> + and <tt>chgrp</tt> + commands, a glob pattern can also be specified instead of an absolute path. +For <tt>move</tt> +, glob pattern can only be specified for source path and not the target.</p> +<p>Each file path must specify the file system URI, for move operations, the target must not specify the system URI.</p> +<p><b>IMPORTANT:</b> + For the purposes of copying files within a cluster it is recommended to refer to the <tt>distcp</tt> + action +instead. Refer to <a href="./DG_DistCpActionExtension.html">=distcp=</a> + action to copy files within a cluster.</p> +<p><b>IMPORTANT:</b> + All the commands within <tt>fs</tt> + action do not happen atomically, if a <tt>fs</tt> + action fails half way in the +commands being executed, successfully executed commands are not rolled back. The <tt>fs</tt> + action, before executing any +command must check that source paths exist and target paths don't exist (constraint regarding target relaxed for the <tt>move</tt> + action. See below for details), thus failing before executing any command. +Therefore the validity of all paths specified in one <tt>fs</tt> + action are evaluated before any of the file operation are +executed. Thus there is less chance of an error occurring while the <tt>fs</tt> + action executes.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="[NODE-NAME]"> + <fs> + <delete path='[PATH]' skip-trash='[true/false]'/> + ... + <mkdir path='[PATH]'/> + ... + <move source='[SOURCE-PATH]' target='[TARGET-PATH]'/> + ... + <chmod path='[PATH]' permissions='[PERMISSIONS]' dir-files='false' /> + ... + <touchz path='[PATH]' /> + ... + <chgrp path='[PATH]' group='[GROUP]' dir-files='false' /> + ... + <setrep path='[PATH]' replication-factor='2'/> + </fs> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p>The <tt>delete</tt> + command deletes the specified path, if it is a directory it deletes recursively all its content and then +deletes the directory. By default it does skip trash. It can be moved to trash by setting the value of skip-trash to +'false'. It can also be used to drop hcat tables/partitions. This is the only FS command which supports HCatalog URIs as well. +For eg: +<pre> +<delete path='hcat://[metastore server]:[port]/[database name]/[table name]'/> +OR +<delete path='hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value];...'/> +</pre></p> +<p>The <tt>mkdir</tt> + command creates the specified directory, it creates all missing directories in the path. If the directory +already exist it does a no-op.</p> +<p>In the <tt>move</tt> + command the <tt>source</tt> + path must exist. The following scenarios are addressed for a <tt>move</tt> +:</p> +<p><ul><li>The file system URI(e.g. <a href="./hdfs://{nameNode}).html">hdfs://{nameNode})</a> + can be skipped in the <tt>target</tt> + path. It is understood to be the same as that of the source. But if the target path does contain the system URI, it cannot be different than that of the source.</li> +<li>The parent directory of the <tt>target</tt> + path must exist</li> +<li>For the <tt>target</tt> + path, if it is a file, then it must not already exist.</li> +<li>However, if the <tt>target</tt> + path is an already existing directory, the <tt>move</tt> + action will place your <tt>source</tt> + as a child of the <tt>target</tt> + directory.</li> +</ul> +</p> +<p>The <tt>chmod</tt> + command changes the permissions for the specified path. Permissions can be specified using the Unix Symbolic +representation (e.g. -rwxrw-rw-) or an octal representation (755). +When doing a <tt>chmod</tt> + command on a directory, by default the command is applied to the directory and the files one level +within the directory. To apply the <tt>chmod</tt> + command to the directory, without affecting the files within it, +the <tt>dir-files</tt> + attribute must be set to <tt>false</tt> +. To apply the <tt>chmod</tt> + command +recursively to all levels within a directory, put a <tt>recursive</tt> + element inside the <chmod> element.</p> +<p>The <tt>touchz</tt> + command creates a zero length file in the specified path if none exists. If one already exists, then touchz will perform a touch operation. +Touchz works only for absolute paths.</p> +<p>The <tt>chgrp</tt> + command changes the group for the specified path. +When doing a <tt>chgrp</tt> + command on a directory, by default the command is applied to the directory and the files one level +within the directory. To apply the <tt>chgrp</tt> + command to the directory, without affecting the files within it, +the <tt>dir-files</tt> + attribute must be set to <tt>false</tt> +. +To apply the <tt>chgrp</tt> + command recursively to all levels within a directory, put a <tt>recursive</tt> + element inside the <chgrp> element.</p> +<p>The <tt>setrep</tt> + command changes replication factor of an hdfs file(s). Changing RF of directories or symlinks is not +supported; this action requires an argument for RF.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="hdfscommands"> + <fs> + <delete path='hdfs://foo:8020/usr/tucu/temp-data'/> + <mkdir path='archives/${wf:id()}'/> + <move source='${jobInput}' target='archives/${wf:id()}/processed-input'/> + <chmod path='${jobOutput}' permissions='-rwxrw-rw-' dir-files='true'><recursive/></chmod> + <chgrp path='${jobOutput}' group='testgroup' dir-files='true'><recursive/></chgrp> + <setrep path='archives/${wf:id()/filename(s)}' replication-factor='2'/> + </fs> + <ok to="myotherjob"/> + <error to="errorcleanup"/> + </action> + ... +</workflow-app> +</pre></p> +<p>In the above example, a directory named after the workflow job ID is created and the input of the job, passed as +workflow configuration parameter, is archived under the previously created directory.</p> +<p>As of schema 0.4, if a <tt>name-node</tt> + element is specified, then it is not necessary for any of the paths to start with the file system +URI as it is taken from the <tt>name-node</tt> + element. This is also true if the name-node is specified in the global section +(see <a href="./WorkflowFunctionalSpec.html#GlobalConfigurations">Global Configurations</a> +)</p> +<p>As of schema 0.4, zero or more <tt>job-xml</tt> + elements can be specified; these must refer to Hadoop JobConf <tt>job.xml</tt> + formatted files +bundled in the workflow application. They can be used to set additional properties for the FileSystem instance.</p> +<p>As of schema 0.4, if a <tt>configuration</tt> + element is specified, then it will also be used to set additional JobConf properties for the +FileSystem instance. Properties specified in the <tt>configuration</tt> + element override properties specified in the files specified +by any <tt>job-xml</tt> + elements.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.4"> + ... + <action name="hdfscommands"> + <fs> + <name-node>hdfs://foo:8020</name-node> + <job-xml>fs-info.xml</job-xml> + <configuration> + <property> + <name>some.property</name> + <value>some.value</value> + </property> + </configuration> + <delete path='/usr/tucu/temp-data'/> + </fs> + <ok to="myotherjob"/> + <error to="errorcleanup"/> + </action> + ... +</workflow-app> +</pre></p> +<p><a name="SubWorkflowAction"></a> +</p> +<a name="a3.2.5_Sub-workflow_Action"></a> +</div> +<div class="section"><h5>3.2.5 Sub-workflow Action</h5> +<p>The <tt>sub-workflow</tt> + action runs a child workflow job, the child workflow job can be in the same Oozie system or in +another Oozie system.</p> +<p>The parent workflow job will wait until the child workflow job has completed.</p> +<p>There can be several sub-workflows defined within a single workflow, each under its own action element.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="[NODE-NAME]"> + <sub-workflow> + <app-path>[WF-APPLICATION-PATH]</app-path> + <propagate-configuration/> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + </sub-workflow> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p>The child workflow job runs in the same Oozie system instance where the parent workflow job is running.</p> +<p>The <tt>app-path</tt> + element specifies the path to the workflow application of the child workflow job.</p> +<p>The <tt>propagate-configuration</tt> + flag, if present, indicates that the workflow job configuration should be propagated to +the child workflow.</p> +<p>The <tt>configuration</tt> + section can be used to specify the job properties that are required to run the child workflow job.</p> +<p>The configuration of the <tt>sub-workflow</tt> + action can be parameterized (templatized) using EL expressions.</p> +<p><b>Example:</b> +</p> +<p><pre> +<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="a"> + <sub-workflow> + <app-path>child-wf</app-path> + <configuration> + <property> + <name>input.dir</name> + <value>${wf:id()}/second-mr-output</value> + </property> + </configuration> + </sub-workflow> + <ok to="end"/> + <error to="kill"/> + </action> + ... +</workflow-app> +</pre></p> +<p>In the above example, the workflow definition with the name <tt>child-wf</tt> + will be run on the Oozie instance at + <tt>.http://myhost:11000/oozie</tt> +. The specified workflow application must be already deployed on the target Oozie instance.</p> +<p>A configuration parameter <tt>input.dir</tt> + is being passed as job property to the child workflow job.</p> +<p>The subworkflow can inherit the lib jars from the parent workflow by setting <tt>oozie.subworkflow.classpath.inheritance</tt> + to true +in oozie-site.xml or on a per-job basis by setting <tt>oozie.wf.subworkflow.classpath.inheritance</tt> + to true in a job.properties file. +If both are specified, <tt>oozie.wf.subworkflow.classpath.inheritance</tt> + has priority. If the subworkflow and the parent have +conflicting jars, the subworkflow's jar has priority. By default, <tt>oozie.wf.subworkflow.classpath.inheritance</tt> + is set to false.</p> +<p>To prevent errant workflows from starting infinitely recursive subworkflows, <tt>oozie.action.subworkflow.max.depth</tt> + can be specified +in oozie-site.xml to set the maximum depth of subworkflow calls. For example, if set to 3, then a workflow can start subwf1, which +can start subwf2, which can start subwf3; but if subwf3 tries to start subwf4, then the action will fail. The default is 50.</p> +<p><a name="JavaAction"></a> +</p> +<a name="a3.2.6_Java_Action"></a> +</div> +<div class="section"><h5>3.2.6 Java Action</h5> +<p>The <tt>java</tt> + action will execute the <tt>public static void main(String[] args)</tt> + method of the specified main Java class.</p> +<p>Java applications are executed in the Hadoop cluster as map-reduce job with a single Mapper task.</p> +<p>The workflow job will wait until the java application completes its execution before continuing to the next action.</p> +<p>The <tt>java</tt> + action has to be configured with the resource-manager, name-node, main Java class, JVM options and arguments.</p> +<p>To indicate an <tt>ok</tt> + action transition, the main Java class must complete gracefully the <tt>main</tt> + method invocation.</p> +<p>To indicate an <tt>error</tt> + action transition, the main Java class must throw an exception.</p> +<p>The main Java class can call <tt>System.exit(int n)</tt> +. Exit code zero is regarded as OK, while non-zero exit codes will +cause the <tt>java</tt> + action to do an <tt>error</tt> + transition and exit.</p> +<p>A <tt>java</tt> + action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup before +starting the Java application. This capability enables Oozie to retry a Java application in the situation of a transient +or non-transient failure (This can be used to cleanup any temporary data which may have been created by the Java +application in case of failure).</p> +<p>A <tt>java</tt> + action can create a Hadoop configuration for interacting with a cluster (e.g. launching a map-reduce job). +Oozie prepares a Hadoop configuration file which includes the environments site configuration files (e.g. hdfs-site.xml, +mapred-site.xml, etc) plus the properties added to the <tt><configuration></tt> + section of the <tt>java</tt> + action. The Hadoop configuration +file is made available as a local file to the Java application in its running directory. It can be added to the <tt>java</tt> + actions +Hadoop configuration by referencing the system property: <tt>oozie.action.conf.xml</tt> +. For example:</p> +<p><pre> +// loading action conf prepared by Oozie +Configuration actionConf = new Configuration(false); +actionConf.addResource(new Path("file:///", System.getProperty("oozie.action.conf.xml"))); +</pre></p> +<p>If <tt>oozie.action.conf.xml</tt> + is not added then the job will pick up the mapred-default properties and this may result +in unexpected behaviour. For repeated configuration properties later values override earlier ones.</p> +<p>Inline property values can be parameterized (templatized) using EL expressions.</p> +<p>The YARN <tt>yarn.resourcemanager.address</tt> + (=resource-manager=) and HDFS <tt>fs.default.name</tt> + (=name-node=) properties must not be present +in the <tt>job-xml</tt> + and in the inline configuration.</p> +<p>As with <tt>map-reduce</tt> + and <tt>pig</tt> + actions, it is possible to add files and archives to be available to the Java +application. Refer to section [#FilesArchives][Adding Files and Archives for the Job].</p> +<p>The <tt>capture-output</tt> + element can be used to propagate values back into Oozie context, which can then be accessed via +EL-functions. This needs to be written out as a java properties format file. The filename is obtained via a System +property specified by the constant <tt>oozie.action.output.properties</tt> +</p> +<p><b>IMPORTANT:</b> + In order for a Java action to succeed on a secure cluster, it must propagate the Hadoop delegation token like in the +following code snippet (this is benign on non-secure clusters): +<pre> +// propagate delegation related props from launcher job to MR job +if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) { + jobConf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION")); +} +</pre></p> +<p><b>IMPORTANT:</b> + Because the Java application is run from within a Map-Reduce job, from Hadoop 0.20. onwards a queue must +be assigned to it. The queue name must be specified as a configuration property.</p> +<p><b>IMPORTANT:</b> + The Java application from a Java action is executed in a single map task. If the task is abnormally terminated, +such as due to a TaskTracker restart (e.g. during cluster maintenance), the task will be retried via the normal Hadoop task +retry mechanism. To avoid workflow failure, the application should be written in a fashion that is resilient to such retries, +for example by detecting and deleting incomplete outputs or picking back up from complete outputs. Furthermore, if a Java action +spawns asynchronous activity outside the JVM of the action itself (such as by launching additional MapReduce jobs), the +application must consider the possibility of collisions with activity spawned by the new instance.</p> +<p><b>Syntax:</b> +</p> +<p><pre> +<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:1.0"> + ... + <action name="[NODE-NAME]"> + <java> + <resource-manager>[RESOURCE-MANAGER]</resource-manager> + <name-node>[NAME-NODE]</name-node> + <prepare> + <delete path="[PATH]"/> + ... + <mkdir path="[PATH]"/> + ... + </prepare> + <job-xml>[JOB-XML]</job-xml> + <configuration> + <property> + <name>[PROPERTY-NAME]</name> + <value>[PROPERTY-VALUE]</value> + </property> + ... + </configuration> + <main-class>[MAIN-CLASS]</main-class> + <java-opts>[JAVA-STARTUP-OPTS]</java-opts> + <arg>ARGUMENT</arg> + ... + <file>[FILE-PATH]</file> + ... + <archive>[FILE-PATH]</archive> + ... + <capture-output /> + </java> + <ok to="[NODE-NAME]"/> + <error to="[NODE-NAME]"/> + </action> + ... +</workflow-app> +</pre></p> +<p>The <tt>prepare</tt> + element, if present, indicates a list of paths to delete before starting the Java application. This should +be used exclusively for directory cleanup or dropping of hcatalog table or table partitions for the Java application to be executed. +In case of <tt>delete</tt> +, a glob pattern can be used to specify path. +The format for specifying hcatalog table URI is +hcat://[metastore server]:[port]/[database name]/[table name] and format to specify a hcatalog table partition URI is +hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]. +In case of a hcatalog URI, the hive-site.xml needs to be shipped using <tt>file</tt> + tag and the hcatalog and hive jars +need to be placed in workflow lib directory or specified using <tt>archive</tt> + tag.</p> +<p>The <tt>java-opts</tt> + and <tt>java-opt</tt> + elements, if present, contains the command line parameters which are to be used to start the JVM that +will execute the Java application. Using this element is equivalent to using the <tt>mapred.child.java.opts</tt> + +or <tt>mapreduce.map.java.opts</tt>
[... 3669 lines stripped ...]