[MediaWiki-commits] [Gerrit] Adding oozie files to control addition and deletion of webre... - change (analytics/kraken)

Ottomata (Code Review) Wed, 14 May 2014 06:12:50 -0700

Ottomata has submitted this change and it was merged.

Change subject: Adding oozie files to control addition and deletion of 
webrequest table partitions
......................................................................



Adding oozie files to control addition and deletion of webrequest table 
partitions

util/hive/partition files are generic enough to be used with any external hive 
table.
webrequest/partition coordinators actually schedule the hive partition workflows
that add and drop partitions.

Note that this is intended to be used with a different table setup than we
currently have in Hive.  I intend to merge the 4 webrequest_* tables
into a single webrequest table, each with a partition for the particular
webrequest_source.  Data after camus imports will look like:

  webrequest/webrequest_<source>/hourly/<year>/<month>/<day>/<hour>

The webrequest table will then contain 5 partitions, 4 for the
date and one for the webrequest_source (e.g. text, bits, mobile, etc.).

Change-Id: I7d658b431c103eb8f73afac136369d401a3958ba
---
A oozie/util/hive/README.md
A oozie/util/hive/hive-site.xml
A oozie/util/hive/partition/add/add_partition.sql
A oozie/util/hive/partition/add/workflow.properties
A oozie/util/hive/partition/add/workflow.xml
A oozie/util/hive/partition/drop/drop_partition.sql
A oozie/util/hive/partition/drop/workflow.properties
A oozie/util/hive/partition/drop/workflow.xml
A oozie/webrequest/README.md
A oozie/webrequest/datasets.xml
A oozie/webrequest/partition/add/coordinator.properties
A oozie/webrequest/partition/add/coordinator.xml
A oozie/webrequest/partition/drop/coordinator.properties
A oozie/webrequest/partition/drop/coordinator.xml
14 files changed, 585 insertions(+), 0 deletions(-)

Approvals:
  Ottomata: Verified; Looks good to me, approved



diff --git a/oozie/util/hive/README.md b/oozie/util/hive/README.md
new file mode 100644
index 0000000..93d922c
--- /dev/null
+++ b/oozie/util/hive/README.md
@@ -0,0 +1,10 @@
+Oozie workflows in this directory run Hive actions. The workflows
+here should be generic enough to run on their own via good
+parameterization.  I.e. they should not refer to specific tables
+or data in HDFS.
+
+hive-site.xml is a symlink to /etc/hive/conf/hive-site.xml.  If you
+run ```hdfs dfs -put``` for the whole oozie/ directory, hive-site.xml
+will be put as the full text of the original file in HDFS.
+This file is referenced from various files relative to an 
```${oozieDirectory}```
+property.
diff --git a/oozie/util/hive/hive-site.xml b/oozie/util/hive/hive-site.xml
new file mode 120000
index 0000000..651e45a
--- /dev/null
+++ b/oozie/util/hive/hive-site.xml
@@ -0,0 +1 @@
+/etc/hive/conf/hive-site.xml
\ No newline at end of file
diff --git a/oozie/util/hive/partition/add/add_partition.sql 
b/oozie/util/hive/partition/add/add_partition.sql
new file mode 100644
index 0000000..a5d1758
--- /dev/null
+++ b/oozie/util/hive/partition/add/add_partition.sql
@@ -0,0 +1,7 @@
+ADD JAR ${serde_jar};
+USE ${database};
+ALTER TABLE ${table}
+  ADD IF NOT EXISTS
+  PARTITION (${partition_spec})
+  LOCATION '${location}'
+;
diff --git a/oozie/util/hive/partition/add/workflow.properties 
b/oozie/util/hive/partition/add/workflow.properties
new file mode 100644
index 0000000..5decd40
--- /dev/null
+++ b/oozie/util/hive/partition/add/workflow.properties
@@ -0,0 +1,35 @@
+# Configures a workflow app to add a single Hive partition to a Hive table.
+# Any of the following properties are overidable with -D.  Some properties
+# are required to be set via the CLI: 'location' and 'partition_spec'.
+#
+# 
-Dlocation="hdfs://namenode.example.org:8020/path/to/data/directory/webrequest/webrequest_mobile/2014/04/02/01"
+# 
-Dpartition_spec="webrequest_source='mobile',year=2014,month=04,day=02,hour=01"
+
+
+# hadoopMaster will be used for <name-node> and <job-tracker> elements.
+hadoopMaster                      = analytics1010.eqiad.wmnet
+nameNode                          = hdfs://${hadoopMaster}:8020
+jobTracker                        = ${hadoopMaster}:8032
+queueName                         = standard
+
+# Base path in HDFS to oozie files.
+# Other files will be used relative to this path.
+oozieDirectory                    = ${nameNode}/wmf/kraken/current/oozie
+
+# HDFS path to hive-site.xml file.  This is needed to run hive actions.
+hive_site_xml                     = ${oozieDirectory}/util/hive/hive-site.xml
+
+# HDFS path to SerDe jar that a table uses.
+serde_jar                         = 
${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar
+
+# Hive database name.
+database                          = wmf
+
+# Hive table name.
+table                             = webrequest
+
+# Workflow app to run.
+oozie.wf.application.path         = 
${oozieDirectory}/util/hive/partition/add/workflow.xml
+oozie.libpath                     = ${nameNode}/user/oozie/share/lib
+oozie.use.system.libpath          = true
+oozie.action.external.stats.write = true
diff --git a/oozie/util/hive/partition/add/workflow.xml 
b/oozie/util/hive/partition/add/workflow.xml
new file mode 100644
index 0000000..c944b95
--- /dev/null
+++ b/oozie/util/hive/partition/add/workflow.xml
@@ -0,0 +1,75 @@
+<workflow-app xmlns="uri:oozie:workflow:0.4"
+    name="hive_add_partition-${table}_${webrequest_source}-wf">
+
+    <parameters>
+        <property>
+            <name>hive_script</name>
+            <!-- This is relative to the containing directory of this file. -->
+            <value>add_partition.sql</value>
+            <description>Hive script to run.</description>
+        </property>
+        <property>
+            <name>queueName</name>
+            <value>adhoc</value>
+        </property>
+
+        <!-- Required properties -->
+        <property><name>nameNode</name></property>
+        <property><name>jobTracker</name></property>
+
+        <property>
+            <name>hive_site_xml</name>
+            <description>hive-site.xml file path in HDFS</description>
+        </property>
+        <property>
+            <name>serde_jar</name>
+            <description>Path in HDFS for SerDe .jar to use.</description>
+        </property>
+        <property>
+            <name>database</name>
+            <description>Hive database to use.</description>
+        </property>
+        <property>
+            <name>table</name>
+            <description>Hive table to partition.</description>
+        </property>
+        <property>
+            <name>partition_spec</name>
+            <description>Full partition spec to pass to Hive script.  E.g. 
'year=2014, month=04,...'</description>
+        </property>
+        <property>
+            <name>location</name>
+            <description>HDFS path(s) naming the input dataset.</description>
+        </property>
+    </parameters>
+
+    <start to="add_partition"/>
+
+    <action name="add_partition">
+        <hive xmlns="uri:oozie:hive-action:0.2">
+            <job-tracker>${jobTracker}</job-tracker>
+            <name-node>${nameNode}</name-node>
+            <job-xml>${hive_site_xml}</job-xml>
+            <configuration>
+                <property>
+                    <name>mapreduce.job.queuename</name>
+                    <value>${queueName}</value>
+                </property>
+            </configuration>
+
+            <script>${hive_script}</script>
+            <param>database=${database}</param>
+            <param>table=${table}</param>
+            <param>serde_jar=${serde_jar}</param>
+            <param>location=${location}</param>
+            <param>partition_spec=${partition_spec}</param>
+        </hive>
+        <ok to="end"/>
+        <error to="kill"/>
+    </action>
+
+    <kill name="kill">
+        <message>Action failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
+    </kill>
+    <end name="end"/>
+</workflow-app>
diff --git a/oozie/util/hive/partition/drop/drop_partition.sql 
b/oozie/util/hive/partition/drop/drop_partition.sql
new file mode 100644
index 0000000..800266e
--- /dev/null
+++ b/oozie/util/hive/partition/drop/drop_partition.sql
@@ -0,0 +1,6 @@
+ADD JAR ${serde_jar};
+USE ${database};
+ALTER TABLE ${table}
+  DROP IF EXISTS
+  PARTITION (${partition_spec})
+;
\ No newline at end of file
diff --git a/oozie/util/hive/partition/drop/workflow.properties 
b/oozie/util/hive/partition/drop/workflow.properties
new file mode 100644
index 0000000..f4e3ab9
--- /dev/null
+++ b/oozie/util/hive/partition/drop/workflow.properties
@@ -0,0 +1,36 @@
+# Configures a workflow app to drop a single Hive partition from a Hive table
+# and to delete the partion location from HDFS. #Any of the following 
properties
+# are overidable with -D.  Some properties are required to be set via the CLI:
+# 'location' and 'partition_spec'.
+#
+# 
-Dlocation="hdfs://namenode.example.org:8020/path/to/data/directory/webrequest/webrequest_mobile/2014/04/02/01"
+# 
-Dpartition_spec="webrequest_source='mobile',year=2014,month=04,day=02,hour=01"
+
+
+# hadoopMaster will be used for <name-node> and <job-tracker> elements.
+hadoopMaster                      = analytics1010.eqiad.wmnet
+nameNode                          = hdfs://${hadoopMaster}:8020
+jobTracker                        = ${hadoopMaster}:8032
+queueName                         = standard
+
+# Base path in HDFS to oozie files.
+# Other files will be used relative to this path.
+oozieDirectory                    = ${nameNode}/wmf/kraken/current/oozie
+
+# HDFS path to hive-site.xml file.  This is needed to run hive actions.
+hive_site_xml                     = ${oozieDirectory}/util/hive/hive-site.xml
+
+# HDFS path to SerDe jar that a table uses.
+serde_jar                         = 
${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar
+
+# Hive database name.
+database                          = wmf
+
+# Hive table name.
+table                             = webrequest
+
+# Workflow app to run.
+oozie.wf.application.path         = 
${oozieDirectory}/util/hive/partition/drop/workflow.xml
+oozie.libpath                     = ${nameNode}/user/oozie/share/lib
+oozie.use.system.libpath          = true
+oozie.action.external.stats.write = true
diff --git a/oozie/util/hive/partition/drop/workflow.xml 
b/oozie/util/hive/partition/drop/workflow.xml
new file mode 100644
index 0000000..b0454a6
--- /dev/null
+++ b/oozie/util/hive/partition/drop/workflow.xml
@@ -0,0 +1,78 @@
+<workflow-app xmlns="uri:oozie:workflow:0.4"
+    name="hive_drop_partition-${table}_${webrequest_source}-wf">
+
+    <parameters>
+        <property>
+            <name>hive_script</name>
+            <!-- This is relative to the parent directory of this file. -->
+            <value>drop_partition.sql</value>
+            <description>Hive script to run.</description>
+        </property>
+        <property>
+            <name>queueName</name>
+            <value>adhoc</value>
+        </property>
+
+        <!-- Required properties -->
+        <property><name>nameNode</name></property>
+        <property><name>jobTracker</name></property>
+
+        <property>
+            <name>hive_site_xml</name>
+            <description>hive-site.xml file path in HDFS</description>
+        </property>
+        <property>
+            <name>serde_jar</name>
+            <description>Path in HDFS for SerDe .jar to use.</description>
+        </property>
+        <property>
+            <name>database</name>
+            <description>Hive database to use.</description>
+        </property>
+        <property>
+            <name>table</name>
+            <description>Hive table to partition.</description>
+        </property>
+        <property>
+            <name>partition_spec</name>
+            <description>Full partition spec to pass to Hive script.  E.g. 
'year=2014, month=04,...'</description>
+        </property>
+        <property>
+            <name>location</name>
+            <description>HDFS path to delete.</description>
+        </property>
+
+    </parameters>
+
+    <start to="delete_location_and_drop_partition"/>
+
+    <!--
+    Run ${hive_script} to drop the partition
+    specified by ${partition_spec}.
+    -->
+    <action name="delete_location_and_drop_partition">
+        <hive xmlns="uri:oozie:hive-action:0.2">
+            <job-tracker>${jobTracker}</job-tracker>
+            <name-node>${nameNode}</name-node>
+
+            <!-- Delete the location as we drop the partition from the table. 
-->
+            <prepare>
+               <delete path="${location}"/>
+            </prepare>
+
+            <job-xml>${hive_site_xml}</job-xml>
+            <script>${hive_script}</script>
+            <param>database=${database}</param>
+            <param>table=${table}</param>
+            <param>serde_jar=${serde_jar}</param>
+            <param>partition_spec=${partition_spec}</param>
+        </hive>
+        <ok to="end"/>
+        <error to="kill"/>
+    </action>
+
+    <kill name="kill">
+        <message>Action failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
+    </kill>
+    <end name="end"/>
+</workflow-app>
diff --git a/oozie/webrequest/README.md b/oozie/webrequest/README.md
new file mode 100644
index 0000000..4af5ca6
--- /dev/null
+++ b/oozie/webrequest/README.md
@@ -0,0 +1,2 @@
+This directory contains the dataset definition and coordinators that launch
+jobs specific to the wmf.webrequest Hive table.
diff --git a/oozie/webrequest/datasets.xml b/oozie/webrequest/datasets.xml
new file mode 100644
index 0000000..3706de6
--- /dev/null
+++ b/oozie/webrequest/datasets.xml
@@ -0,0 +1,19 @@
+<!--
+Defines a reusable dataset for raw webrequest data.
+Use this dataset in your coordinator.xml files by setting:
+
+  ${startTime}      - the initial instance of your data.
+                      Example: 2014-04-01T00:00Z
+  ${dataDirectory}  - Path to directory where data is time bucketed.
+                      Example: 
/wmf/data/external/webrequest/webrequest_mobile/hourly
+-->
+
+<datasets>
+<dataset name="webrequest"
+        frequency="${coord:hours(1)}"
+        initial-instance="${startTime}"
+        timezone="Universal">
+    
<uri-template>${dataDirectory}/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
+    <done-flag></done-flag>
+ </dataset>
+</datasets>
diff --git a/oozie/webrequest/partition/add/coordinator.properties 
b/oozie/webrequest/partition/add/coordinator.properties
new file mode 100644
index 0000000..64df591
--- /dev/null
+++ b/oozie/webrequest/partition/add/coordinator.properties
@@ -0,0 +1,54 @@
+# Configures a coordinator to manage automatically adding Hive partitions to
+# a webrequest table.  Any of the following properties are overidable with -D.
+# The 'webrequest_source' property is required and must be set on the CLI when
+# submitting this coordinator.
+#
+# Usage:
+# oozie job -Dwebrequest_source=mobile -submit -config 
oozie/webrequest/add/coordinator.properties.
+#
+# NOTE:  The $oozieDirectory must be synced to HDFS so that all relevant
+#        .xml files exist there when this job is submitted.
+
+
+# hadoopMaster will be used for <name-node> and <job-tracker> elements.
+hadoopMaster                      = analytics1010.eqiad.wmnet
+nameNode                          = hdfs://${hadoopMaster}:8020
+jobTracker                        = ${hadoopMaster}:8032
+queueName                         = standard
+
+# Base path in HDFS to oozie files.
+# Other files will be used relative to this path.
+oozieDirectory                    = ${nameNode}/wmf/kraken/current/oozie
+
+# HDFS path to workflow to run.
+workflowFile                      = 
${oozieDirectory}/util/hive/partition/add/workflow.xml
+
+# HDFS path to webrequest dataset definition
+datasetsFile                      = ${oozieDirectory}/webrequest/datasets.xml
+
+# Initial import time of the webrequest dataset.
+startTime                         = 2014-04-01T00:00Z
+
+# Time to stop running this coordinator.  Year 3000 == never!
+stopTime                          = 3000-01-01T00:00Z
+
+# HDFS path to hive-site.xml file.  This is needed to run hive actions.
+hive_site_xml                     = ${oozieDirectory}/util/hive/hive-site.xml
+
+# HDFS path to SerDe jar that the webrequest table uses.
+serde_jar                         = 
${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar
+
+# Hive database name.
+database                          = wmf
+
+# Hive table name.
+table                             = webrequest
+
+# HDFS path to directory where webrequest data is time bucketed.
+dataDirectory                     = 
${nameNode}/wmf/data/external/${table}/webrequest_${webrequest_source}/hourly
+
+# Coordintator to start.
+oozie.coord.application.path      = 
${oozieDirectory}/webrequest/partition/add/coordinator.xml
+oozie.libpath                     = ${nameNode}/user/oozie/share/lib
+oozie.use.system.libpath          = true
+oozie.action.external.stats.write = true
diff --git a/oozie/webrequest/partition/add/coordinator.xml 
b/oozie/webrequest/partition/add/coordinator.xml
new file mode 100644
index 0000000..0ec13b8
--- /dev/null
+++ b/oozie/webrequest/partition/add/coordinator.xml
@@ -0,0 +1,98 @@
+<coordinator-app xmlns="uri:oozie:coordinator:0.4"
+    name="hive_add_partition-${table}_${webrequest_source}-coord"
+    frequency="${coord:hours(1)}"
+    start="${startTime}"
+    end="${stopTime}"
+    timezone="Universal">
+
+    <parameters>
+        <property>
+            <name>queueName</name>
+            <value>standard</value>
+        </property>
+
+        <!-- Required properties. -->
+        <property><name>nameNode</name></property>
+        <property><name>jobTracker</name></property>
+        <property><name>workflowFile</name></property>
+        <property><name>startTime</name></property>
+        <property><name>stopTime</name></property>
+        <property><name>dataDirectory</name></property>
+
+        <property><name>hive_site_xml</name></property>
+        <property><name>serde_jar</name></property>
+        <property><name>database</name></property>
+        <property><name>table</name></property>
+        <property><name>webrequest_source</name></property>
+    </parameters>
+
+    <controls>
+        <!--
+        Altering hive partitions is cheap.  If there is an
+        occasion where we have to add a bunch all at once,
+        do it!
+        -->
+        <concurrency>168</concurrency>
+    </controls>
+
+    <datasets>
+        <!--
+        Include the given datasetsFile file.  This should
+        define the "webrequest" dataset for this coordinator.
+        -->
+        <include>${datasetsFile}</include>
+    </datasets>
+
+    <input-events>
+        <data-in name="input" dataset="webrequest">
+            <instance>${coord:current(0)}</instance>
+        </data-in>
+        <!--
+        Having a data-in defined for current(1) will
+        keep the workflow from running until the next
+        hour's directory is created.
+         -->
+        <data-in name="ready_indicator" dataset="webrequest">
+            <instance>${coord:current(1)}</instance>
+        </data-in>
+    </input-events>
+
+    <action>
+        <workflow>
+            <app-path>${workflowFile}</app-path>
+            <configuration>
+
+                <!-- Pass these properties through to the workflow -->
+                
<property><name>nameNode</name><value>${nameNode}</value></property>
+                
<property><name>jobTracker</name><value>${jobTracker}</value></property>
+                
<property><name>queueName</name><value>${queueName}</value></property>
+
+                <property>
+                    <name>hive_site_xml</name>
+                    <value>${hive_site_xml}</value>
+                </property>
+                <property>
+                    <name>serde_jar</name>
+                    <value>${serde_jar}</value>
+                </property>
+                <property>
+                    <name>database</name>
+                    <value>${database}</value>
+                </property>
+                <property>
+                    <name>table</name>
+                    <value>${table}</value>
+                </property>
+                <property>
+                    <name>partition_spec</name>
+                    
<value>webrequest_source='${webrequest_source}',${coord:formatTime(coord:dateOffset(coord:nominalTime(),
 0, 'HOUR'), "'year='yyyy,'month='MM,'day='dd,'hour='HH")}</value>
+                </property>
+                <property>
+                    <name>location</name>
+                    <value>${coord:dataIn('input')}</value>
+                </property>
+
+            </configuration>
+        </workflow>
+    </action>
+</coordinator-app>
diff --git a/oozie/webrequest/partition/drop/coordinator.properties 
b/oozie/webrequest/partition/drop/coordinator.properties
new file mode 100644
index 0000000..b4fc657
--- /dev/null
+++ b/oozie/webrequest/partition/drop/coordinator.properties
@@ -0,0 +1,62 @@
+# Configures a coordinator to manage automatically dropping old Hive partitions
+# for a webrequest table, and and deleting their respective data locations in
+# HDFS. Any of the following properties are overidable with -D.
+# The 'webrequest_source' property is required and must be set on the CLI when
+# submitting this coordinator.
+#
+# Usage:
+# oozie job --Dwebrequest_source=mobile -submit -config 
oozie/webrequest/drop/coordinator.properties.
+#
+# NOTE:  The $oozieDirectory must be synced to HDFS so that all relevant
+#        .xml files exist there when this job is submitted.
+
+
+# hadoopMaster will be used for <name-node> and <job-tracker> elements.
+hadoopMaster                      = analytics1010.eqiad.wmnet
+nameNode                          = hdfs://${hadoopMaster}:8020
+jobTracker                        = ${hadoopMaster}:8032
+queueName                         = standard
+
+# Base path in HDFS to oozie files.
+# Other files will be used relative to this path.
+oozieDirectory                    = ${nameNode}/wmf/kraken/current/oozie
+
+# HDFS path to workflow to run.
+workflowFile                      = 
${oozieDirectory}/util/hive/partition/add/workflow.xml
+
+# HDFS path to webrequest dataset definition
+datasetsFile                      = ${oozieDirectory}/webrequest/datasets.xml
+
+# Initial import time of the webrequest dataset.
+# Note:  This is intentionally set to 1440 hours
+#        after the add partition coordinator's
+#        initial instance.  Each drop workflow app
+#        will remove the partition 1440 hours before
+#        the current instance.  So if we want to delete
+#        partitions that old, we need to wait until
+#        that much time has passed.
+startTime                         = 2014-05-31T00:00Z
+
+# Time to stop running this coordinator.  Year 3000 == never!
+stopTime                          = 3000-01-01T00:00Z
+
+# HDFS path to hive-site.xml file.  This is needed to run hive actions.
+hive_site_xml                     = ${oozieDirectory}/util/hive/hive-site.xml
+
+# HDFS path to SerDe jar that the webrequest table uses.
+serde_jar                         = 
${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar
+
+# Hive database name.
+database                          = wmf
+
+# Hive table name.
+table                             = webrequest
+
+# HDFS path to directory where webrequest data is time bucketed.
+dataDirectory                     = 
${nameNode}/wmf/data/external/${table}/webrequest_${webrequest_source}/hourly
+
+# Coordintator to start
+oozie.coord.application.path      = 
${oozieDirectory}/webrequest/partition/add/coordinator.xml
+oozie.libpath                     = ${nameNode}/user/oozie/share/lib
+oozie.use.system.libpath          = true
+oozie.action.external.stats.write = true
\ No newline at end of file
diff --git a/oozie/webrequest/partition/drop/coordinator.xml 
b/oozie/webrequest/partition/drop/coordinator.xml
new file mode 100644
index 0000000..194fb1d
--- /dev/null
+++ b/oozie/webrequest/partition/drop/coordinator.xml
@@ -0,0 +1,102 @@
+<coordinator-app xmlns="uri:oozie:coordinator:0.4"
+    name="hive_drop_partition-${table}_${webrequest_source}-coord"
+    frequency="${coord:hours(1)}"
+    start="${startTime}"
+    end="${stopTime}"
+    timezone="Universal">
+
+    <parameters>
+        <property>
+            <name>queueName</name>
+            <value>adhoc</value>
+        </property>
+
+        <!-- Required properties. -->
+        <property><name>nameNode</name></property>
+        <property><name>jobTracker</name></property>
+        <property><name>workflowFile</name></property>
+        <property><name>startTime</name></property>
+        <property><name>stopTime</name></property>
+        <property><name>dataDirectory</name></property>
+
+        <property><name>hive_site_xml</name></property>
+        <property><name>serde_jar</name></property>
+        <property><name>database</name></property>
+        <property><name>table</name></property>
+        <property><name>webrequest_source</name></property>
+    </parameters>
+
+    <controls>
+        <!--
+        Altering hive partitions is cheap.  If there is an
+        occasion where we have to add a bunch all at once,
+        do it!
+        -->
+        <concurrency>168</concurrency>
+    </controls>
+
+    <datasets>
+        <!--
+        Include the given datasetsFile file.  This should
+        define the "webrequest" dataset for this coordinator.
+        -->
+        <include>${datasetsFile}</include>
+    </datasets>
+
+    <input-events>
+        <data-in name="toDelete" dataset="webrequest">
+            <!--
+            The partition_spec to delete is inferred from the current
+            instance - 1440 hours.  This amounts to 60 days ago.
+            Please make sure ${startTime} is set to 1440 hours beyond
+            your dataset's actual initial instance.  (It won't hurt
+            to incorrectly set ${startTime}, you'll just end up with
+            a lot of failed jobs until there is actually data old
+            enough to delete.)
+            -->
+            <instance>${coord:current(0)}</instance>
+        </data-in>
+    </input-events>
+
+    <action>
+        <workflow>
+            <app-path>${workflowFile}</app-path>
+            <configuration>
+
+                <!-- Pass these properties through to the workflow -->
+                
<property><name>nameNode</name><value>${nameNode}</value></property>
+                
<property><name>jobTracker</name><value>${jobTracker}</value></property>
+                
<property><name>queueName</name><value>${queueName}</value></property>
+
+                <property>
+                    <name>hive_site_xml</name>
+                    <value>${hive_site_xml}</value>
+                </property>
+                <property>
+                    <name>serde_jar</name>
+                    <value>${serde_jar}</value>
+                </property>
+                <property>
+                    <name>database</name>
+                    <value>${database}</value>
+                </property>
+                <property>
+                    <name>table</name>
+                    <value>${table}</value>
+                </property>
+                <property>
+                    <name>partition_spec</name>
+                    <!--
+                    -1440 hours is 60 days ago.  Delete the partition that is 
60 days old.
+                    -->
+                    
<value>webrequest_source='${webrequest_source}',${coord:formatTime(coord:dateOffset(coord:nominalTime(),
 -1440, 'HOUR'), "'year='yyyy,'month='MM,'day='dd,'hour='HH")}</value>
+                </property>
+                <property>
+                    <name>location</name>
+                    <value>${coord:dataIn('toDelete')}</value>
+                </property>
+
+            </configuration>
+        </workflow>
+    </action>
+</coordinator-app>

-- 
To view, visit https://gerrit.wikimedia.org/r/131208
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I7d658b431c103eb8f73afac136369d401a3958ba
Gerrit-PatchSet: 5
Gerrit-Project: analytics/kraken
Gerrit-Branch: master
Gerrit-Owner: Ottomata <o...@wikimedia.org>
Gerrit-Reviewer: Ottomata <o...@wikimedia.org>
Gerrit-Reviewer: QChris <christ...@quelltextlich.at>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Adding oozie files to control addition and deletion of webre... - change (analytics/kraken)

Reply via email to