Ottomata has submitted this change and it was merged. Change subject: Adding oozie files to control addition and deletion of webrequest table partitions ......................................................................
Adding oozie files to control addition and deletion of webrequest table partitions util/hive/partition files are generic enough to be used with any external hive table. webrequest/partition coordinators actually schedule the hive partition workflows that add and drop partitions. Note that this is intended to be used with a different table setup than we currently have in Hive. I intend to merge the 4 webrequest_* tables into a single webrequest table, each with a partition for the particular webrequest_source. Data after camus imports will look like: webrequest/webrequest_<source>/hourly/<year>/<month>/<day>/<hour> The webrequest table will then contain 5 partitions, 4 for the date and one for the webrequest_source (e.g. text, bits, mobile, etc.). Change-Id: I7d658b431c103eb8f73afac136369d401a3958ba --- A oozie/util/hive/README.md A oozie/util/hive/hive-site.xml A oozie/util/hive/partition/add/add_partition.sql A oozie/util/hive/partition/add/workflow.properties A oozie/util/hive/partition/add/workflow.xml A oozie/util/hive/partition/drop/drop_partition.sql A oozie/util/hive/partition/drop/workflow.properties A oozie/util/hive/partition/drop/workflow.xml A oozie/webrequest/README.md A oozie/webrequest/datasets.xml A oozie/webrequest/partition/add/coordinator.properties A oozie/webrequest/partition/add/coordinator.xml A oozie/webrequest/partition/drop/coordinator.properties A oozie/webrequest/partition/drop/coordinator.xml 14 files changed, 585 insertions(+), 0 deletions(-) Approvals: Ottomata: Verified; Looks good to me, approved diff --git a/oozie/util/hive/README.md b/oozie/util/hive/README.md new file mode 100644 index 0000000..93d922c --- /dev/null +++ b/oozie/util/hive/README.md @@ -0,0 +1,10 @@ +Oozie workflows in this directory run Hive actions. The workflows +here should be generic enough to run on their own via good +parameterization. I.e. they should not refer to specific tables +or data in HDFS. + +hive-site.xml is a symlink to /etc/hive/conf/hive-site.xml. If you +run ```hdfs dfs -put``` for the whole oozie/ directory, hive-site.xml +will be put as the full text of the original file in HDFS. +This file is referenced from various files relative to an ```${oozieDirectory}``` +property. diff --git a/oozie/util/hive/hive-site.xml b/oozie/util/hive/hive-site.xml new file mode 120000 index 0000000..651e45a --- /dev/null +++ b/oozie/util/hive/hive-site.xml @@ -0,0 +1 @@ +/etc/hive/conf/hive-site.xml \ No newline at end of file diff --git a/oozie/util/hive/partition/add/add_partition.sql b/oozie/util/hive/partition/add/add_partition.sql new file mode 100644 index 0000000..a5d1758 --- /dev/null +++ b/oozie/util/hive/partition/add/add_partition.sql @@ -0,0 +1,7 @@ +ADD JAR ${serde_jar}; +USE ${database}; +ALTER TABLE ${table} + ADD IF NOT EXISTS + PARTITION (${partition_spec}) + LOCATION '${location}' +; diff --git a/oozie/util/hive/partition/add/workflow.properties b/oozie/util/hive/partition/add/workflow.properties new file mode 100644 index 0000000..5decd40 --- /dev/null +++ b/oozie/util/hive/partition/add/workflow.properties @@ -0,0 +1,35 @@ +# Configures a workflow app to add a single Hive partition to a Hive table. +# Any of the following properties are overidable with -D. Some properties +# are required to be set via the CLI: 'location' and 'partition_spec'. +# +# -Dlocation="hdfs://namenode.example.org:8020/path/to/data/directory/webrequest/webrequest_mobile/2014/04/02/01" +# -Dpartition_spec="webrequest_source='mobile',year=2014,month=04,day=02,hour=01" + + +# hadoopMaster will be used for <name-node> and <job-tracker> elements. +hadoopMaster = analytics1010.eqiad.wmnet +nameNode = hdfs://${hadoopMaster}:8020 +jobTracker = ${hadoopMaster}:8032 +queueName = standard + +# Base path in HDFS to oozie files. +# Other files will be used relative to this path. +oozieDirectory = ${nameNode}/wmf/kraken/current/oozie + +# HDFS path to hive-site.xml file. This is needed to run hive actions. +hive_site_xml = ${oozieDirectory}/util/hive/hive-site.xml + +# HDFS path to SerDe jar that a table uses. +serde_jar = ${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar + +# Hive database name. +database = wmf + +# Hive table name. +table = webrequest + +# Workflow app to run. +oozie.wf.application.path = ${oozieDirectory}/util/hive/partition/add/workflow.xml +oozie.libpath = ${nameNode}/user/oozie/share/lib +oozie.use.system.libpath = true +oozie.action.external.stats.write = true diff --git a/oozie/util/hive/partition/add/workflow.xml b/oozie/util/hive/partition/add/workflow.xml new file mode 100644 index 0000000..c944b95 --- /dev/null +++ b/oozie/util/hive/partition/add/workflow.xml @@ -0,0 +1,75 @@ +<workflow-app xmlns="uri:oozie:workflow:0.4" + name="hive_add_partition-${table}_${webrequest_source}-wf"> + + <parameters> + <property> + <name>hive_script</name> + <!-- This is relative to the containing directory of this file. --> + <value>add_partition.sql</value> + <description>Hive script to run.</description> + </property> + <property> + <name>queueName</name> + <value>adhoc</value> + </property> + + <!-- Required properties --> + <property><name>nameNode</name></property> + <property><name>jobTracker</name></property> + + <property> + <name>hive_site_xml</name> + <description>hive-site.xml file path in HDFS</description> + </property> + <property> + <name>serde_jar</name> + <description>Path in HDFS for SerDe .jar to use.</description> + </property> + <property> + <name>database</name> + <description>Hive database to use.</description> + </property> + <property> + <name>table</name> + <description>Hive table to partition.</description> + </property> + <property> + <name>partition_spec</name> + <description>Full partition spec to pass to Hive script. E.g. 'year=2014, month=04,...'</description> + </property> + <property> + <name>location</name> + <description>HDFS path(s) naming the input dataset.</description> + </property> + </parameters> + + <start to="add_partition"/> + + <action name="add_partition"> + <hive xmlns="uri:oozie:hive-action:0.2"> + <job-tracker>${jobTracker}</job-tracker> + <name-node>${nameNode}</name-node> + <job-xml>${hive_site_xml}</job-xml> + <configuration> + <property> + <name>mapreduce.job.queuename</name> + <value>${queueName}</value> + </property> + </configuration> + + <script>${hive_script}</script> + <param>database=${database}</param> + <param>table=${table}</param> + <param>serde_jar=${serde_jar}</param> + <param>location=${location}</param> + <param>partition_spec=${partition_spec}</param> + </hive> + <ok to="end"/> + <error to="kill"/> + </action> + + <kill name="kill"> + <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> + </kill> + <end name="end"/> +</workflow-app> diff --git a/oozie/util/hive/partition/drop/drop_partition.sql b/oozie/util/hive/partition/drop/drop_partition.sql new file mode 100644 index 0000000..800266e --- /dev/null +++ b/oozie/util/hive/partition/drop/drop_partition.sql @@ -0,0 +1,6 @@ +ADD JAR ${serde_jar}; +USE ${database}; +ALTER TABLE ${table} + DROP IF EXISTS + PARTITION (${partition_spec}) +; \ No newline at end of file diff --git a/oozie/util/hive/partition/drop/workflow.properties b/oozie/util/hive/partition/drop/workflow.properties new file mode 100644 index 0000000..f4e3ab9 --- /dev/null +++ b/oozie/util/hive/partition/drop/workflow.properties @@ -0,0 +1,36 @@ +# Configures a workflow app to drop a single Hive partition from a Hive table +# and to delete the partion location from HDFS. #Any of the following properties +# are overidable with -D. Some properties are required to be set via the CLI: +# 'location' and 'partition_spec'. +# +# -Dlocation="hdfs://namenode.example.org:8020/path/to/data/directory/webrequest/webrequest_mobile/2014/04/02/01" +# -Dpartition_spec="webrequest_source='mobile',year=2014,month=04,day=02,hour=01" + + +# hadoopMaster will be used for <name-node> and <job-tracker> elements. +hadoopMaster = analytics1010.eqiad.wmnet +nameNode = hdfs://${hadoopMaster}:8020 +jobTracker = ${hadoopMaster}:8032 +queueName = standard + +# Base path in HDFS to oozie files. +# Other files will be used relative to this path. +oozieDirectory = ${nameNode}/wmf/kraken/current/oozie + +# HDFS path to hive-site.xml file. This is needed to run hive actions. +hive_site_xml = ${oozieDirectory}/util/hive/hive-site.xml + +# HDFS path to SerDe jar that a table uses. +serde_jar = ${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar + +# Hive database name. +database = wmf + +# Hive table name. +table = webrequest + +# Workflow app to run. +oozie.wf.application.path = ${oozieDirectory}/util/hive/partition/drop/workflow.xml +oozie.libpath = ${nameNode}/user/oozie/share/lib +oozie.use.system.libpath = true +oozie.action.external.stats.write = true diff --git a/oozie/util/hive/partition/drop/workflow.xml b/oozie/util/hive/partition/drop/workflow.xml new file mode 100644 index 0000000..b0454a6 --- /dev/null +++ b/oozie/util/hive/partition/drop/workflow.xml @@ -0,0 +1,78 @@ +<workflow-app xmlns="uri:oozie:workflow:0.4" + name="hive_drop_partition-${table}_${webrequest_source}-wf"> + + <parameters> + <property> + <name>hive_script</name> + <!-- This is relative to the parent directory of this file. --> + <value>drop_partition.sql</value> + <description>Hive script to run.</description> + </property> + <property> + <name>queueName</name> + <value>adhoc</value> + </property> + + <!-- Required properties --> + <property><name>nameNode</name></property> + <property><name>jobTracker</name></property> + + <property> + <name>hive_site_xml</name> + <description>hive-site.xml file path in HDFS</description> + </property> + <property> + <name>serde_jar</name> + <description>Path in HDFS for SerDe .jar to use.</description> + </property> + <property> + <name>database</name> + <description>Hive database to use.</description> + </property> + <property> + <name>table</name> + <description>Hive table to partition.</description> + </property> + <property> + <name>partition_spec</name> + <description>Full partition spec to pass to Hive script. E.g. 'year=2014, month=04,...'</description> + </property> + <property> + <name>location</name> + <description>HDFS path to delete.</description> + </property> + + </parameters> + + <start to="delete_location_and_drop_partition"/> + + <!-- + Run ${hive_script} to drop the partition + specified by ${partition_spec}. + --> + <action name="delete_location_and_drop_partition"> + <hive xmlns="uri:oozie:hive-action:0.2"> + <job-tracker>${jobTracker}</job-tracker> + <name-node>${nameNode}</name-node> + + <!-- Delete the location as we drop the partition from the table. --> + <prepare> + <delete path="${location}"/> + </prepare> + + <job-xml>${hive_site_xml}</job-xml> + <script>${hive_script}</script> + <param>database=${database}</param> + <param>table=${table}</param> + <param>serde_jar=${serde_jar}</param> + <param>partition_spec=${partition_spec}</param> + </hive> + <ok to="end"/> + <error to="kill"/> + </action> + + <kill name="kill"> + <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> + </kill> + <end name="end"/> +</workflow-app> diff --git a/oozie/webrequest/README.md b/oozie/webrequest/README.md new file mode 100644 index 0000000..4af5ca6 --- /dev/null +++ b/oozie/webrequest/README.md @@ -0,0 +1,2 @@ +This directory contains the dataset definition and coordinators that launch +jobs specific to the wmf.webrequest Hive table. diff --git a/oozie/webrequest/datasets.xml b/oozie/webrequest/datasets.xml new file mode 100644 index 0000000..3706de6 --- /dev/null +++ b/oozie/webrequest/datasets.xml @@ -0,0 +1,19 @@ +<!-- +Defines a reusable dataset for raw webrequest data. +Use this dataset in your coordinator.xml files by setting: + + ${startTime} - the initial instance of your data. + Example: 2014-04-01T00:00Z + ${dataDirectory} - Path to directory where data is time bucketed. + Example: /wmf/data/external/webrequest/webrequest_mobile/hourly +--> + +<datasets> +<dataset name="webrequest" + frequency="${coord:hours(1)}" + initial-instance="${startTime}" + timezone="Universal"> + <uri-template>${dataDirectory}/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> + <done-flag></done-flag> + </dataset> +</datasets> diff --git a/oozie/webrequest/partition/add/coordinator.properties b/oozie/webrequest/partition/add/coordinator.properties new file mode 100644 index 0000000..64df591 --- /dev/null +++ b/oozie/webrequest/partition/add/coordinator.properties @@ -0,0 +1,54 @@ +# Configures a coordinator to manage automatically adding Hive partitions to +# a webrequest table. Any of the following properties are overidable with -D. +# The 'webrequest_source' property is required and must be set on the CLI when +# submitting this coordinator. +# +# Usage: +# oozie job -Dwebrequest_source=mobile -submit -config oozie/webrequest/add/coordinator.properties. +# +# NOTE: The $oozieDirectory must be synced to HDFS so that all relevant +# .xml files exist there when this job is submitted. + + +# hadoopMaster will be used for <name-node> and <job-tracker> elements. +hadoopMaster = analytics1010.eqiad.wmnet +nameNode = hdfs://${hadoopMaster}:8020 +jobTracker = ${hadoopMaster}:8032 +queueName = standard + +# Base path in HDFS to oozie files. +# Other files will be used relative to this path. +oozieDirectory = ${nameNode}/wmf/kraken/current/oozie + +# HDFS path to workflow to run. +workflowFile = ${oozieDirectory}/util/hive/partition/add/workflow.xml + +# HDFS path to webrequest dataset definition +datasetsFile = ${oozieDirectory}/webrequest/datasets.xml + +# Initial import time of the webrequest dataset. +startTime = 2014-04-01T00:00Z + +# Time to stop running this coordinator. Year 3000 == never! +stopTime = 3000-01-01T00:00Z + +# HDFS path to hive-site.xml file. This is needed to run hive actions. +hive_site_xml = ${oozieDirectory}/util/hive/hive-site.xml + +# HDFS path to SerDe jar that the webrequest table uses. +serde_jar = ${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar + +# Hive database name. +database = wmf + +# Hive table name. +table = webrequest + +# HDFS path to directory where webrequest data is time bucketed. +dataDirectory = ${nameNode}/wmf/data/external/${table}/webrequest_${webrequest_source}/hourly + +# Coordintator to start. +oozie.coord.application.path = ${oozieDirectory}/webrequest/partition/add/coordinator.xml +oozie.libpath = ${nameNode}/user/oozie/share/lib +oozie.use.system.libpath = true +oozie.action.external.stats.write = true diff --git a/oozie/webrequest/partition/add/coordinator.xml b/oozie/webrequest/partition/add/coordinator.xml new file mode 100644 index 0000000..0ec13b8 --- /dev/null +++ b/oozie/webrequest/partition/add/coordinator.xml @@ -0,0 +1,98 @@ +<coordinator-app xmlns="uri:oozie:coordinator:0.4" + name="hive_add_partition-${table}_${webrequest_source}-coord" + frequency="${coord:hours(1)}" + start="${startTime}" + end="${stopTime}" + timezone="Universal"> + + <parameters> + <property> + <name>queueName</name> + <value>standard</value> + </property> + + <!-- Required properties. --> + <property><name>nameNode</name></property> + <property><name>jobTracker</name></property> + <property><name>workflowFile</name></property> + <property><name>startTime</name></property> + <property><name>stopTime</name></property> + <property><name>dataDirectory</name></property> + + <property><name>hive_site_xml</name></property> + <property><name>serde_jar</name></property> + <property><name>database</name></property> + <property><name>table</name></property> + <property><name>webrequest_source</name></property> + </parameters> + + <controls> + <!-- + Altering hive partitions is cheap. If there is an + occasion where we have to add a bunch all at once, + do it! + --> + <concurrency>168</concurrency> + </controls> + + <datasets> + <!-- + Include the given datasetsFile file. This should + define the "webrequest" dataset for this coordinator. + --> + <include>${datasetsFile}</include> + </datasets> + + <input-events> + <data-in name="input" dataset="webrequest"> + <instance>${coord:current(0)}</instance> + </data-in> + <!-- + Having a data-in defined for current(1) will + keep the workflow from running until the next + hour's directory is created. + --> + <data-in name="ready_indicator" dataset="webrequest"> + <instance>${coord:current(1)}</instance> + </data-in> + </input-events> + + <action> + <workflow> + <app-path>${workflowFile}</app-path> + <configuration> + + <!-- Pass these properties through to the workflow --> + <property><name>nameNode</name><value>${nameNode}</value></property> + <property><name>jobTracker</name><value>${jobTracker}</value></property> + <property><name>queueName</name><value>${queueName}</value></property> + + <property> + <name>hive_site_xml</name> + <value>${hive_site_xml}</value> + </property> + <property> + <name>serde_jar</name> + <value>${serde_jar}</value> + </property> + <property> + <name>database</name> + <value>${database}</value> + </property> + <property> + <name>table</name> + <value>${table}</value> + </property> + <property> + <name>partition_spec</name> + <value>webrequest_source='${webrequest_source}',${coord:formatTime(coord:dateOffset(coord:nominalTime(), 0, 'HOUR'), "'year='yyyy,'month='MM,'day='dd,'hour='HH")}</value> + </property> + <property> + <name>location</name> + <value>${coord:dataIn('input')}</value> + </property> + + </configuration> + </workflow> + </action> +</coordinator-app> diff --git a/oozie/webrequest/partition/drop/coordinator.properties b/oozie/webrequest/partition/drop/coordinator.properties new file mode 100644 index 0000000..b4fc657 --- /dev/null +++ b/oozie/webrequest/partition/drop/coordinator.properties @@ -0,0 +1,62 @@ +# Configures a coordinator to manage automatically dropping old Hive partitions +# for a webrequest table, and and deleting their respective data locations in +# HDFS. Any of the following properties are overidable with -D. +# The 'webrequest_source' property is required and must be set on the CLI when +# submitting this coordinator. +# +# Usage: +# oozie job --Dwebrequest_source=mobile -submit -config oozie/webrequest/drop/coordinator.properties. +# +# NOTE: The $oozieDirectory must be synced to HDFS so that all relevant +# .xml files exist there when this job is submitted. + + +# hadoopMaster will be used for <name-node> and <job-tracker> elements. +hadoopMaster = analytics1010.eqiad.wmnet +nameNode = hdfs://${hadoopMaster}:8020 +jobTracker = ${hadoopMaster}:8032 +queueName = standard + +# Base path in HDFS to oozie files. +# Other files will be used relative to this path. +oozieDirectory = ${nameNode}/wmf/kraken/current/oozie + +# HDFS path to workflow to run. +workflowFile = ${oozieDirectory}/util/hive/partition/add/workflow.xml + +# HDFS path to webrequest dataset definition +datasetsFile = ${oozieDirectory}/webrequest/datasets.xml + +# Initial import time of the webrequest dataset. +# Note: This is intentionally set to 1440 hours +# after the add partition coordinator's +# initial instance. Each drop workflow app +# will remove the partition 1440 hours before +# the current instance. So if we want to delete +# partitions that old, we need to wait until +# that much time has passed. +startTime = 2014-05-31T00:00Z + +# Time to stop running this coordinator. Year 3000 == never! +stopTime = 3000-01-01T00:00Z + +# HDFS path to hive-site.xml file. This is needed to run hive actions. +hive_site_xml = ${oozieDirectory}/util/hive/hive-site.xml + +# HDFS path to SerDe jar that the webrequest table uses. +serde_jar = ${nameNode}/wmf/kraken/artifacts/hcatalog-core-0.5.0-cdh4.3.1.jar + +# Hive database name. +database = wmf + +# Hive table name. +table = webrequest + +# HDFS path to directory where webrequest data is time bucketed. +dataDirectory = ${nameNode}/wmf/data/external/${table}/webrequest_${webrequest_source}/hourly + +# Coordintator to start +oozie.coord.application.path = ${oozieDirectory}/webrequest/partition/add/coordinator.xml +oozie.libpath = ${nameNode}/user/oozie/share/lib +oozie.use.system.libpath = true +oozie.action.external.stats.write = true \ No newline at end of file diff --git a/oozie/webrequest/partition/drop/coordinator.xml b/oozie/webrequest/partition/drop/coordinator.xml new file mode 100644 index 0000000..194fb1d --- /dev/null +++ b/oozie/webrequest/partition/drop/coordinator.xml @@ -0,0 +1,102 @@ +<coordinator-app xmlns="uri:oozie:coordinator:0.4" + name="hive_drop_partition-${table}_${webrequest_source}-coord" + frequency="${coord:hours(1)}" + start="${startTime}" + end="${stopTime}" + timezone="Universal"> + + <parameters> + <property> + <name>queueName</name> + <value>adhoc</value> + </property> + + <!-- Required properties. --> + <property><name>nameNode</name></property> + <property><name>jobTracker</name></property> + <property><name>workflowFile</name></property> + <property><name>startTime</name></property> + <property><name>stopTime</name></property> + <property><name>dataDirectory</name></property> + + <property><name>hive_site_xml</name></property> + <property><name>serde_jar</name></property> + <property><name>database</name></property> + <property><name>table</name></property> + <property><name>webrequest_source</name></property> + </parameters> + + <controls> + <!-- + Altering hive partitions is cheap. If there is an + occasion where we have to add a bunch all at once, + do it! + --> + <concurrency>168</concurrency> + </controls> + + <datasets> + <!-- + Include the given datasetsFile file. This should + define the "webrequest" dataset for this coordinator. + --> + <include>${datasetsFile}</include> + </datasets> + + <input-events> + <data-in name="toDelete" dataset="webrequest"> + <!-- + The partition_spec to delete is inferred from the current + instance - 1440 hours. This amounts to 60 days ago. + Please make sure ${startTime} is set to 1440 hours beyond + your dataset's actual initial instance. (It won't hurt + to incorrectly set ${startTime}, you'll just end up with + a lot of failed jobs until there is actually data old + enough to delete.) + --> + <instance>${coord:current(0)}</instance> + </data-in> + </input-events> + + <action> + <workflow> + <app-path>${workflowFile}</app-path> + <configuration> + + <!-- Pass these properties through to the workflow --> + <property><name>nameNode</name><value>${nameNode}</value></property> + <property><name>jobTracker</name><value>${jobTracker}</value></property> + <property><name>queueName</name><value>${queueName}</value></property> + + <property> + <name>hive_site_xml</name> + <value>${hive_site_xml}</value> + </property> + <property> + <name>serde_jar</name> + <value>${serde_jar}</value> + </property> + <property> + <name>database</name> + <value>${database}</value> + </property> + <property> + <name>table</name> + <value>${table}</value> + </property> + <property> + <name>partition_spec</name> + <!-- + -1440 hours is 60 days ago. Delete the partition that is 60 days old. + --> + <value>webrequest_source='${webrequest_source}',${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1440, 'HOUR'), "'year='yyyy,'month='MM,'day='dd,'hour='HH")}</value> + </property> + <property> + <name>location</name> + <value>${coord:dataIn('toDelete')}</value> + </property> + + </configuration> + </workflow> + </action> +</coordinator-app> -- To view, visit https://gerrit.wikimedia.org/r/131208 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: merged Gerrit-Change-Id: I7d658b431c103eb8f73afac136369d401a3958ba Gerrit-PatchSet: 5 Gerrit-Project: analytics/kraken Gerrit-Branch: master Gerrit-Owner: Ottomata <o...@wikimedia.org> Gerrit-Reviewer: Ottomata <o...@wikimedia.org> Gerrit-Reviewer: QChris <christ...@quelltextlich.at> Gerrit-Reviewer: jenkins-bot <> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits