[MediaWiki-commits] [Gerrit] Oozie-fy Country Breakdown Pageview Report - change (analytics/refinery)

Milimetric (Code Review) Wed, 16 Dec 2015 10:21:06 -0800

Milimetric has submitted this change and it was merged.

Change subject: Oozie-fy Country Breakdown Pageview Report
......................................................................



Oozie-fy Country Breakdown Pageview Report

The test for this coordinator has job id:
0129354-150922143436497-oozie-oozi-C

The sample output is at:
/user/milimetric/archive/projectview/geo/
hourly/2015/2015-10/projectviews-geo-2015101-00000.gz

Bug: T118323
Change-Id: I57ee315b2738717255bd42fe0f541f04351f0af3
---
A oozie/projectview/geo/README.md
A oozie/projectview/geo/archive_projectview_geo_hourly.hql
A oozie/projectview/geo/coordinator.properties
A oozie/projectview/geo/coordinator.xml
A oozie/projectview/geo/workflow.xml
5 files changed, 417 insertions(+), 0 deletions(-)

Approvals:
  Milimetric: Verified; Looks good to me, approved



diff --git a/oozie/projectview/geo/README.md b/oozie/projectview/geo/README.md
new file mode 100644
index 0000000..2ce05c8
--- /dev/null
+++ b/oozie/projectview/geo/README.md
@@ -0,0 +1,17 @@
+# Aggregate projectviews geographically
+
+This job is responsible for creating archives that aggregate pageviews by
+the geographic origin of their requests, at the project level
+
+Output is archived into /wmf/data/archive/projectview/geo
+
+# Outline
+
+* ```coordinator.properties``` define parameters for the archive job
+* ```coordinator.xml``` injects the aggregation workflow for each dataset
+* ```workflow.xml```
+  * Runs a hive query to aggregate projectview geographically
+
+Note that this job uses the projectview dataset.  If a projectview job
+does not have the _SUCCESS done-flag in the directory, the data for that
+hour will not be aggregated until it does.
diff --git a/oozie/projectview/geo/archive_projectview_geo_hourly.hql 
b/oozie/projectview/geo/archive_projectview_geo_hourly.hql
new file mode 100644
index 0000000..5c41ddd
--- /dev/null
+++ b/oozie/projectview/geo/archive_projectview_geo_hourly.hql
@@ -0,0 +1,63 @@
+-- Parameters:
+--     source_table          -- Fully qualified table name with a
+--                              country_code column
+--     destination_directory -- Directory to write CSV output to
+--     year                  -- year of partition to compute statistics for.
+--     month                 -- month of partition to compute statistics for.
+--     day                   -- day of partition to compute statistics for.
+--     hour                  -- hour of partition to compute statistics for.
+--
+-- Usage:
+--     hive -f archive_projectview_geo_hourly.hql               \
+--         -d source_table=wmf.pageview_hourly                  \
+--         -d destination_directory=/tmp/example                \
+--         -d artifacts_directory=/path/to/refinery/artifacts   \
+--         -d year=2015                                         \
+--         -d month=6                                           \
+--         -d day=1                                             \
+--         -d hour=1
+--
+
+ADD JAR 
${artifacts_directory}/org/wikimedia/analytics/refinery/refinery-hive-${refinery_jar_version}.jar;
+CREATE TEMPORARY FUNCTION country_name as 
'org.wikimedia.analytics.refinery.hive.CountryNameUDF';
+
+SET hive.exec.compress.output=true;
+SET 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
+
+ INSERT OVERWRITE DIRECTORY '${destination_directory}'
+ SELECT CONCAT_WS('\t',
+            continent,
+            country_name,
+            country_code,
+            project,
+            access_method,
+            agent_type,
+            CAST(view_count AS STRING)
+        ) line
+   FROM (SELECT continent,
+                country_code,
+                country_name(country_code) AS country_name,
+                project,
+                access_method,
+                agent_type,
+                sum(view_count) AS view_count
+
+           FROM ${source_table}
+
+          WHERE year=${year} AND month=${month} AND day=${day} AND hour=${hour}
+          GROUP BY
+                continent,
+                country_code,
+                project,
+                access_method,
+                agent_type
+          ORDER BY
+                continent,
+                country_code,
+                project,
+                access_method,
+                agent_type
+
+          LIMIT 100000000
+        ) rows
+;
diff --git a/oozie/projectview/geo/coordinator.properties 
b/oozie/projectview/geo/coordinator.properties
new file mode 100644
index 0000000..f8d35dd
--- /dev/null
+++ b/oozie/projectview/geo/coordinator.properties
@@ -0,0 +1,62 @@
+# Configures a coordinator to manage automatically archiving geographically 
aggregated
+# data from the projectview table. Any of the following properties are 
overidable with -D.
+# Usage:
+#   oozie job -Duser=$USER -Dstart_time=2015-01-05T00:00Z -submit -config 
oozie/projectview/geo/coordinator.properties
+#
+# NOTE:  The $oozie_directory must be synced to HDFS so that all relevant
+#        .xml files exist there when this job is submitted.
+
+
+name_node                         = hdfs://analytics-hadoop
+job_tracker                       = resourcemanager.analytics.eqiad.wmnet:8032
+queue_name                        = default
+
+user                              = hdfs
+
+# Base path in HDFS to refinery.
+# When submitting this job for production, you should
+# override this to point directly at a deployed
+# directory name, and not the 'symbolic' 'current' directory.
+# E.g.  /wmf/refinery/2015-01-05T17.59.18Z--7bb7f07
+refinery_directory                = ${name_node}/wmf/refinery/current
+# Base path in HDFS to oozie files.
+# Other files will be used relative to this path.
+oozie_directory                   = ${refinery_directory}/oozie
+# HDFS path to coordinator to run for each webrequest_source.
+coordinator_file                  = 
${oozie_directory}/projectview/geo/coordinator.xml
+
+# HDFS path to workflow to run.
+workflow_file                     = 
${oozie_directory}/projectview/geo/workflow.xml
+
+# First data available in the projectview dataset
+start_time                        = 2015-05-01T00:00Z
+# Time to stop running this coordinator.  Year 3000 == never!
+stop_time                         = 3000-01-01T00:00Z
+
+# HDFS path to projectview dataset definitions
+projectview_datasets_file         = ${oozie_directory}/projectview/datasets.xml
+projectview_data_directory        = ${name_node}/wmf/data/wmf/projectview
+
+# HDFS path to hive-site.xml file.  This is needed to run hive actions.
+hive_site_xml                     = ${oozie_directory}/util/hive/hive-site.xml
+# Fully qualified Hive table name.
+source_table                      = wmf.projectview_hourly
+
+# Temporary directory for archiving
+temporary_directory               = ${name_node}/tmp
+# Archive base directory
+archive_directory                 = ${name_node}/wmf/data/archive
+# Archive directory for projectview_hourly_legacy_format
+geo_hourly_directory              = ${archive_directory}/projectview/geo/hourly
+
+# HDFS path to workflow to mark a directory as done
+mark_directory_done_workflow_file = 
${oozie_directory}/util/mark_directory_done/workflow.xml
+# HDFS path to workflow to archive output.
+archive_job_output_workflow_file  = 
${oozie_directory}/util/archive_job_output/workflow.xml
+# HDFS path to workflow to email errors
+send_error_email_workflow_file    = 
${oozie_directory}/util/send_error_email/workflow.xml
+
+# Coordintator to start.
+oozie.coord.application.path      = ${coordinator_file}
+oozie.use.system.libpath          = true
+oozie.action.external.stats.write = true
diff --git a/oozie/projectview/geo/coordinator.xml 
b/oozie/projectview/geo/coordinator.xml
new file mode 100644
index 0000000..fc6e607
--- /dev/null
+++ b/oozie/projectview/geo/coordinator.xml
@@ -0,0 +1,102 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<coordinator-app xmlns="uri:oozie:coordinator:0.4"
+    name="projectview_geo-coord"
+    frequency="${coord:hours(1)}"
+    start="${start_time}"
+    end="${stop_time}"
+    timezone="Universal">
+
+    <parameters>
+
+        <!-- Required properties. -->
+        <property><name>name_node</name></property>
+        <property><name>job_tracker</name></property>
+        <property><name>queue_name</name></property>
+
+        <property><name>workflow_file</name></property>
+
+        <property><name>start_time</name></property>
+        <property><name>stop_time</name></property>
+
+        <property><name>projectview_datasets_file</name></property>
+        <property><name>projectview_data_directory</name></property>
+
+        <property><name>hive_site_xml</name></property>
+        <property><name>source_table</name></property>
+
+        <property><name>temporary_directory</name></property>
+        <property><name>geo_hourly_directory</name></property>
+
+        <property><name>mark_directory_done_workflow_file</name></property>
+        <property><name>archive_job_output_workflow_file</name></property>
+    </parameters>
+
+    <controls>
+        <!--
+        By having materialized jobs not timeout, we ease backfilling incidents
+        after recoverable hiccups on the dataset producers.
+        -->
+        <timeout>-1</timeout>
+
+        <!--
+        projectview aggregation is really small, but we limit
+        concurrency for resource sharing.
+
+        Also note, that back-filling is not limited by the
+        coordinator's frequency, so back-filling works nicely
+        even-though the concurrency is low.
+        -->
+        <concurrency>4</concurrency>
+
+
+        <!--
+        Since we expect only one incarnation per hourly dataset, the
+        default throttle of 12 is way to high, and there is not need
+        to keep that many materialized jobs around.
+
+        By resorting to 2, we keep the hdfs checks on the datasets
+        low, while still being able to easily feed the concurrency.
+        -->
+        <throttle>4</throttle>
+    </controls>
+
+    <datasets>
+        <!--
+        Include projectview dataset file.
+        $projectview_datasets_file will be used as the input events
+        -->
+        <include>${projectview_datasets_file}</include>
+    </datasets>
+
+    <input-events>
+        <data-in name="projectview_hourly_input" dataset="projectview_hourly">
+            <instance>${coord:current(0)}</instance>
+        </data-in>
+    </input-events>
+
+    <action>
+        <workflow>
+            <app-path>${workflow_file}</app-path>
+            <configuration>
+
+                <property>
+                    <name>year</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"yyyy")}</value>
+                </property>
+                <property>
+                    <name>month</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"MM")}</value>
+                </property>
+                <property>
+                    <name>day</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"dd")}</value>
+                </property>
+                <property>
+                    <name>hour</name>
+                    <value>${coord:formatTime(coord:nominalTime(), 
"HH")}</value>
+                </property>
+
+            </configuration>
+        </workflow>
+    </action>
+</coordinator-app>
diff --git a/oozie/projectview/geo/workflow.xml 
b/oozie/projectview/geo/workflow.xml
new file mode 100644
index 0000000..1d519f5
--- /dev/null
+++ b/oozie/projectview/geo/workflow.xml
@@ -0,0 +1,173 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<workflow-app xmlns="uri:oozie:workflow:0.4"
+    
name="projectview-geo-hourly-${source_table}->archive-${year},${month},${day},${hour}-wf">
+
+    <parameters>
+
+        <!-- Default values for inner oozie settings -->
+        <property>
+            <name>oozie_launcher_queue_name</name>
+            <value>${queue_name}</value>
+        </property>
+        <property>
+            <name>oozie_launcher_memory</name>
+            <value>256</value>
+        </property>
+
+        <!-- Required properties -->
+        <property><name>queue_name</name></property>
+        <property><name>name_node</name></property>
+        <property><name>job_tracker</name></property>
+
+
+        <!-- Configuration properties for hourly geographic breakdown output 
-->
+        <property>
+            <name>hive_projectview_geo_hourly_script</name>
+            <!-- This is relative to the containing directory of this file. -->
+            <value>archive_projectview_geo_hourly.hql</value>
+            <description>Hive script to run.</description>
+        </property>
+
+        <property>
+            <name>hive_site_xml</name>
+            <description>hive-site.xml file path in HDFS</description>
+        </property>
+        <property>
+            <name>source_table</name>
+            <description>The source table to aggregate by 
geography</description>
+        </property>
+        <property>
+            <name>geo_hourly_directory</name>
+            <description>The destination directory for aggregate geo 
data</description>
+        </property>
+
+        <property>
+            <name>year</name>
+            <description>The partition's year</description>
+        </property>
+        <property>
+            <name>month</name>
+            <description>The partition's month</description>
+        </property>
+        <property>
+            <name>day</name>
+            <description>The partition's day</description>
+        </property>
+        <property>
+            <name>hour</name>
+            <description>The partition's hour</description>
+        </property>
+        <property>
+            <name>send_error_email_workflow_file</name>
+            <description>Workflow for sending an email</description>
+        </property>
+
+        <property>
+            <name>mark_directory_done_workflow_file</name>
+            <description>Workflow for marking a directory done</description>
+        </property>
+        <property>
+            <name>temporary_directory</name>
+            <description>A directory in HDFS for temporary files</description>
+        </property>
+
+    </parameters>
+
+    <start to="aggregate"/>
+
+    <action name="aggregate">
+        <hive xmlns="uri:oozie:hive-action:0.2">
+            <job-tracker>${job_tracker}</job-tracker>
+            <name-node>${name_node}</name-node>
+            <job-xml>${hive_site_xml}</job-xml>
+            <configuration>
+                <property>
+                    <name>mapreduce.job.queuename</name>
+                    <value>${queue_name}</value>
+                </property>
+                <!--make sure oozie:launcher runs in a low priority queue -->
+                <property>
+                    <name>oozie.launcher.mapred.job.queue.name</name>
+                    <value>${oozie_launcher_queue_name}</value>
+                </property>
+                <property>
+                    <name>oozie.launcher.mapreduce.map.memory.mb</name>
+                    <value>${oozie_launcher_memory}</value>
+                </property>
+                <property>
+                    <name>hive.exec.scratchdir</name>
+                    <value>/tmp/hive-${user}</value>
+                </property>
+            </configuration>
+
+            <script>${hive_projectview_geo_hourly_script}</script>
+
+            <param>source_table=${source_table}</param>
+            <param>year=${year}</param>
+            <param>month=${month}</param>
+            <param>day=${day}</param>
+            <param>hour=${hour}</param>
+            
<param>destination_directory=${temporary_directory}/${wf:id()}</param>
+        </hive>
+
+        <ok to="mark_aggregated_geo_hourly_done"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="mark_aggregated_geo_hourly_done">
+        <sub-workflow>
+            <app-path>${mark_directory_done_workflow_file}</app-path>
+            <configuration>
+                <property>
+                    <name>directory</name>
+                    <value>${temporary_directory}/${wf:id()}</value>
+                </property>
+            </configuration>
+        </sub-workflow>
+        <ok to="move_data_to_destination"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="move_data_to_destination">
+        <sub-workflow>
+            <app-path>${archive_job_output_workflow_file}</app-path>
+            <propagate-configuration/>
+            <configuration>
+                <property>
+                    <name>source_directory</name>
+                    <value>${temporary_directory}/${wf:id()}</value>
+                </property>
+                <property>
+                    <name>expected_filename_ending</name>
+                    <value>EMPTY</value>
+                </property>
+                <property>
+                    <name>archive_file</name>
+                    
<value>${geo_hourly_directory}/${year}/${year}-${month}/projectviews-geo-${year}${month}${day}-${hour}0000.gz</value>
+                </property>
+            </configuration>
+        </sub-workflow>
+        <ok to="end"/>
+        <error to="send_error_email"/>
+    </action>
+
+    <action name="send_error_email">
+        <sub-workflow>
+            <app-path>${send_error_email_workflow_file}</app-path>
+            <propagate-configuration/>
+            <configuration>
+                <property>
+                    <name>parent_name</name>
+                    <value>${wf:name()}</value>
+                </property>
+            </configuration>
+        </sub-workflow>
+        <ok to="kill"/>
+        <error to="kill"/>
+    </action>
+
+    <kill name="kill">
+        <message>Action failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
+    </kill>
+    <end name="end"/>
+</workflow-app>

-- 
To view, visit https://gerrit.wikimedia.org/r/256355
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I57ee315b2738717255bd42fe0f541f04351f0af3
Gerrit-PatchSet: 6
Gerrit-Project: analytics/refinery
Gerrit-Branch: master
Gerrit-Owner: Milimetric <[email protected]>
Gerrit-Reviewer: Joal <[email protected]>
Gerrit-Reviewer: Madhuvishy <[email protected]>
Gerrit-Reviewer: Milimetric <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Oozie-fy Country Breakdown Pageview Report - change (analytics/refinery)

Reply via email to