Repository: incubator-griffin-site
Updated Branches:
  refs/heads/asf-site 6eaa9c6f0 -> 5558bdcb3


Updated asf-site site from master (0777296868773f3456019df24829827a90b46fde)


Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/commit/5558bdcb
Tree: 
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/tree/5558bdcb
Diff: 
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/diff/5558bdcb

Branch: refs/heads/asf-site
Commit: 5558bdcb38389a82d8adc73f8f7d0a55a82e3e48
Parents: 6eaa9c6
Author: William Guo <gu...@apache.org>
Authored: Tue Sep 18 15:56:32 2018 +0800
Committer: William Guo <gu...@apache.org>
Committed: Tue Sep 18 15:56:32 2018 +0800

----------------------------------------------------------------------
 docs/profiling.html | 162 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 161 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/blob/5558bdcb/docs/profiling.html
----------------------------------------------------------------------
diff --git a/docs/profiling.html b/docs/profiling.html
index 273452f..1826c21 100644
--- a/docs/profiling.html
+++ b/docs/profiling.html
@@ -130,7 +130,167 @@ under the License.
       </div>
       <div class="col-xs-6 col-sm-9 page-main-content" style="margin-left: 
-15px" id="loadcontent">
         <h1 class="page-header" style="margin-top: 0px">Profiling Use Case</h1>
-        
+        <h2 id="user-story">User Story</h2>
+<p>Say we have one data set(demo_src), partitioned by hour, we want to know 
what is the data like for each hour.</p>
+
+<p>For simplicity, suppose both two data set have the same schema as this:</p>
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>id                      bigint                          
            
+age                     int                                         
+desc                    string                                      
+dt                      string                                      
+hour                    string 
+</code></pre></div></div>
+<p>both dt and hour are partitions,</p>
+
+<p>as every day we have one daily partition dt(like 20180912),</p>
+
+<p>for every day we have 24 hourly partitions(like 00, 01, 02, …, 23).</p>
+
+<h2 id="environment-preparation">Environment Preparation</h2>
+<p>You need to prepare the environment for Apache Griffin measure module, 
including the following software:</p>
+<ul>
+  <li>JDK (1.8+)</li>
+  <li>Hadoop (2.6.0+)</li>
+  <li>Spark (2.2.1+)</li>
+  <li>Hive (2.2.0)</li>
+</ul>
+
+<h2 id="build-griffin-measure-module">Build Griffin Measure Module</h2>
+<ol>
+  <li>Download Griffin source package <a 
href="https://www.apache.org/dist/incubator/griffin/0.3.0-incubating";>here</a>.</li>
+  <li>Unzip the source package.
+    <div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>unzip griffin-0.3.0-incubating-source-release.zip
+cd griffin-0.3.0-incubating-source-release
+</code></pre></div>    </div>
+  </li>
+  <li>Build Griffin jars.
+    <div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>mvn clean install
+</code></pre></div>    </div>
+
+    <p>Move the built griffin measure jar to your work path.</p>
+
+    <div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>mv measure/target/measure-0.3.0-incubating.jar &lt;work 
path&gt;/griffin-measure.jar
+</code></pre></div>    </div>
+  </li>
+</ol>
+
+<h2 id="data-preparation">Data Preparation</h2>
+
+<p>For our quick start, We will generate a hive table demo_src.</p>
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>--create hive tables here. hql script
+--Note: replace hdfs location with your own path
+CREATE EXTERNAL TABLE `demo_src`(
+  `id` bigint,
+  `age` int,
+  `desc` string) 
+PARTITIONED BY (
+  `dt` string,
+  `hour` string)
+ROW FORMAT DELIMITED
+  FIELDS TERMINATED BY '|'
+LOCATION
+  'hdfs:///griffin/data/batch/demo_src';
+</code></pre></div></div>
+<p>The data could be generated this:</p>
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>1|18|student
+2|23|engineer
+3|42|cook
+...
+</code></pre></div></div>
+<p>You can download <a href="/data/batch">demo data</a> and execute <code 
class="highlighter-rouge">./gen_demo_data.sh</code> to get the data source file.
+Then we will load data into hive table for every hour.</p>
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src 
PARTITION (dt='20180912',hour='09');
+</code></pre></div></div>
+<p>Or you can just execute <code 
class="highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory 
above, to generate and load data into the tables hourly.</p>
+
+<h2 id="define-data-quality-measure">Define data quality measure</h2>
+
+<h4 id="griffin-env-configuration">Griffin env configuration</h4>
+<p>The environment config file: env.json</p>
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>{
+  "spark": {
+    "log.level": "WARN"
+  },
+  "sinks": [
+    {
+      "type": "console"
+    },
+    {
+      "type": "hdfs",
+      "config": {
+        "path": "hdfs:///griffin/persist"
+      }
+    },
+    {
+      "type": "elasticsearch",
+      "config": {
+        "method": "post",
+        "api": "http://es:9200/griffin/accuracy";
+      }
+    }
+  ]
+}
+</code></pre></div></div>
+
+<h4 id="define-griffin-data-quality">Define griffin data quality</h4>
+<p>The DQ config file: dq.json</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>{
+  "name": "batch_prof",
+  "process.type": "batch",
+  "data.sources": [
+    {
+      "name": "src",
+      "baseline": true,
+      "connectors": [
+        {
+          "type": "hive",
+          "version": "1.2",
+          "config": {
+            "database": "default",
+            "table.name": "demo_tgt"
+          }
+        }
+      ]
+    }
+  ],
+  "evaluate.rule": {
+    "rules": [
+      {
+        "dsl.type": "griffin-dsl",
+        "dq.type": "profiling",
+        "out.dataframe.name": "prof",
+        "rule": "src.id.count() AS id_count, src.age.max() AS age_max, 
src.desc.length().max() AS desc_length_max",
+        "out": [
+          {
+            "type": "metric",
+            "name": "prof"
+          }
+        ]
+      }
+    ]
+  },
+  "sinks": ["CONSOLE", "HDFS"]
+}
+</code></pre></div></div>
+
+<h2 id="measure-data-quality">Measure data quality</h2>
+<p>Submit the measure job to Spark, with config file paths as parameters.</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>spark-submit --class 
org.apache.griffin.measure.Application --master yarn --deploy-mode client 
--queue default \
+--driver-memory 1g --executor-memory 1g --num-executors 2 \
+&lt;path&gt;/griffin-measure.jar \
+&lt;path&gt;/env.json &lt;path&gt;/dq.json
+</code></pre></div></div>
+
+<h2 id="report-data-quality-metrics">Report data quality metrics</h2>
+<p>Then you can get the calculation log in console, after the job finishes, 
you can get the result metrics printed. The metrics will also be saved in hdfs: 
<code class="highlighter-rouge">hdfs:///griffin/persist/&lt;job 
name&gt;/&lt;timestamp&gt;/_METRICS</code>.</p>
+
+<h2 id="refine-data-quality-report">Refine Data Quality report</h2>
+<p>Depends on your business, you might need to refine your data quality 
measure further till your are satisfied.</p>
+
+<h2 id="more-details">More Details</h2>
+<p>For more details about griffin measures, you can visit our documents in <a 
href="https://github.com/apache/incubator-griffin/tree/master/griffin-doc";>github</a>.</p>
 
       </div><!--end of loadcontent-->
     </div>

Reply via email to