Added: websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css 
(added)
+++ websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css 
Sun Sep 16 18:50:04 2012
@@ -0,0 +1,9 @@
+/*!
+ * Bootstrap v2.1.0
+ *
+ * Copyright 2012 Twitter, Inc
+ * Licensed under the Apache License v2.0
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Designed and built with all the love in the world @twitter by @mdo and @fat.

[... 2 lines stripped ...]
Added: websites/staging/crunch/trunk/content/crunch/css/crunch.css
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/css/crunch.css (added)
+++ websites/staging/crunch/trunk/content/crunch/css/crunch.css Sun Sep 16 
18:50:04 2012
@@ -0,0 +1,4 @@
+.nav-list {
+  padding-left: 5px;
+  padding-right: 5px;
+}

Added: websites/staging/crunch/trunk/content/crunch/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/future-work.html (added)
+++ websites/staging/crunch/trunk/content/crunch/future-work.html Sun Sep 16 
18:50:04 2012
@@ -0,0 +1,141 @@
+<!DOCTYPE html>
+
+
+<html xmlns="http://www.w3.org/1999/xhtml"; lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta http-equiv="Content-Language" content="en" />
+
+    <title>Apache Crunch - Current Limitations and Future Work</title>
+
+    <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+    <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+    <script type="text/javascript" 
src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+  </head>
+  <body>
+
+    <div class="navbar navbar-inverse navbar-static-top">
+      
+        <div class="container-fluid">
+
+          <a class="nav pull-right brand" href="http://incubator.apache.org";>
+            <img src="http://incubator.apache.org/images/egg-logo.png"; 
alt="apache Incubator Logo" />
+          </a>
+
+        </div>
+      
+    </div>
+
+    <ul class="breadcrumb">
+      <li>
+        <a href="/">Incubator</a>
+       <span class="divider">&raquo;</span>
+      </li>
+      <li>
+        <a href="/crunch/">Crunch</a>
+      </li>
+      
+    </ul>
+
+    <div class="container-fluid">
+      <div class="row-fluid">
+
+        <!-- SIDEBAR AREA -->
+        <div class="span2">
+          <div class="sidebar-nav">
+            <ul class="nav nav-list">
+              
+                
+                  <li class="nav-header">Apache Crunch</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/index.html">Overview</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/apidocs/">API</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="https://cwiki.apache.org/confluence/display/CRUNCH/";>Wiki</a></li>
+                  
+                
+              
+                
+                  <li class="nav-header">Project</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/source-repository.html">Source 
Code</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/mailing-lists.html">Mailing 
Lists</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="http://issues.apache.org/jira/browse/CRUNCH";>Issue Tracking</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="http://apache.org/licenses/LICENSE-2.0.html";>License</a></li>
+                  
+                
+              
+            </ul>
+          </div> <!-- /well -->
+        </div> <!-- /span -->
+
+        <!-- CONTENT AREA -->
+        <div class="span10">
+          <h1 class="title">
+            Current Limitations and Future Work
+            
+          </h1>
+
+          <p>This section contains an almost certainly incomplete list of 
known limitations of Crunch and plans for future work.</p>
+<ul>
+<li>We would like to have easy support for reading and writing data from/to 
HCatalog.</li>
+<li>The decision of how to split up processing tasks between dependent 
MapReduce jobs is very naiive right now- we simply
+delegate all of the work to the reduce stage of the predecessor job. We should 
take advantage of information about the
+expected size of different PCollections to optimize this processing.</li>
+<li>The Crunch optimizer does not yet merge different groupByKey operations 
that run over the same input data into a single
+MapReduce job. Implementing this optimization will provide a major performance 
benefit for a number of problems.</li>
+</ul>
+        </div> <!-- /span -->
+
+      </div> <!-- /row-fluid -->
+
+    </div>
+
+    <hr/>
+
+    <footer>
+      <div class="container-fluid">
+        <div class="row span12">Copyright &copy; 2012
+          <a href="http://www.apache.org/";>The Apache Software Foundation</a>,
+          licensed under the <a 
href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License, Version 
2.0</a>.
+         <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+         Apache feather logo are trademarks of The Apache Software Foundation.
+         Other names appearing on the site may be trademarks of their
+         respective owners.</small></p>
+        </div>
+      </div>
+    </footer>
+
+  </body>
+</html>

Modified: websites/staging/crunch/trunk/content/crunch/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/index.html (original)
+++ websites/staging/crunch/trunk/content/crunch/index.html Sun Sep 16 18:50:04 
2012
@@ -1,56 +1,161 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-<html lang="en">
-  <head>
-    <title>Home Page</title>
+<!DOCTYPE html>
 
-    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
-    <meta property="og:image" 
content="http://www.apache.org/images/asf_logo.gif"; />
 
-    <link rel="stylesheet" type="text/css" media="screen" 
href="http://www.apache.org/css/style.css";>
-    <link rel="stylesheet" type="text/css" media="screen" 
href="http://www.apache.org/css/code.css";>
+<html xmlns="http://www.w3.org/1999/xhtml"; lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta http-equiv="Content-Language" content="en" />
+
+    <title>Apache Crunch - Apache Crunch</title>
+
+    <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+    <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+    <script type="text/javascript" 
src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+  </head>
+  <body>
 
-    
+    <div class="navbar navbar-inverse navbar-static-top">
+      
+        <div class="container-fluid">
+
+          <a class="nav pull-right brand" href="http://incubator.apache.org";>
+            <img src="http://incubator.apache.org/images/egg-logo.png"; 
alt="apache Incubator Logo" />
+          </a>
 
-    
-    <!-- Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license agreements.  See the NOTICE file distributed with this work 
for additional information regarding copyright ownership.  The ASF licenses 
this file to you under the Apache License, Version 2.0 (the 
&quot;License&quot;); you may not use this file except in compliance with the 
License.  You may obtain a copy of the License at . 
http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law 
or agreed to in writing, software distributed under the License is distributed 
on an &quot;AS IS&quot; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 
either express or implied.  See the License for the specific language governing 
permissions and limitations under the License. -->
-  </head>
+        </div>
+      
+    </div>
 
-  <body>
-    <div id="page" class="container_16">
-      <div id="header" class="grid_8">
-        <img src="http://www.apache.org/images/feather-small.gif"; alt="The 
Apache Software Foundation">
-        <h1>The Apache Software Foundation</h1>
-        <h2>Home Page</h2>
-      </div>
-      <div id="nav" class="grid_8">
-        <ul>
-          <!-- <li><a href="/" title="Welcome!">Home</a></li> -->
-          <li><a href="http://www.apache.org/foundation/"; title="The 
Foundation">Foundation</a></li>
-          <li><a href="http://projects.apache.org"; title="The 
Projects">Projects</a></li>
-          <li><a href="http://people.apache.org"; title="The 
People">People</a></li>
-          <li><a href="http://www.apache.org/foundation/getinvolved.html"; 
title="Get Involved">Get Involved</a></li>
-          <li><a href="http://www.apache.org/dyn/closer.cgi"; 
title="Download">Download</a></li>
-          <li><a href="http://www.apache.org/foundation/sponsorship.html"; 
title="Support Apache">Support Apache</a></li>
-        </ul>
-        <p><a href="/">Home</a>&nbsp;&raquo&nbsp;<a 
href="/crunch/">Crunch</a></p>
-        <form name="search" id="search" action="http://www.google.com/search"; 
method="get">
-          <input value="*.apache.org" name="sitesearch" type="hidden"/>
-          <input type="text" name="q" id="query">
-          <input type="submit" id="submit" value="Search">
-        </form>
-      </div>
-      <div class="clear"></div>
-      <div id="content" class="grid_16"><div class="section-content"><h1 
id="welcome">Welcome</h1>
-<p>Welcome to the Apache CMS.  Please see the following resources for further 
help:</p>
+    <ul class="breadcrumb">
+      <li>
+        <a href="/">Incubator</a>
+       <span class="divider">&raquo;</span>
+      </li>
+      <li>
+        <a href="/crunch/">Crunch</a>
+      </li>
+      
+    </ul>
+
+    <div class="container-fluid">
+      <div class="row-fluid">
+
+        <!-- SIDEBAR AREA -->
+        <div class="span2">
+          <div class="sidebar-nav">
+            <ul class="nav nav-list">
+              
+                
+                  <li class="nav-header">Apache Crunch</li>
+                
+              
+                
+                  
+                    <li><b>Overview</b></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/apidocs/">API</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="https://cwiki.apache.org/confluence/display/CRUNCH/";>Wiki</a></li>
+                  
+                
+              
+                
+                  <li class="nav-header">Project</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/source-repository.html">Source 
Code</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/mailing-lists.html">Mailing 
Lists</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="http://issues.apache.org/jira/browse/CRUNCH";>Issue Tracking</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="http://apache.org/licenses/LICENSE-2.0.html";>License</a></li>
+                  
+                
+              
+            </ul>
+          </div> <!-- /well -->
+        </div> <!-- /span -->
+
+        <!-- CONTENT AREA -->
+        <div class="span10">
+          <h1 class="title">
+            Apache Crunch
+            
+              <small>Simple and Efficient MapReduce Pipelines</small>
+            
+          </h1>
+
+          <hr />
+<blockquote>
+<p><em>Apache Crunch (incubating)</em> is a Java library for writing, testing, 
and
+running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+pipelines that are composed of many user-defined functions simple to write,
+easy to test, and efficient to run.</p>
+</blockquote>
+<hr />
+<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/";>Hadoop 
MapReduce</a>, Apache
+Crunch provides a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. For Scala users, there is 
also
+Scrunch, an idiomatic Scala API to Crunch.</p>
+<h2 id="documentation">Documentation</h2>
 <ul>
-<li><a 
href="http://www.apache.org/dev/cmsref.html";>http://www.apache.org/dev/cmsref.html</a></li>
-<li><a 
href="http://wiki.apache.org/general/ApacheCms2010";>http://wiki.apache.org/general/ApacheCms2010</a></li>
-</ul></div></div>
-      <div class="clear"></div>
-    </div>
+<li><a href="intro.html">Introduction to Apache Crunch</a></li>
+<li><a href="scrunch.html">Introduction to Scrunch</a></li>
+<li><a href="future-work.html">Current Limitations and Future Work</a></li>
+</ul>
+<h2 id="disclaimer">Disclaimer</h2>
+<p>Apache Crunch is an effort undergoing incubation at <a 
href="http://apache.org/";>The Apache Software Foundation
+(ASF)</a> sponsored by the <a href="http://incubator.apache.org/";>Apache 
Incubator PMC</a>.
+Incubation is required of all newly accepted projects until a further review
+indicates that the infrastructure, communications, and decision making process
+have stabilized in a manner consistent with other successful ASF projects.
+While incubation status is not necessarily a reflection of the completeness or
+stability of the code, it does indicate that the project has yet to be fully
+endorsed by the ASF.</p>
+        </div> <!-- /span -->
+
+      </div> <!-- /row-fluid -->
 
-    <div id="copyright" class="container_16">
-      <p>Copyright &#169; 2011 The Apache Software Foundation, Licensed under 
the <a href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License, 
Version 2.0</a>.<br/>Apache and the Apache feather logo are trademarks of The 
Apache Software Foundation.</p>
     </div>
+
+    <hr/>
+
+    <footer>
+      <div class="container-fluid">
+        <div class="row span12">Copyright &copy; 2012
+          <a href="http://www.apache.org/";>The Apache Software Foundation</a>,
+          licensed under the <a 
href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License, Version 
2.0</a>.
+         <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+         Apache feather logo are trademarks of The Apache Software Foundation.
+         Other names appearing on the site may be trademarks of their
+         respective owners.</small></p>
+        </div>
+      </div>
+    </footer>
+
   </body>
 </html>

Added: websites/staging/crunch/trunk/content/crunch/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/intro.html (added)
+++ websites/staging/crunch/trunk/content/crunch/intro.html Sun Sep 16 18:50:04 
2012
@@ -0,0 +1,298 @@
+<!DOCTYPE html>
+
+
+<html xmlns="http://www.w3.org/1999/xhtml"; lang="en">
+  <head>
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <meta http-equiv="Content-Language" content="en" />
+
+    <title>Apache Crunch - Introduction to Apache Crunch</title>
+
+    <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+    <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+    <script type="text/javascript" 
src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+  </head>
+  <body>
+
+    <div class="navbar navbar-inverse navbar-static-top">
+      
+        <div class="container-fluid">
+
+          <a class="nav pull-right brand" href="http://incubator.apache.org";>
+            <img src="http://incubator.apache.org/images/egg-logo.png"; 
alt="apache Incubator Logo" />
+          </a>
+
+        </div>
+      
+    </div>
+
+    <ul class="breadcrumb">
+      <li>
+        <a href="/">Incubator</a>
+       <span class="divider">&raquo;</span>
+      </li>
+      <li>
+        <a href="/crunch/">Crunch</a>
+      </li>
+      
+    </ul>
+
+    <div class="container-fluid">
+      <div class="row-fluid">
+
+        <!-- SIDEBAR AREA -->
+        <div class="span2">
+          <div class="sidebar-nav">
+            <ul class="nav nav-list">
+              
+                
+                  <li class="nav-header">Apache Crunch</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/index.html">Overview</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/apidocs/">API</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="https://cwiki.apache.org/confluence/display/CRUNCH/";>Wiki</a></li>
+                  
+                
+              
+                
+                  <li class="nav-header">Project</li>
+                
+              
+                
+                  
+                    <li><a href="/crunch/source-repository.html">Source 
Code</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a href="/crunch/mailing-lists.html">Mailing 
Lists</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="http://issues.apache.org/jira/browse/CRUNCH";>Issue Tracking</a></li>
+                  
+                
+              
+                
+                  
+                    <li><a 
href="http://apache.org/licenses/LICENSE-2.0.html";>License</a></li>
+                  
+                
+              
+            </ul>
+          </div> <!-- /well -->
+        </div> <!-- /span -->
+
+        <!-- CONTENT AREA -->
+        <div class="span10">
+          <h1 class="title">
+            Introduction to Apache Crunch
+            
+          </h1>
+
+          <h2 id="build-and-installation">Build and Installation</h2>
+<p>To use Crunch you first have to build the source code using Maven and 
install
+it in your local repository:</p>
+<div class="codehilite"><pre><span class="n">mvn</span> <span 
class="n">clean</span> <span class="n">install</span>
+</pre></div>
+
+
+<p>This also runs the integration test suite which will take a while. 
Afterwards
+you can run the bundled example applications:</p>
+<div class="codehilite"><pre><span class="n">hadoop</span> <span 
class="n">jar</span> <span class="n">examples</span><span 
class="sr">/target/c</span><span class="n">runch</span><span 
class="o">-</span><span class="n">examples</span><span 
class="o">-*-</span><span class="n">job</span><span class="o">.</span><span 
class="n">jar</span> <span class="n">org</span><span class="o">.</span><span 
class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">examples</span><span class="o">.</span><span 
class="n">WordCount</span> <span class="sr">&lt;inputfile&gt;</span> <span 
class="sr">&lt;outputdir&gt;</span>
+</pre></div>
+
+
+<h2 id="high-level-concepts">High Level Concepts</h2>
+<h3 id="data-model-and-operators">Data Model and Operators</h3>
+<p>Crunch is centered around three interfaces that represent distributed 
datasets: <code>PCollection&lt;T&gt;</code>, <code>PTable&lt;K, V&gt;</code>, 
and <code>PGroupedTable&lt;K, V&gt;</code>.</p>
+<p>A <code>PCollection&lt;T&gt;</code> represents a distributed, unordered 
collection of elements of type T. For example, we represent a text file in 
Crunch as a
+<code>PCollection&lt;String&gt;</code> object. PCollection provides a method, 
<code>parallelDo</code>, that applies a function to each element in a 
PCollection in parallel,
+and returns a new PCollection as its result.</p>
+<p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of PCollection that 
represents a distributed, unordered multimap of its key type K to its value 
type V.
+In addition to the parallelDo operation, PTable provides a 
<code>groupByKey</code> operation that aggregates all of the values in the 
PTable that
+have the same key into a single record. It is the groupByKey operation that 
triggers the sort phase of a MapReduce job.</p>
+<p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, 
V&gt;</code> object, which is a distributed, sorted map of keys of type K to an 
Iterable
+collection of values of type V. In addition to parallelDo, the PGroupedTable 
provides a <code>combineValues</code> operation, which allows for
+a commutative and associative aggregation operator to be applied to the values 
of the PGroupedTable instance on both the map side and the
+reduce side of a MapReduce job.</p>
+<p>Finally, PCollection, PTable, and PGroupedTable all support a 
<code>union</code> operation, which takes a series of distinct PCollections and 
treats
+them as a single, virtual PCollection. The union operator is required for 
operations that combine multiple inputs, such as cogroups and
+joins.</p>
+<h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
+<p>Every Crunch pipeline starts with a <code>Pipeline</code> object that is 
used to coordinate building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct 
MapReduce jobs from the different stages of the pipelines when
+the Pipeline object's <code>run</code> or <code>done</code> methods are 
called.</p>
+<h2 id="a-detailed-example">A Detailed Example</h2>
+<p>Here is the classic WordCount application using Crunch:</p>
+<div class="codehilite"><pre><span class="nb">import</span> <span 
class="n">org</span><span class="o">.</span><span class="n">apache</span><span 
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span 
class="n">DoFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">Emitter</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">PCollection</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">PTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">Pipeline</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span class="n">impl</span><span 
class="o">.</span><span class="n">mr</span><span class="o">.</span><span 
class="n">MRPipeline</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span class="n">lib</span><span 
class="o">.</span><span class="n">Aggregate</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">types</span><span class="o">.</span><span 
class="n">writable</span><span class="o">.</span><span 
class="n">Writables</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span 
class="n">WordCount</span> <span class="p">{</span>
+  <span class="n">public</span> <span class="n">static</span> <span 
class="n">void</span> <span class="n">main</span><span class="p">(</span><span 
class="n">String</span><span class="o">[]</span> <span 
class="n">args</span><span class="p">)</span> <span class="n">throws</span> 
<span class="n">Exception</span> <span class="p">{</span>
+    <span class="n">Pipeline</span> <span class="n">pipeline</span> <span 
class="o">=</span> <span class="k">new</span> <span 
class="n">MRPipeline</span><span class="p">(</span><span 
class="n">WordCount</span><span class="o">.</span><span 
class="n">class</span><span class="p">);</span>
+    <span class="n">PCollection</span><span class="sr">&lt;String&gt;</span> 
<span class="n">lines</span> <span class="o">=</span> <span 
class="n">pipeline</span><span class="o">.</span><span 
class="n">readTextFile</span><span class="p">(</span><span 
class="n">args</span><span class="p">[</span><span class="mi">0</span><span 
class="p">]);</span>
+
+    <span class="n">PCollection</span><span class="sr">&lt;String&gt;</span> 
<span class="n">words</span> <span class="o">=</span> <span 
class="n">lines</span><span class="o">.</span><span 
class="n">parallelDo</span><span class="p">(</span><span class="s">&quot;my 
splitter&quot;</span><span class="p">,</span> <span class="k">new</span> <span 
class="n">DoFn</span><span class="o">&lt;</span><span 
class="n">String</span><span class="p">,</span> <span 
class="n">String</span><span class="o">&gt;</span><span class="p">()</span> 
<span class="p">{</span>
+      <span class="n">public</span> <span class="n">void</span> <span 
class="n">process</span><span class="p">(</span><span class="n">String</span> 
<span class="n">line</span><span class="p">,</span> <span 
class="n">Emitter</span><span class="sr">&lt;String&gt;</span> <span 
class="n">emitter</span><span class="p">)</span> <span class="p">{</span>
+        <span class="k">for</span> <span class="p">(</span><span 
class="n">String</span> <span class="n">word</span> <span class="p">:</span> 
<span class="n">line</span><span class="o">.</span><span 
class="nb">split</span><span class="p">(</span><span 
class="s">&quot;\\s+&quot;</span><span class="p">))</span> <span 
class="p">{</span>
+          <span class="n">emitter</span><span class="o">.</span><span 
class="n">emit</span><span class="p">(</span><span class="n">word</span><span 
class="p">);</span>
+        <span class="p">}</span>
+      <span class="p">}</span>
+    <span class="p">},</span> <span class="n">Writables</span><span 
class="o">.</span><span class="n">strings</span><span class="p">());</span>
+
+    <span class="n">PTable</span><span class="o">&lt;</span><span 
class="n">String</span><span class="p">,</span> <span 
class="n">Long</span><span class="o">&gt;</span> <span class="n">counts</span> 
<span class="o">=</span> <span class="n">Aggregate</span><span 
class="o">.</span><span class="n">count</span><span class="p">(</span><span 
class="n">words</span><span class="p">);</span>
+
+    <span class="n">pipeline</span><span class="o">.</span><span 
class="n">writeTextFile</span><span class="p">(</span><span 
class="n">counts</span><span class="p">,</span> <span 
class="n">args</span><span class="p">[</span><span class="mi">1</span><span 
class="p">]);</span>
+    <span class="n">pipeline</span><span class="o">.</span><span 
class="n">run</span><span class="p">();</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>Let's walk through the example line by line.</p>
+<h3 id="step-1-creating-a-pipeline-and-referencing-a-text-file">Step 1: 
Creating a Pipeline and referencing a text file</h3>
+<p>The <code>MRPipeline</code> implementation of the Pipeline interface 
compiles the individual stages of a
+pipeline into a series of MapReduce jobs. The MRPipeline constructor takes a 
class argument
+that is used to tell Hadoop where to find the code that is used in the 
pipeline execution.</p>
+<p>We now need to tell the Pipeline about the inputs it will be consuming. The 
Pipeline interface
+defines a <code>readTextFile</code> method that takes in a String and returns 
a PCollection of Strings.
+In addition to text files, Crunch supports reading data from SequenceFiles and 
Avro container files,
+via the <code>SequenceFileSource</code> and <code>AvroFileSource</code> 
classes defined in the org.apache.crunch.io package.</p>
+<p>Note that each PCollection is a <em>reference</em> to a source of data- no 
data is actually loaded into a
+PCollection on the client machine.</p>
+<h3 id="step-2-splitting-the-lines-of-text-into-words">Step 2: Splitting the 
lines of text into words</h3>
+<p>Crunch defines a small set of primitive operations that can be composed in 
order to build complex data
+pipelines. The first of these primitives is the <code>parallelDo</code> 
function, which applies a function (defined
+by a subclass of <code>DoFn</code>) to every record in a PCollection, and 
returns a new PCollection that contains
+the results.</p>
+<p>The first argument to parallelDo is a string that is used to identify this 
step in the pipeline. When
+a pipeline is composed into a series of MapReduce jobs, it is often the case 
that multiple stages will
+run within the same Mapper or Reducer. Having a string that identifies each 
processing step is useful
+for debugging errors that occur in a running pipeline.</p>
+<p>The second argument to parallelDo is an anonymous subclass of DoFn. Each 
DoFn subclass must override
+the <code>process</code> method, which takes in a record from the input 
PCollection and an <code>Emitter</code> object that
+may have any number of output values written to it. In this case, our DoFn 
splits each lines up into
+words, using a blank space as a separator, and emits the words from the split 
to the output PCollection.</p>
+<p>The last argument to parallelDo is an instance of the <code>PType</code> 
interface, which specifies how the data
+in the output PCollection is serialized. While Crunch takes advantage of Java 
Generics to provide
+compile-time type safety, the generic type information is not available at 
runtime. Crunch needs to know
+how to map the records stored in each PCollection into a Hadoop-supported 
serialization format in order
+to read and write data to disk. Two serialization implementations are 
supported in crunch via the
+<code>PTypeFamily</code> interface: a Writable-based system that is defined in 
the org.apache.crunch.types.writable
+package, and an Avro-based system that is defined in the 
org.apache.crunch.types.avro package. Each
+implementation provides convenience methods for working with the common PTypes 
(Strings, longs, bytes, etc.)
+as well as utility methods for creating PTypes from existing Writable classes 
or Avro schemas.</p>
+<h3 id="step-3-counting-the-words">Step 3: Counting the words</h3>
+<p>Out of Crunch's simple primitive operations, we can build arbitrarily 
complex chains of operations in order
+to perform higher-level operations, like aggregations and joins, that can work 
on any type of input data.
+Let's look at the implementation of the <code>Aggregate.count</code> 
function:</p>
+<div class="codehilite"><pre><span class="nb">package</span> <span 
class="n">org</span><span class="o">.</span><span class="n">apache</span><span 
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span 
class="n">lib</span><span class="p">;</span>
+
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">CombineFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">MapFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">PCollection</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">PGroupedTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">PTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span class="n">Pair</span><span 
class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">types</span><span class="o">.</span><span 
class="n">PTypeFamily</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span 
class="n">Aggregate</span> <span class="p">{</span>
+
+  <span class="n">private</span> <span class="n">static</span> <span 
class="n">class</span> <span class="n">Counter</span><span 
class="sr">&lt;S&gt;</span> <span class="n">extends</span> <span 
class="n">MapFn</span><span class="o">&lt;</span><span class="n">S</span><span 
class="p">,</span> <span class="n">Pair</span><span class="o">&lt;</span><span 
class="n">S</span><span class="p">,</span> <span class="n">Long</span><span 
class="o">&gt;&gt;</span> <span class="p">{</span>
+    <span class="n">public</span> <span class="n">Pair</span><span 
class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span 
class="n">Long</span><span class="o">&gt;</span> <span 
class="nb">map</span><span class="p">(</span><span class="n">S</span> <span 
class="n">input</span><span class="p">)</span> <span class="p">{</span>
+          <span class="k">return</span> <span class="n">Pair</span><span 
class="o">.</span><span class="n">of</span><span class="p">(</span><span 
class="n">input</span><span class="p">,</span> <span class="mi">1</span><span 
class="n">L</span><span class="p">);</span>
+    <span class="p">}</span>
+  <span class="p">}</span>
+
+  <span class="n">public</span> <span class="n">static</span> <span 
class="sr">&lt;S&gt;</span> <span class="n">PTable</span><span 
class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span 
class="n">Long</span><span class="o">&gt;</span> <span 
class="n">count</span><span class="p">(</span><span 
class="n">PCollection</span><span class="sr">&lt;S&gt;</span> <span 
class="n">collect</span><span class="p">)</span> <span class="p">{</span>
+    <span class="n">PTypeFamily</span> <span class="n">tf</span> <span 
class="o">=</span> <span class="n">collect</span><span class="o">.</span><span 
class="n">getTypeFamily</span><span class="p">();</span>
+
+    <span class="sr">//</span> <span class="n">Create</span> <span 
class="n">a</span> <span class="n">PTable</span> <span class="n">from</span> 
<span class="n">the</span> <span class="n">PCollection</span> <span 
class="n">by</span> <span class="n">mapping</span> <span class="nb">each</span> 
<span class="n">element</span>
+    <span class="sr">//</span> <span class="n">to</span> <span 
class="n">a</span> <span class="n">key</span> <span class="n">of</span> <span 
class="n">the</span> <span class="n">PTable</span> <span class="n">with</span> 
<span class="n">the</span> <span class="n">value</span> <span 
class="n">equal</span> <span class="n">to</span> <span class="mi">1</span><span 
class="n">L</span>
+    <span class="n">PTable</span><span class="o">&lt;</span><span 
class="n">S</span><span class="p">,</span> <span class="n">Long</span><span 
class="o">&gt;</span> <span class="n">withCounts</span> <span 
class="o">=</span> <span class="n">collect</span><span class="o">.</span><span 
class="n">parallelDo</span><span class="p">(</span><span 
class="s">&quot;count:&quot;</span> <span class="o">+</span> <span 
class="n">collect</span><span class="o">.</span><span 
class="n">getName</span><span class="p">(),</span>
+        <span class="k">new</span> <span class="n">Counter</span><span 
class="sr">&lt;S&gt;</span><span class="p">(),</span> <span 
class="n">tf</span><span class="o">.</span><span class="n">tableOf</span><span 
class="p">(</span><span class="n">collect</span><span class="o">.</span><span 
class="n">getPType</span><span class="p">(),</span> <span 
class="n">tf</span><span class="o">.</span><span class="n">longs</span><span 
class="p">()));</span>
+
+    <span class="sr">//</span> <span class="n">Group</span> <span 
class="n">the</span> <span class="n">records</span> <span class="n">of</span> 
<span class="n">the</span> <span class="n">PTable</span> <span 
class="n">based</span> <span class="n">on</span> <span class="n">their</span> 
<span class="n">key</span><span class="o">.</span>
+    <span class="n">PGroupedTable</span><span class="o">&lt;</span><span 
class="n">S</span><span class="p">,</span> <span class="n">Long</span><span 
class="o">&gt;</span> <span class="n">grouped</span> <span class="o">=</span> 
<span class="n">withCounts</span><span class="o">.</span><span 
class="n">groupByKey</span><span class="p">();</span>
+
+    <span class="sr">//</span> <span class="n">Sum</span> <span 
class="n">the</span> <span class="mi">1</span><span class="n">L</span> <span 
class="nb">values</span> <span class="n">associated</span> <span 
class="n">with</span> <span class="n">the</span> <span class="nb">keys</span> 
<span class="n">to</span> <span class="n">get</span> <span class="n">the</span>
+    <span class="sr">//</span> <span class="n">count</span> <span 
class="n">of</span> <span class="nb">each</span> <span class="n">element</span> 
<span class="n">in</span> <span class="n">this</span> <span 
class="n">PCollection</span><span class="p">,</span> <span 
class="ow">and</span> <span class="k">return</span> <span class="n">it</span>
+    <span class="sr">//</span> <span class="n">as</span> <span 
class="n">a</span> <span class="n">PTable</span> <span class="n">so</span> 
<span class="n">that</span> <span class="n">it</span> <span 
class="n">may</span> <span class="n">be</span> <span class="n">processed</span> 
<span class="n">further</span> <span class="ow">or</span> <span 
class="n">written</span>
+    <span class="sr">//</span> <span class="n">out</span> <span 
class="k">for</span> <span class="n">storage</span><span class="o">.</span>
+    <span class="k">return</span> <span class="n">grouped</span><span 
class="o">.</span><span class="n">combineValues</span><span 
class="p">(</span><span class="n">CombineFn</span><span class="o">.</span><span 
class="sr">&lt;S&gt;</span><span class="n">SUM_LONGS</span><span 
class="p">());</span>
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>First, we get the PTypeFamily that is associated with the PType for the 
collection. The
+call to parallelDo converts each record in this PCollection into a Pair of the 
input record
+and the number one by extending the <code>MapFn</code> convenience subclass of 
DoFn, and uses the
+<code>tableOf</code> method of the PTypeFamily to specify that the returned 
PCollection should be a
+PTable instance, with the key being the PType of the PCollection and the value 
being the Long
+implementation for this PTypeFamily.</p>
+<p>The next line features the second of Crunch's four operations, 
<code>groupByKey</code>. The groupByKey
+operation may only be applied to a PTable, and returns an instance of the 
<code>PGroupedTable</code>
+interface, which references the grouping of all of the values in the PTable 
that have the same key.
+The groupByKey operation is what triggers the reduce phase of a MapReduce 
within Crunch.</p>
+<p>The last line in the function returns the output of the third of Crunch's 
four operations,
+<code>combineValues</code>. The combineValues operator takes a 
<code>CombineFn</code> as an argument, which is a
+specialized subclass of DoFn that operates on an implementation of Java's 
Iterable interface. The
+use of combineValues (as opposed to parallelDo) signals to Crunch that the 
CombineFn may be used to
+aggregate values for the same key on the map side of a MapReduce job as well 
as the reduce side.</p>
+<h3 id="step-4-writing-the-output-and-running-the-pipeline">Step 4: Writing 
the output and running the pipeline</h3>
+<p>The Pipeline object also provides a <code>writeTextFile</code> convenience 
method for indicating that a
+PCollection should be written to a text file. There are also output targets 
for SequenceFiles and
+Avro container files, available in the org.apache.crunch.io package.</p>
+<p>After you are finished constructing a pipeline and specifying the output 
destinations, call the
+pipeline's blocking <code>run</code> method in order to compile the pipeline 
into one or more MapReduce
+jobs and execute them.</p>
+<h2 id="more-information">More Information</h2>
+<p><a href="pipelines.html">Writing Your Own Pipelines</a></p>
+        </div> <!-- /span -->
+
+      </div> <!-- /row-fluid -->
+
+    </div>
+
+    <hr/>
+
+    <footer>
+      <div class="container-fluid">
+        <div class="row span12">Copyright &copy; 2012
+          <a href="http://www.apache.org/";>The Apache Software Foundation</a>,
+          licensed under the <a 
href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License, Version 
2.0</a>.
+         <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+         Apache feather logo are trademarks of The Apache Software Foundation.
+         Other names appearing on the site may be trademarks of their
+         respective owners.</small></p>
+        </div>
+      </div>
+    </footer>
+
+  </body>
+</html>


Reply via email to