svn commit: r842237 - in /websites/staging/crunch/trunk/content: ./ crunch/

buildbot Thu, 13 Dec 2012 17:28:51 -0800

Author: buildbot
Date: Fri Dec 14 01:27:53 2012
New Revision: 842237

Log:
Staging update by buildbot for crunch


Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/crunch/download.html
    websites/staging/crunch/trunk/content/crunch/future-work.html
    websites/staging/crunch/trunk/content/crunch/getting-started.html
    websites/staging/crunch/trunk/content/crunch/index.html
    websites/staging/crunch/trunk/content/crunch/intro.html
    websites/staging/crunch/trunk/content/crunch/mailing-lists.html
    websites/staging/crunch/trunk/content/crunch/pipelines.html
    websites/staging/crunch/trunk/content/crunch/scrunch.html
    websites/staging/crunch/trunk/content/crunch/source-repository.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Dec 14 01:27:53 2012
@@ -1 +1 @@
-1410987
+1421632

Modified: websites/staging/crunch/trunk/content/crunch/download.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/download.html (original)
+++ websites/staging/crunch/trunk/content/crunch/download.html Fri Dec 14 
01:27:53 2012
@@ -119,7 +119,7 @@
             
           </h1>
 
-          <p>Apache Crunch is distributed under the <a 
href="http://www.apache.org/licenses/LICENSE-2.0.html";>Apache License 
2.0</a>.</p>
+          <p>The Apache Crunch (incubating) libraries are distributed under 
the <a href="http://www.apache.org/licenses/LICENSE-2.0.html";>Apache License 
2.0</a>.</p>
 <p>The link in the Download column takes you to a list of mirrors based on
 your location. Checksum and signature are located on Apache's main
 distribution site.</p>

Modified: websites/staging/crunch/trunk/content/crunch/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/future-work.html (original)
+++ websites/staging/crunch/trunk/content/crunch/future-work.html Fri Dec 14 
01:27:53 2012
@@ -119,13 +119,11 @@
             
           </h1>
 
-          <p>This section contains an almost certainly incomplete list of 
known limitations of Crunch and plans for future work.</p>
+          <p>This section contains an almost certainly incomplete list of 
known limitations and plans for future work.</p>
 <ul>
-<li>We would like to have easy support for reading and writing data from/to 
HCatalog.</li>
-<li>The decision of how to split up processing tasks between dependent 
MapReduce jobs is very naiive right now- we simply
-delegate all of the work to the reduce stage of the predecessor job. We should 
take advantage of information about the
-expected size of different PCollections to optimize this processing.</li>
-<li>The Crunch optimizer does not yet merge different groupByKey operations 
that run over the same input data into a single
+<li>We would like to have easy support for reading and writing data from/to 
the Hive metastore via the HCatalog
+APIs.</li>
+<li>The optimizer does not yet merge different groupByKey operations that run 
over the same input data into a single
 MapReduce job. Implementing this optimization will provide a major performance 
benefit for a number of problems.</li>
 </ul>
         </div> <!-- /span -->

Modified: websites/staging/crunch/trunk/content/crunch/getting-started.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/getting-started.html (original)
+++ websites/staging/crunch/trunk/content/crunch/getting-started.html Fri Dec 
14 01:27:53 2012
@@ -119,19 +119,19 @@
             
           </h1>
 
-          <p>Crunch is developed against Apache Hadoop version 1.0.3 and is 
also tested against
-Apache Hadoop 2.0.0-alpha. Crunch should work with any version of Hadoop
-after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
-vendors like Cloudera, Hortonworks, and IBM. Crunch is <em>not</em> compatible 
with
-versions of Hadoop prior to 1.0.x or 2.0.x, such as Apache Hadoop 0.20.x.</p>
-<p>The easiest way to get started with Crunch is to use its Maven archetype
+          <p>The Apache Crunch (incubating) library is developed against 
version 1.0.3 of the Apache Hadoop library,
+and is also tested against version 2.0.0-alpha. The library should also work 
with any version
+after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from 
vendors like Cloudera,
+Hortonworks, and IBM. The library is <em>not</em> compatible with versions of 
Hadoop prior to 1.0.x or 2.0.x,
+such as version 0.20.x.</p>
+<p>The easiest way to get started with the library is to use the Maven 
archetype
 to generate a simple project. The archetype is available from Maven Central;
 just enter the following command, answer a few questions, and you're ready to
 go:</p>
 <pre>
 $ <strong>mvn archetype:generate 
-Dfilter=org.apache.crunch:crunch-archetype</strong>
 [...]
-1: remote -> org.apache.crunch:crunch-archetype (Create a basic, 
self-contained job for Apache Crunch.)
+1: remote -> org.apache.crunch:crunch-archetype (Create a basic, 
self-contained job with the core library.)
 Choose a number or apply filter (format: [groupId:]artifactId, case sensitive 
contains): : <strong>1</strong>
 Define value for property 'groupId': : <strong>com.example</strong>
 Define value for property 'artifactId': : <strong>crunch-demo</strong>
@@ -172,7 +172,7 @@ $ <strong>tree</strong>
                     `-- TokenizerTest.java
 </pre>
 
-<p>The <code>WordCount.java</code> file contains the main class that defines a 
Crunch-based
+<p>The <code>WordCount.java</code> file contains the main class that defines a 
pipeline
 application which is referenced from <code>pom.xml</code>.</p>
 <p>Build the code:</p>
 <pre>
@@ -189,8 +189,8 @@ $ <strong>hadoop jar target/hadoop-job-d
 </pre>
 
 <p>The <code>&lt;in&gt;</code> parameter references a text file or a directory 
containing text
-files, while <code>&lt;out&gt;</code> is a directory where Crunch writes the 
final results to.</p>
-<p>Crunch also lets you run applications from within an IDE, either as 
standalone
+files, while <code>&lt;out&gt;</code> is a directory where the pipeline writes 
the final results to.</p>
+<p>The library also supports running applications from within an IDE, either 
as standalone
 Java applications or from unit tests. All required dependencies are on Maven's
 classpath so you can run the <code>WordCount</code> class directly without any 
additional
 setup.</p>

Modified: websites/staging/crunch/trunk/content/crunch/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/index.html (original)
+++ websites/staging/crunch/trunk/content/crunch/index.html Fri Dec 14 01:27:53 
2012
@@ -7,7 +7,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <meta http-equiv="Content-Language" content="en" />
 
-    <title>Apache Crunch - Apache Crunch</title>
+    <title>Apache Crunch - Apache Crunch &amp;trade;</title>
 
     <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
     <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
@@ -115,7 +115,7 @@
         <!-- CONTENT AREA -->
         <div class="span10">
           <h1 class="title">
-            Apache Crunch
+            Apache Crunch &amp;trade;
             
               <small>Simple and Efficient MapReduce Pipelines</small>
             
@@ -123,20 +123,23 @@
 
           <hr />
 <blockquote>
-<p><em>Apache Crunch (incubating)</em> is a Java library for writing, testing, 
and
-running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+<p>The <em>Apache Crunch (incubating)</em> Java library provides a framework 
for writing, testing, and
+running MapReduce pipelines, and is based on Google's FlumeJava library. Its 
goal is to make
 pipelines that are composed of many user-defined functions simple to write,
 easy to test, and efficient to run.</p>
 </blockquote>
 <hr />
-<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/";>Hadoop 
MapReduce</a>, Apache
-Crunch provides a simple Java API for tasks like joining and data aggregation
-that are tedious to implement on plain MapReduce. For Scala users, there is 
also
-Scrunch, an idiomatic Scala API to Crunch.</p>
+<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/";>Hadoop 
MapReduce</a>, the Apache
+Crunch library is a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. The APIs are especially 
useful when
+processing data that does not fit naturally into relational model, such as 
time series,
+serialized object formats like protocol buffers or Avro records, and HBase 
rows and columns.
+For Scala users, there is the Scrunch API, which is built on top of the Java 
APIs and
+includes a REPL (read-eval-print loop) for creating MapReduce pipelines.</p>
 <h2 id="documentation">Documentation</h2>
 <ul>
-<li><a href="intro.html">Introduction to Apache Crunch</a></li>
-<li><a href="scrunch.html">Introduction to Scrunch</a></li>
+<li><a href="intro.html">Introduction to the Apache Crunch API</a></li>
+<li><a href="scrunch.html">Introduction to the Scrunch API</a></li>
 <li><a href="future-work.html">Current Limitations and Future Work</a></li>
 </ul>
 <h2 id="disclaimer">Disclaimer</h2>

Modified: websites/staging/crunch/trunk/content/crunch/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/intro.html (original)
+++ websites/staging/crunch/trunk/content/crunch/intro.html Fri Dec 14 01:27:53 
2012
@@ -120,13 +120,15 @@
           </h1>
 
           <h2 id="build-and-installation">Build and Installation</h2>
-<p>To use Crunch you first have to build the source code using Maven and 
install
+<p>You can download the most recently released libraries from the <a 
href="download.html">Download</a> page or from the Maven
+Central Repository.</p>
+<p>If you prefer, you can also build the libraries from the source code using 
Maven and install
 it in your local repository:</p>
 <div class="codehilite"><pre><span class="n">mvn</span> <span 
class="n">clean</span> <span class="n">install</span>
 </pre></div>
 
 
-<p>This also runs the integration test suite which will take a while. 
Afterwards
+<p>This also runs the integration test suite which will take a while to 
complete. Afterwards
 you can run the bundled example applications such as WordCount:</p>
 <div class="codehilite"><pre><span class="n">hadoop</span> <span 
class="n">jar</span> <span class="n">crunch</span><span class="o">-</span><span 
class="n">examples</span><span class="sr">/target/c</span><span 
class="n">runch</span><span class="o">-</span><span 
class="n">examples</span><span class="o">-*-</span><span 
class="n">job</span><span class="o">.</span><span class="n">jar</span> <span 
class="n">org</span><span class="o">.</span><span class="n">apache</span><span 
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span 
class="n">examples</span><span class="o">.</span><span 
class="n">WordCount</span> <span class="sr">&lt;inputfile&gt;</span> <span 
class="sr">&lt;outputdir&gt;</span>
 </pre></div>
@@ -137,8 +139,8 @@ AverageBytesByIP and TotalBytesByIP take
 crunch-examples/src/main/resources/access_logs.tar.gz. WordAggregationHBase 
requires an Apache HBase cluster but no input data.</p>
 <h2 id="high-level-concepts">High Level Concepts</h2>
 <h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>Crunch is centered around three interfaces that represent distributed 
datasets: <code>PCollection&lt;T&gt;</code>, <code>PTable&lt;K, V&gt;</code>, 
and <code>PGroupedTable&lt;K, V&gt;</code>.</p>
-<p>A <code>PCollection&lt;T&gt;</code> represents a distributed, unordered 
collection of elements of type T. For example, we represent a text file in 
Crunch as a
+<p>The Java API is centered around three interfaces that represent distributed 
datasets: <code>PCollection&lt;T&gt;</code>, <code>PTable&lt;K, V&gt;</code>, 
and <code>PGroupedTable&lt;K, V&gt;</code>.</p>
+<p>A <code>PCollection&lt;T&gt;</code> represents a distributed, unordered 
collection of elements of type T. For example, we represent a text file as a
 <code>PCollection&lt;String&gt;</code> object. PCollection provides a method, 
<code>parallelDo</code>, that applies a function to each element in a 
PCollection in parallel,
 and returns a new PCollection as its result.</p>
 <p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of PCollection that 
represents a distributed, unordered multimap of its key type K to its value 
type V.
@@ -152,11 +154,11 @@ reduce side of a MapReduce job.</p>
 them as a single, virtual PCollection. The union operator is required for 
operations that combine multiple inputs, such as cogroups and
 joins.</p>
 <h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
-<p>Every Crunch pipeline starts with a <code>Pipeline</code> object that is 
used to coordinate building the pipeline and executing the underlying MapReduce
-jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct 
MapReduce jobs from the different stages of the pipelines when
+<p>Every pipeline starts with a <code>Pipeline</code> object that is used to 
coordinate building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, the library uses lazy evaluation, so it will only 
construct MapReduce jobs from the different stages of the pipelines when
 the Pipeline object's <code>run</code> or <code>done</code> methods are 
called.</p>
 <h2 id="a-detailed-example">A Detailed Example</h2>
-<p>Here is the classic WordCount application using Crunch:</p>
+<p>Here is the classic WordCount application using the APIs:</p>
 <div class="codehilite"><pre><span class="nb">import</span> <span 
class="n">org</span><span class="o">.</span><span class="n">apache</span><span 
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span 
class="n">DoFn</span><span class="p">;</span>
 <span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">Emitter</span><span class="p">;</span>
 <span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">PCollection</span><span class="p">;</span>
@@ -195,12 +197,12 @@ pipeline into a series of MapReduce jobs
 that is used to tell Hadoop where to find the code that is used in the 
pipeline execution.</p>
 <p>We now need to tell the Pipeline about the inputs it will be consuming. The 
Pipeline interface
 defines a <code>readTextFile</code> method that takes in a String and returns 
a PCollection of Strings.
-In addition to text files, Crunch supports reading data from SequenceFiles and 
Avro container files,
+In addition to text files, the library supports reading data from 
SequenceFiles and Avro container files,
 via the <code>SequenceFileSource</code> and <code>AvroFileSource</code> 
classes defined in the org.apache.crunch.io package.</p>
 <p>Note that each PCollection is a <em>reference</em> to a source of data- no 
data is actually loaded into a
 PCollection on the client machine.</p>
 <h3 id="step-2-splitting-the-lines-of-text-into-words">Step 2: Splitting the 
lines of text into words</h3>
-<p>Crunch defines a small set of primitive operations that can be composed in 
order to build complex data
+<p>The library defines a small set of primitive operations that can be 
composed in order to build complex data
 pipelines. The first of these primitives is the <code>parallelDo</code> 
function, which applies a function (defined
 by a subclass of <code>DoFn</code>) to every record in a PCollection, and 
returns a new PCollection that contains
 the results.</p>
@@ -213,8 +215,8 @@ the <code>process</code> method, which t
 may have any number of output values written to it. In this case, our DoFn 
splits each lines up into
 words, using a blank space as a separator, and emits the words from the split 
to the output PCollection.</p>
 <p>The last argument to parallelDo is an instance of the <code>PType</code> 
interface, which specifies how the data
-in the output PCollection is serialized. While Crunch takes advantage of Java 
Generics to provide
-compile-time type safety, the generic type information is not available at 
runtime. Crunch needs to know
+in the output PCollection is serialized. While the API takes advantage of Java 
Generics to provide
+compile-time type safety, the generic type information is not available at 
runtime. The job planner needs to know
 how to map the records stored in each PCollection into a Hadoop-supported 
serialization format in order
 to read and write data to disk. Two serialization implementations are 
supported in crunch via the
 <code>PTypeFamily</code> interface: a Writable-based system that is defined in 
the org.apache.crunch.types.writable
@@ -222,7 +224,7 @@ package, and an Avro-based system that i
 implementation provides convenience methods for working with the common PTypes 
(Strings, longs, bytes, etc.)
 as well as utility methods for creating PTypes from existing Writable classes 
or Avro schemas.</p>
 <h3 id="step-3-counting-the-words">Step 3: Counting the words</h3>
-<p>Out of Crunch's simple primitive operations, we can build arbitrarily 
complex chains of operations in order
+<p>Out of the simple primitive operations, we can build arbitrarily complex 
chains of operations in order
 to perform higher-level operations, like aggregations and joins, that can work 
on any type of input data.
 Let's look at the implementation of the <code>Aggregate.count</code> 
function:</p>
 <div class="codehilite"><pre><span class="nb">package</span> <span 
class="n">org</span><span class="o">.</span><span class="n">apache</span><span 
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span 
class="n">lib</span><span class="p">;</span>
@@ -270,14 +272,14 @@ and the number one by extending the <cod
 <code>tableOf</code> method of the PTypeFamily to specify that the returned 
PCollection should be a
 PTable instance, with the key being the PType of the PCollection and the value 
being the Long
 implementation for this PTypeFamily.</p>
-<p>The next line features the second of Crunch's four operations, 
<code>groupByKey</code>. The groupByKey
+<p>The next line features the second of the four primary operations, 
<code>groupByKey</code>. The groupByKey
 operation may only be applied to a PTable, and returns an instance of the 
<code>PGroupedTable</code>
 interface, which references the grouping of all of the values in the PTable 
that have the same key.
-The groupByKey operation is what triggers the reduce phase of a MapReduce 
within Crunch.</p>
-<p>The last line in the function returns the output of the third of Crunch's 
four operations,
+The groupByKey operation is what triggers the reduce phase of a MapReduce.</p>
+<p>The last line in the function returns the output of the third of the four 
primary operations,
 <code>combineValues</code>. The combineValues operator takes a 
<code>CombineFn</code> as an argument, which is a
 specialized subclass of DoFn that operates on an implementation of Java's 
Iterable interface. The
-use of combineValues (as opposed to parallelDo) signals to Crunch that the 
CombineFn may be used to
+use of combineValues (as opposed to parallelDo) signals to the planner that 
the CombineFn may be used to
 aggregate values for the same key on the map side of a MapReduce job as well 
as the reduce side.</p>
 <h3 id="step-4-writing-the-output-and-running-the-pipeline">Step 4: Writing 
the output and running the pipeline</h3>
 <p>The Pipeline object also provides a <code>writeTextFile</code> convenience 
method for indicating that a

Modified: websites/staging/crunch/trunk/content/crunch/mailing-lists.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/mailing-lists.html (original)
+++ websites/staging/crunch/trunk/content/crunch/mailing-lists.html Fri Dec 14 
01:27:53 2012
@@ -124,7 +124,7 @@
   so we use plain HTML tables.
 -->
 
-<p>There are several mailing lists for Apache Crunch. To subscribe or 
unsubscribe
+<p>There are several mailing lists for the Apache Crunch project. To subscribe 
or unsubscribe
 to a list send mail to the respective administrative address given below. You
 will then receive a confirmation mail with further instructions.</p>
 <table class="table">

Modified: websites/staging/crunch/trunk/content/crunch/pipelines.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/pipelines.html (original)
+++ websites/staging/crunch/trunk/content/crunch/pipelines.html Fri Dec 14 
01:27:53 2012
@@ -119,12 +119,12 @@
             
           </h1>
 
-          <p>This section discusses the different steps of creating your own 
Crunch pipelines in more detail.</p>
+          <p>This section discusses the different steps of creating your own 
pipelines in more detail.</p>
 <h2 id="writing-a-dofn">Writing a DoFn</h2>
 <p>The DoFn class is designed to keep the complexity of the MapReduce APIs out 
of your way when you
 don't need them while still keeping them accessible when you do.</p>
 <h3 id="serialization">Serialization</h3>
-<p>First, all DoFn instances are required to be 
<code>java.io.Serializable</code>. This is a key aspect of Crunch's design:
+<p>First, all DoFn instances are required to be 
<code>java.io.Serializable</code>. This is a key aspect of the library's design:
 once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce 
job, all of the state
 of that DoFn is serialized so that it may be distributed to all of the nodes 
in the Hadoop cluster that
 will be running that task. There are two important implications of this for 
developers:</p>
@@ -146,13 +146,13 @@ split processing tasks between the Map a
 <p>The DoFn base class provides convenience methods for accessing the 
<code>Configuration</code> and <code>Counter</code> objects that
 are associated with a MapReduce stage, so that they may be accessed during 
initialization, processing, and cleanup.</p>
 <h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins</h3>
-<p>In Crunch, cogroups and joins are performed on PTable instances that have 
the same key type. This section walks through
-the basic flow of a cogroup operation, explaining how this higher-level 
operation is composed of Crunch's four primitives.
-In general, these common operations are provided as part of the core Crunch 
library or in extensions, you do not need
+<p>Cogroups and joins are performed on PTable instances that have the same key 
type. This section walks through
+the basic flow of a cogroup operation, explaining how this higher-level 
operation is composed of the four primitive operations.
+In general, these common operations are provided as part of the core library 
or in extensions, you do not need
 to write them yourself. But it can be useful to understand how they work under 
the covers.</p>
 <p>Assume we have a <code>PTable&lt;K, U&gt;</code> named "a" and a different 
<code>PTable&lt;K, V&gt;</code> named "b" that we would like to combine into a
 single <code>PTable&lt;K, Pair&lt;Collection&lt;U&gt;, 
Collection&lt;V&gt;&gt;&gt;</code>. First, we need to apply parallelDo 
operations to a and b that
-convert them into the same Crunch type, <code>PTable&lt;K, Pair&lt;U, 
V&gt;&gt;</code>:</p>
+convert them into the same PType, <code>PTable&lt;K, Pair&lt;U, 
V&gt;&gt;</code>:</p>
 <div class="codehilite"><pre><span class="sr">//</span> <span 
class="n">Perform</span> <span class="n">the</span> <span 
class="s">&quot;tagging&quot;</span> <span class="n">operation</span> <span 
class="n">as</span> <span class="n">a</span> <span class="n">parallelDo</span> 
<span class="n">on</span> <span class="n">PTable</span> <span class="n">a</span>
 <span class="n">PTable</span><span class="o">&lt;</span><span 
class="n">K</span><span class="p">,</span> <span class="n">Pair</span><span 
class="o">&lt;</span><span class="n">U</span><span class="p">,</span> <span 
class="n">V</span><span class="o">&gt;&gt;</span> <span class="n">aPrime</span> 
<span class="o">=</span> <span class="n">a</span><span class="o">.</span><span 
class="n">parallelDo</span><span class="p">(</span><span 
class="s">&quot;taga&quot;</span><span class="p">,</span> <span 
class="k">new</span> <span class="n">MapFn</span><span 
class="o">&lt;</span><span class="n">Pair</span><span 
class="o">&lt;</span><span class="n">K</span><span class="p">,</span> <span 
class="n">U</span><span class="o">&gt;</span><span class="p">,</span> <span 
class="n">Pair</span><span class="o">&lt;</span><span class="n">K</span><span 
class="p">,</span> <span class="n">Pair</span><span class="o">&lt;</span><span 
class="n">U</span><span class="p">,</span> <span class="n">V</span><span clas
 s="o">&gt;&gt;&gt;</span><span class="p">()</span> <span class="p">{</span>
   <span class="n">public</span> <span class="n">Pair</span><span 
class="o">&lt;</span><span class="n">K</span><span class="p">,</span> <span 
class="n">Pair</span><span class="o">&lt;</span><span class="n">U</span><span 
class="p">,</span> <span class="n">V</span><span class="o">&gt;&gt;</span> 
<span class="nb">map</span><span class="p">(</span><span 
class="n">Pair</span><span class="o">&lt;</span><span class="n">K</span><span 
class="p">,</span> <span class="n">U</span><span class="o">&gt;</span> <span 
class="n">input</span><span class="p">)</span> <span class="p">{</span>

Modified: websites/staging/crunch/trunk/content/crunch/scrunch.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/scrunch.html (original)
+++ websites/staging/crunch/trunk/content/crunch/scrunch.html Fri Dec 14 
01:27:53 2012
@@ -117,19 +117,19 @@
           <h1 class="title">
             Scrunch
             
-              <small>A Scala Wrapper for Apache Crunch</small>
+              <small>A Scala Wrapper for the Apache Crunch (incubating) Java 
API</small>
             
           </h1>
 
           <h2 id="introduction">Introduction</h2>
-<p>Scrunch is an experimental Scala wrapper for Crunch, based on the same 
ideas as the
-<a href="http://days2011.scala-lang.org/node/138/282";>Cascade</a> project at 
Google, which created
-a Scala wrapper for FlumeJava.</p>
+<p>Scrunch is an experimental Scala wrapper for the Apache Crunch (incubating) 
Java API, based on the same ideas as the
+<a href="http://days2011.scala-lang.org/node/138/282";>Cascade</a> project at 
Google, which created a Scala wrapper for
+FlumeJava.</p>
 <h2 id="why-scala">Why Scala?</h2>
-<p>In many ways, Scala is the perfect language for writing Crunch pipelines. 
Scala supports
+<p>In many ways, Scala is the perfect language for writing MapReduce 
pipelines. Scala supports
 a mixture of functional and object-oriented programming styles and has 
powerful type-inference
 capabilities, allowing us to create complex pipelines using very few 
keystrokes. Here is
-the Scrunch analogue of the classic WordCount problem:</p>
+an implementation of the classic WordCount problem using the Scrunch API:</p>
 <div class="codehilite"><pre><span class="nb">import</span> <span 
class="n">org</span><span class="o">.</span><span class="n">apache</span><span 
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span 
class="n">io</span><span class="o">.</span><span class="p">{</span><span 
class="n">From</span> <span class="o">=&gt;</span> <span 
class="n">from</span><span class="p">}</span>
 <span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">scrunch</span><span class="o">.</span><span class="n">_</span>
 <span class="nb">import</span> <span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">crunch</span><span class="o">.</span><span 
class="n">scrunch</span><span class="o">.</span><span 
class="n">Conversions_</span>  <span class="c1"># For implicit type 
conversions</span>
@@ -148,7 +148,7 @@ the Scrunch analogue of the classic Word
 
 
 <p>The Scala compiler can infer the return type of the flatMap function as an 
Array[String], and
-the Scrunch wrapper uses the type inference mechanism to figure out how to 
serialize the
+the Scrunch wrapper code uses the type inference mechanism to figure out how 
to serialize the
 data between the Map and Reduce stages. Here's a slightly more complex 
example, in which we
 get the word counts for two different files and compute the deltas of how 
often different
 words occur, and then only returns the words where the first file had more 
occurrences then
@@ -163,12 +163,9 @@ the second:</p>
 </pre></div>
 
 
-<p>Note that all of the functions are using Scala Tuples, not Crunch Tuples. 
Under the covers,
-Scrunch uses Scala's implicit type conversion mechanism to transparently 
convert data from the
-Crunch format to the Scala format and back again.</p>
 <h2 id="materializing-job-outputs">Materializing Job Outputs</h2>
-<p>Scrunch also incorporates Crunch's materialize functionality, which allows 
us to easily read
-the output of a Crunch pipeline into the client:</p>
+<p>The Scrunch API also incorporates the Java library's 
<code>materialize</code> functionality, which allows us to easily read
+the output of a MapReduce pipeline into the client:</p>
 <div class="codehilite"><pre><span class="n">class</span> <span 
class="n">WordCountExample</span> <span class="p">{</span>
   <span class="n">def</span> <span class="n">hasHamlet</span> <span 
class="o">=</span> <span class="n">wordGt</span><span class="p">(</span><span 
class="s">&quot;shakespeare.txt&quot;</span><span class="p">,</span> <span 
class="s">&quot;maugham.txt&quot;</span><span class="p">)</span><span 
class="o">.</span><span class="n">materialize</span><span 
class="o">.</span><span class="nb">exists</span><span class="p">(</span><span 
class="n">_</span> <span class="o">==</span> <span 
class="s">&quot;hamlet&quot;</span><span class="p">)</span>
 <span class="p">}</span>
@@ -176,15 +173,11 @@ the output of a Crunch pipeline into the
 
 
 <h2 id="notes-and-thanks">Notes and Thanks</h2>
-<p>Scrunch is alpha-quality code, written by someone who was learning Scala on 
the fly. There will be bugs,
-rough edges, and non-idiomatic Scala usage all over the place. This will 
improve with time, and we welcome
-contributions from Scala experts who are interested in helping us make Scrunch 
into a first-class project.</p>
 <p>Scrunch emerged out of conversations with <a 
href="http://twitter.com/#!/squarecog";>Dmitriy Ryaboy</a>,
 <a href="http://twitter.com/#!/posco";>Oscar Boykin</a>, and <a 
href="http://twitter.com/#!/avibryant";>Avi Bryant</a> from Twitter.
 Many thanks to them for their feedback, guidance, and encouragement. We are 
also grateful to
 <a href="http://twitter.com/#!/matei_zaharia";>Matei Zaharia</a>, whose <a 
href="http://www.spark-project.org/";>Spark Project</a>
-inspired much of our implementation and was kind enough to loan us the 
ClosureCleaner implementation
-Spark developed for use in Scrunch.</p>
+inspired much of the original Scrunch API implementation.</p>
         </div> <!-- /span -->
 
       </div> <!-- /row-fluid -->

Modified: websites/staging/crunch/trunk/content/crunch/source-repository.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/source-repository.html 
(original)
+++ websites/staging/crunch/trunk/content/crunch/source-repository.html Fri Dec 
14 01:27:53 2012
@@ -119,7 +119,7 @@
             
           </h1>
 
-          <p>Apache Crunch uses <a href="http://git-scm.com/";>Git</a> for 
version control. Run the
+          <p>The Apache Crunch (incubating) Project uses <a 
href="http://git-scm.com/";>Git</a> for version control. Run the
 following command to clone the repository:</p>
 <div class="codehilite"><pre><span class="n">git</span> <span 
class="n">clone</span> <span class="n">https:</span><span 
class="sr">//gi</span><span class="n">t</span><span class="o">-</span><span 
class="n">wip</span><span class="o">-</span><span class="n">us</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">org</span><span class="sr">/repos/</span><span 
class="n">asf</span><span class="o">/</span><span 
class="n">incubator</span><span class="o">-</span><span 
class="n">crunch</span><span class="o">.</span><span class="n">git</span>
 </pre></div>

svn commit: r842237 - in /websites/staging/crunch/trunk/content: ./ crunch/

Reply via email to