Author: buildbot
Date: Fri Dec 14 01:27:53 2012
New Revision: 842237
Log:
Staging update by buildbot for crunch
Modified:
websites/staging/crunch/trunk/content/ (props changed)
websites/staging/crunch/trunk/content/crunch/download.html
websites/staging/crunch/trunk/content/crunch/future-work.html
websites/staging/crunch/trunk/content/crunch/getting-started.html
websites/staging/crunch/trunk/content/crunch/index.html
websites/staging/crunch/trunk/content/crunch/intro.html
websites/staging/crunch/trunk/content/crunch/mailing-lists.html
websites/staging/crunch/trunk/content/crunch/pipelines.html
websites/staging/crunch/trunk/content/crunch/scrunch.html
websites/staging/crunch/trunk/content/crunch/source-repository.html
Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Dec 14 01:27:53 2012
@@ -1 +1 @@
-1410987
+1421632
Modified: websites/staging/crunch/trunk/content/crunch/download.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/download.html (original)
+++ websites/staging/crunch/trunk/content/crunch/download.html Fri Dec 14
01:27:53 2012
@@ -119,7 +119,7 @@
</h1>
- <p>Apache Crunch is distributed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License
2.0</a>.</p>
+ <p>The Apache Crunch (incubating) libraries are distributed under
the <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License
2.0</a>.</p>
<p>The link in the Download column takes you to a list of mirrors based on
your location. Checksum and signature are located on Apache's main
distribution site.</p>
Modified: websites/staging/crunch/trunk/content/crunch/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/future-work.html (original)
+++ websites/staging/crunch/trunk/content/crunch/future-work.html Fri Dec 14
01:27:53 2012
@@ -119,13 +119,11 @@
</h1>
- <p>This section contains an almost certainly incomplete list of
known limitations of Crunch and plans for future work.</p>
+ <p>This section contains an almost certainly incomplete list of
known limitations and plans for future work.</p>
<ul>
-<li>We would like to have easy support for reading and writing data from/to
HCatalog.</li>
-<li>The decision of how to split up processing tasks between dependent
MapReduce jobs is very naiive right now- we simply
-delegate all of the work to the reduce stage of the predecessor job. We should
take advantage of information about the
-expected size of different PCollections to optimize this processing.</li>
-<li>The Crunch optimizer does not yet merge different groupByKey operations
that run over the same input data into a single
+<li>We would like to have easy support for reading and writing data from/to
the Hive metastore via the HCatalog
+APIs.</li>
+<li>The optimizer does not yet merge different groupByKey operations that run
over the same input data into a single
MapReduce job. Implementing this optimization will provide a major performance
benefit for a number of problems.</li>
</ul>
</div> <!-- /span -->
Modified: websites/staging/crunch/trunk/content/crunch/getting-started.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/getting-started.html (original)
+++ websites/staging/crunch/trunk/content/crunch/getting-started.html Fri Dec
14 01:27:53 2012
@@ -119,19 +119,19 @@
</h1>
- <p>Crunch is developed against Apache Hadoop version 1.0.3 and is
also tested against
-Apache Hadoop 2.0.0-alpha. Crunch should work with any version of Hadoop
-after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
-vendors like Cloudera, Hortonworks, and IBM. Crunch is <em>not</em> compatible
with
-versions of Hadoop prior to 1.0.x or 2.0.x, such as Apache Hadoop 0.20.x.</p>
-<p>The easiest way to get started with Crunch is to use its Maven archetype
+ <p>The Apache Crunch (incubating) library is developed against
version 1.0.3 of the Apache Hadoop library,
+and is also tested against version 2.0.0-alpha. The library should also work
with any version
+after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
vendors like Cloudera,
+Hortonworks, and IBM. The library is <em>not</em> compatible with versions of
Hadoop prior to 1.0.x or 2.0.x,
+such as version 0.20.x.</p>
+<p>The easiest way to get started with the library is to use the Maven
archetype
to generate a simple project. The archetype is available from Maven Central;
just enter the following command, answer a few questions, and you're ready to
go:</p>
<pre>
$ <strong>mvn archetype:generate
-Dfilter=org.apache.crunch:crunch-archetype</strong>
[...]
-1: remote -> org.apache.crunch:crunch-archetype (Create a basic,
self-contained job for Apache Crunch.)
+1: remote -> org.apache.crunch:crunch-archetype (Create a basic,
self-contained job with the core library.)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive
contains): : <strong>1</strong>
Define value for property 'groupId': : <strong>com.example</strong>
Define value for property 'artifactId': : <strong>crunch-demo</strong>
@@ -172,7 +172,7 @@ $ <strong>tree</strong>
`-- TokenizerTest.java
</pre>
-<p>The <code>WordCount.java</code> file contains the main class that defines a
Crunch-based
+<p>The <code>WordCount.java</code> file contains the main class that defines a
pipeline
application which is referenced from <code>pom.xml</code>.</p>
<p>Build the code:</p>
<pre>
@@ -189,8 +189,8 @@ $ <strong>hadoop jar target/hadoop-job-d
</pre>
<p>The <code><in></code> parameter references a text file or a directory
containing text
-files, while <code><out></code> is a directory where Crunch writes the
final results to.</p>
-<p>Crunch also lets you run applications from within an IDE, either as
standalone
+files, while <code><out></code> is a directory where the pipeline writes
the final results to.</p>
+<p>The library also supports running applications from within an IDE, either
as standalone
Java applications or from unit tests. All required dependencies are on Maven's
classpath so you can run the <code>WordCount</code> class directly without any
additional
setup.</p>
Modified: websites/staging/crunch/trunk/content/crunch/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/index.html (original)
+++ websites/staging/crunch/trunk/content/crunch/index.html Fri Dec 14 01:27:53
2012
@@ -7,7 +7,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta http-equiv="Content-Language" content="en" />
- <title>Apache Crunch - Apache Crunch</title>
+ <title>Apache Crunch - Apache Crunch &trade;</title>
<link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
<link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
@@ -115,7 +115,7 @@
<!-- CONTENT AREA -->
<div class="span10">
<h1 class="title">
- Apache Crunch
+ Apache Crunch &trade;
<small>Simple and Efficient MapReduce Pipelines</small>
@@ -123,20 +123,23 @@
<hr />
<blockquote>
-<p><em>Apache Crunch (incubating)</em> is a Java library for writing, testing,
and
-running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+<p>The <em>Apache Crunch (incubating)</em> Java library provides a framework
for writing, testing, and
+running MapReduce pipelines, and is based on Google's FlumeJava library. Its
goal is to make
pipelines that are composed of many user-defined functions simple to write,
easy to test, and efficient to run.</p>
</blockquote>
<hr />
-<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/">Hadoop
MapReduce</a>, Apache
-Crunch provides a simple Java API for tasks like joining and data aggregation
-that are tedious to implement on plain MapReduce. For Scala users, there is
also
-Scrunch, an idiomatic Scala API to Crunch.</p>
+<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/">Hadoop
MapReduce</a>, the Apache
+Crunch library is a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. The APIs are especially
useful when
+processing data that does not fit naturally into relational model, such as
time series,
+serialized object formats like protocol buffers or Avro records, and HBase
rows and columns.
+For Scala users, there is the Scrunch API, which is built on top of the Java
APIs and
+includes a REPL (read-eval-print loop) for creating MapReduce pipelines.</p>
<h2 id="documentation">Documentation</h2>
<ul>
-<li><a href="intro.html">Introduction to Apache Crunch</a></li>
-<li><a href="scrunch.html">Introduction to Scrunch</a></li>
+<li><a href="intro.html">Introduction to the Apache Crunch API</a></li>
+<li><a href="scrunch.html">Introduction to the Scrunch API</a></li>
<li><a href="future-work.html">Current Limitations and Future Work</a></li>
</ul>
<h2 id="disclaimer">Disclaimer</h2>
Modified: websites/staging/crunch/trunk/content/crunch/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/intro.html (original)
+++ websites/staging/crunch/trunk/content/crunch/intro.html Fri Dec 14 01:27:53
2012
@@ -120,13 +120,15 @@
</h1>
<h2 id="build-and-installation">Build and Installation</h2>
-<p>To use Crunch you first have to build the source code using Maven and
install
+<p>You can download the most recently released libraries from the <a
href="download.html">Download</a> page or from the Maven
+Central Repository.</p>
+<p>If you prefer, you can also build the libraries from the source code using
Maven and install
it in your local repository:</p>
<div class="codehilite"><pre><span class="n">mvn</span> <span
class="n">clean</span> <span class="n">install</span>
</pre></div>
-<p>This also runs the integration test suite which will take a while.
Afterwards
+<p>This also runs the integration test suite which will take a while to
complete. Afterwards
you can run the bundled example applications such as WordCount:</p>
<div class="codehilite"><pre><span class="n">hadoop</span> <span
class="n">jar</span> <span class="n">crunch</span><span class="o">-</span><span
class="n">examples</span><span class="sr">/target/c</span><span
class="n">runch</span><span class="o">-</span><span
class="n">examples</span><span class="o">-*-</span><span
class="n">job</span><span class="o">.</span><span class="n">jar</span> <span
class="n">org</span><span class="o">.</span><span class="n">apache</span><span
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span
class="n">examples</span><span class="o">.</span><span
class="n">WordCount</span> <span class="sr"><inputfile></span> <span
class="sr"><outputdir></span>
</pre></div>
@@ -137,8 +139,8 @@ AverageBytesByIP and TotalBytesByIP take
crunch-examples/src/main/resources/access_logs.tar.gz. WordAggregationHBase
requires an Apache HBase cluster but no input data.</p>
<h2 id="high-level-concepts">High Level Concepts</h2>
<h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>Crunch is centered around three interfaces that represent distributed
datasets: <code>PCollection<T></code>, <code>PTable<K, V></code>,
and <code>PGroupedTable<K, V></code>.</p>
-<p>A <code>PCollection<T></code> represents a distributed, unordered
collection of elements of type T. For example, we represent a text file in
Crunch as a
+<p>The Java API is centered around three interfaces that represent distributed
datasets: <code>PCollection<T></code>, <code>PTable<K, V></code>,
and <code>PGroupedTable<K, V></code>.</p>
+<p>A <code>PCollection<T></code> represents a distributed, unordered
collection of elements of type T. For example, we represent a text file as a
<code>PCollection<String></code> object. PCollection provides a method,
<code>parallelDo</code>, that applies a function to each element in a
PCollection in parallel,
and returns a new PCollection as its result.</p>
<p>A <code>PTable<K, V></code> is a sub-interface of PCollection that
represents a distributed, unordered multimap of its key type K to its value
type V.
@@ -152,11 +154,11 @@ reduce side of a MapReduce job.</p>
them as a single, virtual PCollection. The union operator is required for
operations that combine multiple inputs, such as cogroups and
joins.</p>
<h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
-<p>Every Crunch pipeline starts with a <code>Pipeline</code> object that is
used to coordinate building the pipeline and executing the underlying MapReduce
-jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct
MapReduce jobs from the different stages of the pipelines when
+<p>Every pipeline starts with a <code>Pipeline</code> object that is used to
coordinate building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, the library uses lazy evaluation, so it will only
construct MapReduce jobs from the different stages of the pipelines when
the Pipeline object's <code>run</code> or <code>done</code> methods are
called.</p>
<h2 id="a-detailed-example">A Detailed Example</h2>
-<p>Here is the classic WordCount application using Crunch:</p>
+<p>Here is the classic WordCount application using the APIs:</p>
<div class="codehilite"><pre><span class="nb">import</span> <span
class="n">org</span><span class="o">.</span><span class="n">apache</span><span
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span
class="n">DoFn</span><span class="p">;</span>
<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">Emitter</span><span class="p">;</span>
<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">PCollection</span><span class="p">;</span>
@@ -195,12 +197,12 @@ pipeline into a series of MapReduce jobs
that is used to tell Hadoop where to find the code that is used in the
pipeline execution.</p>
<p>We now need to tell the Pipeline about the inputs it will be consuming. The
Pipeline interface
defines a <code>readTextFile</code> method that takes in a String and returns
a PCollection of Strings.
-In addition to text files, Crunch supports reading data from SequenceFiles and
Avro container files,
+In addition to text files, the library supports reading data from
SequenceFiles and Avro container files,
via the <code>SequenceFileSource</code> and <code>AvroFileSource</code>
classes defined in the org.apache.crunch.io package.</p>
<p>Note that each PCollection is a <em>reference</em> to a source of data- no
data is actually loaded into a
PCollection on the client machine.</p>
<h3 id="step-2-splitting-the-lines-of-text-into-words">Step 2: Splitting the
lines of text into words</h3>
-<p>Crunch defines a small set of primitive operations that can be composed in
order to build complex data
+<p>The library defines a small set of primitive operations that can be
composed in order to build complex data
pipelines. The first of these primitives is the <code>parallelDo</code>
function, which applies a function (defined
by a subclass of <code>DoFn</code>) to every record in a PCollection, and
returns a new PCollection that contains
the results.</p>
@@ -213,8 +215,8 @@ the <code>process</code> method, which t
may have any number of output values written to it. In this case, our DoFn
splits each lines up into
words, using a blank space as a separator, and emits the words from the split
to the output PCollection.</p>
<p>The last argument to parallelDo is an instance of the <code>PType</code>
interface, which specifies how the data
-in the output PCollection is serialized. While Crunch takes advantage of Java
Generics to provide
-compile-time type safety, the generic type information is not available at
runtime. Crunch needs to know
+in the output PCollection is serialized. While the API takes advantage of Java
Generics to provide
+compile-time type safety, the generic type information is not available at
runtime. The job planner needs to know
how to map the records stored in each PCollection into a Hadoop-supported
serialization format in order
to read and write data to disk. Two serialization implementations are
supported in crunch via the
<code>PTypeFamily</code> interface: a Writable-based system that is defined in
the org.apache.crunch.types.writable
@@ -222,7 +224,7 @@ package, and an Avro-based system that i
implementation provides convenience methods for working with the common PTypes
(Strings, longs, bytes, etc.)
as well as utility methods for creating PTypes from existing Writable classes
or Avro schemas.</p>
<h3 id="step-3-counting-the-words">Step 3: Counting the words</h3>
-<p>Out of Crunch's simple primitive operations, we can build arbitrarily
complex chains of operations in order
+<p>Out of the simple primitive operations, we can build arbitrarily complex
chains of operations in order
to perform higher-level operations, like aggregations and joins, that can work
on any type of input data.
Let's look at the implementation of the <code>Aggregate.count</code>
function:</p>
<div class="codehilite"><pre><span class="nb">package</span> <span
class="n">org</span><span class="o">.</span><span class="n">apache</span><span
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span
class="n">lib</span><span class="p">;</span>
@@ -270,14 +272,14 @@ and the number one by extending the <cod
<code>tableOf</code> method of the PTypeFamily to specify that the returned
PCollection should be a
PTable instance, with the key being the PType of the PCollection and the value
being the Long
implementation for this PTypeFamily.</p>
-<p>The next line features the second of Crunch's four operations,
<code>groupByKey</code>. The groupByKey
+<p>The next line features the second of the four primary operations,
<code>groupByKey</code>. The groupByKey
operation may only be applied to a PTable, and returns an instance of the
<code>PGroupedTable</code>
interface, which references the grouping of all of the values in the PTable
that have the same key.
-The groupByKey operation is what triggers the reduce phase of a MapReduce
within Crunch.</p>
-<p>The last line in the function returns the output of the third of Crunch's
four operations,
+The groupByKey operation is what triggers the reduce phase of a MapReduce.</p>
+<p>The last line in the function returns the output of the third of the four
primary operations,
<code>combineValues</code>. The combineValues operator takes a
<code>CombineFn</code> as an argument, which is a
specialized subclass of DoFn that operates on an implementation of Java's
Iterable interface. The
-use of combineValues (as opposed to parallelDo) signals to Crunch that the
CombineFn may be used to
+use of combineValues (as opposed to parallelDo) signals to the planner that
the CombineFn may be used to
aggregate values for the same key on the map side of a MapReduce job as well
as the reduce side.</p>
<h3 id="step-4-writing-the-output-and-running-the-pipeline">Step 4: Writing
the output and running the pipeline</h3>
<p>The Pipeline object also provides a <code>writeTextFile</code> convenience
method for indicating that a
Modified: websites/staging/crunch/trunk/content/crunch/mailing-lists.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/mailing-lists.html (original)
+++ websites/staging/crunch/trunk/content/crunch/mailing-lists.html Fri Dec 14
01:27:53 2012
@@ -124,7 +124,7 @@
so we use plain HTML tables.
-->
-<p>There are several mailing lists for Apache Crunch. To subscribe or
unsubscribe
+<p>There are several mailing lists for the Apache Crunch project. To subscribe
or unsubscribe
to a list send mail to the respective administrative address given below. You
will then receive a confirmation mail with further instructions.</p>
<table class="table">
Modified: websites/staging/crunch/trunk/content/crunch/pipelines.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/pipelines.html (original)
+++ websites/staging/crunch/trunk/content/crunch/pipelines.html Fri Dec 14
01:27:53 2012
@@ -119,12 +119,12 @@
</h1>
- <p>This section discusses the different steps of creating your own
Crunch pipelines in more detail.</p>
+ <p>This section discusses the different steps of creating your own
pipelines in more detail.</p>
<h2 id="writing-a-dofn">Writing a DoFn</h2>
<p>The DoFn class is designed to keep the complexity of the MapReduce APIs out
of your way when you
don't need them while still keeping them accessible when you do.</p>
<h3 id="serialization">Serialization</h3>
-<p>First, all DoFn instances are required to be
<code>java.io.Serializable</code>. This is a key aspect of Crunch's design:
+<p>First, all DoFn instances are required to be
<code>java.io.Serializable</code>. This is a key aspect of the library's design:
once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce
job, all of the state
of that DoFn is serialized so that it may be distributed to all of the nodes
in the Hadoop cluster that
will be running that task. There are two important implications of this for
developers:</p>
@@ -146,13 +146,13 @@ split processing tasks between the Map a
<p>The DoFn base class provides convenience methods for accessing the
<code>Configuration</code> and <code>Counter</code> objects that
are associated with a MapReduce stage, so that they may be accessed during
initialization, processing, and cleanup.</p>
<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins</h3>
-<p>In Crunch, cogroups and joins are performed on PTable instances that have
the same key type. This section walks through
-the basic flow of a cogroup operation, explaining how this higher-level
operation is composed of Crunch's four primitives.
-In general, these common operations are provided as part of the core Crunch
library or in extensions, you do not need
+<p>Cogroups and joins are performed on PTable instances that have the same key
type. This section walks through
+the basic flow of a cogroup operation, explaining how this higher-level
operation is composed of the four primitive operations.
+In general, these common operations are provided as part of the core library
or in extensions, you do not need
to write them yourself. But it can be useful to understand how they work under
the covers.</p>
<p>Assume we have a <code>PTable<K, U></code> named "a" and a different
<code>PTable<K, V></code> named "b" that we would like to combine into a
single <code>PTable<K, Pair<Collection<U>,
Collection<V>>></code>. First, we need to apply parallelDo
operations to a and b that
-convert them into the same Crunch type, <code>PTable<K, Pair<U,
V>></code>:</p>
+convert them into the same PType, <code>PTable<K, Pair<U,
V>></code>:</p>
<div class="codehilite"><pre><span class="sr">//</span> <span
class="n">Perform</span> <span class="n">the</span> <span
class="s">"tagging"</span> <span class="n">operation</span> <span
class="n">as</span> <span class="n">a</span> <span class="n">parallelDo</span>
<span class="n">on</span> <span class="n">PTable</span> <span class="n">a</span>
<span class="n">PTable</span><span class="o"><</span><span
class="n">K</span><span class="p">,</span> <span class="n">Pair</span><span
class="o"><</span><span class="n">U</span><span class="p">,</span> <span
class="n">V</span><span class="o">>></span> <span class="n">aPrime</span>
<span class="o">=</span> <span class="n">a</span><span class="o">.</span><span
class="n">parallelDo</span><span class="p">(</span><span
class="s">"taga"</span><span class="p">,</span> <span
class="k">new</span> <span class="n">MapFn</span><span
class="o"><</span><span class="n">Pair</span><span
class="o"><</span><span class="n">K</span><span class="p">,</span> <span
class="n">U</span><span class="o">></span><span class="p">,</span> <span
class="n">Pair</span><span class="o"><</span><span class="n">K</span><span
class="p">,</span> <span class="n">Pair</span><span class="o"><</span><span
class="n">U</span><span class="p">,</span> <span class="n">V</span><span clas
s="o">>>></span><span class="p">()</span> <span class="p">{</span>
<span class="n">public</span> <span class="n">Pair</span><span
class="o"><</span><span class="n">K</span><span class="p">,</span> <span
class="n">Pair</span><span class="o"><</span><span class="n">U</span><span
class="p">,</span> <span class="n">V</span><span class="o">>></span>
<span class="nb">map</span><span class="p">(</span><span
class="n">Pair</span><span class="o"><</span><span class="n">K</span><span
class="p">,</span> <span class="n">U</span><span class="o">></span> <span
class="n">input</span><span class="p">)</span> <span class="p">{</span>
Modified: websites/staging/crunch/trunk/content/crunch/scrunch.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/scrunch.html (original)
+++ websites/staging/crunch/trunk/content/crunch/scrunch.html Fri Dec 14
01:27:53 2012
@@ -117,19 +117,19 @@
<h1 class="title">
Scrunch
- <small>A Scala Wrapper for Apache Crunch</small>
+ <small>A Scala Wrapper for the Apache Crunch (incubating) Java
API</small>
</h1>
<h2 id="introduction">Introduction</h2>
-<p>Scrunch is an experimental Scala wrapper for Crunch, based on the same
ideas as the
-<a href="http://days2011.scala-lang.org/node/138/282">Cascade</a> project at
Google, which created
-a Scala wrapper for FlumeJava.</p>
+<p>Scrunch is an experimental Scala wrapper for the Apache Crunch (incubating)
Java API, based on the same ideas as the
+<a href="http://days2011.scala-lang.org/node/138/282">Cascade</a> project at
Google, which created a Scala wrapper for
+FlumeJava.</p>
<h2 id="why-scala">Why Scala?</h2>
-<p>In many ways, Scala is the perfect language for writing Crunch pipelines.
Scala supports
+<p>In many ways, Scala is the perfect language for writing MapReduce
pipelines. Scala supports
a mixture of functional and object-oriented programming styles and has
powerful type-inference
capabilities, allowing us to create complex pipelines using very few
keystrokes. Here is
-the Scrunch analogue of the classic WordCount problem:</p>
+an implementation of the classic WordCount problem using the Scrunch API:</p>
<div class="codehilite"><pre><span class="nb">import</span> <span
class="n">org</span><span class="o">.</span><span class="n">apache</span><span
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span
class="n">io</span><span class="o">.</span><span class="p">{</span><span
class="n">From</span> <span class="o">=></span> <span
class="n">from</span><span class="p">}</span>
<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">scrunch</span><span class="o">.</span><span class="n">_</span>
<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">scrunch</span><span class="o">.</span><span
class="n">Conversions_</span> <span class="c1"># For implicit type
conversions</span>
@@ -148,7 +148,7 @@ the Scrunch analogue of the classic Word
<p>The Scala compiler can infer the return type of the flatMap function as an
Array[String], and
-the Scrunch wrapper uses the type inference mechanism to figure out how to
serialize the
+the Scrunch wrapper code uses the type inference mechanism to figure out how
to serialize the
data between the Map and Reduce stages. Here's a slightly more complex
example, in which we
get the word counts for two different files and compute the deltas of how
often different
words occur, and then only returns the words where the first file had more
occurrences then
@@ -163,12 +163,9 @@ the second:</p>
</pre></div>
-<p>Note that all of the functions are using Scala Tuples, not Crunch Tuples.
Under the covers,
-Scrunch uses Scala's implicit type conversion mechanism to transparently
convert data from the
-Crunch format to the Scala format and back again.</p>
<h2 id="materializing-job-outputs">Materializing Job Outputs</h2>
-<p>Scrunch also incorporates Crunch's materialize functionality, which allows
us to easily read
-the output of a Crunch pipeline into the client:</p>
+<p>The Scrunch API also incorporates the Java library's
<code>materialize</code> functionality, which allows us to easily read
+the output of a MapReduce pipeline into the client:</p>
<div class="codehilite"><pre><span class="n">class</span> <span
class="n">WordCountExample</span> <span class="p">{</span>
<span class="n">def</span> <span class="n">hasHamlet</span> <span
class="o">=</span> <span class="n">wordGt</span><span class="p">(</span><span
class="s">"shakespeare.txt"</span><span class="p">,</span> <span
class="s">"maugham.txt"</span><span class="p">)</span><span
class="o">.</span><span class="n">materialize</span><span
class="o">.</span><span class="nb">exists</span><span class="p">(</span><span
class="n">_</span> <span class="o">==</span> <span
class="s">"hamlet"</span><span class="p">)</span>
<span class="p">}</span>
@@ -176,15 +173,11 @@ the output of a Crunch pipeline into the
<h2 id="notes-and-thanks">Notes and Thanks</h2>
-<p>Scrunch is alpha-quality code, written by someone who was learning Scala on
the fly. There will be bugs,
-rough edges, and non-idiomatic Scala usage all over the place. This will
improve with time, and we welcome
-contributions from Scala experts who are interested in helping us make Scrunch
into a first-class project.</p>
<p>Scrunch emerged out of conversations with <a
href="http://twitter.com/#!/squarecog">Dmitriy Ryaboy</a>,
<a href="http://twitter.com/#!/posco">Oscar Boykin</a>, and <a
href="http://twitter.com/#!/avibryant">Avi Bryant</a> from Twitter.
Many thanks to them for their feedback, guidance, and encouragement. We are
also grateful to
<a href="http://twitter.com/#!/matei_zaharia">Matei Zaharia</a>, whose <a
href="http://www.spark-project.org/">Spark Project</a>
-inspired much of our implementation and was kind enough to loan us the
ClosureCleaner implementation
-Spark developed for use in Scrunch.</p>
+inspired much of the original Scrunch API implementation.</p>
</div> <!-- /span -->
</div> <!-- /row-fluid -->
Modified: websites/staging/crunch/trunk/content/crunch/source-repository.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/source-repository.html
(original)
+++ websites/staging/crunch/trunk/content/crunch/source-repository.html Fri Dec
14 01:27:53 2012
@@ -119,7 +119,7 @@
</h1>
- <p>Apache Crunch uses <a href="http://git-scm.com/">Git</a> for
version control. Run the
+ <p>The Apache Crunch (incubating) Project uses <a
href="http://git-scm.com/">Git</a> for version control. Run the
following command to clone the repository:</p>
<div class="codehilite"><pre><span class="n">git</span> <span
class="n">clone</span> <span class="n">https:</span><span
class="sr">//gi</span><span class="n">t</span><span class="o">-</span><span
class="n">wip</span><span class="o">-</span><span class="n">us</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">org</span><span class="sr">/repos/</span><span
class="n">asf</span><span class="o">/</span><span
class="n">incubator</span><span class="o">-</span><span
class="n">crunch</span><span class="o">.</span><span class="n">git</span>
</pre></div>