spark git commit: [SPARK-15333][DOCS] Reorganize building-spark.md; rationalize vs wiki

srowen Tue, 17 May 2016 08:41:11 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b031ea7dc -> 273f3d052



[SPARK-15333][DOCS] Reorganize building-spark.md; rationalize vs wiki

## What changes were proposed in this pull request?

See JIRA for the motivation. The changes are almost entirely movement of text 
and edits to sections. Minor changes to text include:

- Copying in / merging text from the "Useful Developer Tools" wiki, in areas of
  - Docker
  - R
  - Running one test
- standardizing on ./build/mvn not mvn, and likewise for ./build/sbt
- correcting some typos
- standardizing code block formatting

No text has been removed from this doc; text has been imported from the 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools wiki

## How was this patch tested?

Jekyll doc build and inspection of resulting HTML in browser.

Author: Sean Owen <so...@cloudera.com>

Closes #13124 from srowen/SPARK-15333.

(cherry picked from commit 932d8002931d352dd2ec87184e6c84ec5fa859cd)
Signed-off-by: Sean Owen <so...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/273f3d05
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/273f3d05
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/273f3d05

Branch: refs/heads/branch-2.0
Commit: 273f3d05294f8fcd8f3f4e116afcd96bd4b50920
Parents: b031ea7
Author: Sean Owen <so...@cloudera.com>
Authored: Tue May 17 16:40:38 2016 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Tue May 17 16:40:48 2016 +0100

----------------------------------------------------------------------
 docs/building-spark.md | 295 +++++++++++++++++++++++---------------------
 1 file changed, 156 insertions(+), 139 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/273f3d05/docs/building-spark.md
----------------------------------------------------------------------
diff --git a/docs/building-spark.md b/docs/building-spark.md
index 63532c7..2c987cf 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -7,48 +7,18 @@ redirect_from: "building-with-maven.html"
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
-Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+.
-The Spark build can supply a suitable Maven binary; see below.
-
-# Building with `build/mvn`
-
-Spark now comes packaged with a self-contained Maven installation to ease 
building and deployment of Spark from source located under the `build/` 
directory. This script will automatically download and setup all necessary 
build requirements ([Maven](https://maven.apache.org/), 
[Scala](http://www.scala-lang.org/), and 
[Zinc](https://github.com/typesafehub/zinc)) locally within the `build/` 
directory itself. It honors any `mvn` binary if present already, however, will 
pull down its own copy of Scala and Zinc regardless to ensure proper version 
requirements are met. `build/mvn` execution acts as a pass through to the `mvn` 
call allowing easy transition from previous build methods. As an example, one 
can build a version of Spark as follows:
-
-{% highlight bash %}
-build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
-{% endhighlight %}
-
-Other build examples can be found below.
-
-**Note:** When building on an encrypted filesystem (if your home directory is 
encrypted, for example), then the Spark build might fail with a "Filename too 
long" error. As a workaround, add the following in the configuration args of 
the `scala-maven-plugin` in the project `pom.xml`:
-
-    <arg>-Xmax-classfile-name</arg>
-    <arg>128</arg>
-
-and in `project/SparkBuild.scala` add:
-
-    scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
-
-to the `sharedSettings` val. See also [this 
PR](https://github.com/apache/spark/pull/2883/files) if you are unsure of where 
to add these lines.
-
-# Building a Runnable Distribution
+# Building Apache Spark
 
-To create a Spark distribution like those distributed by the
-[Spark Downloads](http://spark.apache.org/downloads.html) page, and that is 
laid out so as
-to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
-with Maven profile settings and so on like the direct Maven build. Example:
+## Apache Maven
 
-    ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4 
-Phive -Phive-thriftserver -Pyarn
-
-For more information on usage, run `./dev/make-distribution.sh --help`
+The Maven-based build is the build of reference for Apache Spark.
+Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+.
 
-# Setting up Maven's Memory Usage
+### Setting up Maven's Memory Usage
 
 You'll need to configure Maven to use more memory than usual by setting 
`MAVEN_OPTS`. We recommend the following settings:
 
-{% highlight bash %}
-export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
-{% endhighlight %}
+    export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M 
-XX:ReservedCodeCacheSize=512m"
 
 If you don't run this, you may see errors like the following:
 
@@ -65,7 +35,26 @@ You can fix this by setting the `MAVEN_OPTS` variable as 
discussed before.
 * For Java 8 and above this step is not required.
 * If using `build/mvn` with no `MAVEN_OPTS` set, the script will automate this 
for you.
 
-# Specifying the Hadoop Version
+### build/mvn
+
+Spark now comes packaged with a self-contained Maven installation to ease 
building and deployment of Spark from source located under the `build/` 
directory. This script will automatically download and setup all necessary 
build requirements ([Maven](https://maven.apache.org/), 
[Scala](http://www.scala-lang.org/), and 
[Zinc](https://github.com/typesafehub/zinc)) locally within the `build/` 
directory itself. It honors any `mvn` binary if present already, however, will 
pull down its own copy of Scala and Zinc regardless to ensure proper version 
requirements are met. `build/mvn` execution acts as a pass through to the `mvn` 
call allowing easy transition from previous build methods. As an example, one 
can build a version of Spark as follows:
+
+    ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean 
package
+
+Other build examples can be found below.
+
+## Building a Runnable Distribution
+
+To create a Spark distribution like those distributed by the
+[Spark Downloads](http://spark.apache.org/downloads.html) page, and that is 
laid out so as
+to be runnable, use `./dev/make-distribution.sh` in the project root 
directory. It can be configured
+with Maven profile settings and so on like the direct Maven build. Example:
+
+    ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4 
-Phive -Phive-thriftserver -Pyarn
+
+For more information on usage, run `./dev/make-distribution.sh --help`
+
+## Specifying the Hadoop Version
 
 Because HDFS is not protocol-compatible across versions, if you want to read 
from HDFS, you'll need to build Spark against the specific HDFS version in your 
environment. You can do this through the `hadoop.version` property. If unset, 
Spark will build against Hadoop 2.2.0 by default. Note that certain build 
profiles are required for particular Hadoop versions:
 
@@ -87,87 +76,63 @@ You can enable the `yarn` profile and optionally set the 
`yarn.version` property
 
 Examples:
 
-{% highlight bash %}
+    # Apache Hadoop 2.2.X
+    ./build/mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
 
-# Apache Hadoop 2.2.X
-mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
+    # Apache Hadoop 2.3.X
+    ./build/mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean 
package
 
-# Apache Hadoop 2.3.X
-mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
+    # Apache Hadoop 2.4.X or 2.5.X
+    ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean 
package
 
-# Apache Hadoop 2.4.X or 2.5.X
-mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package
+    # Apache Hadoop 2.6.X
+    ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean 
package
 
-# Apache Hadoop 2.6.X
-mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
+    # Apache Hadoop 2.7.X and later
+    ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=VERSION -DskipTests clean 
package
 
-# Apache Hadoop 2.7.X and later
-mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=VERSION -DskipTests clean package
+    # Different versions of HDFS and YARN.
+    ./build/mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 
-Dyarn.version=2.2.0 -DskipTests clean package
 
-# Different versions of HDFS and YARN.
-mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=2.2.0 
-DskipTests clean package
-{% endhighlight %}
+## Building With Hive and JDBC Support
 
-# Building With Hive and JDBC Support
 To enable Hive integration for Spark SQL along with its JDBC server and CLI,
 add the `-Phive` and `Phive-thriftserver` profiles to your existing build 
options.
 By default Spark will build with Hive 1.2.1 bindings.
-{% highlight bash %}
-# Apache Hadoop 2.4.X with Hive 1.2.1 support
-mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-DskipTests clean package
-{% endhighlight %}
-
-# Building for Scala 2.10
-To produce a Spark package compiled with Scala 2.10, use the `-Dscala-2.10` 
property:
-
-    ./dev/change-scala-version.sh 2.10
-    mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
-
-# PySpark Tests with Maven
 
-If you are building PySpark and wish to run the PySpark tests you will need to 
build Spark with hive support.
+    # Apache Hadoop 2.4.X with Hive 1.2.1 support
+    ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive 
-Phive-thriftserver -DskipTests clean package
 
-{% highlight bash %}
-build/mvn -DskipTests clean package -Phive
-./python/run-tests
-{% endhighlight %}
+## Packaging without Hadoop Dependencies for YARN
 
-The run-tests script also can be limited to a specific Python version or a 
specific module
-
-    ./python/run-tests --python-executables=python --modules=pyspark-sql
-
-**Note:** You can also run Python tests with an sbt build, provided you build 
Spark with hive support.
-
-# Spark Tests in Maven
-
-Tests are run by default via the [ScalaTest Maven 
plugin](http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin).
+The assembly directory produced by `mvn package` will, by default, include all 
of Spark's 
+dependencies, including Hadoop and some of its ecosystem projects. On YARN 
deployments, this 
+causes multiple versions of these to appear on executor classpaths: the 
version packaged in 
+the Spark assembly and the version on each node, included with 
`yarn.application.classpath`.
+The `hadoop-provided` profile builds the assembly without including 
Hadoop-ecosystem projects, 
+like ZooKeeper and Hadoop itself.
 
-Some of the tests require Spark to be packaged first, so always run `mvn 
package` with `-DskipTests` the first time.  The following is an example of a 
correct (build, test) sequence:
-
-    mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive -Phive-thriftserver clean 
package
-    mvn -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
-
-The ScalaTest plugin also supports running only a specific test suite as 
follows:
+## Building for Scala 2.10
+To produce a Spark package compiled with Scala 2.10, use the `-Dscala-2.10` 
property:
 
-    mvn -Dhadoop.version=... -DwildcardSuites=org.apache.spark.repl.ReplSuite 
test
+    ./dev/change-scala-version.sh 2.10
+    ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
 
-# Building submodules individually
+## Building submodules individually
 
 It's possible to build Spark sub-modules using the `mvn -pl` option.
 
 For instance, you can build the Spark Streaming module using:
 
-{% highlight bash %}
-mvn -pl :spark-streaming_2.11 clean install
-{% endhighlight %}
+    ./build/mvn -pl :spark-streaming_2.11 clean install
 
 where `spark-streaming_2.11` is the `artifactId` as defined in 
`streaming/pom.xml` file.
 
-# Continuous Compilation
+## Continuous Compilation
 
 We use the scala-maven-plugin which supports incremental and continuous 
compilation. E.g.
 
-    mvn scala:cc
+    ./build/mvn scala:cc
 
 should run continuous compilation (i.e. wait for changes). However, this has 
not been tested
 extensively. A couple of gotchas to note:
@@ -182,86 +147,138 @@ the `spark-parent` module).
 
 Thus, the full flow for running continuous-compilation of the `core` submodule 
may look more like:
 
-    $ mvn install
+    $ ./build/mvn install
     $ cd core
-    $ mvn scala:cc
+    $ ../build/mvn scala:cc
 
-# Building Spark with IntelliJ IDEA or Eclipse
+## Speeding up Compilation with Zinc
 
-For help in setting up IntelliJ IDEA or Eclipse for Spark development, and 
troubleshooting, refer to the
-[wiki page for IDE 
setup](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup).
+[Zinc](https://github.com/typesafehub/zinc) is a long-running server version 
of SBT's incremental
+compiler. When run locally as a background process, it speeds up builds of 
Scala-based projects
+like Spark. Developers who regularly recompile Spark with Maven will be the 
most interested in
+Zinc. The project site gives instructions for building and running `zinc`; OS 
X users can
+install it using `brew install zinc`.
 
-# Running Java 8 Test Suites
+If using the `build/mvn` package `zinc` will automatically be downloaded and 
leveraged for all
+builds. This process will auto-start after the first time `build/mvn` is 
called and bind to port
+3030 unless the `ZINC_PORT` environment variable is set. The `zinc` process 
can subsequently be
+shut down at any time by running `build/zinc-<version>/bin/zinc -shutdown` and 
will automatically
+restart whenever `build/mvn` is called.
 
-Running only Java 8 tests and nothing else.
+## Building with SBT
 
-    mvn install -DskipTests
-    mvn -pl :java8-tests_2.11 test
+Maven is the official build tool recommended for packaging Spark, and is the 
*build of reference*.
+But SBT is supported for day-to-day development since it can provide much 
faster iterative
+compilation. More advanced developers may wish to use SBT.
 
-or
+The SBT build is derived from the Maven POM files, and so the same Maven 
profiles and variables
+can be set to control the SBT build. For example:
 
-    sbt java8-tests/test
+    ./build/sbt -Pyarn -Phadoop-2.3 package
 
-Java 8 tests are automatically enabled when a Java 8 JDK is detected.
-If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
+To avoid the overhead of launching sbt each time you need to re-compile, you 
can launch sbt
+in interactive mode by running `build/sbt`, and then run all build commands at 
the command
+prompt. For more recommendations on reducing build time, refer to the
+[wiki 
page](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-ReducingBuildTimes).
 
-# Running Docker based Integration Test Suites
+##Â Encrypted Filesystems
 
-Running only docker based integration tests and nothing else.
+When building on an encrypted filesystem (if your home directory is encrypted, 
for example), then the Spark build might fail with a "Filename too long" error. 
As a workaround, add the following in the configuration args of the 
`scala-maven-plugin` in the project `pom.xml`:
 
-    mvn install -DskipTests
-    mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11
+    <arg>-Xmax-classfile-name</arg>
+    <arg>128</arg>
 
-or
+and in `project/SparkBuild.scala` add:
 
-    sbt docker-integration-tests/test
+    scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
 
+to the `sharedSettings` val. See also [this 
PR](https://github.com/apache/spark/pull/2883/files) if you are unsure of where 
to add these lines.
 
-# Packaging without Hadoop Dependencies for YARN
+## IntelliJ IDEA or Eclipse
 
-The assembly directory produced by `mvn package` will, by default, include all 
of Spark's dependencies, including Hadoop and some of its ecosystem projects. 
On YARN deployments, this causes multiple versions of these to appear on 
executor classpaths: the version packaged in the Spark assembly and the version 
on each node, included with `yarn.application.classpath`.  The 
`hadoop-provided` profile builds the assembly without including 
Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
+For help in setting up IntelliJ IDEA or Eclipse for Spark development, and 
troubleshooting, refer to the
+[wiki page for IDE 
setup](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup).
 
-# Building with SBT
 
-Maven is the official build tool recommended for packaging Spark, and is the 
*build of reference*.
-But SBT is supported for day-to-day development since it can provide much 
faster iterative
-compilation. More advanced developers may wish to use SBT.
+# Running Tests
 
-The SBT build is derived from the Maven POM files, and so the same Maven 
profiles and variables
-can be set to control the SBT build. For example:
+Tests are run by default via the [ScalaTest Maven 
plugin](http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin).
 
-    build/sbt -Pyarn -Phadoop-2.3 package
+Some of the tests require Spark to be packaged first, so always run `mvn 
package` with `-DskipTests` the first time.  The following is an example of a 
correct (build, test) sequence:
 
-To avoid the overhead of launching sbt each time you need to re-compile, you 
can launch sbt
-in interactive mode by running `build/sbt`, and then run all build commands at 
the command
-prompt. For more recommendations on reducing build time, refer to the
-[wiki 
page](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-ReducingBuildTimes).
+    ./build/mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive -Phive-thriftserver 
clean package
+    ./build/mvn -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
+
+The ScalaTest plugin also supports running only a specific Scala test suite as 
follows:
+
+    ./build/mvn -P... -Dtest=none 
-DwildcardSuites=org.apache.spark.repl.ReplSuite test
+    ./build/mvn -P... -Dtest=none -DwildcardSuites=org.apache.spark.repl.* test
+
+or a Java test:
 
-# Testing with SBT
+    ./build/mvn test -P... -DwildcardSuites=none 
-Dtest=org.apache.spark.streaming.JavaAPISuite
+
+## Testing with SBT
 
 Some of the tests require Spark to be packaged first, so always run `build/sbt 
package` the first time.  The following is an example of a correct (build, 
test) sequence:
 
-    build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver package
-    build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
+    ./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver package
+    ./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
 
 To run only a specific test suite as follows:
 
-    build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only 
org.apache.spark.repl.ReplSuite"
+    ./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only 
org.apache.spark.repl.ReplSuite"
+    ./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only 
org.apache.spark.repl.*"
 
 To run test suites of a specific sub project as follows:
 
-    build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver core/test
+    ./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver core/test
 
-# Speeding up Compilation with Zinc
+## Running Java 8 Test Suites
 
-[Zinc](https://github.com/typesafehub/zinc) is a long-running server version 
of SBT's incremental
-compiler. When run locally as a background process, it speeds up builds of 
Scala-based projects
-like Spark. Developers who regularly recompile Spark with Maven will be the 
most interested in
-Zinc. The project site gives instructions for building and running `zinc`; OS 
X users can
-install it using `brew install zinc`.
+Running only Java 8 tests and nothing else.
 
-If using the `build/mvn` package `zinc` will automatically be downloaded and 
leveraged for all
-builds. This process will auto-start after the first time `build/mvn` is 
called and bind to port
-3030 unless the `ZINC_PORT` environment variable is set. The `zinc` process 
can subsequently be
-shut down at any time by running `build/zinc-<version>/bin/zinc -shutdown` and 
will automatically
-restart whenever `build/mvn` is called.
+    ./build/mvn install -DskipTests
+    ./build/mvn -pl :java8-tests_2.11 test
+
+or
+
+    ./build/sbt java8-tests/test
+
+Java 8 tests are automatically enabled when a Java 8 JDK is detected.
+If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
+
+## PySpark Tests with Maven
+
+If you are building PySpark and wish to run the PySpark tests you will need to 
build Spark with Hive support.
+
+    ./build/mvn -DskipTests clean package -Phive
+    ./python/run-tests
+
+The run-tests script also can be limited to a specific Python version or a 
specific module
+
+    ./python/run-tests --python-executables=python --modules=pyspark-sql
+
+**Note:** You can also run Python tests with an sbt build, provided you build 
Spark with Hive support.
+
+## Running R Tests
+
+To run the SparkR tests you will need to install the R package `testthat` 
+(run `install.packages(testthat)` from R shell).  You can run just the SparkR 
tests using 
+the command:
+
+    ./R/run-tests.sh
+
+## Running Docker-based Integration Test Suites
+
+In order to run Docker integration tests, you have to install the `docker` 
engine on your box. 
+The instructions for installation can be found at [the Docker 
site](https://docs.docker.com/engine/installation/). 
+Once installed, the `docker` service needs to be started, if not already 
running. 
+On Linux, this can be done by `sudo service docker start`.
+
+    ./build/mvn install -DskipTests
+    ./build/mvn -Pdocker-integration-tests -pl 
:spark-docker-integration-tests_2.11
+
+or
+
+    ./build/sbt docker-integration-tests/test


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15333][DOCS] Reorganize building-spark.md; rationalize vs wiki

Reply via email to