Repository: incubator-zeppelin Updated Branches: refs/heads/master 617eb947b -> d16ec20fc
Fix pyspark to work on yarn mode when spark version is lower than or equal to 1.4.x ### What is this PR for? pyspark.zip, py4j-\*.zip should be distributed to yarn nodes to make pyspark function but this hasn't been working after #463 because [`if (pythonLibs.length == pythonLibUris.size())`](https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java#L329) condition will never be true. This PR fixes this issue by changing this if condition to be `pythonlibUris.size() == 2`, while integer 2 refers pyspark.zip and py4j-\*.zip. In addition, yarn-install documentation has been updated. ### What type of PR is it? Bug Fix ### Is there a relevant Jira issue? No. But the issue has reported via [user mailing list](http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Can-t-get-Pyspark-1-4-1-interpreter-to-work-on-Zeppelin-0-6-td2229.html#a2259) by Ian Maloney ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Mina Lee <[email protected]> Closes #736 from minahlee/fix/pyspark_on_yarn and squashes the following commits: e588f7b [Mina Lee] Merge branch 'master' of https://github.com/apache/incubator-zeppelin into fix/pyspark_on_yarn 2710c46 [Mina Lee] [DOC] Remove invalid information of installation location c544dec [Mina Lee] [DOC] Remove redundant Zeppelin build information from yarn_install.md [DOC] Guide users to set SPARK_HOME to use spark in yarn mode [DOC] Change spark version to the latest in yarn config example [DOC] Add note that spark for cdh4 doesn't support yarn [DOC] Remove spark properties `spark.home` and `spark.yarn.jar` from doc which doesn't work on zeppelin anymore [DOC] Fix typos [DOC] Add info that embedded spark doesn't work on yarn mode anymore when Spark version is 1.5.0 or higher in README.md 6465ba8 [Mina Lee] Change condition to make pyspark, py4j libraries be distributed to yarn executors Project: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/commit/d16ec20f Tree: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/tree/d16ec20f Diff: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/diff/d16ec20f Branch: refs/heads/master Commit: d16ec20fcf7c69a97bd90b3faac634098dc58214 Parents: 617eb94 Author: Mina Lee <[email protected]> Authored: Tue Feb 23 13:35:19 2016 +0900 Committer: Lee moon soo <[email protected]> Committed: Wed Feb 24 08:44:50 2016 -0800 ---------------------------------------------------------------------- README.md | 2 +- docs/install/install.md | 14 +- docs/install/yarn_install.md | 132 ++++--------------- .../apache/zeppelin/spark/SparkInterpreter.java | 8 +- 4 files changed, 40 insertions(+), 116 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/README.md ---------------------------------------------------------------------- diff --git a/README.md b/README.md index ce5926f..cca45d4 100644 --- a/README.md +++ b/README.md @@ -104,7 +104,7 @@ minor version can be adjusted by `-Dhadoop.version=x.x.x` ##### -Pyarn (optional) enable YARN support for local mode - +> YARN for local mode is not supported for Spark v1.5.0 or higher. Set SPARK_HOME instead. ##### -Ppyspark (optional) http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/install.md ---------------------------------------------------------------------- diff --git a/docs/install/install.md b/docs/install/install.md index 38752f5..b86c5bb 100644 --- a/docs/install/install.md +++ b/docs/install/install.md @@ -22,9 +22,9 @@ limitations under the License. ## Zeppelin Installation -Welcome to your first trial to explore Zeppelin ! +Welcome to your first trial to explore Zeppelin! -In this documentation, we will explain how you can install Zeppelin from **Binary Package** or build from **Source** by yourself. Plus, you can see all of Zeppelin's configurations in the **Zeppelin Configuration** section below. +In this documentation, we will explain how you can install Zeppelin from **Binary Package** or build from **Source** by yourself. Plus, you can see all of Zeppelin's configurations in the [Zeppelin Configuration](install.html#zeppelin-configuration) section below. ### Install with Binary Package @@ -32,9 +32,17 @@ If you want to install Zeppelin with latest binary package, please visit [this p ### Build from Zeppelin Source -You can also build Zeppelin from the source. Please check instructions in `README.md` in [Zeppelin github](https://github.com/apache/incubator-zeppelin/blob/master/README.md). +You can also build Zeppelin from the source. +#### Prerequisites for build + * Java 1.7 + * Git + * Maven(3.1.x or higher) + * Node.js Package Manager +If you don't have requirements prepared, please check instructions in [README.md](https://github.com/apache/incubator-zeppelin/blob/master/README.md) for the details. + +<a name="zeppelin-configuration"> </a> ## Zeppelin Configuration You can configure Zeppelin with both **environment variables** in `conf/zeppelin-env.sh` and **java properties** in `conf/zeppelin-site.xml`. If both are defined, then the **environment variables** will be used priorly. http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/yarn_install.md ---------------------------------------------------------------------- diff --git a/docs/install/yarn_install.md b/docs/install/yarn_install.md index 723291f..dd86467 100644 --- a/docs/install/yarn_install.md +++ b/docs/install/yarn_install.md @@ -20,7 +20,7 @@ limitations under the License. {% include JB/setup %} ## Introduction -This page describes how to pre-configure a bare metal node, build & configure Zeppelin on it, configure Zeppelin and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. It also describes steps to configure Spark & Hive interpreter of Zeppelin. +This page describes how to pre-configure a bare metal node, configure Zeppelin and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. It also describes steps to configure Spark & Hive interpreter of Zeppelin. ## Prepare Node @@ -44,84 +44,16 @@ Its assumed in the rest of the document that zeppelin user is indeed created and ### List of Prerequisites - * CentOS 6.x - * Git - * Java 1.7 - * Apache Maven - * Hadoop client. - * Spark. + * CentOS 6.x, Mac OSX, Ubuntu 14.X + * Java 1.7 + * Hadoop client + * Spark * Internet connection is required. -Its assumed that the node has CentOS 6.x installed on it. Although any version of Linux distribution should work fine. The working directory of all prerequisite pacakges is /home/zeppelin/prerequisites, although any location could be used. - -#### Git -Intall latest stable version of Git. This document describes installation of version 2.4.8 - -```bash -yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel -yum install gcc perl-ExtUtils-MakeMaker -yum remove git -cd /home/zeppelin/prerequisites -wget https://github.com/git/git/archive/v2.4.8.tar.gz -tar xzf git-2.0.4.tar.gz -cd git-2.0.4 -make prefix=/home/zeppelin/prerequisites/git all -make prefix=/home/zeppelin/prerequisites/git install -echo "export PATH=$PATH:/home/zeppelin/prerequisites/bin" >> /home/zeppelin/.bashrc -source /home/zeppelin/.bashrc -git --version -``` - -Assuming all the packages are successfully installed, running the version option with git command should display - -```bash -git version 2.4.8 -``` - -#### Java -Zeppelin works well with 1.7.x version of Java runtime. Download JDK version 7 and a stable update and follow below instructions to install it. - -```bash -cd /home/zeppelin/prerequisites/ -#Download JDK 1.7, Assume JDK 7 update 79 is downloaded. -tar -xf jdk-7u79-linux-x64.tar.gz -echo "export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79" >> /home/zeppelin/.bashrc -source /home/zeppelin/.bashrc -echo $JAVA_HOME -``` -Assuming all the packages are successfully installed, echoing JAVA_HOME environment variable should display - -```bash -/home/zeppelin/prerequisites/jdk1.7.0_79 -``` - -#### Apache Maven -Download and install a stable version of Maven. - -```bash -cd /home/zeppelin/prerequisites/ -wget ftp://mirror.reverse.net/pub/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz -tar -xf apache-maven-3.3.3-bin.tar.gz -cd apache-maven-3.3.3 -export MAVEN_HOME=/home/zeppelin/prerequisites/apache-maven-3.3.3 -echo "export PATH=$PATH:/home/zeppelin/prerequisites/apache-maven-3.3.3/bin" >> /home/zeppelin/.bashrc -source /home/zeppelin/.bashrc -mvn -version -``` - -Assuming all the packages are successfully installed, running the version option with mvn command should display - -```bash -Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T04:57:37-07:00) -Maven home: /home/zeppelin/prerequisites/apache-maven-3.3.3 -Java version: 1.7.0_79, vendor: Oracle Corporation -Java home: /home/zeppelin/prerequisites/jdk1.7.0_79/jre -Default locale: en_US, platform encoding: UTF-8 -OS name: "linux", version: "2.6.32-358.el6.x86_64", arch: "amd64", family: "unix" -``` +It's assumed that the node has CentOS 6.x installed on it. Although any version of Linux distribution should work fine. #### Hadoop client -Zeppelin can work with multiple versions & distributions of Hadoop. A complete list [is available here.](https://github.com/apache/incubator-zeppelin#build) This document assumes Hadoop 2.7.x client libraries including configuration files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains various Hadoop configuration files. The location of Hadoop configuration files may vary, hence use appropriate location. +Zeppelin can work with multiple versions & distributions of Hadoop. A complete list is available [here](https://github.com/apache/incubator-zeppelin#build). This document assumes Hadoop 2.7.x client libraries including configuration files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains various Hadoop configuration files. The location of Hadoop configuration files may vary, hence use appropriate location. ```bash hadoop version @@ -134,32 +66,21 @@ This command was run using /usr/hdp/2.3.1.0-2574/hadoop/lib/hadoop-common-2.7.1. ``` #### Spark -Zeppelin can work with multiple versions Spark. A complete list [is available here.](https://github.com/apache/incubator-zeppelin#build) This document assumes Spark 1.3.1 is installed on Zeppelin node at /home/zeppelin/prerequisites/spark. - -## Build +Spark is supported out of the box and to take advantage of this, you need to Download appropriate version of Spark binary packages from [Spark Download page](http://spark.apache.org/downloads.html) and unzip it. +Zeppelin can work with multiple versions of Spark. A complete list is available [here](https://github.com/apache/incubator-zeppelin#build). +This document assumes Spark 1.6.0 is installed at /usr/lib/spark. +> Note: Spark should be installed on the same node as Zeppelin. -Checkout source code from [git://git.apache.org/incubator-zeppelin.git](git://git.apache.org/incubator-zeppelin.git). +> Note: Spark's pre-built package for CDH 4 doesn't support yarn. -```bash -cd /home/zeppelin/ -git clone git://git.apache.org/incubator-zeppelin.git -``` -Zeppelin package is available at `/home/zeppelin/incubator-zeppelin` after the checkout completes. - -### Cluster mode +#### Zeppelin -As its assumed Hadoop 2.7.x is installed on the YARN cluster & Spark 1.3.1 is installed on Zeppelin node. Hence appropriate options are chosen to build Zeppelin. This is very important as Zeppelin will bundle corresponding Hadoop & Spark libraries and they must match the ones present on YARN cluster & Zeppelin Spark installation. - -Zeppelin is a maven project and hence must be built with Apache Maven. - -```bash -cd /home/zeppelin/incubator-zeppelin -mvn clean package -Pspark-1.3 -Dspark.version=1.3.1 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -``` -Building Zeppelin for first time downloads various dependencies and hence takes few minutes to complete. +Checkout source code from [git://git.apache.org/incubator-zeppelin.git](https://github.com/apache/incubator-zeppelin.git) or download binary package from [Download page](https://zeppelin.incubator.apache.org/download.html). +You can refer [Install](install.html) page for the details. +This document assumes that Zeppelin is located under `/home/zeppelin/incubator-zeppelin`. ## Zeppelin Configuration -Zeppelin configurations needs to be modified to connect to YARN cluster. Create a copy of zeppelin environment XML +Zeppelin configuration needs to be modified to connect to YARN cluster. Create a copy of zeppelin environment shell script. ```bash cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh @@ -168,9 +89,10 @@ cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppeli Set the following properties ```bash -export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79 -export HADOOP_CONF_DIR=/etc/hadoop/conf +export JAVA_HOME="/usr/java/jdk1.7.0_79" +export HADOOP_CONF_DIR="/etc/hadoop/conf" export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.1.0-2574" +export SPARK_HOME="/usr/lib/spark" ``` As /etc/hadoop/conf contains various configurations of YARN cluster, Zeppelin can now submit Spark/Hive jobs on YARN cluster form its web interface. The value of hdp.version is set to 2.3.1.0-2574. This can be obtained by running the following command @@ -196,7 +118,7 @@ bin/zeppelin-daemon.sh stop ``` ## Interpreter -Zeppelin provides to various distributed processing frameworks to process data that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This document describes to configure Hive & Spark interpreters. +Zeppelin provides various distributed processing frameworks to process data that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This document describes to configure Hive & Spark interpreters. ### Hive Zeppelin supports Hive interpreter and hence copy hive-site.xml that should be present at /etc/hive/conf to the configuration folder of Zeppelin. Once Zeppelin is built it will have conf folder under /home/zeppelin/incubator-zeppelin. @@ -209,7 +131,7 @@ Once Zeppelin server has started successfully, visit http://[zeppelin-server-hos Click on Save button. Once these configurations are updated, Zeppelin will prompt you to restart the interpreter. Accept the prompt and the interpreter will reload the configurations. ### Spark -Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spark is installed at /home/zeppelin/prerequisites/spark. Look for Spark configrations and click edit button to add the following properties +It was assumed that 1.6.0 version of Spark is installed at /usr/lib/spark. Look for Spark configurations and click edit button to add the following properties <table class="table-configuration"> <tr> @@ -223,11 +145,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spa <td>In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.</td> </tr> <tr> - <td>spark.home</td> - <td>/home/zeppelin/prerequisites/spark</td> - <td></td> - </tr> - <tr> <td>spark.driver.extraJavaOptions</td> <td>-Dhdp.version=2.3.1.0-2574</td> <td></td> @@ -237,11 +154,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of Spa <td>-Dhdp.version=2.3.1.0-2574</td> <td></td> </tr> - <tr> - <td>spark.yarn.jar</td> - <td>/home/zeppelin/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar</td> - <td></td> - </tr> </table> Click on Save button. Once these configurations are updated, Zeppelin will prompt you to restart the interpreter. Accept the prompt and the interpreter will reload the configurations. http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java ---------------------------------------------------------------------- diff --git a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java index a905fb7..1923186 100644 --- a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java +++ b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java @@ -326,7 +326,10 @@ public class SparkInterpreter extends Interpreter { } } pythonLibUris.trimToSize(); - if (pythonLibs.length == pythonLibUris.size()) { + + // Distribute two libraries(pyspark.zip and py4j-*.zip) to workers + // when spark version is less than or equal to 1.4.1 + if (pythonLibUris.size() == 2) { try { String confValue = conf.get("spark.yarn.dist.files"); conf.set("spark.yarn.dist.files", confValue + "," + Joiner.on(",").join(pythonLibUris)); @@ -339,7 +342,8 @@ public class SparkInterpreter extends Interpreter { conf.set("spark.submit.pyArchives", Joiner.on(":").join(pythonLibs)); } - // Distributes needed libraries to workers. + // Distributes needed libraries to workers + // when spark version is greater than or equal to 1.5.0 if (getProperty("master").equals("yarn-client")) { conf.set("spark.yarn.isPython", "true"); }
