incubator-zeppelin git commit: Fix pyspark to work on yarn mode when spark version is lower than or equal to 1.4.x

moon Wed, 24 Feb 2016 08:42:10 -0800

Repository: incubator-zeppelin
Updated Branches:
  refs/heads/master 617eb947b -> d16ec20fc



Fix pyspark to work on yarn mode when spark version is lower than or equal to 
1.4.x

### What is this PR for?
pyspark.zip, py4j-\*.zip should be distributed to yarn nodes to make pyspark 
function but this hasn't been working after #463 because [`if 
(pythonLibs.length == 
pythonLibUris.size())`](https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java#L329)
 condition will never be true. This PR fixes this issue by changing this if 
condition to be  `pythonlibUris.size() == 2`, while integer 2 refers 
pyspark.zip and py4j-\*.zip.

In addition, yarn-install documentation has been updated.

### What type of PR is it?
Bug Fix

### Is there a relevant Jira issue?
No. But the issue has reported via [user mailing 
list](http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/Can-t-get-Pyspark-1-4-1-interpreter-to-work-on-Zeppelin-0-6-td2229.html#a2259)
 by Ian Maloney

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Mina Lee <[email protected]>

Closes #736 from minahlee/fix/pyspark_on_yarn and squashes the following 
commits:

e588f7b [Mina Lee] Merge branch 'master' of 
https://github.com/apache/incubator-zeppelin into fix/pyspark_on_yarn
2710c46 [Mina Lee] [DOC] Remove invalid information of installation location
c544dec [Mina Lee] [DOC] Remove redundant Zeppelin build information from 
yarn_install.md [DOC] Guide users to set SPARK_HOME to use spark in yarn mode 
[DOC] Change spark version to the latest in yarn config example [DOC] Add note 
that spark for cdh4 doesn't support yarn [DOC] Remove spark properties 
`spark.home` and `spark.yarn.jar` from doc which doesn't work on zeppelin 
anymore [DOC] Fix typos [DOC] Add info that embedded spark doesn't work on yarn 
mode anymore when Spark version is 1.5.0 or higher in README.md
6465ba8 [Mina Lee] Change  condition to make pyspark, py4j libraries be 
distributed to yarn executors


Project: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/commit/d16ec20f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/tree/d16ec20f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/diff/d16ec20f

Branch: refs/heads/master
Commit: d16ec20fcf7c69a97bd90b3faac634098dc58214
Parents: 617eb94
Author: Mina Lee <[email protected]>
Authored: Tue Feb 23 13:35:19 2016 +0900
Committer: Lee moon soo <[email protected]>
Committed: Wed Feb 24 08:44:50 2016 -0800

----------------------------------------------------------------------
 README.md                                       |   2 +-
 docs/install/install.md                         |  14 +-
 docs/install/yarn_install.md                    | 132 ++++---------------
 .../apache/zeppelin/spark/SparkInterpreter.java |   8 +-
 4 files changed, 40 insertions(+), 116 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index ce5926f..cca45d4 100644
--- a/README.md
+++ b/README.md
@@ -104,7 +104,7 @@ minor version can be adjusted by `-Dhadoop.version=x.x.x`
 ##### -Pyarn (optional)
 
 enable YARN support for local mode
-
+> YARN for local mode is not supported for Spark v1.5.0 or higher. Set 
SPARK_HOME instead.
 
 ##### -Ppyspark (optional)
 

http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/install.md
----------------------------------------------------------------------
diff --git a/docs/install/install.md b/docs/install/install.md
index 38752f5..b86c5bb 100644
--- a/docs/install/install.md
+++ b/docs/install/install.md
@@ -22,9 +22,9 @@ limitations under the License.
 
 
 ## Zeppelin Installation
-Welcome to your first trial to explore Zeppelin ! 
+Welcome to your first trial to explore Zeppelin!
 
-In this documentation, we will explain how you can install Zeppelin from 
**Binary Package** or build from **Source** by yourself. Plus, you can see all 
of Zeppelin's configurations in the **Zeppelin Configuration** section below.
+In this documentation, we will explain how you can install Zeppelin from 
**Binary Package** or build from **Source** by yourself. Plus, you can see all 
of Zeppelin's configurations in the [Zeppelin 
Configuration](install.html#zeppelin-configuration) section below.
 
 ### Install with Binary Package
 
@@ -32,9 +32,17 @@ If you want to install Zeppelin with latest binary package, 
please visit [this p
 
 ### Build from Zeppelin Source
 
-You can also build Zeppelin from the source. Please check instructions in 
`README.md` in [Zeppelin 
github](https://github.com/apache/incubator-zeppelin/blob/master/README.md). 
+You can also build Zeppelin from the source.
 
+#### Prerequisites for build
+ * Java 1.7
+ * Git
+ * Maven(3.1.x or higher)
+ * Node.js Package Manager
 
+If you don't have requirements prepared, please check instructions in 
[README.md](https://github.com/apache/incubator-zeppelin/blob/master/README.md) 
for the details.
+
+<a name="zeppelin-configuration"> </a>
 ## Zeppelin Configuration
 
 You can configure Zeppelin with both **environment variables** in 
`conf/zeppelin-env.sh` and **java properties** in `conf/zeppelin-site.xml`. If 
both are defined, then the **environment variables** will be used priorly.

http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/docs/install/yarn_install.md
----------------------------------------------------------------------
diff --git a/docs/install/yarn_install.md b/docs/install/yarn_install.md
index 723291f..dd86467 100644
--- a/docs/install/yarn_install.md
+++ b/docs/install/yarn_install.md
@@ -20,7 +20,7 @@ limitations under the License.
 {% include JB/setup %}
 
 ## Introduction
-This page describes how to pre-configure a bare metal node, build & configure 
Zeppelin on it, configure Zeppelin and connect it to existing YARN cluster 
running Hortonworks flavour of Hadoop. It also describes steps to configure 
Spark & Hive interpreter of Zeppelin. 
+This page describes how to pre-configure a bare metal node, configure Zeppelin 
and connect it to existing YARN cluster running Hortonworks flavour of Hadoop. 
It also describes steps to configure Spark & Hive interpreter of Zeppelin.
 
 ## Prepare Node
 
@@ -44,84 +44,16 @@ Its assumed in the rest of the document that zeppelin user 
is indeed created and
 
 ### List of Prerequisites
 
- * CentOS 6.x
- * Git
- * Java 1.7 
- * Apache Maven
- * Hadoop client.
- * Spark.
+ * CentOS 6.x, Mac OSX, Ubuntu 14.X
+ * Java 1.7
+ * Hadoop client
+ * Spark
  * Internet connection is required. 
 
-Its assumed that the node has CentOS 6.x installed on it. Although any version 
of Linux distribution should work fine. The working directory of all 
prerequisite pacakges is /home/zeppelin/prerequisites, although any location 
could be used.
-
-#### Git
-Intall latest stable version of Git. This document describes installation of 
version 2.4.8
-
-```bash
-yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel
-yum install  gcc perl-ExtUtils-MakeMaker
-yum remove git
-cd /home/zeppelin/prerequisites
-wget https://github.com/git/git/archive/v2.4.8.tar.gz
-tar xzf git-2.0.4.tar.gz
-cd git-2.0.4
-make prefix=/home/zeppelin/prerequisites/git all
-make prefix=/home/zeppelin/prerequisites/git install
-echo "export PATH=$PATH:/home/zeppelin/prerequisites/bin" >> 
/home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-git --version
-```
-
-Assuming all the packages are successfully installed, running the version 
option with git command should display
-
-```bash
-git version 2.4.8
-```
-
-#### Java
-Zeppelin works well with 1.7.x version of Java runtime. Download JDK version 7 
and a stable update and follow below instructions to install it.
-
-```bash
-cd /home/zeppelin/prerequisites/
-#Download JDK 1.7, Assume JDK 7 update 79 is downloaded.
-tar -xf jdk-7u79-linux-x64.tar.gz
-echo "export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79" >> 
/home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-echo $JAVA_HOME
-```
-Assuming all the packages are successfully installed, echoing JAVA_HOME 
environment variable should display
-
-```bash
-/home/zeppelin/prerequisites/jdk1.7.0_79
-```
-
-#### Apache Maven
-Download and install a stable version of Maven.
-
-```bash
-cd /home/zeppelin/prerequisites/
-wget 
ftp://mirror.reverse.net/pub/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
-tar -xf apache-maven-3.3.3-bin.tar.gz 
-cd apache-maven-3.3.3
-export MAVEN_HOME=/home/zeppelin/prerequisites/apache-maven-3.3.3
-echo "export PATH=$PATH:/home/zeppelin/prerequisites/apache-maven-3.3.3/bin" 
>> /home/zeppelin/.bashrc
-source /home/zeppelin/.bashrc
-mvn -version
-```
-
-Assuming all the packages are successfully installed, running the version 
option with mvn command should display
-
-```bash
-Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 
2015-04-22T04:57:37-07:00)
-Maven home: /home/zeppelin/prerequisites/apache-maven-3.3.3
-Java version: 1.7.0_79, vendor: Oracle Corporation
-Java home: /home/zeppelin/prerequisites/jdk1.7.0_79/jre
-Default locale: en_US, platform encoding: UTF-8
-OS name: "linux", version: "2.6.32-358.el6.x86_64", arch: "amd64", family: 
"unix"
-```
+It's assumed that the node has CentOS 6.x installed on it. Although any 
version of Linux distribution should work fine.
 
 #### Hadoop client
-Zeppelin can work with multiple versions & distributions of Hadoop. A complete 
list [is available here.](https://github.com/apache/incubator-zeppelin#build) 
This document assumes Hadoop 2.7.x client libraries including configuration 
files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains 
various Hadoop configuration files. The location of Hadoop configuration files 
may vary, hence use appropriate location.
+Zeppelin can work with multiple versions & distributions of Hadoop. A complete 
list is available [here](https://github.com/apache/incubator-zeppelin#build). 
This document assumes Hadoop 2.7.x client libraries including configuration 
files are installed on Zeppelin node. It also assumes /etc/hadoop/conf contains 
various Hadoop configuration files. The location of Hadoop configuration files 
may vary, hence use appropriate location.
 
 ```bash
 hadoop version
@@ -134,32 +66,21 @@ This command was run using 
/usr/hdp/2.3.1.0-2574/hadoop/lib/hadoop-common-2.7.1.
 ```
 
 #### Spark
-Zeppelin can work with multiple versions Spark. A complete list [is available 
here.](https://github.com/apache/incubator-zeppelin#build) This document 
assumes Spark 1.3.1 is installed on Zeppelin node at 
/home/zeppelin/prerequisites/spark.
-
-## Build
+Spark is supported out of the box and to take advantage of this, you need to 
Download appropriate version of Spark binary packages from [Spark Download 
page](http://spark.apache.org/downloads.html) and unzip it.
+Zeppelin can work with multiple versions of Spark. A complete list is 
available [here](https://github.com/apache/incubator-zeppelin#build).
+This document assumes Spark 1.6.0 is installed at /usr/lib/spark.
+> Note: Spark should be installed on the same node as Zeppelin.
 
-Checkout source code from 
[git://git.apache.org/incubator-zeppelin.git](git://git.apache.org/incubator-zeppelin.git).
+> Note: Spark's pre-built package for CDH 4 doesn't support yarn.
 
-```bash
-cd /home/zeppelin/
-git clone git://git.apache.org/incubator-zeppelin.git
-```
-Zeppelin package is available at `/home/zeppelin/incubator-zeppelin` after the 
checkout completes.
-
-### Cluster mode
+#### Zeppelin
 
-As its assumed Hadoop 2.7.x is installed on the YARN cluster & Spark 1.3.1 is 
installed on Zeppelin node. Hence appropriate options are chosen to build 
Zeppelin. This is very important as Zeppelin will bundle corresponding Hadoop & 
Spark libraries and they must match the ones present on YARN cluster & Zeppelin 
Spark installation. 
-
-Zeppelin is a maven project and hence must be built with Apache Maven.
-
-```bash
-cd /home/zeppelin/incubator-zeppelin
-mvn clean package -Pspark-1.3 -Dspark.version=1.3.1 -Dhadoop.version=2.7.0 
-Phadoop-2.6 -Pyarn -DskipTests
-```
-Building Zeppelin for first time downloads various dependencies and hence 
takes few minutes to complete. 
+Checkout source code from 
[git://git.apache.org/incubator-zeppelin.git](https://github.com/apache/incubator-zeppelin.git)
 or download binary package from [Download 
page](https://zeppelin.incubator.apache.org/download.html).
+You can refer [Install](install.html) page for the details.
+This document assumes that Zeppelin is located under 
`/home/zeppelin/incubator-zeppelin`.
 
 ## Zeppelin Configuration
-Zeppelin configurations needs to be modified to connect to YARN cluster. 
Create a copy of zeppelin environment XML
+Zeppelin configuration needs to be modified to connect to YARN cluster. Create 
a copy of zeppelin environment shell script.
 
 ```bash
 cp /home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template 
/home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh 
@@ -168,9 +89,10 @@ cp 
/home/zeppelin/incubator-zeppelin/conf/zeppelin-env.sh.template /home/zeppeli
 Set the following properties
 
 ```bash
-export JAVA_HOME=/home/zeppelin/prerequisites/jdk1.7.0_79
-export HADOOP_CONF_DIR=/etc/hadoop/conf
+export JAVA_HOME="/usr/java/jdk1.7.0_79"
+export HADOOP_CONF_DIR="/etc/hadoop/conf"
 export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.3.1.0-2574"
+export SPARK_HOME="/usr/lib/spark"
 ```
 
 As /etc/hadoop/conf contains various configurations of YARN cluster, Zeppelin 
can now submit Spark/Hive jobs on YARN cluster form its web interface. The 
value of hdp.version is set to 2.3.1.0-2574. This can be obtained by running 
the following command
@@ -196,7 +118,7 @@ bin/zeppelin-daemon.sh stop
 ```
 
 ## Interpreter
-Zeppelin provides to various distributed processing frameworks to process data 
that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This 
document describes to configure Hive & Spark interpreters.
+Zeppelin provides various distributed processing frameworks to process data 
that ranges from Spark, Hive, Tajo, Ignite and Lens to name a few. This 
document describes to configure Hive & Spark interpreters.
 
 ### Hive
 Zeppelin supports Hive interpreter and hence copy hive-site.xml that should be 
present at /etc/hive/conf to the configuration folder of Zeppelin. Once 
Zeppelin is built it will have conf folder under 
/home/zeppelin/incubator-zeppelin.
@@ -209,7 +131,7 @@ Once Zeppelin server has started successfully, visit 
http://[zeppelin-server-hos
 Click on Save button. Once these configurations are updated, Zeppelin will 
prompt you to restart the interpreter. Accept the prompt and the interpreter 
will reload the configurations.
 
 ### Spark
-Zeppelin was built with Spark 1.3.1 and it was assumed that 1.3.1 version of 
Spark is installed at /home/zeppelin/prerequisites/spark. Look for Spark 
configrations and click edit button to add the following properties
+It was assumed that 1.6.0 version of Spark is installed at /usr/lib/spark. 
Look for Spark configurations and click edit button to add the following 
properties
 
 <table class="table-configuration">
   <tr>
@@ -223,11 +145,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed 
that 1.3.1 version of Spa
     <td>In yarn-client mode, the driver runs in the client process, and the 
application master is only used for requesting resources from YARN.</td>
   </tr>
   <tr>
-    <td>spark.home</td>
-    <td>/home/zeppelin/prerequisites/spark</td>
-    <td></td>
-  </tr>
-  <tr>
     <td>spark.driver.extraJavaOptions</td>
     <td>-Dhdp.version=2.3.1.0-2574</td>
     <td></td>
@@ -237,11 +154,6 @@ Zeppelin was built with Spark 1.3.1 and it was assumed 
that 1.3.1 version of Spa
     <td>-Dhdp.version=2.3.1.0-2574</td>
     <td></td>
   </tr>
-  <tr>
-    <td>spark.yarn.jar</td>
-    
<td>/home/zeppelin/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar</td>
-    <td></td>
-  </tr>
 </table>
 
 Click on Save button. Once these configurations are updated, Zeppelin will 
prompt you to restart the interpreter. Accept the prompt and the interpreter 
will reload the configurations.

http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/d16ec20f/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
----------------------------------------------------------------------
diff --git 
a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java 
b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
index a905fb7..1923186 100644
--- a/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
+++ b/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
@@ -326,7 +326,10 @@ public class SparkInterpreter extends Interpreter {
       }
     }
     pythonLibUris.trimToSize();
-    if (pythonLibs.length == pythonLibUris.size()) {
+
+    // Distribute two libraries(pyspark.zip and py4j-*.zip) to workers
+    // when spark version is less than or equal to 1.4.1
+    if (pythonLibUris.size() == 2) {
       try {
         String confValue = conf.get("spark.yarn.dist.files");
         conf.set("spark.yarn.dist.files", confValue + "," + 
Joiner.on(",").join(pythonLibUris));
@@ -339,7 +342,8 @@ public class SparkInterpreter extends Interpreter {
       conf.set("spark.submit.pyArchives", Joiner.on(":").join(pythonLibs));
     }
 
-    // Distributes needed libraries to workers.
+    // Distributes needed libraries to workers
+    // when spark version is greater than or equal to 1.5.0
     if (getProperty("master").equals("yarn-client")) {
       conf.set("spark.yarn.isPython", "true");
     }

incubator-zeppelin git commit: Fix pyspark to work on yarn mode when spark version is lower than or equal to 1.4.x

Reply via email to