[zeppelin] branch master updated: [ZEPPELIN-4440]. Update spark document

zjffdu Tue, 07 Jan 2020 07:13:27 -0800

This is an automated email from the ASF dual-hosted git repository.

zjffdu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/zeppelin.git



The following commit(s) were added to refs/heads/master by this push:
     new f08c75c  [ZEPPELIN-4440]. Update spark document
f08c75c is described below

commit f08c75ceacacc819ebf51582449ac4eb498e6279
Author: Jeff Zhang <zjf...@apache.org>
AuthorDate: Sat Nov 9 21:51:04 2019 +0800

    [ZEPPELIN-4440]. Update spark document
    
    ### What is this PR for?
    
    This PR refine the spark document.
    
    ### What type of PR is it?
    [Documentation]
    
    ### Todos
    * [ ] - Task
    
    ### What is the Jira issue?
    * https://issues.apache.org/jira/browse/ZEPPELIN-4440
    
    ### How should this be tested?
    * CI pass
    
    ### Screenshots (if appropriate)
    
    ### Questions:
    * Does the licenses files need update? No
    * Is there breaking changes for older versions? No
    * Does this needs documentation? No
    
    Author: Jeff Zhang <zjf...@apache.org>
    
    Closes #3577 from zjffdu/ZEPPELIN-4440 and squashes the following commits:
    
    88f7ef725 [Jeff Zhang] [ZEPPELIN-4440]. Update spark document
---
 .../zeppelin/img/docs-img/spark_SPARK_HOME16.png   | Bin 0 -> 123514 bytes
 .../zeppelin/img/docs-img/spark_SPARK_HOME24.png   | Bin 0 -> 122833 bytes
 .../img/docs-img/spark_inline_configuration.png    | Bin 0 -> 38073 bytes
 .../img/docs-img/spark_user_impersonation.png      | Bin 0 -> 68387 bytes
 docs/interpreter/spark.md                          | 335 ++++++++++++++-------
 docs/usage/interpreter/overview.md                 |   2 +-
 .../src/main/resources/interpreter-setting.json    | 125 +++++---
 7 files changed, 301 insertions(+), 161 deletions(-)

diff --git a/docs/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME16.png 
b/docs/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME16.png
new file mode 100644
index 0000000..f925d47
Binary files /dev/null and 
b/docs/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME16.png differ
diff --git a/docs/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME24.png 
b/docs/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME24.png
new file mode 100644
index 0000000..0eaa063
Binary files /dev/null and 
b/docs/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME24.png differ
diff --git 
a/docs/assets/themes/zeppelin/img/docs-img/spark_inline_configuration.png 
b/docs/assets/themes/zeppelin/img/docs-img/spark_inline_configuration.png
new file mode 100644
index 0000000..c02785b
Binary files /dev/null and 
b/docs/assets/themes/zeppelin/img/docs-img/spark_inline_configuration.png differ
diff --git 
a/docs/assets/themes/zeppelin/img/docs-img/spark_user_impersonation.png 
b/docs/assets/themes/zeppelin/img/docs-img/spark_user_impersonation.png
new file mode 100644
index 0000000..f16f402
Binary files /dev/null and 
b/docs/assets/themes/zeppelin/img/docs-img/spark_user_impersonation.png differ
diff --git a/docs/interpreter/spark.md b/docs/interpreter/spark.md
index bd50cb0..ef79959 100644
--- a/docs/interpreter/spark.md
+++ b/docs/interpreter/spark.md
@@ -37,12 +37,7 @@ Apache Spark is supported in Zeppelin with Spark interpreter 
group which consist
   <tr>
     <td>%spark</td>
     <td>SparkInterpreter</td>
-    <td>Creates a SparkContext and provides a Scala environment</td>
-  </tr>
-  <tr>
-    <td>%spark.kotlin</td>
-    <td>KotlinSparkInterpreter</td>
-    <td>Provides a Kotlin environment</td>
+    <td>Creates a SparkContext/SparkSession and provides a Scala 
environment</td>
   </tr>
   <tr>
     <td>%spark.pyspark</td>
@@ -50,6 +45,11 @@ Apache Spark is supported in Zeppelin with Spark interpreter 
group which consist
     <td>Provides a Python environment</td>
   </tr>
   <tr>
+    <td>%spark.ipyspark</td>
+    <td>IPySparkInterpreter</td>
+    <td>Provides a IPython environment</td>
+  </tr>
+  <tr>
     <td>%spark.r</td>
     <td>SparkRInterpreter</td>
     <td>Provides an R environment with SparkR support</td>
@@ -60,9 +60,9 @@ Apache Spark is supported in Zeppelin with Spark interpreter 
group which consist
     <td>Provides a SQL environment</td>
   </tr>
   <tr>
-    <td>%spark.dep</td>
-    <td>DepInterpreter</td>
-    <td>Dependency loader</td>
+    <td>%spark.kotlin</td>
+    <td>KotlinSparkInterpreter</td>
+    <td>Provides a Kotlin environment</td>
   </tr>
 </table>
 
@@ -76,42 +76,58 @@ You can also set other Spark properties which are not 
listed in the table. For a
     <th>Description</th>
   </tr>
   <tr>
-    <td>args</td>
+    <td>`SPARK_HOME`</td>
     <td></td>
-    <td>Spark commandline args</td>
-  </tr>
+    <td>Location of spark distribution</td>
+  <tr>
+  <tr>
     <td>master</td>
     <td>local[*]</td>
-    <td>Spark master uri. <br/> ex) spark://masterhost:7077</td>
+    <td>Spark master uri. <br/> e.g. spark://master_host:7077</td>
   <tr>
     <td>spark.app.name</td>
     <td>Zeppelin</td>
     <td>The name of spark application.</td>
   </tr>
   <tr>
-    <td>spark.cores.max</td>
-    <td></td>
-    <td>Total number of cores to use. <br/> Empty value uses all available 
core.</td>
+    <td>spark.driver.cores</td>
+    <td>1</td>
+    <td>Number of cores to use for the driver process, only in cluster 
mode.</td>
   </tr>
   <tr>
-    <td>spark.executor.memory </td>
+    <td>spark.driver.memory</td>
     <td>1g</td>
-    <td>Executor memory per worker instance. <br/> ex) 512m, 32g</td>
+    <td>Amount of memory to use for the driver process, i.e. where 
SparkContext is initialized, in the same format as JVM memory strings with a 
size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g).</td>
   </tr>
   <tr>
-    <td>zeppelin.dep.additionalRemoteRepository</td>
-    <td>spark-packages, <br/> http://dl.bintray.com/spark-packages/maven, 
<br/> false;</td>
-    <td>A list of `id,remote-repository-URL,is-snapshot;` <br/> for each 
remote repository.</td>
+    <td>spark.executor.cores</td>
+    <td>1</td>
+    <td>The number of cores to use on each executor</td>
   </tr>
   <tr>
-    <td>zeppelin.dep.localrepo</td>
-    <td>local-repo</td>
-    <td>Local repository for dependency loader</td>
+    <td>spark.executor.memory</td>
+    <td>1g</td>
+    <td>Executor memory per worker instance. <br/> e.g. 512m, 32g</td>
+  </tr>
+  <tr>
+    <td>spark.files</td>
+    <td></td>
+    <td>Comma-separated list of files to be placed in the working directory of 
each executor. Globs are allowed.</td>
+  </tr>
+  <tr>
+    <td>spark.jars</td>
+    <td></td>
+    <td>Comma-separated list of jars to include on the driver and executor 
classpaths. Globs are allowed.</td>
+  </tr>
+  <tr>
+    <td>spark.jars.packages</td>
+    <td></td>
+    <td>Comma-separated list of Maven coordinates of jars to include on the 
driver and executor classpaths. The coordinates should be 
groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will 
be resolved according to the configuration in the file, otherwise artifacts 
will be searched for in the local maven repo, then maven central and finally 
any additional remote repositories given by the command-line option 
--repositories.</td>
   </tr>
   <tr>
     <td>`PYSPARK_PYTHON`</td>
     <td>python</td>
-    <td>Python binary executable to use for PySpark in both driver and workers 
(default is <code>python</code>).
+    <td>Python binary executable to use for PySpark in both driver and 
executors (default is <code>python</code>).
             Property <code>spark.pyspark.python</code> take precedence if it 
is set</td>
   </tr>
   <tr>
@@ -121,6 +137,16 @@ You can also set other Spark properties which are not 
listed in the table. For a
             Property <code>spark.pyspark.driver.python</code> take precedence 
if it is set</td>
   </tr>
   <tr>
+    <td>zeppelin.pyspark.useIPython</td>
+    <td>false</td>
+    <td>Whether use IPython when the ipython prerequisites are met in 
`%spark.pyspark`</td>
+  </tr>
+  <tr>
+    <td>zeppelin.R.cmd</td>
+    <td>R</td>
+    <td>R binary executable path.</td>
+  </tr>  
+  <tr>
     <td>zeppelin.spark.concurrentSQL</td>
     <td>false</td>
     <td>Execute multiple SQL concurrently if set true.</td>
@@ -133,22 +159,17 @@ You can also set other Spark properties which are not 
listed in the table. For a
   <tr>
     <td>zeppelin.spark.maxResult</td>
     <td>1000</td>
-    <td>Max number of Spark SQL result to display.</td>
+    <td>Max number rows of Spark SQL result to display.</td>
   </tr>
   <tr>
     <td>zeppelin.spark.printREPLOutput</td>
     <td>true</td>
-    <td>Print REPL output</td>
+    <td>Print scala REPL output</td>
   </tr>
   <tr>
     <td>zeppelin.spark.useHiveContext</td>
     <td>true</td>
-    <td>Use HiveContext instead of SQLContext if it is true.</td>
-  </tr>
-  <tr>
-    <td>zeppelin.spark.importImplicit</td>
-    <td>true</td>
-    <td>Import implicits, UDF collection, and sql if set true.</td>
+    <td>Use HiveContext instead of SQLContext if it is true. Enable hive for 
SparkSession</td>
   </tr>
   <tr>
     <td>zeppelin.spark.enableSupportedVersionCheck</td>
@@ -158,47 +179,68 @@ You can also set other Spark properties which are not 
listed in the table. For a
   <tr>
     <td>zeppelin.spark.sql.interpolation</td>
     <td>false</td>
-    <td>Enable ZeppelinContext variable interpolation into paragraph text</td>
+    <td>Enable ZeppelinContext variable interpolation into spark sql</td>
   </tr>
   <tr>
   <td>zeppelin.spark.uiWebUrl</td>
     <td></td>
     <td>Overrides Spark UI default URL. Value should be a full URL (ex: 
http://{hostName}/{uniquePath}</td>
   </tr>
-  <td>zeppelin.spark.scala.color</td>
-    <td>true</td>
-    <td>Whether to enable color output of spark scala interpreter</td>
-  </tr>
 </table>
 
 Without any configuration, Spark interpreter works out of box in local mode. 
But if you want to connect to your Spark cluster, you'll need to follow below 
two simple steps.
 
-### 1. Export SPARK_HOME
-In `conf/zeppelin-env.sh`, export `SPARK_HOME` environment variable with your 
Spark installation path.
+### Export SPARK_HOME
 
-For example,
+There are several options for setting `SPARK_HOME`.
+
+* Set `SPARK_HOME` in `zeppelin-env.sh`
+* Set `SPARK_HOME` in Interpreter setting page
+* Set `SPARK_HOME` via [inline generic 
configuration](../usage/interpreter/overview.html#inline-generic-confinterpreter)
 
+
+#### 1. Set `SPARK_HOME` in `zeppelin-env.sh`
+
+If you work with only one version of spark, then you can set `SPARK_HOME` in 
`zeppelin-env.sh` because any setting in `zeppelin-env.sh` is globally applied.
+
+e.g. 
 
 ```bash
 export SPARK_HOME=/usr/lib/spark
 ```
 
-You can optionally set more environment variables
+You can optionally set more environment variables in `zeppelin-env.sh`
 
 ```bash
 # set hadoop conf dir
 export HADOOP_CONF_DIR=/usr/lib/hadoop
 
-# set options to pass spark-submit command
-export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0"
-
-# extra classpath. e.g. set classpath for hive-site.xml
-export ZEPPELIN_INTP_CLASSPATH_OVERRIDES=/etc/hive/conf
 ```
 
-For Windows, ensure you have `winutils.exe` in `%HADOOP_HOME%\bin`. Please see 
[Problems running Hadoop on 
Windows](https://wiki.apache.org/hadoop/WindowsProblems) for the details.
 
-### 2. Set master in Interpreter menu
-After start Zeppelin, go to **Interpreter** menu and edit **master** property 
in your Spark interpreter setting. The value may vary depending on your Spark 
cluster deployment type.
+#### 2. Set `SPARK_HOME` in Interpreter setting page
+
+If you want to use multiple versions of spark, then you need create multiple 
spark interpreters and set `SPARK_HOME` for each of them. e.g.
+Create a new spark interpreter `spark24` for spark 2.4 and set `SPARK_HOME` in 
interpreter setting page
+<center>
+<img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME24.png" 
width="80%">
+</center>
+
+Create a new spark interpreter `spark16` for spark 1.6 and set `SPARK_HOME` in 
interpreter setting page
+<center>
+<img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_SPARK_HOME16.png" 
width="80%">
+</center>
+
+
+#### 3. Set `SPARK_HOME` via [inline generic 
configuration](../usage/interpreter/overview.html#inline-generic-confinterpreter)
 
+
+Besides setting `SPARK_HOME` in interpreter setting page, you can also use 
inline generic configuration to put the 
+configuration with code together for more flexibility. e.g.
+<center>
+<img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_inline_configuration.png"
 width="80%">
+</center>
+
+### Set master in Interpreter menu
+After starting Zeppelin, go to **Interpreter** menu and edit **master** 
property in your Spark interpreter setting. The value may vary depending on 
your Spark cluster deployment type.
 
 For example,
 
@@ -213,93 +255,132 @@ For the further information about Spark & Zeppelin 
version compatibility, please
 
 > Note that without exporting `SPARK_HOME`, it's running in local mode with 
 > included version of Spark. The included version may vary depending on the 
 > build profile.
 
-### 3. Yarn mode
-Zeppelin support both yarn client and yarn cluster mode (yarn cluster mode is 
supported from 0.8.0). For yarn mode, you must specify `SPARK_HOME` & 
`HADOOP_CONF_DIR`.
-You can either specify them in `zeppelin-env.sh`, or in interpreter setting 
page. Specifying them in `zeppelin-env.sh` means you can use only one version 
of `spark` & `hadoop`. Specifying them
-in interpreter setting page means you can use multiple versions of `spark` & 
`hadoop` in one zeppelin instance.
-
-### 4. New Version of SparkInterpreter
-Starting from 0.9, we totally removed the old spark interpreter 
implementation, and make the new spark interpreter as the official spark 
interpreter.
-
 ## SparkContext, SQLContext, SparkSession, ZeppelinContext
-SparkContext, SQLContext and ZeppelinContext are automatically created and 
exposed as variable names `sc`, `sqlContext` and `z`, respectively, in Scala, 
Kotlin, Python and R environments.
-Staring from 0.6.1 SparkSession is available as variable `spark` when you are 
using Spark 2.x.
-
-> Note that Scala/Python/R environment shares the same SparkContext, 
SQLContext and ZeppelinContext instance.
 
-<a name="dependencyloading"> </a>
+SparkContext, SQLContext, SparkSession (for spark 2.x) and ZeppelinContext are 
automatically created and exposed as variable names `sc`, `sqlContext`, `spark` 
and `z`, respectively, in Scala, Kotlin, Python and R environments.
 
-### How to pass property to SparkConf
 
-There're 2 kinds of properties that would be passed to SparkConf
+> Note that Scala/Python/R environment shares the same SparkContext, 
SQLContext, SparkSession and ZeppelinContext instance.
 
- * Standard spark property (prefix with `spark.`). e.g. 
`spark.executor.memory` will be passed to `SparkConf`
- * Non-standard spark property (prefix with `zeppelin.spark.`).  e.g. 
`zeppelin.spark.property_1`, `property_1` will be passed to `SparkConf`
+## YARN Mode
+Zeppelin support both yarn client and yarn cluster mode (yarn cluster mode is 
supported from 0.8.0). For yarn mode, you must specify `SPARK_HOME` & 
`HADOOP_CONF_DIR`. 
+Usually you only have one hadoop cluster, so you can set `HADOOP_CONF_DIR` in 
`zeppelin-env.sh` which is applied to all spark interpreters. If you want to 
use spark against multiple hadoop cluster, then you need to define
+`HADOOP_CONF_DIR` in interpreter setting or via inline generic configuration.
 
 ## Dependency Management
 
-For spark interpreter, you should not use Zeppelin's [Dependency 
Management](../usage/interpreter/dependency_management.html) for managing 
-third party dependencies, (`%spark.dep` also is not the recommended approach 
starting from Zeppelin 0.8). Instead you should set spark properties 
(`spark.jars`, `spark.files`, `spark.jars.packages`) in 2 ways.
+For spark interpreter, it is not recommended to use Zeppelin's [Dependency 
Management](../usage/interpreter/dependency_management.html) for managing 
+third party dependencies (`%spark.dep` is removed from Zeppelin 0.9 as well). 
Instead you should set the standard Spark properties.
 
 <table class="table-configuration">
   <tr>
-    <th>spark-defaults.conf</th>
-    <th>SPARK_SUBMIT_OPTIONS</th>
+    <th>Spark Property</th>
+    <th>Spark Submit Argument</th>
     <th>Description</th>
   </tr>
   <tr>
+    <td>spark.files</td>
+    <td>--files</td>
+    <td>Comma-separated list of files to be placed in the working directory of 
each executor. Globs are allowed.</td>
+  </tr>
+  <tr>
     <td>spark.jars</td>
     <td>--jars</td>
-    <td>Comma-separated list of local jars to include on the driver and 
executor classpaths.</td>
+    <td>Comma-separated list of jars to include on the driver and executor 
classpaths. Globs are allowed.</td>
   </tr>
   <tr>
     <td>spark.jars.packages</td>
     <td>--packages</td>
-    <td>Comma-separated list of maven coordinates of jars to include on the 
driver and executor classpaths. Will search the local maven repo, then maven 
central and any additional remote repositories given by --repositories. The 
format for the coordinates should be 
<code>groupId:artifactId:version</code>.</td>
-  </tr>
-  <tr>
-    <td>spark.files</td>
-    <td>--files</td>
-    <td>Comma-separated list of files to be placed in the working directory of 
each executor.</td>
+    <td>Comma-separated list of Maven coordinates of jars to include on the 
driver and executor classpaths. The coordinates should be 
groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will 
be resolved according to the configuration in the file, otherwise artifacts 
will be searched for in the local maven repo, then maven central and finally 
any additional remote repositories given by the command-line option 
--repositories.</td>
   </tr>
 </table>
 
-### 1. Set spark properties in zeppelin side.
+You can either set Spark properties in interpreter setting page or set Spark 
submit arguments in `zeppelin-env.sh` via environment variable 
`SPARK_SUBMIT_OPTIONS`. 
+For examples:
+
+```bash
+export SPARK_SUBMIT_OPTIONS="--files <my_file> --jars <my_jar> --packages 
<my_package>"
+```
+
+But it is not recommended to set them in `SPARK_SUBMIT_OPTIONS`. Because it 
will be shared by all spark interpreters, which means you can not set different 
dependencies for different users.
 
-In zeppelin side, you can either set them in spark interpreter setting page or 
via [Generic ConfInterpreter](../usage/interpreter/overview.html).
-It is not recommended to set them in `SPARK_SUBMIT_OPTIONS`. Because it will 
be shared by all spark interpreters, you can not set different dependencies for 
different users.
 
-### 2. Set spark properties in spark side.
+## PySpark
 
-In spark side, you can set them in `spark-defaults.conf`.
+There're 2 ways to use PySpark in Zeppelin:
 
-e.g.
+* Vanilla PySpark
+* IPySpark
 
-  ```
-    spark.jars        /path/mylib1.jar,/path/mylib2.jar
-    spark.jars.packages   com.databricks:spark-csv_2.10:1.2.0
-    spark.files       /path/mylib1.py,/path/mylib2.egg,/path/mylib3.zip
-  ```
+### Vanilla PySpark (Not Recommended)
+Vanilla PySpark interpreter is almost the same as vanilla Python interpreter 
except Zeppelin inject SparkContext, SQLContext, SparkSession via variables 
`sc`, `sqlContext`, `spark`.
 
+By default, Zeppelin would use IPython in `%spark.pyspark` when IPython is 
available, Otherwise it would fall back to the original PySpark implementation.
+If you don't want to use IPython, then you can set 
`zeppelin.pyspark.useIPython` as `false` in interpreter setting. For the 
IPython features, you can refer doc
+[Python Interpreter](python.html)
 
-## ZeppelinContext
-Zeppelin automatically injects `ZeppelinContext` as variable `z` in your 
Scala/Python environment. `ZeppelinContext` provides some additional functions 
and utilities.
-See [Zeppelin-Context](../usage/other_features/zeppelin_context.html) for more 
details.
+### IPySpark (Recommended)
+You can use `IPySpark` explicitly via `%spark.ipyspark`. IPySpark interpreter 
is almost the same as IPython interpreter except Zeppelin inject SparkContext, 
SQLContext, SparkSession via variables `sc`, `sqlContext`, `spark`.
+For the IPython features, you can refer doc [Python Interpreter](python.html)
+
+## SparkR
+
+Zeppelin support SparkR via `%spark.r`. Here's configuration for SparkR 
Interpreter.
+
+<table class="table-configuration">
+  <tr>
+    <th>Spark Property</th>
+    <th>Default</th>
+    <th>Description</th>
+  </tr>
+  <tr>
+    <td>zeppelin.R.cmd</td>
+    <td>R</td>
+    <td>R binary executable path.</td>
+  </tr>
+  <tr>
+    <td>zeppelin.R.knitr</td>
+    <td>true</td>
+    <td>Whether use knitr or not. (It is recommended to install knitr and use 
it in Zeppelin)</td>
+  </tr>
+  <tr>
+    <td>zeppelin.R.image.width</td>
+    <td>100%</td>
+    <td>R plotting image width.</td>
+  </tr>
+  <tr>
+    <td>zeppelin.R.render.options</td>
+    <td>out.format = 'html', comment = NA, echo = FALSE, results = 'asis', 
message = F, warning = F, fig.retina = 2</td>
+    <td>R plotting options.</td>
+  </tr>
+</table>
+
+
+## SparkSql
 
-## Matplotlib Integration (pyspark)
-Both the `python` and `pyspark` interpreters have built-in support for inline 
visualization using `matplotlib`,
-a popular plotting library for python. More details can be found in the 
[python interpreter documentation](../interpreter/python.html),
-since matplotlib support is identical. More advanced interactive plotting can 
be done with pyspark through
-utilizing Zeppelin's built-in [Angular Display 
System](../usage/display_system/angular_backend.html), as shown below:
+Spark Sql Interpreter share the same SparkContext/SparkSession with other 
Spark interpreter. That means any table registered in scala, python or r code 
can be accessed by Spark Sql.
+For examples:
 
-<img class="img-responsive" 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/matplotlibAngularExample.gif"
 />
+```scala
+%spark
+
+case class People(name: String, age: Int)
+var df = spark.createDataFrame(List(People("jeff", 23), People("andy", 20)))
+df.createOrReplaceTempView("people")
+```
+
+```sql
+
+%spark.sql
+
+select * from people
+```
 
-## Running spark sql concurrently
 By default, each sql statement would run sequentially in `%spark.sql`. But you 
can run them concurrently by following setup.
 
-1. set `zeppelin.spark.concurrentSQL` to true to enable the sql concurrent 
feature, underneath zeppelin will change to use fairscheduler for spark. And 
also set `zeppelin.spark.concurrentSQL.max` to control the max number of sql 
statements running concurrently.
-2. configure pools by creating `fairscheduler.xml` under your 
`SPARK_CONF_DIR`, check the offical spark doc [Configuring Pool 
Properties](http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties)
-3. set pool property via setting paragraph property. e.g.
+1. Set `zeppelin.spark.concurrentSQL` to true to enable the sql concurrent 
feature, underneath zeppelin will change to use fairscheduler for spark. And 
also set `zeppelin.spark.concurrentSQL.max` to control the max number of sql 
statements running concurrently.
+2. Configure pools by creating `fairscheduler.xml` under your 
`SPARK_CONF_DIR`, check the official spark doc [Configuring Pool 
Properties](http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties)
+3. Set pool property via setting paragraph property. e.g.
 
 ```
 %spark(pool=pool1)
@@ -307,19 +388,44 @@ By default, each sql statement would run sequentially in 
`%spark.sql`. But you c
 sql statement
 ```
 
-This feature is available for both all versions of scala spark, pyspark. For 
sparkr, it is only available starting from 2.3.0.
+This pool feature is also available for all versions of scala Spark, PySpark. 
For SparkR, it is only available starting from 2.3.0.
  
-## Interpreter setting option
+## Interpreter Setting Option
 
-You can choose one of `shared`, `scoped` and `isolated` options wheh you 
configure Spark interpreter.
-Spark interpreter creates separated Scala compiler per each notebook but share 
a single SparkContext in `scoped` mode (experimental).
-It creates separated SparkContext per each notebook in `isolated` mode.
+You can choose one of `shared`, `scoped` and `isolated` options when you 
configure Spark interpreter.
+e.g. 
 
-## IPython support
+* In `scoped` per user mode, Zeppelin creates separated Scala compiler for 
each user but share a single SparkContext.
+* In `isolated` per user mode, Zeppelin creates separated SparkContext for 
each user.
 
-By default, zeppelin would use IPython in `pyspark` when IPython is available, 
Otherwise it would fall back to the original PySpark implementation.
-If you don't want to use IPython, then you can set 
`zeppelin.pyspark.useIPython` as `false` in interpreter setting. For the 
IPython features, you can refer doc
-[Python Interpreter](python.html)
+## ZeppelinContext
+Zeppelin automatically injects `ZeppelinContext` as variable `z` in your 
Scala/Python environment. `ZeppelinContext` provides some additional functions 
and utilities.
+See [Zeppelin-Context](../usage/other_features/zeppelin_context.html) for more 
details.
+
+## User Impersonation
+
+In yarn mode, the user who launch the zeppelin server will be used to launch 
the spark yarn application. This is not a good practise.
+Most of time, you will enable shiro in Zeppelin and would like to use the 
login user to submit the spark yarn app. For this purpose,
+you need to enable user impersonation for more security control. In order the 
enable user impersonation, you need to do the following steps
+
+**Step 1** Enable user impersonation setting hadoop's `core-site.xml`. E.g. if 
you are using user `zeppelin` to launch Zeppelin, then add the following to 
`core-site.xml`, then restart both hdfs and yarn. 
+
+```
+<property>
+  <name>hadoop.proxyuser.zeppelin.groups</name>
+  <value>*</value>
+</property>
+<property>
+  <name>hadoop.proxyuser.zeppelin.hosts</name>
+  <value>*</value>
+</property>
+```
+
+**Step 2** Enable interpreter user impersonation in Spark interpreter's 
interpreter setting. (Enable shiro first of course)
+<img 
src="{{BASE_PATH}}/assets/themes/zeppelin/img/docs-img/spark_user_impersonation.png">
+
+**Step 3(Optional)** If you are using kerberos cluster, then you need to set 
`zeppelin.server.kerberos.keytab` and `zeppelin.server.kerberos.principal` to 
the user(aka. user in Step 1) you want to 
+impersonate in `zeppelin-site.xml`.
 
 
 ## Setting up Zeppelin with Kerberos
@@ -338,10 +444,7 @@ You can get rid of this message by setting 
`zeppelin.spark.deprecatedMsg.show` t
 1. On the server that Zeppelin is installed, install Kerberos client modules 
and configuration, krb5.conf.
 This is to make the server communicate with KDC.
 
-2. Set `SPARK_HOME` in `[ZEPPELIN_HOME]/conf/zeppelin-env.sh` to use 
spark-submit
-(Additionally, you might have to set `export HADOOP_CONF_DIR=/etc/hadoop/conf`)
-
-3. Add the two properties below to Spark configuration 
(`[SPARK_HOME]/conf/spark-defaults.conf`):
+2. Add the two properties below to Spark configuration 
(`[SPARK_HOME]/conf/spark-defaults.conf`):
 
     ```
     spark.yarn.principal
@@ -350,5 +453,5 @@ This is to make the server communicate with KDC.
 
   > **NOTE:** If you do not have permission to access for the above 
spark-defaults.conf file, optionally, you can add the above lines to the Spark 
Interpreter setting through the Interpreter tab in the Zeppelin UI.
 
-4. That's it. Play with Zeppelin!
+3. That's it. Play with Zeppelin!
 
diff --git a/docs/usage/interpreter/overview.md 
b/docs/usage/interpreter/overview.md
index 3fe0f5f..ef2eda9 100644
--- a/docs/usage/interpreter/overview.md
+++ b/docs/usage/interpreter/overview.md
@@ -132,7 +132,7 @@ Before 0.8.0, Zeppelin didn't have lifecycle management for 
interpreters. Users
 Users can change this threshold via the 
`zeppelin.interpreter.lifecyclemanager.timeout.threshold` setting. 
`TimeoutLifecycleManager` is the default lifecycle manager, and users can 
change it via `zeppelin.interpreter.lifecyclemanager.class`.
 
 
-## Generic ConfInterpreter
+## Inline Generic ConfInterpreter
 
 Zeppelin's interpreter setting is shared by all users and notes, if you want 
to have different settings, you have to create a new interpreter, e.g. you can 
create `spark_jar1` for running Spark with dependency jar1 and `spark_jar2` for 
running Spark with dependency jar2.
 This approach works, but is not particularly convenient. `ConfInterpreter` can 
provide more fine-grained control on interpreter settings and more flexibility. 
diff --git a/spark/interpreter/src/main/resources/interpreter-setting.json 
b/spark/interpreter/src/main/resources/interpreter-setting.json
index 7739221..5fbccaf 100644
--- a/spark/interpreter/src/main/resources/interpreter-setting.json
+++ b/spark/interpreter/src/main/resources/interpreter-setting.json
@@ -5,6 +5,48 @@
     "className": "org.apache.zeppelin.spark.SparkInterpreter",
     "defaultInterpreter": true,
     "properties": {
+      "SPARK_HOME": {
+        "envName": "SPARK_HOME",
+        "propertyName": "SPARK_HOME",
+        "defaultValue": "",
+        "description": "Location of spark distribution",
+        "type": "string"
+      },
+      "master": {
+        "envName": "",
+        "propertyName": "spark.master",
+        "defaultValue": "local[*]",
+        "description": "Spark master uri. ex) spark://master_host:7077",
+        "type": "string"
+      },
+      "spark.app.name": {
+        "envName": "",
+        "propertyName": "spark.app.name",
+        "defaultValue": "Zeppelin",
+        "description": "The name of spark application.",
+        "type": "string"
+      },
+      "spark.driver.cores": {
+        "envName": "",
+        "propertyName": "spark.driver.cores",
+        "defaultValue": "1",
+        "description": "Number of cores to use for the driver process, only in 
cluster mode.",
+        "type": "int"
+      },
+      "spark.driver.memory": {
+        "envName": "",
+        "propertyName": "spark.driver.memory",
+        "defaultValue": "1g",
+        "description": "Amount of memory to use for the driver process, i.e. 
where SparkContext is initialized, in the same format as JVM memory strings 
with a size unit suffix (\"k\", \"m\", \"g\" or \"t\") (e.g. 512m, 2g).",
+        "type": "string"
+      },
+      "spark.executor.cores": {
+        "envName": null,
+        "propertyName": "spark.executor.cores",
+        "defaultValue": "1",
+        "description": "The number of cores to use on each executor",
+        "type": "int"
+      },
       "spark.executor.memory": {
         "envName": null,
         "propertyName": "spark.executor.memory",
@@ -12,55 +54,50 @@
         "description": "Executor memory per worker instance. ex) 512m, 32g",
         "type": "string"
       },
-      "args": {
+      "spark.files": {
         "envName": null,
-        "propertyName": null,
+        "propertyName": "spark.files",
         "defaultValue": "",
-        "description": "spark commandline args",
-        "type": "textarea"
+        "description": "Comma-separated list of files to be placed in the 
working directory of each executor. Globs are allowed.",
+        "type": "string"
+      },
+      "spark.jars": {
+        "envName": null,
+        "propertyName": "spark.jars",
+        "defaultValue": "",
+        "description": "Comma-separated list of jars to include on the driver 
and executor classpaths. Globs are allowed.",
+        "type": "string"
+      },
+      "spark.jars.packages": {
+        "envName": null,
+        "propertyName": "spark.jars.packages",
+        "defaultValue": "",
+        "description": "Comma-separated list of Maven coordinates of jars to 
include on the driver and executor classpaths. The coordinates should be 
groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will 
be resolved according to the configuration in the file, otherwise artifacts 
will be searched for in the local maven repo, then maven central and finally 
any additional remote repositories given by the command-line option 
--repositories.",
+        "type": "string"
       },
       "zeppelin.spark.useHiveContext": {
-        "envName": "ZEPPELIN_SPARK_USEHIVECONTEXT",
+        "envName": null,
         "propertyName": "zeppelin.spark.useHiveContext",
         "defaultValue": true,
-        "description": "Use HiveContext instead of SQLContext if it is true.",
+        "description": "Use HiveContext instead of SQLContext if it is true. 
Enable hive for SparkSession.",
         "type": "checkbox"
       },
-      "spark.app.name": {
-        "envName": "SPARK_APP_NAME",
-        "propertyName": "spark.app.name",
-        "defaultValue": "Zeppelin",
-        "description": "The name of spark application.",
-        "type": "string"
-      },
+
       "zeppelin.spark.printREPLOutput": {
         "envName": null,
         "propertyName": "zeppelin.spark.printREPLOutput",
         "defaultValue": true,
-        "description": "Print REPL output",
+        "description": "Print scala REPL output",
         "type": "checkbox"
       },
-      "spark.cores.max": {
-        "envName": null,
-        "propertyName": "spark.cores.max",
-        "defaultValue": "",
-        "description": "Total number of cores to use. Empty value uses all 
available core.",
-        "type": "number"
-      },
       "zeppelin.spark.maxResult": {
-        "envName": "ZEPPELIN_SPARK_MAXRESULT",
+        "envName": null,
         "propertyName": "zeppelin.spark.maxResult",
         "defaultValue": "1000",
         "description": "Max number of Spark SQL result to display.",
         "type": "number"
       },
-      "master": {
-        "envName": "MASTER",
-        "propertyName": "spark.master",
-        "defaultValue": "local[*]",
-        "description": "Spark master uri. ex) spark://masterhost:7077",
-        "type": "string"
-      },
+
       "zeppelin.spark.enableSupportedVersionCheck": {
         "envName": null,
         "propertyName": "zeppelin.spark.enableSupportedVersionCheck",
@@ -110,21 +147,21 @@
     "className": "org.apache.zeppelin.spark.SparkSqlInterpreter",
     "properties": {
       "zeppelin.spark.concurrentSQL": {
-        "envName": "ZEPPELIN_SPARK_CONCURRENTSQL",
+        "envName": null,
         "propertyName": "zeppelin.spark.concurrentSQL",
         "defaultValue": false,
         "description": "Execute multiple SQL concurrently if set true.",
         "type": "checkbox"
       },
       "zeppelin.spark.concurrentSQL.max": {
-        "envName": "ZEPPELIN_SPARK_CONCURRENTSQL_MAX",
+        "envName": null,
         "propertyName": "zeppelin.spark.concurrentSQL.max",
         "defaultValue": 10,
         "description": "Max number of SQL concurrently executed",
         "type": "number"
       },
       "zeppelin.spark.sql.stacktrace": {
-        "envName": "ZEPPELIN_SPARK_SQL_STACKTRACE",
+        "envName": null,
         "propertyName": "zeppelin.spark.sql.stacktrace",
         "defaultValue": false,
         "description": "Show full exception stacktrace for SQL queries if set 
to true.",
@@ -134,18 +171,18 @@
         "envName": null,
         "propertyName": "zeppelin.spark.sql.interpolation",
         "defaultValue": false,
-        "description": "Enable ZeppelinContext variable interpolation into 
paragraph text",
+        "description": "Enable ZeppelinContext variable interpolation into 
spark sql",
         "type": "checkbox"
       },
       "zeppelin.spark.maxResult": {
-        "envName": "ZEPPELIN_SPARK_MAXRESULT",
+        "envName": null,
         "propertyName": "zeppelin.spark.maxResult",
         "defaultValue": "1000",
         "description": "Max number of Spark SQL result to display.",
         "type": "number"
       },
       "zeppelin.spark.importImplicit": {
-        "envName": "ZEPPELIN_SPARK_IMPORTIMPLICIT",
+        "envName": null,
         "propertyName": "zeppelin.spark.importImplicit",
         "defaultValue": true,
         "description": "Import implicits, UDF collection, and sql if set true. 
true by default.",
@@ -168,21 +205,21 @@
         "envName": "PYSPARK_PYTHON",
         "propertyName": "PYSPARK_PYTHON",
         "defaultValue": "python",
-        "description": "Python command to run pyspark with",
+        "description": "Python binary executable to use for PySpark in driver 
only (default is `PYSPARK_PYTHON`). Property 
<code>spark.pyspark.driver.python</code> take precedence if it is set",
         "type": "string"
       },
       "PYSPARK_DRIVER_PYTHON": {
         "envName": "PYSPARK_DRIVER_PYTHON",
         "propertyName": "PYSPARK_DRIVER_PYTHON",
         "defaultValue": "python",
-        "description": "Python command to run pyspark with",
+        "description": "Python binary executable to use for PySpark in driver 
only (default is `PYSPARK_PYTHON`). Property 
<code>spark.pyspark.driver.python</code> take precedence if it is set",
         "type": "string"
       },
       "zeppelin.pyspark.useIPython": {
         "envName": null,
         "propertyName": "zeppelin.pyspark.useIPython",
         "defaultValue": true,
-        "description": "whether use IPython when it is available",
+        "description": "Whether use IPython when it is available",
         "type": "checkbox"
       }
     },
@@ -210,28 +247,28 @@
     "className": "org.apache.zeppelin.spark.SparkRInterpreter",
     "properties": {
       "zeppelin.R.knitr": {
-        "envName": "ZEPPELIN_R_KNITR",
+        "envName": null,
         "propertyName": "zeppelin.R.knitr",
         "defaultValue": true,
-        "description": "whether use knitr or not",
+        "description": "Whether use knitr or not",
         "type": "checkbox"
       },
       "zeppelin.R.cmd": {
-        "envName": "ZEPPELIN_R_CMD",
+        "envName": null,
         "propertyName": "zeppelin.R.cmd",
         "defaultValue": "R",
-        "description": "R repl path",
+        "description": "R binary executable path",
         "type": "string"
       },
       "zeppelin.R.image.width": {
-        "envName": "ZEPPELIN_R_IMAGE_WIDTH",
+        "envName": null,
         "propertyName": "zeppelin.R.image.width",
         "defaultValue": "100%",
         "description": "",
         "type": "number"
       },
       "zeppelin.R.render.options": {
-        "envName": "ZEPPELIN_R_RENDER_OPTIONS",
+        "envName": null,
         "propertyName": "zeppelin.R.render.options",
         "defaultValue": "out.format = 'html', comment = NA, echo = FALSE, 
results = 'asis', message = F, warning = F, fig.retina = 2",
         "description": "",

[zeppelin] branch master updated: [ZEPPELIN-4440]. Update spark document

Reply via email to