This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-docker.git
The following commit(s) were added to refs/heads/master by this push: new d02ff60 [SPARK-40513][DOCS] Add apache/spark docker image overview d02ff60 is described below commit d02ff6091835311a32c7ccc73d8ebae1d5817ecc Author: Yikun Jiang <yikunk...@gmail.com> AuthorDate: Tue Jun 27 14:28:21 2023 +0800 [SPARK-40513][DOCS] Add apache/spark docker image overview ### What changes were proposed in this pull request? This PR add the `OVERVIEW.md`. ### Why are the changes needed? This will be used in the page of https://hub.docker.com/r/apache/spark to introduce the spark docker image and tag info. ### Does this PR introduce _any_ user-facing change? Yes, doc only ### How was this patch tested? Doc only, review. Closes #34 from Yikun/overview. Authored-by: Yikun Jiang <yikunk...@gmail.com> Signed-off-by: Yikun Jiang <yikunk...@gmail.com> --- OVERVIEW.md | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) diff --git a/OVERVIEW.md b/OVERVIEW.md new file mode 100644 index 0000000..0465555 --- /dev/null +++ b/OVERVIEW.md @@ -0,0 +1,83 @@ +# What is Apache Spark™? + +Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structu [...] + +https://spark.apache.org/ + +## Online Documentation + +You can find the latest Spark documentation, including a programming guide, on the [project web page](https://spark.apache.org/documentation.html). This README file only contains basic setup instructions. + +## Interactive Scala Shell + +The easiest way to start using Spark is through the Scala shell: + +``` +docker run -it apache/spark /opt/spark/bin/spark-shell +``` + +Try the following command, which should return 1,000,000,000: + +``` +scala> spark.range(1000 * 1000 * 1000).count() +``` + +## Interactive Python Shell + +The easiest way to start using PySpark is through the Python shell: + +``` +docker run -it apache/spark /opt/spark/bin/pyspark +``` + +And run the following command, which should also return 1,000,000,000: + +``` +>>> spark.range(1000 * 1000 * 1000).count() +``` + +## Interactive R Shell + +The easiest way to start using R on Spark is through the R shell: + +``` +docker run -it apache/spark:r /opt/spark/bin/sparkR +``` + +## Running Spark on Kubernetes + +https://spark.apache.org/docs/latest/running-on-kubernetes.html + +## Supported tags and respective Dockerfile links + +Currently, the `apache/spark` docker image supports 4 types for each version: + +Such as for v3.4.0: +- [3.4.0-scala2.12-java11-python3-ubuntu, 3.4.0-python3, 3.4.0, python3, latest](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-ubuntu) +- [3.4.0-scala2.12-java11-r-ubuntu, 3.4.0-r, r](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-r-ubuntu) +- [3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala, scala](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-ubuntu) +- [3.4.0-scala2.12-java11-python3-r-ubuntu](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-r-ubuntu) + +## Environment Variable + +The environment variables of entrypoint.sh are listed below: + +| Environment Variable | Meaning | +|----------------------|-----------| +| SPARK_EXTRA_CLASSPATH | The extra path to be added to the classpath, see also in https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management | +| PYSPARK_PYTHON | Python binary executable to use for PySpark in both driver and workers (default is python3 if available, otherwise python). Property spark.pyspark.python take precedence if it is set | +| PYSPARK_DRIVER_PYTHON | Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set | +| SPARK_DIST_CLASSPATH | Distribution-defined classpath to add to processes | +| SPARK_DRIVER_BIND_ADDRESS | Hostname or IP address where to bind listening sockets. See also `spark.driver.bindAddress` | +| SPARK_EXECUTOR_JAVA_OPTS | The Java opts of Spark Executor | +| SPARK_APPLICATION_ID | A unique identifier for the Spark application | +| SPARK_EXECUTOR_POD_IP | The Pod IP address of spark executor | +| SPARK_RESOURCE_PROFILE_ID | The resource profile ID | +| SPARK_EXECUTOR_POD_NAME | The executor pod name | +| SPARK_CONF_DIR | Alternate conf dir. (Default: ${SPARK_HOME}/conf) | +| SPARK_EXECUTOR_CORES | Number of cores for the executors (Default: 1) | +| SPARK_EXECUTOR_MEMORY | Memory per Executor (e.g. 1000M, 2G) (Default: 1G) | +| SPARK_DRIVER_MEMORY | Memory for Driver (e.g. 1000M, 2G) (Default: 1G) | + +See also in https://spark.apache.org/docs/latest/configuration.html and https://spark.apache.org/docs/latest/running-on-kubernetes.html + --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org