This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-docker.git
The following commit(s) were added to refs/heads/master by this push: new 6459e3d [SPARK-40528] Support dockerfile template 6459e3d is described below commit 6459e3d09a2e009573be355e63c404bb35139d28 Author: Yikun Jiang <yikunk...@gmail.com> AuthorDate: Mon Oct 17 16:23:23 2022 +0800 [SPARK-40528] Support dockerfile template ### What changes were proposed in this pull request? This patch: - Add dockerfile template: `Dockerfile.template` contains 3 vars: `BASE_IMAGE` for base image name, `HAVE_PY` for adding python support, `HAVE_R` for adding sparkr support. - Add a script: `add-dockerfiles.sh`, you can `./add-dockerfiles.sh 3.3.0` - Add a tool: `tempalte.py` to help generate dockerfile from jinja template. ### Why are the changes needed? Generate the dockerfiles to make life easier. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? ```shell # Prepare new env python3 -m venv ~/xxx pip install -r ./tools/requirements.txt source ~/xxx/bin/activate # Generate 3.3.0 ./add-dockerfiles.sh 3.3.0 # no diff git diff ``` lint: ``` $ flake8 ./tools/template.py $ black ./tools/template.py All done! ✨ 🍰 ✨ 1 file left unchanged. ``` Closes #12 from Yikun/SPARK-40528. Authored-by: Yikun Jiang <yikunk...@gmail.com> Signed-off-by: Yikun Jiang <yikunk...@gmail.com> --- Dockerfile.template | 98 ++++++++++++++++++++++++++++++++++++++++++ add-dockerfiles.sh | 53 +++++++++++++++++++++++ entrypoint.sh.template | 114 +++++++++++++++++++++++++++++++++++++++++++++++++ tools/requirements.txt | 1 + tools/template.py | 84 ++++++++++++++++++++++++++++++++++++ 5 files changed, 350 insertions(+) diff --git a/Dockerfile.template b/Dockerfile.template new file mode 100644 index 0000000..2001281 --- /dev/null +++ b/Dockerfile.template @@ -0,0 +1,98 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +FROM {{ BASE_IMAGE }} + +ARG spark_uid=185 + +RUN groupadd --system --gid=${spark_uid} spark && \ + useradd --system --uid=${spark_uid} --gid=spark spark + +RUN set -ex && \ + apt-get update && \ + ln -s /lib /lib64 && \ + apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu && \ + {%- if HAVE_PY %} + apt install -y python3 python3-pip && \ + pip3 install --upgrade pip setuptools && \ + {%- endif %} + {%- if HAVE_R %} + apt install -y r-base r-base-dev && \ + {%- endif %} + mkdir -p /opt/spark && \ + {%- if HAVE_PY %} + mkdir /opt/spark/python && \ + {%- endif %} + mkdir -p /opt/spark/examples && \ + mkdir -p /opt/spark/work-dir && \ + touch /opt/spark/RELEASE && \ + chown -R spark:spark /opt/spark && \ + rm /bin/sh && \ + ln -sv /bin/bash /bin/sh && \ + echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ + chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ + rm -rf /var/cache/apt/* + +# Install Apache Spark +# https://downloads.apache.org/spark/KEYS +ENV SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-{{ SPARK_VERSION }}/spark-{{ SPARK_VERSION }}-bin-hadoop3.tgz \ + SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-{{ SPARK_VERSION }}/spark-{{ SPARK_VERSION }}-bin-hadoop3.tgz.asc \ + GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9 + +RUN set -ex; \ + export SPARK_TMP="$(mktemp -d)"; \ + cd $SPARK_TMP; \ + wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \ + wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \ + export GNUPGHOME="$(mktemp -d)"; \ + gpg --keyserver hkps://keyserver.pgp.com --recv-key "$GPG_KEY" || \ + gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY" || \ + gpg --batch --verify spark.tgz.asc spark.tgz; \ + gpgconf --kill all; \ + rm -rf "$GNUPGHOME" spark.tgz.asc; \ + \ + tar -xf spark.tgz --strip-components=1; \ + chown -R spark:spark .; \ + mv jars /opt/spark/; \ + mv bin /opt/spark/; \ + mv sbin /opt/spark/; \ + mv kubernetes/dockerfiles/spark/decom.sh /opt/; \ + mv examples /opt/spark/; \ + mv kubernetes/tests /opt/spark/; \ + mv data /opt/spark/; \ + {%- if HAVE_PY %} + mv python/pyspark /opt/spark/python/pyspark/; \ + mv python/lib /opt/spark/python/lib/; \ + {%- endif %} + {%- if HAVE_R %} + mv R /opt/spark/; \ + {%- endif %} + cd ..; \ + rm -rf "$SPARK_TMP"; + +COPY entrypoint.sh /opt/ + +ENV SPARK_HOME /opt/spark +{%- if HAVE_R %} +ENV R_HOME /usr/lib/R +{%- endif %} + +WORKDIR /opt/spark/work-dir +RUN chmod g+w /opt/spark/work-dir +RUN chmod a+x /opt/decom.sh +RUN chmod a+x /opt/entrypoint.sh + +ENTRYPOINT [ "/opt/entrypoint.sh" ] diff --git a/add-dockerfiles.sh b/add-dockerfiles.sh new file mode 100755 index 0000000..4829ecd --- /dev/null +++ b/add-dockerfiles.sh @@ -0,0 +1,53 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Usage: $0 [version] +# Generate dockerfiles for specified spark version. +# +# Examples: +# - Add 3.3.0 dockerfiles: +# $ ./add-dockerfiles.sh +# - Add 3.3.1 dockerfiles: +# $ ./add-dockerfiles.sh 3.3.1 + +VERSION=${1:-"3.3.0"} + +TAGS=" +scala2.12-java11-python3-r-ubuntu +scala2.12-java11-python3-ubuntu +scala2.12-java11-r-ubuntu +scala2.12-java11-ubuntu +" + +for TAG in $TAGS; do + OPTS="" + if echo $TAG | grep -q "python"; then + OPTS+=" --pyspark" + fi + + if echo $TAG | grep -q "r-"; then + OPTS+=" --sparkr" + fi + + OPTS+=" --spark-version $VERSION" + + mkdir -p $VERSION/$TAG + cp -f entrypoint.sh.template $VERSION/$TAG/entrypoint.sh + python3 tools/template.py $OPTS > $VERSION/$TAG/Dockerfile +done diff --git a/entrypoint.sh.template b/entrypoint.sh.template new file mode 100644 index 0000000..4bb1557 --- /dev/null +++ b/entrypoint.sh.template @@ -0,0 +1,114 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +# turn off -e for getent because it will return error code in anonymous uid case +set +e +uidentry=$(getent passwd $myuid) +set -e + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then + if [ -w /etc/passwd ] ; then + echo "$myuid:x:$myuid:$mygid:${SPARK_USER_NAME:-anonymous uid}:$SPARK_HOME:/bin/false" >> /etc/passwd + else + echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" + fi +fi + +if [ -z "$JAVA_HOME" ]; then + JAVA_HOME=$(java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' | awk '{print $3}') +fi + +SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*" +env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt +readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt + +if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then + SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH" +fi + +if ! [ -z ${PYSPARK_PYTHON+x} ]; then + export PYSPARK_PYTHON +fi +if ! [ -z ${PYSPARK_DRIVER_PYTHON+x} ]; then + export PYSPARK_DRIVER_PYTHON +fi + +# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor. +# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s. +if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then + export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)" +fi + +if ! [ -z ${HADOOP_CONF_DIR+x} ]; then + SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH"; +fi + +if ! [ -z ${SPARK_CONF_DIR+x} ]; then + SPARK_CLASSPATH="$SPARK_CONF_DIR:$SPARK_CLASSPATH"; +elif ! [ -z ${SPARK_HOME+x} ]; then + SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH"; +fi + +case "$1" in + driver) + shift 1 + CMD=( + "$SPARK_HOME/bin/spark-submit" + --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" + --deploy-mode client + "$@" + ) + ;; + executor) + shift 1 + CMD=( + ${JAVA_HOME}/bin/java + "${SPARK_EXECUTOR_JAVA_OPTS[@]}" + -Xms$SPARK_EXECUTOR_MEMORY + -Xmx$SPARK_EXECUTOR_MEMORY + -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" + org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend + --driver-url $SPARK_DRIVER_URL + --executor-id $SPARK_EXECUTOR_ID + --cores $SPARK_EXECUTOR_CORES + --app-id $SPARK_APPLICATION_ID + --hostname $SPARK_EXECUTOR_POD_IP + --resourceProfileId $SPARK_RESOURCE_PROFILE_ID + --podName $SPARK_EXECUTOR_POD_NAME + ) + ;; + + *) + # Non-spark-on-k8s command provided, proceeding in pass-through mode... + CMD=("$@") + ;; +esac + +# Switch to spark if no USER specified (root by default) otherwise use USER directly +switch_spark_if_root() { + if [ $(id -u) -eq 0 ]; then + echo gosu spark + fi +} + +# Execute the container CMD under tini for better hygiene +exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}" diff --git a/tools/requirements.txt b/tools/requirements.txt new file mode 100644 index 0000000..7f7afbf --- /dev/null +++ b/tools/requirements.txt @@ -0,0 +1 @@ +jinja2 diff --git a/tools/template.py b/tools/template.py new file mode 100755 index 0000000..e5e9a65 --- /dev/null +++ b/tools/template.py @@ -0,0 +1,84 @@ +#!/usr/bin/env python3 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from argparse import ArgumentParser + +from jinja2 import Environment, FileSystemLoader + + +def parse_opts(): + parser = ArgumentParser(prog="template") + + parser.add_argument( + "-f", + "--template-file", + help="The Dockerfile template file path.", + default="Dockerfile.template", + ) + + parser.add_argument( + "-v", + "--spark-version", + help="The Spark version of Dockerfile.", + default="3.3.0", + ) + + parser.add_argument( + "-i", + "--image", + help="The base image tag of Dockerfile.", + default="eclipse-temurin:11-jre-focal", + ) + + parser.add_argument( + "-p", + "--pyspark", + action="store_true", + help="Have PySpark support or not.", + ) + + parser.add_argument( + "-r", + "--sparkr", + action="store_true", + help="Have SparkR support or not.", + ) + + args, unknown = parser.parse_known_args() + if unknown: + parser.error("Unsupported arguments: %s" % " ".join(unknown)) + return args + + +def main(): + opts = parse_opts() + env = Environment(loader=FileSystemLoader("./")) + template = env.get_template(opts.template_file) + print( + template.render( + BASE_IMAGE=opts.image, + HAVE_PY=opts.pyspark, + HAVE_R=opts.sparkr, + SPARK_VERSION=opts.spark_version, + ) + ) + + +if __name__ == "__main__": + main() --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org