[jira] [Commented] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224165#comment-17224165 ] Apache Spark commented on SPARK-29574: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30214 > spark with user provided hadoop doesn't work on kubernetes > -- > > Key: SPARK-29574 > URL: https://issues.apache.org/jira/browse/SPARK-29574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.4 >Reporter: Michał Wesołowski >Assignee: Shahin Shakeri >Priority: Major > Fix For: 3.0.0 > > > When spark-submit is run with image built with "hadoop free" spark and user > provided hadoop it fails on kubernetes (hadoop libraries are not on spark's > classpath). > I downloaded spark [Pre-built with user-provided Apache > Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz]. > > I created docker image with usage of > [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]]. > > > Based on this image (2.4.4-without-hadoop) > I created another one with Dockerfile > {code:java} > FROM spark-py:2.4.4-without-hadoop > ENV SPARK_HOME=/opt/spark/ > # This is needed for newer kubernetes versions > ADD > https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar > $SPARK_HOME/jars > COPY spark-env.sh /opt/spark/conf/spark-env.sh > RUN chmod +x /opt/spark/conf/spark-env.sh > RUN wget -qO- > https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz > | tar xz -C /opt/ > ENV HADOOP_HOME=/opt/hadoop-3.2.1 > ENV PATH=${HADOOP_HOME}/bin:${PATH} > {code} > Contents of spark-env.sh: > {code:java} > #!/usr/bin/env bash > export SPARK_DIST_CLASSPATH=$(hadoop > classpath):$HADOOP_HOME/share/hadoop/tools/lib/* > export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native > {code} > spark-submit run with image crated this way fails since spark-env.sh is > overwritten by [volume created when pod > starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108] > As quick workaround I tried to modify [entrypoint > script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh] > to run spark-env.sh during startup and moving spark-env.sh to a different > directory. > Driver starts without issues in this setup however, evethough > SPARK_DIST_CLASSPATH is set executor is constantly failing: > {code:java} > PS > C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda> > kubectl logs rda-script-1571835692837-exec-12 > ++ id -u > + myuid=0 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 0 > + uidentry=root:x:0:0:root:/root:/bin/ash > + set -e > + '[' -z root:x:0:0:root:/root:/bin/ash ']' > + source /opt/spark-env.sh > +++ hadoop classpath > ++ export > 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++ > > SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' > ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native > ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native > ++ echo > 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' >
[jira] [Commented] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073145#comment-17073145 ] Devin Boyer commented on SPARK-29574: - Will this or can this change be backported to future versions of 2.4? Doing so would mean that I won't have to manually patch or fork the entrypoint.sh file in my docker images. It's unclear to me if this introduces a backwards-incompatible change or not. > spark with user provided hadoop doesn't work on kubernetes > -- > > Key: SPARK-29574 > URL: https://issues.apache.org/jira/browse/SPARK-29574 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 >Reporter: Michał Wesołowski >Assignee: Shahin Shakeri >Priority: Major > Fix For: 3.0.0 > > > When spark-submit is run with image built with "hadoop free" spark and user > provided hadoop it fails on kubernetes (hadoop libraries are not on spark's > classpath). > I downloaded spark [Pre-built with user-provided Apache > Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz]. > > I created docker image with usage of > [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]]. > > > Based on this image (2.4.4-without-hadoop) > I created another one with Dockerfile > {code:java} > FROM spark-py:2.4.4-without-hadoop > ENV SPARK_HOME=/opt/spark/ > # This is needed for newer kubernetes versions > ADD > https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar > $SPARK_HOME/jars > COPY spark-env.sh /opt/spark/conf/spark-env.sh > RUN chmod +x /opt/spark/conf/spark-env.sh > RUN wget -qO- > https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz > | tar xz -C /opt/ > ENV HADOOP_HOME=/opt/hadoop-3.2.1 > ENV PATH=${HADOOP_HOME}/bin:${PATH} > {code} > Contents of spark-env.sh: > {code:java} > #!/usr/bin/env bash > export SPARK_DIST_CLASSPATH=$(hadoop > classpath):$HADOOP_HOME/share/hadoop/tools/lib/* > export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native > {code} > spark-submit run with image crated this way fails since spark-env.sh is > overwritten by [volume created when pod > starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108] > As quick workaround I tried to modify [entrypoint > script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh] > to run spark-env.sh during startup and moving spark-env.sh to a different > directory. > Driver starts without issues in this setup however, evethough > SPARK_DIST_CLASSPATH is set executor is constantly failing: > {code:java} > PS > C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda> > kubectl logs rda-script-1571835692837-exec-12 > ++ id -u > + myuid=0 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 0 > + uidentry=root:x:0:0:root:/root:/bin/ash > + set -e > + '[' -z root:x:0:0:root:/root:/bin/ash ']' > + source /opt/spark-env.sh > +++ hadoop classpath > ++ export > 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++ > > SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' > ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native > ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native > ++ echo > 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' >
[jira] [Commented] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959849#comment-16959849 ] Michał Wesołowski commented on SPARK-29574: --- I investigated the executor issue. It doesn't handle SPARK_DIST_CLASSPATH environment variable because in kubernetes it is simply {color:#172b4d}org.apache.spark.executor.CoarseGrainedExecutorBackend invoked that does not respect it. For executor to "see" user provided hadoop dependencies I modified entrypoint script so in case of SPARK_K8S_CMD executor it would specify classpath with $SPARK_DIST_CLASSPATH{color} {code:java} ... executor) CMD=( ${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP ) {code} So there are two problems: Driver doesn't see environment variables from $SPARK_HOME/conf/spark-env.sh because this gets hidden by mounted config map, and executor doesn't take into account $SPARK_DIST_CLASSPATH at all. > spark with user provided hadoop doesn't work on kubernetes > -- > > Key: SPARK-29574 > URL: https://issues.apache.org/jira/browse/SPARK-29574 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 >Reporter: Michał Wesołowski >Priority: Major > > When spark-submit is run with image built with "hadoop free" spark and user > provided hadoop it fails on kubernetes (hadoop libraries are not on spark's > classpath). > I downloaded spark [Pre-built with user-provided Apache > Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz]. > > I created docker image with usage of > [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]]. > > > Based on this image (2.4.4-without-hadoop) > I created another one with Dockerfile > {code:java} > FROM spark-py:2.4.4-without-hadoop > ENV SPARK_HOME=/opt/spark/ > # This is needed for newer kubernetes versions > ADD > https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar > $SPARK_HOME/jars > COPY spark-env.sh /opt/spark/conf/spark-env.sh > RUN chmod +x /opt/spark/conf/spark-env.sh > RUN wget -qO- > https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz > | tar xz -C /opt/ > ENV HADOOP_HOME=/opt/hadoop-3.2.1 > ENV PATH=${HADOOP_HOME}/bin:${PATH} > {code} > Contents of spark-env.sh: > {code:java} > #!/usr/bin/env bash > export SPARK_DIST_CLASSPATH=$(hadoop > classpath):$HADOOP_HOME/share/hadoop/tools/lib/* > export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native > {code} > spark-submit run with image crated this way fails since spark-env.sh is > overwritten by [volume created when pod > starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108] > As quick workaround I tried to modify [entrypoint > script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh] > to run spark-env.sh during startup and moving spark-env.sh to a different > directory. > Driver starts without issues in this setup however, evethough > SPARK_DIST_CLASSPATH is set executor is constantly failing: > {code:java} > PS > C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda> > kubectl logs rda-script-1571835692837-exec-12 > ++ id -u > + myuid=0 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 0 > + uidentry=root:x:0:0:root:/root:/bin/ash > + set -e > + '[' -z root:x:0:0:root:/root:/bin/ash ']' > + source /opt/spark-env.sh > +++ hadoop classpath > ++ export > 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++ > >