Spark documentation refers to spark-sumit --files as

--files FILES: Comma-separated list of files to be placed in the working
directory of each executor.


OK I have implemented this one for Kubernetes as per Spark doc
<https://spark.apache.org/docs/latest/running-on-kubernetes.html>  as
follows: (shown in blue)


export VOLUME_TYPE=hostPath

export VOLUME_NAME=minikube-mount

export SOURCE_DIR=/d4T/hduser/minikube

export MOUNT_PATH=$SOURCE_DIR/mnt


        spark-submit --verbose \

           --master k8s://$K8S_SERVER \

           --deploy-mode cluster \

           --name pytest \

           --py-files
hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip,hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/dependencies_short.zip
\

           --files config.yml \

           --conf spark.kubernetes.namespace=spark \

           --conf spark.executor.instances=2 \

           --conf spark.kubernetes.driver.limit.cores=1 \

           --conf spark.executor.cores=1 \

           --conf spark.executor.memory=500m \

           --conf spark.kubernetes.container.image=${IMAGE} \

           --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\

           --conf spark.kubernetes.file.upload.path=$SOURCE_DIR \

           --conf
spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH
\

           --conf
spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH
\

           --conf
spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH
\

           --conf
spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH
\

           hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}

What does it do?


You put a file in (in my case  a file called config.yml) and put
in $SOURCE_DIR on your driver . You tell spark-submit to pick that
file up --files
config.yml  and put it in every executor directory.


My $APPLICATION file called testpackages.py has this code:


import sys

import os

import pkgutil

import pkg_resources

import yaml

import pyspark

from pyspark.sql import SparkSession

from pyspark.sql import SQLContext

from pyspark.sql import HiveContext

from pyspark import SparkFiles

from pyspark import SparkConf, SparkContext

def main():

    spark = SparkSession.builder \

        .enableHiveSupport() \

        .getOrCreate()

    sc = SparkContext.getOrCreate()

    sc.setLogLevel("ERROR")

    # check os path

    from os import listdir

    from os.path import isfile, join

    dirpath="/d4T/hduser/minikube"

    onlyfiles = [f for f in listdir(dirpath) if isfile(join(dirpath, f))]

    print(onlyfiles)

    print("==> End looking at loaded files")


When is run it finds both files (see in red below created for each
executor)  but claims it cannot create SparkContext

From

 DRIVER_POD_NAME=`kubectl get pods -n spark |grep driver|awk '{print $1}'`

kubectl logs $DRIVER_POD_NAME -n spark



we can see the problem


2021-07-24 10:26:26,106 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and
started at http://pytest-3bbfa67ad80d2534-driver-svc.spark.svc:4040

2021-07-24 10:26:26,118 ERROR spark.SparkContext: Error initializing
SparkContext.

java.io.FileNotFoundException: File
file:/d4T/hduser/minikube/spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde/config.yml
does not exist

        at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)

        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)

        at org.apache.spark.SparkContext.addFile(SparkContext.scala:1604)

        at
org.apache.spark.SparkContext.$anonfun$new$13(SparkContext.scala:508)

        at
org.apache.spark.SparkContext.$anonfun$new$13$adapted(SparkContext.scala:508)

        at scala.collection.immutable.List.foreach(List.scala:392)

        at org.apache.spark.SparkContext.<init>(SparkContext.scala:508)

        at
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)

        at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

        at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)

        at
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)

        at java.base/java.lang.reflect.Constructor.newInstance(Unknown
Source)

        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)

        at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

        at py4j.Gateway.invoke(Gateway.java:238)

        at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)

        at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)

        at py4j.GatewayConnection.run(GatewayConnection.java:238)

        at java.base/java.lang.Thread.run(Unknown Source)

2021-07-24 10:26:26,125 INFO server.AbstractConnector: Stopped
Spark@694ea73d{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}

2021-07-24 10:26:26,126 INFO ui.SparkUI: Stopped Spark web UI at
http://pytest-3bbfa67ad80d2534-driver-svc.spark.svc:4040

2021-07-24 10:26:26,142 INFO spark.MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!

2021-07-24 10:26:26,151 INFO memory.MemoryStore: MemoryStore cleared

2021-07-24 10:26:26,151 INFO storage.BlockManager: BlockManager stopped

2021-07-24 10:26:26,157 INFO storage.BlockManagerMaster: BlockManagerMaster
stopped

2021-07-24 10:26:26,157 WARN metrics.MetricsSystem: Stopping a
MetricsSystem that is not running

2021-07-24 10:26:26,159 INFO
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!

2021-07-24 10:26:26,207 INFO spark.SparkContext: Successfully stopped
SparkContext

Traceback (most recent call last):

  File "/tmp/spark-2041787b-aee8-4bbd-a8d1-e2cc0339665e/testpackages.py",
line 74, in <module>

    main()

  File "/tmp/spark-2041787b-aee8-4bbd-a8d1-e2cc0339665e/testpackages.py",
line 15, in main

    spark = SparkSession.builder \

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line
228, in getOrCreate

  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 384, in
getOrCreate

  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 147, in
__init__

  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 209, in
_do_init

  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 321, in
_initialize_context

  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1569, in __call__

  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line
328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.

: java.io.FileNotFoundException: File
file:/d4T/hduser/minikube/spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde/config.yml
does not exist

        at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)

        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)

        at org.apache.spark.SparkContext.addFile(SparkContext.scala:1604)

        at
org.apache.spark.SparkContext.$anonfun$new$13(SparkContext.scala:508)

        at
org.apache.spark.SparkContext.$anonfun$new$13$adapted(SparkContext.scala:508)

        at scala.collection.immutable.List.foreach(List.scala:392)

        at org.apache.spark.SparkContext.<init>(SparkContext.scala:508)

        at
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)

        at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

        at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)

        at
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)

        at java.base/java.lang.reflect.Constructor.newInstance(Unknown
Source)

        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)

        at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

        at py4j.Gateway.invoke(Gateway.java:238)

        at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)

        at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)

        at py4j.GatewayConnection.run(GatewayConnection.java:238)

        at java.base/java.lang.Thread.run(Unknown Source)

However, that file in red is created outside on host on mount directory
$SOURCE_DIR = /d4T/hduser/minikube


ls -l /d4T/hduser/minikube/

total 20

drwxr-xr-x. 12 hduser hadoop 4096 Jul 24 10:09 ..

-rw-r--r--.  1 hduser hadoop 4433 Jul 24 10:12 config.yml

drwxr-xr-x.  3 hduser hadoop 4096 Jul 24 11:26 .

drwxr-xr-x.  2 hduser hadoop 4096 Jul 24 11:26
spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde

config.yml is the one i put there and if we look
under spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde, we see the copy

ls -l /d4T/hduser/minikube/spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde
total 16
-rw-r--r--. 1 hduser hadoop 4433 Jul 24 11:26 config.yml

Sound like a bug


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to