Hi, I am aware that some fellow members in this dev group were involved in creating scripts for running spark on kubernetes
# To build additional PySpark docker image$ ./bin/docker-image-tool.sh -r <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build The problem I have explained is to be able to unpack packages like yaml and pandas inside k8s I am using spark-submit --verbose \ --master k8s://$K8S_SERVER \ --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz \ --deploy-mode cluster \ --name pytest \ --conf spark.kubernetes.namespace=spark \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.driver.limit.cores=1 \ --conf spark.executor.cores=1 \ --conf spark.executor.memory=500m \ --conf spark.kubernetes.container.image=${IMAGE} \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount \ --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \ hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION} The directory containing code is zipped as DSBQ.zip and it reads it ok. However, it says in verbose mode 2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Unpacking an archive hdfs:// 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to /opt/spark/work-dir/./pyspark_venv.tar.gz In this case it tries to use pandas The module ${APPLICATION} has this code import sys import os import pkgutil import pkg_resources def main(): print("\n printing sys.path") for p in sys.path: print(p) user_paths = os.environ['PYTHONPATH'].split(os.pathsep) print("\n Printing user_paths") for p in user_paths: print(p) v = sys.version print("\n python version") print(v) print("\nlooping over pkg_resources.working_set") for r in pkg_resources.working_set: print(r) import pandas if __name__ == "__main__": main() The output is shown below Unpacking an archive hdfs:// 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to /opt/spark/work-dir/./pyspark_venv.tar.gz printing sys.path /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538 /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip /opt/spark/python/lib/pyspark.zip /opt/spark/python/lib/py4j-0.10.9-src.zip /opt/spark/jars/spark-core_2.12-3.1.1.jar /usr/lib/python37.zip /usr/lib/python3.7 /usr/lib/python3.7/lib-dynload /usr/local/lib/python3.7/dist-packages /usr/lib/python3/dist-packages Printing user_paths /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip /opt/spark/python/lib/pyspark.zip /opt/spark/python/lib/py4j-0.10.9-src.zip /opt/spark/jars/spark-core_2.12-3.1.1.jar python version 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] looping over pkg_resources.working_set setuptools 57.2.0 pip 21.1.3 wheel 0.32.3 six 1.12.0 SecretStorage 2.3.1 pyxdg 0.25 PyGObject 3.30.4 pycrypto 2.6.1 keyrings.alt 3.1.1 keyring 17.1.1 entrypoints 0.3 cryptography 2.6.1 asn1crypto 0.24.0 Traceback (most recent call last): File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py", line 24, in <module> main() File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py", line 21, in main import pandas ModuleNotFoundError: No module named 'pandas' Adding that if I go inside the docker and do 185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list Package Version ------------- ------- asn1crypto 0.24.0 cryptography 2.6.1 entrypoints 0.3 keyring 17.1.1 keyrings.alt 3.1.1 pip 21.1.3 pycrypto 2.6.1 PyGObject 3.30.4 pyxdg 0.25 SecretStorage 2.3.1 setuptools 57.2.0 six 1.12.0 wheel 0.32.3 I don't get any external packages! I opened a SO thead for this as well. https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes Do I need to hack Dockerfile to install the requirement.txt etc? Thanks view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. ---------- Forwarded message --------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 20 Jul 2021 at 22:51 Subject: Unpacking and using external modules with PySpark inside k8s To: user @spark <u...@spark.apache.org> I have been struggling with this. Kubernetes (not that matters minikube is working fine. In one of the module called configure.py I am importing yaml module import yaml This is throwing errors import yaml ModuleNotFoundError: No module named 'yaml' I have been through a number of loops. First I created virtual environment pyspark_venv.tar.gz that includes yaml module and past it to spark-submit as follows + spark-submit --verbose --master k8s://192.168.49.2:8443 '--archives=hdfs:// 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv' --deploy-mode cluster --name pytest --conf 'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1' --conf 'spark.kubernetes.driver.limit.cores=1' --conf 'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf 'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf 'spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount' --py-files hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip hdfs:// 50.140.197.220:9000/minikube/codes/testyml.py Parsed arguments: master k8s://192.168.49.2:8443 deployMode cluster executorMemory 500m executorCores 1 totalExecutorCores null propertiesFile /opt/spark/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath $SPARK_HOME/jars/*.jar driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors 1 files null pyFiles hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip archives hdfs:// 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv mainClass null primaryResource hdfs:// 50.140.197.220:9000/minikube/codes/testyml.py name pytest childArgs [] jars null packages null packagesExclusions null repositories null verbose true Unpacking an archive hdfs:// 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from /tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to /opt/spark/work-dir/./pyspark_venv printing sys.path /tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc /tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip /opt/spark/python/lib/pyspark.zip /opt/spark/python/lib/py4j-0.10.9-src.zip /opt/spark/jars/spark-core_2.12-3.1.1.jar /usr/lib/python37.zip /usr/lib/python3.7 /usr/lib/python3.7/lib-dynload /usr/local/lib/python3.7/dist-packages /usr/lib/python3/dist-packages Printing user_paths ['/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip', '/opt/spark/python/lib/pyspark.zip', '/opt/spark/python/lib/py4j-0.10.9-src.zip', '/opt/spark/jars/spark-core_2.12-3.1.1.jar'] checking yaml Traceback (most recent call last): File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line 18, in <module> main() File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line 15, in main import yaml ModuleNotFoundError: No module named 'yaml' Well it does not matter if it is yaml or numpy. It just cannot find the modules. How can I find out if the gz file is unpacked OK? Thanks view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.