You may recall that I raised a few questions here and in Stacktrace regarding two items both related to running Pyspark inside kubernetes.
The challenge was 1. Load third party packages like tensorflow, numpy, pyyaml in running job in k8s 2. How to read from a yaml file to load initialisation variables into Pyspark job The option using --archives pyspark_venv.tar.gz#pyspark_venv etc did not work. It could not unzip and untar the file. Moreover this is loaded after sparkcontext is created and it was crashing all the time. Having thought about this and talking to a fellow forum member, I was advised and decided to add this to the Dockerfile that generates PySpark image. So basically this. File $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dokerfile RUN pip install pyyaml numpy cx_Oracle pyspark tensorflow when run you get Step 8/20 : RUN pip install pyyaml numpy cx_Oracle pyspark tensorflow ---> Running in 9efc3cb25f25 Next challenge was to add config.yml somewhere to the image that I could pick it up. Also I wanted to be consistent so whatever run mode (k8s, YARN, local etc), I could pick up this config.yml file seamlessly and did not need to change the code. On prem this file was read from directory /home/hduser/dba/bin/python/DSBQ/conf And also I wanted to be able to edit as spark_uid= 185. So I needed to install vim as well. RUN ["apt-get","install","-y","vim"] RUN mkdir -p /home/hduser/dba/bin/python/DSBQ/conf RUN chmod g+w /home/hduser/dba/bin/python/DSBQ/conf COPY config.yml /home/hduser/dba/bin/python/DSBQ/conf/config.yml RUN chmod g+w /home/hduser/dba/bin/python/DSBQ/conf/config.yml Note that for this to work you will need to copy config.yml to under $SPARK_HOME (no soft link) for it to be copied into the docker image ( as persistent) Once the image generated, you can log like below (image docker run -it be686602970d bash docker run -it be686602970d bash 185@5b4c23427cfd:/opt/spark/work-dir$ ls -l /home/hduser/dba/bin/python/DSBQ/conf total 12 -rw-rw-r--. 1 root root 4433 Jul 29 19:32 config.yml -rw-rw-r--. 1 root root 824 Jul 29 20:10 config_test.yml Now you can edit those two files as the user is member of root group 185@5b4c23427cfd:/opt/spark/work-dir$ id uid=185(185) gid=0(root) groups=0(root) You can also log in as root to the image itself docker run -u 0 -it be686602970d bash root@addbe3ffb9fa:/opt/spark/work-dir# id uid=0(root) gid=0(root) groups=0(root) Hope this helps someone. It is a hack but it works for now. view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
