You may recall that I raised a few questions here and in Stacktrace
regarding two items both related to running Pyspark inside kubernetes.

The challenge was


   1. Load third party packages like tensorflow, numpy, pyyaml in running
   job in k8s
   2. How to read from a yaml file to load initialisation variables into
   Pyspark job


The option using --archives pyspark_venv.tar.gz#pyspark_venv etc did not
work. It could not unzip and untar the file. Moreover this is loaded after
sparkcontext is created and it was crashing all the time.

Having thought about this and talking to a fellow forum member, I was
advised and decided to add this to the Dockerfile that generates PySpark
image. So basically this.

File $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dokerfile

RUN pip install pyyaml numpy cx_Oracle pyspark tensorflow
when run you get

Step 8/20 : RUN pip install pyyaml numpy cx_Oracle pyspark tensorflow
 ---> Running in 9efc3cb25f25

Next  challenge was to add config.yml somewhere to the image that I could
pick it up.
Also I wanted to be consistent so whatever run mode (k8s, YARN, local etc),
I could pick up this config.yml file seamlessly and did not need to change
the code.

On prem this file was read from  directory
/home/hduser/dba/bin/python/DSBQ/conf

And also I wanted to be able to edit as spark_uid= 185. So I needed to
install vim as well.

RUN ["apt-get","install","-y","vim"]
RUN mkdir -p /home/hduser/dba/bin/python/DSBQ/conf
RUN chmod g+w /home/hduser/dba/bin/python/DSBQ/conf
COPY config.yml /home/hduser/dba/bin/python/DSBQ/conf/config.yml
RUN chmod g+w /home/hduser/dba/bin/python/DSBQ/conf/config.yml

Note that for this to work you will need to copy config.yml to under
$SPARK_HOME (no soft link) for it to be copied into the docker image ( as
persistent)

Once the image generated, you can log like below (image

docker run  -it be686602970d bash

docker run -it be686602970d bash
185@5b4c23427cfd:/opt/spark/work-dir$ ls -l
/home/hduser/dba/bin/python/DSBQ/conf
total 12
-rw-rw-r--. 1 root root 4433 Jul 29 19:32 config.yml
-rw-rw-r--. 1 root root  824 Jul 29 20:10 config_test.yml

Now you can edit those two files as the user is member of root group

185@5b4c23427cfd:/opt/spark/work-dir$ id
uid=185(185) gid=0(root) groups=0(root)

You can also log in as root to the image itself

docker run -u 0 -it be686602970d bash
root@addbe3ffb9fa:/opt/spark/work-dir# id
uid=0(root) gid=0(root) groups=0(root)

Hope this helps someone. It is a hack but it works for now.


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to