
I am aware that some fellow members in this dev group were involved in
creating scripts for running spark on kubernetes

# To build additional PySpark docker image$ ./bin/docker-image-tool.sh
-r <repo> -t my-tag -p
./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

The problem I have explained is to be able to unpack packages like yaml and
pandas inside k8s

I am using

        spark-submit --verbose \
           --master k8s://$K8S_SERVER \

           --deploy-mode cluster \
           --name pytest \
           --conf spark.kubernetes.namespace=spark \
           --conf spark.executor.instances=1 \
           --conf spark.kubernetes.driver.limit.cores=1 \
           --conf spark.executor.cores=1 \
           --conf spark.executor.memory=500m \
           --conf spark.kubernetes.container.image=${IMAGE} \
           --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \

The directory containing code is zipped as DSBQ.zip and it reads it ok.

However, it says in verbose mode

2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
Unpacking an archive hdfs:// from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to

In this case it tries to use pandas

The module ${APPLICATION} has this code

import sys
import os
import pkgutil
import pkg_resources

def main():
    print("\n printing sys.path")
    for p in sys.path:
    user_paths = os.environ['PYTHONPATH'].split(os.pathsep)
    print("\n Printing user_paths")
    for p in user_paths:
    v = sys.version
    print("\n python version")
    print("\nlooping over pkg_resources.working_set")
    for r in pkg_resources.working_set:
    import pandas

if __name__ == "__main__":

The output is shown below

Unpacking an archive hdfs:// from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to

 printing sys.path

 Printing user_paths

 python version
3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0]

looping over pkg_resources.working_set
setuptools 57.2.0
pip 21.1.3
wheel 0.32.3
six 1.12.0
SecretStorage 2.3.1
pyxdg 0.25
PyGObject 3.30.4
pycrypto 2.6.1
keyrings.alt 3.1.1
keyring 17.1.1
entrypoints 0.3
cryptography 2.6.1
asn1crypto 0.24.0
Traceback (most recent call last):
  File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 24, in <module>
  File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 21, in main
    import pandas
ModuleNotFoundError: No module named 'pandas'

Adding that if I go inside the docker and do

185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
pip           21.1.3
pycrypto      2.6.1
PyGObject     3.30.4
pyxdg         0.25
SecretStorage 2.3.1
setuptools    57.2.0
six           1.12.0
wheel         0.32.3

I don't get any external packages!

I opened a SO thead for this as well.


Do I need to hack Dockerfile to install the requirement.txt etc?


   view my Linkedin profile

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

I have been struggling with this.

Kubernetes (not that matters minikube is working fine. In one of the module
called configure.py  I am importing yaml module

import yaml

This is throwing errors

    import yaml
ModuleNotFoundError: No module named 'yaml'

I have been through a number of loops.

First I created  virtual environment pyspark_venv.tar.gz that includes yaml
module and past it to spark-submit as follows

+ spark-submit --verbose --master k8s://
--deploy-mode cluster --name pytest --conf
'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1'
--conf 'spark.kubernetes.driver.limit.cores=1' --conf
'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf
'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf
--py-files hdfs:// hdfs://

Parsed arguments:
  master                  k8s://
  deployMode              cluster
  executorMemory          500m
  executorCores           1
  totalExecutorCores      null
  propertiesFile          /opt/spark/conf/spark-defaults.conf
  driverMemory            null
  driverCores             null
  driverExtraClassPath    $SPARK_HOME/jars/*.jar
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            1
  files                   null
  pyFiles                 hdfs://
  archives                hdfs://
  mainClass               null
  primaryResource         hdfs://
  name                    pytest
  childArgs               []
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Unpacking an archive hdfs:// from
/tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to

printing sys.path

 Printing user_paths
checking yaml
Traceback (most recent call last):
  File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
18, in <module>
  File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
15, in main
    import yaml
ModuleNotFoundError: No module named 'yaml'

Well it does not matter if it is yaml or numpy. It just cannot find the
modules. How can I find out if the gz file is unpacked OK?


   view my Linkedin profile

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

