Fwd: Unpacking and using external modules with PySpark inside k8s

Mich Talebzadeh Wed, 21 Jul 2021 10:11:21 -0700

Hi,

I am aware that some fellow members in this dev group were involved in
creating scripts for running spark on kubernetes


# To build additional PySpark docker image$ ./bin/docker-image-tool.sh
-r <repo> -t my-tag -p
./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build


The problem I have explained is to be able to unpack packages like yaml and
pandas inside k8s


I am using


        spark-submit --verbose \
           --master k8s://$K8S_SERVER \

 --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz
\
           --deploy-mode cluster \
           --name pytest \
           --conf spark.kubernetes.namespace=spark \
           --conf spark.executor.instances=1 \
           --conf spark.kubernetes.driver.limit.cores=1 \
           --conf spark.executor.cores=1 \
           --conf spark.executor.memory=500m \
           --conf spark.kubernetes.container.image=${IMAGE} \
           --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\
           --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
           hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}


The directory containing code is zipped as DSBQ.zip and it reads it ok.


However, it says in verbose mode


2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv.tar.gz


In this case it tries to use pandas


The module ${APPLICATION} has this code


import sys
import os
import pkgutil
import pkg_resources

def main():
    print("\n printing sys.path")
    for p in sys.path:
       print(p)
    user_paths = os.environ['PYTHONPATH'].split(os.pathsep)
    print("\n Printing user_paths")
    for p in user_paths:
       print(p)
    v = sys.version
    print("\n python version")
    print(v)
    print("\nlooping over pkg_resources.working_set")
    for r in pkg_resources.working_set:
       print(r)
    import pandas

if __name__ == "__main__":
  main()


The output is shown below

Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv.tar.gz

 printing sys.path
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
/usr/lib/python37.zip
/usr/lib/python3.7
/usr/lib/python3.7/lib-dynload
/usr/local/lib/python3.7/dist-packages
/usr/lib/python3/dist-packages

 Printing user_paths
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar

 python version
3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0]

looping over pkg_resources.working_set
setuptools 57.2.0
pip 21.1.3
wheel 0.32.3
six 1.12.0
SecretStorage 2.3.1
pyxdg 0.25
PyGObject 3.30.4
pycrypto 2.6.1
keyrings.alt 3.1.1
keyring 17.1.1
entrypoints 0.3
cryptography 2.6.1
asn1crypto 0.24.0
Traceback (most recent call last):
  File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 24, in <module>
    main()
  File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 21, in main
    import pandas
ModuleNotFoundError: No module named 'pandas'


Adding that if I go inside the docker and do


185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
pip           21.1.3
pycrypto      2.6.1
PyGObject     3.30.4
pyxdg         0.25
SecretStorage 2.3.1
setuptools    57.2.0
six           1.12.0
wheel         0.32.3


I don't get any external packages!


I opened a SO thead for this as well.


https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes


Do I need to hack Dockerfile to install the requirement.txt etc?


Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




---------- Forwarded message ---------
From: Mich Talebzadeh <[email protected]>
Date: Tue, 20 Jul 2021 at 22:51
Subject: Unpacking and using external modules with PySpark inside k8s
To: user @spark <[email protected]>



I have been struggling with this.


Kubernetes (not that matters minikube is working fine. In one of the module
called configure.py  I am importing yaml module


import yaml


This is throwing errors


    import yaml
ModuleNotFoundError: No module named 'yaml'


I have been through a number of loops.


First I created  virtual environment pyspark_venv.tar.gz that includes yaml
module and past it to spark-submit as follows


+ spark-submit --verbose --master k8s://192.168.49.2:8443
'--archives=hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv'
--deploy-mode cluster --name pytest --conf
'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1'
--conf 'spark.kubernetes.driver.limit.cores=1' --conf
'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf
'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf
'spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount'
--py-files hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip hdfs://
50.140.197.220:9000/minikube/codes/testyml.py


Parsed arguments:
  master                  k8s://192.168.49.2:8443
  deployMode              cluster
  executorMemory          500m
  executorCores           1
  totalExecutorCores      null
  propertiesFile          /opt/spark/conf/spark-defaults.conf
  driverMemory            null
  driverCores             null
  driverExtraClassPath    $SPARK_HOME/jars/*.jar
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            1
  files                   null
  pyFiles                 hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip
  archives                hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv
  mainClass               null
  primaryResource         hdfs://
50.140.197.220:9000/minikube/codes/testyml.py
  name                    pytest
  childArgs               []
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true


Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from
/tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv


printing sys.path
/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc
/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
/usr/lib/python37.zip
/usr/lib/python3.7
/usr/lib/python3.7/lib-dynload
/usr/local/lib/python3.7/dist-packages
/usr/lib/python3/dist-packages

 Printing user_paths
['/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip',
'/opt/spark/python/lib/pyspark.zip',
'/opt/spark/python/lib/py4j-0.10.9-src.zip',
'/opt/spark/jars/spark-core_2.12-3.1.1.jar']
checking yaml
Traceback (most recent call last):
  File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
18, in <module>
    main()
  File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
15, in main
    import yaml
ModuleNotFoundError: No module named 'yaml'


Well it does not matter if it is yaml or numpy. It just cannot find the
modules. How can I find out if the gz file is unpacked OK?


Thanks


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Fwd: Unpacking and using external modules with PySpark inside k8s

Reply via email to