I have done some observations on running Spark on Kubernetes (AKA k8s).

The model works on the basis of the "one-container-per-Pod" model
<https://kubernetes.io/docs/concepts/workloads/pods/> meaning that for
each node of the cluster you will have one node running the driver and each
remaining node running one executor each. I tested this using Google
Kubernetes cluster (GKE)  with dynamic adjustment of the number of nodes
using the following command:


gcloud container clusters resize $GKE_NAME --num-nodes=$NODES --zone $ZONE
--quiet


Thus you can create a k8s cluster with $NODES. Once you have done your
spark-submit, you can resize it to zero to save money


gcloud container clusters resize spark-on-gke --num-nodes=0 --zone $ZONE
--quiet


Note that you cannot shutdown a GKE cluster but you can effectively reduce
it to zero nodes


So as a rule of thumb, if you have a 5 node k8s cluster, you will set the
following parameters, where the number of executors will be one less the
number of nodes

 NODES=5

 NEXEC=$((NODES-1))

 --conf spark.executor.instances=$NEXEC \


 In this model increasing the number of executors above the available nodes
for executors, will result in the addition of pending executors that will
not be deployed with Pending status as shown below


kubectl get pod -n spark

NAME                                         READY   STATUS    RESTARTS
 AGE

randomdatabigquery-b40dd67c791417bf-exec-1   1/1     Running   0
65s

randomdatabigquery-b40dd67c791417bf-exec-2   1/1     Running   0
65s

randomdatabigquery-b40dd67c791417bf-exec-3   1/1     Running   0
65s

randomdatabigquery-b40dd67c791417bf-exec-4   1/1     Running   0
65s

randomdatabigquery-b40dd67c791417bf-exec-5   0/1     Pending   0
65s

sparkbq-13d8857c7913e1d0-driver              1/1     Running   0
81s

Spark GUI can be accessed through the following port forwarding once the
driver was created.(run time)


DRIVER_POD_NAME=`kubectl get pods -n spark |grep driver|awk '{print $1}'`

kubectl port-forward $DRIVER_POD_NAME 4040:4040 -n $NAMESPACE &


For this purpose I built a docker image with JAVA8, Spark 3.1.1 and SCALA
2.12. Data was randomly generated in PySpark and saved to Google BigQuery
(GBQ) database. The choice of JAVA 8 and Spark 3.1.1 was in compatibility
with GBQ. You will need to build your own docker image from my experience.


Performance wise, I think it typically takes around 30 seconds to create
the driver itself followed by executors so still some work to do. The major
advantage that K8s for Spark offers is its flexibility of creating k8s
clusters once and dynamically adjusting the number of nodes and reducing it
to zero after running the job. To submit the spark job you need to use
another VM box or anything that has spark binaries installed on it, For
this purpose I installed Spark 3.1.1 on a VM with 2 vCPUs and 5.5 GB
memory. Note that as long as you have a deployable docker image, that image
can be shared by multiple k8s clusters or you can have multiple docker
images to be deployed.


Let me know your thoughts



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to