Performance of PySpark jobs on the Kubernetes cluster

Mich Talebzadeh Mon, 09 Aug 2021 12:21:08 -0700

Hi,

I have a basic question to ask.


I am running a Google k8s cluster (AKA GKE) with three nodes each having
configuration below

e2-standard-2 (2 vCPUs, 8 GB memory)


spark-submit is launched from another node (actually a data proc single
node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We call
this the launch node

OK I know that the cluster is not much but Google was complaining about the
launch node hitting 100% cpus. So I added two more cpus to it.

It appears that despite using k8s as the computational cluster, the burden
falls upon the launch node!

The cpu utilisation for launch node shown below

[image: image.png]
The dip is when 2 more cpus were added to  it so it had to reboot. so
around %70 usage

The combined cpu usage for GKE nodes is shown below:

[image: image.png]

Never goes above 20%!

I can see that the drive and executors as below:

k get pods -n spark
NAME                                         READY   STATUS    RESTARTS
 AGE
pytest-c958c97b2c52b6ed-driver               1/1     Running   0
69s
randomdatabigquery-e68a8a7b2c52f468-exec-1   1/1     Running   0
51s
randomdatabigquery-e68a8a7b2c52f468-exec-2   1/1     Running   0
51s
randomdatabigquery-e68a8a7b2c52f468-exec-3   0/1     Pending   0
51s

It is a PySpark 3.1.1 image using java 8 and pushing random data generated
into Google BigQuery data warehouse. The last executor (exec-3) seems to be
just pending. The spark-submit is as below:

        spark-submit --verbose \
           --properties-file ${property_file} \
           --master k8s://https://$KUBERNETES_MASTER_IP:443 \
           --deploy-mode cluster \
           --name pytest \
           --conf
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \
           --py-files $CODE_DIRECTORY/DSBQ.zip \
           --conf spark.kubernetes.namespace=$NAMESPACE \
           --conf spark.executor.memory=5000m \
           --conf spark.network.timeout=300 \
           --conf spark.executor.instances=3 \
           --conf spark.kubernetes.driver.limit.cores=1 \
           --conf spark.driver.cores=1 \
           --conf spark.executor.cores=1 \
           --conf spark.executor.memory=2000m \
           --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \
           --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \
           --conf spark.kubernetes.container.image=${IMAGEGCP} \
           --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
           --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
           --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
           --conf spark.sql.execution.arrow.pyspark.enabled="true" \
           $CODE_DIRECTORY/${APPLICATION}

Aren't the driver and executors running on K8s cluster? So why is the
launch node heavily used but k8s cluster is underutilized?

Thanks

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Performance of PySpark jobs on the Kubernetes cluster

Reply via email to