Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-14 Thread Mich Talebzadeh
Hi Khalid and David. Thanks for your comments. I believe I found out the source of High CPU utilisation on host submitting spark-submit where I referred to as launch node This node was the master in what is known as Google Dataproc cluster. According to this link

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-11 Thread David Diebold
Hi Mich, I don't quite understand why the driver node is using so much CPU, but it may be unrelated to your executors being underused. About your executors being underused, I would check that your job generated enough tasks. Then I would check spark.executor.cores and spark.tasks.cpus parameters t

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-10 Thread Khalid Mammadov
Hi Mich I think you need to check your code. If code does not use PySpark API effectively you may get this. I.e. if you use pure Python/pandas api rather than Pyspark i.e. transform->transform->action. e.g df.select(..).withColumn(...)...count() Hope this helps to put you on right direction. Che

Performance of PySpark jobs on the Kubernetes cluster

2021-08-09 Thread Mich Talebzadeh
Hi, I have a basic question to ask. I am running a Google k8s cluster (AKA GKE) with three nodes each having configuration below e2-standard-2 (2 vCPUs, 8 GB memory) spark-submit is launched from another node (actually a data proc single node that I have just upgraded to e2-custom (4 vCPUs, 8