Hi Khalid and David.
Thanks for your comments. I believe I found out the source of High CPU
utilisation on host submitting spark-submit where I referred to as launch
node
This node was the master in what is known as Google Dataproc cluster.
According to this link
Hi Mich,
I don't quite understand why the driver node is using so much CPU, but it
may be unrelated to your executors being underused.
About your executors being underused, I would check that your job generated
enough tasks.
Then I would check spark.executor.cores and spark.tasks.cpus parameters
Hi Mich
I think you need to check your code.
If code does not use PySpark API effectively you may get this. I.e. if you
use pure Python/pandas api rather than Pyspark i.e.
transform->transform->action. e.g df.select(..).withColumn(...)...count()
Hope this helps to put you on right direction.
Hi,
I have a basic question to ask.
I am running a Google k8s cluster (AKA GKE) with three nodes each having
configuration below
e2-standard-2 (2 vCPUs, 8 GB memory)
spark-submit is launched from another node (actually a data proc single
node that I have just upgraded to e2-custom (4 vCPUs, 8