Hi Khalid and David.
Thanks for your comments. I believe I found out the source of High CPU
utilisation on host submitting spark-submit where I referred to as launch
node
This node was the master in what is known as Google Dataproc cluster.
According to this link
Hi Mich
I think you need to check your code.
If code does not use PySpark API effectively you may get this. I.e. if you
use pure Python/pandas api rather than Pyspark i.e.
transform->transform->action. e.g df.select(..).withColumn(...)...count()
Hope this helps to put you on right direction.
Hi,
I have a basic question to ask.
I am running a Google k8s cluster (AKA GKE) with three nodes each having
configuration below
e2-standard-2 (2 vCPUs, 8 GB memory)
spark-submit is launched from another node (actually a data proc single
node that I have just upgraded to e2-custom (4 vCPUs, 8