Hi Mich, I don't quite understand why the driver node is using so much CPU, but it may be unrelated to your executors being underused. About your executors being underused, I would check that your job generated enough tasks. Then I would check spark.executor.cores and spark.tasks.cpus parameters to see if I can give more work to the executors.
Cheers, David Le mar. 10 août 2021 à 12:20, Khalid Mammadov <khalidmammad...@gmail.com> a écrit : > Hi Mich > > I think you need to check your code. > If code does not use PySpark API effectively you may get this. I.e. if you > use pure Python/pandas api rather than Pyspark i.e. > transform->transform->action. e.g df.select(..).withColumn(...)...count() > > Hope this helps to put you on right direction. > > Cheers > Khalid > > > > > On Mon, 9 Aug 2021, 20:20 Mich Talebzadeh, <mich.talebza...@gmail.com> > wrote: > >> Hi, >> >> I have a basic question to ask. >> >> I am running a Google k8s cluster (AKA GKE) with three nodes each having >> configuration below >> >> e2-standard-2 (2 vCPUs, 8 GB memory) >> >> >> spark-submit is launched from another node (actually a data proc single >> node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We call >> this the launch node >> >> OK I know that the cluster is not much but Google was complaining about >> the launch node hitting 100% cpus. So I added two more cpus to it. >> >> It appears that despite using k8s as the computational cluster, the >> burden falls upon the launch node! >> >> The cpu utilisation for launch node shown below >> >> [image: image.png] >> The dip is when 2 more cpus were added to it so it had to reboot. so >> around %70 usage >> >> The combined cpu usage for GKE nodes is shown below: >> >> [image: image.png] >> >> Never goes above 20%! >> >> I can see that the drive and executors as below: >> >> k get pods -n spark >> NAME READY STATUS RESTARTS >> AGE >> pytest-c958c97b2c52b6ed-driver 1/1 Running 0 >> 69s >> randomdatabigquery-e68a8a7b2c52f468-exec-1 1/1 Running 0 >> 51s >> randomdatabigquery-e68a8a7b2c52f468-exec-2 1/1 Running 0 >> 51s >> randomdatabigquery-e68a8a7b2c52f468-exec-3 0/1 Pending 0 >> 51s >> >> It is a PySpark 3.1.1 image using java 8 and pushing random data >> generated into Google BigQuery data warehouse. The last executor (exec-3) >> seems to be just pending. The spark-submit is as below: >> >> spark-submit --verbose \ >> --properties-file ${property_file} \ >> --master k8s://https://$KUBERNETES_MASTER_IP:443 \ >> --deploy-mode cluster \ >> --name pytest \ >> --conf >> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \ >> --py-files $CODE_DIRECTORY/DSBQ.zip \ >> --conf spark.kubernetes.namespace=$NAMESPACE \ >> --conf spark.executor.memory=5000m \ >> --conf spark.network.timeout=300 \ >> --conf spark.executor.instances=3 \ >> --conf spark.kubernetes.driver.limit.cores=1 \ >> --conf spark.driver.cores=1 \ >> --conf spark.executor.cores=1 \ >> --conf spark.executor.memory=2000m \ >> --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \ >> --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \ >> --conf spark.kubernetes.container.image=${IMAGEGCP} \ >> --conf >> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \ >> --conf >> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \ >> --conf >> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" >> \ >> --conf spark.sql.execution.arrow.pyspark.enabled="true" \ >> $CODE_DIRECTORY/${APPLICATION} >> >> Aren't the driver and executors running on K8s cluster? So why is the >> launch node heavily used but k8s cluster is underutilized? >> >> Thanks >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >