Re: Performance of PySpark jobs on the Kubernetes cluster

David Diebold Wed, 11 Aug 2021 05:18:03 -0700

Hi Mich,

I don't quite understand why the driver node is using so much CPU, but it
may be unrelated to your executors being underused.
About your executors being underused, I would check that your job generated
enough tasks.
Then I would check spark.executor.cores and spark.tasks.cpus parameters to
see if I can give more work to the executors.


Cheers,
David



Le mar. 10 août 2021 à 12:20, Khalid Mammadov <khalidmammad...@gmail.com> a
écrit :

> Hi Mich
>
> I think you need to check your code.
> If code does not use PySpark API effectively you may get this. I.e. if you
> use pure Python/pandas api rather than Pyspark i.e.
> transform->transform->action. e.g df.select(..).withColumn(...)...count()
>
> Hope this helps to put you on right direction.
>
> Cheers
> Khalid
>
>
>
>
> On Mon, 9 Aug 2021, 20:20 Mich Talebzadeh, <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a basic question to ask.
>>
>> I am running a Google k8s cluster (AKA GKE) with three nodes each having
>> configuration below
>>
>> e2-standard-2 (2 vCPUs, 8 GB memory)
>>
>>
>> spark-submit is launched from another node (actually a data proc single
>> node that I have just upgraded to e2-custom (4 vCPUs, 8 GB mem). We call
>> this the launch node
>>
>> OK I know that the cluster is not much but Google was complaining about
>> the launch node hitting 100% cpus. So I added two more cpus to it.
>>
>> It appears that despite using k8s as the computational cluster, the
>> burden falls upon the launch node!
>>
>> The cpu utilisation for launch node shown below
>>
>> [image: image.png]
>> The dip is when 2 more cpus were added to  it so it had to reboot. so
>> around %70 usage
>>
>> The combined cpu usage for GKE nodes is shown below:
>>
>> [image: image.png]
>>
>> Never goes above 20%!
>>
>> I can see that the drive and executors as below:
>>
>> k get pods -n spark
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>> pytest-c958c97b2c52b6ed-driver               1/1     Running   0
>> 69s
>> randomdatabigquery-e68a8a7b2c52f468-exec-1   1/1     Running   0
>> 51s
>> randomdatabigquery-e68a8a7b2c52f468-exec-2   1/1     Running   0
>> 51s
>> randomdatabigquery-e68a8a7b2c52f468-exec-3   0/1     Pending   0
>> 51s
>>
>> It is a PySpark 3.1.1 image using java 8 and pushing random data
>> generated into Google BigQuery data warehouse. The last executor (exec-3)
>> seems to be just pending. The spark-submit is as below:
>>
>>         spark-submit --verbose \
>>            --properties-file ${property_file} \
>>            --master k8s://https://$KUBERNETES_MASTER_IP:443 \
>>            --deploy-mode cluster \
>>            --name pytest \
>>            --conf
>> spark.yarn.appMasterEnv.PYSPARK_PYTHON=./pyspark_venv/bin/python \
>>            --py-files $CODE_DIRECTORY/DSBQ.zip \
>>            --conf spark.kubernetes.namespace=$NAMESPACE \
>>            --conf spark.executor.memory=5000m \
>>            --conf spark.network.timeout=300 \
>>            --conf spark.executor.instances=3 \
>>            --conf spark.kubernetes.driver.limit.cores=1 \
>>            --conf spark.driver.cores=1 \
>>            --conf spark.executor.cores=1 \
>>            --conf spark.executor.memory=2000m \
>>            --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \
>>            --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \
>>            --conf spark.kubernetes.container.image=${IMAGEGCP} \
>>            --conf
>> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>>            --conf
>> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>>            --conf
>> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
>> \
>>            --conf spark.sql.execution.arrow.pyspark.enabled="true" \
>>            $CODE_DIRECTORY/${APPLICATION}
>>
>> Aren't the driver and executors running on K8s cluster? So why is the
>> launch node heavily used but k8s cluster is underutilized?
>>
>> Thanks
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Performance of PySpark jobs on the Kubernetes cluster

Reply via email to