Re: Help with Shuffle Read performance

Artemis User Fri, 30 Sep 2022 08:26:35 -0700

The reduce phase is always more resource-intensive than the map phase. Couple of suggestions you may want to consider:


1. Setting the number of partitions to 18K may be way too high (the
   default number is only 200).  You may want to just use the default
   and the scheduler will automatically increase the partitions if needed.
2. Turn on dynamic resource allocation (DRA)
   
(https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation).
   It would allow those executors that finish the map tasks returning
   the resources (e.g. RAM, CPU cores) back to the cluster, and
   reallocate the resources to the reduce tasks.  This feature (post
   Spark 3.0) is also available to K8, but turned off by default.
3. With DRA turned on, you may want also try to play with a small
   number of number of executors/nodes thus reducing shuffling needs,
   given the fact that you only have 128GB RAM.


Hope this helps...


On 9/29/22 2:12 PM, Igor Calabria wrote:

Hi Everyone,
I'm running spark 3.2 on kubernetes and have a job with a decentlysized shuffle of almost 4TB. The relevant cluster config is as follows:
- 30 Executors. 16 physical cores, configured with 32 Cores for spark
- 128 GB RAM
-  shuffle.partitions is 18k which gives me tasks of around 150~180MB
The job runs fine but I'm bothered by how underutilized the clustergets during the reduce phase. During the map(reading data from s3 andwriting the shuffle data) CPU usage, disk throughput and network usageis as expected, but during the reduce phase it gets really low. Itseems the main bottleneck is reading shuffle data from other nodes,task statistics reports values ranging from 25s to several minutes(thetask sizes are really close, they aren't skewed). I've triedincreasing "spark.reducer.maxSizeInFlight" and"spark.shuffle.io.numConnectionsPerPeer" and it did improveperformance by a little, but not enough to saturate the cluster resources.
Did I miss some more tuning parameters that could help?
One obvious thing would be to vertically increase the machines and useless nodes to minimize traffic, but 30 nodes doesn't seem like mucheven considering 30x30 connections.
Thanks in advance!

Re: Help with Shuffle Read performance

Reply via email to