Igor,

what exact instance types do you use? Unless you use local instance storage
and have actually configured your Kubernetes and Spark to use instance
storage, your 30x30 exchange can run into EBS IOPS limits. You can
investigate that by going to an instance, then to volume, and see
monitoring charts.

Another thought is that you're essentially giving 4GB per core. That sounds
pretty low, in my experience.



On Thu, Sep 29, 2022 at 9:13 PM Igor Calabria <igor.calab...@gmail.com>
wrote:

> Hi Everyone,
>
> I'm running spark 3.2 on kubernetes and have a job with a decently sized
> shuffle of almost 4TB. The relevant cluster config is as follows:
>
> - 30 Executors. 16 physical cores, configured with 32 Cores for spark
> - 128 GB RAM
> -  shuffle.partitions is 18k which gives me tasks of around 150~180MB
>
> The job runs fine but I'm bothered by how underutilized the cluster gets
> during the reduce phase. During the map(reading data from s3 and writing
> the shuffle data) CPU usage, disk throughput and network usage is as
> expected, but during the reduce phase it gets really low. It seems the main
> bottleneck is reading shuffle data from other nodes, task statistics
> reports values ranging from 25s to several minutes(the task sizes are
> really close, they aren't skewed). I've tried increasing
> "spark.reducer.maxSizeInFlight" and
> "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
> a little, but not enough to saturate the cluster resources.
>
> Did I miss some more tuning parameters that could help?
> One obvious thing would be to vertically increase the machines and use
> less nodes to minimize traffic, but 30 nodes doesn't seem like much even
> considering 30x30 connections.
>
> Thanks in advance!
>
>

-- 
Vladimir Prus
http://vladimirprus.com

Reply via email to