Have you tried taking several thread dumps across executors to see if the executors are consistently waiting for a resource?
I suspect it’s S3.. S3’s list operation doesn’t scale with the number of keys in a folder. You aren’t being throttled by S3. S3 is just slow when you have lot of small objects. I suspect you’ll just see the threads waiting for S3 to send back a response. From: Denarian Kislata <denar...@gmail.com> Date: Thursday, May 5, 2022 at 8:03 AM To: "user@spark.apache.org" <user@spark.apache.org> Subject: [EXTERNAL] Something about Spark which has bothered me for a very long time, which I've never understood CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Greetings, and thanks in advance. For ~8 years i've been a spark user, and i've seen this same problem at more SaaS startups than I can count, and although it's straightforward to fix, I've never understood _why_ it happens. I'm hoping someone can explain the why behind it. Unfortunately I don't have screenshots handy, but the common problem is succinct: Someone created a table stored as a large number of small s3 objects in a single bucket, partitioned by date. Some job loads this table by date partition, and inevitably at some point there is a scan/exchange in an execution plan which has ~1M partitions in the spark UI. Common issue with a known solution, but the odd thing I always see in this case, and wonder about, is why do the ganglia metrics look so strange for that particular part of the job? To clarify what strange means, I mean (for some cluster of some size) CPU utilization at about 40% user 2% sys while network is sitting at about 5% of quoted throughput. It's not a long-fat network, and I usually haven't seen high retransmission rates or throttling from s3 in the logs. In these cases, heap is usually ~25% of maximum. It's so strange to see utilization like this across the board, because some resource must be saturated, and although I haven't gotten to the point of connecting to an executor and reading saturation metrics, my intuition is that it should actually be some pooled resource or semaphore which is exhausted. An obvious culprit would be something related to s3a, but whenever I peek at that, nothing nothing really stands out configuration wise. Maybe there's something I don't understand about the connections and threads configuration for s3a and I should go read that implementation tomorrow, but I thought I'd ask here first. Has anyone else seen this? I'm not interested in _fixing_ it really (the right fix is to correct the input data...), but I'd love to understand _why_ it happens. Thanks! Den I'm curious what people think about this.