Re: Something about Spark which has bothered me for a very long time, which I've never understood

Lalwani, Jayesh Thu, 05 May 2022 07:48:52 -0700

Have you tried taking several thread dumps across executors to see if the 
executors are consistently waiting for a resource?

I suspect it’s S3.. S3’s list operation doesn’t scale with the number of keys 
in a folder. You aren’t being throttled by S3. S3 is just slow when you have 
lot of small objects. I suspect you’ll just see the threads waiting for S3 to 
send back a response.
From: Denarian Kislata <denar...@gmail.com>
Date: Thursday, May 5, 2022 at 8:03 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [EXTERNAL] Something about Spark which has bothered me for a very long 
time, which I've never understood

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Greetings, and thanks in advance.

For ~8 years i've been a spark user, and i've seen this same problem at more 
SaaS startups than I can count, and although it's straightforward to fix, I've 
never understood _why_ it happens.

I'm hoping someone can explain the why behind it.

Unfortunately I don't have screenshots handy, but the common problem is 
succinct:

Someone created a table stored as a large number of small s3 objects in a 
single bucket, partitioned by date. Some job loads this table by date 
partition, and inevitably at some point there is a scan/exchange in an 
execution plan which has ~1M partitions in the spark UI.

Common issue with a known solution, but the odd thing I always see in this 
case, and wonder about, is why do the ganglia metrics look so strange for that 
particular part of the job?

To clarify what strange means, I mean (for some cluster of some size) CPU 
utilization at about 40% user 2% sys while network is sitting at about 5% of 
quoted throughput. It's not a long-fat network, and I usually haven't seen high 
retransmission rates or throttling from s3 in the logs.

In these cases, heap is usually ~25% of maximum. It's so strange to see 
utilization like this across the board, because some resource must be 
saturated, and although I haven't gotten to the point of connecting to an 
executor and reading saturation metrics, my intuition is that it should 
actually be some pooled resource or semaphore which is exhausted. An obvious 
culprit would be something related to s3a, but whenever I peek at that, nothing 
nothing really stands out configuration wise. Maybe there's something I don't 
understand about the connections and threads configuration for s3a and I should 
go read that implementation tomorrow, but I thought I'd ask here first.

Has anyone else seen this? I'm not interested in _fixing_ it really (the right 
fix is to correct the input data...), but I'd love to understand _why_ it 
happens.

Thanks!
Den

I'm curious what people think about this.

Re: Something about Spark which has bothered me for a very long time, which I've never understood

Reply via email to