Greetings, and thanks in advance.

For ~8 years i've been a spark user, and i've seen this same problem at
more SaaS startups than I can count, and although it's straightforward to
fix, I've never understood _why_ it happens.

I'm hoping someone can explain the why behind it.

Unfortunately I don't have screenshots handy, but the common problem is
succinct:

Someone created a table stored as a large number of small s3 objects in a
single bucket, partitioned by date. Some job loads this table by date
partition, and inevitably at some point there is a scan/exchange in an
execution plan which has ~1M partitions in the spark UI.

Common issue with a known solution, but the odd thing I always see in this
case, and wonder about, is why do the ganglia metrics look so strange for
that particular part of the job?

To clarify what strange means, I mean (for some cluster of some size) CPU
utilization at about 40% user 2% sys while network is sitting at about 5%
of quoted throughput. It's not a long-fat network, and I usually haven't
seen high retransmission rates or throttling from s3 in the logs.

In these cases, heap is usually ~25% of maximum. It's so strange to see
utilization like this across the board, because some resource must be
saturated, and although I haven't gotten to the point of connecting to an
executor and reading saturation metrics, my intuition is that it should
actually be some pooled resource or semaphore which is exhausted. An
obvious culprit would be something related to s3a, but whenever I peek at
that, nothing nothing really stands out configuration wise. Maybe there's
something I don't understand about the connections and threads
configuration for s3a and I should go read that implementation tomorrow,
but I thought I'd ask here first.

Has anyone else seen this? I'm not interested in _fixing_ it really (the
right fix is to correct the input data...), but I'd love to understand
_why_ it happens.

Thanks!
Den

I'm curious what people think about this.

Reply via email to