Hard to say without a lot more info, but 76.5K tasks is very large. How big
are the tasks / how long do they take? if very short, you should
repartition down.
Do you end up with 800 executors? if so why 2 per machine? that generally
is a loss at this scale of worker. I'm confused because you have 4000 tasks
running, which would be just 10 per executor as well.
What is the data input format? it's far faster to 'count' parquet as it's
just a metadata read.
Is anything else happening besides count() after the data is read?


On Tue, Apr 6, 2021 at 2:00 AM Krishna Chakka <krishnachakka...@gmail.com>
wrote:

> Hi,
>
> I am working on a spark job. It takes 10 mins for the job just for the
> count() function. * Question is How can I make it faster ?*
>
> From the above image, what I understood is that there 4001 tasks are
> running in parallel. Total tasks are 76,553 .
>
> Here are the parameters that I am using for the job
>     - master machine type - e2-standard-16
>     - worker machine type - e2-standard-8 (8 vcpus, 32 GB memory)
>     - number of workers - 400
>     - spark.executor.cores - 4
>     - spark.executor.memory - 11g
>     - spark.sql.shuffle.partitions - 10000
>
>
> Please advice how can I make this faster ?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>

Reply via email to