I am working with parquets and the metadata reading there is quite fast as
there are at most 16 files (a couple of gigs each).

I find it very hard to answer the question: "how many partitions do you
have?", many spark operations do not preserve partitioning and I have a lot
of filtering and grouping going on.
What I *can* say is that I specified spark.sql.shuffle.partitions to
30,000.

I am not worried that there are not enough partitions to keep the cores
working. Having said that I do see that the high utilisation correlates
heavily with shuffle read/write. Whereas low utilisation correlates with no
shuffling.
This leads me to the conclusion that compared to the amount of shuffling,
the cluster is doing very little work.

Question is what can I do about it.

On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Can you shed more light on what kind of processing you are doing?
>
>
>
> One common pattern that I have seen for active core/executor utilization
> dropping to zero is while reading ORC data and the driver seems (I think)
> to be doing schema validation.
>
> In my case I would have hundreds of thousands of ORC data files and there
> is dead silence for about 1-2 hours.
>
> I have tried providing a schema and disabling schema validation while
> reading the ORC data, but that does not seem to help (Spark 2.2.1).
>
>
>
> And as you know, in most cases, there is a linear relationship between
> number of partitions in your data and the concurrently active executors.
>
>
>
> Another thing I would suggest is use the following two API calls/method –
> they will annotate the spark stages and jobs with what is being executed in
> the Spark UI.
>
> SparkContext.setJobGroup(….)
>
> SparkContext.setJobDescription(….)
>
>
>
> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
> *Date: *Thursday, November 15, 2018 at 8:51 AM
> *To: *user <user@spark.apache.org>
> *Cc: *David Markovitz <dudu.markov...@microsoft.com>
> *Subject: *How to address seemingly low core utilization on a spark
> workload?
>
>
>
> I have a workload that runs on a cluster of 300 cores.
>
> Below is a plot of the amount of active tasks over time during the
> execution of this workload:
>
>
>
> [image: image.png]
>
>
>
> What I deduce is that there are substantial intervals where the cores are
> heavily under-utilised.
>
>
>
> What actions can I take to:
>
>    - Increase the efficiency (== core utilisation) of the cluster?
>    - Understand the root causes behind the drops in core utilisation?
>
>

Reply via email to