How to increase hdfs read parallelism

Rajat Verma Tue, 04 Nov 2014 22:52:56 -0800

Hi
I have simple use case where I have to join two feeds. I have two worker
nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with
yarn(2.4.0).
I have allocated 80% resources to spark queue and my spark config looks like
spark.executor.cores=18
spark.executor.memory=66g
spark.executor.instances=2


My jobs schedules 400 tasks and at a time close to 40 tasks run in
parallel. My job is IO bound and CPU utilisation is less than 40%.
Spark tuning page recommends to configure 2-3 tasks per CPU core. I have
changed spark.default.parallelism to 80 but still it is running only
40(approx) tasks at a time.
How can I run more tasks in parallel.

One more question, I have to cache combined RDD after join. Should I run 4
executers with 32GB memory and set -XX:+UseCompressedOops?? what are pros
and cons of doing it.

Thanks.

How to increase hdfs read parallelism

Reply via email to