I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
I have 5 slaves:
(32cores /256Gb / 7physical disks) x 5

I have been trying many different configurations with YARN.
yarn.nodemanager.resource.memory-mb 196Gb
yarn.nodemanager.resource.cpu-vcores 24

I have tried to execute the job with different number of executors a
memory (1-4g)
With 20 executors takes 25s each iteration (128mb) and it never has a
really long time waiting because GC.

When I execute around 60 executors the process time it's about 45s and
some tasks take until one minute because GC.

I have no idea why it's calling GC when I execute more executors simultaneously.
The another question it's why it takes more time to execute each
block. My theory about the this it's because there're only 7 physical
disks and it's not the same 5 processes writing than 20.

The code is pretty simple, it's just a map function which parse a line
and write the output in HDFS. There're a lot of substrings inside of
the function what it could cause GC.

Any theory about?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to