I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5
I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org