Hi I have simple use case where I have to join two feeds. I have two worker nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with yarn(2.4.0). I have allocated 80% resources to spark queue and my spark config looks like spark.executor.cores=18 spark.executor.memory=66g spark.executor.instances=2
My jobs schedules 400 tasks and at a time close to 40 tasks run in parallel. My job is IO bound and CPU utilisation is less than 40%. Spark tuning page recommends to configure 2-3 tasks per CPU core. I have changed spark.default.parallelism to 80 but still it is running only 40(approx) tasks at a time. How can I run more tasks in parallel. One more question, I have to cache combined RDD after join. Should I run 4 executers with 32GB memory and set -XX:+UseCompressedOops?? what are pros and cons of doing it. Thanks.