Hi spark-user, I am using spark 1.6 to build reverse index for one month of twitter data (~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates 50 partitions. I'd like to increase the parallelism by increase the number of input partitions. Thus, I use textFile(..., 200) to yield 200 partitions.
I found a significant GC overhead for the stage of building reverse indexes (with a large shuffle). More than 80% of task time is consumed by GC. I tried to decrease the # of cores per executor from 8 to 5, and the GC time was reduced but still high (more than 50% of task time). However, with the default number of partitions (50), there is no GC overhead at all. The machines running executors have more than 100GB memory, and I set executor memory to 32GB. I can confirm that no more than 1 executor running on each machine. I am wondering why there is such significant GC overhead after I increase the number of input partitions? Thanks, J.