Hi folks, Help me! I met a very weird problem. I really need some help!! Here is my situation: -------------------------------------------------------------------- Case: Assign keys to two datasets (one is 96GB with 2.7 billion records and one 1.5GB with 30k records) via MapPartitions first, and join them together with their keys. -------------------------------------------------------------------- Environment:
Standalone Spark on Amazon EC2 Master*1 13GB 8 cores Worker*16 each one 13GB 8 cores (After met this problem, I switched to Worker*16 each one 59GB 8 cores) Read and write on HDFS (same cluster) ---------------------------------------------------------------------- Problem: At the beginning:----------------------- The MapPartitions looks no problem. But when Spark does the Join for two datasets, the console says *"ERROR TaskSchedulerImpl: Lost executor 4 on ip-172-31-27-174.us-west-2.compute.internal: remote Akka client disassociated"* Then I go back to this worker and check its log There is something like "Master said remote Akka client disassociated and asked to kill executor *** and then the worker killed this executor" (Sorry I deleted that log and just remember the content.) There is no other errors before the Akka client disassociated (for both of master and worker). Then ------------------------------------------- I tried one 62GB dataset with the 1.5 GB dataset. My job worked smoothly. *HOWEVER, I found one thing: If I set the spark.shuffle.memoryFraction to Zero, same error will happen on this 62GB dataset.* Then --------------------------------------------------- I switched my workers to Worker*16 each one 59GB 8 cores. Error even happened when Spark does the MapPartitions!!!! Some metrics I found---------------------------------------------------------------------------- *When I do the MapPartitions or Join with 96GB data, its shuffle write is around 100GB. And I cached 96GB data and its size is around 530GB.* *Garbage collection time for 96GB dataset when Spark does the Map or Join is around 12 second.* My analysis---------------------------------------------- This problem might be caused by large shuffle write data. The large shuffle write caused high I/O on disk. If the shuffle write cannot be done by some timeout period, then the master will think this executor is disassociated. But I don't know how to solve this problem. ------------------------------------------------------------------- Any help will be appreciated!!!!!!!!!!! Thanks, Jia