As part of my processing, I have the following code: rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 100000) rdd.count()
The s3 directory has about 8GB of data and 61,878 files. I am using Spark 2.1, and running it with 15 modes of m3.xlarge nodes on EMR. The job fails with this error: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 35532 in stage 0.0 failed 4 times, most recent failure: Lost task 35532.3 in stage 0.0 (TID 35543, ip-172-31-36-192.us-west-2.compute.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 7.4 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. I have run it dozens of times, increasing partitions, reducing the size of my data set (the original is 60GB), and increasing the number of partitions, but get the same error each time. In contrast, if I run a simple: rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/") rdd.coutn() The job finishes in 15 minutes, even with just 3 nodes. Thanks -- Paul Henry Tremblay Robert Half Technology