My job is 1TB join + 10 GB table on spark1.6.1 run on yarn mode: *1. if I open shuffle service, the error is * Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is not registered (appId=application_1473819702737_1239, execId=52) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:105) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
*2. if I close shuffle service, * *set spark.executor.instances 80* the error is : ExecutorLostFailure (executor 71 exited caused by one of the running tasks) Reason: Container marked as failed: container_1473819702737_1432_01_406847560 on host: nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics: Exception from container-launch: ExitCodeException exitCode=52: ExitCodeException exitCode=52: These errors are reported on shuffle stage My data is skew, some ids have 400million rows, but some ids only have 1million rows, is anybody has some ideas to solve the problem? *3. My config is * Here is my config I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem will be more serious spark.driver.cores 4 spark.driver.memory 8g # use on client mode spark.yarn.am.memory 8g spark.yarn.am.cores 4 spark.executor.memory 8g spark.executor.cores 4 spark.yarn.executor.memoryOverhead 6144 spark.memory.offHeap.enabled true spark.memory.offHeap.size 4000000000 Best & Regards Cyanny LIANG