Re: 1TB shuffle failed with executor lost failure

2016-09-19 Thread Divya Gehlot
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is
val OOM=52 - i.e. an OutOfMemoryError
Refer
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala



On 19 September 2016 at 14:57, Cyanny LIANG  wrote:

> My job is 1TB join + 10 GB table on spark1.6.1
> run on yarn mode:
>
> *1. if I open shuffle service, the error is *
> Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at
> NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number
> of times: 4. Most recent failure reason: 
> org.apache.spark.shuffle.FetchFailedException:
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1473819702737_1239,
> execId=52)
> at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.
> getBlockData(ExternalShuffleBlockResolver.java:105)
> at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.
> receive(ExternalShuffleBlockHandler.java:74)
> at org.apache.spark.network.server.TransportRequestHandler.
> processRpcRequest(TransportRequestHandler.java:114)
> at org.apache.spark.network.server.TransportRequestHandler.handle(
> TransportRequestHandler.java:87)
> at org.apache.spark.network.server.TransportChannelHandler.
> channelRead0(TransportChannelHandler.java:101)
>
> *2. if I close shuffle service, *
> *set spark.executor.instances 80*
> the error is :
> ExecutorLostFailure (executor 71 exited caused by one of the running
> tasks) Reason: Container marked as failed: 
> container_1473819702737_1432_01_406847560
> on host: nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics:
> Exception from container-launch: ExitCodeException exitCode=52:
> ExitCodeException exitCode=52:
>
> These errors are reported on shuffle stage
> My data is skew, some ids have 400million rows, but some ids only have
> 1million rows, is anybody has some ideas to solve the problem?
>
>
> *3. My config is *
> Here is my config
> I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem
> will be more serious
>
> spark.driver.cores 4
>
> spark.driver.memory 8g
>
>
> # use on client mode
>
>
> spark.yarn.am.memory 8g
>
>
> spark.yarn.am.cores 4
>
>
> spark.executor.memory 8g
>
>
> spark.executor.cores 4
>
> spark.yarn.executor.memoryOverhead 6144
>
>
> spark.memory.offHeap.enabled true
>
>
> spark.memory.offHeap.size 40
>
> Best & Regards
> Cyanny LIANG
>


1TB shuffle failed with executor lost failure

2016-09-19 Thread Cyanny LIANG
My job is 1TB join + 10 GB table on spark1.6.1
run on yarn mode:

*1. if I open shuffle service, the error is *
Job aborted due to stage failure: ShuffleMapStage 2 (writeToDirectory at
NativeMethodAccessorImpl.java:-2) has failed the maximum allowable number
of times: 4. Most recent failure reason:
org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
Executor is not registered (appId=application_1473819702737_1239, execId=52)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:105)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
at
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
at
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87)
at
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)

*2. if I close shuffle service, *
*set spark.executor.instances 80*
the error is :
ExecutorLostFailure (executor 71 exited caused by one of the running tasks)
Reason: Container marked as failed:
container_1473819702737_1432_01_406847560 on host:
nmg01-spark-a0021.nmg01.baidu.com. Exit status: 52. Diagnostics: Exception
from container-launch: ExitCodeException exitCode=52:
ExitCodeException exitCode=52:

These errors are reported on shuffle stage
My data is skew, some ids have 400million rows, but some ids only have
1million rows, is anybody has some ideas to solve the problem?


*3. My config is *
Here is my config
I use tungsten-sort in off-heap mode, in on-heap mode, the oom problem will
be more serious

spark.driver.cores 4

spark.driver.memory 8g


# use on client mode


spark.yarn.am.memory 8g


spark.yarn.am.cores 4


spark.executor.memory 8g


spark.executor.cores 4

spark.yarn.executor.memoryOverhead 6144


spark.memory.offHeap.enabled true


spark.memory.offHeap.size 40

Best & Regards
Cyanny LIANG