[ 
https://issues.apache.org/jira/browse/SPARK-35006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huy updated SPARK-35006:
------------------------
    Description: 
*Issue Description*

I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on 
K8s) (please note it's just for testing purpose). The corresponding 
configuration can be found bellowing.

When I run task for loading & computing on a XML file; due to the size of XML 
file is large (which I intended to) the executor got OOM error

#Aborting due to java.lang.OutOfMemoryError: Java heap space
 # 
 # A fatal error has been detected by the Java Runtime Environment:
 # 
 # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700
 # fatal error: OutOfMemory encountered: Java heap space

 

However, the driver doesn't recognize this error as task failure scenario. 
Instead, it consider this as a framework issue and continue retrying the task

INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver.
 INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it 
was being computed, its executor exited for a reason unrelated to the task. 
*{color:#de350b}Not counting this failure towards the maximum number of 
failures for the task{color}.*
 INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from 
BlockManagerMaster.

 

This results in the fact that, the Spark application keeps retrying the task 
forever and locks other tasks from running

 

*Expectation* 

Spark driver should classify OOM on executor pod as a failure due to task and 
increase the count of max failure time

 

*Configuration*

spark.kubernetes.container.image: "spark_image_path"
 spark.kubernetes.container.image.pullPolicy: "Always"
 spark.kubernetes.namespace: "qa-namespace"
 spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account"
 spark.kubernetes.executor.request.cores: "2"
 spark.kubernetes.executor.limit.cores: "2"
 spark.executorEnv.SPARK_ENV: "dev"
 spark.executor.memoryOverhead: "1G"
 spark.executor.memory: "6g"
 spark.executor.cores: "2"
 spark.executor.instances: "3"
 spark.driver.maxResultSize: "1g"
 spark.driver.memory: "10g"
 spark.driver.cores: "2"
 spark.eventLog.enabled: 'true'
 spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \
 -Dcom.sun.management.jmxremote.authenticate=false \
 -Dcom.sun.management.jmxremote.ssl=false \
 -XX:+UseG1GC \
 -XX:+PrintFlagsFinal \
 -XX:+PrintReferenceGC -verbose:gc \
 -XX:+PrintGCDetails \
 -XX:+PrintGCTimeStamps \
 -XX:+PrintAdaptiveSizePolicy \
 -XX:+UnlockDiagnosticVMOptions \
 -XX:+G1SummarizeConcMark \
 -XX:InitiatingHeapOccupancyPercent=35 \
 -XX:ConcGCThreads=20 \
 -XX:+PrintGCCause \
 -XX:+AlwaysPreTouch \
 -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... "
 spark.sql.session.timeZone: UTC

 

 

 

  was:
*Issue Description*

I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on 
K8s) (please note it's just for testing purpose). The corresponding 
configuration can be found bellowing.

When I run task for loading & computing on a XML file; due to the size of XML 
file is large (which I intended to) the executor got OOM error

#Aborting due to java.lang.OutOfMemoryError: Java heap space
 # 
 # A fatal error has been detected by the Java Runtime Environment:
 # 
 # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700
 # fatal error: OutOfMemory encountered: Java heap space

 

However, the driver doesn't recognize this error as task failure scenario. 
Instead, it consider this as a framework issue and continue retry the task

INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver.
 INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it 
was being computed, its executor exited for a reason unrelated to the task. 
*{color:#de350b}Not counting this failure towards the maximum number of 
failures for the task{color}.*
 INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from 
BlockManagerMaster.

 

This results in the fact that, the Spark application keeps retrying the task 
forever and locks following tasks from running

 

*Expectation* 

Spark driver should classify OOM on executor pod as a failure due to task and 
increase the count of max failure time

 

*Configuration*

spark.kubernetes.container.image: "spark_image_path"
 spark.kubernetes.container.image.pullPolicy: "Always"
 spark.kubernetes.namespace: "qa-namespace"
 spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account"
 spark.kubernetes.executor.request.cores: "2"
 spark.kubernetes.executor.limit.cores: "2"
 spark.executorEnv.SPARK_ENV: "dev"
 spark.executor.memoryOverhead: "1G"
 spark.executor.memory: "6g"
 spark.executor.cores: "2"
 spark.executor.instances: "3"
 spark.driver.maxResultSize: "1g"
 spark.driver.memory: "10g"
 spark.driver.cores: "2"
 spark.eventLog.enabled: 'true'
 spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \
 -Dcom.sun.management.jmxremote.authenticate=false \
 -Dcom.sun.management.jmxremote.ssl=false \
 -XX:+UseG1GC \
 -XX:+PrintFlagsFinal \
 -XX:+PrintReferenceGC -verbose:gc \
 -XX:+PrintGCDetails \
 -XX:+PrintGCTimeStamps \
 -XX:+PrintAdaptiveSizePolicy \
 -XX:+UnlockDiagnosticVMOptions \
 -XX:+G1SummarizeConcMark \
 -XX:InitiatingHeapOccupancyPercent=35 \
 -XX:ConcGCThreads=20 \
 -XX:+PrintGCCause \
 -XX:+AlwaysPreTouch \
 -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... "
 spark.sql.session.timeZone: UTC

 

 

 


> Spark driver mistakenly classifies OOM error of executor (on K8s pod) as 
> framework error
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-35006
>                 URL: https://issues.apache.org/jira/browse/SPARK-35006
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.1.1
>            Reporter: Huy
>            Priority: Major
>
> *Issue Description*
> I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on 
> K8s) (please note it's just for testing purpose). The corresponding 
> configuration can be found bellowing.
> When I run task for loading & computing on a XML file; due to the size of XML 
> file is large (which I intended to) the executor got OOM error
> #Aborting due to java.lang.OutOfMemoryError: Java heap space
>  # 
>  # A fatal error has been detected by the Java Runtime Environment:
>  # 
>  # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700
>  # fatal error: OutOfMemory encountered: Java heap space
>  
> However, the driver doesn't recognize this error as task failure scenario. 
> Instead, it consider this as a framework issue and continue retrying the task
> INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver.
>  INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while 
> it was being computed, its executor exited for a reason unrelated to the 
> task. *{color:#de350b}Not counting this failure towards the maximum number of 
> failures for the task{color}.*
>  INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from 
> BlockManagerMaster.
>  
> This results in the fact that, the Spark application keeps retrying the task 
> forever and locks other tasks from running
>  
> *Expectation* 
> Spark driver should classify OOM on executor pod as a failure due to task and 
> increase the count of max failure time
>  
> *Configuration*
> spark.kubernetes.container.image: "spark_image_path"
>  spark.kubernetes.container.image.pullPolicy: "Always"
>  spark.kubernetes.namespace: "qa-namespace"
>  spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account"
>  spark.kubernetes.executor.request.cores: "2"
>  spark.kubernetes.executor.limit.cores: "2"
>  spark.executorEnv.SPARK_ENV: "dev"
>  spark.executor.memoryOverhead: "1G"
>  spark.executor.memory: "6g"
>  spark.executor.cores: "2"
>  spark.executor.instances: "3"
>  spark.driver.maxResultSize: "1g"
>  spark.driver.memory: "10g"
>  spark.driver.cores: "2"
>  spark.eventLog.enabled: 'true'
>  spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \
>  -Dcom.sun.management.jmxremote.authenticate=false \
>  -Dcom.sun.management.jmxremote.ssl=false \
>  -XX:+UseG1GC \
>  -XX:+PrintFlagsFinal \
>  -XX:+PrintReferenceGC -verbose:gc \
>  -XX:+PrintGCDetails \
>  -XX:+PrintGCTimeStamps \
>  -XX:+PrintAdaptiveSizePolicy \
>  -XX:+UnlockDiagnosticVMOptions \
>  -XX:+G1SummarizeConcMark \
>  -XX:InitiatingHeapOccupancyPercent=35 \
>  -XX:ConcGCThreads=20 \
>  -XX:+PrintGCCause \
>  -XX:+AlwaysPreTouch \
>  -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... "
>  spark.sql.session.timeZone: UTC
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to