[ https://issues.apache.org/jira/browse/SPARK-35006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Huy updated SPARK-35006: ------------------------ Description: *Issue Description* I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on K8s) (please note it's just for testing purpose). The corresponding configuration can be found bellowing. When I run task for loading & computing on a XML file; due to the size of XML file is large (which I intended to) the executor got OOM error #Aborting due to java.lang.OutOfMemoryError: Java heap space # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700 # fatal error: OutOfMemory encountered: Java heap space However, the driver doesn't recognize this error as task failure scenario. Instead, it consider this as a framework issue and continue retrying the task INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver. INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it was being computed, its executor exited for a reason unrelated to the task. *{color:#de350b}Not counting this failure towards the maximum number of failures for the task{color}.* INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from BlockManagerMaster. This results in the fact that, the Spark application keeps retrying the task forever and locks other tasks from running *Expectation* Spark driver should classify OOM on executor pod as a failure due to task and increase the count of max failure time *Configuration* spark.kubernetes.container.image: "spark_image_path" spark.kubernetes.container.image.pullPolicy: "Always" spark.kubernetes.namespace: "qa-namespace" spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account" spark.kubernetes.executor.request.cores: "2" spark.kubernetes.executor.limit.cores: "2" spark.executorEnv.SPARK_ENV: "dev" spark.executor.memoryOverhead: "1G" spark.executor.memory: "6g" spark.executor.cores: "2" spark.executor.instances: "3" spark.driver.maxResultSize: "1g" spark.driver.memory: "10g" spark.driver.cores: "2" spark.eventLog.enabled: 'true' spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false \ -XX:+UseG1GC \ -XX:+PrintFlagsFinal \ -XX:+PrintReferenceGC -verbose:gc \ -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps \ -XX:+PrintAdaptiveSizePolicy \ -XX:+UnlockDiagnosticVMOptions \ -XX:+G1SummarizeConcMark \ -XX:InitiatingHeapOccupancyPercent=35 \ -XX:ConcGCThreads=20 \ -XX:+PrintGCCause \ -XX:+AlwaysPreTouch \ -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... " spark.sql.session.timeZone: UTC was: *Issue Description* I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on K8s) (please note it's just for testing purpose). The corresponding configuration can be found bellowing. When I run task for loading & computing on a XML file; due to the size of XML file is large (which I intended to) the executor got OOM error #Aborting due to java.lang.OutOfMemoryError: Java heap space # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700 # fatal error: OutOfMemory encountered: Java heap space However, the driver doesn't recognize this error as task failure scenario. Instead, it consider this as a framework issue and continue retry the task INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver. INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it was being computed, its executor exited for a reason unrelated to the task. *{color:#de350b}Not counting this failure towards the maximum number of failures for the task{color}.* INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from BlockManagerMaster. This results in the fact that, the Spark application keeps retrying the task forever and locks following tasks from running *Expectation* Spark driver should classify OOM on executor pod as a failure due to task and increase the count of max failure time *Configuration* spark.kubernetes.container.image: "spark_image_path" spark.kubernetes.container.image.pullPolicy: "Always" spark.kubernetes.namespace: "qa-namespace" spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account" spark.kubernetes.executor.request.cores: "2" spark.kubernetes.executor.limit.cores: "2" spark.executorEnv.SPARK_ENV: "dev" spark.executor.memoryOverhead: "1G" spark.executor.memory: "6g" spark.executor.cores: "2" spark.executor.instances: "3" spark.driver.maxResultSize: "1g" spark.driver.memory: "10g" spark.driver.cores: "2" spark.eventLog.enabled: 'true' spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false \ -XX:+UseG1GC \ -XX:+PrintFlagsFinal \ -XX:+PrintReferenceGC -verbose:gc \ -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps \ -XX:+PrintAdaptiveSizePolicy \ -XX:+UnlockDiagnosticVMOptions \ -XX:+G1SummarizeConcMark \ -XX:InitiatingHeapOccupancyPercent=35 \ -XX:ConcGCThreads=20 \ -XX:+PrintGCCause \ -XX:+AlwaysPreTouch \ -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... " spark.sql.session.timeZone: UTC > Spark driver mistakenly classifies OOM error of executor (on K8s pod) as > framework error > ---------------------------------------------------------------------------------------- > > Key: SPARK-35006 > URL: https://issues.apache.org/jira/browse/SPARK-35006 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.1.1 > Reporter: Huy > Priority: Major > > *Issue Description* > I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on > K8s) (please note it's just for testing purpose). The corresponding > configuration can be found bellowing. > When I run task for loading & computing on a XML file; due to the size of XML > file is large (which I intended to) the executor got OOM error > #Aborting due to java.lang.OutOfMemoryError: Java heap space > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700 > # fatal error: OutOfMemory encountered: Java heap space > > However, the driver doesn't recognize this error as task failure scenario. > Instead, it consider this as a framework issue and continue retrying the task > INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver. > INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while > it was being computed, its executor exited for a reason unrelated to the > task. *{color:#de350b}Not counting this failure towards the maximum number of > failures for the task{color}.* > INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from > BlockManagerMaster. > > This results in the fact that, the Spark application keeps retrying the task > forever and locks other tasks from running > > *Expectation* > Spark driver should classify OOM on executor pod as a failure due to task and > increase the count of max failure time > > *Configuration* > spark.kubernetes.container.image: "spark_image_path" > spark.kubernetes.container.image.pullPolicy: "Always" > spark.kubernetes.namespace: "qa-namespace" > spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account" > spark.kubernetes.executor.request.cores: "2" > spark.kubernetes.executor.limit.cores: "2" > spark.executorEnv.SPARK_ENV: "dev" > spark.executor.memoryOverhead: "1G" > spark.executor.memory: "6g" > spark.executor.cores: "2" > spark.executor.instances: "3" > spark.driver.maxResultSize: "1g" > spark.driver.memory: "10g" > spark.driver.cores: "2" > spark.eventLog.enabled: 'true' > spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \ > -Dcom.sun.management.jmxremote.authenticate=false \ > -Dcom.sun.management.jmxremote.ssl=false \ > -XX:+UseG1GC \ > -XX:+PrintFlagsFinal \ > -XX:+PrintReferenceGC -verbose:gc \ > -XX:+PrintGCDetails \ > -XX:+PrintGCTimeStamps \ > -XX:+PrintAdaptiveSizePolicy \ > -XX:+UnlockDiagnosticVMOptions \ > -XX:+G1SummarizeConcMark \ > -XX:InitiatingHeapOccupancyPercent=35 \ > -XX:ConcGCThreads=20 \ > -XX:+PrintGCCause \ > -XX:+AlwaysPreTouch \ > -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... " > spark.sql.session.timeZone: UTC > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org