Huy created SPARK-35006: --------------------------- Summary: Spark driver mistakenly classifies OOM error of executor (on K8s pod) as framework error Key: SPARK-35006 URL: https://issues.apache.org/jira/browse/SPARK-35006 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.1 Reporter: Huy
*Issue Description* I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on K8s) (please note it's just for testing purpose). The corresponding configuration can be found bellowing. When I run task for loading & computing on a XML file; due to the size of XML file is large (which I intended to) the executor got OOM error #Aborting due to java.lang.OutOfMemoryError: Java heap space # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700 # fatal error: OutOfMemory encountered: Java heap space However, the driver doesn't recognize this error as task failure scenario. Instead, it consider this as a framework issue and continue retry the task INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver. INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it was being computed, its executor exited for a reason unrelated to the task. *{color:#de350b}Not counting this failure towards the maximum number of failures for the task{color}.* INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from BlockManagerMaster. This results in the fact that, the Spark application keeps retrying the task forever and locks following tasks from running *Expectation* Spark driver should classify OOM on executor pod as a task failure and increase the count of max failure time *Configuration* spark.kubernetes.container.image: "spark_image_path" spark.kubernetes.container.image.pullPolicy: "Always" spark.kubernetes.namespace: "qa-namespace" spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account" spark.kubernetes.executor.request.cores: "2" spark.kubernetes.executor.limit.cores: "2" spark.executorEnv.SPARK_ENV: "dev" spark.executor.memoryOverhead: "1G" spark.executor.memory: "6g" spark.executor.cores: "2" spark.executor.instances: "3" spark.driver.maxResultSize: "1g" spark.driver.memory: "10g" spark.driver.cores: "2" spark.eventLog.enabled: 'true' spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false \ -XX:+UseG1GC \ -XX:+PrintFlagsFinal \ -XX:+PrintReferenceGC -verbose:gc \ -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps \ -XX:+PrintAdaptiveSizePolicy \ -XX:+UnlockDiagnosticVMOptions \ -XX:+G1SummarizeConcMark \ -XX:InitiatingHeapOccupancyPercent=35 \ -XX:ConcGCThreads=20 \ -XX:+PrintGCCause \ -XX:+AlwaysPreTouch \ -Dlog4j.debug=true -Dlog4j.configuration=file:///.... " spark.sql.session.timeZone: UTC -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org