[ 
https://issues.apache.org/jira/browse/SPARK-49079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-49079:
----------------------------------
    Component/s: Kubernetes
                     (was: k8s)

> Spark jobs failing with UnknownHostException on executors if driver readiness 
> timeout elapsed
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-49079
>                 URL: https://issues.apache.org/jira/browse/SPARK-49079
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.5.0
>         Environment: Running spark jobs inside EMR on EKS offering from AWS, 
> which is using 3.5.0 under the hood
>            Reporter: Oscar Torreno
>            Priority: Major
>
> We have seen cases where Spark jobs would fail to run in case 
> ExecutorPodsAllocator times out while waiting for the driver pod to get to 
> the READY status. If that happens, we have seen 2 possible scenarios leading 
> to the same result (executors failing with an UnknownHostException trying to 
> resolve the k8s service for spark driver and the job failing because the 
> maximum number of executor failures was reached):
>  * Kubernetes service not getting created (confirmed that with the k8s 
> service created event/metric available in grafana)
>  * Kubernetes service being there but still not being able to resolve the 
> hostname in the executors (maybe the service being fully available only when 
> driver pod got ready and executors tried to resolve the hostname prior to 
> that)
> The particular part of the code under question is 
> [https://github.com/apache/spark/blob/v3.5.0/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L130]
>  
> {code:java}
>     driverPod.foreach { pod =>
>       // Wait until the driver pod is ready before starting executors, as the 
> headless service won't
>       // be resolvable by DNS until the driver pod is ready.
>       Utils.tryLogNonFatalError {
>         kubernetesClient
>           .pods()
>           .inNamespace(namespace)
>           .withName(pod.getMetadata.getName)
>           .waitUntilReady(driverPodReadinessTimeout, TimeUnit.SECONDS)
>       }
>     } {code}
> Interestingly enough the comment says wait until the driver pod otherwise the 
> service will not be resolvable by executors, but we still let the run to 
> continue.
> Also worth mentioning the documentation around such readiness timeout config 
> ([https://github.com/apache/spark/blob/v3.5.0/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala#L454])
> {code:java}
>   val KUBERNETES_ALLOCATION_DRIVER_READINESS_TIMEOUT =
>     ConfigBuilder("spark.kubernetes.allocation.driver.readinessTimeout")
>       .doc("Time to wait for driver pod to get ready before creating executor 
> pods. This wait " +
>         "only happens on application start. If timeout happens, executor pods 
> will still be " +
>         "created.")
>       .version("3.1.3")
>       .timeConf(TimeUnit.SECONDS)
>       .checkValue(value => value > 0, "Allocation driver readiness timeout 
> must be a positive "
>         + "time value.")
>       .createWithDefaultString("1s") {code}
> Please note the "If timeout happens, executor pods will still be created", 
> which conflicts (at least in my head) with the code comment on the await we 
> have for the driver pod.
> The question would be, is this intended behaviour? Looks like a bug, maybe we 
> should check before creating the executors once again whether driver pod is 
> ready and otherwise fail the job?
> For now trying to mitigate by increasing the readiness timeout value as a 
> bandaid fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to