[
https://issues.apache.org/jira/browse/SPARK-49079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved SPARK-49079.
-----------------------------------
Resolution: Not A Problem
This is more like a question. And, yes, the configuration aims to set a limit
until we wait driver before starting next steps. It doesn't aim to fix the
underlying infra issue like Unknown host.
> Spark jobs failing with UnknownHostException on executors if driver readiness
> timeout elapsed
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-49079
> URL: https://issues.apache.org/jira/browse/SPARK-49079
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.5.0
> Environment: Running spark jobs inside EMR on EKS offering from AWS,
> which is using 3.5.0 under the hood
> Reporter: Oscar Torreno
> Priority: Major
>
> We have seen cases where Spark jobs would fail to run in case
> ExecutorPodsAllocator times out while waiting for the driver pod to get to
> the READY status. If that happens, we have seen 2 possible scenarios leading
> to the same result (executors failing with an UnknownHostException trying to
> resolve the k8s service for spark driver and the job failing because the
> maximum number of executor failures was reached):
> * Kubernetes service not getting created (confirmed that with the k8s
> service created event/metric available in grafana)
> * Kubernetes service being there but still not being able to resolve the
> hostname in the executors (maybe the service being fully available only when
> driver pod got ready and executors tried to resolve the hostname prior to
> that)
> The particular part of the code under question is
> [https://github.com/apache/spark/blob/v3.5.0/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L130]
>
> {code:java}
> driverPod.foreach { pod =>
> // Wait until the driver pod is ready before starting executors, as the
> headless service won't
> // be resolvable by DNS until the driver pod is ready.
> Utils.tryLogNonFatalError {
> kubernetesClient
> .pods()
> .inNamespace(namespace)
> .withName(pod.getMetadata.getName)
> .waitUntilReady(driverPodReadinessTimeout, TimeUnit.SECONDS)
> }
> } {code}
> Interestingly enough the comment says wait until the driver pod otherwise the
> service will not be resolvable by executors, but we still let the run to
> continue.
> Also worth mentioning the documentation around such readiness timeout config
> ([https://github.com/apache/spark/blob/v3.5.0/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala#L454])
> {code:java}
> val KUBERNETES_ALLOCATION_DRIVER_READINESS_TIMEOUT =
> ConfigBuilder("spark.kubernetes.allocation.driver.readinessTimeout")
> .doc("Time to wait for driver pod to get ready before creating executor
> pods. This wait " +
> "only happens on application start. If timeout happens, executor pods
> will still be " +
> "created.")
> .version("3.1.3")
> .timeConf(TimeUnit.SECONDS)
> .checkValue(value => value > 0, "Allocation driver readiness timeout
> must be a positive "
> + "time value.")
> .createWithDefaultString("1s") {code}
> Please note the "If timeout happens, executor pods will still be created",
> which conflicts (at least in my head) with the code comment on the await we
> have for the driver pod.
> The question would be, is this intended behaviour? Looks like a bug, maybe we
> should check before creating the executors once again whether driver pod is
> ready and otherwise fail the job?
> For now trying to mitigate by increasing the readiness timeout value as a
> bandaid fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]