[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
[ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127234#comment-17127234 ] Prabhakar commented on SPARK-29640: --- Is there a way to explicitly configure the Kube API server Url? if so, specifying the complete DNS name e.g. kubernetes.default.svc.cluster.local might help > [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in > Spark driver > -- > > Key: SPARK-29640 > URL: https://issues.apache.org/jira/browse/SPARK-29640 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.4 >Reporter: Andy Grove >Priority: Major > > We are running into intermittent DNS issues where the Spark driver fails to > resolve "kubernetes.default.svc" when trying to create executors. We are > running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS. > This happens approximately 10% of the time. > Here is the stack trace: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: External > scheduler cannot be instantiated > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794) > at org.apache.spark.SparkContext.(SparkContext.scala:493) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) > at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36) > at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: > [get] for kind: [Pod] with name: > [wf-5-69674f15d0fc45-1571354060179-driver] in namespace: > [tenant-8-workflows] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788) > ... 20 more > Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) > at java.net.InetAddress.getAllByName0(InetAddress.java:1277) > at java.net.InetAddress.getAllByName(InetAddress.java:1193) > at java.net.InetAddress.getAllByName(InetAddress.java:1127) > at okhttp3.Dns$1.lookup(Dns.java:39) > at >
[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
[ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994510#comment-16994510 ] Andy Grove commented on SPARK-29640: Closing this as not a bug. We have confirmed that this is due to the way certain EKS clusters are set up. The issue only happens when Spark is on the same node as a CoreDNS pod and only happens intermittently even then. We have experienced the same issue with applications other than Spark as well. > [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in > Spark driver > -- > > Key: SPARK-29640 > URL: https://issues.apache.org/jira/browse/SPARK-29640 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 >Reporter: Andy Grove >Priority: Major > > We are running into intermittent DNS issues where the Spark driver fails to > resolve "kubernetes.default.svc" when trying to create executors. We are > running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS. > This happens approximately 10% of the time. > Here is the stack trace: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: External > scheduler cannot be instantiated > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794) > at org.apache.spark.SparkContext.(SparkContext.scala:493) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) > at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36) > at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: > [get] for kind: [Pod] with name: > [wf-5-69674f15d0fc45-1571354060179-driver] in namespace: > [tenant-8-workflows] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788) > ... 20 more > Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) > at java.net.InetAddress.getAllByName0(InetAddress.java:1277) > at java.net.InetAddress.getAllByName(InetAddress.java:1193) > at
[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
[ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994257#comment-16994257 ] Andy Grove commented on SPARK-29640: We were finally able to get to a root cause on this so I'm documenting it here in the hopes that it helps someone else in the future. The issue was due to the way that routing was set up on our EKS clusters combined with the fact that we were using an NLB rather than ELB along with nginx ingress controllers. Specifically, NLB does not support "hairpinning" as explained in [https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html] In layman's terms, if pod A tries to communicate with pod B, and both pods are on the same node and the request egresses from the node and is then routed back to the node via NLB and nginx controller then the request can never succeed and will time out. Switching to an ELB resolves the issue but a better solution is to use cluster local addressing so that communicate between pods on the same nodes uses the local network. > [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in > Spark driver > -- > > Key: SPARK-29640 > URL: https://issues.apache.org/jira/browse/SPARK-29640 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 >Reporter: Andy Grove >Priority: Major > > We are running into intermittent DNS issues where the Spark driver fails to > resolve "kubernetes.default.svc" when trying to create executors. We are > running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS. > This happens approximately 10% of the time. > Here is the stack trace: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: External > scheduler cannot be instantiated > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794) > at org.apache.spark.SparkContext.(SparkContext.scala:493) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) > at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36) > at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: > [get] for kind: [Pod] with name: > [wf-5-69674f15d0fc45-1571354060179-driver] in namespace: > [tenant-8-workflows] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55) > at >
[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
[ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963175#comment-16963175 ] Andy Grove commented on SPARK-29640: A hacky workaround is to wait for DNS to resolve before creating the Spark context: {code:java} def waitForDns(): Unit = { val host = "kubernetes.default.svc" println(s"Resolving $host ...") val t1 = System.currentTimeMillis() var attempts = 0 while (System.currentTimeMillis() - t1 < 15000) { try { attempts += 1 val address = InetAddress.getByName(host) println(s"Resolved $host as ${address.getHostAddress()} after $attempts attempt(s)") return } catch { case _: UnknownHostException => println(s"Failed to resolve $host due to UnknownHostException (attempt $attempts)") Thread.sleep(100) } } } {code} > [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in > Spark driver > -- > > Key: SPARK-29640 > URL: https://issues.apache.org/jira/browse/SPARK-29640 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 >Reporter: Andy Grove >Priority: Major > Fix For: 2.4.5 > > > We are running into intermittent DNS issues where the Spark driver fails to > resolve "kubernetes.default.svc" when trying to create executors. We are > running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS. > This happens approximately 10% of the time. > Here is the stack trace: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: External > scheduler cannot be instantiated > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794) > at org.apache.spark.SparkContext.(SparkContext.scala:493) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) > at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36) > at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: > [get] for kind: [Pod] with name: > [wf-5-69674f15d0fc45-1571354060179-driver] in namespace: > [tenant-8-workflows] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788) > ... 20 more > Caused by: java.net.UnknownHostException: