[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

2020-06-06 Thread Prabhakar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127234#comment-17127234
 ] 

Prabhakar commented on SPARK-29640:
---

Is there a way to explicitly configure the Kube API server Url? if so, 
specifying the complete DNS name

e.g. kubernetes.default.svc.cluster.local might help

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> --
>
> Key: SPARK-29640
> URL: https://issues.apache.org/jira/browse/SPARK-29640
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.4
>Reporter: Andy Grove
>Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>   at org.apache.spark.SparkContext.(SparkContext.scala:493)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>   at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>   at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-5-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55)
>   at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
>   ... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
>   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>   at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
>   at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
>   at java.net.InetAddress.getAllByName(InetAddress.java:1193)
>   at java.net.InetAddress.getAllByName(InetAddress.java:1127)
>   at okhttp3.Dns$1.lookup(Dns.java:39)
>   at 
> 

[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

2019-12-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994510#comment-16994510
 ] 

Andy Grove commented on SPARK-29640:


Closing this as not a bug. We have confirmed that this is due to the way 
certain EKS clusters are set up. The issue only happens when Spark is on the 
same node as a CoreDNS pod and only happens intermittently even then. We have 
experienced the same issue with applications other than Spark as well.

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> --
>
> Key: SPARK-29640
> URL: https://issues.apache.org/jira/browse/SPARK-29640
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Andy Grove
>Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>   at org.apache.spark.SparkContext.(SparkContext.scala:493)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>   at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>   at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-5-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55)
>   at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
>   ... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
>   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>   at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
>   at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
>   at java.net.InetAddress.getAllByName(InetAddress.java:1193)
>   at 

[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

2019-12-11 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994257#comment-16994257
 ] 

Andy Grove commented on SPARK-29640:


We were finally able to get to a root cause on this so I'm documenting it here 
in the hopes that it helps someone else in the future.

The issue was due to the way that routing was set up on our EKS clusters 
combined with the fact that we were using an NLB rather than ELB along with 
nginx ingress controllers.

Specifically, NLB does not support "hairpinning" as explained in 
[https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html]

In layman's terms, if pod A tries to communicate with pod B, and both pods are 
on the same node and the request egresses from the node and is then routed back 
to the node via NLB and nginx controller then the request can never succeed and 
will time out.

Switching to an ELB resolves the issue but a better solution is to use cluster 
local addressing so that communicate between pods on the same nodes uses the 
local network.

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> --
>
> Key: SPARK-29640
> URL: https://issues.apache.org/jira/browse/SPARK-29640
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Andy Grove
>Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>   at org.apache.spark.SparkContext.(SparkContext.scala:493)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>   at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>   at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-5-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55)
>   at 
> 

[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

2019-10-30 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963175#comment-16963175
 ] 

Andy Grove commented on SPARK-29640:


A hacky workaround is to wait for DNS to resolve before creating the Spark 
context:
{code:java}
def waitForDns(): Unit = {
  
  val host = "kubernetes.default.svc"

  println(s"Resolving $host ...")
  val t1 = System.currentTimeMillis()
  var attempts = 0
  while (System.currentTimeMillis() - t1 < 15000) {
try {
  attempts += 1
  val address = InetAddress.getByName(host)
  println(s"Resolved $host as ${address.getHostAddress()} after $attempts 
attempt(s)")
  return
} catch {
  case _: UnknownHostException =>
println(s"Failed to resolve $host due to UnknownHostException (attempt 
$attempts)")
Thread.sleep(100)
}
  }
} {code}

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> --
>
> Key: SPARK-29640
> URL: https://issues.apache.org/jira/browse/SPARK-29640
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Andy Grove
>Priority: Major
> Fix For: 2.4.5
>
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>   at org.apache.spark.SparkContext.(SparkContext.scala:493)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>   at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>   at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-5-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55)
>   at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
>   ... 20 more
> Caused by: java.net.UnknownHostException: