Re: [External Sender] Re: Driver pods stuck in running state indefinitely

2020-04-10 Thread Prudhvi Chennuru (CONT)
No, there was no internal domain issue. As I mentioned I saw this issue
only on a few nodes on the cluster.

On Thu, Apr 9, 2020 at 10:49 PM Wei Zhang  wrote:

> Is there any internal domain name resolving issues?
>
> > Caused by:  java.net.UnknownHostException:
> spark-1586333186571-driver-svc.fractal-segmentation.svc
>
> -z
> ________
> From: Prudhvi Chennuru (CONT) 
> Sent: Friday, April 10, 2020 2:44
> To: user
> Subject: Driver pods stuck in running state indefinitely
>
>
> Hi,
>
>We are running spark batch jobs on K8s.
>Kubernetes version: 1.11.5 ,
>spark version: 2.3.2,
>   docker version: 19.3.8
>
>Issue: Few Driver pods are stuck in running state indefinitely with
> error
>
>```
>The Initial job has not accepted any resources; check your cluster UI
> to ensure that workers are registered and have sufficient resources.
>```
>
> Below is the log of the errored out executor pods
>
>   ```
>Exception in thread "main"
> java.lang.reflect.UndeclaredThrowableException
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1858)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.SparkException: Exception thrown in
> awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
> ... 4 more
> Caused by: java.io.IOException: Failed to connect to
> spark-1586333186571-driver-svc.fractal-segmentation.svc:7078
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> at
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException:
> spark-1586333186571-driver-svc.fractal-segmentation.svc
> at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at java.net.InetAddress.getByName(InetAddress.java:1076)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> at java.security.AccessController.doPrivileged(Native Method)
> at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> at
> io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
> at
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
> at
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
> at
> io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
> at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
> at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
> at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
> at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
> at
> io.netty.util.concurrent.

Driver pods stuck in running state indefinitely

2020-04-09 Thread Prudhvi Chennuru (CONT)
Hi,

   *We are running spark batch jobs on K8s.*
   *Kubernetes version:* 1.11.5 ,
*   spark version*: 2.3.2,
*  docker version:* 19.3.8

   *Issue: Few Driver pods are stuck in running state indefinitely with
error*

   ```
   The Initial job has not accepted any resources; check your cluster UI to
ensure that workers are registered and have sufficient resources.
   ```

*Below is the log of the errored out executor pods*

  ```
   Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1858)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:63)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in
awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
... 4 more
Caused by: java.io.IOException: Failed to connect to
spark-1586333186571-driver-svc.fractal-segmentation.svc:7078
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException:
spark-1586333186571-driver-svc.fractal-segmentation.svc
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
at java.security.AccessController.doPrivileged(Native Method)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
at
io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
at
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at
io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at
io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
at
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at
io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

MultiObjectDeleteException

2019-10-29 Thread Prudhvi Chennuru (CONT)
Hi ,

I am running spark batch jobs on kubernetes cluster and intermittently
i am seeing MultiObjectDeleteException

spark version: 2.3.0
kubernetes version: 1.11.5
aws-java-sdk: 1.7.4.jar
hadoop-aws: 2.7.3.jar

I even added *spark.hadoop.fs.s3a.multiobjectdelete.enable=false* property
to disable multiObjectDeletion but it's not taking the property, is there
anything else I can do to avoid this issue and which version of spark
supports this property

```
2019-10-29 06:21:31 ERROR FileFormatWriter:91 - Aborting job null.
com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0,
AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error
Message: One or more objects could not be deleted, S3 Extended Request ID:
null
at
com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:1745)
at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:687)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:463)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:367)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:47)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
```

-- 
*Thanks,*
*Prudhvi Chennuru.*

__



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.





Re: [External Sender] Spark Executor pod not getting created on kubernetes cluster

2019-10-01 Thread Prudhvi Chennuru (CONT)
If you are passing the service account for executors as spark property then
executor will use the one you are passing not the default service account.
Did you check the api server logs?

On Tue, Oct 1, 2019 at 11:07 AM manish gupta 
wrote:

> While launching the driver pod I am passing the service account which has
> cluster role and has all the required permissions to create a new pod. So
> will driver pass the same details to API server while creating executor pod
> OR executors will be created with default service account?
>
> Regards
> Manish Gupta
>
> On Tue, Oct 1, 2019 at 8:01 PM Prudhvi Chennuru (CONT) <
> prudhvi.chenn...@capitalone.com> wrote:
>
>> By default, executors use default service account in the namespace you
>> are creating the driver and executors so i am guessing that executors don't
>> have access to run on the cluster, if you check the kube-apisever logs you
>> will know the issue
>> and try giving privileged access to default service account in the
>> namespace you are creating the executors it should work.
>>
>> On Tue, Oct 1, 2019 at 10:25 AM manish gupta 
>> wrote:
>>
>>> Hi Prudhvi
>>>
>>> I can see this issue consistently. I am doing a POC wherein I am trying
>>> to create a dynamic spark cluster to run my job using spark submit on
>>> Kubernetes. On Minikube it works fine but on rbac enabled kubernetes it
>>> fails to launch executor pod. It is able to launch driver pod but not sure
>>> why it cannot launch executor pod even though it has ample resources.I dont
>>> see any error message in the logs apart from the warning message that I
>>> have provided above.
>>> Not even a single executor pod is getting launched.
>>>
>>> Regards
>>> Manish Gupta
>>>
>>> On Tue, Oct 1, 2019 at 6:31 PM Prudhvi Chennuru (CONT) <
>>> prudhvi.chenn...@capitalone.com> wrote:
>>>
>>>> Hi Manish,
>>>>
>>>> Are you seeing this issue consistently or sporadically? and
>>>> when you say executors are not launched not even a single executor created
>>>> for that driver pod?
>>>>
>>>> On Tue, Oct 1, 2019 at 1:43 AM manish gupta 
>>>> wrote:
>>>>
>>>>> Hi Team
>>>>>
>>>>> I am trying to create a spark cluster on kubernetes with rbac enabled
>>>>> using spark submit job. I am using spark-2.4.1 version.
>>>>> Spark submit is able to launch the driver pod by contacting Kubernetes
>>>>> API server but executor Pod is not getting launched. I can see the below
>>>>> warning message in the driver pod logs.
>>>>>
>>>>>
>>>>> *19/09/27 10:16:01 INFO TaskSchedulerImpl: Adding task set 0.0 with 3
>>>>> tasks19/09/27 10:16:16 WARN TaskSchedulerImpl: Initial job has not 
>>>>> accepted
>>>>> any resources; check your cluster UI to ensure that workers are registered
>>>>> and have sufficient resources*
>>>>>
>>>>> I have faced this issue in standalone spark clusters and resolved it
>>>>> but not sure how to resolve this issue in kubernetes. I have not given any
>>>>> ResourceQuota configuration in kubernetes rbac yaml file and there is 
>>>>> ample
>>>>> memory and cpu available for any new pod/container to be launched.
>>>>>
>>>>> Any leads/pointers to resolve this issue would be of great help.
>>>>>
>>>>> Thanks and Regards
>>>>> Manish Gupta
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks,*
>>>> *Prudhvi Chennuru.*
>>>> --
>>>>
>>>> The information contained in this e-mail is confidential and/or
>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>> solely in performance of work or services for Capital One. The information
>>>> transmitted herewith is intended only for use by the individual or entity
>>>> to which it is addressed. If the reader of this message is not the intended
>>>> recipient, you are hereby notified that any review, retransmission,
>>>> dissemination, distribution, copying or other use of, or taking of any
>>>> action in reliance upon this information is strictly prohibited. If you
>>>> have received this communication in e

Re: [External Sender] Spark Executor pod not getting created on kubernetes cluster

2019-10-01 Thread Prudhvi Chennuru (CONT)
By default, executors use default service account in the namespace you are
creating the driver and executors so i am guessing that executors don't
have access to run on the cluster, if you check the kube-apisever logs you
will know the issue
and try giving privileged access to default service account in the
namespace you are creating the executors it should work.

On Tue, Oct 1, 2019 at 10:25 AM manish gupta 
wrote:

> Hi Prudhvi
>
> I can see this issue consistently. I am doing a POC wherein I am trying to
> create a dynamic spark cluster to run my job using spark submit on
> Kubernetes. On Minikube it works fine but on rbac enabled kubernetes it
> fails to launch executor pod. It is able to launch driver pod but not sure
> why it cannot launch executor pod even though it has ample resources.I dont
> see any error message in the logs apart from the warning message that I
> have provided above.
> Not even a single executor pod is getting launched.
>
> Regards
> Manish Gupta
>
> On Tue, Oct 1, 2019 at 6:31 PM Prudhvi Chennuru (CONT) <
> prudhvi.chenn...@capitalone.com> wrote:
>
>> Hi Manish,
>>
>> Are you seeing this issue consistently or sporadically? and
>> when you say executors are not launched not even a single executor created
>> for that driver pod?
>>
>> On Tue, Oct 1, 2019 at 1:43 AM manish gupta 
>> wrote:
>>
>>> Hi Team
>>>
>>> I am trying to create a spark cluster on kubernetes with rbac enabled
>>> using spark submit job. I am using spark-2.4.1 version.
>>> Spark submit is able to launch the driver pod by contacting Kubernetes
>>> API server but executor Pod is not getting launched. I can see the below
>>> warning message in the driver pod logs.
>>>
>>>
>>> *19/09/27 10:16:01 INFO TaskSchedulerImpl: Adding task set 0.0 with 3
>>> tasks19/09/27 10:16:16 WARN TaskSchedulerImpl: Initial job has not accepted
>>> any resources; check your cluster UI to ensure that workers are registered
>>> and have sufficient resources*
>>>
>>> I have faced this issue in standalone spark clusters and resolved it but
>>> not sure how to resolve this issue in kubernetes. I have not given any
>>> ResourceQuota configuration in kubernetes rbac yaml file and there is ample
>>> memory and cpu available for any new pod/container to be launched.
>>>
>>> Any leads/pointers to resolve this issue would be of great help.
>>>
>>> Thanks and Regards
>>> Manish Gupta
>>>
>>
>>
>> --
>> *Thanks,*
>> *Prudhvi Chennuru.*
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>>
>>
>>
>>

-- 
*Thanks,*
*Prudhvi Chennuru.*

__



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.





Re: [External Sender] Spark Executor pod not getting created on kubernetes cluster

2019-10-01 Thread Prudhvi Chennuru (CONT)
Hi Manish,

Are you seeing this issue consistently or sporadically? and
when you say executors are not launched not even a single executor created
for that driver pod?

On Tue, Oct 1, 2019 at 1:43 AM manish gupta 
wrote:

> Hi Team
>
> I am trying to create a spark cluster on kubernetes with rbac enabled
> using spark submit job. I am using spark-2.4.1 version.
> Spark submit is able to launch the driver pod by contacting Kubernetes API
> server but executor Pod is not getting launched. I can see the below
> warning message in the driver pod logs.
>
>
> *19/09/27 10:16:01 INFO TaskSchedulerImpl: Adding task set 0.0 with 3
> tasks19/09/27 10:16:16 WARN TaskSchedulerImpl: Initial job has not accepted
> any resources; check your cluster UI to ensure that workers are registered
> and have sufficient resources*
>
> I have faced this issue in standalone spark clusters and resolved it but
> not sure how to resolve this issue in kubernetes. I have not given any
> ResourceQuota configuration in kubernetes rbac yaml file and there is ample
> memory and cpu available for any new pod/container to be launched.
>
> Any leads/pointers to resolve this issue would be of great help.
>
> Thanks and Regards
> Manish Gupta
>


-- 
*Thanks,*
*Prudhvi Chennuru.*

__



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.





Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-06-18 Thread Prudhvi Chennuru (CONT)
Thanks for the response Oliver.

I am facing this issue intermittently, once in a while i don't see service
being created for the respective spark driver(* i don't see service for
that driver on kubernetes dashboard and not even via kubectl but in driver
logs i see the service endpoint*) and by default driver requests for
executors in a batch of 5 as soon as 5 executors are created they fail with
below error.


Caused by: java.io.IOException: Failed to connect to
group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc:7078
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException:
group9990-features-282526d440ab3f12a68746fbef289c95-driver-svc.experimental.svc
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)

Did you face the same problem or were you able to see the service for the
driver pod on your cluster?


On Tue, Jun 18, 2019 at 8:00 AM Jose Luis Pedrosa <
jose.pedr...@microsoft.com> wrote:

> Hi guys
>
>
>
> There’s also an interesting one that we found in a similar case. In our
> case the service ip ranges takes more time to be reachable, so DNS was
> timing out. The approach that I was suggesting was:
>
>1. Add retries in the connection from the executor to the driver:
>https://github.com/apache/spark/pull/24702
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_24702=DwMGaQ=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE=ZbC2jt41phJyXtl9lDl7uaUnDWK7Ilns1DeTPpSa2T4=eg0fGctzE8h6HioRMam_Q18QTLAN3LEl1SdiGuTX7a4=GA-PO2FbDWQPNYgoTNs0kNHbjryZZ6phLPZ-wdQSBTs=>
>2. Disable negative DNS caching at JVM level, on the entrypoint.sh
>
>
>
> JL
>
>
>
>
>
> *From: *Olivier Girardot 
> *Date: *Tuesday 18 June 2019 at 10:06
> *To: *"Prudhvi Chennuru (CONT)" 
> *Cc: *Li Gao , dev , user <
> user@spark.apache.org>
> *Subject: *Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS
> resolution of driver fails
>
>
>
> Hi Prudhvi,
>
> not really but we took a drastic approach mitigating this, modifying the
> bundled launch script to be more resilient.
>
> In the kubernetes/dockerfiles/spark/entrypoint.sh in the executor case we
> added something like that :
>
>
>
>   executor)
>
> DRIVER_HOST=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":"
> -f 1)
>
> DRIVER_PORT=$(echo $SPARK_DRIVER_URL | cut -d "@" -f 2 | cut -d ":"
> -f 2)
>
>
>
> for i in $(seq 1 20);
>
> do
>
>   nc -zvw1 $DRIVER_HOST $DRIVER_PORT
>
>   status=$?
>
>   if [ $status -eq 0 ]
>
>   then
>
> echo "Driver is accessible, let's rock'n'roll."
>
> break
>
>   else
>
> echo "Driver not accessible :-| napping for a while..."
>
> sleep 3
>
>   fi
>
> done
>
> CMD=(
>
>   ${JAVA_HOME}/bin/java
>
> 
>
>
>
> That way the executor will not start before the driver is really
> connectable.
>
> That's kind of a hack but we did not experience the issue anymore, so I
> guess I'll keep it for now.
>
>
>
> Regards,
>
>
>
> Olivier.
>
>
>
> Le mar. 11 juin 2019 à 18:23, Prudhvi Chennuru (CONT) <
> prudhvi.chenn...@capitalone.com> a écrit :
>
> Hey Oliver,
>
>
>
>  I am also facing the same issue on my kubernetes
> cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
> the root cause?
>
>
>
> On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
> Hi,
>
> I did not try on another vendor, so I can't say if it's only related to
> gke, and no, I did not notice anything on the kubelet or kube-dns
> processes...
>
>
>
> Regards
>
>
>
> Le ven. 3 mai 2019 à 03:05, Li Gao  a écrit :
>
&g

Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-06-11 Thread Prudhvi Chennuru (CONT)
Hey Oliver,

 I am also facing the same issue on my kubernetes
cluster(v1.11.5)  on AWS with spark version 2.3.3, any luck in figuring out
the root cause?

On Fri, May 3, 2019 at 5:37 AM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi,
> I did not try on another vendor, so I can't say if it's only related to
> gke, and no, I did not notice anything on the kubelet or kube-dns
> processes...
>
> Regards
>
> Le ven. 3 mai 2019 à 03:05, Li Gao  a écrit :
>
>> hi Olivier,
>>
>> This seems a GKE specific issue? have you tried on other vendors ? Also
>> on the kubelet nodes did you notice any pressure on the DNS side?
>>
>> Li
>>
>>
>> On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
>>> and sometimes while running these jobs a pretty bad thing happens, the
>>> driver (in cluster mode) gets scheduled on Kubernetes and launches many
>>> executor pods.
>>> So far so good, but the k8s "Service" associated to the driver does not
>>> seem to be propagated in terms of DNS resolution so all the executor fails
>>> with a "spark-application-..cluster.svc.local" does not exists.
>>>
>>> All executors failing the driver should be failing too, but it considers
>>> that it's a "pending" initial allocation and stay stuck forever in a loop
>>> of "Initial job has not accepted any resources, please check Cluster UI"
>>>
>>> Has anyone else observed this king of behaviour ?
>>> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
>>> exist even after the "big refactoring" in the kubernetes cluster scheduler
>>> backend.
>>>
>>> I can work on a fix / workaround but I'd like to check with you the
>>> proper way forward :
>>>
>>>- Some processes (like the airflow helm recipe) rely on a "sleep
>>>30s" before launching the dependent pods (that could be added to
>>>/opt/entrypoint.sh used in the kubernetes packing)
>>>- We can add a simple step to the init container trying to do the
>>>DNS resolution and failing after 60s if it did not work
>>>
>>> But these steps won't change the fact that the driver will stay stuck
>>> thinking we're still in the case of the Initial allocation delay.
>>>
>>> Thoughts ?
>>>
>>> --
>>> *Olivier Girardot*
>>> o.girar...@lateral-thoughts.com
>>>
>>

-- 
*Thanks,*
*Prudhvi Chennuru.*


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Fwd: Spark driver pod scheduling fails on auto scaled node

2019-01-31 Thread Prudhvi Chennuru (CONT)
Hi,

I am using kubernetes *v 1.11.5* and spark *v 2.3.0*,
*calico(daemonset)* as overlay network plugin and kubernetes *cluster auto
scalar* feature to autoscale cluster if needed. When the cluster is auto
scaling calico pods are scheduling on those nodes but they are not ready
for 40 to 50 seconds and the driver and executors pods scheduling on those
nodes are failing as calico is not ready.
   So is there a way to overcome this issue by not scheduling
driver and executor pods until *calico* is ready or introduce a delay in
driver or executor pods to schedule on the nodes.

*I am not using spark operator.*

-- 
*Thanks,*
*Prudhvi Chennuru.*


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.