​An assumption would be your master process is getting killed for some
reason, it could be because of OOM kills by the kernel. (I'm assuming you
are running your driver program on the master node itself.​)

Thanks
Best Regards

On Wed, Jan 14, 2015 at 11:25 PM, TJ Klein <[email protected]> wrote:

> Hi,
>
> I am running PySpark on a cluster. Generally it runs. However, frequently I
> get the warning message (and consequently, the task not being executed):
>
> WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
> your cluster UI to ensure that workers are registered and have sufficient
> memory
>
> It is weird because all my nodes have the same specifications and the same
> data. Why does it work sometimes and sometimes not?
>
> Looking at the log file I see stuff like this:
>
> 15/01/14 11:42:11 INFO Worker: Disassociated
> [akka.tcp://[email protected]:50198] ->
> [akka.tcp://sparkMaster@node001:7077] Disassociated !
> 15/01/14 11:42:11 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 15/01/14 11:42:11 ERROR EndpointWriter: AssociationError
> [akka.tcp://[email protected]:50198] ->
> [akka.tcp://[email protected]:35231]: Error [Association
> failed
> with [akka.tcp://[email protected]:35231]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://[email protected]:35231]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: node001.cluster/172.16.6.101:35231
> ]
> 15/01/14 11:42:11 ERROR EndpointWriter: AssociationError
> [akka.tcp://[email protected]:50198] ->
> [akka.tcp://sparkMaster@node001:7077]: Error [Association failed with
> [akka.tcp://sparkMaster@node001:7077]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkMaster@node001:7077]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: node001/172.16.6.101:7077
> ]
> 15/01/14 11:42:11 INFO Worker: Disassociated
> [akka.tcp://[email protected]:50198] ->
> [akka.tcp://sparkMaster@node001:7077] Disassociated !
> 15/01/14 11:42:11 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 15/01/14 11:42:11 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef:
> Message [org.apache.spark.deploy.DeployMessages$ExecutorStateChanged] from
> Actor[akka://sparkWorker/user/Worker#-1661660308] to
> Actor[akka://sparkWorker/deadLetters] was not delivered. [3] dead letters
> encountered. This logging can be turned off or adjusted with configuration
> settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'
>
>
> Maybe somebody has an idea? Would greatly appreciate that.
>
> -Tassilo
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-running-spark-on-cluster-tp21138.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to