[jira] [Comment Edited] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.

2016-02-03 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113180#comment-15113180
 ] 

Alan Braithwaite edited comment on SPARK-12826 at 2/3/16 5:25 PM:
--

-Update to this:  We moved the spark-master out from behind the load balancer 
(statically provisioned it with a CNAME) and we're still observing the same 
behavior.-

-SPARK_PUBLIC_DNS is set to the CNAME.  Once again: no logs, no active 
connections.-

Edit: I think this one (non-proxied case) was just our scheduling framework 
restarting the master without HA enabled.


was (Author: abraithwaite):
Update to this:  We moved the spark-master out from behind the load balancer 
(statically provisioned it with a CNAME) and we're still observing the same 
behavior.

SPARK_PUBLIC_DNS is set to the CNAME.  Once again: no logs, no active 
connections.

> Spark Workers do not attempt reconnect or exit on connection failure.
> -
>
> Key: SPARK-12826
> URL: https://issues.apache.org/jira/browse/SPARK-12826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Alan Braithwaite
>Priority: Critical
>
> Spark version 1.6.0 Hadoop 2.6.0 CDH 5.4.2
> We're running behind a tcp proxy (10.14.12.11:7077 is the tcp proxy listen 
> address in the example, upstreaming to the spark master listening on 9682 and 
> a different IP)
> To reproduce, I started a spark worker, let it successfully connect to the 
> master through the proxy, then tcpkill'd the connection on the Worker.  
> Nothing is logged from the code handling reconnection attempts.
> {code}
> 16/01/14 18:23:30 INFO Worker: Connecting to master 
> spark-master.example.com:7077...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Creating new connection to 
> spark-master.example.com/10.14.12.11:7077
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Connection to 
> spark-master.example.com/10.14.12.11:7077 successful, running bootstraps...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Successfully created 
> connection to spark-master.example.com/10.14.12.11:7077 after 1 ms (0 ms 
> spent in bootstraps)
> 16/01/14 18:23:30 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 
> 262144
> 16/01/14 18:23:30 INFO Worker: Successfully registered with master 
> spark://0.0.0.0:9682
> 16/01/14 18:23:30 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /var/lib/spark/work
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:41:31 WARN TransportChannelHandler: Exception in connection from 
> spark-master.example.com/10.14.12.11:7077
> java.io.IOException: Connection reset by peer
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> -- nothing more is logged, going on 15 minutes --
> $ ag -C5 Disconn 
> core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
> 313registrationRetryTimer.foreach(_.cancel(true))
> 314registrationRetryTimer = None
> 315  }
> 316
> 317  private def registerWithMaster() {
> 318// onDisconnected may be triggered multiple times, so don't attempt 
> registration
> 319// if there are outstanding registration attempts scheduled.
> 320registrationRetryTimer match {
> 321  case None =>
> 322registered = false
> 323

[jira] [Comment Edited] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.

2016-01-14 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099069#comment-15099069
 ] 

Shixiong Zhu edited comment on SPARK-12826 at 1/14/16 10:47 PM:


The issue here is the worker cannot get the correct master address from the 
connection.

This line is weird: 16/01/14 18:23:30 INFO Worker: Successfully registered with 
master spark://0.0.0.0:9682

Did you use "SPARK_MASTER_HOST" or "-h" to set the master host to "0.0.0.0"?



was (Author: zsxwing):
This line is weird: 16/01/14 18:23:30 INFO Worker: Successfully registered with 
master spark://0.0.0.0:9682

Did you use "SPARK_MASTER_HOST" or "-h" to set the master host to "0.0.0.0"?


> Spark Workers do not attempt reconnect or exit on connection failure.
> -
>
> Key: SPARK-12826
> URL: https://issues.apache.org/jira/browse/SPARK-12826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Alan Braithwaite
>
> Spark version 1.6.0 Hadoop 2.6.0 CDH 5.4.2
> We're running behind a tcp proxy (10.14.12.11:7077 is the tcp proxy listen 
> address in the example, upstreaming to the spark master listening on 9682 and 
> a different IP)
> To reproduce, I started a spark worker, let it successfully connect to the 
> master through the proxy, then tcpkill'd the connection on the Worker.  
> Nothing is logged from the code handling reconnection attempts.
> {code}
> 16/01/14 18:23:30 INFO Worker: Connecting to master 
> spark-master.example.com:7077...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Creating new connection to 
> spark-master.example.com/10.14.12.11:7077
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Connection to 
> spark-master.example.com/10.14.12.11:7077 successful, running bootstraps...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Successfully created 
> connection to spark-master.example.com/10.14.12.11:7077 after 1 ms (0 ms 
> spent in bootstraps)
> 16/01/14 18:23:30 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 
> 262144
> 16/01/14 18:23:30 INFO Worker: Successfully registered with master 
> spark://0.0.0.0:9682
> 16/01/14 18:23:30 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /var/lib/spark/work
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:41:31 WARN TransportChannelHandler: Exception in connection from 
> spark-master.example.com/10.14.12.11:7077
> java.io.IOException: Connection reset by peer
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> -- nothing more is logged, going on 15 minutes --
> $ ag -C5 Disconn 
> core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
> 313registrationRetryTimer.foreach(_.cancel(true))
> 314registrationRetryTimer = None
> 315  }
> 316
> 317  private def registerWithMaster() {
> 318// onDisconnected may be triggered multiple times, so don't attempt 
> registration
> 319// if there are outstanding registration attempts scheduled.
> 320registrationRetryTimer match {
> 321  case None =>
> 322registered = false
> 323registerMasterFutures = tryRegisterAllMasters()
> --
> 549finishedExecutors.values.toList, drivers.values.toList,
> 550finishedDrivers.values.toList, activeMasterUrl, cores, m

[jira] [Comment Edited] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.

2016-01-14 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099069#comment-15099069
 ] 

Shixiong Zhu edited comment on SPARK-12826 at 1/14/16 10:46 PM:


This line is weird: 16/01/14 18:23:30 INFO Worker: Successfully registered with 
master spark://0.0.0.0:9682

Did you use "SPARK_MASTER_HOST" or "-h" to set the master host to "0.0.0.0"?



was (Author: zsxwing):
This line is weird: 16/01/14 18:23:30 INFO Worker: Successfully registered with 
master spark://0.0.0.0:9682

Did you set "SPARK_MASTER_HOST" or "-h" to set the master host to "0.0.0.0"?


> Spark Workers do not attempt reconnect or exit on connection failure.
> -
>
> Key: SPARK-12826
> URL: https://issues.apache.org/jira/browse/SPARK-12826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Alan Braithwaite
>
> Spark version 1.6.0 Hadoop 2.6.0 CDH 5.4.2
> We're running behind a tcp proxy (10.14.12.11:7077 is the tcp proxy listen 
> address in the example, upstreaming to the spark master listening on 9682 and 
> a different IP)
> To reproduce, I started a spark worker, let it successfully connect to the 
> master through the proxy, then tcpkill'd the connection on the Worker.  
> Nothing is logged from the code handling reconnection attempts.
> {code}
> 16/01/14 18:23:30 INFO Worker: Connecting to master 
> spark-master.example.com:7077...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Creating new connection to 
> spark-master.example.com/10.14.12.11:7077
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Connection to 
> spark-master.example.com/10.14.12.11:7077 successful, running bootstraps...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Successfully created 
> connection to spark-master.example.com/10.14.12.11:7077 after 1 ms (0 ms 
> spent in bootstraps)
> 16/01/14 18:23:30 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 
> 262144
> 16/01/14 18:23:30 INFO Worker: Successfully registered with master 
> spark://0.0.0.0:9682
> 16/01/14 18:23:30 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /var/lib/spark/work
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:41:31 WARN TransportChannelHandler: Exception in connection from 
> spark-master.example.com/10.14.12.11:7077
> java.io.IOException: Connection reset by peer
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> -- nothing more is logged, going on 15 minutes --
> $ ag -C5 Disconn 
> core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
> 313registrationRetryTimer.foreach(_.cancel(true))
> 314registrationRetryTimer = None
> 315  }
> 316
> 317  private def registerWithMaster() {
> 318// onDisconnected may be triggered multiple times, so don't attempt 
> registration
> 319// if there are outstanding registration attempts scheduled.
> 320registrationRetryTimer match {
> 321  case None =>
> 322registered = false
> 323registerMasterFutures = tryRegisterAllMasters()
> --
> 549finishedExecutors.values.toList, drivers.values.toList,
> 550finishedDrivers.values.toList, activeMasterUrl, cores, memory,
> 551coresUsed, memoryUsed, activeMasterWebUiUrl))
> 552  }
> 553
> 554  ov