[ 
https://issues.apache.org/jira/browse/SPARK-16711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392135#comment-15392135
 ] 

Thomas Graves commented on SPARK-16711:
---------------------------------------

Note that what happens here is the yarnshuffleservice  inits and immediately 
opens the port so clients can connect immediately.  Really we should have read 
everything from our back up database and repopulated things before opening the 
port.  I think we are currently relying on the Nodemanager calling 
initializeApplication again, but this is to late, by the time this happens the 
client could get the NullPointerException and fail.

> YarnShuffleService doesn't re-init properly on YARN rolling upgrade
> -------------------------------------------------------------------
>
>                 Key: SPARK-16711
>                 URL: https://issues.apache.org/jira/browse/SPARK-16711
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, YARN
>    Affects Versions: 1.5.2
>            Reporter: Thomas Graves
>
> When a yarn rolling upgrade happens the Spark YarnShuffleService isn't 
> re-initializing the tokens soon enough which causes running applications to 
> fail with NullPointerExceptions rather then IOExceptions which causes clients 
> to not retry which in turn causes the application to totally fail when it 
> should have just retried and succeeded.
> 2016-07-22 23:22:05,460 [shuffle-server-1] ERROR 
> server.TransportRequestHandler: Error while invoking RpcHandler#receive() on 
> RPC id 6235606084052282795
> java.lang.NullPointerException: Password cannot be null if SASL is enabled
>         at 
> org.spark-project.guava.base.Preconditions.checkNotNull(Preconditions.java:208)
>         at 
> org.apache.spark.network.sasl.SparkSaslServer.encodePassword(SparkSaslServer.java:196)
>         at 
> org.apache.spark.network.sasl.SparkSaslServer$DigestCallbackHandler.handle(SparkSaslServer.java:166)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.spark.network.sasl.SparkSaslServer.response(SparkSaslServer.java:119)
>         at 
> org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:101)
>         at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
>         at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>         at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>         at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>         at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>         at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>         at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>         at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>      at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to