[ https://issues.apache.org/jira/browse/SPARK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782678#comment-16782678 ]
Mohamed Mehdi BEN AISSA commented on SPARK-24346: ------------------------------------------------- Any news !? I have exactly the same issue in the same context (HDP version) : ERROR TransportRequestHandler: Error opening block StreamChunkId\{streamId=1377556883266, chunkIndex=9} for request from /10.147.167.40:39050 java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readLong(DataInputStream.java:416) at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:209) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:375) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:92) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:137) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:745) > Executors are unable to fetch remote cache blocks > ------------------------------------------------- > > Key: SPARK-24346 > URL: https://issues.apache.org/jira/browse/SPARK-24346 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 2.3.0 > Environment: OS: Centos 7.3 > Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0 > Reporter: Truong Duc Kien > Priority: Major > > After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a > massive performance hit because executors become unable to fetch remote cache > block from each others. The scenario is: > 1. An executor creates a connection and sends a ChunkFetchRequest message to > another executor. > 2. This request arrives at the target executor, which sends back a > ChunkFetchSuccess response > 3. The ChunkFetchSuccess msg never arrives. > 4. The connection between these two executors is killed by the originating > executor after 120s of idleness. At the same time, the other executor report > that it failed to send the ChunkFetchSuccess because the pipe is closed. > This process repeats itself 3 times, delaying our jobs by 6 minutes, then the > originating executor decides to stop fetching and calculates the block by > itself and the job can continue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org