Spark101 created SPARK-36912:
--------------------------------

             Summary: Get Result time for task is taking very long time and 
timesout
                 Key: SPARK-36912
                 URL: https://issues.apache.org/jira/browse/SPARK-36912
             Project: Spark
          Issue Type: Question
          Components: Block Manager
    Affects Versions: 3.0.3
            Reporter: Spark101
         Attachments: Stage-result.pdf, Storage-result.pdf, environment.pdf, 
executors.pdf, thread-dump-exec3.pdf, threadDump-exc2.pdf

We use Spark on Kubernetes to run batch jobs to analyze flows and produce 
insights. The flows are read from timeseries database. We have 3 exec instances 
each having 5g mem + driver (5g mem). We observe the following warning followed 
by timeout errors after which the job fails. We have been stuck on this for 
some time and really hoping to get some help from this 
forum:2021-10-02T16:07:09.459ZGMT  WARN dispatcher-CoarseGrainedScheduler 
TaskSetManager - Stage 52 contains a task of very large size (2842 KiB). The 
maximum recommended task size is 1000 KiB.






2021-10-02T16:08:19.151ZGMT ERROR task-result-getter-0 RetryingBlockFetcher - 
Exception while beginning fetch of 1 outstanding blocks 
java.io.IOException: Failed to connect to /192.168.7.99:34259
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
        at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:122)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
        at 
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
        at 
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103)
        at 
org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1010)
        at 
org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:954)
        at scala.Option.orElse(Option.scala:447)
        at 
org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:954)
        at 
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1092)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: /192.168.7.99:34259
Caused by: java.net.ConnectException: Connection timed out
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
2021-10-02T16:08:19.151ZGMT ERROR task-result-getter-2 RetryingBlockFetcher - 
Exception while beginning fetch of 1 outstanding blocks 
java.io.IOException: Failed to connect to /192.168.6.167:42405
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
        at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:122)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
        at 
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
        at 
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103)
        at 
org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1010)
        at 
org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:954)
        at scala.Option.orElse(Option.scala:447)
        at 
org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:954)
        at 
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1092)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection timed out: /192.168.6.167:42



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to