[jira] [Commented] (SPARK-23981) ShuffleBlockFetcherIterator - Spamming Logs
[ https://issues.apache.org/jira/browse/SPARK-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439614#comment-16439614 ] BELUGA BEHR commented on SPARK-23981: - Or maybe lower per-block logging and debug and produce one over-all logging message if fetches cannot be completed. > ShuffleBlockFetcherIterator - Spamming Logs > --- > > Key: SPARK-23981 > URL: https://issues.apache.org/jira/browse/SPARK-23981 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: BELUGA BEHR >Priority: Major > > If a remote host shuffle service fails, Spark Executors produce a huge amount > of logging. > {code:java} > 2018-04-10 20:24:44,834 INFO [Block Fetch Retry-1] > shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - > Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms > 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] > storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to > get block(s) from myhost.local:7337 > java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.ConnectException: Connection refused: > myhost.local/12.13.14.15:7337 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > {code} > > We can see from the code, that if a block fetch fails, a "listener" is > updated once for each block. From the error messages previously, it can be > seen that 1753 blocks were being fetched. However, since the remote host has > become unavailable, they all fail and every block is alerted on. > > {code:java|title=RetryingBlockFetcher.java} > if (shouldRetry(e)) { > initiateRetry(); > } else { > for (String bid : blockIdsToFetch) { > listener.onBlockFetchFailure(bid, e); > } > } > {code} > {code:java|title=ShuffleBlockFetcherIterator.scala} > override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = { > logError(s"Failed to get block(s) from > ${req.address.host}:${req.address.port}", e) > results.put(new FailureFetchResult(BlockId(blockId), address, e)) > } > {code} > So what we get here, is 1753 ERROR stack traces in the logging all printing > the same message: > {quote}Failed to get block(s) from myhost.local:7337 > ... > {quote} > Perhaps it would be better if the method signature {{onBlockFetchFailure}} > was changed to accept an entire Collection of block IDs instead of one-by-one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23994) Add Host To Blacklist If Shuffle Cannot Complete
BELUGA BEHR created SPARK-23994: --- Summary: Add Host To Blacklist If Shuffle Cannot Complete Key: SPARK-23994 URL: https://issues.apache.org/jira/browse/SPARK-23994 Project: Spark Issue Type: Improvement Components: Block Manager, Shuffle Affects Versions: 2.3.0 Reporter: BELUGA BEHR If a node cannot be reached for shuffling data, add the node to the blacklist and retry the current stage. {code:java} 2018-04-10 20:25:55,065 ERROR [Block Fetch Retry-3] shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:fetchAllOutstanding(142)) - Exception while beginning fetch of 711 outstanding blocks (after 3 retries) java.io.IOException: Failed to connect to host.local/10.11.12.13:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused: host.local/10.11.12.13:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23981) ShuffleBlockFetcherIterator - Spamming Logs
BELUGA BEHR created SPARK-23981: --- Summary: ShuffleBlockFetcherIterator - Spamming Logs Key: SPARK-23981 URL: https://issues.apache.org/jira/browse/SPARK-23981 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.3.0 Reporter: BELUGA BEHR If a remote host shuffle service fails, Spark Executors produce a huge amount of logging. {code:java} 2018-04-10 20:24:44,834 INFO [Block Fetch Retry-1] shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to get block(s) from myhost.local:7337 java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused: myhost.local/12.13.14.15:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more {code} We can see from the code, that if a block fetch fails, a "listener" is updated once for each block. From the error messages previously, it can be seen that 1753 blocks were being fetched. However, since the remote host has become unavailable, they all fail and every block is alerted on. {code:java|title=RetryingBlockFetcher.java} if (shouldRetry(e)) { initiateRetry(); } else { for (String bid : blockIdsToFetch) { listener.onBlockFetchFailure(bid, e); } } {code} {code:java|title=ShuffleBlockFetcherIterator.scala} override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = { logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e) results.put(new FailureFetchResult(BlockId(blockId), address, e)) } {code} So what we get here, is 1753 ERROR stack traces in the logging all printing the same message: {quote}Failed to get block(s) from myhost.local:7337 ... {quote} Perhaps it would be better if the method signature {{onBlockFetchFailure}} was changed to accept an entire Collection of block IDs instead of one-by-one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org