Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36556494 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -590,10 +590,21 @@ private[spark] class BlockManager( private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): Option[Any] = { require(blockId != null, "BlockId is null") val locations = Random.shuffle(master.getLocations(blockId)) + var failTimes = 0 for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") - val data = blockTransferService.fetchBlockSync( - loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + val data = try { + blockTransferService.fetchBlockSync( + loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + } catch { + case e: IOException if failTimes < locations.size - 1 => + // Return null when IOException throw, so we can fetch block + // from another location if there still have locations + failTimes += 1 + logWarning(s"Try ${failTimes} times getting remote block $blockId from $loc failed:", e) --- End diff -- Hi @GraceH , sorry I didn't see your comment earlier. You are right, catching just `IOException` doesn't make sense, no matter what the exception we should keep trying all locations. I think we are in agreement that `doGetRemote` should tolerate exceptions, as long as it is able to get from one location (and all those exceptions should get logged). The question is, what should it do when it cannot fetch the block from any location? Should it (a) return `None`, and just log all the exceptions or (b) should it throw an exception? In some ways, it seems the answer should be (a), since the api already returns an `Option`, and it would seem that was the whole purpose. But on the other hand, the only way the existing code could return `None` is if `locations` were empty -- the actual behavior of the current version is to throw an exception if the block isn't available. I'm really torn between both options, I find I can convince myself of either one. Ultimately I am leaning slightly toward (b), just because we can get a slightly better exception.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org