Github user GraceH commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36372781 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -590,10 +590,21 @@ private[spark] class BlockManager( private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): Option[Any] = { require(blockId != null, "BlockId is null") val locations = Random.shuffle(master.getLocations(blockId)) + var failTimes = 0 for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") - val data = blockTransferService.fetchBlockSync( - loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + val data = try { + blockTransferService.fetchBlockSync( + loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + } catch { + case e: IOException if failTimes < locations.size - 1 => + // Return null when IOException throw, so we can fetch block + // from another location if there still have locations + failTimes += 1 + logWarning(s"Try ${failTimes} times getting remote block $blockId from $loc failed:", e) --- End diff -- @squito We'd better to catch up the exception to avoid working flow to be broken. The single fetch attempt should not break the entire code path. No matter what the exception type is. Regarding the place to catch that exception. For the debugging or diagnosis, we'd better to mark down how many remote fetch failures there and why. And log out all the fetches are failed or not. It seems to be more reasonable to catch the exception in ```getRemote```. ```getRemote``` itself can tolerant certain fetch failures from parts of the remotes. It only requires single success. That is the design for ```getRemote```. The ```fetchBlockSync``` is a function to tell if fetch fail or success. If success, return back the data. If not, throw out the exception to indicate why. To swallow the exception there seems not so that reasonable. The upper level function can decide if the exception or fetch failure is acceptable or not. What do you think?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org