[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/13473#discussion_r65649617 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -403,6 +403,17 @@ private[spark] class BlockManager( } /** + * Cleanup code run in response to a failed local read. + * Must be called while holding a read lock on the block. + */ + private def handleLocalReadFailure(blockId: BlockId): Nothing = { +releaseLock(blockId) +// Remove the missing block so that its unavailability is reported to the driver +removeBlock(blockId) --- End diff -- Looking at BlockInfoManager#lockForWriting(), I think you're right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/13473#discussion_r65648664 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -403,6 +403,17 @@ private[spark] class BlockManager( } /** + * Cleanup code run in response to a failed local read. + * Must be called while holding a read lock on the block. + */ + private def handleLocalReadFailure(blockId: BlockId): Nothing = { +releaseLock(blockId) +// Remove the missing block so that its unavailability is reported to the driver +removeBlock(blockId) --- End diff -- No, I don't think so: internally, `removeBlock` acquires a write lock on the block, so if we called it before the `releaseLock` call then we'd be calling it while holding a read lock which would cause us to deadlock. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/13473#discussion_r65648116 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -403,6 +403,17 @@ private[spark] class BlockManager( } /** + * Cleanup code run in response to a failed local read. + * Must be called while holding a read lock on the block. + */ + private def handleLocalReadFailure(blockId: BlockId): Nothing = { +releaseLock(blockId) +// Remove the missing block so that its unavailability is reported to the driver +removeBlock(blockId) --- End diff -- Should this be called before the releaseLock() call ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...
Github user tedyu commented on a diff in the pull request: https://github.com/apache/spark/pull/13473#discussion_r65648040 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -403,6 +403,17 @@ private[spark] class BlockManager( } /** + * Cleanup code run in response to a failed local read. + * Must be called while holding a read lock on the block. + */ + private def handleLocalReadFailure(blockId: BlockId): Nothing = { +releaseLock(blockId) +// Remove the missing block so that its unavailability is reported to the driver +removeBlock(blockId) --- End diff -- Should this be called before the releaseLock() call ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13473 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13473: [SPARK-15736][CORE] Gracefully handle loss of Dis...
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/13473 [SPARK-15736][CORE] Gracefully handle loss of DiskStore files If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. However, the current behavior is to repeatedly re-attempt the read on the same machine without performing any recomputation, which leads to a complete job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark handle-missing-cache-files Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13473.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13473 commit 92bfce821c998c0b1a5b7ab3b674d0e59a7c0033 Author: Josh RosenDate: 2016-06-02T20:30:42Z Add failing regression test. commit a104ac534be3df927a21aeb42fe6c11e1cd03636 Author: Josh Rosen Date: 2016-06-02T21:08:43Z Add failing unit tests in BlockManagerSuite. commit e26b2f6aa0571392730338960666ca66f5901f35 Author: Josh Rosen Date: 2016-06-02T21:09:02Z Fix actual bug. commit 0c51bb63e979549f2fad4fc14767ac6a19952e8c Author: Josh Rosen Date: 2016-06-02T21:21:22Z Update test names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org