GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/19458
[SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files ## What changes were proposed in this pull request? Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality, this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough. ## How was this patch tested? `DiskBlockAggregateSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/criteo-forks/spark block-id-option Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19458.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19458 ---- commit 9b9b86fed0e5949fd9e7abaefe08c3d9d986feb6 Author: Sergei Lebedev <s.lebe...@criteo.com> Date: 2017-10-09T16:52:00Z [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files Prior to this commit getAllBlocks implicitly assumed that the directories managed by the DiskBlockManager contain only the files corresponding to valid block IDs. In reality this assumption was violated during shuffle, which produces temporary files in the same directory as the resulting blocks. As a result, calls to getAllBlocks during shuffle were unreliable. The fix could be made more efficient, but this is probably good enough. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org