GitHub user peter-toth opened a pull request: https://github.com/apache/spark/pull/22603
SPARK-25062: clean up BlockLocations in InMemoryFileIndex ## What changes were proposed in this pull request? `InMemoryFileIndex` caches `FileStatus` objects to paths. Each `FileStatus` object can contain several `BlockLocations`. Depending on the parallel discovery threshold (`spark.sql.sources.parallelPartitionDiscovery.threshold`) the file listing can happen on the driver or on the executors. If the listing happens on the executors the block location objects are converted to simple `BlockLocation` objects to ensure serialization requirements. If it happens on the driver then there is no conversion and depending on the file system a `BlockLocation` object can be a subclass like `HdfsBlockLocation` and consume more memory. This PR adds the conversion to the latter case and decreases memory consumption. ## How was this patch tested? Added unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/peter-toth/spark SPARK-25062 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22603.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22603 ---- commit 5735dba9a43c336843ccc531831105d6c23b4586 Author: Peter Toth <peter.toth@...> Date: 2018-09-30T13:01:36Z SPARK-25062: clean up BlockLocations in InMemoryFileIndex ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org