GitHub user peter-toth opened a pull request:

    https://github.com/apache/spark/pull/22603

    SPARK-25062: clean up BlockLocations in InMemoryFileIndex

    ## What changes were proposed in this pull request?
    
    `InMemoryFileIndex` caches `FileStatus` objects to paths. Each `FileStatus` 
object can contain several `BlockLocations`. Depending on the parallel 
discovery threshold (`spark.sql.sources.parallelPartitionDiscovery.threshold`) 
the file listing can happen on the driver or on the executors. If the listing 
happens on the executors the block location objects are converted to simple 
`BlockLocation` objects to ensure serialization requirements. If it happens on 
the driver then there is no conversion and depending on the file system a 
`BlockLocation` object can be a subclass like `HdfsBlockLocation` and consume 
more memory. This PR adds the conversion to the latter case and decreases 
memory consumption.
    
    ## How was this patch tested?
    
    Added unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/peter-toth/spark SPARK-25062

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22603
    
----
commit 5735dba9a43c336843ccc531831105d6c23b4586
Author: Peter Toth <peter.toth@...>
Date:   2018-09-30T13:01:36Z

    SPARK-25062: clean up BlockLocations in InMemoryFileIndex

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to