[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641152#comment-16641152 ]
Peter Toth commented on SPARK-25062: ------------------------------------ Thanks [~dongjoon]. :) > Clean up BlockLocations in FileStatus objects > --------------------------------------------- > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.2.2 > Reporter: andrzej.stankev...@gmail.com > Assignee: Peter Toth > Priority: Major > Fix For: 3.0.0 > > Attachments: petertoth.png > > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org