[jira] [Assigned] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25062: - Assignee: Peter Toth > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > Attachments: petertoth.png > > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25062: Assignee: (was: Apache Spark) > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25062: Assignee: Apache Spark > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Assignee: Apache Spark >Priority: Major > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org