[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-24356: ------------------------------------ Assignee: Apache Spark > Duplicate strings in File.path managed by FileSegmentManagedBuffer > ------------------------------------------------------------------ > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 2.3.0 > Reporter: Misha Dmitriev > Assignee: Apache Spark > Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org