Misha Dmitriev created SPARK-24356: -------------------------------------- Summary: Duplicate strings in File.path managed by FileSegmentManagedBuffer Key: SPARK-24356 URL: https://issues.apache.org/jira/browse/SPARK-24356 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.3.0 Reporter: Misha Dmitriev
I recently analyzed a heap dump of Yarn Node Manager that was suffering from high GC pressure due to high object churn. Analysis was done with the jxray tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a number of well-known memory issues. One problem that it found in this dump is 19.5% of memory wasted due to duplicate strings. Of these duplicates, more than a half come from {{FileInputStream.path}} and {{File.path}}. All the {{FileInputStream}} objects that JXRay shows are garbage - looks like they are used for a very short period and then discarded (I guess there is a separate question of whether that's a good pattern). But {{File}} instances are traceable to {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here is the full reference chain: {code:java} ↖java.io.File.path ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file ↖{j.u.ArrayList} ↖j.u.ArrayList$Itr.this$0 ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers ↖{java.util.concurrent.ConcurrentHashMap}.values ↖org.apache.spark.network.server.OneForOneStreamManager.streams ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance {code} Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very similar, so I think {{FileInputStream}}s are generated by the {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely come from [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] To avoid duplicate strings in {{File.path}}'s in this case, it is suggested that in the above code we create a File with a complete, normalized pathname, that has been already interned. This will prevent the code inside {{java.io.File}} from modifying this string, and thus it will use the interned copy, and will pass it to FileInputStream. Essentially the current line {code:java} return new File(new File(localDir, String.format("%02x", subDirId)), filename);{code} should be replaced with something like {code:java} String pathname = localDir + File.separator + String.format(...) + File.separator + filename; pathname = fileSystem.normalize(pathname).intern(); return new File(pathname);{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org