[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579184#comment-16579184 ] Imran Rashid commented on SPARK-24356: -- Somewhat related to SPARK-24938 -- that explains why these buffers are even on the heap at all, as spark configures netty to use offheap buffers by default. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-24356.01.patch, dup-file-strings-details.png > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494184#comment-16494184 ] Apache Spark commented on SPARK-24356: -- User 'countmdm' has created a pull request for this issue: https://github.com/apache/spark/pull/21456 > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494058#comment-16494058 ] Ruslan Dautkhanov commented on SPARK-24356: --- Another improvement for YARN NodeManagers we saw that could decrease GC pressure is to decrease io.netty.allocator.maxOrder from default 11 down to 8. Which will decrease netty buffers from 16Mb to 2Mb. Thanks to [~mi...@cloudera.com] for helping to identify this one too {quote} Netty code responsible for highly underutilized buffers that we discussed. Long story short, I think I found the variables that control these byte[] arrays referenced by io.netty.buffer.PoolChunk.memory. Check the code of http://netty.io/4.0/xref/io/netty/buffer/PooledByteBufAllocator.html: lines 39-40 look like: private static final int DEFAULT_PAGE_SIZE; private static final int DEFAULT_MAX_ORDER; // 8192 << 11 = 16 MiB per chunk A little below you can see: int defaultPageSize = SystemPropertyUtil.getInt("io.netty.allocator.pageSize", 8192); ... // Some validation DEFAULT_PAGE_SIZE = defaultPageSize; int defaultMaxOrder = SystemPropertyUtil.getInt("io.netty.allocator.maxOrder", 11); ... // Some validation DEFAULT_MAX_ORDER = defaultMaxOrder; And then from the rest of the code in this class, as well as PoolChunk, PoolChunkList and PoolArena, it is clear that the size of the said buffers is set as pageSize * (2^maxOrder), with the default values as above. 8192b * (2^11) = 16MB, which agrees with the buffer size obtained from the jxray report, that I previously mentioned. So looks like to reduce the amount of memory wasted by these underutilized netty buffers, it's best to run the Yarn NM JVM with the "io.netty.allocator.maxOrder" explicitly set to something less than the default 11 value. Decreasing this number by 1 will reduce the amount of memory consumed by this stuff by a factor of 2. I would suggest starting with property value 9 or 8 - that seems like a reasonable balance between savings and safety. {quote} I got surprised to learn that YARN NM actually uses some Spark code (e.g. org.apache.spark.network.yarn.YarnShuffleService) so this issue could be common between YARN NM and Spark shuffle service. However we did not check if underutilized buffers in netty apply to Spark shuffle service too - might be a good idea to open another jira. jxray seems to be a great tool to find issues like these. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already inte
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492960#comment-16492960 ] Edwina Lu commented on SPARK-24356: --- Thanks! This is interesting, and could help with some of our shuffle service issues – we can give this a try. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492931#comment-16492931 ] Imran Rashid commented on SPARK-24356: -- Yeah Misha has a change ready -- I sent him a message about submitting a PR instead of a patch, will follow up tomorrow. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491788#comment-16491788 ] Felix Cheung commented on SPARK-24356: -- Interesting. we will definitely look into this. Is the plan to turn this into a PR to fix in Spark? > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Attachments: SPARK-24356.01.patch > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489417#comment-16489417 ] Imran Rashid commented on SPARK-24356: -- cc [~jinxing6...@126.com] [~elu] [~felixcheung] -- this could be a nice win to decrease GC pressure on the shuffle service, might be related to issues you are running into. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486580#comment-16486580 ] Misha Dmitriev commented on SPARK-24356: I plan to work on this feature. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org