[jira] [Created] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

Misha Dmitriev (JIRA) Tue, 22 May 2018 14:27:21 -0700

Misha Dmitriev created SPARK-24356:
--------------------------------------

             Summary: Duplicate strings in File.path managed by 
FileSegmentManagedBuffer
                 Key: SPARK-24356
                 URL: https://issues.apache.org/jira/browse/SPARK-24356
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle
    Affects Versions: 2.3.0
            Reporter: Misha Dmitriev



I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
high GC pressure due to high object churn. Analysis was done with the jxray 
tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
number of well-known memory issues. One problem that it found in this dump is 
19.5% of memory wasted due to duplicate strings. Of these duplicates, more than 
a half come from {{FileInputStream.path}} and {{File.path}}. All the 
{{FileInputStream}} objects that JXRay shows are garbage - looks like they are 
used for a very short period and then discarded (I guess there is a separate 
question of whether that's a good pattern). But {{File}} instances are 
traceable to {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} 
field. Here is the full reference chain:
 
{code:java}
↖java.io.File.path
↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
↖{j.u.ArrayList}
↖j.u.ArrayList$Itr.this$0
↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
↖{java.util.concurrent.ConcurrentHashMap}.values
↖org.apache.spark.network.server.OneForOneStreamManager.streams
↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
{code}
 
Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
similar, so I think {{FileInputStream}}s are generated by the 
{{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely come 
from 
[https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
 
To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
that in the above code we create a File with a complete, normalized pathname, 
that has been already interned. This will prevent the code inside 
{{java.io.File}} from modifying this string, and thus it will use the interned 
copy, and will pass it to FileInputStream. Essentially the current line
{code:java}
return new File(new File(localDir, String.format("%02x", subDirId)), 
filename);{code}
should be replaced with something like
{code:java}
String pathname = localDir + File.separator + String.format(...) + 
File.separator + filename;
pathname = fileSystem.normalize(pathname).intern();
return new File(pathname);{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

Reply via email to