[ 
https://issues.apache.org/jira/browse/TEZ-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142893#comment-15142893
 ] 

Jason Lowe commented on TEZ-3115:
---------------------------------

When auto-parallelism kicks in we're going to see many copies of the same 
upstream task attempt IDs, host:port, etc.  We should at least consider 
interning or otherwise sharing these, or potentially just storing the raw ID 
and generating the string when necessary on-the-fly.  MapHost is another 
example of many redundancies, since it stores the fully qualified host name and 
port at least three times (as part of baseUrl, identifier, and hostIdentifier). 
 I wonder if it would be better overall to have MapHost be more efficiently 
stored and generate the URLs and identifiers on-demand.


> Shuffle string handling adds significant memory overhead
> --------------------------------------------------------
>
>                 Key: TEZ-3115
>                 URL: https://issues.apache.org/jira/browse/TEZ-3115
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>
> While investigating the OOM heap dump from TEZ-3114 I noticed that the 
> ShuffleManager and other shuffle-related objects were holding onto many 
> strings that added up to over a hundred megabytes of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to