[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364712#comment-14364712 ]
Apache Spark commented on SPARK-5523: ------------------------------------- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5064 > TaskMetrics and TaskInfo have innumerable copies of the hostname string > ----------------------------------------------------------------------- > > Key: SPARK-5523 > URL: https://issues.apache.org/jira/browse/SPARK-5523 > Project: Spark > Issue Type: Bug > Components: Spark Core, Streaming > Reporter: Tathagata Das > > TaskMetrics and TaskInfo objects have the hostname associated with the task. > As these are created (directly or through deserialization of RPC messages), > each of them have a separate String object for the hostname even though most > of them have the same string data in them. This results in thousands of > string objects, increasing memory requirement of the driver. > This can be easily deduped when deserializing a TaskMetrics object, or when > creating a TaskInfo object. > This affects streaming particularly bad due to the rate of job/stage/task > generation. > For solution, see how this dedup is done for StorageLevel. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org