[ https://issues.apache.org/jira/browse/SPARK-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matei Zaharia resolved SPARK-824. --------------------------------- Resolution: Fixed This is a pretty old issue that no longer affects the newest block manager and Netty code. > Make less copies of blocks during remote reads > ---------------------------------------------- > > Key: SPARK-824 > URL: https://issues.apache.org/jira/browse/SPARK-824 > Project: Spark > Issue Type: Improvement > Components: Block Manager > Reporter: Charles Reiss > > To satisfy a getRemote() when the source block is stored as in deserialized > form, Spark will make at least two copies: it will create a bytebuffer of the > entire deserialized block on the source node and one on the destination node. > If the block consists of a single object (as occurs frequently in Shark and > apparently PySpark), then the total effect is that remote reads require extra > memory equal to three times the block size. As a result, especially with > relatively coarse partitioning, Spark can require a surprisingly large amount > of non-cache headroom. > One copy could be avoided by serializing blocks as they are sent over the > network. This would require connection manager and block manager API changes > and might have negative effects if serialization is expensive relative to the > connection speed. > Similarly, a copy might be avoided by deserializing blocks as they are > received over the network, but this probably would hurt some applications > (e.g. if they process items from a block much slower than the network speed). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org