[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088953#comment-14088953 ]
Larry Xiao commented on SPARK-1987: ----------------------------------- Hi Ankur What do you mean by custom sort routine? and parallel arrays? I understand the overhead of JVM objects, so do you want to use 3 separate primitivevector of srcID, dstId and Attr? I think I can implement EdgePartitionBuilder using three arrays. Some concern: Will this harm locality? (I also noticed in EdgePartition, srcID, dstId and Attr are stored in three arrays) BTW. I came across Storage Strategies for Collections in Dynamically Typed Languages, I think maybe this can be solved in JVM. > More memory-efficient graph construction > ---------------------------------------- > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX > Reporter: Ankur Dave > Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org