[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228208#comment-14228208 ] Takeshi Yamamuro commented on SPARK-1987: - Thanks for your review! :)) What's the change in the system? Anyway, if no problem, I'll send PR. Thanks, again. takeshi > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228143#comment-14228143 ] Larry Xiao commented on SPARK-1987: --- [~maropu] I think it needs slight change in build system. I see your patch, cool idea, didn't know about timsort before, and your code looks very clear. :) > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228136#comment-14228136 ] Takeshi Yamamuro commented on SPARK-1987: - What is the status of this patch? This is related to a issue I created (https://issues.apache.org/jira/browse/SPARK-4646). I refactored this patch based on my patch, it is as follows: https://github.com/maropu/spark/commit/77e34424a5e6cf2bfd6300ab35f329bdaba6e775 Thanks :) > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138970#comment-14138970 ] Apache Spark commented on SPARK-1987: - User 'larryxiao' has created a pull request for this issue: https://github.com/apache/spark/pull/2446 > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099476#comment-14099476 ] Larry Xiao commented on SPARK-1987: --- ok. I understand. I'll try to implement it > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099438#comment-14099438 ] Ankur Dave commented on SPARK-1987: --- [~larryxiao] I was thinking of sorting the 3 primitive arrays directly rather than first putting them into an array of Edge objects. Though each edge will then be spread across 3 arrays, I think it shouldn't hurt locality too much, since we already have to access 2 memory locations per edge (the pointer and the referenced Edge object). Also, it will be more compact and hopefully make better use of the cache. > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088953#comment-14088953 ] Larry Xiao commented on SPARK-1987: --- Hi Ankur What do you mean by custom sort routine? and parallel arrays? I understand the overhead of JVM objects, so do you want to use 3 separate primitivevector of srcID, dstId and Attr? I think I can implement EdgePartitionBuilder using three arrays. Some concern: Will this harm locality? (I also noticed in EdgePartition, srcID, dstId and Attr are stored in three arrays) BTW. I came across Storage Strategies for Collections in Dynamically Typed Languages, I think maybe this can be solved in JVM. > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org