[jira] [Comment Edited] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734 ] Larry Xiao edited comment on SPARK-3523 at 12/2/14 4:50 PM: Hi [~lianhuiwang], that's great! Currently, to allow more complex partitioning, I changed the interface of partitioner, because default partitioners are hash based, and didn't need much information about graph. An option is to partition the graph externally, as [~ankurd] suggested. What do you think? was (Author: larryxiao): Hi [~lianhuiwang], that's great! As you can see, to allow more complex partitioning, the interface need to be change, because default partitioners are hash based. Another option is to partition the graph externally, as [~ankurd] suggested. What do you think? > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! > * HybridCutPlus > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! > * The left Y axis is replication factor, right axis is the balance > (measured using CV, coefficient of variation) of either vertices or edges of > all partitions. The balance of edges can infer computation balance, and the > balance of vertices can infer communication balance. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! > > * This is an example of a balanced partitioning achieving 20% saving on > communication. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! > * This is a simple partitioning result of BiCut. > * in-2.0-1m is a generated power law graph with alpha equals 2.0 > h5. Code > * > https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 > * Because the implementation breaks the current separation with > PartitionStrategy.scala, so need to think of a way to support access to graph. > h5. Reference > - Bipartite-oriented Distributed Graph Partitioning for Big Learning. > - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed > Graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734 ] Larry Xiao commented on SPARK-3523: --- Hi [~lianhuiwang], that's great! As you can see, to allow more complex partitioning, the interface need to be change, because default partitioners are hash based. Another option is to partition the graph externally, as [~ankurd] suggested. What do you think? > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! > * HybridCutPlus > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! > * The left Y axis is replication factor, right axis is the balance > (measured using CV, coefficient of variation) of either vertices or edges of > all partitions. The balance of edges can infer computation balance, and the > balance of vertices can infer communication balance. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! > > * This is an example of a balanced partitioning achieving 20% saving on > communication. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! > * This is a simple partitioning result of BiCut. > * in-2.0-1m is a generated power law graph with alpha equals 2.0 > h5. Code > * > https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 > * Because the implementation breaks the current separation with > PartitionStrategy.scala, so need to think of a way to support access to graph. > h5. Reference > - Bipartite-oriented Distributed Graph Partitioning for Big Learning. > - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed > Graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228143#comment-14228143 ] Larry Xiao commented on SPARK-1987: --- [~maropu] I think it needs slight change in build system. I see your patch, cool idea, didn't know about timsort before, and your code looks very clear. :) > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Shepherd: (was: ankurdave) > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! > * HybridCutPlus > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! > * The left Y axis is replication factor, right axis is the balance > (measured using CV, coefficient of variation) of either vertices or edges of > all partitions. The balance of edges can infer computation balance, and the > balance of vertices can infer communication balance. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! > > * This is an example of a balanced partitioning achieving 20% saving on > communication. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! > * This is a simple partitioning result of BiCut. > * in-2.0-1m is a generated power law graph with alpha equals 2.0 > h5. Code > * > https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 > * Because the implementation breaks the current separation with > PartitionStrategy.scala, so need to think of a way to support access to graph. > h5. Reference > - Bipartite-oriented Distributed Graph Partitioning for Big Learning. > - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed > Graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning achieving 20% saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation with PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation with PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Mot
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation with PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > *
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in r
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in re
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600! * The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! * This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * This is a simple partitioning result of BiCut. * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of t
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! The left Y axis is replication factor, right axis is the balance (measured using CV, coefficient of variation) of either vertices or edges of all partitions. The balance of edges can infer computation balance, and the balance of vertices can infer communication balance. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! This is an example of a balanced partitioning's saving on communication. * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! * in-2.0-1m is a generated power law graph with alpha equals 2.0 h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Hyb
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! > * HybridCutPlus > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! > * HybridCutPlus > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360! > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! > * HybridCutPlus > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%! > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%! > !https://raw.githubusercontent.com/larryxiao/spark/Graph
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. H5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. H5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite H5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png H5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. H5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png > * HybridCutPlus > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png > h5. Code > * > https://github.
[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy
[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-3523: -- Description: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png! !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png! h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs was: We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. h5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. h5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite h5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png h5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. h5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs > GraphX graph partitioning strategy > -- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png! > * HybridCutPlus > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png! > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png! > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png! > h5. Code > * > https
[jira] [Created] (SPARK-3523) GraphX graph partitioning strategy
Larry Xiao created SPARK-3523: - Summary: GraphX graph partitioning strategy Key: SPARK-3523 URL: https://issues.apache.org/jira/browse/SPARK-3523 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao We implemented some algorithms for partitioning on GraphX, and evaluated. And find the partitioning has space of improving. Seek opinion and advice. H5. Motivation * Graph in real world follow power law. Eg. On twitter 1% of the vertices are adjacent to nearly half of the edges. * For high-degree vertex, one vertex concentrates vast resources. So the workload on few high-degree vertex should be decomposed by all machines * For low-degree vertex, The computation on one vertex is quite small. Thus should exploit the locality of the computation on low-degree vertex. H5. Algorithm Description * HybridCut * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png * HybridCutPlus * !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png * Greedy BiCut * a heuristic algorithm for bipartite H5. Result !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png H5. Code * https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 * Because the implementation breaks the current separation using PartitionStrategy.scala, so need to think of a way to support access to graph. H5. Reference - Bipartite-oriented Distributed Graph Partitioning for Big Learning. - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099476#comment-14099476 ] Larry Xiao commented on SPARK-1987: --- ok. I understand. I'll try to implement it > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow
[ https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-2981: -- Description: In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} was: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} > PartitionStrategy: VertexID hash overflow > - > > Key: SPARK-2981 > URL: https://issues.apache.org/jira/browse/SPARK-2981 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with > a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. > The Long is overflowed, and when cast to Int: > {quote} > scala> (1125899906842597L*1).toInt > res1: Int = -27 > scala> (1125899906842597L*2).toInt > res2: Int = -54 > scala> (1125899906842597L*3).toInt > res3: Int = -81 > {quote} > As the cast produce number that are multiplies of 3, the partition is not > useable when partitioning to multiples of 3. > for example when you partition to 6 or 9 parts: > {quote} > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > so the vertices are partitioned to 0,3 for 6; and 0 for 9 > {quote} > I think solution is to cast after mod. > {quote} > scala> (1125899906842597L*3) > res4: Long = 3377699720527791 > scala> (1125899906842597L*3) % 9 > res5: Long = 3 > scala> ((1125899906842597L*3) % 9).toInt > res5: Int = 3 > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow
[ https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Larry Xiao updated SPARK-2981: -- Description: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) so the vertices are partitioned to 0,3 for 6; and 0 for 9 {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} was: In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} > PartitionStrategy: VertexID hash overflow > - > > Key: SPARK-2981 > URL: https://issues.apache.org/jira/browse/SPARK-2981 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.0.2 >Reporter: Larry Xiao > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > In PartitionStrategy.scala a PartitionID is calculated by multiplying > VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod > numParts. > The Long is overflowed, and when cast to Int: > {quote} > scala> (1125899906842597L*1).toInt > res1: Int = -27 > scala> (1125899906842597L*2).toInt > res2: Int = -54 > scala> (1125899906842597L*3).toInt > res3: Int = -81 > {quote} > As the cast produce number that are multiplies of 3, the partition is not > useable when partitioning to multiples of 3. > for example when you partition to 6 or 9 parts: > {quote} > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), > (1,0), (2,0), (3,3832578), (4,0), (5,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), > (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) > so the vertices are partitioned to 0,3 for 6; and 0 for 9 > {quote} > I think solution is to cast after mod. > {quote} > scala> (1125899906842597L*3) > res4: Long = 3377699720527791 > scala> (1125899906842597L*3) % 9 > res5: Long = 3 > scala> ((1125899906842597L*3) % 9).toInt > res5: Int = 3 > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2981) PartitionStrategy: VertexID hash overflow
Larry Xiao created SPARK-2981: - Summary: PartitionStrategy: VertexID hash overflow Key: SPARK-2981 URL: https://issues.apache.org/jira/browse/SPARK-2981 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Reporter: Larry Xiao In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts. The Long is overflowed, and when cast to Int: {quote} scala> (1125899906842597L*1).toInt res1: Int = -27 scala> (1125899906842597L*2).toInt res2: Int = -54 scala> (1125899906842597L*3).toInt res3: Int = -81 {quote} As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3. for example when you partition to 6 or 9 parts: {quote} 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) {quote} I think solution is to cast after mod. {quote} scala> (1125899906842597L*3) res4: Long = 3377699720527791 scala> (1125899906842597L*3) % 9 res5: Long = 3 scala> ((1125899906842597L*3) % 9).toInt res5: Int = 3 {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2062) VertexRDD.apply does not use the mergeFunc
[ https://issues.apache.org/jira/browse/SPARK-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092381#comment-14092381 ] Larry Xiao commented on SPARK-2062: --- Is anyone working on it? I want to take it. My plan is to add a pass to do the merge, is it ok? [~ankurd] > VertexRDD.apply does not use the mergeFunc > -- > > Key: SPARK-2062 > URL: https://issues.apache.org/jira/browse/SPARK-2062 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > Here: > https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1987) More memory-efficient graph construction
[ https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088953#comment-14088953 ] Larry Xiao commented on SPARK-1987: --- Hi Ankur What do you mean by custom sort routine? and parallel arrays? I understand the overhead of JVM objects, so do you want to use 3 separate primitivevector of srcID, dstId and Attr? I think I can implement EdgePartitionBuilder using three arrays. Some concern: Will this harm locality? (I also noticed in EdgePartition, srcID, dstId and Attr are stored in three arrays) BTW. I came across Storage Strategies for Collections in Dynamically Typed Languages, I think maybe this can be solved in JVM. > More memory-efficient graph construction > > > Key: SPARK-1987 > URL: https://issues.apache.org/jira/browse/SPARK-1987 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > A graph's edges are usually the largest component of the graph. GraphX > currently stores edges in parallel primitive arrays, so each edge should only > take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the > current implementation in EdgePartitionBuilder uses an array of Edge objects > as an intermediate representation for sorting, so each edge additionally > takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr > (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This > unnecessarily increases GraphX's memory requirements by a factor of 3. > To save memory, EdgePartitionBuilder should instead use a custom sort routine > that operates directly on the three parallel arrays. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
[ https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085676#comment-14085676 ] Larry Xiao commented on SPARK-1153: --- I like npanj's approach. It's universal. You treat UUID as attribute. Like the procedure from http://spark.apache.org/docs/latest/graphx-programming-guide.html // Connect to the Spark cluster == Build Graph (build VertexID if necessary) // Load my user data and parse into tuples of user id and attribute list // Parse the edge data which is already in userId -> userId format // Attach the user attributes == Clean Graph // Some users may not have attributes so we set them as empty, Restrict the graph to users with usernames and names == Compute // Compute the PageRank == Get Result // Get the attributes of the top pagerank users > Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs. > -- > > Key: SPARK-1153 > URL: https://issues.apache.org/jira/browse/SPARK-1153 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 0.9.0 >Reporter: Deepak Nulu > > Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be > able to use {{UUID}} as the vertex ID type because the data I want to process > with GraphX uses that type for its primay-keys. Others might have a different > type for their primary-keys. Generalizing {{VertexId}} (with a type class) > will help in such cases. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1986) lib.Analytics should be in org.apache.spark.examples
[ https://issues.apache.org/jira/browse/SPARK-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084423#comment-14084423 ] Larry Xiao commented on SPARK-1986: --- Hi Ankur Yes I like Analytics to be in examples And I'd like to take on this issue > lib.Analytics should be in org.apache.spark.examples > > > Key: SPARK-1986 > URL: https://issues.apache.org/jira/browse/SPARK-1986 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > The org.apache.spark.graphx.lib.Analytics driver is currently hard to run; > the user has to figure out the correct invocation involving spark-submit. > Instead, it should be put into the examples package to enable running it > using bin/run-example. > Here is how Analytics must be invoked currently: > {code} > ~/spark/bin/spark-submit --master spark://$(wget -q -O - > http://169.254.169.254/latest/meta-data/public-hostname):7077 --class > org.apache.spark.graphx.lib.Analytics > ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar > triangles /soc-LiveJournal1.txt --numEPart=256 > {code} > Any JAR can be supplied in place of the assembly jar, as long as it exists. > Here is how it will be invoked after this issue is fixed: > {code} > MASTER=spark://$(wget -q -O - > http://169.254.169.254/latest/meta-data/public-hostname):7077 > ~/spark/bin/run-example GraphXAnalytics triangles /soc-LiveJournal1.txt > --numEPart=256 > {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080594#comment-14080594 ] Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:24 AM: I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find it's very different now. It uses Int still, but don't have multiplication like before. Can you test again? [~huangjs] was (Author: larryxiao): I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find it's very different now. It uses Int still, but don't have multiplication like before. Can you test on it? > Integer overflow in partition index calculation RangePartitioner > > > Key: SPARK-2728 > URL: https://issues.apache.org/jira/browse/SPARK-2728 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Spark 1.0.1 >Reporter: Jianshi Huang > Labels: easyfix > > If the partition number are greater than 10362, then spark will report > ArrayOutofIndex error. > The reason is in the partition index calculation in rangeBounds: > #Line: 112 > val bounds = new Array[K](partitions - 1) > for (i <- 0 until partitions - 1) { > val index = (rddSample.length - 1) * (i + 1) / partitions > bounds(i) = rddSample(index) > } > Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. > Cast rddSample.length - 1 to Long should be enough for a fix? > Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080594#comment-14080594 ] Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:23 AM: I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find it's very different now. It uses Int still, but don't have multiplication like before. Can you test on it? was (Author: larryxiao): I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find it's very different now. > Integer overflow in partition index calculation RangePartitioner > > > Key: SPARK-2728 > URL: https://issues.apache.org/jira/browse/SPARK-2728 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Spark 1.0.1 >Reporter: Jianshi Huang > Labels: easyfix > > If the partition number are greater than 10362, then spark will report > ArrayOutofIndex error. > The reason is in the partition index calculation in rangeBounds: > #Line: 112 > val bounds = new Array[K](partitions - 1) > for (i <- 0 until partitions - 1) { > val index = (rddSample.length - 1) * (i + 1) / partitions > bounds(i) = rddSample(index) > } > Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. > Cast rddSample.length - 1 to Long should be enough for a fix? > Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080594#comment-14080594 ] Larry Xiao commented on SPARK-2728: --- I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find it's very different now. > Integer overflow in partition index calculation RangePartitioner > > > Key: SPARK-2728 > URL: https://issues.apache.org/jira/browse/SPARK-2728 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Spark 1.0.1 >Reporter: Jianshi Huang > Labels: easyfix > > If the partition number are greater than 10362, then spark will report > ArrayOutofIndex error. > The reason is in the partition index calculation in rangeBounds: > #Line: 112 > val bounds = new Array[K](partitions - 1) > for (i <- 0 until partitions - 1) { > val index = (rddSample.length - 1) * (i + 1) / partitions > bounds(i) = rddSample(index) > } > Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. > Cast rddSample.length - 1 to Long should be enough for a fix? > Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner
[ https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080452#comment-14080452 ] Larry Xiao commented on SPARK-2728: --- I want to try :) > Integer overflow in partition index calculation RangePartitioner > > > Key: SPARK-2728 > URL: https://issues.apache.org/jira/browse/SPARK-2728 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Spark 1.0.1 >Reporter: Jianshi Huang > Labels: easyfix > > If the partition number are greater than 10362, then spark will report > ArrayOutofIndex error. > The reason is in the partition index calculation in rangeBounds: > #Line: 112 > val bounds = new Array[K](partitions - 1) > for (i <- 0 until partitions - 1) { > val index = (rddSample.length - 1) * (i + 1) / partitions > bounds(i) = rddSample(index) > } > Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int. > Cast rddSample.length - 1 to Long should be enough for a fix? > Jianshi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2660) Enable pretty-printing SchemaRDD Rows
[ https://issues.apache.org/jira/browse/SPARK-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072687#comment-14072687 ] Larry Xiao commented on SPARK-2660: --- I think this one is suitable for newbie like me, though I think it's alright the way it is now :) I'm not familiar with the code, and I found printRDDElement in pipe for RDD. Is it the correct place? > Enable pretty-printing SchemaRDD Rows > - > > Key: SPARK-2660 > URL: https://issues.apache.org/jira/browse/SPARK-2660 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Aaron Davidson > > Right now, "printing a Row" results in something like "[a,b,c]". It would be > cool if there was a way to pretty-print them similar to JSON, into something > like > {code} > { > Col1 : a, > Col2 : b, > Col3 : c > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)