[ https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734 ]
Larry Xiao commented on SPARK-3523: ----------------------------------- Hi [~lianhuiwang], that's great! As you can see, to allow more complex partitioning, the interface need to be change, because default partitioners are hash based. Another option is to partition the graph externally, as [~ankurd] suggested. What do you think? > GraphX graph partitioning strategy > ---------------------------------- > > Key: SPARK-3523 > URL: https://issues.apache.org/jira/browse/SPARK-3523 > Project: Spark > Issue Type: Improvement > Components: GraphX > Affects Versions: 1.0.2 > Reporter: Larry Xiao > > We implemented some algorithms for partitioning on GraphX, and evaluated. And > find the partitioning has space of improving. Seek opinion and advice. > h5. Motivation > * Graph in real world follow power law. Eg. On twitter 1% of the vertices are > adjacent to nearly half of the edges. > * For high-degree vertex, one vertex concentrates vast resources. So the > workload on few high-degree vertex should be decomposed by all machines > * For low-degree vertex, The computation on one vertex is quite small. Thus > should exploit the locality of the computation on low-degree vertex. > h5. Algorithm Description > * HybridCut > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360! > * HybridCutPlus > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360! > * Greedy BiCut > * a heuristic algorithm for bipartite > h5. Result > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%! > * The left Y axis is replication factor, right axis is the balance > (measured using CV, coefficient of variation) of either vertices or edges of > all partitions. The balance of edges can infer computation balance, and the > balance of vertices can infer communication balance. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360! > > * This is an example of a balanced partitioning achieving 20% saving on > communication. > * > !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360! > * This is a simple partitioning result of BiCut. > * in-2.0-1m is a generated power law graph with alpha equals 2.0 > h5. Code > * > https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173 > * Because the implementation breaks the current separation with > PartitionStrategy.scala, so need to think of a way to support access to graph. > h5. Reference > - Bipartite-oriented Distributed Graph Partitioning for Big Learning. > - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed > Graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org