[ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
 ] 

Larry Xiao commented on SPARK-3523:
-----------------------------------

Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be 
change, because default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?

> GraphX graph partitioning strategy
> ----------------------------------
>
>                 Key: SPARK-3523
>                 URL: https://issues.apache.org/jira/browse/SPARK-3523
>             Project: Spark
>          Issue Type: Improvement
>          Components: GraphX
>    Affects Versions: 1.0.2
>            Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance 
> (measured using CV, coefficient of variation) of either vertices or edges of 
> all partitions. The balance of edges can infer computation balance, and the 
> balance of vertices can infer communication balance.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>  
>   * This is an example of a balanced partitioning achieving 20% saving on 
> communication.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * 
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with 
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed 
> Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to