[jira] [Comment Edited] (SPARK-3523) GraphX graph partitioning strategy

2014-12-02 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
 ] 

Larry Xiao edited comment on SPARK-3523 at 12/2/14 4:50 PM:


Hi [~lianhuiwang], that's great!
Currently, to allow more complex partitioning, I changed the interface of 
partitioner, because default partitioners are hash based, and didn't need much 
information about graph.
An option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?


was (Author: larryxiao):
Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be 
change, because default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?

> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance 
> (measured using CV, coefficient of variation) of either vertices or edges of 
> all partitions. The balance of edges can infer computation balance, and the 
> balance of vertices can infer communication balance.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>  
>   * This is an example of a balanced partitioning achieving 20% saving on 
> communication.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * 
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with 
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed 
> Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3523) GraphX graph partitioning strategy

2014-12-02 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
 ] 

Larry Xiao commented on SPARK-3523:
---

Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be 
change, because default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?

> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance 
> (measured using CV, coefficient of variation) of either vertices or edges of 
> all partitions. The balance of edges can infer computation balance, and the 
> balance of vertices can infer communication balance.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>  
>   * This is an example of a balanced partitioning achieving 20% saving on 
> communication.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * 
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with 
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed 
> Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-11-28 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228143#comment-14228143
 ] 

Larry Xiao commented on SPARK-1987:
---

[~maropu] I think it needs slight change in build system.
I see your patch, cool idea, didn't know about timsort before, and your code 
looks very clear. :)

> More memory-efficient graph construction
> 
>
> Key: SPARK-1987
> URL: https://issues.apache.org/jira/browse/SPARK-1987
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> A graph's edges are usually the largest component of the graph. GraphX 
> currently stores edges in parallel primitive arrays, so each edge should only 
> take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
> current implementation in EdgePartitionBuilder uses an array of Edge objects 
> as an intermediate representation for sorting, so each edge additionally 
> takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
> (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
> unnecessarily increases GraphX's memory requirements by a factor of 3.
> To save memory, EdgePartitionBuilder should instead use a custom sort routine 
> that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-28 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Shepherd:   (was: ankurdave)

> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance 
> (measured using CV, coefficient of variation) of either vertices or edges of 
> all partitions. The balance of edges can infer computation balance, and the 
> balance of vertices can infer communication balance.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>  
>   * This is an example of a balanced partitioning achieving 20% saving on 
> communication.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * 
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with 
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed 
> Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-15 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning achieving 20% saving on 
communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation with 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation with 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Mot

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation with 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * 

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in r

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in re

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=600!
  * The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 
  * This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
  * This is a simple partitioning result of BiCut.
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
 The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of t

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
 The left Y axis is replication factor, right axis is the balance (measured 
using CV, coefficient of variation) of either vertices or edges of all 
partitions. The balance of edges can infer computation balance, and the balance 
of vertices can infer communication balance.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
 This is an example of a balanced partitioning's saving on communication.
* 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
* in-2.0-1m is a generated power law graph with alpha equals 2.0

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Hyb

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=360!
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|scale=50%!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
> * HybridCutPlus
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|scale=50%!
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|scale=50%!
> !https://raw.githubusercontent.com/larryxiao/spark/Graph

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

H5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

H5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

H5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

H5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

H5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
> * HybridCutPlus
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png
> h5. Code
> * 
> https://github.

[jira] [Updated] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-3523:
--
Description: 
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png!
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png!

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs

  was:
We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

h5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

h5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

h5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

h5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

h5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs


> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png!
> * HybridCutPlus
>   * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png!
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png!
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png!
> h5. Code
> * 
> https

[jira] [Created] (SPARK-3523) GraphX graph partitioning strategy

2014-09-14 Thread Larry Xiao (JIRA)
Larry Xiao created SPARK-3523:
-

 Summary: GraphX graph partitioning strategy
 Key: SPARK-3523
 URL: https://issues.apache.org/jira/browse/SPARK-3523
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao


We implemented some algorithms for partitioning on GraphX, and evaluated. And 
find the partitioning has space of improving. Seek opinion and advice.

H5. Motivation
* Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
adjacent to nearly half of the edges.
* For high-degree vertex, one vertex concentrates vast resources. So the 
workload on few high-degree vertex should be decomposed by all machines
*  For low-degree vertex, The computation on one vertex is  quite small. Thus 
should exploit the locality of the computation on low-degree vertex.

H5. Algorithm Description
* HybridCut
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png
* HybridCutPlus
  * 
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png
* Greedy BiCut
  * a heuristic algorithm for bipartite

H5. Result
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png
!https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png

H5. Code
* 
https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
* Because the implementation breaks the current separation using 
PartitionStrategy.scala, so need to think of a way to support access to graph.

H5. Reference
- Bipartite-oriented Distributed Graph Partitioning for Big Learning.
- PowerLyra : Differentiated Graph Computation and Partitioning on Skewed Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-08-15 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099476#comment-14099476
 ] 

Larry Xiao commented on SPARK-1987:
---

ok. I understand.
I'll try to implement it

> More memory-efficient graph construction
> 
>
> Key: SPARK-1987
> URL: https://issues.apache.org/jira/browse/SPARK-1987
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> A graph's edges are usually the largest component of the graph. GraphX 
> currently stores edges in parallel primitive arrays, so each edge should only 
> take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
> current implementation in EdgePartitionBuilder uses an array of Edge objects 
> as an intermediate representation for sorting, so each edge additionally 
> takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
> (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
> unnecessarily increases GraphX's memory requirements by a factor of 3.
> To save memory, EdgePartitionBuilder should instead use a custom sort routine 
> that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-2981:
--

Description: 
In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with a 
mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}

  was:
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}


> PartitionStrategy: VertexID hash overflow
> -
>
> Key: SPARK-2981
> URL: https://issues.apache.org/jira/browse/SPARK-2981
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with 
> a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.
> The Long is overflowed, and when cast to Int:
> {quote}
> scala> (1125899906842597L*1).toInt
> res1: Int = -27
> scala> (1125899906842597L*2).toInt
> res2: Int = -54
> scala> (1125899906842597L*3).toInt
> res3: Int = -81
> {quote}
> As the cast produce number that are multiplies of 3, the partition is not 
> useable when partitioning to multiples of 3.
> for example when you partition to 6 or 9 parts:
> {quote}
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0))
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0)) 
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
> so the vertices are partitioned to 0,3 for 6; and 0 for 9
> {quote}
> I think solution is to cast after mod.
> {quote}
> scala> (1125899906842597L*3)
> res4: Long = 3377699720527791
> scala> (1125899906842597L*3) % 9
> res5: Long = 3
> scala> ((1125899906842597L*3) % 9).toInt
> res5: Int = 3
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-2981:
--

Description: 
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}

  was:
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
{quote}
I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}


> PartitionStrategy: VertexID hash overflow
> -
>
> Key: SPARK-2981
> URL: https://issues.apache.org/jira/browse/SPARK-2981
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In PartitionStrategy.scala a PartitionID is calculated by multiplying 
> VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod 
> numParts.
> The Long is overflowed, and when cast to Int:
> {quote}
> scala> (1125899906842597L*1).toInt
> res1: Int = -27
> scala> (1125899906842597L*2).toInt
> res2: Int = -54
> scala> (1125899906842597L*3).toInt
> res3: Int = -81
> {quote}
> As the cast produce number that are multiplies of 3, the partition is not 
> useable when partitioning to multiples of 3.
> for example when you partition to 6 or 9 parts:
> {quote}
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0))
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0)) 
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
> so the vertices are partitioned to 0,3 for 6; and 0 for 9
> {quote}
> I think solution is to cast after mod.
> {quote}
> scala> (1125899906842597L*3)
> res4: Long = 3377699720527791
> scala> (1125899906842597L*3) % 9
> res5: Long = 3
> scala> ((1125899906842597L*3) % 9).toInt
> res5: Int = 3
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)
Larry Xiao created SPARK-2981:
-

 Summary: PartitionStrategy: VertexID hash overflow
 Key: SPARK-2981
 URL: https://issues.apache.org/jira/browse/SPARK-2981
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao


In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
{quote}
I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2062) VertexRDD.apply does not use the mergeFunc

2014-08-10 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092381#comment-14092381
 ] 

Larry Xiao commented on SPARK-2062:
---

Is anyone working on it? I want to take it.
My plan is to add a pass to do the merge, is it ok? [~ankurd]

> VertexRDD.apply does not use the mergeFunc
> --
>
> Key: SPARK-2062
> URL: https://issues.apache.org/jira/browse/SPARK-2062
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Here: 
> https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-08-07 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088953#comment-14088953
 ] 

Larry Xiao commented on SPARK-1987:
---

Hi Ankur

What do you mean by custom sort routine? and parallel arrays?

I understand the overhead of JVM objects, so do you want to use 3 separate 
primitivevector of srcID, dstId and Attr?

I think I can implement EdgePartitionBuilder using three arrays. 

Some concern:
Will this harm locality?
(I also noticed in EdgePartition, srcID, dstId and Attr are stored in three 
arrays)

BTW. I came across Storage Strategies for Collections in Dynamically Typed 
Languages, I think maybe this can be solved in JVM.

> More memory-efficient graph construction
> 
>
> Key: SPARK-1987
> URL: https://issues.apache.org/jira/browse/SPARK-1987
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> A graph's edges are usually the largest component of the graph. GraphX 
> currently stores edges in parallel primitive arrays, so each edge should only 
> take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
> current implementation in EdgePartitionBuilder uses an array of Edge objects 
> as an intermediate representation for sorting, so each edge additionally 
> takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
> (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
> unnecessarily increases GraphX's memory requirements by a factor of 3.
> To save memory, EdgePartitionBuilder should instead use a custom sort routine 
> that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.

2014-08-04 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085676#comment-14085676
 ] 

Larry Xiao commented on SPARK-1153:
---

I like npanj's approach.
It's universal. You treat UUID as attribute.

Like the procedure from 
http://spark.apache.org/docs/latest/graphx-programming-guide.html

// Connect to the Spark cluster
== Build Graph (build VertexID if necessary)
// Load my user data and parse into tuples of user id and attribute list
// Parse the edge data which is already in userId -> userId format
// Attach the user attributes
== Clean Graph
// Some users may not have attributes so we set them as empty, Restrict the 
graph to users with usernames and names
== Compute
// Compute the PageRank
== Get Result
// Get the attributes of the top pagerank users

> Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
> --
>
> Key: SPARK-1153
> URL: https://issues.apache.org/jira/browse/SPARK-1153
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 0.9.0
>Reporter: Deepak Nulu
>
> Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
> able to use {{UUID}} as the vertex ID type because the data I want to process 
> with GraphX uses that type for its primay-keys. Others might have a different 
> type for their primary-keys. Generalizing {{VertexId}} (with a type class) 
> will help in such cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1986) lib.Analytics should be in org.apache.spark.examples

2014-08-04 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084423#comment-14084423
 ] 

Larry Xiao commented on SPARK-1986:
---

Hi Ankur
Yes I like Analytics to be in examples
And I'd like to take on this issue

> lib.Analytics should be in org.apache.spark.examples
> 
>
> Key: SPARK-1986
> URL: https://issues.apache.org/jira/browse/SPARK-1986
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The org.apache.spark.graphx.lib.Analytics driver is currently hard to run; 
> the user has to figure out the correct invocation involving spark-submit. 
> Instead, it should be put into the examples package to enable running it 
> using bin/run-example.
> Here is how Analytics must be invoked currently:
> {code}
> ~/spark/bin/spark-submit --master spark://$(wget -q -O - 
> http://169.254.169.254/latest/meta-data/public-hostname):7077 --class 
> org.apache.spark.graphx.lib.Analytics 
> ~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
>  triangles /soc-LiveJournal1.txt --numEPart=256
> {code}
> Any JAR can be supplied in place of the assembly jar, as long as it exists.
> Here is how it will be invoked after this issue is fixed:
> {code}
> MASTER=spark://$(wget -q -O - 
> http://169.254.169.254/latest/meta-data/public-hostname):7077 
> ~/spark/bin/run-example GraphXAnalytics triangles /soc-LiveJournal1.txt 
> --numEPart=256
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080594#comment-14080594
 ] 

Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:24 AM:


I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find 
it's very different now. It uses Int still, but don't have multiplication like 
before.

Can you test again? [~huangjs]


was (Author: larryxiao):
I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find 
it's very different now. It uses Int still, but don't have multiplication like 
before.

Can you test on it?

> Integer overflow in partition index calculation RangePartitioner
> 
>
> Key: SPARK-2728
> URL: https://issues.apache.org/jira/browse/SPARK-2728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Spark 1.0.1
>Reporter: Jianshi Huang
>  Labels: easyfix
>
> If the partition number are greater than 10362, then spark will report 
> ArrayOutofIndex error. 
> The reason is in the partition index calculation in rangeBounds:
> #Line: 112
> val bounds = new Array[K](partitions - 1)
> for (i <- 0 until partitions - 1) {
>   val index = (rddSample.length - 1) * (i + 1) / partitions
>   bounds(i) = rddSample(index)
> }
> Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
> Cast rddSample.length - 1 to Long should be enough for a fix?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080594#comment-14080594
 ] 

Larry Xiao edited comment on SPARK-2728 at 7/31/14 7:23 AM:


I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find 
it's very different now. It uses Int still, but don't have multiplication like 
before.

Can you test on it?


was (Author: larryxiao):
I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find 
it's very different now.

> Integer overflow in partition index calculation RangePartitioner
> 
>
> Key: SPARK-2728
> URL: https://issues.apache.org/jira/browse/SPARK-2728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Spark 1.0.1
>Reporter: Jianshi Huang
>  Labels: easyfix
>
> If the partition number are greater than 10362, then spark will report 
> ArrayOutofIndex error. 
> The reason is in the partition index calculation in rangeBounds:
> #Line: 112
> val bounds = new Array[K](partitions - 1)
> for (i <- 0 until partitions - 1) {
>   val index = (rddSample.length - 1) * (i + 1) / partitions
>   bounds(i) = rddSample(index)
> }
> Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
> Cast rddSample.length - 1 to Long should be enough for a fix?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-31 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080594#comment-14080594
 ] 

Larry Xiao commented on SPARK-2728:
---

I checked-out newest commit <5a110da25f15694773d6f7c6ee63c5b08ada4eb0> and find 
it's very different now.

> Integer overflow in partition index calculation RangePartitioner
> 
>
> Key: SPARK-2728
> URL: https://issues.apache.org/jira/browse/SPARK-2728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Spark 1.0.1
>Reporter: Jianshi Huang
>  Labels: easyfix
>
> If the partition number are greater than 10362, then spark will report 
> ArrayOutofIndex error. 
> The reason is in the partition index calculation in rangeBounds:
> #Line: 112
> val bounds = new Array[K](partitions - 1)
> for (i <- 0 until partitions - 1) {
>   val index = (rddSample.length - 1) * (i + 1) / partitions
>   bounds(i) = rddSample(index)
> }
> Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
> Cast rddSample.length - 1 to Long should be enough for a fix?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2728) Integer overflow in partition index calculation RangePartitioner

2014-07-30 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080452#comment-14080452
 ] 

Larry Xiao commented on SPARK-2728:
---

I want to try :)

> Integer overflow in partition index calculation RangePartitioner
> 
>
> Key: SPARK-2728
> URL: https://issues.apache.org/jira/browse/SPARK-2728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Spark 1.0.1
>Reporter: Jianshi Huang
>  Labels: easyfix
>
> If the partition number are greater than 10362, then spark will report 
> ArrayOutofIndex error. 
> The reason is in the partition index calculation in rangeBounds:
> #Line: 112
> val bounds = new Array[K](partitions - 1)
> for (i <- 0 until partitions - 1) {
>   val index = (rddSample.length - 1) * (i + 1) / partitions
>   bounds(i) = rddSample(index)
> }
> Here (rddSample.length - 1) * (i + 1) will overflow to a negative Int.
> Cast rddSample.length - 1 to Long should be enough for a fix?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2660) Enable pretty-printing SchemaRDD Rows

2014-07-23 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072687#comment-14072687
 ] 

Larry Xiao commented on SPARK-2660:
---

I think this one is suitable for newbie like me, though I think it's alright 
the way it is now :)

I'm not familiar with the code, and I found printRDDElement in pipe for RDD.
Is it the correct place?

> Enable pretty-printing SchemaRDD Rows
> -
>
> Key: SPARK-2660
> URL: https://issues.apache.org/jira/browse/SPARK-2660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>
> Right now, "printing a Row" results in something like "[a,b,c]". It would be 
> cool if there was a way to pretty-print them similar to JSON, into something 
> like
> {code}
> {
>   Col1 : a,
>   Col2 : b,
>   Col3 : c
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)