Re: GraphX Snapshot Partitioning

2015-03-14 Thread Takeshi Yamamuro
Large edge partitions could cause java.lang.OutOfMemoryError, and then
spark tasks fails.

FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex
IDs are
mapped into 32-bit ones in each partitions.
If #edges is over the limit, graphx could throw
ArrayIndexOutOfBoundsException,
or something. So, each partition can have more edges than you expect.





On Wed, Mar 11, 2015 at 11:42 PM, Matthew Bucci mrbucci...@gmail.com
wrote:

 Hi,

 Thanks for the response! That answered some questions I had, but the last
 one I was wondering is what happens if you run a partition strategy and one
 of the partitions ends up being too large? For example, let's say
 partitions can hold 64MB (actually knowing the maximum possible size of a
 partition would probably also be helpful to me). You try to partition the
 edges of a graph to 3 separate partitions but the edges in the first
 partition end up being 80MB worth of edges so it cannot all fit in the
 first partition . Would the extra 16MB flood over into a new 4th partition
 or would the system try to split it so that the 1st and 4th partition are
 both at 40MB, or would the partition strategy just fail with a memory
 error?

 Thank You,
 Matthew Bucci

 On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com
 wrote:

 Hi,

 Vertices are simply hash-paritioned by their 64-bit IDs, so
 they are evenly spread over parititons.

 As for edges, GraphLoader#edgeList builds edge paritions
 through hadoopFile(), so the initial parititons depend
 on InputFormat#getSplits implementations
 (e.g, partitions are mostly equal to 64MB blocks for HDFS).

 Edges can be re-partitioned by ParititonStrategy;
 a graph is partitioned considering graph structures and
 a source ID and a destination ID are used as partition keys.
 The partitions might suffer from skewness depending
 on graph properties (hub nodes, or something).

 Thanks,
 takeshi


 On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com
 wrote:

 Hello,

 I am working on a project where we want to split graphs of data into
 snapshots across partitions and I was wondering what would happen if one
 of
 the snapshots we had was too large to fit into a single partition. Would
 the
 snapshot be split over the two partitions equally, for example, and how
 is a
 single snapshot spread over multiple partitions?

 Thank You,
 Matthew Bucci



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 ---
 Takeshi Yamamuro





-- 
---
Takeshi Yamamuro


Re: GraphX Snapshot Partitioning

2015-03-11 Thread Matthew Bucci
Hi,

Thanks for the response! That answered some questions I had, but the last
one I was wondering is what happens if you run a partition strategy and one
of the partitions ends up being too large? For example, let's say
partitions can hold 64MB (actually knowing the maximum possible size of a
partition would probably also be helpful to me). You try to partition the
edges of a graph to 3 separate partitions but the edges in the first
partition end up being 80MB worth of edges so it cannot all fit in the
first partition . Would the extra 16MB flood over into a new 4th partition
or would the system try to split it so that the 1st and 4th partition are
both at 40MB, or would the partition strategy just fail with a memory
error?

Thank You,
Matthew Bucci

On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com
wrote:

 Hi,

 Vertices are simply hash-paritioned by their 64-bit IDs, so
 they are evenly spread over parititons.

 As for edges, GraphLoader#edgeList builds edge paritions
 through hadoopFile(), so the initial parititons depend
 on InputFormat#getSplits implementations
 (e.g, partitions are mostly equal to 64MB blocks for HDFS).

 Edges can be re-partitioned by ParititonStrategy;
 a graph is partitioned considering graph structures and
 a source ID and a destination ID are used as partition keys.
 The partitions might suffer from skewness depending
 on graph properties (hub nodes, or something).

 Thanks,
 takeshi


 On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com
 wrote:

 Hello,

 I am working on a project where we want to split graphs of data into
 snapshots across partitions and I was wondering what would happen if one
 of
 the snapshots we had was too large to fit into a single partition. Would
 the
 snapshot be split over the two partitions equally, for example, and how
 is a
 single snapshot spread over multiple partitions?

 Thank You,
 Matthew Bucci



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 ---
 Takeshi Yamamuro



GraphX Snapshot Partitioning

2015-03-09 Thread Matthew Bucci
Hello,

I am working on a project where we want to split graphs of data into
snapshots across partitions and I was wondering what would happen if one of
the snapshots we had was too large to fit into a single partition. Would the
snapshot be split over the two partitions equally, for example, and how is a
single snapshot spread over multiple partitions?

Thank You,
Matthew Bucci



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX Snapshot Partitioning

2015-03-09 Thread Takeshi Yamamuro
Hi,

Vertices are simply hash-paritioned by their 64-bit IDs, so
they are evenly spread over parititons.

As for edges, GraphLoader#edgeList builds edge paritions
through hadoopFile(), so the initial parititons depend
on InputFormat#getSplits implementations
(e.g, partitions are mostly equal to 64MB blocks for HDFS).

Edges can be re-partitioned by ParititonStrategy;
a graph is partitioned considering graph structures and
a source ID and a destination ID are used as partition keys.
The partitions might suffer from skewness depending
on graph properties (hub nodes, or something).

Thanks,
takeshi


On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com wrote:

 Hello,

 I am working on a project where we want to split graphs of data into
 snapshots across partitions and I was wondering what would happen if one of
 the snapshots we had was too large to fit into a single partition. Would
 the
 snapshot be split over the two partitions equally, for example, and how is
 a
 single snapshot spread over multiple partitions?

 Thank You,
 Matthew Bucci



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
---
Takeshi Yamamuro