Re: GraphX Snapshot Partitioning
Large edge partitions could cause java.lang.OutOfMemoryError, and then spark tasks fails. FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex IDs are mapped into 32-bit ones in each partitions. If #edges is over the limit, graphx could throw ArrayIndexOutOfBoundsException, or something. So, each partition can have more edges than you expect. On Wed, Mar 11, 2015 at 11:42 PM, Matthew Bucci mrbucci...@gmail.com wrote: Hi, Thanks for the response! That answered some questions I had, but the last one I was wondering is what happens if you run a partition strategy and one of the partitions ends up being too large? For example, let's say partitions can hold 64MB (actually knowing the maximum possible size of a partition would probably also be helpful to me). You try to partition the edges of a graph to 3 separate partitions but the edges in the first partition end up being 80MB worth of edges so it cannot all fit in the first partition . Would the extra 16MB flood over into a new 4th partition or would the system try to split it so that the 1st and 4th partition are both at 40MB, or would the partition strategy just fail with a memory error? Thank You, Matthew Bucci On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com wrote: Hi, Vertices are simply hash-paritioned by their 64-bit IDs, so they are evenly spread over parititons. As for edges, GraphLoader#edgeList builds edge paritions through hadoopFile(), so the initial parititons depend on InputFormat#getSplits implementations (e.g, partitions are mostly equal to 64MB blocks for HDFS). Edges can be re-partitioned by ParititonStrategy; a graph is partitioned considering graph structures and a source ID and a destination ID are used as partition keys. The partitions might suffer from skewness depending on graph properties (hub nodes, or something). Thanks, takeshi On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com wrote: Hello, I am working on a project where we want to split graphs of data into snapshots across partitions and I was wondering what would happen if one of the snapshots we had was too large to fit into a single partition. Would the snapshot be split over the two partitions equally, for example, and how is a single snapshot spread over multiple partitions? Thank You, Matthew Bucci -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- --- Takeshi Yamamuro -- --- Takeshi Yamamuro
Re: GraphX Snapshot Partitioning
Hi, Thanks for the response! That answered some questions I had, but the last one I was wondering is what happens if you run a partition strategy and one of the partitions ends up being too large? For example, let's say partitions can hold 64MB (actually knowing the maximum possible size of a partition would probably also be helpful to me). You try to partition the edges of a graph to 3 separate partitions but the edges in the first partition end up being 80MB worth of edges so it cannot all fit in the first partition . Would the extra 16MB flood over into a new 4th partition or would the system try to split it so that the 1st and 4th partition are both at 40MB, or would the partition strategy just fail with a memory error? Thank You, Matthew Bucci On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com wrote: Hi, Vertices are simply hash-paritioned by their 64-bit IDs, so they are evenly spread over parititons. As for edges, GraphLoader#edgeList builds edge paritions through hadoopFile(), so the initial parititons depend on InputFormat#getSplits implementations (e.g, partitions are mostly equal to 64MB blocks for HDFS). Edges can be re-partitioned by ParititonStrategy; a graph is partitioned considering graph structures and a source ID and a destination ID are used as partition keys. The partitions might suffer from skewness depending on graph properties (hub nodes, or something). Thanks, takeshi On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com wrote: Hello, I am working on a project where we want to split graphs of data into snapshots across partitions and I was wondering what would happen if one of the snapshots we had was too large to fit into a single partition. Would the snapshot be split over the two partitions equally, for example, and how is a single snapshot spread over multiple partitions? Thank You, Matthew Bucci -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- --- Takeshi Yamamuro
GraphX Snapshot Partitioning
Hello, I am working on a project where we want to split graphs of data into snapshots across partitions and I was wondering what would happen if one of the snapshots we had was too large to fit into a single partition. Would the snapshot be split over the two partitions equally, for example, and how is a single snapshot spread over multiple partitions? Thank You, Matthew Bucci -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: GraphX Snapshot Partitioning
Hi, Vertices are simply hash-paritioned by their 64-bit IDs, so they are evenly spread over parititons. As for edges, GraphLoader#edgeList builds edge paritions through hadoopFile(), so the initial parititons depend on InputFormat#getSplits implementations (e.g, partitions are mostly equal to 64MB blocks for HDFS). Edges can be re-partitioned by ParititonStrategy; a graph is partitioned considering graph structures and a source ID and a destination ID are used as partition keys. The partitions might suffer from skewness depending on graph properties (hub nodes, or something). Thanks, takeshi On Tue, Mar 10, 2015 at 2:21 AM, Matthew Bucci mrbucci...@gmail.com wrote: Hello, I am working on a project where we want to split graphs of data into snapshots across partitions and I was wondering what would happen if one of the snapshots we had was too large to fit into a single partition. Would the snapshot be split over the two partitions equally, for example, and how is a single snapshot spread over multiple partitions? Thank You, Matthew Bucci -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Snapshot-Partitioning-tp21977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- --- Takeshi Yamamuro