On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI <iamyifa...@gmail.com> wrote:

> Yes, is possible to defining a custom partition strategy?


Yes, you just need to create a subclass of PartitionStrategy as follows:

import org.apache.spark.graphx._
object MyPartitionStrategy extends PartitionStrategy {
  override def getPartition(src: VertexId, dst: VertexId, numParts:
PartitionID): PartitionID = {
    // put your hash function here  }
}

if I load one edges file(5 GB), without any cores/partitions setting, what
> is the default partition in graph construction? and how many cores will be
> used?
> Or, if the size of file is 50 GB(more than available memory, without
> partition setting)?


If you don't specify a number of partitions, it will default to 2 or the
number of blocks in the input file, whichever is greater (this is the same
behavior as sc.textFile
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1243>).
I'm not sure exactly how many this will be for your file, but you can find
out by running sc.textFile("...").partitions.length.

The default partitioning strategy is to leave the edges in the same
partitions as they were initially loaded from.

For the 50 GB file, everything is the same except the default number of
partitions will probably be larger. Hopefully each partition will
individually still fit in memory, which will allow GraphX to proceed.

the whole vertex table VA will be replicated to all these 3 edge
> partitions? since each of them refers to some vertexes in VA.


Yes, for that graph and partitioning, all vertices will be replicated to
all edge partitions. It won't be quite as bad for most graphs, and
EdgePartition2D
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$$EdgePartition2D$>
provides a bound on the replication factor, as described in the docs.

You mean that the core is stricter to access only its own partition in
> memory?
> how do they communicate when the required data(edges?) in another
> partition?


Right, there is one task per core, and each task only accesses its own
partition, never the partitions of other tasks.

Tasks communicate in two ways:
1. Vertex replication (multicast). When an edge needs to access its
neighboring vertices (to construct the triplets or perform the map step in
mapReduceTriplets), GraphX replicates vertices to all edge partitions where
they are referenced.
2. Vertex aggregation. When an edge needs to update the value of a
neighboring vertex (to perform the reduce step in mapReduceTriplets),
GraphX aggregates vertex updates from the edge partitions back to the
vertices.

Both of these communication steps happen using Spark shuffles, which work
by writing messages to disk and notifying other tasks to read them from
disk.

Note that edges never need to access other edges directly, and all
communication is done through the vertices.

Ankur <http://www.ankurdave.com/>

Reply via email to