On Jul 15, 2014, at 12:06 PM, Yifan LI <iamyifa...@gmail.com> wrote:

> Btw, is there any possibility to customise the partition strategy as we
> expect?


I'm not sure I understand. Are you asking about defining a custom
<http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-how-to-specify-partition-strategy-td9309.html#a9338>
partition strategy?

On Tue, Jul 15, 2014 at 6:20 AM, Yifan LI <iamyifa...@gmail.com> wrote:

> when I load the file using sc.textFile (minPartitions = *16*,
> PartitionStrategy.RandomVertexCut)


The easiest way to load the edge file would actually be to use
GraphLoader.edgeListFile(sc,
path, minEdgePartitions = 16).partitionBy(PartitionStrategy.RandomVertexCut)
.

1) how much data will be loaded into memory?


The exact size of the graph (vertices + edges) in memory depends on the
graph's structure, the partition function, and the average vertex degree,
because each vertex must be replicated to all partitions where it is
referenced.

It's easier to estimate the size of just the edges, which I did on the
mailing list
<http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-partition-problem-tp6248p6363.html>
a while ago. To summarize, during graph construction each edge takes 60
bytes, and once the graph is constructed it takes 20 bytes.

2) how many partitions will be stored in memory?


Once you call cache() on the graph, all 16 partitions will be stored in
memory. You can also tell GraphX to spill them to disk in case of memory
pressure by passing edgeStorageLevel=StorageLevel.MEMORY_AND_DISK to
GraphLoader.edgeListFile.

3) If the thread/task on each core will read only *one* edge from memory
> and then compute it at every time?


Yes, each task is single-threaded, so it will read the edges in its
partition sequentially.

3.1) which edge on memory was determined to read into cache?


In theory, each core should independently read the edges in its partition
into its own cache. Cache issues should be much less of a concern than in
most applications because different tasks (cores) operate on independent
partitions; there is no shared-memory parallelism. The tradeoff is heavier
reliance on shuffles.

3.2) how are those partitions being scheduled?


Spark handles the scheduling. There are details in Section 5.1 of the Spark
NSDI paper
<https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf>;
in short, tasks are scheduled to run on the same machine as their data
partition, and by default each machine can accept at most one task per core.

Ankur <http://www.ankurdave.com/>

Reply via email to