Hi Ankur,

Thanks so much! :))

Yes, is possible to defining a custom partition strategy?

And, some other questions:
(2*4 cores machine, 24GB memory)

- if I load one edges file(5 GB), without any cores/partitions setting, what is 
the default partition in graph construction? and how many cores will be used?
Or, if the size of file is 50 GB(more than available memory, without partition 
setting)?

- "because each vertex must be replicated to all partitions where it is 
referenced."
I don't understand, for instance, we have 3 edge partition tables(EA: a -> b, a 
-> c; EB: a -> d, a -> e; EC: d -> c ), 2 vertex partition tables(VA: a, b, c; 
VB: d, e), the whole vertex table VA will be replicated to all these 3 edge 
partitions? since each of them refers to some vertexes in VA.

- "there is no shared-memory parallelism."
You mean that the core is stricter to access only its own partition in memory?
how do they communicate when the required data(edges?) in another partition?



On Jul 15, 2014, at 9:30 PM, Ankur Dave <ankurd...@gmail.com> wrote:

> On Jul 15, 2014, at 12:06 PM, Yifan LI <iamyifa...@gmail.com> wrote: 
> Btw, is there any possibility to customise the partition strategy as we 
> expect?
> 
> I'm not sure I understand. Are you asking about defining a custom partition 
> strategy?
> 
> On Tue, Jul 15, 2014 at 6:20 AM, Yifan LI <iamyifa...@gmail.com> wrote:
> when I load the file using sc.textFile (minPartitions = 16, 
> PartitionStrategy.RandomVertexCut)
> 
> The easiest way to load the edge file would actually be to use 
> GraphLoader.edgeListFile(sc, path, minEdgePartitions = 
> 16).partitionBy(PartitionStrategy.RandomVertexCut).
> 
> 1) how much data will be loaded into memory?
> 
> The exact size of the graph (vertices + edges) in memory depends on the 
> graph's structure, the partition function, and the average vertex degree, 
> because each vertex must be replicated to all partitions where it is 
> referenced.
> 
> It's easier to estimate the size of just the edges, which I did on the 
> mailing list a while ago. To summarize, during graph construction each edge 
> takes 60 bytes, and once the graph is constructed it takes 20 bytes.
> 
> 2) how many partitions will be stored in memory?
> 
> Once you call cache() on the graph, all 16 partitions will be stored in 
> memory. You can also tell GraphX to spill them to disk in case of memory 
> pressure by passing edgeStorageLevel=StorageLevel.MEMORY_AND_DISK to 
> GraphLoader.edgeListFile.
> 
> 3) If the thread/task on each core will read only one edge from memory and 
> then compute it at every time?
> 
> Yes, each task is single-threaded, so it will read the edges in its partition 
> sequentially.
> 
> 3.1) which edge on memory was determined to read into cache?
> 
> In theory, each core should independently read the edges in its partition 
> into its own cache. Cache issues should be much less of a concern than in 
> most applications because different tasks (cores) operate on independent 
> partitions; there is no shared-memory parallelism. The tradeoff is heavier 
> reliance on shuffles.
> 
> 3.2) how are those partitions being scheduled?
> 
> Spark handles the scheduling. There are details in Section 5.1 of the Spark 
> NSDI paper; in short, tasks are scheduled to run on the same machine as their 
> data partition, and by default each machine can accept at most one task per 
> core.
> 
> Ankur
> 

Reply via email to