Hi, answers are inline.
On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns <martin.jungha...@gmx.net> wrote: > Hi group, > > I got a question concerning the graph partitioning step. If I understood > the code correctly, the graph is distributed to n partitions by using > vertexID.hashCode() & n. I got two questions concerning that step. > > 1) Is the whole graph loaded and partitioned only by the Master? This > would mean, the whole data has to be moved to that Master map job and then > moved to the physical node the specific worker for the partition runs on. > As this sounds like a huge overhead, I further inspected the code: > I saw that there is also a WorkerGraphPartitioner and I assume he calls > the partitioning method on his local data (lets say his local HDFS blocks) > and if the resulting partition for a vertex is not himself, the data gets > moved to that worker, which reduces the overhead. Is this assumption > correct? > That is correct, workers forward vertex data to the correct worker who is responsible for that vertex via hash-partitioning (by default), meaning that the master is not involved. > > 2) Let's say the graph is already partitioned in the file system, e.g. > blocks on physical nodes contain logical connected graph nodes. Is it > possible to just read the data as it is and skip the partitioning step? In > that case I currently assume, that the vertexID should contain the > partitionID and the custom partitioning would be an identity function in > that case (instead of hashing or range). > In principle you can. You would need to organize splits so that they contain all the data for each particular worker, and then assign relevant splits to the corresponding worker. > > Thanks for your time and help! > > Cheers, > Martin > -- Claudio Martella