Running Giraph on MapReduce, you have no control over where the worker
tasks will be hosted on the cluster. Therefore the partitioning generally
is not aware of co-located blocks and does a fair amount of time-consuming
network shuffling of data during the initialization of a Giraph job.

What Giraph does do is, as each worker tasks spins up on the cluster, it
attempts to claim input splits that happen to be local to the DataNode the
worker runs on. This speeds up the initial injestion of graph data quite a
bit, but does not help up much when it comes to distributing the data to
the worker that owns that data's assigned partition.

Only when all data have been been pushed to the appropriate worker can the
Giraph job actually begin. When data actually does end up belonging to a
host-local partition it is not sent over the network, but in many cases
there is no alternative without using an alternate to hash partitioning.


On Sat, Nov 16, 2013 at 12:22 PM, David J Garcia <djch...@utexas.edu> wrote:

> hello, I was wondering if there was a way to ensure that vertices located
> on the same data block (on hdfs) are co-located with each other?
>
> Also, will the vertices in input-splits (splits that are located on the
> same DataNode) have a reasonable chance of being partitioned to the same id?
>
> for example, suppose that I have vertex_1 located on data_block_i, and
> vertex_2 located on data_block_k.  Let's suppose that both of the data
> blocks are located on the same DataNode machine.  Is there a reasonably
> good chance that the vertex_1 and vertex_2 will partition to the same id?
>
> I'm doing a research project and I'm trying to show the benefits of graph
> data-locality.
>
> -David
>

Reply via email to