Re: Handling vertices with huge number of outgoing edges (and other optimization).

Eli Reisman Sun, 29 Sep 2013 11:44:12 -0700

Actually, the data locality is a bit different in Giraph. What happens is
that when running the non-YARN Giraph profiles, your workers are
distributed by the Hadoop framework anywhere on the cluster, but once the
Giraph workers launch, they attempt to claim an Input Split of the total
input data (which has blocks spread all over the cluster presumably) using
Apache Zookeeper.


At this point, the worker must load whichever blocks the map to the input
split the worker claimed. These could potentially be located anywhere on
the cluster. At this point, the Giraph worker attempts to find some of the
blocks it claimed on the node it happened to be started on. In this lucky
situation, it will load those blocks locally. All other data for the input
split will be pulled across the network from other nodes where the blocks
reside.

In practice, this form of data locality is more limited than MapReduce
locality, but does reduce the input stage running time of Giraph jobs in
many cases. In cases where there is not much input data or few job workers
relative to the cluster size, this locality scheme is not very effective.
One assumes the data input stage would be short due to the small scale of
the job in these cases.





On Wed, Sep 25, 2013 at 11:31 PM, Marco Aurelio Barbosa Fagnani Lotz <
m.a.b.l...@stu12.qmul.ac.uk> wrote:

>  Hello Alok,
>
>  about the question 3.a, i guess the framework will indeed try to
> allocate the local workers.
> Each worker is actually a map only task. Due to the behaviour of the
> Hadoop framework, it will aim for data locality. Therefore, the framework
> will try to run the map tasks (and thus the workers) in nodes that have
> local blocks.
>
>  Best regards,
> Marco Aurelio Lotz
>
> Sent from my iPhone
>
> On 17 Sep 2013, at 18:19, "kumbh...@usc.edu" <kumbh...@usc.edu> wrote:
>
>   Hi,
> We have a moderately sized data set (~20M vertices, ~45M edges).
>
>  We are running several basic algorithms e.g connected components, single
> source shortest paths, page rank etc. The setup includes 12 nodes with 8
> core, 16GB ram each. We allow max three mappers per node (-Xmx5G) and run
> upto 24 giraph workers for the experiments. We are using the trunk version,
> pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data store (the
> file is ~980 MB, with 64MB block size, we get around 15 HDFS blocks)
>
>  Input data is in an adjacency list, json format. We use the built in
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as
> the input format.
>
>  Given this setup, we have a few questions and appreciate any help to
> optimize the execution:
>
>
>    1. We observed that the dataset contains most of the vertices (>90%)
>    with out degree < 20, and some have between 20-1000. However there a few
>    vertices (<0.5%) with a very high out degree (>100,000).
>     1. Due to this, most of the workers load data fairly quickly
>       (~20-30secs), however a couple of workers take a much longer time
>       (~800secs) to complete just the data input step. Is there a way to 
> handle
>       such vertices? Or do you suggest using any other input format?
>        2. Another question we have is if, in general, there is a guide
>    for choosing various optimization parameters?
>       1. e.g. number of input/compute/output threads
>        3. Data Locality and in memory messages:
>     1. Is there any data locality attempt while running worker?
>       Basically, out of 12 nodes, if the HDFS blocks for a file are stored 
> only
>       on say 8 nodes and I run 8 workers, is it guaranteed that the workers 
> will
>       run on those 8 nodes?
>       2. Also, if the vertices are located on the same worker, do we have
>       in memory message transfer between those vertices.
>        4. Partitioning: We wish to study the effect of different
>    partitioning schemes on the runtime.
>     1. Is there a Partitioner we can use that will try to collocate
>       neighboring vertices on the same worker while balancing different
>       partitions? (Basically a METIS Partitioner)
>       2. If we do pre-processing of the data file and store neighboring
>       vertices close to each other in the file, implying different HDFS blocks
>       will approximately contain neighboring vertices, and use the default 
> giaph
>       partitioner, will that help?
>
>
>
>  I know this is a long mail, and we truly appreciate your help.
>
>  Thanks,
> Alok
>
>  Sent from Windows Mail
>
>

Re: Handling vertices with huge number of outgoing edges (and other optimization).

Reply via email to