Thank you all for such detailed responses. I guess moving to YARN will help us out with some of these issues. Is there a document that enumerates YARN specific optimization/advantages and how to take advantage of those.
Thanks, Alok On Sun, Sep 29, 2013 at 11:43 AM, Eli Reisman <apache.mail...@gmail.com>wrote: > Actually, the data locality is a bit different in Giraph. What happens is > that when running the non-YARN Giraph profiles, your workers are > distributed by the Hadoop framework anywhere on the cluster, but once the > Giraph workers launch, they attempt to claim an Input Split of the total > input data (which has blocks spread all over the cluster presumably) using > Apache Zookeeper. > > At this point, the worker must load whichever blocks the map to the input > split the worker claimed. These could potentially be located anywhere on > the cluster. At this point, the Giraph worker attempts to find some of the > blocks it claimed on the node it happened to be started on. In this lucky > situation, it will load those blocks locally. All other data for the input > split will be pulled across the network from other nodes where the blocks > reside. > > In practice, this form of data locality is more limited than MapReduce > locality, but does reduce the input stage running time of Giraph jobs in > many cases. In cases where there is not much input data or few job workers > relative to the cluster size, this locality scheme is not very effective. > One assumes the data input stage would be short due to the small scale of > the job in these cases. > > > > > > On Wed, Sep 25, 2013 at 11:31 PM, Marco Aurelio Barbosa Fagnani Lotz < > m.a.b.l...@stu12.qmul.ac.uk> wrote: > >> Hello Alok, >> >> about the question 3.a, i guess the framework will indeed try to >> allocate the local workers. >> Each worker is actually a map only task. Due to the behaviour of the >> Hadoop framework, it will aim for data locality. Therefore, the framework >> will try to run the map tasks (and thus the workers) in nodes that have >> local blocks. >> >> Best regards, >> Marco Aurelio Lotz >> >> Sent from my iPhone >> >> On 17 Sep 2013, at 18:19, "kumbh...@usc.edu" <kumbh...@usc.edu> wrote: >> >> Hi, >> We have a moderately sized data set (~20M vertices, ~45M edges). >> >> We are running several basic algorithms e.g connected components, >> single source shortest paths, page rank etc. The setup includes 12 nodes >> with 8 core, 16GB ram each. We allow max three mappers per node (-Xmx5G) >> and run upto 24 giraph workers for the experiments. We are using the trunk >> version, pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data >> store (the file is ~980 MB, with 64MB block size, we get around 15 HDFS >> blocks) >> >> Input data is in an adjacency list, json format. We use the built in >> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as >> the input format. >> >> Given this setup, we have a few questions and appreciate any help to >> optimize the execution: >> >> >> 1. We observed that the dataset contains most of the vertices (>90%) >> with out degree < 20, and some have between 20-1000. However there a few >> vertices (<0.5%) with a very high out degree (>100,000). >> 1. Due to this, most of the workers load data fairly quickly >> (~20-30secs), however a couple of workers take a much longer time >> (~800secs) to complete just the data input step. Is there a way to >> handle >> such vertices? Or do you suggest using any other input format? >> 2. Another question we have is if, in general, there is a guide >> for choosing various optimization parameters? >> 1. e.g. number of input/compute/output threads >> 3. Data Locality and in memory messages: >> 1. Is there any data locality attempt while running worker? >> Basically, out of 12 nodes, if the HDFS blocks for a file are stored >> only >> on say 8 nodes and I run 8 workers, is it guaranteed that the workers >> will >> run on those 8 nodes? >> 2. Also, if the vertices are located on the same worker, do we >> have in memory message transfer between those vertices. >> 4. Partitioning: We wish to study the effect of different >> partitioning schemes on the runtime. >> 1. Is there a Partitioner we can use that will try to collocate >> neighboring vertices on the same worker while balancing different >> partitions? (Basically a METIS Partitioner) >> 2. If we do pre-processing of the data file and store neighboring >> vertices close to each other in the file, implying different HDFS >> blocks >> will approximately contain neighboring vertices, and use the default >> giaph >> partitioner, will that help? >> >> >> >> I know this is a long mail, and we truly appreciate your help. >> >> Thanks, >> Alok >> >> Sent from Windows Mail >> >> >