Hi, We have a moderately sized data set (~20M vertices, ~45M edges).
We are running several basic algorithms e.g connected components, single source shortest paths, page rank etc. The setup includes 12 nodes with 8 core, 16GB ram each. We allow max three mappers per node (-Xmx5G) and run upto 24 giraph workers for the experiments. We are using the trunk version, pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data store (the file is ~980 MB, with 64MB block size, we get around 15 HDFS blocks) Input data is in an adjacency list, json format. We use the built in org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as the input format. Given this setup, we have a few questions and appreciate any help to optimize the execution: We observed that the dataset contains most of the vertices (>90%) with out degree < 20, and some have between 20-1000. However there a few vertices (<0.5%) with a very high out degree (>100,000). Due to this, most of the workers load data fairly quickly (~20-30secs), however a couple of workers take a much longer time (~800secs) to complete just the data input step. Is there a way to handle such vertices? Or do you suggest using any other input format? Another question we have is if, in general, there is a guide for choosing various optimization parameters? e.g. number of input/compute/output threads Data Locality and in memory messages: Is there any data locality attempt while running worker? Basically, out of 12 nodes, if the HDFS blocks for a file are stored only on say 8 nodes and I run 8 workers, is it guaranteed that the workers will run on those 8 nodes? Also, if the vertices are located on the same worker, do we have in memory message transfer between those vertices. Partitioning: We wish to study the effect of different partitioning schemes on the runtime. Is there a Partitioner we can use that will try to collocate neighboring vertices on the same worker while balancing different partitions? (Basically a METIS Partitioner) If we do pre-processing of the data file and store neighboring vertices close to each other in the file, implying different HDFS blocks will approximately contain neighboring vertices, and use the default giaph partitioner, will that help? I know this is a long mail, and we truly appreciate your help. Thanks, Alok Sent from Windows Mail