Handling vertices with huge number of outgoing edges (and other optimization).

kumbhare Tue, 17 Sep 2013 09:19:48 -0700

Hi,

We have a moderately sized data set (~20M vertices, ~45M edges).



We are running several basic algorithms e.g connected components, single source 
shortest paths, page rank etc. The setup includes 12 nodes with 8 core, 16GB 
ram each. We allow max three mappers per node (-Xmx5G) and run upto 24 giraph 
workers for the experiments. We are using the trunk version, pulled on 9/2 from 
the github on hadoop 1.2.1. We use HDFS data store (the file is ~980 MB, with 
64MB block size, we get around 15 HDFS blocks)


Input data is in an adjacency list, json format. We use the built in 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as the 
input format.


Given this setup, we have a few questions and appreciate any help to optimize 
the execution:


We observed that the dataset contains most of the vertices (>90%) with out 
degree < 20, and some have between 20-1000. However there a few vertices 
(<0.5%) with a very high out degree (>100,000). 


Due to this, most of the workers load data fairly quickly (~20-30secs), however 
a couple of workers take a much longer time (~800secs) to complete just the 
data input step. Is there a way to handle such vertices? Or do you suggest 
using any other input format?

Another question we have is if, in general, there is a guide for choosing 
various optimization parameters?

e.g. number of input/compute/output threads


Data Locality and in memory messages:


Is there any data locality attempt while running worker? Basically, out of 12 
nodes, if the HDFS blocks for a file are stored only on say 8 nodes and I run 8 
workers, is it guaranteed that the workers will run on those 8 nodes?


Also, if the vertices are located on the same worker, do we have in memory 
message transfer between those vertices.


Partitioning: We wish to study the effect of different partitioning schemes on 
the runtime. 


Is there a Partitioner we can use that will try to collocate neighboring 
vertices on the same worker while balancing different partitions? (Basically a 
METIS Partitioner)


If we do pre-processing of the data file and store neighboring vertices close 
to each other in the file, implying different HDFS blocks will approximately 
contain neighboring vertices, and use the default giaph partitioner, will that 
help? 




I know this is a long mail, and we truly appreciate your help.


Thanks,

Alok






Sent from Windows Mail

Handling vertices with huge number of outgoing edges (and other optimization).

Reply via email to