I implemented an algorithm to run hadoop on a 25GB graph data to
calculate its average separation length.
The input format is V1(tab)V2 (where V2 is the friend of V1).
My purpose is to first randomly select some seed nodes, and then for
each node, calculate the shortest paths from this node to all other
nodes on the graph.

To do this, I first run a simple python code in a single machine to get
some random seed nodes.
Then I run a hadoop job to generate adjacent list for each node as the
input for the second job.

The second job takes the adjacent list input and output the first level
breadth-first search result. The nodes which are the friends of the seed
node have distance 1. Then this output is the input for the next hadoop
job so on so forth, until all the nodes are reached.

I generated a simulated graph for testing. This data has only 100 nodes.
Normal python code can find the separation length within 1 second (100
seed nodes). However, the hadoop took almost 3 hours to do that
(pseudo-distributed mode on one machine)!!

I wonder if there is a more efficient way to do breadth-first search in
hadoop? It is very inefficient to output so many intermediate results.
Totally there would be seedNodeNumber*levelNumber+1 jobs,
seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? 

Please help.  Thanks!

Wei

Reply via email to