I also blog about how to to Single Source Shortest Path here at http://horicky.blogspot.com/2010/02/nosql-graphdb.html
One MR algorithm is based on Dijkstra and the other is based on BFS. I think the first one is more efficient than the second one. Rgds, Ricky -----Original Message----- From: Peng, Wei [mailto:wei.p...@xerox.com] Sent: Monday, December 20, 2010 5:50 PM To: common-user@hadoop.apache.org Subject: breadth-first search I implemented an algorithm to run hadoop on a 25GB graph data to calculate its average separation length. The input format is V1(tab)V2 (where V2 is the friend of V1). My purpose is to first randomly select some seed nodes, and then for each node, calculate the shortest paths from this node to all other nodes on the graph. To do this, I first run a simple python code in a single machine to get some random seed nodes. Then I run a hadoop job to generate adjacent list for each node as the input for the second job. The second job takes the adjacent list input and output the first level breadth-first search result. The nodes which are the friends of the seed node have distance 1. Then this output is the input for the next hadoop job so on so forth, until all the nodes are reached. I generated a simulated graph for testing. This data has only 100 nodes. Normal python code can find the separation length within 1 second (100 seed nodes). However, the hadoop took almost 3 hours to do that (pseudo-distributed mode on one machine)!! I wonder if there is a more efficient way to do breadth-first search in hadoop? It is very inefficient to output so many intermediate results. Totally there would be seedNodeNumber*levelNumber+1 jobs, seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? Please help. Thanks! Wei