Thank you all for all the replies.

I know I can run 100 source nodes in parallel. It is just surprising to me that 
the python code and hadoop code have such a big efficiency difference (they are 
both running in sequential on 100 source nodes). Hadoop is useful when the data 
is huge and cannot fit into memory, but it does not seem to be a real-time 
solution.

There are so many to learn. Thank you again for all your kind help.

Wei

-----Original Message-----
From: Ricky Ho [mailto:rickyphyl...@yahoo.com] 
Sent: Tuesday, December 21, 2010 11:53 AM
To: common-user@hadoop.apache.org
Subject: RE: breadth-first search

Exactly, it should be done in 4 iterations instead of 4 * 100 iterations.

But in general, Graph algorithms is harder to do in Map/Reduce and the 
statelessness nature typically requires shuffling of the whole graph in each 
iteration.  Jimmy Lin has described a technique called Skimmy to avoid that, 
which I described here at
http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html

And of course, the BSP style is another good way to achieve this which is where 
Google's Pregel model based on, I also blog about this here at
http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html

Notice that "scalability" and "speed" are different animals.  Hadoop is about 
scalability but not speed.  If your data can be squeezed into a single machine, 
then Hadoop is  not for you.  Otherwise, Hadoop is one great tool to 
parallelize 
your  processing.  It has a pretty constant (but significant) overhead that is  
justifiable when you have large amount of data.  For example, Hadoop is not for 
processing real-time streams.
http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html


Notice that ... Hadoop is just one of the tool that you can use, and you decide 
whether you don't have better ones.


Rgds,
Ricky

-----Original Message-----
From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Tuesday, December 21, 2010 7:29 AM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search
 
Ahh... I see what you mean.
 
This algorithm can be implemented with all of the iterations for all points
proceeding in parallel.  You should only need 4 map-reduce steps, not 400.
 This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less important.
 
2010/12/21 Peng, Wei <wei.p...@xerox.com>
 
> The graph that my BFS algorithm is running on only needs 4 levels to reach
> all nodes. The reason I say "many iterations" is that there are 100 sources
> nodes, so totally 400 iterations. The algorithm should be right, and I
> cannot think of anything to reduce the number of iterations.
> 
> Ted, I will check out the links that you sent to me.
> I really appreciate your help.
> 
> Wei
> -----Original Message-----
> From: Edward J. Yoon [mailto:edw...@udanax.org]
> Sent: Tuesday, December 21, 2010 1:27 AM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
> 
> There's no release yet.
> 
> But, I had tested the BFS using hama and, hbase.
> 
> Sent from my iPhone
> 
> On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <wei.p...@xerox.com> wrote:
> 
> > Yoon,
> >
> > Can I use HAMA now, or it is still in development?
> >
> > Thanks
> >
> > Wei
> >
> > -----Original Message-----
> > From: Edward J. Yoon [mailto:edwardy...@apache.org]
> > Sent: Monday, December 20, 2010 6:23 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: breadth-first search
> >
> > Check this slide out -
> > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
> >
> > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <wei.p...@xerox.com> wrote:
> >>
> >>  I implemented an algorithm to run hadoop on a 25GB graph data to
> >> calculate its average separation length.
> >> The input format is V1(tab)V2 (where V2 is the friend of V1).
> >> My purpose is to first randomly select some seed nodes, and then for
> >> each node, calculate the shortest paths from this node to all other
> >> nodes on the graph.
> >>
> >> To do this, I first run a simple python code in a single machine to get
> >> some random seed nodes.
> >> Then I run a hadoop job to generate adjacent list for each node as the
> >> input for the second job.
> >>
> >> The second job takes the adjacent list input and output the first level
> >> breadth-first search result. The nodes which are the friends of the seed
> >> node have distance 1. Then this output is the input for the next hadoop
> >> job so on so forth, until all the nodes are reached.
> >>
> >> I generated a simulated graph for testing. This data has only 100 nodes.
> >> Normal python code can find the separation length within 1 second (100
> >> seed nodes). However, the hadoop took almost 3 hours to do that
> >> (pseudo-distributed mode on one machine)!!
> >>
> >> I wonder if there is a more efficient way to do breadth-first search in
> >> hadoop? It is very inefficient to output so many intermediate results.
> >> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
> >>
> >> Please help.  Thanks!
> >>
> >> Wei
> >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > edwardy...@apache.org
> > http://blog.udanax.org
> 


      

Reply via email to