Re: breadth-first search

2010-12-21 Thread Edward J. Yoon
> I don't see much evidence that HAMA will ever solve anything, so I wouldn't
> recommend pinning your hopes on that.

Hello ted.

What is your opinion based on? Have you tried to run the HAMA TRUNK?

If not, you don't know about our progress in the last year. :-)

On Tue, Dec 21, 2010 at 3:18 PM, Ted Dunning  wrote:
> On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei  wrote:
>
>> ...
>> Currently, most of the matrix data (graph matrix, document-word matrix)
>> that we are dealing with are sparse.
>>
>
> Good.
>
>
>> The matrix decomposition often needs many iterations to converge, then
>> intermediate results have to be saved to serve as the input for the next
>> iteration.
>>
>
> I think you are thinking of the wrong algorithms.  The ones that I am
> talking about converge
> in a fixed and small number of steps.  See
> https://issues.apache.org/jira/browse/MAHOUT-376 for the work
> in progress on this.
>
>
>> This is super inefficient. As I mentioned, the BFS algorithm written in
>> python took 1 second to run, however, hadoop took almost 3 hours. I
>> would expect hadoop to be slower, but not this slow.
>>
>
> I think you have a combination of factors here.  But, even accounting for
> having too many
> iterations in your BFS algorithm, iterations with stock Hadoop take 10s of
> seconds even if
> they do nothing.  If you arrange your computation to need many iterations,
> it will be slow.
>
>
> All the hadoop applications that I saw are all very simple calculations,
>> I wonder how it can be applied to machine learning/data mining
>> algorithms.
>>
>
> Check out Mahout.  There is a lot of machine learning going on there both on
> Hadoop and using other scalable methods.
>
>
>> Is HAMA the only way to solve it? If it is not ready to use yet, then
>> can I assume hadoop is not a good solution for multiple iteration
>> algorithms now?
>>
>
> I don't see much evidence that HAMA will ever solve anything, so I wouldn't
> recommend pinning your hopes on that.
>
> For fast, iterative map-reduce, you really need to keep your mappers and
> reducers live between iterations.  Check out
> twister for that:  http://www.iterativemapreduce.org/
>



-- 
Best Regards, Edward J. Yoon
edwardy...@apache.org
http://blog.udanax.org


RE: breadth-first search

2010-12-21 Thread Peng, Wei
I was just trying to run 100 source nodes in multiple threads, but the
mapreduce tasks still look like to run in sequential.
Do I need to configure hadoop somehow for multiple threads? Assign more
task slots? How?


Thanks

-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Tuesday, December 21, 2010 1:10 PM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search

Absolutely true.  Nobody should pretend otherwise.

On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei  wrote:

> Hadoop is useful when the data is huge and cannot fit into memory, but
it
> does not seem to be a real-time solution.


Re: breadth-first search

2010-12-21 Thread Ted Dunning
Absolutely true.  Nobody should pretend otherwise.

On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei  wrote:

> Hadoop is useful when the data is huge and cannot fit into memory, but it
> does not seem to be a real-time solution.


RE: breadth-first search

2010-12-21 Thread Peng, Wei
Thank you all for all the replies.

I know I can run 100 source nodes in parallel. It is just surprising to me that 
the python code and hadoop code have such a big efficiency difference (they are 
both running in sequential on 100 source nodes). Hadoop is useful when the data 
is huge and cannot fit into memory, but it does not seem to be a real-time 
solution.

There are so many to learn. Thank you again for all your kind help.

Wei

-Original Message-
From: Ricky Ho [mailto:rickyphyl...@yahoo.com] 
Sent: Tuesday, December 21, 2010 11:53 AM
To: common-user@hadoop.apache.org
Subject: RE: breadth-first search

Exactly, it should be done in 4 iterations instead of 4 * 100 iterations.

But in general, Graph algorithms is harder to do in Map/Reduce and the 
statelessness nature typically requires shuffling of the whole graph in each 
iteration.  Jimmy Lin has described a technique called Skimmy to avoid that, 
which I described here at
http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html

And of course, the BSP style is another good way to achieve this which is where 
Google's Pregel model based on, I also blog about this here at
http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html

Notice that "scalability" and "speed" are different animals.  Hadoop is about 
scalability but not speed.  If your data can be squeezed into a single machine, 
then Hadoop is  not for you.  Otherwise, Hadoop is one great tool to 
parallelize 
your  processing.  It has a pretty constant (but significant) overhead that is  
justifiable when you have large amount of data.  For example, Hadoop is not for 
processing real-time streams.
http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html


Notice that ... Hadoop is just one of the tool that you can use, and you decide 
whether you don't have better ones.


Rgds,
Ricky

-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Tuesday, December 21, 2010 7:29 AM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search
 
Ahh... I see what you mean.
 
This algorithm can be implemented with all of the iterations for all points
proceeding in parallel.  You should only need 4 map-reduce steps, not 400.
 This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less important.
 
2010/12/21 Peng, Wei 
 
> The graph that my BFS algorithm is running on only needs 4 levels to reach
> all nodes. The reason I say "many iterations" is that there are 100 sources
> nodes, so totally 400 iterations. The algorithm should be right, and I
> cannot think of anything to reduce the number of iterations.
> 
> Ted, I will check out the links that you sent to me.
> I really appreciate your help.
> 
> Wei
> -Original Message-
> From: Edward J. Yoon [mailto:edw...@udanax.org]
> Sent: Tuesday, December 21, 2010 1:27 AM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
> 
> There's no release yet.
> 
> But, I had tested the BFS using hama and, hbase.
> 
> Sent from my iPhone
> 
> On 2010. 12. 21., at 오전 11:30, "Peng, Wei"  wrote:
> 
> > Yoon,
> >
> > Can I use HAMA now, or it is still in development?
> >
> > Thanks
> >
> > Wei
> >
> > -Original Message-
> > From: Edward J. Yoon [mailto:edwardy...@apache.org]
> > Sent: Monday, December 20, 2010 6:23 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: breadth-first search
> >
> > Check this slide out -
> > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
> >
> > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei  wrote:
> >>
> >>  I implemented an algorithm to run hadoop on a 25GB graph data to
> >> calculate its average separation length.
> >> The input format is V1(tab)V2 (where V2 is the friend of V1).
> >> My purpose is to first randomly select some seed nodes, and then for
> >> each node, calculate the shortest paths from this node to all other
> >> nodes on the graph.
> >>
> >> To do this, I first run a simple python code in a single machine to get
> >> some random seed nodes.
> >> Then I run a hadoop job to generate adjacent list for each node as the
> >> input for the second job.
> >>
> >> The second job takes the adjacent list input and output the first level
> >> breadth-first search result. The nodes which are the friends of the seed
> >> node have distance 1. Then this output is the input for the next hadoop
> >> job so on so forth, until all the nodes are reached.
> >>
> >> I generated a simulated graph for testing. This data has only 100 nodes.
> >> Normal python code can find the separation length within 1 second (100
> >> seed nodes). However, the hadoop took almost 3 hours to do that
> >> (pseudo-distributed mode on one machine)!!
> >>
> >> I wonder if there is a more efficient way to do breadth-first search in
> >> hadoop? It is very inefficient to output so many intermediate results.
> >> Totally there would be seedNodeNumber*levelNumber+1 jobs,
>

RE: breadth-first search

2010-12-21 Thread Ricky Ho
Exactly, it should be done in 4 iterations instead of 4 * 100 iterations.

But in general, Graph algorithms is harder to do in Map/Reduce and the 
statelessness nature typically requires shuffling of the whole graph in each 
iteration.  Jimmy Lin has described a technique called Skimmy to avoid that, 
which I described here at
http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html

And of course, the BSP style is another good way to achieve this which is where 
Google's Pregel model based on, I also blog about this here at
http://horicky.blogspot.com/2010/07/google-pregel-graph-processing.html

Notice that "scalability" and "speed" are different animals.  Hadoop is about 
scalability but not speed.  If your data can be squeezed into a single machine, 
then Hadoop is  not for you.  Otherwise, Hadoop is one great tool to 
parallelize 
your  processing.  It has a pretty constant (but significant) overhead that is  
justifiable when you have large amount of data.  For example, Hadoop is not for 
processing real-time streams.
http://horicky.blogspot.com/2010/11/map-reduce-and-stream-processing.html


Notice that ... Hadoop is just one of the tool that you can use, and you decide 
whether you don't have better ones.


Rgds,
Ricky

-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Tuesday, December 21, 2010 7:29 AM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search
 
Ahh... I see what you mean.
 
This algorithm can be implemented with all of the iterations for all points
proceeding in parallel.  You should only need 4 map-reduce steps, not 400.
 This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less important.
 
2010/12/21 Peng, Wei 
 
> The graph that my BFS algorithm is running on only needs 4 levels to reach
> all nodes. The reason I say "many iterations" is that there are 100 sources
> nodes, so totally 400 iterations. The algorithm should be right, and I
> cannot think of anything to reduce the number of iterations.
> 
> Ted, I will check out the links that you sent to me.
> I really appreciate your help.
> 
> Wei
> -Original Message-
> From: Edward J. Yoon [mailto:edw...@udanax.org]
> Sent: Tuesday, December 21, 2010 1:27 AM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
> 
> There's no release yet.
> 
> But, I had tested the BFS using hama and, hbase.
> 
> Sent from my iPhone
> 
> On 2010. 12. 21., at 오전 11:30, "Peng, Wei"  wrote:
> 
> > Yoon,
> >
> > Can I use HAMA now, or it is still in development?
> >
> > Thanks
> >
> > Wei
> >
> > -Original Message-
> > From: Edward J. Yoon [mailto:edwardy...@apache.org]
> > Sent: Monday, December 20, 2010 6:23 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: breadth-first search
> >
> > Check this slide out -
> > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
> >
> > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei  wrote:
> >>
> >>  I implemented an algorithm to run hadoop on a 25GB graph data to
> >> calculate its average separation length.
> >> The input format is V1(tab)V2 (where V2 is the friend of V1).
> >> My purpose is to first randomly select some seed nodes, and then for
> >> each node, calculate the shortest paths from this node to all other
> >> nodes on the graph.
> >>
> >> To do this, I first run a simple python code in a single machine to get
> >> some random seed nodes.
> >> Then I run a hadoop job to generate adjacent list for each node as the
> >> input for the second job.
> >>
> >> The second job takes the adjacent list input and output the first level
> >> breadth-first search result. The nodes which are the friends of the seed
> >> node have distance 1. Then this output is the input for the next hadoop
> >> job so on so forth, until all the nodes are reached.
> >>
> >> I generated a simulated graph for testing. This data has only 100 nodes.
> >> Normal python code can find the separation length within 1 second (100
> >> seed nodes). However, the hadoop took almost 3 hours to do that
> >> (pseudo-distributed mode on one machine)!!
> >>
> >> I wonder if there is a more efficient way to do breadth-first search in
> >> hadoop? It is very inefficient to output so many intermediate results.
> >> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
> >>
> >> Please help.  Thanks!
> >>
> >> Wei
> >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > edwardy...@apache.org
> > http://blog.udanax.org
> 





Re: breadth-first search

2010-12-21 Thread Ted Dunning
Ahh... I see what you mean.

This algorithm can be implemented with all of the iterations for all points
proceeding in parallel.  You should only need 4 map-reduce steps, not 400.
 This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less important.

2010/12/21 Peng, Wei 

> The graph that my BFS algorithm is running on only needs 4 levels to reach
> all nodes. The reason I say "many iterations" is that there are 100 sources
> nodes, so totally 400 iterations. The algorithm should be right, and I
> cannot think of anything to reduce the number of iterations.
>
> Ted, I will check out the links that you sent to me.
> I really appreciate your help.
>
> Wei
> -Original Message-
> From: Edward J. Yoon [mailto:edw...@udanax.org]
> Sent: Tuesday, December 21, 2010 1:27 AM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
>
> There's no release yet.
>
> But, I had tested the BFS using hama and, hbase.
>
> Sent from my iPhone
>
> On 2010. 12. 21., at 오전 11:30, "Peng, Wei"  wrote:
>
> > Yoon,
> >
> > Can I use HAMA now, or it is still in development?
> >
> > Thanks
> >
> > Wei
> >
> > -Original Message-
> > From: Edward J. Yoon [mailto:edwardy...@apache.org]
> > Sent: Monday, December 20, 2010 6:23 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: breadth-first search
> >
> > Check this slide out -
> > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
> >
> > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei  wrote:
> >>
> >>  I implemented an algorithm to run hadoop on a 25GB graph data to
> >> calculate its average separation length.
> >> The input format is V1(tab)V2 (where V2 is the friend of V1).
> >> My purpose is to first randomly select some seed nodes, and then for
> >> each node, calculate the shortest paths from this node to all other
> >> nodes on the graph.
> >>
> >> To do this, I first run a simple python code in a single machine to get
> >> some random seed nodes.
> >> Then I run a hadoop job to generate adjacent list for each node as the
> >> input for the second job.
> >>
> >> The second job takes the adjacent list input and output the first level
> >> breadth-first search result. The nodes which are the friends of the seed
> >> node have distance 1. Then this output is the input for the next hadoop
> >> job so on so forth, until all the nodes are reached.
> >>
> >> I generated a simulated graph for testing. This data has only 100 nodes.
> >> Normal python code can find the separation length within 1 second (100
> >> seed nodes). However, the hadoop took almost 3 hours to do that
> >> (pseudo-distributed mode on one machine)!!
> >>
> >> I wonder if there is a more efficient way to do breadth-first search in
> >> hadoop? It is very inefficient to output so many intermediate results.
> >> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
> >>
> >> Please help.  Thanks!
> >>
> >> Wei
> >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > edwardy...@apache.org
> > http://blog.udanax.org
>


RE: breadth-first search

2010-12-21 Thread Peng, Wei
The graph that my BFS algorithm is running on only needs 4 levels to reach all 
nodes. The reason I say "many iterations" is that there are 100 sources nodes, 
so totally 400 iterations. The algorithm should be right, and I cannot think of 
anything to reduce the number of iterations.

Ted, I will check out the links that you sent to me.
I really appreciate your help.

Wei
-Original Message-
From: Edward J. Yoon [mailto:edw...@udanax.org] 
Sent: Tuesday, December 21, 2010 1:27 AM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search

There's no release yet.

But, I had tested the BFS using hama and, hbase. 

Sent from my iPhone

On 2010. 12. 21., at 오전 11:30, "Peng, Wei"  wrote:

> Yoon,
> 
> Can I use HAMA now, or it is still in development?
> 
> Thanks
> 
> Wei
> 
> -Original Message-
> From: Edward J. Yoon [mailto:edwardy...@apache.org] 
> Sent: Monday, December 20, 2010 6:23 PM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
> 
> Check this slide out -
> http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
> 
> On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei  wrote:
>> 
>>  I implemented an algorithm to run hadoop on a 25GB graph data to
>> calculate its average separation length.
>> The input format is V1(tab)V2 (where V2 is the friend of V1).
>> My purpose is to first randomly select some seed nodes, and then for
>> each node, calculate the shortest paths from this node to all other
>> nodes on the graph.
>> 
>> To do this, I first run a simple python code in a single machine to get
>> some random seed nodes.
>> Then I run a hadoop job to generate adjacent list for each node as the
>> input for the second job.
>> 
>> The second job takes the adjacent list input and output the first level
>> breadth-first search result. The nodes which are the friends of the seed
>> node have distance 1. Then this output is the input for the next hadoop
>> job so on so forth, until all the nodes are reached.
>> 
>> I generated a simulated graph for testing. This data has only 100 nodes.
>> Normal python code can find the separation length within 1 second (100
>> seed nodes). However, the hadoop took almost 3 hours to do that
>> (pseudo-distributed mode on one machine)!!
>> 
>> I wonder if there is a more efficient way to do breadth-first search in
>> hadoop? It is very inefficient to output so many intermediate results.
>> Totally there would be seedNodeNumber*levelNumber+1 jobs,
>> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
>> 
>> Please help.  Thanks!
>> 
>> Wei
>> 
> 
> 
> 
> -- 
> Best Regards, Edward J. Yoon
> edwardy...@apache.org
> http://blog.udanax.org