Re: GSOC Mahout.GA, next steps ?

2008-05-28 Thread Grant Ingersoll
This sounds good.  I don't know a lot about GAs, so if others have  
insight, that would be great.  It would also be handy if you could put  
up a section on the Wiki about GAs and maybe post some links to basic  
papers there, so people that aren't familiar can go do some background  
reading.


I will try to get to MAHOUT-56 this week, but others can jump in and  
review as well.


-Grant

On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote:

In a GA there are many things that can be distributed, and one  
should always start with the most compute demanding task . This is  
very problem dependent, but in most cases the fitness evaluation  
function (FEF) "is" the part to distribute.


The FEF evaluates each single individual in the population, and it  
may need some datas (D) to do so. For example in the traveling  
Salesman Problem, the problem is defined by a set of cities and the  
distances between them, the FEF needs those distances to evaluate  
the individuals.


I see 2 ways to distribute the FEF:

A. if the datas D is not big and can fit in each single cluster  
node, then the easiest solution is to use each Mapper to evaluate  
one individual and to pass the Datas D to all the mappers (using  
some Job parameter or the DistributedCache). The input of the job is  
the population of individuals. For someone used to work with  
Watchmaker, the solution A is straightforward, he needs to change  
one line of code.


B. if the datas D are really big and span over multiple nodes, then  
the FEF should be writen in the form of Mappers-Reducers, the  
population of individuals is passed to all the mappers (again using  
the DistributedCache or a Job parameter) and the datas D are now the  
input of the Job.


[MAHOUT-56] contains a possible implementation for solution A. Now I  
should start thinking about solution B and all I need is a problem  
that uses very big datasets. I already proposed one in my GSoC  
proposal, it consists of using a Genetic Algorithm to find good  
binary classification rule for a given dataset. But I am open to any  
other suggestion.


__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure  
protection possible contre les messages non sollicités

http://mail.yahoo.fr Yahoo! Mail





Re: GSOC Mahout.GA, next steps ?

2008-05-28 Thread Ted Dunning
Conceptually, at least, it would be good to have the option for fitness
functions to be expressed as map-reduce programs.  Unfortunately, having
mappers spawn MR programs runs the real risk of dead-lock, especially on
less than grandiose clusters.

To me, that indicates that if the fitness function is nasty enough to
require map-reduce to compute, then either:

a) the executive that manages the population and generates mutations should
be written in sequential form

or

b) the evolutionary algorithm has to be written in such a way as to be able
to manipulate a map-reduce program so that evolution and evaluation can be
merged into a single (composite) map-reduce program.

I vote for (a) because if fitness computations are so complex as to need MR,
then the cost of sorting the population will be negligible.

This raises the question of how the population should be communicated to the
parallel evaluator.


On Tue, May 27, 2008 at 1:52 AM, deneche abdelhakim <[EMAIL PROTECTED]>
wrote:

> In a GA there are many things that can be distributed, and one should
> always start with the most compute demanding task . This is very problem
> dependent, but in most cases the fitness evaluation function (FEF) "is" the
> part to distribute.
>
> The FEF evaluates each single individual in the population, and it may need
> some datas (D) to do so. For example in the traveling Salesman Problem, the
> problem is defined by a set of cities and the distances between them, the
> FEF needs those distances to evaluate the individuals.
>
> I see 2 ways to distribute the FEF:
>
> A. if the datas D is not big and can fit in each single cluster node, then
> the easiest solution is to use each Mapper to evaluate one individual and to
> pass the Datas D to all the mappers (using some Job parameter or the
> DistributedCache). The input of the job is the population of individuals.
> For someone used to work with Watchmaker, the solution A is straightforward,
> he needs to change one line of code.
>
> B. if the datas D are really big and span over multiple nodes, then the FEF
> should be writen in the form of Mappers-Reducers, the population of
> individuals is passed to all the mappers (again using the DistributedCache
> or a Job parameter) and the datas D are now the input of the Job.
>
> [MAHOUT-56] contains a possible implementation for solution A. Now I should
> start thinking about solution B and all I need is a problem that uses very
> big datasets. I already proposed one in my GSoC proposal, it consists of
> using a Genetic Algorithm to find good binary classification rule for a
> given dataset. But I am open to any other suggestion.
>
> __
> Do You Yahoo!?
> En finir avec le spam? Yahoo! Mail vous offre la meilleure protection
> possible contre les messages non sollicités
> http://mail.yahoo.fr Yahoo! Mail
>



-- 
ted


Re: GSOC Mahout.GA, next steps ?

2008-05-28 Thread deneche abdelhakim
> Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> Conceptually, at least, it would be good to have the option for fitness
> functions to be expressed as map-reduce programs.  Unfortunately, having
> mappers spawn MR programs runs the real risk of dead-lock, especially on
> less than grandiose clusters.
> 
> To me, that indicates that if the fitness function is nasty enough to
> require map-reduce to compute, then either:
> 
> a) the executive that manages the population and generates mutations 
> should be written in sequential form
> 
> or
> 
> b) the evolutionary algorithm has to be written in such a way as to be 
> able to manipulate a map-reduce program so that evolution and evaluation > 
> can be merged into a single (composite) map-reduce program.
> 
> I vote for (a) because if fitness computations are so complex as to need > 
> MR, then the cost of sorting the population will be negligible.


(a) has another advantage too: one can start by writing its program in a 
sequential form, test it with a small dataset, then rewrite only the fitness 
function in a M-R form.


> This raises the question of how the population should be communicated to > 
> the parallel evaluator.


I don't know if there are many ways to do it in Hadoop, but how about writing 
the population into a file and pass it with the DistributedCache ?

-- 
abdelhakim

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail


Re: GSOC Mahout.GA, next steps ?

2008-05-28 Thread Ted Dunning
How about writing the population to the file and using it as input to
map-reduce directly?  Evaluations that fit into a map can obviously handle
that.  Evaluations that need to be parallelized, can have the map emit
multiple copies with sequential keys.  This will put a copy of each input
into multiple reduces that can do a piece of the evaluation.  The pieces can
be put back together in a second MR step.

I even think that you should be able to demonstrate that any MR program that
evaluates a single population member can be applied to multiple population
members with slight modifications.

On Wed, May 28, 2008 at 11:44 AM, deneche abdelhakim <[EMAIL PROTECTED]>
wrote:

>
>
> I don't know if there are many ways to do it in Hadoop, but how about
> writing the population into a file and pass it with the DistributedCache ?
>
>

-- 
ted