Cool, thanks! Sorry I have been so quiet. Do feel free to ask more
questions if you have them. I hope to finally have a bunch of
personal things past very soon and will be able to focus some time on
Mahout again.
-Grant
On Jun 9, 2008, at 6:14 AM, deneche abdelhakim wrote:
I found a cool introduction to evolutionary algorithms, I added it
to the wiki if someone is interested...
--- En date de : Mer 28.5.08, Grant Ingersoll <[EMAIL PROTECTED]>
a écrit :
De: Grant Ingersoll <[EMAIL PROTECTED]>
Objet: Re: GSOC Mahout.GA, next steps ?
À: mahout-dev@lucene.apache.org
Date: Mercredi 28 Mai 2008, 13h11
This sounds good. I don't know a lot about GAs, so if
others have
insight, that would be great. It would also be handy if
you could put
up a section on the Wiki about GAs and maybe post some
links to basic
papers there, so people that aren't familiar can go do
some background
reading.
I will try to get to MAHOUT-56 this week, but others can
jump in and
review as well.
-Grant
On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote:
In a GA there are many things that can be distributed,
and one
should always start with the most compute demanding
task . This is
very problem dependent, but in most cases the fitness
evaluation
function (FEF) "is" the part to distribute.
The FEF evaluates each single individual in the
population, and it
may need some datas (D) to do so. For example in the
traveling
Salesman Problem, the problem is defined by a set of
cities and the
distances between them, the FEF needs those distances
to evaluate
the individuals.
I see 2 ways to distribute the FEF:
A. if the datas D is not big and can fit in each
single cluster
node, then the easiest solution is to use each Mapper
to evaluate
one individual and to pass the Datas D to all the
mappers (using
some Job parameter or the DistributedCache). The input
of the job is
the population of individuals. For someone used to
work with
Watchmaker, the solution A is straightforward, he
needs to change
one line of code.
B. if the datas D are really big and span over
multiple nodes, then
the FEF should be writen in the form of
Mappers-Reducers, the
population of individuals is passed to all the mappers
(again using
the DistributedCache or a Job parameter) and the datas
D are now the
input of the Job.
[MAHOUT-56] contains a possible implementation for
solution A. Now I
should start thinking about solution B and all I need
is a problem
that uses very big datasets. I already proposed one in
my GSoC
proposal, it consists of using a Genetic Algorithm to
find good
binary classification rule for a given dataset. But I
am open to any
other suggestion.
__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la
meilleure
protection possible contre les messages non
sollicités
http://mail.yahoo.fr Yahoo! Mail
_
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr
--
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ