I've written my proposal, and because I could no more change it after I submit 
it to GSoc, I first post it here
if someone have some suggestions you are welcome.
I will wait until saturday morning to post it to the GSoC

**************************************************************************************
Application for Summer of Code 2008    Mahout Project

Deneche Abdel Hakim

Codename Mahout.GA


I. Synopsis

I will add a genetic algorithm (GA) for binary classification on large datasets 
to the Mahout project. To gain time I will use an existing framework for 
genetic algorithms WatchMaker [WatchMaker] with an Apache Software License. I 
will also add a parallelized measure that indicates the quality of 
classification rules on a given dataset, this measure will be available 
independently of the GA. And if I have enough time I will make the GA more 
generic and apply it on a different problem (multiclass classification).


II. Project

A GA works by evolving a population of individuals toward a desired goal. To 
get a satisfying solution, the GA needs to run thousands of iterations with 
hundreds of individuals. For each iteration and individual the fitness is 
calculated, it indicates the closeness of that individual to the desired 
solution. The main advantage of GAs is there ability to find solution of 
problems given only a fitness measure (and of course a sufficient CPU power), 
this is particularly helpful when the problem is complex and no mathematical 
solution is available.

My primary goal is to implement the GA described in [GA]. It uses a fitness 
function that is easy to implement and can benefit from the Map-Reduce strategy 
to exploit distributed computing (when the training dataset is very large). It 
will be available as ready to use tool (Mahout.GA) that discovers binary 
classification rules for any given dataset. Concretely, the main program will 
launch the GA using WatchMaker, each time the GA needs to evaluate the fitness 
of the population it calls a specific class given by us, this class will 
configure and launch a Hadoop Job on a distributed cluster.

My secondary goal is to make Mahout.GA problem independent, thus allowing us to 
use it for different problems such as multiclass classification, optimization, 
clustering. This will be done by implementing a ready to use generic fitness 
function for WatchMaker that calls internally Hadoop. As a proof of concept I 
will use it for multiclass classification (if I don't run out of time of 
course!).


III. Profit for Mahout

1.The GA will be integrated with Mahout as a ready to use rule discovering tool 
for binary classification;
2.Explore the integration of existing frameworks with Mahout, for example how 
to design the program in a way that the framework libraries will not be needed 
in the slave nodes (technically its feasible, but I still need to learn how to 
do it);
3.The parallelized fitness function can be used independently of Mahout.GA. 
It’s a good measure of the quality of binary classification rules;
4.Simplify the process of using Mahout.GA for other problems. The user will 
still need to design the solutions representation and to implement a fitness 
function, but all the Hadoop stuff should be hidden or at least made simpler;
5.Apply the generalized Mahout.GA to multiclass classification and write a 
corresponding tutorial that explains how to use Mahout.GA to solve new problems.


IV. Success Criteria

Main goals
  1.Implement the parallelized fitness function described in [GA] and validate 
its results on a small dataset;
  2.Implement Mahout.GA for binary classification rule discovery. A simpler 
(not parallelized) version of this algorithm should also be implemented to     
validate the results of Mahout.GA;

Secondary goals
  1.Allow the parallelized fitness function to be used independently of 
Mahout.GA;
  2.Use Mahout.GA on a different problem (multiclass classification) and write 
a corresponding tutorial.


V. Roadmap

[April, 14: accepted students known]
  1.Familiarize myself with Hadoop
    Modify one of the examples of Hadoop to simulate an iterative process. For 
each iteration, a new Job is executed with different parameters, and its 
results are imported back by the program.
  2.Implement the GA without parallelism
    a.Start by implementing the tutorial example that comes with WatchMaker;
    b.Implement my own Individual and Fitness function classes;
    c.Validate the algorithm using a small dataset, and find the parameters 
that will give acceptable results.
  3.Prepare whatever I may need in the development period
[May, 26 coding starts]
  4.Implement the parallelized fitness function
    a.Use Hadoop Map-Reduce to implement it [2 weeks];
    b.Validate it on a small dataset [1 week].
  5.Implement Mahout.GA
    a.Write an intermediary component between WatchMaker and the parallelized 
fitness function. This component takes a population, configures and launches a 
Job, waits for its end, then returns the calculated fitness values [2 weeks];
    b.Validate Mahout.GA by comparing its results with the GA without 
parallelism [1 week].
[July, 7-14 mid term evaluation]
  6.Generic Mahout.GA
    a.Identify the components that are problem dependant, and make them less 
dependant of Hadoop as much as possible [2 weeks];
    b.Implement the components for the multiclass classification problem and 
validate Mahout.GA on a given dataset [2 week];
    c.Write a tutorial that explains how to use Mahout.GA to solve new problems 
(in this case the multiclass classification problem) [in parallel with 5.b].
[August, 11 suggested pencil 'down' date]
  Clean the code and arrange the documentation.
[August, 18 final evaluations]

Note that this plan may change given my interaction with my Mentor and the 
Mahout community.

VI. Biography

I am a PhD student at the University Mentouri of Constantine. My primary 
research goal is a framework to help build Intelligent Adaptive Systems. I am 
still on my first year, and there is a good chance that I will be working on 
Distributed Evolutionary Algorithms for the next three years.

For the purpose of my Master, I worked on Artificial Immune Systems. I applied 
them to handwritten digits recognition [PatternRecognition] and Muliple 
Sequence Alignement (bioinformatics) [BioInformatics]. I also built a feature 
selection operator for Yale (but for lack of time I never published it), and 
participated in an internship at the LIFL laboratory (Lille, France), where I 
implemented several operators for a C++ evolutionary computation framework 
[ParadisEO].

In parallel to my Master, I worked as a freelance programmer for my University. 
I developed a Java scholar management system using Eclipse, TortoiseSVN and 
many open source libraries. I gained a good experience on project management 
(how to make a realistic plan and stick to it) and open source development (how 
to choose a good open source library, use it, and work around known bugs).

VII. References
[GA]    Bojarczuk CC, Lopes HS, and Freitas AA. "Discovering comprehensible 
classification rules using genetic programming: a case study in a medical 
domain". Proc. Genetic and Evolutionary Computation Conference GECCO99, 
953-958. Orlando, FL, USA, July 1999.

[WatchMaker]    https://watchmaker.dev.java.net/

[PatternRecognition]    S. Meshoul, A. Deneche, M. Batouche, "Combining an 
Artificial Immune System with a Clustering Method for Effective Pattern 
Recognition", International Conference on Machine Intelligence ICMI’05, pp. 
782-787, Tunis 2005.

[BioInformatics]    A. Layeb, A. Deneche, "Multiple Sequence Alignment by 
Immune Artificial System", ACS/IEEE International Conference on Computer 
Systems and Applications AICCSA’07, Jordan 2007.

[ParadisEO]    
http://paradiseo.gforge.inria.fr/index.php?n=Paradiseo.Home?from=Main.HomePage

----------------------------------------------------------------------------------------------
This proposal is inspired from the excellent one of Konstantin Kafer 
[http://drupal.org/files/application.pdf]
*********************************************************************************************************



      
_____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. 
http://mail.yahoo.fr

Reply via email to