I've written my proposal, and because I could no more change it after I submit it to GSoc, I first post it here if someone have some suggestions you are welcome. I will wait until saturday morning to post it to the GSoC
************************************************************************************** Application for Summer of Code 2008 Mahout Project Deneche Abdel Hakim Codename Mahout.GA I. Synopsis I will add a genetic algorithm (GA) for binary classification on large datasets to the Mahout project. To gain time I will use an existing framework for genetic algorithms WatchMaker [WatchMaker] with an Apache Software License. I will also add a parallelized measure that indicates the quality of classification rules on a given dataset, this measure will be available independently of the GA. And if I have enough time I will make the GA more generic and apply it on a different problem (multiclass classification). II. Project A GA works by evolving a population of individuals toward a desired goal. To get a satisfying solution, the GA needs to run thousands of iterations with hundreds of individuals. For each iteration and individual the fitness is calculated, it indicates the closeness of that individual to the desired solution. The main advantage of GAs is there ability to find solution of problems given only a fitness measure (and of course a sufficient CPU power), this is particularly helpful when the problem is complex and no mathematical solution is available. My primary goal is to implement the GA described in [GA]. It uses a fitness function that is easy to implement and can benefit from the Map-Reduce strategy to exploit distributed computing (when the training dataset is very large). It will be available as ready to use tool (Mahout.GA) that discovers binary classification rules for any given dataset. Concretely, the main program will launch the GA using WatchMaker, each time the GA needs to evaluate the fitness of the population it calls a specific class given by us, this class will configure and launch a Hadoop Job on a distributed cluster. My secondary goal is to make Mahout.GA problem independent, thus allowing us to use it for different problems such as multiclass classification, optimization, clustering. This will be done by implementing a ready to use generic fitness function for WatchMaker that calls internally Hadoop. As a proof of concept I will use it for multiclass classification (if I don't run out of time of course!). III. Profit for Mahout 1.The GA will be integrated with Mahout as a ready to use rule discovering tool for binary classification; 2.Explore the integration of existing frameworks with Mahout, for example how to design the program in a way that the framework libraries will not be needed in the slave nodes (technically its feasible, but I still need to learn how to do it); 3.The parallelized fitness function can be used independently of Mahout.GA. It’s a good measure of the quality of binary classification rules; 4.Simplify the process of using Mahout.GA for other problems. The user will still need to design the solutions representation and to implement a fitness function, but all the Hadoop stuff should be hidden or at least made simpler; 5.Apply the generalized Mahout.GA to multiclass classification and write a corresponding tutorial that explains how to use Mahout.GA to solve new problems. IV. Success Criteria Main goals 1.Implement the parallelized fitness function described in [GA] and validate its results on a small dataset; 2.Implement Mahout.GA for binary classification rule discovery. A simpler (not parallelized) version of this algorithm should also be implemented to validate the results of Mahout.GA; Secondary goals 1.Allow the parallelized fitness function to be used independently of Mahout.GA; 2.Use Mahout.GA on a different problem (multiclass classification) and write a corresponding tutorial. V. Roadmap [April, 14: accepted students known] 1.Familiarize myself with Hadoop Modify one of the examples of Hadoop to simulate an iterative process. For each iteration, a new Job is executed with different parameters, and its results are imported back by the program. 2.Implement the GA without parallelism a.Start by implementing the tutorial example that comes with WatchMaker; b.Implement my own Individual and Fitness function classes; c.Validate the algorithm using a small dataset, and find the parameters that will give acceptable results. 3.Prepare whatever I may need in the development period [May, 26 coding starts] 4.Implement the parallelized fitness function a.Use Hadoop Map-Reduce to implement it [2 weeks]; b.Validate it on a small dataset [1 week]. 5.Implement Mahout.GA a.Write an intermediary component between WatchMaker and the parallelized fitness function. This component takes a population, configures and launches a Job, waits for its end, then returns the calculated fitness values [2 weeks]; b.Validate Mahout.GA by comparing its results with the GA without parallelism [1 week]. [July, 7-14 mid term evaluation] 6.Generic Mahout.GA a.Identify the components that are problem dependant, and make them less dependant of Hadoop as much as possible [2 weeks]; b.Implement the components for the multiclass classification problem and validate Mahout.GA on a given dataset [2 week]; c.Write a tutorial that explains how to use Mahout.GA to solve new problems (in this case the multiclass classification problem) [in parallel with 5.b]. [August, 11 suggested pencil 'down' date] Clean the code and arrange the documentation. [August, 18 final evaluations] Note that this plan may change given my interaction with my Mentor and the Mahout community. VI. Biography I am a PhD student at the University Mentouri of Constantine. My primary research goal is a framework to help build Intelligent Adaptive Systems. I am still on my first year, and there is a good chance that I will be working on Distributed Evolutionary Algorithms for the next three years. For the purpose of my Master, I worked on Artificial Immune Systems. I applied them to handwritten digits recognition [PatternRecognition] and Muliple Sequence Alignement (bioinformatics) [BioInformatics]. I also built a feature selection operator for Yale (but for lack of time I never published it), and participated in an internship at the LIFL laboratory (Lille, France), where I implemented several operators for a C++ evolutionary computation framework [ParadisEO]. In parallel to my Master, I worked as a freelance programmer for my University. I developed a Java scholar management system using Eclipse, TortoiseSVN and many open source libraries. I gained a good experience on project management (how to make a realistic plan and stick to it) and open source development (how to choose a good open source library, use it, and work around known bugs). VII. References [GA] Bojarczuk CC, Lopes HS, and Freitas AA. "Discovering comprehensible classification rules using genetic programming: a case study in a medical domain". Proc. Genetic and Evolutionary Computation Conference GECCO99, 953-958. Orlando, FL, USA, July 1999. [WatchMaker] https://watchmaker.dev.java.net/ [PatternRecognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an Artificial Immune System with a Clustering Method for Effective Pattern Recognition", International Conference on Machine Intelligence ICMI’05, pp. 782-787, Tunis 2005. [BioInformatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by Immune Artificial System", ACS/IEEE International Conference on Computer Systems and Applications AICCSA’07, Jordan 2007. [ParadisEO] http://paradiseo.gforge.inria.fr/index.php?n=Paradiseo.Home?from=Main.HomePage ---------------------------------------------------------------------------------------------- This proposal is inspired from the excellent one of Konstantin Kafer [http://drupal.org/files/application.pdf] ********************************************************************************************************* _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. http://mail.yahoo.fr