[ https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855104#action_12855104 ]
zhao zhendong commented on MAHOUT-334: -------------------------------------- Is there any suggestion or comment on my proposal? > Proposal for GSoC2010 (Linear SVM for Mahout) > --------------------------------------------- > > Key: MAHOUT-334 > URL: https://issues.apache.org/jira/browse/MAHOUT-334 > Project: Mahout > Issue Type: Task > Reporter: zhao zhendong > > Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout > Student: Zhen-Dong Zhao > Student e-mail: zha...@comp.nus.edu.sg > Student Major: Multimedia Information Retrieval /Computer Science > Student Degree: Master Student Graduation: NUS'10 > Organization: Hadoop > 0 Abstract > Linear Support Vector Machine (SVM) is pretty useful in some applications > with large-scale datasets or datasets with high dimension features. This > proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR > [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which > is another linear SVM solver and almost finished by me. Two distinct con > tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces > for linear SVM classifier. > 1 Motivation > As one of TOP 10 algorithms in data mining society [3], Support Vector > Machine is very powerful Machine Learning tool and widely adopted in Data > Mining, Pattern Recognition and Information Retrieval domains. > The SVM training procedure is pretty slow, however, especially on the case > with large-scale dataset. Nowadays, several literatures propose SVM solvers > with linear kernel that can handle large-scale learning problem, for > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of > linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232). > Nevertheless, as the winner of ICML 2008 large-scale learning challenge > (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR > [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR package > supports: > (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and > logistic regression (LR) > (2) L1-regularized classifiers L2-loss linear SVM and logistic regression > (LR) > Main features of LIBLINEAR are following: > (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer > (2) Cross validation for model selection > (3) Probability estimates (logistic regression only) > (4) Weights for unbalanced data > All the functionalities suppose to be implemented except probability > estimates and weights for unbalanced data (If time permitting, I would like > to do so). > 2 Unified Interfaces > Linear SVM classifier based on Pegasos package on Mahout already can provide > such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232) > (1) Sequential Binary Classification (Two-class Classification), includes > sequential training and prediction; > (2) Sequential Regression; > (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and > One-vs.-Others schemes. > Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR > are quite similar to each other. As aforementioned, in this section I will > introduce an unified interfaces for linear SVM classifier on Mahout, which > will incorporate Pegasos, LIBLINEAR. > The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I > will introduce them separately. > 2.1 Data Handler > The dataset can be stored on personal computer or on Hadoop cluster. This > framework provides high performance Random Loader, Sequential Loader for > accessing large-scale data. > 2.2 Sequential Algorithms > Sequential Algorithms will include binary classification, regression based on > Pegasos and LIBLINEAR with unified interface. > 2.3 Parallel Algorithms > It is widely accepted that to parallelize binary SVM classifier is hard. For > multi-classification, however, the coarse-grained scheme (e.g. each Mapper or > Reducer has one independent SVM binary classifier) is easier to achieve great > improvement. Besides, cross validation for model selection also can take > advantage of such coarse-grained parallelism. I will introduce a unified > interface for all of them. > 3 Biography: > I am a graduating masters student in Multimedia Information Retrieval System > from National University of Singapore. My research has involved the > large-scale SVM classifier. > I have worked with Hadoop and Map Reduce since one year ago, and I have > dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout > (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in > setting up and maintaining a Hadoop cluster with around 70 nodes in our group. > 4 Timeline: > Weeks 1-4 (May 24 ~ June 18): Implement binary classifier > Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification > and Implement cross validation for model selection. > Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation > Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance > turning > Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing. > 5 References > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen > Lin. Liblinear: A library for large linear classification. J. Mach. Learn. > Res., 9:1871-1874, 2008. > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal > estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th > international conference on Machine learning, pages 807-814, New York, NY, > USA, 2007. ACM. > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, > Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 > algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.