[ 
https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855104#action_12855104
 ] 

zhao zhendong commented on MAHOUT-334:
--------------------------------------

Is there any suggestion or comment on my proposal?

> Proposal for GSoC2010 (Linear SVM for Mahout)
> ---------------------------------------------
>
>                 Key: MAHOUT-334
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-334
>             Project: Mahout
>          Issue Type: Task
>            Reporter: zhao zhendong
>
> Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout
> Student: Zhen-Dong Zhao
> Student e-mail: zha...@comp.nus.edu.sg
> Student Major: Multimedia Information Retrieval /Computer Science
> Student Degree: Master        Student Graduation: NUS'10           
> Organization: Hadoop
> 0 Abstract
> Linear Support Vector Machine (SVM) is pretty useful in some applications 
> with large-scale datasets or datasets with high dimension features. This 
> proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR 
> [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which 
> is another linear SVM solver and almost finished by me. Two distinct con 
> tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces 
> for linear SVM classifier.
> 1 Motivation
> As one of TOP 10 algorithms in data mining society [3], Support Vector 
> Machine is very powerful Machine Learning tool and widely adopted in Data 
> Mining, Pattern Recognition and Information Retrieval domains.
> The SVM training procedure is pretty slow, however, especially on the case 
> with large-scale dataset. Nowadays, several literatures propose SVM solvers 
> with linear kernel that can handle large-scale learning problem, for 
> instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of 
> linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232). 
> Nevertheless, as the winner of ICML 2008 large-scale learning challenge 
> (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR 
> [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR package 
> supports:
>   (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and 
> logistic regression (LR)
>   (2) L1-regularized classifiers L2-loss linear SVM and logistic regression 
> (LR)
> Main features of LIBLINEAR are following:
>   (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
>   (2) Cross validation for model selection
>   (3) Probability estimates (logistic regression only)
>   (4) Weights for unbalanced data
> All the functionalities suppose to be implemented except probability 
> estimates and weights for unbalanced data (If time permitting, I would like 
> to do so).
> 2 Unified Interfaces
> Linear SVM classifier based on Pegasos package on Mahout already can provide 
> such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)
>   (1) Sequential Binary Classification (Two-class Classification), includes 
> sequential training and prediction;
>   (2) Sequential Regression;
>   (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and 
> One-vs.-Others schemes.
> Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR 
> are quite similar to each other. As aforementioned, in this section I will 
> introduce an unified interfaces for linear SVM classifier on Mahout, which 
> will incorporate Pegasos, LIBLINEAR. 
> The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I 
> will introduce them separately.
> 2.1 Data Handler
> The dataset can be stored on personal computer or on Hadoop cluster. This 
> framework provides high performance Random Loader, Sequential Loader for 
> accessing large-scale data.
> 2.2 Sequential Algorithms
> Sequential Algorithms will include binary classification, regression based on 
> Pegasos and LIBLINEAR with unified interface.
> 2.3 Parallel Algorithms
> It is widely accepted that to parallelize binary SVM classifier is hard. For 
> multi-classification, however, the coarse-grained scheme (e.g. each Mapper or 
> Reducer has one independent SVM binary classifier) is easier to achieve great 
> improvement. Besides, cross validation for model selection also can take 
> advantage of such coarse-grained parallelism. I will introduce a unified 
> interface for all of them.
> 3 Biography:
> I am a graduating masters student in Multimedia Information Retrieval System 
> from National University of Singapore. My research has involved the 
> large-scale SVM classifier.
> I have worked with Hadoop and Map Reduce since one year ago, and I have 
> dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout 
> (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in 
> setting up and maintaining a Hadoop cluster with around 70 nodes in our group.
> 4 Timeline:
> Weeks 1-4 (May 24 ~ June 18): Implement binary classifier 
> Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification 
> and Implement cross validation for model selection. 
> Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation
> Weeks 9 - 11 (July 16 ~ August 9):  Interface re-factory and performance 
> turning
> Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing. 
> 5 References
> [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen 
> Lin. Liblinear: A library for large linear classification. J. Mach. Learn. 
> Res., 9:1871-1874, 2008.
> [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal 
> estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th 
> international conference on Machine learning, pages 807-814, New York, NY, 
> USA, 2007. ACM.
> [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, 
> Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, 
> Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 
> algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to