[
https://issues.apache.org/jira/browse/MAHOUT-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855168#action_12855168
]
Ted Dunning commented on MAHOUT-334:
------------------------------------
There is little to say. This is an excellent proposal. You clearly have the
background and the capability to succeed.
A key limitation will be whether we have enough mentors.
> Proposal for GSoC2010 (Linear SVM for Mahout)
> ---------------------------------------------
>
> Key: MAHOUT-334
> URL: https://issues.apache.org/jira/browse/MAHOUT-334
> Project: Mahout
> Issue Type: Task
> Reporter: zhao zhendong
>
> Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout
> Student: Zhen-Dong Zhao
> Student e-mail: [email protected]
> Student Major: Multimedia Information Retrieval /Computer Science
> Student Degree: Master Student Graduation: NUS'10
> Organization: Hadoop
> 0 Abstract
> Linear Support Vector Machine (SVM) is pretty useful in some applications
> with large-scale datasets or datasets with high dimension features. This
> proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR
> [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which
> is another linear SVM solver and almost finished by me. Two distinct con
> tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified interfaces
> for linear SVM classifier.
> 1 Motivation
> As one of TOP 10 algorithms in data mining society [3], Support Vector
> Machine is very powerful Machine Learning tool and widely adopted in Data
> Mining, Pattern Recognition and Information Retrieval domains.
> The SVM training procedure is pretty slow, however, especially on the case
> with large-scale dataset. Nowadays, several literatures propose SVM solvers
> with linear kernel that can handle large-scale learning problem, for
> instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of
> linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232).
> Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> (linear SVM track (http://largescale.first.fraunhofer.de/summary/), LIBLINEAR
> [1] suppose to be incorporated in Mahout too. Currently, LIBLINEAR package
> supports:
> (1) L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and
> logistic regression (LR)
> (2) L1-regularized classifiers L2-loss linear SVM and logistic regression
> (LR)
> Main features of LIBLINEAR are following:
> (1) Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
> (2) Cross validation for model selection
> (3) Probability estimates (logistic regression only)
> (4) Weights for unbalanced data
> All the functionalities suppose to be implemented except probability
> estimates and weights for unbalanced data (If time permitting, I would like
> to do so).
> 2 Unified Interfaces
> Linear SVM classifier based on Pegasos package on Mahout already can provide
> such functionalities: (http://issues.apache.org/jira/browse/MAHOUT-232)
> (1) Sequential Binary Classification (Two-class Classification), includes
> sequential training and prediction;
> (2) Sequential Regression;
> (3) Parallel & Sequential Multi-Classification, includes One-vs.-One and
> One-vs.-Others schemes.
> Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR
> are quite similar to each other. As aforementioned, in this section I will
> introduce an unified interfaces for linear SVM classifier on Mahout, which
> will incorporate Pegasos, LIBLINEAR.
> The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms. I
> will introduce them separately.
> 2.1 Data Handler
> The dataset can be stored on personal computer or on Hadoop cluster. This
> framework provides high performance Random Loader, Sequential Loader for
> accessing large-scale data.
> 2.2 Sequential Algorithms
> Sequential Algorithms will include binary classification, regression based on
> Pegasos and LIBLINEAR with unified interface.
> 2.3 Parallel Algorithms
> It is widely accepted that to parallelize binary SVM classifier is hard. For
> multi-classification, however, the coarse-grained scheme (e.g. each Mapper or
> Reducer has one independent SVM binary classifier) is easier to achieve great
> improvement. Besides, cross validation for model selection also can take
> advantage of such coarse-grained parallelism. I will introduce a unified
> interface for all of them.
> 3 Biography:
> I am a graduating masters student in Multimedia Information Retrieval System
> from National University of Singapore. My research has involved the
> large-scale SVM classifier.
> I have worked with Hadoop and Map Reduce since one year ago, and I have
> dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout
> (http://issues.apache.org/jira/browse/MAHOUT-232). I have taken part in
> setting up and maintaining a Hadoop cluster with around 70 nodes in our group.
> 4 Timeline:
> Weeks 1-4 (May 24 ~ June 18): Implement binary classifier
> Weeks 5-7 (June 21 ~ July 12): Implement parallel multi-class classification
> and Implement cross validation for model selection.
> Weeks 8 (July 12 ~ July 16): Summit of mid-term evaluation
> Weeks 9 - 11 (July 16 ~ August 9): Interface re-factory and performance
> turning
> Weeks 11 - 12 (August 9 ~ August 16): Code cleaning, documents and testing.
> 5 References
> [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
> Lin. Liblinear: A library for large linear classification. J. Mach. Learn.
> Res., 9:1871-1874, 2008.
> [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
> estimated sub-gradient solver for svm. In ICML '07: Proceedings of the 24th
> international conference on Machine learning, pages 807-814, New York, NY,
> USA, 2007. ACM.
> [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
> Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
> Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
> algorithms in data mining. Knowl. Inf. Syst., 14(1):1-37, 2007.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.