Need comments on Proposal for linear SVM framework (Google Summer of Code 2010)

zhao zhendong Sat, 20 Feb 2010 10:01:15 -0800

Hi all,

Robin told me such great chance for continuous contributing code here (many
thanks to Robin). Because I still work on Sequential SVM (Mahout-232) and I
prefer to extend it to a unified framework that incorporates some other
state-of-the-art linear SVM classifiers, I propose "Linear Support Vector
Machine (SVM) Framework based on Mahout".


I will appreciate your any comment! :)

Cheers,
Zhendong

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

 Linear Support Vector Machine (SVM) Framework based on Mahout

— Proposal for Google Summer of CodeTM 2010

Abstract

Linear Support Vector Machine (SVM) framework will be introduced to Mahout
in this proposal. This framework provides a uniﬁed framework based on Mahout
for diverse algorithms, such as Pegasos [2] and LIBLINEAR [1]. The
contribution has twofold: 1) Uniﬁed framework for linear SVM classiﬁer; 2)
Introduce LIBLINEAR to Mahout.

1 Motivation

Support Vector Machine is a powerful Machine Learning tool and widely
adopted in Data Mining, Pattern Recognition and Information Retrieval
communities. Recently, SVM is chose as one of Top 10 algorithms in data
mining [3].

The SVM training procedure is pretty slow, especially, on the case with huge
number of samples. Nowadays, several literatures propose linear SVM solvers
that can handle large-scale learning problem, for instance, LIBLINEAR [1]
and Pegasos [2]. Although I have implemented a prototype of linear SVM
classiﬁer based on Pegasos [2], as the winner of ICML 2008 large-scale
learning challenge (linear SVM<http://largescale.first.fraunhofer.de/summary/>
 track), LIBLINEAR [1] suppose to be incorporated in Mahout.
<http://largescale.first.fraunhofer.de/summary/>Currently, LIBLINEAR package
supports:

·       L2-regularized classiﬁers [L2-loss linear SVM, L1-loss linear SVM,
and logistic regression (LR)]

·       L1-regularized classiﬁers [L2-loss linear SVM and logistic
regression (LR)]

Main features of LIBLINEAR are following:

·       Multi-class classiﬁcation: 1) one-vs-the rest, 2) Crammer & Singer

·       Cross validation for model selection

·       Probability estimates (logistic regression only)

·       Weights for unbalanced data

Linear SVM classiﬁer based on Pegasos package on Mahout provides such
function <http://issues.apache.org/jira/browse/MAHOUT-232>alities:

·       Sequential Binary Classiﬁcation (Two-class Classiﬁcation), includes
sequential training and prediction;

·       Sequential Regression;

·       Parallel & Sequential Multi-Classiﬁcation, includes
One-vs.-One and One-vs.-Others
schemes.

Obviously, a uniﬁed framework for linear SVM classiﬁer should be introduced
into Mahout platform.

2 Framework

As aforementioned, in this section I propose a linear SVM classiﬁer
framework for Mahout, which will incorporate Pegasos, LIBLINEAR.
<http://issues.apache.org/jira/browse/MAHOUT-228>The whole picture of
framework is illustrated in Figure 1:

Apparently, this framework has two main parts: 1) Data accessing and
pre-processing; 2) Algorithms. I will introduce them separately.

2.1 Data Processing Layer

The dataset can be stored on personal computer or on Hadoop cluster. This
framework provides high performance Random Loader, Sequential Loader for
accessing large-scale data. Such loaders support both sequential vector, Gson
format and raw dataset format ( as same as SVMlight 2 and Libsvm 3).


[image:
?ui=2&view=att&th=126ec618035bdf9a&attid=0.1&disp=attd&realattid=ii_126ec618035bdf9a&zw]

Figure 1: The framework of linear SVM on Mahout

2.2 Sequential Algorithms

Sequential Algorithms will include binary classiﬁcation, regression based on
Pegasos and LIBLINEAR with uniﬁed interface.

2.3 Parallel Algorithms

It is widely accepted that to parallelize binary SVM classiﬁer is hard. For
multi-classiﬁcation, however, the coarse-grained scheme (e.g. each Mapper or
Reducer has one independent SVM binary classiﬁer) is easier to achieve great
improvement. Besides, cross validation for model selection also can take
advantage of such coarse-grained parallelism. I will introduce a uniﬁed
interface for all of them.

References

[1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. Liblinear: A library for large linear classiﬁcation. J. Mach. Learn.
Res., 9:1871–1874, 2008.

[2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the 24th
international conference on Machine learning, pages 807–814, New York, NY,
USA, 2007. ACM.

[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
Hiroshi Motoda, Geoﬀrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.
-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<

Need comments on Proposal for linear SVM framework (Google Summer of Code 2010)

Reply via email to