 Linear Support Vector Machine (SVM) Framework based on Mahout

— Proposal for Google Summer of CodeTM 2010


Linear Support Vector Machine (SVM) framework will be introduced to Ma­hout
in this proposal. This framework provides a unified framework based on Mahout
for diverse algorithms, such as Pegasos [2] and LIBLINEAR [1]. The
con­tribution has twofold: 1) Unified framework for linear SVM classifier; 2)
Introduce LIBLINEAR to Mahout.

1 Motivation

Support Vector Machine is a powerful Machine Learning tool and widely
adopted in Data Mining, Pattern Recognition and Information Retrieval
communities. Recently, SVM is chose as one of Top 10 algorithms in data
mining [3].

The SVM training procedure is pretty slow, especially, on the case with huge
number of samples. Nowadays, several literatures propose linear SVM solvers
that can handle large-scale learning problem, for instance, LIBLINEAR [1]
and Pegasos [2]. Although I have implemented a prototype of linear SVM
classifier based on Pegasos [2], as the winner of ICML 2008 large-scale
learning challenge (linear SVM<http://largescale.first.fraunhofer.de/summary/>
 track), LIBLINEAR [1] suppose to be incorporated in Mahout.
<http://largescale.first.fraunhofer.de/summary/>Currently, LIBLINEAR package

·       L2-regularized classifiers [L2-loss linear SVM, L1-loss linear SVM,
and logistic regression (LR)]

·       L1-regularized classifiers [L2-loss linear SVM and logistic
regression (LR)]

Main features of LIBLINEAR are following:

·       Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer

·       Cross validation for model selection

·       Probability estimates (logistic regression only)

·       Weights for unbalanced data

Linear SVM classifier based on Pegasos package on Mahout provides such
function­ <http://issues.apache.org/jira/browse/MAHOUT-232>alities:

·       Sequential Binary Classification (Two-class Classification), includes
sequential train­ing and prediction;

·       Sequential Regression;

·       Parallel & Sequential Multi-Classification, includes
One-vs.-One and One-vs.-Others

Obviously, a unified framework for linear SVM classifier should be introduced
into Mahout platform.

2 Framework

As aforementioned, in this section I propose a linear SVM classifier
framework for Ma­hout, which will incorporate Pegasos, LIBLINEAR.
<http://issues.apache.org/jira/browse/MAHOUT-228>The whole picture of
framework is illustrated in Figure 1:

Apparently, this framework has two main parts: 1) Data accessing and
pre-processing; 2) Algorithms. I will introduce them separately.

2.1 Data Processing Layer

The dataset can be stored on personal computer or on Hadoop cluster. This
framework provides high performance Random Loader, Sequential Loader for
accessing large-scale data. Such loaders support both sequential vector, Gson
format and raw dataset format ( as same as SVMlight 2 and Libsvm 3).


Figure 1: The framework of linear SVM on Mahout

2.2 Sequential Algorithms

Sequential Algorithms will include binary classification, regression based on
Pegasos and LIBLINEAR with unified interface.

2.3 Parallel Algorithms

It is widely accepted that to parallelize binary SVM classifier is hard. For
multi-classification, however, the coarse-grained scheme (e.g. each Mapper or
Reducer has one independent SVM binary classifier) is easier to achieve great
improvement. Besides, cross validation for model selection also can take
advantage of such coarse-grained parallelism. I will introduce a unified
interface for all of them.


[1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. Liblinear: A library for large linear classification. J. Mach. Learn.
Res., 9:1871–1874, 2008.

[2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal
estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the 24th
international conference on Machine learning, pages 807–814, New York, NY,
USA, 2007. ACM.

[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10
algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007.

