Hi, See the response below:
On Sun, Feb 21, 2010 at 3:53 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > This seems like a good idea for a project, but I see two issues: > > a) it seems very ambitious for one summer. This is good and bad. Good > because you are excited and want to accomplish something grand, bad if it > is > too ambitious and would cause you to officially fail while still > accomplishing parts of good things. Perhaps my perception is due to item > That's true. Do you think whether porting a LIBLINEAR to Mahout is good enough for this proposal, I really don't know How big is big enough:) If Yes, I can move the rest part for the future work. (b) and you have a more limited goal than it seems. > > b) there doesn't seem to be a specific goal. You say "introduce a unifying > framework", but this is a little bit non-specific. Do you mean to augment > your Pegasos implementation by adding a Java-based liblinear > implementation? Or do you just mean to build a framework that would ALLOW > somebody else to call each of these uniformly? > Yeap, I will specify this part. I meant that I will adding a Java-based Liblinear implementation to Current package and allow users to call them (Pegasos, Liblinear etc.) in a unifying INTERFACE (Data pre-processer, loader and command line). > c) Liblinear is in C++. Mahout is committed to portability and currently > has no C++ code. What is your plan? > Using JAVA. I find Benedikt has re-implemented a java version: http://www.bwaldvogel.de/liblinear-java/, I want to port this code to Mahout using Mahout Collections, etc. > On Sat, Feb 20, 2010 at 10:00 AM, zhao zhendong <zhaozhend...@gmail.com > >wrote: > > > Hi all, > > > > Robin told me such great chance for continuous contributing code here > (many > > thanks to Robin). Because I still work on Sequential SVM (Mahout-232) and > I > > prefer to extend it to a unified framework that incorporates some other > > state-of-the-art linear SVM classifiers, I propose "Linear Support Vector > > Machine (SVM) Framework based on Mahout". > > > > I will appreciate your any comment! :) > > > > Cheers, > > Zhendong > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > > > Linear Support Vector Machine (SVM) Framework based on Mahout > > > > — Proposal for Google Summer of CodeTM 2010 > > > > Abstract > > > > Linear Support Vector Machine (SVM) framework will be introduced to > Mahout > > in this proposal. This framework provides a unified framework based on > > Mahout > > for diverse algorithms, such as Pegasos [2] and LIBLINEAR [1]. The > > contribution has twofold: 1) Unified framework for linear SVM classifier; > 2) > > Introduce LIBLINEAR to Mahout. > > > > 1 Motivation > > > > Support Vector Machine is a powerful Machine Learning tool and widely > > adopted in Data Mining, Pattern Recognition and Information Retrieval > > communities. Recently, SVM is chose as one of Top 10 algorithms in data > > mining [3]. > > > > The SVM training procedure is pretty slow, especially, on the case with > > huge > > number of samples. Nowadays, several literatures propose linear SVM > solvers > > that can handle large-scale learning problem, for instance, LIBLINEAR [1] > > and Pegasos [2]. Although I have implemented a prototype of linear SVM > > classifier based on Pegasos [2], as the winner of ICML 2008 large-scale > > learning challenge (linear SVM< > > http://largescale.first.fraunhofer.de/summary/> > > track), LIBLINEAR [1] suppose to be incorporated in Mahout. > > <http://largescale.first.fraunhofer.de/summary/>Currently, LIBLINEAR > > package > > supports: > > > > · L2-regularized classifiers [L2-loss linear SVM, L1-loss linear > SVM, > > and logistic regression (LR)] > > > > · L1-regularized classifiers [L2-loss linear SVM and logistic > > regression (LR)] > > > > Main features of LIBLINEAR are following: > > > > · Multi-class classification: 1) one-vs-the rest, 2) Crammer & > Singer > > > > · Cross validation for model selection > > > > · Probability estimates (logistic regression only) > > > > · Weights for unbalanced data > > > > Linear SVM classifier based on Pegasos package on Mahout provides such > > function <http://issues.apache.org/jira/browse/MAHOUT-232>alities: > > > > · Sequential Binary Classification (Two-class Classification), > includes > > sequential training and prediction; > > > > · Sequential Regression; > > > > · Parallel & Sequential Multi-Classification, includes > > One-vs.-One and One-vs.-Others > > schemes. > > > > Obviously, a unified framework for linear SVM classifier should be > introduced > > into Mahout platform. > > > > 2 Framework > > > > As aforementioned, in this section I propose a linear SVM classifier > > framework for Mahout, which will incorporate Pegasos, LIBLINEAR. > > <http://issues.apache.org/jira/browse/MAHOUT-228>The whole picture of > > framework is illustrated in Figure 1: > > > > Apparently, this framework has two main parts: 1) Data accessing and > > pre-processing; 2) Algorithms. I will introduce them separately. > > > > 2.1 Data Processing Layer > > > > The dataset can be stored on personal computer or on Hadoop cluster. This > > framework provides high performance Random Loader, Sequential Loader for > > accessing large-scale data. Such loaders support both sequential vector, > > Gson > > format and raw dataset format ( as same as SVMlight 2 and Libsvm 3). > > > > > > [image: > > > > > ?ui=2&view=att&th=126ec618035bdf9a&attid=0.1&disp=attd&realattid=ii_126ec618035bdf9a&zw] > > > > Figure 1: The framework of linear SVM on Mahout > > > > 2.2 Sequential Algorithms > > > > Sequential Algorithms will include binary classification, regression based > > on > > Pegasos and LIBLINEAR with unified interface. > > > > 2.3 Parallel Algorithms > > > > It is widely accepted that to parallelize binary SVM classifier is hard. > For > > multi-classification, however, the coarse-grained scheme (e.g. each Mapper > > or > > Reducer has one independent SVM binary classifier) is easier to achieve > > great > > improvement. Besides, cross validation for model selection also can take > > advantage of such coarse-grained parallelism. I will introduce a unified > > interface for all of them. > > > > References > > > > [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and > Chih-Jen > > Lin. Liblinear: A library for large linear classification. J. Mach. Learn. > > Res., 9:1871–1874, 2008. > > > > [2] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal > > estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the > 24th > > international conference on Machine learning, pages 807–814, New York, > NY, > > USA, 2007. ACM. > > > > [3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, > > Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, > > Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 > > algorithms in data mining. Knowl. Inf. Syst., 14(1):1–37, 2007. > > -- > > ------------------------------------------------------------- > > > > Zhen-Dong Zhao (Maxim) > > > > <><<><><><><><><><>><><><><><>>>>>> > > > > Department of Computer Science > > School of Computing > > National University of Singapore > > > > >>>>>>><><><><><><><><<><>><><<<<<< > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- ------------------------------------------------------------- Zhen-Dong Zhao (Maxim) <><<><><><><><><><>><><><><><>>>>>> Department of Computer Science School of Computing National University of Singapore >>>>>>><><><><><><><><<><>><><<<<<<