[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795813#action_12795813 ]
Ted Dunning commented on MAHOUT-232: ------------------------------------ The 0.1 patch compiles for me, but the 0.2 patch produces this problem: {noformat} /Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[195,8] cannot find symbol symbol : variable HDFSConfig location: class org.apache.mahout.classifier.svm.DataSetHandler /Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[244,8] cannot find symbol symbol : variable HDFSConfig location: class org.apache.mahout.classifier.svm.DataSetHandler {noformat} It seems that something has been dropped from the patch. > Implementation of sequential SVM solver based on Pegasos > -------------------------------------------------------- > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.2 > Reporter: zhao zhendong > Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.patch > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > Sequential SVM based on Pegasos. > Maxim zhao (zhaozhendong at gmail dot com) > ------------------------------------------------------------------------------------------- > Currently, this package provides (Features): > ------------------------------------------------------------------------------------------- > 1. Sequential SVM linear solver, include training and testing. > 2. Supporting general file system and HDFS right now. > 3. Supporting large-scale data set. > Because of the Pegasos only need to sample certain amount of samples, this > package pre-fetches certain size (e.g. max iteration) of samples to memory. > For example: if the size of data set has 100,000,000 samples, due to the > default maximum iteration is 10,000, thus it randomly load 10,000 samples to > memory. > 4. Sequential Data set testing, then the package can support large-scale data > set both on training and testing process. > ------------------------------------------------------------------------------------------- > TODO: > ------------------------------------------------------------------------------------------- > 1. HDFS writ function for storing model file to HDFS. > 2. Parallel testing algorithm based MapReduce framework. > 3. Regression. > 4. Multi-classification. > ------------------------------------------------------------------------------------------- > Usage: > ------------------------------------------------------------------------------------------- > Training: > SVMPegasosTraining.java > I have hard coded the arguments in this file, if you want to custom the > arguments by youself, please uncomment the first line in main function. > The default argument is: > -tr ../examples/src/test/resources/svmdataset/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > [For the case that training data set on HDFS:] > >>>>>>> > 1 Assure that your training data set has been submitted to hdfs > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset > 2 revise the argument: > -tr /user/hadoop/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model -hdfs > hdfs://localhost:12009 > >>>>>>> > Testing: > SVMPegasosTesting.java > I have hard coded the arguments in this file, if you want to custom the > arguments by youself, please uncomment the first line in main function. > The default argument is: > -te ../examples/src/test/resources/svmdataset/test.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > ------------------------------------------------------------------------------------------- > Experimental Results: > ------------------------------------------------------------------------------------------- > Data set: > name | source | type| class | training size > | testing size | feature > ----------------------------------------------------------------------------------------------- > rcv1.binary | [DL04b] | classification | 2 | 20,242 > | 677,399 | 47,236 > covtype.binary | UCI | classification | 2 | > 581,012 | 54 > a9a | UCI | classification | 2 | > 32,561 | 16,281 | 123 > w8a | [JP98a] | classification | 2 | > 49,749 | 14,951 | 300 > http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Accuracy | Training Time > | Testing Time | > rcv1.binary | 94.67% | 19 Sec > | 2 min 25 Sec | > covtype.binary | | 19 Sec > | | > a9a | 84.72% | 14 Sec > | 12 Sec | > w8a | 89.8 % | 14 Sec > | 8 Sec | -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.