[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhao zhendong updated MAHOUT-232: --------------------------------- Description: After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos for Mahout platform (mahout command line style, SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. Sequential SVM based on Pegasos. Maxim zhao (zhaozhendong at gmail dot com) ------------------------------------------------------------------------------------------- Currently, this package provides (Features): ------------------------------------------------------------------------------------------- 1. Sequential SVM linear solver, include training and testing. 2. Supporting general file system and HDFS right now. 3. Supporting large-scale data set. Because of the Pegasos only need to sample certain amount of samples, this package pre-fetches certain size (e.g. max iteration) of samples to memory. For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, thus it randomly load 10,000 samples to memory. 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing process. ------------------------------------------------------------------------------------------- TODO: ------------------------------------------------------------------------------------------- 1. HDFS writ function for storing model file to HDFS. 2. Parallel testing algorithm based MapReduce framework. 3. Regression. 4. Multi-classification. ------------------------------------------------------------------------------------------- Usage: ------------------------------------------------------------------------------------------- Training: SVMPegasosTraining.java I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model [For the case that training data set on HDFS:] >>>>>>> 1 Assure that your training data set has been submitted to hdfs hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset 2 revise the argument: -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009 >>>>>>> Testing: SVMPegasosTesting.java I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model ------------------------------------------------------------------------------------------- Experimental Results: ------------------------------------------------------------------------------------------- Data set: name | source | type| class | training size | testing size | feature ----------------------------------------------------------------------------------------------- rcv1.binary | [DL04b] | classification | 2 | 20,242 | 677,399 | 47,236 covtype.binary | UCI | classification | 2 | 581,012 | 54 a9a | UCI | classification | 2 | 32,561 | 16,281 | 123 w8a | [JP98a] | classification | 2 | 49,749 | 14,951 | 300 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data set | Accuracy | Training Time | Testing Time | rcv1.binary | 94.67% | 19 Sec | 2 min 25 Sec | covtype.binary | | 19 Sec | | a9a | 84.72% | 14 Sec | 12 Sec | w8a | 89.8 % | 14 Sec | 8 Sec | was: After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos for Mahout platform (mahout command line style, SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. Sequential SVM based on Pegasos. Maxim zhao (zhaozhendong at gmail dot com) ------------------------------------------------------------------------------------------- Currently, this package provides (Features): ------------------------------------------------------------------------------------------- 1. Sequential SVM linear solver, include training and testing. 2. Support general file system and HDFS right now. 3. Supporting large-scale data set. Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch the certain size (e.g. max iteration) of samples to memory. For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, as the result, this package only random load 10,000 samples to memory. 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing. ------------------------------------------------------------------------------------------- TODO: ------------------------------------------------------------------------------------------- 1. HDFS writ function for storing model file to HDFS. 2. Parallel testing algorithm based MapReduce framework. 3. Regression. 4. Multi-classification. ------------------------------------------------------------------------------------------- Usage: ------------------------------------------------------------------------------------------- Training: SVMPegasosTraining.java I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model [For the case that training data set on HDFS:] >>>>>>> 1 Assure that your training data set has been submitted to hdfs hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset 2 revise the argument: -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009 >>>>>>> Testing: SVMPegasosTesting.java I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model ------------------------------------------------------------------------------------------- Experimental Results: ------------------------------------------------------------------------------------------- Data set: name | source | type| class | training size | testing size | feature ----------------------------------------------------------------------------------------------- rcv1.binary | [DL04b] | classification | 2 | 20,242 | 677,399 | 47,236 covtype.binary | UCI | classification | 2 | 581,012 | 54 a9a | UCI | classification | 2 | 32,561 | 16,281 | 123 w8a | [JP98a] | classification | 2 | 49,749 | 14,951 | 300 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Data set | Accuracy | Training Time | Testing Time | rcv1.binary | 94.67% | 19 Sec | 2 min 25 Sec | covtype.binary | | 19 Sec | | a9a | 84.72% | 14 Sec | 12 Sec | w8a | 89.8 % | 14 Sec | 8 Sec | > Implementation of sequential SVM solver based on Pegasos > -------------------------------------------------------- > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.2 > Reporter: zhao zhendong > Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.patch > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > Sequential SVM based on Pegasos. > Maxim zhao (zhaozhendong at gmail dot com) > ------------------------------------------------------------------------------------------- > Currently, this package provides (Features): > ------------------------------------------------------------------------------------------- > 1. Sequential SVM linear solver, include training and testing. > 2. Supporting general file system and HDFS right now. > 3. Supporting large-scale data set. > Because of the Pegasos only need to sample certain amount of samples, this > package pre-fetches certain size (e.g. max iteration) of samples to memory. > For example: if the size of data set has 100,000,000 samples, due to the > default maximum iteration is 10,000, thus it randomly load 10,000 samples to > memory. > 4. Sequential Data set testing, then the package can support large-scale data > set both on training and testing process. > ------------------------------------------------------------------------------------------- > TODO: > ------------------------------------------------------------------------------------------- > 1. HDFS writ function for storing model file to HDFS. > 2. Parallel testing algorithm based MapReduce framework. > 3. Regression. > 4. Multi-classification. > ------------------------------------------------------------------------------------------- > Usage: > ------------------------------------------------------------------------------------------- > Training: > SVMPegasosTraining.java > I have hard coded the arguments in this file, if you want to custom the > arguments by youself, please uncomment the first line in main function. > The default argument is: > -tr ../examples/src/test/resources/svmdataset/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > [For the case that training data set on HDFS:] > >>>>>>> > 1 Assure that your training data set has been submitted to hdfs > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset > 2 revise the argument: > -tr /user/hadoop/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model -hdfs > hdfs://localhost:12009 > >>>>>>> > Testing: > SVMPegasosTesting.java > I have hard coded the arguments in this file, if you want to custom the > arguments by youself, please uncomment the first line in main function. > The default argument is: > -te ../examples/src/test/resources/svmdataset/test.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > ------------------------------------------------------------------------------------------- > Experimental Results: > ------------------------------------------------------------------------------------------- > Data set: > name | source | type| class | training size > | testing size | feature > ----------------------------------------------------------------------------------------------- > rcv1.binary | [DL04b] | classification | 2 | 20,242 > | 677,399 | 47,236 > covtype.binary | UCI | classification | 2 | > 581,012 | 54 > a9a | UCI | classification | 2 | > 32,561 | 16,281 | 123 > w8a | [JP98a] | classification | 2 | > 49,749 | 14,951 | 300 > http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Accuracy | Training Time > | Testing Time | > rcv1.binary | 94.67% | 19 Sec > | 2 min 25 Sec | > covtype.binary | | 19 Sec > | | > a9a | 84.72% | 14 Sec > | 12 Sec | > w8a | 89.8 % | 14 Sec > | 8 Sec | -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.