[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhao zhendong updated MAHOUT-232: --------------------------------- Affects Version/s: 0.1 Status: Patch Available (was: Open) Sequential SVM based on Pegasos. ------------------------------------------------------------------------------------------- Currently, this package provides (Features): ------------------------------------------------------------------------------------------- 1. Sequential SVM linear solver, include training and testing. 2. It supports general file system right now, it means that HDFS supporting will be a near future work. 3. Supporting large-scale data set. ( need to assign the argument "trainSampleNum" ) Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch the certain size (e.g. max iteration) of samples to memory. For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, as the result, this package only randomly loads 10,000 samples to memory. ------------------------------------------------------------------------------------------- TODO: ------------------------------------------------------------------------------------------- 1. Supporting HDFS; 2. Because of adopting mahout.math.SparseMatrix and mahout.math.SparseVectorUnsafe, I must assign the cardinality of matrix while create them. It's not easy for reading the data set with the format of SVM-light or libsvm, which are very popular in Machine learning community. Such dataset does not store the number of samples and the size of dimension. Currently, I still use a stupid method to read the data to map<> first, then dump the data to SparseMatrix. Does any one know some smart methods or other matrix to support such operation? ------------------------------------------------------------------------------------------- Usage: ------------------------------------------------------------------------------------------- Training: SVMPegasosTraining.java I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model Testing: SVMPegasosTesting.java I have hard encoded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. The default argument is: -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model > Implementation of sequential SVM solver based on Pegasos > -------------------------------------------------------- > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.1 > Reporter: zhao zhendong > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > The plan of Sequential Pegasos: > 1 Supporting the general file system ( almost finished ); > 2 Supporting HDFS; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.