[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830843#action_12830843 ]
Robin Anil commented on MAHOUT-232: ----------------------------------- Some Comments * Try using Mahout collections OpenIntDoubleHashMap etc. I have seen super memory savings using them as compared to java collections. WeightVector memory footprint would halve. * Package names are not camel case, I saw import org.apache.mahout.classifier.svm.MapReduce.Testing.TestRawKeyValueIterator; should have been org.apache.mahout.classifier.svm.mapreduce.TestRawKeyValueIterator in the test directory not main * Move all test classes to test directory * No author tags. See any class in Mahout for reference * Your test classes could be re-used, we already have a dummy output collector and Dummy status reporter in common. How about moving testing classes there or reusing them. Feel free to modify them or add functionality. * Organize imports. Are you using the Mahout(lucene based) code formatter. Its here, https://issues.apache.org/jira/browse/MAHOUT-233 * is there a need for a parameter parser ? Check out common.parameter.* You could reuse the parameter classes there. See KMeansMapper for usage. * In HDFS writer . "/user/maximzhao/test.t" I see hardcoded paths. Should make it configurable * I dont think using HDFSWriter class is the best way for writing to HDFS. FileSystem object would select the appropriate filesystem based on the Hadoop Configuration. This enforces that your classes read and write to HDFS via namenode making the code unusable for local execution. Plus, this really shouldnt be used when running a Map/reduce, underlying Filesystem object is already pointing to HDFS. Creating socket connnections is not a good thing when Map/Reducing. * LibSVMFormatParser could be moved to utils package, Not in core. Like ARFF format reader, we can have the libsvm format reader * Move readme to Package.html so that javadoc generates the package summary. * Also if you can separate out the dataset from the patch and upload two separate files. I think others might have issues(read legal) with including reuters data in mahout trunk > Implementation of sequential SVM solver based on Pegasos > -------------------------------------------------------- > > Key: MAHOUT-232 > URL: https://issues.apache.org/jira/browse/MAHOUT-232 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.4 > Reporter: zhao zhendong > Fix For: 0.3 > > Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, > SequentialSVM_0.3.patch, SequentialSVM_0.4.patch > > > After discussed with guys in this community, I decided to re-implement a > Sequential SVM solver based on Pegasos for Mahout platform (mahout command > line style, SparseMatrix and SparseVector etc.) , Eventually, it will > support HDFS. > Sequential SVM based on Pegasos. > Maxim zhao (zhaozhendong at gmail dot com) > ------------------------------------------------------------------------------------------- > Currently, this package provides (Features): > ------------------------------------------------------------------------------------------- > 1. Sequential SVM linear solver, include training and testing. > 2. Support general file system and HDFS right now. > 3. Supporting large-scale data set training. > Because of the Pegasos only need to sample certain samples, this package > supports to pre-fetch > the certain size (e.g. max iteration) of samples to memory. > For example: if the size of data set has 100,000,000 samples, due to the > default maximum iteration is 10,000, > as the result, this package only random load 10,000 samples to memory. > 4. Sequential Data set testing, then the package can support large-scale data > set both on training and testing. > 5. Supporting parallel classification (only testing phrase) based on > Map-Reduce framework. > 6. Supoorting Multi-classfication based on Map-Reduce framework (whole > parallelized version). > 7. Supporting Regression. > ------------------------------------------------------------------------------------------- > TODO: > ------------------------------------------------------------------------------------------- > 1. Multi-classification Probability Prediction > 2. Performance Testing > ------------------------------------------------------------------------------------------- > Usage: > ------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Classification: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > @@ Training: @@ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > SVMPegasosTraining.java > The default argument is: > -tr ../examples/src/test/resources/svmdataset/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > ~~~~~~~~~~~~~~~~~~~~~~ > @ For the case that training data set on HDFS:@ > ~~~~~~~~~~~~~~~~~~~~~~ > 1 Assure that your training data set has been submitted to hdfs > hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset > 2 revise the argument: > -tr /user/hadoop/train.dat -m > ../examples/src/test/resources/svmdataset/SVM.model -hdfs > hdfs://localhost:12009 > ~~~~~~~~~~~~~~~~~~~~~~ > @ Multi-class Training [Based on MapReduce Framework]:@ > ~~~~~~~~~~~~~~~~~~~~~~ > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job > org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver > -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m > /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M > -ttt 1080 > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > @@ Testing: @@ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > SVMPegasosTesting.java > I have hard coded the arguments in this file, if you want to custom the > arguments by youself, please uncomment the first line in main function. > The default argument is: > -te ../examples/src/test/resources/svmdataset/test.dat -m > ../examples/src/test/resources/svmdataset/SVM.model > ~~~~~~~~~~~~~~~~~~~~~~ > @ Parallel Testing (Classification): @ > ~~~~~~~~~~~~~~~~~~~~~~ > ParallelClassifierDriver.java > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job > org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver > -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result > -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080 > ~~~~~~~~~~~~~~~~~~~~~~ > @ Parallel multi-classification: @ > ~~~~~~~~~~~~~~~~~~~~~~ > bin/hadoop jar mahout-core-0.3-SNAPSHOT.job > org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver > -if /user/maximzhao/dataset/protein.t -of > /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 > -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080 > Note: the parameter -ms 241572968 is obtained by equation : ms = input files > size / number of mapper. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Regression: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > SVMPegasosTraining.java > -tr ../examples/src/test/resources/svmdataset/abalone_scale -m > ../examples/src/test/resources/svmdataset/SVMregression.model -s 1 > ------------------------------------------------------------------------------------------- > Experimental Results: > ------------------------------------------------------------------------------------------- > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Classsification: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set: > name source type class training size testing > size feature > ----------------------------------------------------------------------------------------------- > rcv1.binary [DL04b] classification 2 20,242 > 677,399 47,236 > covtype.binary UCI classification 2 581,012 > 54 > a9a UCI classification 2 32,561 > 16,281 123 > w8a [JP98a] classification 2 49,749 > 14,951 300 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Accuracy | Training Time > | Testing Time | > rcv1.binary | 94.67% | 19 Sec > | 2 min 25 Sec | > covtype.binary | | 19 Sec > | | > a9a | 84.72% | 14 Sec > | 12 Sec | > w8a | 89.8 % | 14 Sec > | 8 Sec | > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Parallel Classification (Testing) > Data set | Accuracy | Training Time > | Testing Time | > rcv1.binary | 94.98% | 19 Sec > | 3 min 29 Sec (one node)| > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Parallel Multi-classification Based on MapReduce Framework: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set: > name | source | type | class | training size | > testing size | feature > ----------------------------------------------------------------------------------------------- > poker | UCI | classification | 10 | 25,010 | 1,000,000 > | 10 > protein | [JYW02a] | classification | 3 | 17,766 > | 6,621 | 357 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Accuracy vs. (Libsvm with linear kernel) > poker | 50.14 % vs. ( 49.952% ) | > protein | 68.14% vs. ( 64.93% ) | > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Regression: > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set: > name | source | type | class | training size | > testing size | feature > ----------------------------------------------------------------------------------------------- > abalone | UCI | regression | 4,177 | | 8 > triazines | UCI | regression | 186 | | 60 > cadata | StatLib | regression | 20,640 | | 8 > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > Data set | Mean Squared error vs. (Libsvm with linear > kernel) | Training Time | Test Time | > abalone | 6.01 vs. (5.25) | 13 Sec | > triazines | 0.031 vs. (0.0276) | 14 Sec | > cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec | -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.