[ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
---------------------------------

          Description: 
After discussed with guys in this community, I decided to re-implement a 
Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support 
HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Support general file system and HDFS right now.

3. Supporting large-scale data set.
   Because of the Pegasos only need to sample certain samples, this package 
supports to pre-fetch
   the certain size (e.g. max iteration) of samples to memory.
   For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000,
   as the result, this package only random load 10,000 samples to memory. 

4. Sequential Data set testing, then the package can support large-scale data 
set both on training and testing.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. HDFS writ function for storing model file to HDFS.
2. Parallel testing algorithm based MapReduce framework.
3. Regression.
4. Multi-classification.

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
Training:
SVMPegasosTraining.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function. 
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

[For the case that training data set on HDFS:]
>>>>>>>
1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
>>>>>>>

Testing:
SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
Data set:
name            |  source       |                 type| class   | training size 
| testing size  |       feature
-----------------------------------------------------------------------------------------------
rcv1.binary     | [DL04b]  |    classification   |  2            |   20,242     
|  677,399 |    47,236
covtype.binary  |  UCI  | classification |          2   |          581,012      
|                           54
a9a                         |  UCI      |  classification |    2    |          
32,561     | 16,281           |       123
w8a                     |   [JP98a] |   classification    |  2            |   
49,749    |   14,951                  | 300
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           | 
    2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           | 
                    |
a9a                      |          84.72%         |         14 Sec           | 
    12 Sec          |
w8a                      |          89.8 %         |         14 Sec           | 
    8  Sec          |

  was:
After discussed with guys in this community, I decided to re-implement a 
Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support 
HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Support general file system and HDFS right now.

3. Supporting large-scale data set.
   Because of the Pegasos only need to sample certain samples, this package 
supports to pre-fetch
   the certain size (e.g. max iteration) of samples to memory.
   For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000,
   as the result, this package only random load 10,000 samples to memory. 

4. Sequential Data set testing, then the package can support large-scale data 
set both on training and testing.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. HDFS writ function for storing model file to HDFS.
2. Parallel testing algorithm based MapReduce framework.
3. Regression.
4. Multi-classification.

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
Training:
SVMPegasosTraining.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function. 
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

[For the case that training data set on HDFS:]
>>>>>>>
1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
>>>>>>>

Testing:
SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
Data set:
name              source                         type   class   training size   
testing size    feature
-----------------------------------------------------------------------------------------------
rcv1.binary      [DL04b]        classification     2                20,242      
  677,399       47,236
covtype.binary    UCI   classification      2             581,012               
                    54
a9a                           UCI       classification     2              
32,561           16,281                   123
w8a                        [JP98a]      classification      2                
49,749        14,951                    300

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           | 
    2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           | 
                    |
a9a                      |          84.72%         |         14 Sec           | 
    12 Sec          |
w8a                      |          89.8 %         |         14 Sec           | 
    8  Sec          |

    Affects Version/s:     (was: 0.1)
                       0.2

> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>         Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.patch
>
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set.
>    Because of the Pegasos only need to sample certain samples, this package 
> supports to pre-fetch
>    the certain size (e.g. max iteration) of samples to memory.
>    For example: if the size of data set has 100,000,000 samples, due to the 
> default maximum iteration is 10,000,
>    as the result, this package only random load 10,000 samples to memory. 
> 4. Sequential Data set testing, then the package can support large-scale data 
> set both on training and testing.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. HDFS writ function for storing model file to HDFS.
> 2. Parallel testing algorithm based MapReduce framework.
> 3. Regression.
> 4. Multi-classification.
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> Training:
> SVMPegasosTraining.java
> I have hard coded the arguments in this file, if you want to custom the 
> arguments by youself, please uncomment the first line in main function. 
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> [For the case that training data set on HDFS:]
> >>>>>>>
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model -hdfs 
> hdfs://localhost:12009
> >>>>>>>
> Testing:
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the 
> arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> Data set:
> name          |  source       |                 type| class   | training size 
> | testing size  |       feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary   | [DL04b]  |    classification   |  2            |   20,242     
> |  677,399 |    47,236
> covtype.binary        |  UCI  | classification |          2   |          
> 581,012      |                           54
> a9a                         |  UCI      |  classification |    2    |         
>  32,561   | 16,281           |       123
> w8a                   |   [JP98a] |   classification    |  2            |   
> 49,749    |   14,951                  | 300
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      
> |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           
> |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           
> |                     |
> a9a                      |          84.72%         |         14 Sec           
> |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           
> |     8  Sec          |

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to