[ 
https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao zhendong updated MAHOUT-232:
---------------------------------

    Description: 
After discussed with guys in this community, I decided to re-implement a 
Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support 
HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Supporting general file system and HDFS right now.

3. Supporting large-scale data set.
   Because of the Pegasos only need to sample certain amount of samples, this 
package pre-fetches  certain size (e.g. max iteration) of samples to memory.
   For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000, thus it randomly load 10,000 samples to 
memory. 

4. Sequential Data set testing, then the package can support large-scale data 
set both on training and testing process.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. HDFS writ function for storing model file to HDFS.
2. Parallel testing algorithm based MapReduce framework.
3. Regression.
4. Multi-classification.

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
Training:
SVMPegasosTraining.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function. 
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

[For the case that training data set on HDFS:]
>>>>>>>
1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
>>>>>>>

Testing:
SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
Data set:
name            |  source       |                 type| class   | training size 
| testing size  |       feature
-----------------------------------------------------------------------------------------------
rcv1.binary     | [DL04b]  |    classification   |  2            |   20,242     
|  677,399 |    47,236
covtype.binary  |  UCI  | classification |          2   |          581,012      
|                           54
a9a                         |  UCI      |  classification |    2    |          
32,561     | 16,281           |       123
w8a                     |   [JP98a] |   classification    |  2            |   
49,749    |   14,951                  | 300
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           | 
    2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           | 
                    |
a9a                      |          84.72%         |         14 Sec           | 
    12 Sec          |
w8a                      |          89.8 %         |         14 Sec           | 
    8  Sec          |

  was:
After discussed with guys in this community, I decided to re-implement a 
Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support 
HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Support general file system and HDFS right now.

3. Supporting large-scale data set.
   Because of the Pegasos only need to sample certain samples, this package 
supports to pre-fetch
   the certain size (e.g. max iteration) of samples to memory.
   For example: if the size of data set has 100,000,000 samples, due to the 
default maximum iteration is 10,000,
   as the result, this package only random load 10,000 samples to memory. 

4. Sequential Data set testing, then the package can support large-scale data 
set both on training and testing.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. HDFS writ function for storing model file to HDFS.
2. Parallel testing algorithm based MapReduce framework.
3. Regression.
4. Multi-classification.

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
Training:
SVMPegasosTraining.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function. 
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

[For the case that training data set on HDFS:]
>>>>>>>
1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m 
../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
>>>>>>>

Testing:
SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the 
arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m 
../examples/src/test/resources/svmdataset/SVM.model

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
Data set:
name            |  source       |                 type| class   | training size 
| testing size  |       feature
-----------------------------------------------------------------------------------------------
rcv1.binary     | [DL04b]  |    classification   |  2            |   20,242     
|  677,399 |    47,236
covtype.binary  |  UCI  | classification |          2   |          581,012      
|                           54
a9a                         |  UCI      |  classification |    2    |          
32,561     | 16,281           |       123
w8a                     |   [JP98a] |   classification    |  2            |   
49,749    |   14,951                  | 300
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      | 
   Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           | 
    2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           | 
                    |
a9a                      |          84.72%         |         14 Sec           | 
    12 Sec          |
w8a                      |          89.8 %         |         14 Sec           | 
    8  Sec          |


> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>         Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.patch
>
>
> After discussed with guys in this community, I decided to re-implement a 
> Sequential SVM solver based on Pegasos  for Mahout platform (mahout command 
> line style,  SparseMatrix and SparseVector etc.) , Eventually, it will 
> support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Supporting general file system and HDFS right now.
> 3. Supporting large-scale data set.
>    Because of the Pegasos only need to sample certain amount of samples, this 
> package pre-fetches  certain size (e.g. max iteration) of samples to memory.
>    For example: if the size of data set has 100,000,000 samples, due to the 
> default maximum iteration is 10,000, thus it randomly load 10,000 samples to 
> memory. 
> 4. Sequential Data set testing, then the package can support large-scale data 
> set both on training and testing process.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. HDFS writ function for storing model file to HDFS.
> 2. Parallel testing algorithm based MapReduce framework.
> 3. Regression.
> 4. Multi-classification.
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> Training:
> SVMPegasosTraining.java
> I have hard coded the arguments in this file, if you want to custom the 
> arguments by youself, please uncomment the first line in main function. 
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> [For the case that training data set on HDFS:]
> >>>>>>>
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model -hdfs 
> hdfs://localhost:12009
> >>>>>>>
> Testing:
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the 
> arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m 
> ../examples/src/test/resources/svmdataset/SVM.model
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> Data set:
> name          |  source       |                 type| class   | training size 
> | testing size  |       feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary   | [DL04b]  |    classification   |  2            |   20,242     
> |  677,399 |    47,236
> covtype.binary        |  UCI  | classification |          2   |          
> 581,012      |                           54
> a9a                         |  UCI      |  classification |    2    |         
>  32,561   | 16,281           |       123
> w8a                   |   [JP98a] |   classification    |  2            |   
> 49,749    |   14,951                  | 300
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      
> |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           
> |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           
> |                     |
> a9a                      |          84.72%         |         14 Sec           
> |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           
> |     8  Sec          |

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to