[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/13/13 8:57 PM:
-

Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

(for ratingSGD)
  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

(for parallelSGD)
  double mu0=1;
  double decayFactor=1;
  int stepOffset=100;
  double forgettingExponent=-1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  was (Author: peng):
Test on libimseti dataset (http://www.occamslab.com/petricek/data/), 
libimseti is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
> ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JI

[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702175#comment-13702175
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/8/13 6:06 PM:


Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own 
you this.
I'll test more grouplens data. Since Sebastian has taken over the code, new 
test cases will only be posted as code snippets.

  was (Author: peng):
Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I 
own you this.
testing on netflix dataset has encountered some trouble, namely, I don't know 
where to download it :-<. Great appreciation for anyone who can share his 
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as 
code snippets.
  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/7/13 10:21 PM:
-

New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m, all evaluation uses RMSE:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

  was (Author: peng):
New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.
  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-06 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701233#comment-13701233
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/6/13 2:43 PM:


Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 items) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.

  was (Author: peng):
Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 times) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.
  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: mahout.patch, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira