[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-04 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758103#comment-13758103
 ] 

Peng Cheng commented on MAHOUT-1286:


The existing open addressing hash table is for 1d arrays, not 2d matrices. I 
can get the concurrency done by next week, but there are simply too many 
pending optimization. e.g. if you set loadfactor to 1.2 it is pretty slow. If 
you can help improving on the TODO list in the code that will be awesome.

Not sure about the consequence as 2d matrix interface has an int (16bit) index, 
but dataModel has a long (32bit) index. If you don't bother adding more things 
to mahout-math, then it should be alright.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
> Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-29 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Attachment: Semifinal-implementation-added.patch

Sorry about the late reply, and please be noted that the code can still be 
optimized at many places, I'll keep maintain it and keep an ear on all 
suggestions.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
> Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-29 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754269#comment-13754269
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Gokhan,

No problem, but it only has two files, I'll post the patch immediately. -Yours 
Peng

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-16 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742714#comment-13742714
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Dr Dunning,

Great appreciation, I watched your speech in Berlin on youtube and finally have 
a clue on what is going on here.

If i understand right, the core concept is to use Solr as a sparse matrix 
multiplier. So theoretically it can encapsulate any recommendation engine (not 
necessarily CF) if the recommendation phase can be cast as linear 
multiplication. Co-occurence matrix is one instance, other types of 
recommendations are possible, but slightly harder, require multiple queries 
sometimes. The following 3 cases should cover most classical CF instances:

1. Item-based CF (result = Sim(A,A)* h, where A is the rating matrix and Sim() 
is the item-to-item similarity matrix, between all pairs of items ): this is 
the easiest and has already been addressed in your speech: calculate Sim(A,A) 
beforehand, import into solr and run query ranked by weighted frequency.

2. User-based CF (result = A^T * Sim(A,h), where Sim() is the user-to-user 
similarity vector, between new user and all old users): slightly more complex, 
can run the first query on A ranked by the customized similarity function, then 
use the result of the first to run the second query on A^T ranked by weighted 
frequency.

3. SVD-based CF: no can do if the new user is not known before, AFAIK solr 
doesn't have any form of matrix pseudoinversion or optimization function. So 
determining new user's projection in the SV subspace is impossible given its 
dot with some old items. However, if the user in question is old, or new user 
can be merged into the model in real-time. Solr can just look-up its vector in 
SV subspace by a full match search.

4. ensemble: obviously another linear operation, can be interpreted by a query 
with mixed ranking function or multiple queries. Multi-model recommendation, as 
a juxtaposing of rating matrix (A_1 | A_2), was never a problem either using 
old style CF or recommendation-as-search.

Judging by the sheer performance and scalabilty of solr, this could potentially 
make recommendation-as-search a superior option. However as Gokhan inferred, we 
will likely still use old algorithms for training, but solr for recommendation. 
So I'm going back to 1274 anyway, by using the posted DataModel as a temporary 
glue. It won't be hard for me or anybody else to refactor it for the solr 
interface.

-Yours Peng

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737563#comment-13737563
 ] 

Peng Cheng commented on MAHOUT-1286:


Also, please be noted that the first patch is still not optimized to extreme. 
Many improvements can be made to make it smaller and faster. (see TODO: list in 
code) But I'm trying to get back to MAHOUT-1274, if we expect large scale 
refactoring on all recommenders in favor of recommendation-as-search, I'll have 
to suspend it until refactoring is finished.

I'm waiting online for Dr Dunning's plan.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737553#comment-13737553
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Gentlemen,

Thanks a lot for proving my point Gokhan, yeah I mean either user or item 
preferences extraction can be fast but not both.

Sorry I should have proposed it in our last hangout but I missed the invitation 
:-< But I tried to understand your proposal on recommendation-as-search.

>From what I heard on youtube, the new architecture is proposed as an easier 
>and faster replacement of all existing recommenders that take DataModel. Each 
>item is a weighted 'bag of words' generated by concurrence analysis/item 
>similarity on previous ratings. New users's ratings are converted into 
>weighted tuple of existing words and is matched with the items that have 
>highest sum of hits.

My concerns are that 1) does it support all type of recommenders and their 
ensemble? I know modern search engine like Google and YANDEX has a fairly 
complex ensemble search and ranking algorithm that looks similar to an ensemble 
recommender, but IMHO Lucene is built only for text search, not sure to what 
extend it is customizable. 2) does it support online learning? This feature is 
more important to SVDRecommender as a new user's recommendation is only known 
if this user is merged into the model. (Of course, an option is to project a 
new user into the user subspace by minimising its distance given its dot to 
existing items, but no body has test its performance before)

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737033#comment-13737033
 ] 

Peng Cheng commented on MAHOUT-1286:


The data structure used here (1d hashing row/column index + 2d hopscotch 
hashing value) should be more efficient than any sparse matrix representation 
in mahout-math and does not priortize rows or columns (so extrating single 
row/column or a submatrix is equally fast). I'm wondering if the same technique 
can be merged into the mahout-math for other purposes

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736962#comment-13736962
 ] 

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:54 PM:
-

The idea of ArrayMap has been discarded due to its impractical time consumption 
of insertion (O( n ) for a batch insertion) and query (O(logn)). I have moved 
back to HashMap. Due to the same reason, I feel that using Sparse Row/Column 
matrix may have the same problem.

  was (Author: peng):
The idea of ArrayMap has been discarded due to its impractical time 
consumption of insertion (O(n) for a batch insertion) and query (O(logn)). I 
have moved back to HashMap. Due to the same reason, I feel that using Sparse 
Row/Column matrix may have the same problem.
  
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737023#comment-13737023
 ] 

Peng Cheng commented on MAHOUT-1286:


Well, I mean, I partially agree that the effort I spent on this probably won't 
pay off as few will use In-memory/file dataModel in production, most of them 
will choose a databased-backed one. I just try to solve it because its a 
blocker.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737019#comment-13737019
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some 
difficulties, namely 1) a columnar form doesn't support fast extraction of 
rows, yet dataModel should allow quick getPreferencesFromUser() and 
getPreferencesForItem(). 2) a columnar form doesn't support fast online update 
(time complexity is O(n), maximally O(n) if using block copy and columns are 
sorted). 3) To create such dataModel we need to initialize a HashMap first, 
this uses twice as much as heap space for initialization, could defeat the 
purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him 
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure 
it is an all-improving solution to similarity-based recommenders. But low rank 
matrix-factorization based ones should merge preferences from the new users 
immediately into the prediction model, of course you can just project it into 
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according 
to guys I'm working with the online recommender seems to be in demand these 
days.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737019#comment-13737019
 ] 

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:44 PM:
-

Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some 
difficulties, namely 1) a columnar form doesn't support fast extraction of 
rows, yet dataModel should allow quick getPreferencesFromUser() and 
getPreferencesForItem(). 2) a columnar form doesn't support fast online update 
(time complexity is O( n ), maximally O( log n ) if using block copy and 
columns are sorted). 3) To create such dataModel we need to initialize a 
HashMap first, this uses twice as much as heap space for initialization, could 
defeat the purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him 
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure 
it is an all-improving solution to similarity-based recommenders. But low rank 
matrix-factorization based ones should merge preferences from the new users 
immediately into the prediction model, of course you can just project it into 
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according 
to guys I'm working with the online recommender seems to be in demand these 
days.

  was (Author: peng):
Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some 
difficulties, namely 1) a columnar form doesn't support fast extraction of 
rows, yet dataModel should allow quick getPreferencesFromUser() and 
getPreferencesForItem(). 2) a columnar form doesn't support fast online update 
(time complexity is O(n), maximally O(n) if using block copy and columns are 
sorted). 3) To create such dataModel we need to initialize a HashMap first, 
this uses twice as much as heap space for initialization, could defeat the 
purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him 
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure 
it is an all-improving solution to similarity-based recommenders. But low rank 
matrix-factorization based ones should merge preferences from the new users 
immediately into the prediction model, of course you can just project it into 
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according 
to guys I'm working with the online recommender seems to be in demand these 
days.
  
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Fix Version/s: 0.9
   Labels: collaborative-filtering datamodel patch recommender  (was: )
   Status: Patch Available  (was: Open)

According to my test, it can load the entire Netflix dataset into memory using 
only 3G heap space.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: patch, collaborative-filtering, datamodel, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736992#comment-13736992
 ] 

Peng Cheng commented on MAHOUT-1286:


Here is my final solution after numerous experiments: A combination of double 
hashing for storing user/item IDs and 2d hopscotch hashing 
(http://mcg.cs.tau.ac.il/papers/disc2008-hopscotch.pdf) for storing preferences 
as a map from user/item indices in the double hashing table. Hopscotch hashing 
maintains strong locality and high load factor, and each dimension uses an 
independent hash function. As a result, it can quickly extract a submatrix or 
single row or column.

This is the smallest implementation I can think of, apparently only bloom map 
can achieve smaller memory footage. But it has many other problems.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Attachment: InMemoryDataModelTest.java
InMemoryDataModel.java

See uploaded files for detail

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736962#comment-13736962
 ] 

Peng Cheng commented on MAHOUT-1286:


The idea of ArrayMap has been discarded due to its impractical time consumption 
of insertion (O(n) for a batch insertion) and query (O(logn)). I have moved 
back to HashMap. Due to the same reason, I feel that using Sparse Row/Column 
matrix may have the same problem.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-23 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717659#comment-13717659
 ] 

Peng Cheng commented on MAHOUT-1286:


Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView 
has taken 1.7G. Quite unexpected right? Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will be 
no more PreferenceArray.

Class Name 
|Objects |  Shallow Heap |Retained Heap
---
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
 72,237,632 | 1,733,703,168 | >= 1,733,703,168
long[] 
|480,199 |   818,209,680 |   >= 818,209,680
float[]
|480,190 |   410,563,592 |   >= 410,563,592
java.lang.Object[] 
| 18,230 |   361,525,488 | >= 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray   
|480,189 |15,366,048 | >= 1,237,456,672
java.util.ArrayList
| 17,811 |   427,464 | >= 2,092,416,104
char[] 
|  2,150 |   272,632 |   >= 272,632
byte[] 
|141 |54,048 |>= 54,048
java.lang.String   
|  2,119 |50,856 |   >= 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry   
|673 |21,536 |>= 38,104
java.net.URL   
|229 |14,656 |>= 40,720
java.util.HashMap$Entry
|344 |11,008 |>= 68,760
---


> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715912#comment-13715912
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Sebastian, Gokhan, how do you feel about the cause of the memory efficiency 
problem? Do you think we should talk privately? I'm also interested in your 
experimentation results.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715906#comment-13715906
 ] 

Peng Cheng edited comment on MAHOUT-1286 at 7/23/13 12:26 AM:
--

In another hand, I try to solve the problem by implementing FastByIDArrayMap, a 
slightly more compact Map implementation than FastByIDMap, it uses binary 
search to arrange all entries into a tight array, so its worst-case time 
complexity for get, put and delete is log( n ) (much slower than double 
hashing's average O(1)). But has a (marginally) smaller memory footprint and 
faster iteration. It has no problem passing all unit tests. But its real 
performance can only be shown when embedded in FileDataModel. I'll post the 
result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything 
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't 
worth the speed loss.

  was (Author: peng):
In another hand, I try to solve the problem by implementing 
FastByIDArrayMap, a slightly more compact Map implementation than FastByIDMap, 
it uses binary search to arrange all entries into a tight array, so its 
worst-case time complexity for get, put and delete is log(n) (much slower than 
double hashing's average O(1)). But has a (marginally) smaller memory footprint 
and faster iteration. It has no problem passing all unit tests. But its real 
performance can only be shown when embedded in FileDataModel. I'll post the 
result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything 
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't 
worth the speed loss.
  
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715906#comment-13715906
 ] 

Peng Cheng commented on MAHOUT-1286:


In another hand, I try to solve the problem by implementing FastByIDArrayMap, a 
slightly more compact Map implementation than FastByIDMap, it uses binary 
search to arrange all entries into a tight array, so its worst-case time 
complexity for get, put and delete is log(n) (much slower than double hashing's 
average O(1)). But has a (marginally) smaller memory footprint and faster 
iteration. It has no problem passing all unit tests. But its real performance 
can only be shown when embedded in FileDataModel. I'll post the result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything 
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't 
worth the speed loss.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715885#comment-13715885
 ] 

Peng Cheng commented on MAHOUT-1286:


On second thought, hash map is very likely not the culprit for poor memory 
efficiency here, apologies for the misinformation. The double hashing algorithm 
in FastByIDMap, as described in Don Knuth's book 'the art of computer 
programming', has a default loadFactor of 1.5, which means the size of array is 
only 1.5 times the number of keys. So theoretically the heap size of 
GenericDataModel should never exceed 3 times the size of 
FactorizablePreferences. I'm still very unclear about FastByIDMap's 
implementation, like how it handles deletion of entries. So I cannot tell if my 
observation on netflix is caused by GC (e.g. construct new arrays too often), 
or deletion, or extra space allocated for timestamp. We probably have to run 
netflix in debug mode to identify the problem.

I'll try to bring up this topic in the next hangout. Please give me some hint 
if you are an expert in those FastMap implementations.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and iteration

2013-07-17 Thread Peng Cheng (JIRA)
Peng Cheng created MAHOUT-1286:
--

 Summary: Memory-efficient DataModel, supporting fast online 
updates and iteration
 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen


Most DataModel implementation in current CF component use hash map to enable 
fast 2d indexing and update. This is not memory-efficient for big data set. 
e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.

Improved implementation of DataModel should use more compact data structure 
(like arrays), this can trade a little of time complexity in 2d indexing for 
vast improvement in memory efficiency. In addition, any online recommender or 
online-to-batch converted recommender will not be affected by this in training 
process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-17 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Summary: Memory-efficient DataModel, supporting fast online updates and 
element-wise iteration  (was: Memory-efficient DataModel, supporting fast 
online updates and iteration)

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: NetflixRecomenderEvaluatorRunner.java

Runnable component for testing ParallelSGDFactorizer on netflix training 
dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I 
cannot get judging.txt for validation, but my purpose is just to test its 
efficiency on extreme scale, so whatever).

Warning! To run it without danger you need to allocate at least 12G of heap 
space to jvm by using the following VM parameters:

-Xms12288M -Xmx12288M.

In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap 
will kill you (or both). I almost burned my laptop on this (which has only 8G 
RAM). As a result, I won't be able to post any result before I can get a better 
machine. But since its number of rating is about 6 times the size of the 
movielens-10m or libimseti dataset, and SGD scales linearly to this number, I 
estimate the running time to be between 2.5-3 minutes.

I will be utmost obliged to anybody who can try it and post the result here (of 
course, if your machine can handle it). But obviously as Sebastian has pointed 
out, our FileDataModel needs some serious optimization to handle such scale.

Hey Sebastian, can you try this out in your lab? That will be most helpful.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
> NetflixRecomenderEvaluatorRunner.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/13/13 8:57 PM:
-

Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

(for ratingSGD)
  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

(for parallelSGD)
  double mu0=1;
  double decayFactor=1;
  int stepOffset=100;
  double forgettingExponent=-1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  was (Author: peng):
Test on libimseti dataset (http://www.occamslab.com/petricek/data/), 
libimseti is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
> ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JI

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: libimsetiSVDRecomenderEvaluatorRunner.java

here is the component for testing on libimseti dataset

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
> ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng commented on MAHOUT-1272:


Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.


> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707104#comment-13707104
 ] 

Peng Cheng commented on MAHOUT-1274:


Totally agree, I don't know about other DataModel but current GenericDataModel 
uses two maps of PreferenceArray, which is conterintuitive. I thought it can be 
a double FastByIDMap that allows O(1) random access, but I must missed some 
other requirements.

Haven't read FactorizablePreferences yet, thanks a lot for your advice.

> SGD-based Online SVD recommender
> 
>
> Key: MAHOUT-1274
> URL: https://issues.apache.org/jira/browse/MAHOUT-1274
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, features, machine_learning, svd
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> an online SVD recommender is otherwise similar to an offline SVD recommender 
> except that, upon receiving one or several new recommendations, it can add 
> them into the training dataModel and update the result accordingly in real 
> time.
> an online SVD recommender should override setPreference(...) and 
> removePreference(...) in AbstractRecommender such that the factorization 
> result is updated in O(1) time and without retraining.
> Right now the slopeOneRecommender is the only component possessing such 
> capability.
> Since SGD is intrinsically an online algorithm and its CF implementation is 
> available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
> good time to convert it. Such feature could come in handy for some websites.
> Implementation: Adding new users, items, or increasing rating matrix rank are 
> just increasing size of user and item matrices. Reducing rating matrix rank 
> involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
> algorithm, multiple passes are required to achieve an acceptable optimality 
> and even more so if hyperparameters are bad. But here are two possible 
> circumvents:
> 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
> applying stochastic convex-opt algorithm to non-convex problem is anarchy. 
> But it may be a long shot.
> 2. Run incomplete passes in each online update using ratings randomly sampled 
> (but not uniformly sampled) from latest dataModel. I don't know how exactly 
> this should be done but new rating should be sampled more frequently. Uniform 
> sampling will results in old ratings being used more than new ratings in 
> total. If somebody has worked on this batch-to-online conversion before and 
> share his insight that would be awesome. This seems to be the most viable 
> option, if I get the non-uniform pseudorandom generator that maintains a 
> cumulative uniform distribution I want.
> I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but 
> it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707065#comment-13707065
 ] 

Peng Cheng commented on MAHOUT-1274:


Main component finished. The new factorizer and recommender can support adding 
new users and items, and update user/item vectors in only one GD step (this is 
very suboptimal, but I will improve this part very soon).

But I don't know how to test it, the sandbox GenericDataModel doesn't support 
setPreference(...) and removePreference(...) yet. (SlopeOneRecommenderTest 
doesn't test this part either). Could someone tell me if there is an 
alternative to avoid this problem?

As Sebastian have foretold, now is not the best time for adding support for an 
online recommender: The SlopeOneRecommender is half-dead, many dependencies are 
incomplete, and everybody's attention were drawn to core-0.8 release. 
Regardless, I'll try to solve it myself, and spend some time on other tickets.

> SGD-based Online SVD recommender
> 
>
> Key: MAHOUT-1274
> URL: https://issues.apache.org/jira/browse/MAHOUT-1274
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, features, machine_learning, svd
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> an online SVD recommender is otherwise similar to an offline SVD recommender 
> except that, upon receiving one or several new recommendations, it can add 
> them into the training dataModel and update the result accordingly in real 
> time.
> an online SVD recommender should override setPreference(...) and 
> removePreference(...) in AbstractRecommender such that the factorization 
> result is updated in O(1) time and without retraining.
> Right now the slopeOneRecommender is the only component possessing such 
> capability.
> Since SGD is intrinsically an online algorithm and its CF implementation is 
> available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
> good time to convert it. Such feature could come in handy for some websites.
> Implementation: Adding new users, items, or increasing rating matrix rank are 
> just increasing size of user and item matrices. Reducing rating matrix rank 
> involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
> algorithm, multiple passes are required to achieve an acceptable optimality 
> and even more so if hyperparameters are bad. But here are two possible 
> circumvents:
> 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
> applying stochastic convex-opt algorithm to non-convex problem is anarchy. 
> But it may be a long shot.
> 2. Run incomplete passes in each online update using ratings randomly sampled 
> (but not uniformly sampled) from latest dataModel. I don't know how exactly 
> this should be done but new rating should be sampled more frequently. Uniform 
> sampling will results in old ratings being used more than new ratings in 
> total. If somebody has worked on this batch-to-online conversion before and 
> share his insight that would be awesome. This seems to be the most viable 
> option, if I get the non-uniform pseudorandom generator that maintains a 
> cumulative uniform distribution I want.
> I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but 
> it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702175#comment-13702175
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/8/13 6:06 PM:


Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own 
you this.
I'll test more grouplens data. Since Sebastian has taken over the code, new 
test cases will only be posted as code snippets.

  was (Author: peng):
Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I 
own you this.
testing on netflix dataset has encountered some trouble, namely, I don't know 
where to download it :-<. Great appreciation for anyone who can share his 
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as 
code snippets.
  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702175#comment-13702175
 ] 

Peng Cheng commented on MAHOUT-1272:


Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own 
you this.
testing on netflix dataset has encountered some trouble, namely, I don't know 
where to download it :-<. Great appreciation for anyone who can share his 
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as 
code snippets.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Fix For: 0.8
>
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701688#comment-13701688
 ] 

Peng Cheng commented on MAHOUT-1272:


Hi Sebastian,

Really? I would break my fingers to squeeze into 0.8 release. (not RC1 of 
course, but there is still RC2 :->) A few guys I work with are also kicking me 
for the online recommender, so I can work very hard and undistracted. You just 
tell me what to do next and I'll be thrilled to oblige.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/7/13 10:21 PM:
-

New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m, all evaluation uses RMSE:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

  was (Author: peng):
New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.
  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng commented on MAHOUT-1272:


New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701679#comment-13701679
 ] 

Peng Cheng commented on MAHOUT-1272:


Hi Sebastian may I ask question? I digged some old post and found that the best 
result should be RMSE ~= 0.85, do you know the parameters being used?

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: ParallelSGDFactorizerTest.java
ParallelSGDFactorizer.java
GroupLensSVDRecomenderEvaluatorRunner.java

My laptop is a HP Pavilion with Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 8G 
mem.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
> mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701672#comment-13701672
 ] 

Peng Cheng commented on MAHOUT-1272:


Hey honoured contributors I've got some crude test results for the new parallel 
SGD factorizer for CF:

1. parameters:
lambda = 1e-10
rank of the rating matrix/number of features of each user/item vectors = 50
number of biases: 3 (average rating + user bias + item bias)
number of iterations/epochs = 2 (for all factorizers including ALSWR, 
ratingSGD and the proposed parallelSGD)
initial mu/learning rate = 0.01 (for ratingSGD and proposed parallelSGD)
decay rate of mu = 1 (does not decay) (for ratingSGD and proposed 
parallelSGD)
other parameters are set to default.

2. result on movielens-10m (I don't know what the hell happened to ALSWR, the 
default hyperparameters must screw up real bad, but my point is the speed edge):
  a. RMSE

Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 3.7709163950800665E21 
time spent: 6.179s===
Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8847393972529887 time spent: 6.179s===
Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8805947464818478 time spent: 3.084s

  b. Absolute Average

INFO: ==Recommender With ALSWRFactorizer: 1.2085420449917682E19 
time spent: 7.444s===
Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.675685274206 time spent: 7.444s===
Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.6775774766740665 time spent: 2.365s

3. result on movielens-1m (in average sgd works worse on it comparing to 
movielens-10m, perhaps I could use more iterations/epochs)

  a. RMSE

Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 1.3514189134383086E20 
time spent: 0.637s===
Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.9312989913558529 time spent: 0.637s===
Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.9529995632658007 time spent: 0.305s

  b. Absolute Average

Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 
1.58934499216789965E18 time spent: 0.626s===
Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.7459565635961599 time spent: 0.626s===
Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.7420818642753416 time spent: 0.297s

Great thanks to Sebastian for his guidance, I'll upload the EvaluatorRunner 
class as a mahout-example component and the formatted code shortly.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: mahout.patch, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more i

[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701380#comment-13701380
 ] 

Peng Cheng commented on MAHOUT-1274:


BTW may I ask (noobishly) that why you have deprecated the SlopeOneRecommender 
in the latest core-0.8 snapshot? i must have missed a lot in previous 
mahout-development emails before i join so apologies if its a stupid question.

> SGD-based Online SVD recommender
> 
>
> Key: MAHOUT-1274
> URL: https://issues.apache.org/jira/browse/MAHOUT-1274
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, features, machine_learning, svd
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> an online SVD recommender is otherwise similar to an offline SVD recommender 
> except that, upon receiving one or several new recommendations, it can add 
> them into the training dataModel and update the result accordingly in real 
> time.
> an online SVD recommender should override setPreference(...) and 
> removePreference(...) in AbstractRecommender such that the factorization 
> result is updated in O(1) time and without retraining.
> Right now the slopeOneRecommender is the only component possessing such 
> capability.
> Since SGD is intrinsically an online algorithm and its CF implementation is 
> available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
> good time to convert it. Such feature could come in handy for some websites.
> Implementation: Adding new users, items, or increasing rating matrix rank are 
> just increasing size of user and item matrices. Reducing rating matrix rank 
> involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
> algorithm, multiple passes are required to achieve an acceptable optimality 
> and even more so if hyperparameters are bad. But here are two possible 
> circumvents:
> 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
> applying stochastic convex-opt algorithm to non-convex problem is anarchy. 
> But it may be a long shot.
> 2. Run incomplete passes in each online update using ratings randomly sampled 
> (but not uniformly sampled) from latest dataModel. I don't know how exactly 
> this should be done but new rating should be sampled more frequently. Uniform 
> sampling will results in old ratings being used more than new ratings in 
> total. If somebody has worked on this batch-to-online conversion before and 
> share his insight that would be awesome. This seems to be the most viable 
> option, if I get the non-uniform pseudorandom generator that maintains a 
> cumulative uniform distribution I want.
> I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but 
> it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)
Peng Cheng created MAHOUT-1274:
--

 Summary: SGD-based Online SVD recommender
 Key: MAHOUT-1274
 URL: https://issues.apache.org/jira/browse/MAHOUT-1274
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen


an online SVD recommender is otherwise similar to an offline SVD recommender 
except that, upon receiving one or several new recommendations, it can add them 
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and 
removePreference(...) in AbstractRecommender such that the factorization result 
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such 
capability.

Since SGD is intrinsically an online algorithm and its CF implementation is 
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
good time to convert it. Such feature could come in handy for some websites.

Implementation: Adding new users, items, or increasing rating matrix rank are 
just increasing size of user and item matrices. Reducing rating matrix rank 
involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
algorithm, multiple passes are required to achieve an acceptable optimality and 
even more so if hyperparameters are bad. But here are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But 
it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly sampled 
(but not uniformly sampled) from latest dataModel. I don't know how exactly 
this should be done but new rating should be sampled more frequently. Uniform 
sampling will results in old ratings being used more than new ratings in total. 
If somebody has worked on this batch-to-online conversion before and share his 
insight that would be awesome. This seems to be the most viable option, if I 
get the non-uniform pseudorandom generator that maintains a cumulative uniform 
distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it 
didn't pay off. Hopefully its not a bad idea to submit a new tickets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1274:
---

Description: 
an online SVD recommender is otherwise similar to an offline SVD recommender 
except that, upon receiving one or several new recommendations, it can add them 
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and 
removePreference(...) in AbstractRecommender such that the factorization result 
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such 
capability.

Since SGD is intrinsically an online algorithm and its CF implementation is 
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
good time to convert it. Such feature could come in handy for some websites.

Implementation: Adding new users, items, or increasing rating matrix rank are 
just increasing size of user and item matrices. Reducing rating matrix rank 
involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
algorithm, multiple passes are required to achieve an acceptable optimality and 
even more so if hyperparameters are bad. But here are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But 
it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly sampled 
(but not uniformly sampled) from latest dataModel. I don't know how exactly 
this should be done but new rating should be sampled more frequently. Uniform 
sampling will results in old ratings being used more than new ratings in total. 
If somebody has worked on this batch-to-online conversion before and share his 
insight that would be awesome. This seems to be the most viable option, if I 
get the non-uniform pseudorandom generator that maintains a cumulative uniform 
distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it 
didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

  was:
an online SVD recommender is otherwise similar to an offline SVD recommender 
except that, upon receiving one or several new recommendations, it can add them 
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and 
removePreference(...) in AbstractRecommender such that the factorization result 
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such 
capability.

Since SGD is intrinsically an online algorithm and its CF implementation is 
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
good time to convert it. Such feature could come in handy for some websites.

Implementation: Adding new users, items, or increasing rating matrix rank are 
just increasing size of user and item matrices. Reducing rating matrix rank 
involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
algorithm, multiple passes are required to achieve an acceptable optimality and 
even more so if hyperparameters are bad. But here are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But 
it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly sampled 
(but not uniformly sampled) from latest dataModel. I don't know how exactly 
this should be done but new rating should be sampled more frequently. Uniform 
sampling will results in old ratings being used more than new ratings in total. 
If somebody has worked on this batch-to-online conversion before and share his 
insight that would be awesome. This seems to be the most viable option, if I 
get the non-uniform pseudorandom generator that maintains a cumulative uniform 
distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it 
didn't pay off. Hopefully its not a bad idea to submit a new tickets.


> SGD-based Online SVD recommender
> 
>
> Key: MAHOUT-1274
> URL: https://issues.apache.org/jira/browse/MAHOUT-1274
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, features, machine_learning, svd
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> an online SVD recommender is otherwise similar to an offline SVD recommender 
> except that, upon receiving one or several new recommendations, it can add 
> th

[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-06 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701233#comment-13701233
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/6/13 2:43 PM:


Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 items) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.

  was (Author: peng):
Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 times) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.
  
> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: mahout.patch, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701247#comment-13701247
 ] 

Peng Cheng commented on MAHOUT-1272:


Aye aye, more test on the way. Much obliged to the quick suggestion.





> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: mahout.patch, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701241#comment-13701241
 ] 

Peng Cheng commented on MAHOUT-1272:


The next step would be to create an online version of this (and recommender)
sgd is an online algorithm but now works only for batch recommender.
In the mean time the only online recommender in mahout is the slope-one, kind 
of a shame.
Will create a new JIRA ticket tomorrow.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: mahout.patch, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: mahout.patch

patch

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: mahout.patch, ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: ParallelSGDFactorizerTest.java
ParallelSGDFactorizer.java

java file

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: features, patch, test
> Attachments: ParallelSGDFactorizer.java, 
> ParallelSGDFactorizerTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Labels: features patch test  (was: )
Status: Patch Available  (was: Open)

Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 times) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: patch, test, features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-01 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696932#comment-13696932
 ] 

Peng Cheng commented on MAHOUT-1272:


Looks like the 1/n learning rate doesn't work at all on SGD factorizer, maybe 
the convergence of stochastic optimization can't be applied on the non-convex 
MF problem. Can someone show me a paper discussing convergence bound of such 
problem? Much appreciated.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-29 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696155#comment-13696155
 ] 

Peng Cheng commented on MAHOUT-1272:


learning rate/step size are set to be identical to package ~.classifier.sgd, 
the old learning rate is exponential with a constant decaying factor, this 
setting seems to be only working for smooth functions (proved by Nesterov?), 
I'm not sure if it is true in CF. Otherwise, either use 1/sqrt(n) for convex f 
or 1/n for strongly convex f.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13695818#comment-13695818
 ] 

Peng Cheng commented on MAHOUT-1272:


Thank a lot for the hint! Is it in org.apache.mahout.math.als? I can't find any 
other implementation in core-0.7
Yeah, I think this should be a good practice to start with, regardless of 
whether it has any performance edge.
I'll try to do something this weekend.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13695800#comment-13695800
 ] 

Peng Cheng commented on MAHOUT-1272:


I'm reading the source code of ALS-WR, apparently it uses an ExecutorService to 
distribute ALS to each core.
There is no MR here. I just started using it for a few days. Plz correct me if 
I'm wrong.

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13695791#comment-13695791
 ] 

Peng Cheng commented on MAHOUT-1272:


I presume it be be a single-machine multi-core? many people on the dicussion 
has voted against iterative MR. Not sure though...

> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-28 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Description: 
a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
multicore processor.

existing code is single-thread and perhaps may still be outperformed by the 
default ALS-WR.

In addition, its hardcoded online-to-batch-conversion prevents it to be used by 
an online recommender. An online SGD implementation may help build 
high-performance online recommender as a replacement of the outdated slope-one.

The new factorizer can implement either DSGD 
(http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).

Related discussion has been carried on for a while but remain inconclusive:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

  was:
a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
multicore processor.

existing code is single-thread and perhaps may still be outperformed by the 
default ALS-WR.

In addition, its hardcoded online-to-batch-conversion prevents it to be used by 
an online recommender. An online SGD implementation may help building 
high-performance online recommender as a replacement of the outdated slope-one.

The new factorizer can implement either DSGD 
(http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).

Related discussion has been carried on for a while but remain inconclusive:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl


> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help build 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-28 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Description: 
a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
multicore processor.

existing code is single-thread and perhaps may still be outperformed by the 
default ALS-WR.

In addition, its hardcoded online-to-batch-conversion prevents it to be used by 
an online recommender. An online SGD implementation may help building 
high-performance online recommender as a replacement of the outdated slope-one.

The new factorizer can implement either DSGD 
(http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).

Related discussion has been carried on for a while but remain inconclusive:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

  was:
a parallel factorizer based on MAHOUT-1089 
(https://issues.apache.org/jira/browse/MAHOUT-1089) may achieve better 
performance on multicore processor.

current patch of MAHOUT-1089 is single-thread and perhaps may still be 
outperformed by the default ALS-WR.

In addition, its hardcoded online-to-batch-conversion prevents it to be used by 
an online recommender. An online SGD implementation may help building 
high-performance online recommender as a replacement of the outdated slope-one.

The new factorizer can implement either DSGD 
(http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).

Related discussion has been carried on for a while but remain inconclusive:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl


> Parallel SGD matrix factorizer for SVDrecommender
> -
>
> Key: MAHOUT-1272
> URL: https://issues.apache.org/jira/browse/MAHOUT-1272
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Peng Cheng
>Assignee: Sean Owen
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
> multicore processor.
> existing code is single-thread and perhaps may still be outperformed by the 
> default ALS-WR.
> In addition, its hardcoded online-to-batch-conversion prevents it to be used 
> by an online recommender. An online SGD implementation may help building 
> high-performance online recommender as a replacement of the outdated 
> slope-one.
> The new factorizer can implement either DSGD 
> (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
> hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
> Related discussion has been carried on for a while but remain inconclusive:
> http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-28 Thread Peng Cheng (JIRA)
Peng Cheng created MAHOUT-1272:
--

 Summary: Parallel SGD matrix factorizer for SVDrecommender
 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen


a parallel factorizer based on MAHOUT-1089 
(https://issues.apache.org/jira/browse/MAHOUT-1089) may achieve better 
performance on multicore processor.

current patch of MAHOUT-1089 is single-thread and perhaps may still be 
outperformed by the default ALS-WR.

In addition, its hardcoded online-to-batch-conversion prevents it to be used by 
an online recommender. An online SGD implementation may help building 
high-performance online recommender as a replacement of the outdated slope-one.

The new factorizer can implement either DSGD 
(http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).

Related discussion has been carried on for a while but remain inconclusive:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1089) SGD matrix factorization for rating prediction with user and item biases

2013-06-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13695745#comment-13695745
 ] 

Peng Cheng commented on MAHOUT-1089:


Code is slick! But apparently there is no multi-threading yet.
The proposal for it has been there for a long time:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

Is somebody working on its implementation?
apparently using hogwild or vanilla DSGD has no big impact on performance.

> SGD matrix factorization for rating prediction with user and item biases
> 
>
> Key: MAHOUT-1089
> URL: https://issues.apache.org/jira/browse/MAHOUT-1089
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Zeno Gantner
>Assignee: Sebastian Schelter
> Attachments: MAHOUT-1089.patch, RatingSGDFactorizer.java, 
> RatingSGDFactorizer.java
>
>
> A matrix factorization that is trained with standard SGD on all features at 
> the same time, in contrast to ExpectationMaximizationFactorizer, which learns 
> feature by feature.
> Additionally to the free features it models a rating bias for each user and 
> item.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira