from:"peng"

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736962#comment-13736962
]

Peng Cheng commented on MAHOUT-1286:

The idea of ArrayMap has been discarded due to its impractical time consumption
of insertion (O(n) for a batch insertion) and query (O(logn)). I have moved
back to HashMap. Due to the same reason, I feel that using Sparse Row/Column
matrix may have the same problem.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Attachment: InMemoryDataModelTest.java
InMemoryDataModel.java

See uploaded files for detail

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736992#comment-13736992
]

Peng Cheng commented on MAHOUT-1286:

Here is my final solution after numerous experiments: A combination of double
hashing for storing user/item IDs and 2d hopscotch hashing
(http://mcg.cs.tau.ac.il/papers/disc2008-hopscotch.pdf) for storing preferences
as a map from user/item indices in the double hashing table. Hopscotch hashing
maintains strong locality and high load factor, and each dimension uses an
independent hash function. As a result, it can quickly extract a submatrix or
single row or column.

This is the smallest implementation I can think of, apparently only bloom map
can achieve smaller memory footage. But it has many other problems.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Fix Version/s: 0.9
   Labels: collaborative-filtering datamodel patch recommender  (was: )
   Status: Patch Available  (was: Open)

According to my test, it can load the entire Netflix dataset into memory using 
only 3G heap space.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: patch, collaborative-filtering, datamodel, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737019#comment-13737019
]

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:44 PM:
-

Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some
difficulties, namely 1) a columnar form doesn't support fast extraction of
rows, yet dataModel should allow quick getPreferencesFromUser() and
getPreferencesForItem(). 2) a columnar form doesn't support fast online update
(time complexity is O( n ), maximally O( log n ) if using block copy and
columns are sorted). 3) To create such dataModel we need to initialize a
HashMap first, this uses twice as much as heap space for initialization, could
defeat the purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure
it is an all-improving solution to similarity-based recommenders. But low rank
matrix-factorization based ones should merge preferences from the new users
immediately into the prediction model, of course you can just project it into
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according
to guys I'm working with the online recommender seems to be in demand these
days.

was (Author: peng):
Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some
difficulties, namely 1) a columnar form doesn't support fast extraction of
rows, yet dataModel should allow quick getPreferencesFromUser() and
getPreferencesForItem(). 2) a columnar form doesn't support fast online update
(time complexity is O(n), maximally O(n) if using block copy and columns are
sorted). 3) To create such dataModel we need to initialize a HashMap first,
this uses twice as much as heap space for initialization, could defeat the
purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him
for some time.

I'm not sure how much Lucene supports online update of indices, but according
to guys I'm working with the online recommender seems to be in demand these
days.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, datamodel, patch, recommender
Fix For: 0.9

Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737019#comment-13737019
]

Peng Cheng commented on MAHOUT-1286:

Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some
difficulties, namely 1) a columnar form doesn't support fast extraction of
rows, yet dataModel should allow quick getPreferencesFromUser() and
getPreferencesForItem(). 2) a columnar form doesn't support fast online update
(time complexity is O(n), maximally O(n) if using block copy and columns are
sorted). 3) To create such dataModel we need to initialize a HashMap first,
this uses twice as much as heap space for initialization, could defeat the
purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him
for some time.

I'm not sure how much Lucene supports online update of indices, but according
to guys I'm working with the online recommender seems to be in demand these
days.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, datamodel, patch, recommender
Fix For: 0.9

Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737023#comment-13737023
]

Peng Cheng commented on MAHOUT-1286:

Well, I mean, I partially agree that the effort I spent on this probably won't
pay off as few will use In-memory/file dataModel in production, most of them
will choose a databased-backed one. I just try to solve it because its a
blocker.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, datamodel, patch, recommender
Fix For: 0.9

Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736962#comment-13736962
]

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:54 PM:
-

The idea of ArrayMap has been discarded due to its impractical time consumption
of insertion (O( n ) for a batch insertion) and query (O(logn)). I have moved
back to HashMap. Due to the same reason, I feel that using Sparse Row/Column
matrix may have the same problem.

was (Author: peng):
The idea of ArrayMap has been discarded due to its impractical time
consumption of insertion (O(n) for a batch insertion) and query (O(logn)). I
have moved back to HashMap. Due to the same reason, I feel that using Sparse
Row/Column matrix may have the same problem.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, datamodel, patch, recommender
Fix For: 0.9

Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737553#comment-13737553
]

Peng Cheng commented on MAHOUT-1286:

Hi Gentlemen,

Thanks a lot for proving my point Gokhan, yeah I mean either user or item
preferences extraction can be fast but not both.

Sorry I should have proposed it in our last hangout but I missed the invitation
:- But I tried to understand your proposal on recommendation-as-search.

From what I heard on youtube, the new architecture is proposed as an easier
and faster replacement of all existing recommenders that take DataModel. Each
item is a weighted 'bag of words' generated by concurrence analysis/item
similarity on previous ratings. New users's ratings are converted into
weighted tuple of existing words and is matched with the items that have
highest sum of hits.

My concerns are that 1) does it support all type of recommenders and their
ensemble? I know modern search engine like Google and YANDEX has a fairly
complex ensemble search and ranking algorithm that looks similar to an ensemble
recommender, but IMHO Lucene is built only for text search, not sure to what
extend it is customizable. 2) does it support online learning? This feature is
more important to SVDRecommender as a new user's recommendation is only known
if this user is merged into the model. (Of course, an option is to project a
new user into the user subspace by minimising its distance given its dot to
existing items, but no body has test its performance before)

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, datamodel, patch, recommender
Fix For: 0.9

Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737563#comment-13737563
]

Peng Cheng commented on MAHOUT-1286:

Also, please be noted that the first patch is still not optimized to extreme.
Many improvements can be made to make it smaller and faster. (see TODO: list in
code) But I'm trying to get back to MAHOUT-1274, if we expect large scale
refactoring on all recommenders in favor of recommendation-as-search, I'll have
to suspend it until refactoring is finished.

I'm waiting online for Dr Dunning's plan.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, datamodel, patch, recommender
Fix For: 0.9

Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

Original Estimate: 336h
Remaining Estimate: 336h

Re: apache-math dependency

2013-08-12 Thread Peng Cheng

Apologies, I mistaken apache-math as mahout-math and didn't know what 
I'm talking about :)


On 13-08-12 07:08 PM, Ted Dunning wrote:

Yes.  Apache Math linear algebra is very difficult for us to use because
their matrices are non-extensible.

But there is actually quite a lot of code to do with random distributions,
optimization and quadrature. Those are much more likely to be useful to us.


On Mon, Aug 12, 2013 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:


Larger part of mahout-math is linear algebra, which is currently broken for
sparse part of the equation and which we don't use at all.

One part of the problem is that our use for that library is always a fringe
case, and as far as i can tell, will always continue to be such.

Another part of the problem is that keeping dependency will invite
bypassing Mahout's solvers and, as a result, architecture inconsistency.

That said, I guess Ted's argument (which is mainly cost, as i gathered),
trumps the two above.


On Mon, Aug 12, 2013 at 3:20 PM, Peng Cheng pc...@uowmail.edu.au wrote:


seriously, I would prefer the dependency as a good architectural pattern.
It encourages other people to use/contribute to it to avoid repetitive

work.


On 13-08-12 06:16 PM, Ted Dunning wrote:


I am fine with it staying.


On Mon, Aug 12, 2013 at 3:14 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

  So you are ok with apache-math dependency to stay?


On Mon, Aug 12, 2013 at 3:09 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  So I checked on these.  The non-trivial issues with replacing Commons
Math


include:

- Poisson and negative binomial distributions.  This would be several


hours


work to write and test (we have Colt-inherited negative binomial
distribution, but it takes no longer to write a new one than to test

an

old


one).

- random number generators.  This is about and hour or two of work to


pull


the MersenneTwister implementation into our code.

- next prime number finder.  Not a big deal to replicate, but it would


take


a few hours to do.

- quadrature.  We use an adaptive integration routine to check


distribution


properties.  This, again, would take a few hours to replace.

I really don't see the benefit to this work.

On Mon, Aug 12, 2013 at 2:53 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  2 distribution.**PoissonDistribution;

 2 distribution.**PascalDistribution;
 2 distribution.**NormalDistribution;
 1 util.FastMath;
 1 random.RandomGenerator;
 1 random.MersenneTwister;
 1 primes.Primes;
 1 linear.RealMatrix;
 1 linear.EigenDecomposition;
 1 linear.Array2DRowRealMatrix;
 1 distribution.RealDistribution;
 1 distribution.**IntegerDistribution;
 1 analysis.integration.**UnivariateIntegrator;
 1 analysis.integration.**RombergIntegrator;
 1 analysis.UnivariateFunction;

Re: Hangout on Monday

2013-08-05 Thread Peng Cheng


Strange, I didn't see any invitation.

On 13-08-05 06:54 PM, Ted Dunning wrote:

Just sent invite to Mahout dev list.


On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com wrote:


It is for both.

If you have g+ installed you can participate.  If not, you can watch.



On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org wrote:


Is the link only for watching or also for participation? Never did a
hangout before :)

2013/8/5 Andrew Musselman andrew.mussel...@gmail.com


Can't make it alas


On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu

wrote:
what's the addr of the hangout?


On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au

wrote:

Nice, I'll be there.


On 13-08-03 02:51 PM, Andrew Musselman wrote:


Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning 

ted.dunn...@gmail.com

wrote:

  Yes.  1600 PDT

I got that right in the linked doc, just not on the more important

email.




On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com


wrote:
On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

  Let's have the hangout at 1600 on Monday, August 5th.
Maybe asking the obvious here so I apologize for the spam. The

timezone

is


PDT, correct?

Re: Hangout on Monday

2013-08-05 Thread Peng Cheng

So buggy, the program act as i'm in the meeting (showing a push to 
talk button), but it doesn't do anything.


On 13-08-05 08:02 PM, Ted Dunning wrote:

Hangouts clearly do not work the way I thought they did.  The URL that I
sent out was for the arhcived version of the meeting.


On Mon, Aug 5, 2013 at 5:00 PM, Peng Cheng pc...@uowmail.edu.au wrote:


Strange, I didn't see any invitation.


On 13-08-05 06:54 PM, Ted Dunning wrote:


Just sent invite to Mahout dev list.


On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  It is for both.

If you have g+ installed you can participate.  If not, you can watch.



On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org
wrote:

  Is the link only for watching or also for participation? Never did a

hangout before :)

2013/8/5 Andrew Musselman andrew.mussel...@gmail.com

  Can't make it alas


On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu


wrote:
what's the addr of the hangout?


On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au


wrote:


Nice, I'll be there.


On 13-08-03 02:51 PM, Andrew Musselman wrote:

  Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning 


ted.dunn...@gmail.com

  wrote:

   Yes.  1600 PDT


I got that right in the linked doc, just not on the more important


email.


On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com

  wrote:

On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

   Let's have the hangout at 1600 on Monday, August 5th.
Maybe asking the obvious here so I apologize for the spam. The


timezone

is

  PDT, correct?

Re: Hangout on Monday

2013-08-05 Thread Peng Cheng

Oh Sorry I figure out the problem, my Google+ uses gmail address as my 
account. I'll change that right away.


On 13-08-05 08:16 PM, Ted Dunning wrote:

Peng,

It looks like you are not actually on google plus.  I have you in my Mahout
circle under your iowa email address, but I am unable to add you to a
hangout.


On Mon, Aug 5, 2013 at 5:07 PM, Peng Cheng pc...@uowmail.edu.au wrote:


So buggy, the program act as i'm in the meeting (showing a push to talk
button), but it doesn't do anything.


On 13-08-05 08:02 PM, Ted Dunning wrote:


Hangouts clearly do not work the way I thought they did.  The URL that I
sent out was for the arhcived version of the meeting.


On Mon, Aug 5, 2013 at 5:00 PM, Peng Cheng pc...@uowmail.edu.au wrote:

  Strange, I didn't see any invitation.


On 13-08-05 06:54 PM, Ted Dunning wrote:

  Just sent invite to Mahout dev list.


On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

   It is for both.


If you have g+ installed you can participate.  If not, you can watch.



On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org
wrote:

   Is the link only for watching or also for participation? Never did a


hangout before :)

2013/8/5 Andrew Musselman andrew.mussel...@gmail.com

   Can't make it alas


On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang 
kuny...@stanford.edu

  wrote:

what's the addr of the hangout?


On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au

  wrote:

  Nice, I'll be there.

On 13-08-03 02:51 PM, Andrew Musselman wrote:

   Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning 

  ted.dunn...@gmail.com

   wrote:
Yes.  1600 PDT

  I got that right in the linked doc, just not on the more

important

  email.
On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com

   wrote:


On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Let's have the hangout at 1600 on Monday, August 5th.
Maybe asking the obvious here so I apologize for the spam. The

  timezone

is
   PDT, correct?

Re: Hangout on Monday

2013-08-04 Thread Peng Cheng


Nice, I'll be there.

On 13-08-03 02:51 PM, Andrew Musselman wrote:

Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote:


Yes.  1600 PDT

I got that right in the linked doc, just not on the more important email.




On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com

wrote:
On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:


Let's have the hangout at 1600 on Monday, August 5th.

Maybe asking the obvious here so I apologize for the spam. The timezone

is

PDT, correct?

Re: [jira] [Created] (MAHOUT-1298) SparseRowMatrix,SparseColMatrix: optimize transpose()

2013-07-29 Thread Peng Cheng


+1, we have type conversion anyway.

On 29/07/2013 6:40 PM, Sebastian Schelter wrote:

+1

2013/7/29 Dmitriy Lyubimov (JIRA) j...@apache.org


Dmitriy Lyubimov created MAHOUT-1298:


  Summary: SparseRowMatrix,SparseColMatrix: optimize transpose()
  Key: MAHOUT-1298
  URL: https://issues.apache.org/jira/browse/MAHOUT-1298
  Project: Mahout
   Issue Type: New Feature
   Components: Math
 Affects Versions: 0.8
 Reporter: Dmitriy Lyubimov
 Assignee: Dmitriy Lyubimov
  Fix For: 0.9


these matrices lack optimized transpose and rely onto AbstractMatrix's
O(mn) implementation which is not cool for very sparse subblocks.

proposal is to implement a custom transpose with two things in mind:

1) transpose result to row sparse matrix should be col sparse matrix, and
vice versa (and not from default like() as default implementation would
take);

2) obviously, iterate only thru non-zero elements only of all
rows(columns).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-23 Thread Peng Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717659#comment-13717659
 ] 

Peng Cheng commented on MAHOUT-1286:


Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView 
has taken 1.7G. Quite unexpected right? Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will be 
no more PreferenceArray.

Class Name 
|Objects |  Shallow Heap |Retained Heap
---
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
 72,237,632 | 1,733,703,168 | = 1,733,703,168
long[] 
|480,199 |   818,209,680 |   = 818,209,680
float[]
|480,190 |   410,563,592 |   = 410,563,592
java.lang.Object[] 
| 18,230 |   361,525,488 | = 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray   
|480,189 |15,366,048 | = 1,237,456,672
java.util.ArrayList
| 17,811 |   427,464 | = 2,092,416,104
char[] 
|  2,150 |   272,632 |   = 272,632
byte[] 
|141 |54,048 |= 54,048
java.lang.String   
|  2,119 |50,856 |   = 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry   
|673 |21,536 |= 38,104
java.net.URL   
|229 |14,656 |= 40,720
java.util.HashMap$Entry
|344 |11,008 |= 68,760
---


 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-23 Thread Peng Cheng

That's exactly what I'm trying to do right now :) (I'm testing 
FastByIDArrayMap), but we probably have more problems than just HashMap, 
based on the heap dump analysis result, PreferenceArray probably will be 
our next target. This is awesome, as your FactorizablePreferences didn't 
use it in the first place.


Yours Peng

On 13-07-23 05:46 PM, Sebastian Schelter wrote:

IMHO you will always have memory issues if you try to provide constant time
random access. Thats why I proposed to created a special memory efficient
DataModel for sequential access.


2013/7/23 Peng Cheng (JIRA) j...@apache.org


 [
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717659#comment-13717659]

Peng Cheng commented on MAHOUT-1286:


Aye aye, I just did, turns out that instances of
PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right?
Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will
be no more PreferenceArray.

Class Name
 |Objects |  Shallow Heap |Retained Heap

---
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
72,237,632 | 1,733,703,168 | = 1,733,703,168
long[]
 |480,199 |   818,209,680 |   = 818,209,680
float[]
  |480,190 |   410,563,592 |   = 410,563,592
java.lang.Object[]
 | 18,230 |   361,525,488 | = 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray
 |480,189 |15,366,048 | = 1,237,456,672
java.util.ArrayList
  | 17,811 |   427,464 | = 2,092,416,104
char[]
 |  2,150 |   272,632 |   = 272,632
byte[]
 |141 |54,048 |= 54,048
java.lang.String
 |  2,119 |50,856 |   = 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry
 |673 |21,536 |= 38,104
java.net.URL
 |229 |14,656 |= 40,720
java.util.HashMap$Entry
  |344 |11,008 |= 68,760

---



Memory-efficient DataModel, supporting fast online updates and

element-wise iteration
-

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

Most DataModel implementation in current CF component use hash map to

enable fast 2d indexing and update. This is not memory-efficient for big
data set. e.g. Netflix prize dataset takes 11G heap space as a
FileDataModel.

Improved implementation of DataModel should use more compact data

structure (like arrays), this can trade a little of time complexity in 2d
indexing for vast improvement in memory efficiency. In addition, any online
recommender or online-to-batch converted recommender will not be affected
by this in training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715885#comment-13715885
]

Peng Cheng commented on MAHOUT-1286:

On second thought, hash map is very likely not the culprit for poor memory
efficiency here, apologies for the misinformation. The double hashing algorithm
in FastByIDMap, as described in Don Knuth's book 'the art of computer
programming', has a default loadFactor of 1.5, which means the size of array is
only 1.5 times the number of keys. So theoretically the heap size of
GenericDataModel should never exceed 3 times the size of
FactorizablePreferences. I'm still very unclear about FastByIDMap's
implementation, like how it handles deletion of entries. So I cannot tell if my
observation on netflix is caused by GC (e.g. construct new arrays too often),
or deletion, or extra space allocated for timestamp. We probably have to run
netflix in debug mode to identify the problem.

I'll try to bring up this topic in the next hangout. Please give me some hint
if you are an expert in those FastMap implementations.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715906#comment-13715906
]

Peng Cheng commented on MAHOUT-1286:

In another hand, I try to solve the problem by implementing FastByIDArrayMap, a
slightly more compact Map implementation than FastByIDMap, it uses binary
search to arrange all entries into a tight array, so its worst-case time
complexity for get, put and delete is log(n) (much slower than double hashing's
average O(1)). But has a (marginally) smaller memory footprint and faster
iteration. It has no problem passing all unit tests. But its real performance
can only be shown when embedded in FileDataModel. I'll post the result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't
worth the speed loss.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715906#comment-13715906
]

Peng Cheng edited comment on MAHOUT-1286 at 7/23/13 12:26 AM:
--

In another hand, I try to solve the problem by implementing FastByIDArrayMap, a
slightly more compact Map implementation than FastByIDMap, it uses binary
search to arrange all entries into a tight array, so its worst-case time
complexity for get, put and delete is log( n ) (much slower than double
hashing's average O(1)). But has a (marginally) smaller memory footprint and
faster iteration. It has no problem passing all unit tests. But its real
performance can only be shown when embedded in FileDataModel. I'll post the
result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't
worth the speed loss.

was (Author: peng):
In another hand, I try to solve the problem by implementing
FastByIDArrayMap, a slightly more compact Map implementation than FastByIDMap,
it uses binary search to arrange all entries into a tight array, so its
worst-case time complexity for get, put and delete is log(n) (much slower than
double hashing's average O(1)). But has a (marginally) smaller memory footprint
and faster iteration. It has no problem passing all unit tests. But its real
performance can only be shown when embedded in FileDataModel. I'll post the
result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't
worth the speed loss.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

[
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715912#comment-13715912
]

Peng Cheng commented on MAHOUT-1286:

Hi Sebastian, Gokhan, how do you feel about the cause of the memory efficiency
problem? Do you think we should talk privately? I'm also interested in your
experimentation results.

Memory-efficient DataModel, supporting fast online updates and element-wise
iteration
-

Key: MAHOUT-1286
URL: https://issues.apache.org/jira/browse/MAHOUT-1286
Project: Mahout
Issue Type: Improvement
Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
Original Estimate: 336h
Remaining Estimate: 336h

Re: Regarding Online Recommenders

2013-07-19 Thread Peng Cheng


Hi,

Just one simple question: Is the 
org.apache.mahout.math.BinarySearch.binarySearch() function an optimized 
version of Arrays.binarySearch()? If it is not, why implement it again?


Yours Peng

On 13-07-17 06:31 PM, Sebastian Schelter wrote:

You are completely right, the simple interface would only be usable for
readonly / batch-updatable recommenders. Online recommenders might need
something different. I tried to widen the discussion here to discuss all
kinds of API changes in the recommenders that would be necessary in the
future.



2013/7/17 Peng Cheng pc...@uowmail.edu.au


One thing that suddenly comes to my mind is that, for a simple interface
like FactorizablePreferences, maybe sequential READ in real time is
possible, but sequential WRITE in O(1) time is Utopia. Because you need to
flush out old preference with same user and item ID (in worst case it could
be an interpolation search), otherwise you are permitting a user rating an
item twice with different values. Considering how FileDataModel suppose to
work (new files flush old files), maybe using the simple interface has less
advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:


Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access
(and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au

  I see, OK so we shouldn't use the old implementation. But I mean, the old

interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an
interpolation
search that takes O(log log n) time in average. So does the online
update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:

  I completely agree, Netflix is less than one gigabye in a smart

representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient
representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

   Netflix is a small dataset.  12G for that seems quite excessive.


Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

   The second idea is indeed splendid, we should separate
time-complexity


first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

  laptop

  can already handle that (emphasis on laptop). And if we replace hash

map
(the culprit of high memory consumption) with list/linkedList, it
would
simply degrade time complexity for a linear search to O(n), not too
bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead
of
subverting it.

Re: Regarding Online Recommenders

For a low-rank matrix factorization based recommender, a new preference 
is not itself, but a dot product of two vectors in the low dimensional 
space, so it needs no projection. The user and item vectors however may 
need to be projected into a lower dimensional space, if and only if you 
want to reduce the rank of the preference matrix. The refactorization 
step in SGD is super fast--that's the charm of SGD. So, yes, we will 
refactorize in every update.


Yours Peng

On 13-07-18 11:34 AM, Pat Ferrel wrote:

On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote:

Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.


Aren't there two different things needed to truly update the model: 1) add the 
new preference to the lower dimensional space 2) refactorize the all 
preferences. #2 only needs to be done periodically--afaik. #1 would be super 
fast and could be done at runtime.  Am I wrong or are you planning to 
incrementally refactorize the entire preference array with every new preference?

Re: Regarding Online Recommenders

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but 
this is definitely I would like to pursue. It's probably the only 
advantage a non-hadoop implementation can offer in the future.


Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for 
sandboxes, IMHO they are the culprit of scalability problem.


May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months

of

data. The data was purchase data from a diverse ecom source with a large
variety of products from electronics to clothes. We found Item based CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full item based online recommender has very different
tradeoffs and anyway others may not have seen this disparity of results.
Furthermore we don't have A/B test results yet to validate the offline
metric.

On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

Peng,

This is the reason I separated out the DataModel, and only put the

learner

stuff there. The learner I

Re: Regarding Online Recommenders

Strange, its just a little bit larger than limibseti dataset (17m 
ratings), did you encountered an outOfMemory or GCTimeOut exception? 
Allocating more heap space usually help.


Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months

of

data. The data was purchase data from a diverse ecom source with a large
variety of products from electronics to clothes. We found Item based CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full

Re: Regarding Online Recommenders

I see, sorry I was too presumptuous. I only recently worked and tested 
SVDRecommender, never could have known its efficiency using an 
item-based recommender. Maybe there is space for algorithmic optimization.


The online recommender Gokhan is working on is also an SVDRecommender. 
An online user-based or item-based recommender based on clustering 
technique would definitely be critical, but we need an expert to 
volunteer :)


Perhaps Dr Dunning can have a few words? He announced the online 
clustering component.


Yours Peng

On 13-07-18 03:54 PM, Pat Ferrel wrote:

No it was CPU bound not memory. I gave it something like 14G heap. It was 
running, just too slow to be of any real use. We switched to the hadoop version 
and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m ratings), did 
you encountered an outOfMemory or GCTimeOut exception? Allocating more heap 
space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model

Re: Regarding Online Recommenders


Wow, that's lightning fast.

Is it a SparseMatrix or DenseMatrix?

On 13-07-18 07:23 PM, Gokhan Capan wrote:

I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote:


I see, sorry I was too presumptuous. I only recently worked and tested
SVDRecommender, never could have known its efficiency using an item-based
recommender. Maybe there is space for algorithmic optimization.

The online recommender Gokhan is working on is also an SVDRecommender. An
online user-based or item-based recommender based on clustering technique
would definitely be critical, but we need an expert to volunteer :)

Perhaps Dr Dunning can have a few words? He announced the online
clustering component.

Yours Peng


On 13-07-18 03:54 PM, Pat Ferrel wrote:


No it was CPU bound not memory. I gave it something like 14G heap. It was
running, just too slow to be of any real use. We switched to the hadoop
version and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m
ratings), did you encountered an outOfMemory or GCTimeOut exception?
Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:


It was about 2.5M users and 500K items with 25M actions over 6 months of
data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering
algorithm. I'm not sure if it can be used in item-based recommender, but
this is definitely I would like to pursue. It's probably the only advantage
a non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory
GenericDataModel and FileDataModel are largely implemented for sandboxes,
IMHO they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:


Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported
by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender,
usually
this happens after a retraining in batch. You have to care for
cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com

  Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're
experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit
of
tinkering and doesn't have quite the same set of options--no llr
similarity
for instance.

On the same subject I recently attended a workshop in Seattle for
UAI2013
where Walmart reported similar results using a factorized recommender.
They
had to increase the factor number past where it would perform well.
Along
the way they saw increasing performance measuring precision offline.
They
eventually gave up on a factorized solution. This decision seems odd
but
anyway… In the case of Walmart and our data set they are quite
diverse. The
best idea is probably to create different recommenders for separate
parts
of the catalog but if you create one model on all items our intuition
is
that item-based works better than factorized. Again caveat--no A/B
tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and
it
could not really handle more than a down-sampled few months of our
data.
Down-sampling lost us 20% of our precision scores so we moved to the
hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org
wrote:

Hi Pat

I think we should provide a simple support for recommending to
anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to
fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code
for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com

  May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How

Re: Regarding Online Recommenders

I see, OK so we shouldn't use the old implementation. But I mean, the 
old interface doesn't have to be discarded. The discrepancy between your 
FactorizablePreferences and DataModel is that, your model supports 
getPreferences(), which returns all preferences as an iterator, and 
DataModel supports a few old functions that returns preferences for an 
individual user or item.


My point is that, it is not hard for each of them to implement what they 
lack of: old DataModel can implement getPreferences() just by a a loop 
in abstract class. Your new FactorizablePreferences can implement those 
old functions by a binary search that takes O(log n) time, or an 
interpolation search that takes O(log log n) time in average. So does 
the online update. It will just be a matter of different speed and 
space, but not different interface standard, we can use old unit tests, 
old examples, old everything. And we will be more flexible in writing 
ensemble recommender.


Just a few thoughts, I'll have to validate the idea first before 
creating a new JIRA ticket.


Yours Peng


On 13-07-16 02:51 PM, Sebastian Schelter wrote:

I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com


Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote:


The second idea is indeed splendid, we should separate time-complexity
first and space-complexity first implementation. What I'm not quite sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.

Re: Regarding Online Recommenders

Mmm, You are right, the most simple solution is usually the best, I'm 
creating new jira ticket.


Yours Peng

On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au


I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an interpolation
search that takes O(log log n) time in average. So does the online update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

  The second idea is indeed splendid, we should separate time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new


laptop


can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.

Re: Regarding Online Recommenders

Awesome! your reinforcements are highly appreciated.

On 13-07-17 01:29 AM, Abhishek Sharma wrote:

Sorry to interrupt guys, but I just wanted to bring it to your notice that
I am also interested in contributing to this idea. I am planning to
participate in ASF-ICFOSS mentor-ship
programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme.
(this is very similar to GSOC)

I do have strong concepts in machine learning (have done the ML course by
Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
experience). I am not really sure of how can I approach this problem (but I
do have a strong interest to work on this problem) hence would like to pair
up on this. I am currently working as a research intern at Indian Institute
of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

Please let me know your thoughts if I can be a part of this.

Thanks Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21

On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote:

Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over all preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan

On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
wrote:

I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient

representation,

tested on KDD Music dataset which is approx 2.5 times Netflix and fits

into

3GB with that approach.

2013/7/16 Ted Dunning ted.dunn...@gmail.com

Netflix is a small dataset. 12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take 1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au

wrote:

The second idea is indeed splendid, we should separate

time-complexity

first and space-complexity first implementation. What I'm not quite

sure,

is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace hash

map

(the culprit of high memory consumption) with list/linkedList, it

would

simply degrade time complexity for a linear search to O(n), not too

bad

either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead

subverting it.

[jira] [Created] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and iteration

2013-07-17 Thread Peng Cheng (JIRA)

Peng Cheng created MAHOUT-1286:
--

 Summary: Memory-efficient DataModel, supporting fast online 
updates and iteration
 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen


Most DataModel implementation in current CF component use hash map to enable 
fast 2d indexing and update. This is not memory-efficient for big data set. 
e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.

Improved implementation of DataModel should use more compact data structure 
(like arrays), this can trade a little of time complexity in 2d indexing for 
vast improvement in memory efficiency. In addition, any online recommender or 
online-to-batch converted recommender will not be affected by this in training 
process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-17 Thread Peng Cheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Summary: Memory-efficient DataModel, supporting fast online updates and 
element-wise iteration  (was: Memory-efficient DataModel, supporting fast 
online updates and iteration)

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Regarding Online Recommenders

One thing that suddenly comes to my mind is that, for a simple interface 
like FactorizablePreferences, maybe sequential READ in real time is 
possible, but sequential WRITE in O(1) time is Utopia. Because you need 
to flush out old preference with same user and item ID (in worst case it 
could be an interpolation search), otherwise you are permitting a user 
rating an item twice with different values. Considering how 
FileDataModel suppose to work (new files flush old files), maybe using 
the simple interface has less advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au


I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an interpolation
search that takes O(log log n) time in average. So does the online update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

  The second idea is indeed splendid, we should separate time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new


laptop


can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.

Re: Regarding Online Recommenders

2013-07-16 Thread Peng Cheng

Yeah, setPreference() and removePreference() shouldn't be there, but 
injecting Recommender back to DataModel is kind of a strong dependency, 
which may intermingle components for different concerns. Maybe we can do 
something to RefreshHelper class? e.g. push something into a swap field 
so the downstream of a refreshable chain can read it out. I have read 
Gokhan's UpdateAwareDataModel, and feel that it's probably too 
heavyweight for a model selector as every time he change the algorithm 
he has to re-register that.


The second idea is indeed splendid, we should separate time-complexity 
first and space-complexity first implementation. What I'm not quite 
sure, is that if we really need to create two interfaces instead of one. 
Personally, I think 12G heap space is not that high right? Most new 
laptop can already handle that (emphasis on laptop). And if we replace 
hash map (the culprit of high memory consumption) with list/linkedList, 
it would simply degrade time complexity for a linear search to O(n), not 
too bad either. The current DataModel is a result of careful thoughts 
and has underwent extensive test, it is easier to expand on top of it 
instead of subverting it.


All the best,
Yours Peng

On 13-07-16 01:05 AM, Sebastian Schelter wrote:

Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian



The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng commented on MAHOUT-1272:


Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.


 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: libimsetiSVDRecomenderEvaluatorRunner.java

here is the component for testing on libimseti dataset

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
 ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/13/13 8:57 PM:
-

Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

(for ratingSGD)
  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

(for parallelSGD)
  double mu0=1;
  double decayFactor=1;
  int stepOffset=100;
  double forgettingExponent=-1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  was (Author: peng):
Test on libimseti dataset (http://www.occamslab.com/petricek/data/), 
libimseti is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  
 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
 ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peng Cheng updated MAHOUT-1272:
---

Attachment: NetflixRecomenderEvaluatorRunner.java

Runnable component for testing ParallelSGDFactorizer on netflix training
dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I
cannot get judging.txt for validation, but my purpose is just to test its
efficiency on extreme scale, so whatever).

Warning! To run it without danger you need to allocate at least 12G of heap
space to jvm by using the following VM parameters:

-Xms12288M -Xmx12288M.

In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap
will kill you (or both). I almost burned my laptop on this (which has only 8G
RAM). As a result, I won't be able to post any result before I can get a better
machine. But since its number of rating is about 6 times the size of the
movielens-10m or libimseti dataset, and SGD scales linearly to this number, I
estimate the running time to be between 2.5-3 minutes.

I will be utmost obliged to anybody who can try it and post the result here (of
course, if your machine can handle it). But obviously as Sebastian has pointed
out, our FileDataModel needs some serious optimization to handle such scale.

Hey Sebastian, can you try this out in your lab? That will be most helpful.

Parallel SGD matrix factorizer for SVDrecommender
-

Attachments: GroupLensSVDRecomenderEvaluatorRunner.java,
libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch,
NetflixRecomenderEvaluatorRunner.java, ParallelSGDFactorizer.java,
ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java,
ParallelSGDFactorizerTest.java

Original Estimate: 336h
Remaining Estimate: 336h

a parallel factorizer based on MAHOUT-1089 may achieve better performance on
multicore processor.
existing code is single-thread and perhaps may still be outperformed by the
default ALS-WR.
In addition, its hardcoded online-to-batch-conversion prevents it to be used
by an online recommender. An online SGD implementation may help build
high-performance online recommender as a replacement of the outdated
slope-one.
The new factorizer can implement either DSGD
(http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or
hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
Related discussion has been carried on for a while but remain inconclusive:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-12 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707065#comment-13707065
]

Peng Cheng commented on MAHOUT-1274:

Main component finished. The new factorizer and recommender can support adding
new users and items, and update user/item vectors in only one GD step (this is
very suboptimal, but I will improve this part very soon).

But I don't know how to test it, the sandbox GenericDataModel doesn't support
setPreference(...) and removePreference(...) yet. (SlopeOneRecommenderTest
doesn't test this part either). Could someone tell me if there is an
alternative to avoid this problem?

As Sebastian have foretold, now is not the best time for adding support for an
online recommender: The SlopeOneRecommender is half-dead, many dependencies are
incomplete, and everybody's attention were drawn to core-0.8 release.
Regardless, I'll try to solve it myself, and spend some time on other tickets.

SGD-based Online SVD recommender

Key: MAHOUT-1274
URL: https://issues.apache.org/jira/browse/MAHOUT-1274
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: collaborative-filtering, features, machine_learning, svd
Original Estimate: 336h
Remaining Estimate: 336h

an online SVD recommender is otherwise similar to an offline SVD recommender
except that, upon receiving one or several new recommendations, it can add
them into the training dataModel and update the result accordingly in real
time.
an online SVD recommender should override setPreference(...) and
removePreference(...) in AbstractRecommender such that the factorization
result is updated in O(1) time and without retraining.
Right now the slopeOneRecommender is the only component possessing such
capability.
Since SGD is intrinsically an online algorithm and its CF implementation is
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a
good time to convert it. Such feature could come in handy for some websites.
Implementation: Adding new users, items, or increasing rating matrix rank are
just increasing size of user and item matrices. Reducing rating matrix rank
involves just one svd. The real challenge here is that sgd is NO ONE-PASS
algorithm, multiple passes are required to achieve an acceptable optimality
and even more so if hyperparameters are bad. But here are two possible
circumvents:
1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as
applying stochastic convex-opt algorithm to non-convex problem is anarchy.
But it may be a long shot.
2. Run incomplete passes in each online update using ratings randomly sampled
(but not uniformly sampled) from latest dataModel. I don't know how exactly
this should be done but new rating should be sampled more frequently. Uniform
sampling will results in old ratings being used more than new ratings in
total. If somebody has worked on this batch-to-online conversion before and
share his insight that would be awesome. This seems to be the most viable
option, if I get the non-uniform pseudorandom generator that maintains a
cumulative uniform distribution I want.
I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but
it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-12 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707104#comment-13707104
]

Peng Cheng commented on MAHOUT-1274:

Totally agree, I don't know about other DataModel but current GenericDataModel
uses two maps of PreferenceArray, which is conterintuitive. I thought it can be
a double FastByIDMap that allows O(1) random access, but I must missed some
other requirements.

Haven't read FactorizablePreferences yet, thanks a lot for your advice.

SGD-based Online SVD recommender

Re: (Bi-)Weekly/Monthly Dev Sessions

2013-07-09 Thread Peng Cheng

Sorry I missed the meeting, I really want to listen to your discussion 
but yesterday a thunderstorm cut off my electricity.


On 13-07-08 08:29 PM, Andrew Musselman wrote:

I'm getting an error when I build after doing svn up:

$ mvn package
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project - [Help 1]
[ERROR]
[ERROR]   The project  (/home/akm/mahout/pom.xml) has 1 error
[ERROR] Non-readable POM /home/akm/mahout/pom.xml: no more data
available - expected end tag /project to close start tag project from
line 2, parser stopped on END_TAG seen .../reporting\n/project\n...
@1030:1

But there's a /project tag at the end of that..


On Mon, Jul 8, 2013 at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:


Hmm, seems like that old link doesn't work.  Here's a new one:
https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en

-Grant

On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:


How about tomorrow (Monday) night at 8:30 pm EDT?

Anyone who wants to join, can browse to

https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en
 If for some reason that doesn't work, ping me on IRC (gsingers) in the
#mahout channel on Freenode.


Agenda:

0.8 Release Testing

-Grant


On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com

wrote:

Is today's Hangout happening?



On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org

wrote:
Hi,

One of the things we kicked around at Buzzwords was having a
weekly/bi-weekly/monthly dev session via Google hangout (Drill does

this

with good success, I believe).  Since we are so spread out, I thought

I

would throw out a Doodle (scheduling tool for those unfamiliar) to see

what

times work best for the majority of people interested in such a thing.
   Anyone is free to participate, but this is not a Q and A session,

but is

instead focused on writing code, fixing bugs, triaging JIRA,

releasing,

etc.

If you are interested, please fill out

http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time

Zone

since I did the poll!)  I just

grabbed a sampling of hours throughout the day.  I also picked 1 week

as

being representative of this being on a repeating schedule.  If none

of

the

times work for you, but you are still interested, please respond

here.  I

would imagine we would meet for 1-2 hours.

Also, please reply with the frequency at which you would like to meet:

[]  Weekly
[]  Bi-weekly (every 2 weeks)
[]  Monthly

My vote is every two weeks.

-Grant




--
Thanks,
Pradeep


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: 0.8 progress

2013-07-08 Thread Peng Cheng


Hi Sebastian,

I'm sorry for the entirely noobish questions: where can I download the 
judging.txt ground truth set? (netflix is pulling it off everywhere, so 
far I can only get the legacy trainingSet and qualifying.txt)
and how do I inject the ParallelAlsFactorizationJob into a common 
recommender class?
I was trying to reproduce your result (I own a small cluster), but don't 
even know where to start. The only related thing i found in 
mahout-example is a format converter.


Thanks a lot if you can give me a hint.

- Yours Peng

On 13-07-01 01:24 AM, Sebastian Schelter wrote:

I successfully ran the ALS and cooccurrence-based recommenders on the
Netflix dataset on a 26 machine cluster using Hadoop 1.0.4.

--sebastian


On 28.06.2013 21:31, Jake Mannix wrote:

I can run LDA on Twitter's cluster, on both reuters and some real data,
as well as LR/SGD.


On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote:


We really should setup a VM that we can run a couple of nodes (perhaps at
ASF?) on that we can share w/ everyone that makes it easy to test our stuff
on Hadoop for the specific version that we ship.

On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote:


Can someone (if you have time and experience). Write a small shim to run
all examples one after the other on a cluster and write up instructions

on

how to do it.?

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org

wrote:

Its crucial that we retest everything on a real cluster before the

release.

I will do this for the recommenders code next week.

--sebastian
Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org:


I should have time next week to do the release, if we can get these
knocked out.  If not next week, the following.

On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com
wrote:


1. Could someone look at Mahout-1257? There is a patch that's been

submitted but I am not sure if this has been superseded by Sean's

against

Mahout-1239.

2. Stevo, I am for fixing the findbugs excludes as part of 0.8

release,

I see that the number of warnings has gone up over the last few builds.

3. I am more concerned about the cause of the mysterious cosmic rays

that randomly fail unit tests (since we have moved to running parallel
tests).  I see that happening on my local repository too.





From: Stevo Slavić ssla...@gmail.com
To: dev@mahout.apache.org
Sent: Friday, June 28, 2013 3:21 AM
Subject: Re: 0.8 progress


Well done team!

Build is unstable, oscillates, IMO regardless of changes made. Judging

from

logs I suspect that some of the Jenkins nodes are not configured well,

/tmp

directory security related issues, and file size constraints. Could be

also

issue with our tests.

Javadoc was reported earlier not to be OK (not all modules in

aggregated

javadoc), and code quality reports are not working OK, e.g. findbugs
doesn't respect excludes - plan to work on this during weekend.

Do we want to fix these before or after 0.8 release?

Kind regards,
Stevo Slavić.


On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com

wrote:

All Done

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com

wrote:

I sent the comments. The code is good. But without the matrix/vector

input

we cant ship it in the release. Hope Yiqun and Da Zhang can make

those

changes quickly.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll 

gsing...@apache.org

wrote:


I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any

chance

we

can finish this up this week?

-Grant

On Jun 23, 2013, at 9:26 AM, Suneel Marthi 

suneel_mar...@yahoo.com

wrote:


Finally got to finishing up M-833, the changes can be reviewed at

https://reviews.apache.org/r/11774/diff/3/.






From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Tuesday, June 11, 2013 10:09 AM
Subject: Re: 0.8 progress


I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by

Thursday, I can roll an RC on Thursday.

-Grant

On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org

wrote:

Down to 4 issues!  I would say what they are, but JIRA is flaking

out

again.

My instinct is that 1030 and 1233 can be pushed.  Suneel has been

working hard to get M-833 in.  Not sure on M-1214, Robin?

-G

On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org

wrote:

On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 

gsing...@apache.org

wrote:

M-1067 -- Dmitriy  --  This is an enhancement, should we push?

Looks like this was committed already.





Grant Ingersoll | @gsingers
http://www.lucidworks.com


Grant Ingersoll

Re: 0.8 progress

2013-07-08 Thread Peng Cheng


Hi Sebastian,

I'm sorry for the entirely noobish questions: where can I download the 
judging.txt ground truth set? (netflix is pulling it off everywhere, so 
far I can only get the legacy trainingSet and qualifying.txt)
and how do I inject the ParallelAlsFactorizationJob into a common 
recommender class?
I was trying to reproduce your result (I own a small cluster), but don't 
even know where to start. The only related thing i found in 
mahout-example is a format converter.


Thanks a lot if you can give me a hint.

- Yours Peng

On 13-07-01 01:24 AM, Sebastian Schelter wrote:

I successfully ran the ALS and cooccurrence-based recommenders on the
Netflix dataset on a 26 machine cluster using Hadoop 1.0.4.

--sebastian


On 28.06.2013 21:31, Jake Mannix wrote:

I can run LDA on Twitter's cluster, on both reuters and some real data,
as well as LR/SGD.


On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote:


We really should setup a VM that we can run a couple of nodes (perhaps at
ASF?) on that we can share w/ everyone that makes it easy to test our stuff
on Hadoop for the specific version that we ship.

On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote:


Can someone (if you have time and experience). Write a small shim to run
all examples one after the other on a cluster and write up instructions

on

how to do it.?

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org

wrote:

Its crucial that we retest everything on a real cluster before the

release.

I will do this for the recommenders code next week.

--sebastian
Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org:


I should have time next week to do the release, if we can get these
knocked out.  If not next week, the following.

On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com
wrote:


1. Could someone look at Mahout-1257? There is a patch that's been

submitted but I am not sure if this has been superseded by Sean's

against

Mahout-1239.

2. Stevo, I am for fixing the findbugs excludes as part of 0.8

release,

I see that the number of warnings has gone up over the last few builds.

3. I am more concerned about the cause of the mysterious cosmic rays

that randomly fail unit tests (since we have moved to running parallel
tests).  I see that happening on my local repository too.





From: Stevo Slavić ssla...@gmail.com
To: dev@mahout.apache.org
Sent: Friday, June 28, 2013 3:21 AM
Subject: Re: 0.8 progress


Well done team!

Build is unstable, oscillates, IMO regardless of changes made. Judging

from

logs I suspect that some of the Jenkins nodes are not configured well,

/tmp

directory security related issues, and file size constraints. Could be

also

issue with our tests.

Javadoc was reported earlier not to be OK (not all modules in

aggregated

javadoc), and code quality reports are not working OK, e.g. findbugs
doesn't respect excludes - plan to work on this during weekend.

Do we want to fix these before or after 0.8 release?

Kind regards,
Stevo Slavić.


On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com

wrote:

All Done

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com

wrote:

I sent the comments. The code is good. But without the matrix/vector

input

we cant ship it in the release. Hope Yiqun and Da Zhang can make

those

changes quickly.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll 

gsing...@apache.org

wrote:


I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any

chance

we

can finish this up this week?

-Grant

On Jun 23, 2013, at 9:26 AM, Suneel Marthi 

suneel_mar...@yahoo.com

wrote:


Finally got to finishing up M-833, the changes can be reviewed at

https://reviews.apache.org/r/11774/diff/3/.






From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Tuesday, June 11, 2013 10:09 AM
Subject: Re: 0.8 progress


I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by

Thursday, I can roll an RC on Thursday.

-Grant

On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org

wrote:

Down to 4 issues!  I would say what they are, but JIRA is flaking

out

again.

My instinct is that 1030 and 1233 can be pushed.  Suneel has been

working hard to get M-833 in.  Not sure on M-1214, Robin?

-G

On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org

wrote:

On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 

gsing...@apache.org

wrote:

M-1067 -- Dmitriy  --  This is an enhancement, should we push?

Looks like this was committed already.





Grant Ingersoll | @gsingers
http://www.lucidworks.com


Grant Ingersoll

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702175#comment-13702175
]

Peng Cheng commented on MAHOUT-1272:

Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own
you this.
testing on netflix dataset has encountered some trouble, namely, I don't know
where to download it :-. Great appreciation for anyone who can share his
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as
code snippets.

Parallel SGD matrix factorizer for SVDrecommender
-

Attachments: GroupLensSVDRecomenderEvaluatorRunner.java,
mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java,
ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702175#comment-13702175
]

Peng Cheng edited comment on MAHOUT-1272 at 7/8/13 6:06 PM:

Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own
you this.
I'll test more grouplens data. Since Sebastian has taken over the code, new
test cases will only be posted as code snippets.

was (Author: peng):
Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I
own you this.
testing on netflix dataset has encountered some trouble, namely, I don't know
where to download it :-. Great appreciation for anyone who can share his
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as
code snippets.

Parallel SGD matrix factorizer for SVDrecommender
-

Attachments: GroupLensSVDRecomenderEvaluatorRunner.java,
mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java,
ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701672#comment-13701672
 ] 

Peng Cheng commented on MAHOUT-1272:


Hey honoured contributors I've got some crude test results for the new parallel 
SGD factorizer for CF:

1. parameters:
lambda = 1e-10
rank of the rating matrix/number of features of each user/item vectors = 50
number of biases: 3 (average rating + user bias + item bias)
number of iterations/epochs = 2 (for all factorizers including ALSWR, 
ratingSGD and the proposed parallelSGD)
initial mu/learning rate = 0.01 (for ratingSGD and proposed parallelSGD)
decay rate of mu = 1 (does not decay) (for ratingSGD and proposed 
parallelSGD)
other parameters are set to default.

2. result on movielens-10m (I don't know what the hell happened to ALSWR, the 
default hyperparameters must screw up real bad, but my point is the speed edge):
  a. RMSE

Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 3.7709163950800665E21 
time spent: 6.179s===
Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8847393972529887 time spent: 6.179s===
Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8805947464818478 time spent: 3.084s

  b. Absolute Average

INFO: ==Recommender With ALSWRFactorizer: 1.2085420449917682E19 
time spent: 7.444s===
Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.675685274206 time spent: 7.444s===
Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.6775774766740665 time spent: 2.365s

3. result on movielens-1m (in average sgd works worse on it comparing to 
movielens-10m, perhaps I could use more iterations/epochs)

  a. RMSE

Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 1.3514189134383086E20 
time spent: 0.637s===
Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.9312989913558529 time spent: 0.637s===
Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.9529995632658007 time spent: 0.305s

  b. Absolute Average

Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 
1.58934499216789965E18 time spent: 0.626s===
Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.7459565635961599 time spent: 0.626s===
Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.7420818642753416 time spent: 0.297s

Great thanks to Sebastian for his guidance, I'll upload the EvaluatorRunner 
class as a mahout-example component and the formatted code shortly.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: ParallelSGDFactorizerTest.java
ParallelSGDFactorizer.java
GroupLensSVDRecomenderEvaluatorRunner.java

My laptop is a HP Pavilion with Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 8G 
mem.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701679#comment-13701679
 ] 

Peng Cheng commented on MAHOUT-1272:


Hi Sebastian may I ask question? I digged some old post and found that the best 
result should be RMSE ~= 0.85, do you know the parameters being used?

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/7/13 10:21 PM:
-

New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m, all evaluation uses RMSE:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

  was (Author: peng):
New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.
  
 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng commented on MAHOUT-1272:


New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701688#comment-13701688
]

Peng Cheng commented on MAHOUT-1272:

Hi Sebastian,

Really? I would break my fingers to squeeze into 0.8 release. (not RC1 of
course, but there is still RC2 :-) A few guys I work with are also kicking me
for the online recommender, so I can work very hard and undistracted. You just
tell me what to do next and I'll be thrilled to oblige.

Parallel SGD matrix factorizer for SVDrecommender
-

Key: MAHOUT-1272
URL: https://issues.apache.org/jira/browse/MAHOUT-1272
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: features, patch, test
Attachments: GroupLensSVDRecomenderEvaluatorRunner.java,
mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java,
ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

Original Estimate: 336h
Remaining Estimate: 336h

Re: Code Freeze for 0.8

2013-07-07 Thread Peng Cheng


Hi Dr Dunning,

I recently joined the team and am working on tickets 1272 and 1274 right 
now. I was planning to commit to core-0.8 rc2 but the time frame seems 
harsh. Could you tell me if is it practical? I'm a hard worker.


PS I was there at your presentation in Toronto this year. Not ashamed to 
say, one of the funniest lecture in my life.


-Yours Peng

On 13-07-07 05:19 PM, Grant Ingersoll wrote:

Working on the release now.  If anyone wants to join in, I'm on IRC as well.

-Grant


On Jul 5, 2013, at 12:40 PM, Sebastian Schelters...@apache.org  wrote:


+1

On 05.07.2013 18:06, Jake Mannix wrote:

+1



On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunningted.dunn...@gmail.com  wrote:


+1


On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com

wrote:
+1




From: Grant Ingersollgsing...@apache.org
To:dev@mahout.apache.org  dev@mahout.apache.org
Sent: Friday, July 5, 2013 10:36 AM
Subject: Code Freeze for 0.8


I know it's short notice, but I'd like to suggest a code freeze for 0.8
today or tomorrow and I will do a 0.8 RC on Sunday.  Based on JIRA, etc.,
it looks like this should be fine, but let me know if there are any
objections.

Thanks,
Grant





Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: Code Freeze for 0.8

2013-07-07 Thread Peng Cheng


Hi Dr Dunning,

Thanks a lot, I just read that the deadline is within 7 days and 
immediately realize how retarded my plan was. There will be not rc1 or 
rc2, just rc.

Will have to spam some test in the next few days.

- Peng

On 07/07/2013 10:12 PM, Ted Dunning wrote:

Peng,

Strictly speaking, the code is frozen already.  Sebastian seems to think
some can get in, but even that is pushing things.


On Sun, Jul 7, 2013 at 3:59 PM, Peng Cheng pc...@uowmail.edu.au wrote:


Hi Dr Dunning,

I recently joined the team and am working on tickets 1272 and 1274 right
now. I was planning to commit to core-0.8 rc2 but the time frame seems
harsh. Could you tell me if is it practical? I'm a hard worker.

PS I was there at your presentation in Toronto this year. Not ashamed to
say, one of the funniest lecture in my life.

-Yours Peng


On 13-07-07 05:19 PM, Grant Ingersoll wrote:


Working on the release now.  If anyone wants to join in, I'm on IRC as
well.

-Grant


On Jul 5, 2013, at 12:40 PM, Sebastian Schelters...@apache.org  wrote:

  +1

On 05.07.2013 18:06, Jake Mannix wrote:


+1



On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunningted.dunn...@gmail.com
  wrote:

  +1


On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com


wrote:
+1



__**__
From: Grant Ingersollgsing...@apache.org
To:dev@mahout.apache.org  dev@mahout.apache.org
Sent: Friday, July 5, 2013 10:36 AM
Subject: Code Freeze for 0.8


I know it's short notice, but I'd like to suggest a code freeze for
0.8
today or tomorrow and I will do a 0.8 RC on Sunday.  Based on JIRA,
etc.,
it looks like this should be fine, but let me know if there are any
objections.

Thanks,
Grant



  --**--

Grant Ingersoll | @gsingers
http://www.lucidworks.com

[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-06 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701233#comment-13701233
]

Peng Cheng edited comment on MAHOUT-1272 at 7/6/13 2:43 PM:

Hey I have finished the class and test for parallel sgd factorizer for
matrix-completion based recommender (not mapreduced, just single machine
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only
tested on toy and synthetic data (2000users * 1000 items) but it is pretty
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x,
apparently the executor induces high overhead allocation cost) And definitely
faster than single machine ALSWR.

I'm submitting my java files and patch for review.

was (Author: peng):
Hey I have finished the class and test for parallel sgd factorizer for
matrix-completion based recommender (not mapreduced, just single machine
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only
tested on toy and synthetic data (2000users * 1000 times) but it is pretty
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x,
apparently the executor induces high overhead allocation cost) And definitely
faster than single machine ALSWR.

I'm submitting my java files and patch for review.

Parallel SGD matrix factorizer for SVDrecommender
-

Key: MAHOUT-1272
URL: https://issues.apache.org/jira/browse/MAHOUT-1272
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: features, patch, test
Attachments: mahout.patch, ParallelSGDFactorizer.java,
ParallelSGDFactorizerTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Updated] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peng Cheng updated MAHOUT-1274:
---

Description:
an online SVD recommender is otherwise similar to an offline SVD recommender
except that, upon receiving one or several new recommendations, it can add them
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and
removePreference(...) in AbstractRecommender such that the factorization result
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such
capability.

Since SGD is intrinsically an online algorithm and its CF implementation is
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a
good time to convert it. Such feature could come in handy for some websites.

Implementation: Adding new users, items, or increasing rating matrix rank are
just increasing size of user and item matrices. Reducing rating matrix rank
involves just one svd. The real challenge here is that sgd is NO ONE-PASS
algorithm, multiple passes are required to achieve an acceptable optimality and
even more so if hyperparameters are bad. But here are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But
it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly sampled
(but not uniformly sampled) from latest dataModel. I don't know how exactly
this should be done but new rating should be sampled more frequently. Uniform
sampling will results in old ratings being used more than new ratings in total.
If somebody has worked on this batch-to-online conversion before and share his
insight that would be awesome. This seems to be the most viable option, if I
get the non-uniform pseudorandom generator that maintains a cumulative uniform
distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it
didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

was:
an online SVD recommender is otherwise similar to an offline SVD recommender
except that, upon receiving one or several new recommendations, it can add them
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and
removePreference(...) in AbstractRecommender such that the factorization result
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such
capability.

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But
it may be a long shot.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it
didn't pay off. Hopefully its not a bad idea to submit a new tickets.

SGD-based Online SVD recommender

an online SVD recommender is otherwise similar to an offline SVD recommender
except that, upon receiving one or several new recommendations, it can add
them

[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701380#comment-13701380
]

Peng Cheng commented on MAHOUT-1274:

BTW may I ask (noobishly) that why you have deprecated the SlopeOneRecommender
in the latest core-0.8 snapshot? i must have missed a lot in previous
mahout-development emails before i join so apologies if its a stupid question.

SGD-based Online SVD recommender

Re: [jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng

Hi Sebastian,

Thanks a lot for help! You mean core-1.0 or bundle-1.0? I hope I can
work hard enough to catch the next release. Also, what do you think
about the proposed online pseudorandom sampling problem?

I was digging old threads and found MAHOUT-1069, which already did a lot
of work I need right now, and used a lot of code optimization
techniques, but was eventually rejected for being too complex and
drastic. :-

I wonder if overengineering is a researcher's most dangerous bane,
happened to a lot of people.

On 13-07-06 01:31 PM, Sebastian Schelter wrote:

Hi Peng,

We deprecated a lot of algorithms that we found to be not much used to
streamline our codebase for a coming 1.0 release.
Am 06.07.2013 10:25 schrieb Peng Cheng (JIRA) j...@apache.org:

[
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701380#comment-13701380]

Peng Cheng commented on MAHOUT-1274:

BTW may I ask (noobishly) that why you have deprecated the
SlopeOneRecommender in the latest core-0.8 snapshot? i must have missed a
lot in previous mahout-development emails before i join so apologies if its
a stupid question.

SGD-based Online SVD recommender

machine_learning, svd

Original Estimate: 336h
Remaining Estimate: 336h

an online SVD recommender is otherwise similar to an offline SVD

recommender except that, upon receiving one or several new recommendations,
it can add them into the training dataModel and update the result
accordingly in real time.

an online SVD recommender should override setPreference(...) and

removePreference(...) in AbstractRecommender such that the factorization
result is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such

capability.

Since SGD is intrinsically an online algorithm and its CF implementation

is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would
be a good time to convert it. Such feature could come in handy for some
websites.

Implementation: Adding new users, items, or increasing rating matrix

rank are just increasing size of user and item matrices. Reducing rating
matrix rank involves just one svd. The real challenge here is that sgd is
NO ONE-PASS algorithm, multiple passes are required to achieve an
acceptable optimality and even more so if hyperparameters are bad. But here
are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever

work as applying stochastic convex-opt algorithm to non-convex problem is
anarchy. But it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly

sampled (but not uniformly sampled) from latest dataModel. I don't know how
exactly this should be done but new rating should be sampled more
frequently. Uniform sampling will results in old ratings being used more
than new ratings in total. If somebody has worked on this batch-to-online
conversion before and share his insight that would be awesome. This seems
to be the most viable option, if I get the non-uniform pseudorandom
generator that maintains a cumulative uniform distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender

but it didn't pay off. Hopefully its not a bad idea to submit a new ticket
here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Peng Cheng updated MAHOUT-1272:
---

Labels: features patch test (was: )
Status: Patch Available (was: Open)

Hey I have finished the class and test for parallel sgd factorizer for
matrix-completion based recommender (not mapreduced, just single machine
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only
tested on toy and synthetic data (2000users * 1000 times) but it is pretty
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x,
apparently the executor induces high overhead allocation cost) And definitely
faster than single machine ALSWR.

I'm submitting my java files and patch for review.

Parallel SGD matrix factorizer for SVDrecommender
-

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: ParallelSGDFactorizerTest.java
ParallelSGDFactorizer.java

java file

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: mahout.patch

patch

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

[
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701241#comment-13701241
]

Peng Cheng commented on MAHOUT-1272:

The next step would be to create an online version of this (and recommender)
sgd is an online algorithm but now works only for batch recommender.
In the mean time the only online recommender in mahout is the slope-one, kind
of a shame.
Will create a new JIRA ticket tomorrow.

Parallel SGD matrix factorizer for SVDrecommender
-

Key: MAHOUT-1272
URL: https://issues.apache.org/jira/browse/MAHOUT-1272
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
Labels: features, patch, test
Attachments: mahout.patch, ParallelSGDFactorizer.java,
ParallelSGDFactorizerTest.java

Original Estimate: 336h
Remaining Estimate: 336h

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender