Re: Regarding Online Recommenders

2013-07-24 Thread Gokhan Capan
Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G
for the Netflix Data, still large.

The same history is encoded in 2 SparseRowMatrices, one is row-indexed by
users and one is by item.

It has serious concurrency issues at several places, though (sets and
removes need to be thread-safe).

Best

Gokhan


On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 Hi,

 Just one simple question: Is the 
 org.apache.mahout.math.**BinarySearch.binarySearch()
 function an optimized version of Arrays.binarySearch()? If it is not, why
 implement it again?

 Yours Peng


 On 13-07-17 06:31 PM, Sebastian Schelter wrote:

 You are completely right, the simple interface would only be usable for
 readonly / batch-updatable recommenders. Online recommenders might need
 something different. I tried to widen the discussion here to discuss all
 kinds of API changes in the recommenders that would be necessary in the
 future.



 2013/7/17 Peng Cheng pc...@uowmail.edu.au

  One thing that suddenly comes to my mind is that, for a simple interface
 like FactorizablePreferences, maybe sequential READ in real time is
 possible, but sequential WRITE in O(1) time is Utopia. Because you need
 to
 flush out old preference with same user and item ID (in worst case it
 could
 be an interpolation search), otherwise you are permitting a user rating
 an
 item twice with different values. Considering how FileDataModel suppose
 to
 work (new files flush old files), maybe using the simple interface has
 less
 advantages than we used to believe.


 On 13-07-17 04:58 PM, Sebastian Schelter wrote:

  Hi Peng,

 I never wanted to discard the old interface, I just wanted to split it
 up.
 I want to have a simple interface that only supports sequential access
 (and
 allows for very memory efficient implementions, e.g. by the use of
 primitive arrays). DataModel should *extend* this interface and provide
 sequential and random access (basically what is already does).

 Than a recommender such as SGD could state that it only needs sequential
 access to the preferences and you can either feed it a DataModel (so we
 dont break backwards compatibility) or a memory efficient sequential
 access thingy.

 Does that make sense for you?


 2013/7/17 Peng Cheng pc...@uowmail.edu.au

   I see, OK so we shouldn't use the old implementation. But I mean, the
 old

 interface doesn't have to be discarded. The discrepancy between your
 FactorizablePreferences and DataModel is that, your model supports
 getPreferences(), which returns all preferences as an iterator, and
 DataModel supports a few old functions that returns preferences for an
 individual user or item.

 My point is that, it is not hard for each of them to implement what
 they
 lack of: old DataModel can implement getPreferences() just by a a loop
 in
 abstract class. Your new FactorizablePreferences can implement those
 old
 functions by a binary search that takes O(log n) time, or an
 interpolation
 search that takes O(log log n) time in average. So does the online
 update.
 It will just be a matter of different speed and space, but not
 different
 interface standard, we can use old unit tests, old examples, old
 everything. And we will be more flexible in writing ensemble
 recommender.

 Just a few thoughts, I'll have to validate the idea first before
 creating
 a new JIRA ticket.

 Yours Peng



 On 13-07-16 02:51 PM, Sebastian Schelter wrote:

   I completely agree, Netflix is less than one gigabye in a smart

 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient
 representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

Netflix is a small dataset.  12G for that seems quite excessive.

  Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:

The second idea is indeed splendid, we should separate
 time-complexity

  first and space-complexity first implementation. What I'm not quite
 sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new

   laptop

   can already handle that (emphasis on laptop). And if we replace
 hash

 map
 (the culprit of high memory consumption) with list/linkedList, it
 would
 simply degrade time complexity for a linear search to O(n), not too
 bad
 either. The current DataModel is a result of careful thoughts and
 has
 underwent extensive test, it is easier to expand on top of it
 instead
 of
 subverting it.








Re: Regarding Online Recommenders

2013-07-19 Thread Peng Cheng

Hi,

Just one simple question: Is the 
org.apache.mahout.math.BinarySearch.binarySearch() function an optimized 
version of Arrays.binarySearch()? If it is not, why implement it again?


Yours Peng

On 13-07-17 06:31 PM, Sebastian Schelter wrote:

You are completely right, the simple interface would only be usable for
readonly / batch-updatable recommenders. Online recommenders might need
something different. I tried to widen the discussion here to discuss all
kinds of API changes in the recommenders that would be necessary in the
future.



2013/7/17 Peng Cheng pc...@uowmail.edu.au


One thing that suddenly comes to my mind is that, for a simple interface
like FactorizablePreferences, maybe sequential READ in real time is
possible, but sequential WRITE in O(1) time is Utopia. Because you need to
flush out old preference with same user and item ID (in worst case it could
be an interpolation search), otherwise you are permitting a user rating an
item twice with different values. Considering how FileDataModel suppose to
work (new files flush old files), maybe using the simple interface has less
advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:


Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access
(and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au

  I see, OK so we shouldn't use the old implementation. But I mean, the old

interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an
interpolation
search that takes O(log log n) time in average. So does the online
update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:

  I completely agree, Netflix is less than one gigabye in a smart

representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient
representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

   Netflix is a small dataset.  12G for that seems quite excessive.


Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

   The second idea is indeed splendid, we should separate
time-complexity


first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

  laptop

  can already handle that (emphasis on laptop). And if we replace hash

map
(the culprit of high memory consumption) with list/linkedList, it
would
simply degrade time complexity for a linear search to O(n), not too
bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead
of
subverting it.









Re: Regarding Online Recommenders

2013-07-18 Thread Pat Ferrel
 
 On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 Hi Pat, please see my response inline.
 
 Best,
 Gokhan
 
 
 On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 May I ask how you plan to support model updates and 'anonymous' users?
 
 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates handled?
 
 
 If you are referring to the recommender of discussion here, no, updating
 the model can be done with a single preference, using stochastic gradient
 descent, by updating the particular user and item factors simultaneously.
 

Aren't there two different things needed to truly update the model: 1) add the 
new preference to the lower dimensional space 2) refactorize the all 
preferences. #2 only needs to be done periodically--afaik. #1 would be super 
fast and could be done at runtime.  Am I wrong or are you planning to 
incrementally refactorize the entire preference array with every new preference?


Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
For a low-rank matrix factorization based recommender, a new preference 
is not itself, but a dot product of two vectors in the low dimensional 
space, so it needs no projection. The user and item vectors however may 
need to be projected into a lower dimensional space, if and only if you 
want to reduce the rank of the preference matrix. The refactorization 
step in SGD is super fast--that's the charm of SGD. So, yes, we will 
refactorize in every update.


Yours Peng

On 13-07-18 11:34 AM, Pat Ferrel wrote:

On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote:

Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.


Aren't there two different things needed to truly update the model: 1) add the 
new preference to the lower dimensional space 2) refactorize the all 
preferences. #2 only needs to be done periodically--afaik. #1 would be super 
fast and could be done at runtime.  Am I wrong or are you planning to 
incrementally refactorize the entire preference array with every new preference?






Re: Regarding Online Recommenders

2013-07-18 Thread Pat Ferrel
Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only 
factorized. Ted talks about using Solr for this, which we're experimenting with 
alongside Myrrix. I suspect Solr works but it does require a bit of tinkering 
and doesn't have quite the same set of options--no llr similarity for instance. 

On the same subject I recently attended a workshop in Seattle for UAI2013 where 
Walmart reported similar results using a factorized recommender. They had to 
increase the factor number past where it would perform well. Along the way they 
saw increasing performance measuring precision offline. They eventually gave up 
on a factorized solution. This decision seems odd but anyway… In the case of 
Walmart and our data set they are quite diverse. The best idea is probably to 
create different recommenders for separate parts of the catalog but if you 
create one model on all items our intuition is that item-based works better 
than factorized. Again caveat--no A/B tests to support this yet.

Doing an online item-based recommender would quickly run into scaling problems, 
no? We put together the simple Mahout in-memory version and it could not really 
handle more than a down-sampled few months of our data. Down-sampling lost us 
20% of our precision scores so we moved to the hadoop version. Now we have 
use-cases for an online recommender that handles anonymous new users and that 
takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com

 May I ask how you plan to support model updates and 'anonymous' users?
 
 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates handled? Do you
 plan to require batch model refactorization for any update? Or perform some
 partial update by maybe just transforming new data into the LF space
 already in place then doing full refactorization every so often in batch
 mode?
 
 By 'anonymous users' I mean users with some history that is not yet
 incorporated in the LF model. This could be history from a new user asked
 to pick a few items to start the rec process, or an old user with some new
 action history not yet in the model. Are you going to allow for passing the
 entire history vector or userID+incremental new history to the recommender?
 I hope so.
 
 For what it's worth we did a comparison of Mahout Item based CF to Mahout
 ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
 data. The data was purchase data from a diverse ecom source with a large
 variety of products from electronics to clothes. We found Item based CF did
 far better than ALS. As we increased the number of latent factors the
 results got better but were never within 10% of item based (we used MAP as
 the offline metric). Not sure why but maybe it has to do with the diversity
 of the item types.
 
 I understand that a full item based online recommender has very different
 tradeoffs and anyway others may not have seen this disparity of results.
 Furthermore we don't have A/B test results yet to validate the offline
 metric.
 
 On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 Peng,
 
 This is the reason I separated out the DataModel, and only put the learner
 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
 where preferences are stored.
 
 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to deploy a
 recommender and perform a lot of top-N recommendation tasks.
 
 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me try a
 matrix-backed DataModel approach)
 
 Gokhan
 
 
 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:
 
 I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient
 representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
 3GB with that approach.
 
 
 2013/7/16 Ted Dunning ted.dunn...@gmail.com
 
 Netflix is a small dataset.  12G for that seems 

Re: Regarding Online Recommenders

2013-07-18 Thread Sebastian Schelter
Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com

 Yes, what Myrrix does is good.

 My last aside was a wish for an item-based online recommender not only
 factorized. Ted talks about using Solr for this, which we're experimenting
 with alongside Myrrix. I suspect Solr works but it does require a bit of
 tinkering and doesn't have quite the same set of options--no llr similarity
 for instance.

 On the same subject I recently attended a workshop in Seattle for UAI2013
 where Walmart reported similar results using a factorized recommender. They
 had to increase the factor number past where it would perform well. Along
 the way they saw increasing performance measuring precision offline. They
 eventually gave up on a factorized solution. This decision seems odd but
 anyway… In the case of Walmart and our data set they are quite diverse. The
 best idea is probably to create different recommenders for separate parts
 of the catalog but if you create one model on all items our intuition is
 that item-based works better than factorized. Again caveat--no A/B tests to
 support this yet.

 Doing an online item-based recommender would quickly run into scaling
 problems, no? We put together the simple Mahout in-memory version and it
 could not really handle more than a down-sampled few months of our data.
 Down-sampling lost us 20% of our precision scores so we moved to the hadoop
 version. Now we have use-cases for an online recommender that handles
 anonymous new users and that takes the story full circle.

 On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

 Hi Pat

 I think we should provide a simple support for recommending to anonymous
 users. We should have a method recommendToAnonymous() that takes a
 PreferenceArray as argument. For itembased recommenders, its
 straightforward to compute recommendations, for userbased you have to
 search through all users once, for latent factor models, you have to fold
 the user vector into the low dimensional space.

 I think Sean already added this method in myrrix and I have some code for
 my kornakapi project (a simple weblayer for mahout).

 Would such a method fit your needs?

 Best,
 Sebastian



 2013/7/17 Pat Ferrel pat.fer...@gmail.com

  May I ask how you plan to support model updates and 'anonymous' users?
 
  I assume the latent factors model is calculated offline still in batch
  mode, then there are periodic updates? How are the updates handled? Do
 you
  plan to require batch model refactorization for any update? Or perform
 some
  partial update by maybe just transforming new data into the LF space
  already in place then doing full refactorization every so often in batch
  mode?
 
  By 'anonymous users' I mean users with some history that is not yet
  incorporated in the LF model. This could be history from a new user asked
  to pick a few items to start the rec process, or an old user with some
 new
  action history not yet in the model. Are you going to allow for passing
 the
  entire history vector or userID+incremental new history to the
 recommender?
  I hope so.
 
  For what it's worth we did a comparison of Mahout Item based CF to Mahout
  ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
 of
  data. The data was purchase data from a diverse ecom source with a large
  variety of products from electronics to clothes. We found Item based CF
 did
  far better than ALS. As we increased the number of latent factors the
  results got better but were never within 10% of item based (we used MAP
 as
  the offline metric). Not sure why but maybe it has to do with the
 diversity
  of the item types.
 
  I understand that a full item based online recommender has very different
  tradeoffs and anyway others may not have seen this disparity of results.
  Furthermore we don't have A/B test results yet to validate the offline
  metric.
 
  On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
  Peng,
 
  This is the reason I separated out the DataModel, and only put the
 learner
  stuff there. The learner I mentioned yesterday just stores the
  parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
  where preferences are stored.
 
  I, kind of, agree with the multi-level DataModel approach:
  One for iterating over all preferences, one for if one wants to deploy
 a
  recommender and perform a lot of top-N recommendation tasks.
 
  (Or one DataModel with a strategy that might reduce existing memory
  consumption, while still providing fast access, I am not sure. Let me
 try a
  matrix-backed DataModel 

Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but 
this is definitely I would like to pursue. It's probably the only 
advantage a non-hadoop implementation can offer in the future.


Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for 
sandboxes, IMHO they are the culprit of scalability problem.


May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months

of

data. The data was purchase data from a diverse ecom source with a large
variety of products from electronics to clothes. We found Item based CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full item based online recommender has very different
tradeoffs and anyway others may not have seen this disparity of results.
Furthermore we don't have A/B test results yet to validate the offline
metric.

On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

Peng,

This is the reason I separated out the DataModel, and only put the

learner

stuff there. The learner I 

Re: Regarding Online Recommenders

2013-07-18 Thread Pat Ferrel
It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:
 Well, with itembased the only problem is new items. New users can
 immediately be served by the model (although this is not well supported by
 the API in Mahout). For the majority of usecases I saw, it is perfectly
 fine to have a short delay until new items enter the recommender, usually
 this happens after a retraining in batch. You have to care for cold-start
 and collect some interactions anyway.
 
 
 2013/7/18 Pat Ferrel pat.fer...@gmail.com
 
 Yes, what Myrrix does is good.
 
 My last aside was a wish for an item-based online recommender not only
 factorized. Ted talks about using Solr for this, which we're experimenting
 with alongside Myrrix. I suspect Solr works but it does require a bit of
 tinkering and doesn't have quite the same set of options--no llr similarity
 for instance.
 
 On the same subject I recently attended a workshop in Seattle for UAI2013
 where Walmart reported similar results using a factorized recommender. They
 had to increase the factor number past where it would perform well. Along
 the way they saw increasing performance measuring precision offline. They
 eventually gave up on a factorized solution. This decision seems odd but
 anyway… In the case of Walmart and our data set they are quite diverse. The
 best idea is probably to create different recommenders for separate parts
 of the catalog but if you create one model on all items our intuition is
 that item-based works better than factorized. Again caveat--no A/B tests to
 support this yet.
 
 Doing an online item-based recommender would quickly run into scaling
 problems, no? We put together the simple Mahout in-memory version and it
 could not really handle more than a down-sampled few months of our data.
 Down-sampling lost us 20% of our precision scores so we moved to the hadoop
 version. Now we have use-cases for an online recommender that handles
 anonymous new users and that takes the story full circle.
 
 On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:
 
 Hi Pat
 
 I think we should provide a simple support for recommending to anonymous
 users. We should have a method recommendToAnonymous() that takes a
 PreferenceArray as argument. For itembased recommenders, its
 straightforward to compute recommendations, for userbased you have to
 search through all users once, for latent factor models, you have to fold
 the user vector into the low dimensional space.
 
 I think Sean already added this method in myrrix and I have some code for
 my kornakapi project (a simple weblayer for mahout).
 
 Would such a method fit your needs?
 
 Best,
 Sebastian
 
 
 
 2013/7/17 Pat Ferrel pat.fer...@gmail.com
 
 May I ask how you plan to support model updates and 'anonymous' users?
 
 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates handled? Do
 you
 plan to require batch model refactorization for any update? Or perform
 some
 partial update by maybe just transforming new data into the LF space
 already in place then doing full refactorization every so often in batch
 mode?
 
 By 'anonymous users' I mean users with some history that is not yet
 incorporated in the LF model. This could be history from a new user asked
 to pick a few items to start the rec process, or an old user with some
 new
 action history not yet in the model. Are you going to allow for passing
 the
 entire history vector or userID+incremental new history to the
 recommender?
 I hope so.
 
 For what it's worth we did a comparison of Mahout Item based CF to Mahout
 ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
 of
 data. The data was purchase data from a diverse ecom source with a large
 variety of products from electronics to clothes. We found Item based CF
 did
 far better than ALS. As we increased the number of latent factors the
 results got better but were never within 10% of item based (we used MAP
 as
 the offline metric). Not sure why but maybe it has to do with the
 diversity
 of the item types.
 
 I understand that a full item based online recommender has very different
 tradeoffs and anyway others may not have seen this disparity of results.
 Furthermore we don't have A/B test 

Re: Regarding Online Recommenders

2013-07-18 Thread Abhishek Sharma
Thanks for the encouragement guys and my apologies for deviating from the
discussion again.

As suggested by Sebastian, I looked into the Mahout's recommenders and
implemented a User-based recommender as well. IMHO, I now have a fair
understanding about the recommenders.

In order to be a part of ASF-ICFOSS mentor-ship
programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+ProgrammeI
am required to submit a proposal discussing about my plan in order to
work on this idea. My understanding, based on the discussion on this
thread, so far is that we are still in the process of finalizing an
approach and yet to start working on the idea.

I need your help on the following:
1) Could you please suggest the sub-features that I would be able to
contribute to and thus can mention them under expected deliverables ?
2) Also, I would be really grateful if one of you could be my
mentorhttp://community.apache.org/mentoringprogramme.htmlas a part
of this programme (As a mentee I would need only some
pointers/directions to proceed further). Please let me know.

I am really excited and looking forward to a positive response to work on
this feature.

Thanks  Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21

On Thu, Jul 18, 2013 at 2:39 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 Awesome! your reinforcements are highly appreciated.


 On 13-07-17 01:29 AM, Abhishek Sharma wrote:

 Sorry to interrupt guys, but I just wanted to bring it to your notice that
 I am also interested in contributing to this idea. I am planning to
 participate in ASF-ICFOSS mentor-ship
 programmehttps://cwiki.**apache.org/confluence/display/**
 COMDEV/ASF-ICFOSS+Pilot+**Mentoring+Programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme
 .

 (this is very similar to GSOC)

 I do have strong concepts in machine learning (have done the ML course by
 Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of
 work
 experience). I am not really sure of how can I approach this problem (but
 I
 do have a strong interest to work on this problem) hence would like to
 pair
 up on this. I am currently working as a research intern at Indian
 Institute
 of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

 Please let me know your thoughts if I can be a part of this.

 Thanks  Regards,
 Abhishek Sharma
 http://www.linkedin.com/in/**abhi21 http://www.linkedin.com/in/abhi21
 https://github.com/abhi21


 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote:

  Peng,

 This is the reason I separated out the DataModel, and only put the
 learner
 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not care
 where preferences are stored.

 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to deploy
 a
 recommender and perform a lot of top-N recommendation tasks.

 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me
 try a
 matrix-backed DataModel approach)

 Gokhan


 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:

  I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient

 representation,

 tested on KDD Music dataset which is approx 2.5 times Netflix and fits

 into

 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au

 wrote:

 The second idea is indeed splendid, we should separate

 time-complexity

 first and space-complexity first implementation. What I'm not quite

 sure,

 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new

 laptop

 can already handle that (emphasis on laptop). And if we replace hash

 map

 (the culprit of high memory consumption) with list/linkedList, it

 would

 simply degrade time complexity for a linear search to O(n), not too

 bad

 either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it instead

 of

 subverting it.








-- 
--
Abhishek Sharma
ThoughtWorks


Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
Strange, its just a little bit larger than limibseti dataset (17m 
ratings), did you encountered an outOfMemory or GCTimeOut exception? 
Allocating more heap space usually help.


Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months

of

data. The data was purchase data from a diverse ecom source with a large
variety of products from electronics to clothes. We found Item based CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full 

Re: Regarding Online Recommenders

2013-07-18 Thread Pat Ferrel
No it was CPU bound not memory. I gave it something like 14G heap. It was 
running, just too slow to be of any real use. We switched to the hadoop version 
and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m ratings), did 
you encountered an outOfMemory or GCTimeOut exception? Allocating more heap 
space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:
 It was about 2.5M users and 500K items with 25M actions over 6 months of data.
 
 On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:
 
 If I remember right, a highlight of 0.8 release is an online clustering 
 algorithm. I'm not sure if it can be used in item-based recommender, but this 
 is definitely I would like to pursue. It's probably the only advantage a 
 non-hadoop implementation can offer in the future.
 
 Many non-hadoop recommenders are pretty fast. But existing in-memory 
 GenericDataModel and FileDataModel are largely implemented for sandboxes, 
 IMHO they are the culprit of scalability problem.
 
 May I ask about the scale of your dataset? how many rating does it have?
 
 Yours Peng
 
 On 13-07-18 12:14 PM, Sebastian Schelter wrote:
 Well, with itembased the only problem is new items. New users can
 immediately be served by the model (although this is not well supported by
 the API in Mahout). For the majority of usecases I saw, it is perfectly
 fine to have a short delay until new items enter the recommender, usually
 this happens after a retraining in batch. You have to care for cold-start
 and collect some interactions anyway.
 
 
 2013/7/18 Pat Ferrel pat.fer...@gmail.com
 
 Yes, what Myrrix does is good.
 
 My last aside was a wish for an item-based online recommender not only
 factorized. Ted talks about using Solr for this, which we're experimenting
 with alongside Myrrix. I suspect Solr works but it does require a bit of
 tinkering and doesn't have quite the same set of options--no llr similarity
 for instance.
 
 On the same subject I recently attended a workshop in Seattle for UAI2013
 where Walmart reported similar results using a factorized recommender. They
 had to increase the factor number past where it would perform well. Along
 the way they saw increasing performance measuring precision offline. They
 eventually gave up on a factorized solution. This decision seems odd but
 anyway… In the case of Walmart and our data set they are quite diverse. The
 best idea is probably to create different recommenders for separate parts
 of the catalog but if you create one model on all items our intuition is
 that item-based works better than factorized. Again caveat--no A/B tests to
 support this yet.
 
 Doing an online item-based recommender would quickly run into scaling
 problems, no? We put together the simple Mahout in-memory version and it
 could not really handle more than a down-sampled few months of our data.
 Down-sampling lost us 20% of our precision scores so we moved to the hadoop
 version. Now we have use-cases for an online recommender that handles
 anonymous new users and that takes the story full circle.
 
 On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:
 
 Hi Pat
 
 I think we should provide a simple support for recommending to anonymous
 users. We should have a method recommendToAnonymous() that takes a
 PreferenceArray as argument. For itembased recommenders, its
 straightforward to compute recommendations, for userbased you have to
 search through all users once, for latent factor models, you have to fold
 the user vector into the low dimensional space.
 
 I think Sean already added this method in myrrix and I have some code for
 my kornakapi project (a simple weblayer for mahout).
 
 Would such a method fit your needs?
 
 Best,
 Sebastian
 
 
 
 2013/7/17 Pat Ferrel pat.fer...@gmail.com
 
 May I ask how you plan to support model updates and 'anonymous' users?
 
 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates handled? Do
 you
 plan to require batch model refactorization for any update? Or perform
 some
 partial update by maybe just transforming new data into the LF space
 already in place then doing full refactorization every so often in batch
 mode?
 
 By 'anonymous users' I mean users with some history that is not yet
 incorporated in the LF model. This could be history from a new user asked
 to pick a few items to start the rec process, or an old user with some
 new
 action history not yet in the model. Are you going to allow for passing
 the
 entire history vector or userID+incremental new history to the
 recommender?
 I hope so.
 
 For what it's worth we did a comparison of Mahout Item based CF to Mahout
 ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months
 of
 data. The data was purchase data from a diverse ecom 

Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
I see, sorry I was too presumptuous. I only recently worked and tested 
SVDRecommender, never could have known its efficiency using an 
item-based recommender. Maybe there is space for algorithmic optimization.


The online recommender Gokhan is working on is also an SVDRecommender. 
An online user-based or item-based recommender based on clustering 
technique would definitely be critical, but we need an expert to 
volunteer :)


Perhaps Dr Dunning can have a few words? He announced the online 
clustering component.


Yours Peng

On 13-07-18 03:54 PM, Pat Ferrel wrote:

No it was CPU bound not memory. I gave it something like 14G heap. It was 
running, just too slow to be of any real use. We switched to the hadoop version 
and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m ratings), did 
you encountered an outOfMemory or GCTimeOut exception? Allocating more heap 
space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This 

Re: Regarding Online Recommenders

2013-07-18 Thread Gokhan Capan
I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote:

 I see, sorry I was too presumptuous. I only recently worked and tested
 SVDRecommender, never could have known its efficiency using an item-based
 recommender. Maybe there is space for algorithmic optimization.

 The online recommender Gokhan is working on is also an SVDRecommender. An
 online user-based or item-based recommender based on clustering technique
 would definitely be critical, but we need an expert to volunteer :)

 Perhaps Dr Dunning can have a few words? He announced the online
 clustering component.

 Yours Peng


 On 13-07-18 03:54 PM, Pat Ferrel wrote:

 No it was CPU bound not memory. I gave it something like 14G heap. It was
 running, just too slow to be of any real use. We switched to the hadoop
 version and stored precalculated recs in a db for every user.

 On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

 Strange, its just a little bit larger than limibseti dataset (17m
 ratings), did you encountered an outOfMemory or GCTimeOut exception?
 Allocating more heap space usually help.

 Yours Peng

 On 13-07-18 02:27 PM, Pat Ferrel wrote:

 It was about 2.5M users and 500K items with 25M actions over 6 months of
 data.

 On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 If I remember right, a highlight of 0.8 release is an online clustering
 algorithm. I'm not sure if it can be used in item-based recommender, but
 this is definitely I would like to pursue. It's probably the only advantage
 a non-hadoop implementation can offer in the future.

 Many non-hadoop recommenders are pretty fast. But existing in-memory
 GenericDataModel and FileDataModel are largely implemented for sandboxes,
 IMHO they are the culprit of scalability problem.

 May I ask about the scale of your dataset? how many rating does it have?

 Yours Peng

 On 13-07-18 12:14 PM, Sebastian Schelter wrote:

 Well, with itembased the only problem is new items. New users can
 immediately be served by the model (although this is not well supported
 by
 the API in Mahout). For the majority of usecases I saw, it is perfectly
 fine to have a short delay until new items enter the recommender,
 usually
 this happens after a retraining in batch. You have to care for
 cold-start
 and collect some interactions anyway.


 2013/7/18 Pat Ferrel pat.fer...@gmail.com

  Yes, what Myrrix does is good.

 My last aside was a wish for an item-based online recommender not only
 factorized. Ted talks about using Solr for this, which we're
 experimenting
 with alongside Myrrix. I suspect Solr works but it does require a bit
 of
 tinkering and doesn't have quite the same set of options--no llr
 similarity
 for instance.

 On the same subject I recently attended a workshop in Seattle for
 UAI2013
 where Walmart reported similar results using a factorized recommender.
 They
 had to increase the factor number past where it would perform well.
 Along
 the way they saw increasing performance measuring precision offline.
 They
 eventually gave up on a factorized solution. This decision seems odd
 but
 anyway… In the case of Walmart and our data set they are quite
 diverse. The
 best idea is probably to create different recommenders for separate
 parts
 of the catalog but if you create one model on all items our intuition
 is
 that item-based works better than factorized. Again caveat--no A/B
 tests to
 support this yet.

 Doing an online item-based recommender would quickly run into scaling
 problems, no? We put together the simple Mahout in-memory version and
 it
 could not really handle more than a down-sampled few months of our
 data.
 Down-sampling lost us 20% of our precision scores so we moved to the
 hadoop
 version. Now we have use-cases for an online recommender that handles
 anonymous new users and that takes the story full circle.

 On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org
 wrote:

 Hi Pat

 I think we should provide a simple support for recommending to
 anonymous
 users. We should have a method recommendToAnonymous() that takes a
 PreferenceArray as argument. For itembased recommenders, its
 straightforward to compute recommendations, for userbased you have to
 search through all users once, for latent factor models, you have to
 fold
 the user vector into the low dimensional space.

 I think Sean already added this method in myrrix and I have some code
 for
 my kornakapi project (a simple weblayer for mahout).

 Would such a method fit your needs?

 Best,
 Sebastian



 2013/7/17 Pat Ferrel pat.fer...@gmail.com

  May I ask how you plan to support model updates and 'anonymous' users?

 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates 

Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng

Wow, that's lightning fast.

Is it a SparseMatrix or DenseMatrix?

On 13-07-18 07:23 PM, Gokhan Capan wrote:

I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote:


I see, sorry I was too presumptuous. I only recently worked and tested
SVDRecommender, never could have known its efficiency using an item-based
recommender. Maybe there is space for algorithmic optimization.

The online recommender Gokhan is working on is also an SVDRecommender. An
online user-based or item-based recommender based on clustering technique
would definitely be critical, but we need an expert to volunteer :)

Perhaps Dr Dunning can have a few words? He announced the online
clustering component.

Yours Peng


On 13-07-18 03:54 PM, Pat Ferrel wrote:


No it was CPU bound not memory. I gave it something like 14G heap. It was
running, just too slow to be of any real use. We switched to the hadoop
version and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m
ratings), did you encountered an outOfMemory or GCTimeOut exception?
Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:


It was about 2.5M users and 500K items with 25M actions over 6 months of
data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering
algorithm. I'm not sure if it can be used in item-based recommender, but
this is definitely I would like to pursue. It's probably the only advantage
a non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory
GenericDataModel and FileDataModel are largely implemented for sandboxes,
IMHO they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:


Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported
by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender,
usually
this happens after a retraining in batch. You have to care for
cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com

  Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're
experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit
of
tinkering and doesn't have quite the same set of options--no llr
similarity
for instance.

On the same subject I recently attended a workshop in Seattle for
UAI2013
where Walmart reported similar results using a factorized recommender.
They
had to increase the factor number past where it would perform well.
Along
the way they saw increasing performance measuring precision offline.
They
eventually gave up on a factorized solution. This decision seems odd
but
anyway… In the case of Walmart and our data set they are quite
diverse. The
best idea is probably to create different recommenders for separate
parts
of the catalog but if you create one model on all items our intuition
is
that item-based works better than factorized. Again caveat--no A/B
tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and
it
could not really handle more than a down-sampled few months of our
data.
Down-sampling lost us 20% of our precision scores so we moved to the
hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org
wrote:

Hi Pat

I think we should provide a simple support for recommending to
anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to
fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code
for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com

  May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are 

Re: Regarding Online Recommenders

2013-07-18 Thread Gokhan Capan
It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before
actually trying it. This is essentially a first step for me to choose my
side on the DataModel implementation discussion:)

Gokhan

On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 Wow, that's lightning fast.

 Is it a SparseMatrix or DenseMatrix?


 On 13-07-18 07:23 PM, Gokhan Capan wrote:

 I just started to implement a Matrix backed data model and pushed it, to
 check the performance and memory considerations.

 I believe I can try it on some data tomorrow.

 Best

 Gokhan


 On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au
 wrote:

  I see, sorry I was too presumptuous. I only recently worked and tested
 SVDRecommender, never could have known its efficiency using an item-based
 recommender. Maybe there is space for algorithmic optimization.

 The online recommender Gokhan is working on is also an SVDRecommender. An
 online user-based or item-based recommender based on clustering technique
 would definitely be critical, but we need an expert to volunteer :)

 Perhaps Dr Dunning can have a few words? He announced the online
 clustering component.

 Yours Peng


 On 13-07-18 03:54 PM, Pat Ferrel wrote:

  No it was CPU bound not memory. I gave it something like 14G heap. It
 was
 running, just too slow to be of any real use. We switched to the hadoop
 version and stored precalculated recs in a db for every user.

 On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

 Strange, its just a little bit larger than limibseti dataset (17m
 ratings), did you encountered an outOfMemory or GCTimeOut exception?
 Allocating more heap space usually help.

 Yours Peng

 On 13-07-18 02:27 PM, Pat Ferrel wrote:

  It was about 2.5M users and 500K items with 25M actions over 6 months
 of
 data.

 On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 If I remember right, a highlight of 0.8 release is an online clustering
 algorithm. I'm not sure if it can be used in item-based recommender,
 but
 this is definitely I would like to pursue. It's probably the only
 advantage
 a non-hadoop implementation can offer in the future.

 Many non-hadoop recommenders are pretty fast. But existing in-memory
 GenericDataModel and FileDataModel are largely implemented for
 sandboxes,
 IMHO they are the culprit of scalability problem.

 May I ask about the scale of your dataset? how many rating does it
 have?

 Yours Peng

 On 13-07-18 12:14 PM, Sebastian Schelter wrote:

  Well, with itembased the only problem is new items. New users can
 immediately be served by the model (although this is not well
 supported
 by
 the API in Mahout). For the majority of usecases I saw, it is
 perfectly
 fine to have a short delay until new items enter the recommender,
 usually
 this happens after a retraining in batch. You have to care for
 cold-start
 and collect some interactions anyway.


 2013/7/18 Pat Ferrel pat.fer...@gmail.com

   Yes, what Myrrix does is good.

 My last aside was a wish for an item-based online recommender not
 only
 factorized. Ted talks about using Solr for this, which we're
 experimenting
 with alongside Myrrix. I suspect Solr works but it does require a bit
 of
 tinkering and doesn't have quite the same set of options--no llr
 similarity
 for instance.

 On the same subject I recently attended a workshop in Seattle for
 UAI2013
 where Walmart reported similar results using a factorized
 recommender.
 They
 had to increase the factor number past where it would perform well.
 Along
 the way they saw increasing performance measuring precision offline.
 They
 eventually gave up on a factorized solution. This decision seems odd
 but
 anyway… In the case of Walmart and our data set they are quite
 diverse. The
 best idea is probably to create different recommenders for separate
 parts
 of the catalog but if you create one model on all items our intuition
 is
 that item-based works better than factorized. Again caveat--no A/B
 tests to
 support this yet.

 Doing an online item-based recommender would quickly run into scaling
 problems, no? We put together the simple Mahout in-memory version and
 it
 could not really handle more than a down-sampled few months of our
 data.
 Down-sampling lost us 20% of our precision scores so we moved to the
 hadoop
 version. Now we have use-cases for an online recommender that handles
 anonymous new users and that takes the story full circle.

 On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org
 wrote:

 Hi Pat

 I think we should provide a simple support for recommending to
 anonymous
 users. We should have a method recommendToAnonymous() that takes a
 PreferenceArray as argument. For itembased recommenders, its
 straightforward to compute recommendations, for userbased you have to
 search through all users once, for latent factor models, you have to
 fold
 the user vector into the low dimensional space.

 I think Sean already added 

Re: Regarding Online Recommenders

2013-07-17 Thread Pat Ferrel
May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch mode, 
then there are periodic updates? How are the updates handled? Do you plan to 
require batch model refactorization for any update? Or perform some partial 
update by maybe just transforming new data into the LF space already in place 
then doing full refactorization every so often in batch mode?

By 'anonymous users' I mean users with some history that is not yet 
incorporated in the LF model. This could be history from a new user asked to 
pick a few items to start the rec process, or an old user with some new action 
history not yet in the model. Are you going to allow for passing the entire 
history vector or userID+incremental new history to the recommender? I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout 
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of 
data. The data was purchase data from a diverse ecom source with a large 
variety of products from electronics to clothes. We found Item based CF did far 
better than ALS. As we increased the number of latent factors the results got 
better but were never within 10% of item based (we used MAP as the offline 
metric). Not sure why but maybe it has to do with the diversity of the item 
types.

I understand that a full item based online recommender has very different 
tradeoffs and anyway others may not have seen this disparity of results. 
Furthermore we don't have A/B test results yet to validate the offline metric.

On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over all preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote:

 I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
 3GB with that approach.
 
 
 2013/7/16 Ted Dunning ted.dunn...@gmail.com
 
 Netflix is a small dataset.  12G for that seems quite excessive.
 
 Note also that this is before you have done any work.
 
 Ideally, 100million observations should take  1GB.
 
 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:
 
 The second idea is indeed splendid, we should separate time-complexity
 first and space-complexity first implementation. What I'm not quite
 sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new
 laptop
 can already handle that (emphasis on laptop). And if we replace hash
 map
 (the culprit of high memory consumption) with list/linkedList, it would
 simply degrade time complexity for a linear search to O(n), not too bad
 either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it instead
 of
 subverting it.
 
 



Re: Regarding Online Recommenders

2013-07-17 Thread Gokhan Capan
Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 May I ask how you plan to support model updates and 'anonymous' users?

 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.

Do you plan to require batch model refactorization for any update? Or
 perform some partial update by maybe just transforming new data into the LF
 space already in place then doing full refactorization every so often in
 batch mode?

 By 'anonymous users' I mean users with some history that is not yet
 incorporated in the LF model. This could be history from a new user asked
 to pick a few items to start the rec process, or an old user with some new
 action history not yet in the model. Are you going to allow for passing the
 entire history vector or userID+incremental new history to the recommender?
 I hope so.


 For what it's worth we did a comparison of Mahout Item based CF to Mahout
 ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
 data. The data was purchase data from a diverse ecom source with a large
 variety of products from electronics to clothes. We found Item based CF did
 far better than ALS. As we increased the number of latent factors the
 results got better but were never within 10% of item based (we used MAP as
 the offline metric). Not sure why but maybe it has to do with the diversity
 of the item types.


My first question, are those actions are only positive, like purchase as
you mentioned?


 I understand that a full item based online recommender has very different
 tradeoffs and anyway others may not have seen this disparity of results.
 Furthermore we don't have A/B test results yet to validate the offline
 metric.


I personally think an A/B test is the best way to evaluate a recommender,
and if you will be able to share it, I personally look forward to see the
results. I believe that would be a great contribution for some future
decisions.


 On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Peng,

 This is the reason I separated out the DataModel, and only put the learner
 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
 where preferences are stored.

 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to deploy a
 recommender and perform a lot of top-N recommendation tasks.

 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me try a
 matrix-backed DataModel approach)

 Gokhan


 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:

  I completely agree, Netflix is less than one gigabye in a smart
  representation, 12x more memory is a nogo. The techniques used in
  FactorizablePreferences allow a much more memory efficient
 representation,
  tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
  3GB with that approach.
 
 
  2013/7/16 Ted Dunning ted.dunn...@gmail.com
 
  Netflix is a small dataset.  12G for that seems quite excessive.
 
  Note also that this is before you have done any work.
 
  Ideally, 100million observations should take  1GB.
 
  On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
  wrote:
 
  The second idea is indeed splendid, we should separate time-complexity
  first and space-complexity first implementation. What I'm not quite
  sure,
  is that if we really need to create two interfaces instead of one.
  Personally, I think 12G heap space is not that high right? Most new
  laptop
  can already handle that (emphasis on laptop). And if we replace hash
  map
  (the culprit of high memory consumption) with list/linkedList, it would
  simply degrade time complexity for a linear search to O(n), not too bad
  either. The current DataModel is a result of careful thoughts and has
  underwent extensive test, it is easier to expand on top of it instead
  of
  subverting it.
 
 




Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng
I see, OK so we shouldn't use the old implementation. But I mean, the 
old interface doesn't have to be discarded. The discrepancy between your 
FactorizablePreferences and DataModel is that, your model supports 
getPreferences(), which returns all preferences as an iterator, and 
DataModel supports a few old functions that returns preferences for an 
individual user or item.


My point is that, it is not hard for each of them to implement what they 
lack of: old DataModel can implement getPreferences() just by a a loop 
in abstract class. Your new FactorizablePreferences can implement those 
old functions by a binary search that takes O(log n) time, or an 
interpolation search that takes O(log log n) time in average. So does 
the online update. It will just be a matter of different speed and 
space, but not different interface standard, we can use old unit tests, 
old examples, old everything. And we will be more flexible in writing 
ensemble recommender.


Just a few thoughts, I'll have to validate the idea first before 
creating a new JIRA ticket.


Yours Peng


On 13-07-16 02:51 PM, Sebastian Schelter wrote:

I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com


Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote:


The second idea is indeed splendid, we should separate time-complexity
first and space-complexity first implementation. What I'm not quite sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.





Re: Regarding Online Recommenders

2013-07-17 Thread Sebastian Schelter
Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au

 I see, OK so we shouldn't use the old implementation. But I mean, the old
 interface doesn't have to be discarded. The discrepancy between your
 FactorizablePreferences and DataModel is that, your model supports
 getPreferences(), which returns all preferences as an iterator, and
 DataModel supports a few old functions that returns preferences for an
 individual user or item.

 My point is that, it is not hard for each of them to implement what they
 lack of: old DataModel can implement getPreferences() just by a a loop in
 abstract class. Your new FactorizablePreferences can implement those old
 functions by a binary search that takes O(log n) time, or an interpolation
 search that takes O(log log n) time in average. So does the online update.
 It will just be a matter of different speed and space, but not different
 interface standard, we can use old unit tests, old examples, old
 everything. And we will be more flexible in writing ensemble recommender.

 Just a few thoughts, I'll have to validate the idea first before creating
 a new JIRA ticket.

 Yours Peng



 On 13-07-16 02:51 PM, Sebastian Schelter wrote:

 I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:

  The second idea is indeed splendid, we should separate time-complexity
 first and space-complexity first implementation. What I'm not quite
 sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new

 laptop

 can already handle that (emphasis on laptop). And if we replace hash map
 (the culprit of high memory consumption) with list/linkedList, it would
 simply degrade time complexity for a linear search to O(n), not too bad
 either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it instead of
 subverting it.






Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng
Mmm, You are right, the most simple solution is usually the best, I'm 
creating new jira ticket.


Yours Peng

On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au


I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an interpolation
search that takes O(log log n) time in average. So does the online update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

  The second idea is indeed splendid, we should separate time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new


laptop


can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.








Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng

Awesome! your reinforcements are highly appreciated.

On 13-07-17 01:29 AM, Abhishek Sharma wrote:

Sorry to interrupt guys, but I just wanted to bring it to your notice that
I am also interested in contributing to this idea. I am planning to
participate in ASF-ICFOSS mentor-ship
programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme.
(this is very similar to GSOC)

I do have strong concepts in machine learning (have done the ML course by
Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
experience). I am not really sure of how can I approach this problem (but I
do have a strong interest to work on this problem) hence would like to pair
up on this. I am currently working as a research intern at Indian Institute
of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

Please let me know your thoughts if I can be a part of this.

Thanks  Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21


On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote:


Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over all preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient

representation,

tested on KDD Music dataset which is approx 2.5 times Netflix and fits

into

3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com


Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au

wrote:

The second idea is indeed splendid, we should separate

time-complexity

first and space-complexity first implementation. What I'm not quite

sure,

is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace hash

map

(the culprit of high memory consumption) with list/linkedList, it

would

simply degrade time complexity for a linear search to O(n), not too

bad

either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead

of

subverting it.








Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng
One thing that suddenly comes to my mind is that, for a simple interface 
like FactorizablePreferences, maybe sequential READ in real time is 
possible, but sequential WRITE in O(1) time is Utopia. Because you need 
to flush out old preference with same user and item ID (in worst case it 
could be an interpolation search), otherwise you are permitting a user 
rating an item twice with different values. Considering how 
FileDataModel suppose to work (new files flush old files), maybe using 
the simple interface has less advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au


I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an interpolation
search that takes O(log log n) time in average. So does the online update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

  The second idea is indeed splendid, we should separate time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new


laptop


can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.








Re: Regarding Online Recommenders

2013-07-17 Thread Sebastian Schelter
You are completely right, the simple interface would only be usable for
readonly / batch-updatable recommenders. Online recommenders might need
something different. I tried to widen the discussion here to discuss all
kinds of API changes in the recommenders that would be necessary in the
future.



2013/7/17 Peng Cheng pc...@uowmail.edu.au

 One thing that suddenly comes to my mind is that, for a simple interface
 like FactorizablePreferences, maybe sequential READ in real time is
 possible, but sequential WRITE in O(1) time is Utopia. Because you need to
 flush out old preference with same user and item ID (in worst case it could
 be an interpolation search), otherwise you are permitting a user rating an
 item twice with different values. Considering how FileDataModel suppose to
 work (new files flush old files), maybe using the simple interface has less
 advantages than we used to believe.


 On 13-07-17 04:58 PM, Sebastian Schelter wrote:

 Hi Peng,

 I never wanted to discard the old interface, I just wanted to split it up.
 I want to have a simple interface that only supports sequential access
 (and
 allows for very memory efficient implementions, e.g. by the use of
 primitive arrays). DataModel should *extend* this interface and provide
 sequential and random access (basically what is already does).

 Than a recommender such as SGD could state that it only needs sequential
 access to the preferences and you can either feed it a DataModel (so we
 dont break backwards compatibility) or a memory efficient sequential
 access thingy.

 Does that make sense for you?


 2013/7/17 Peng Cheng pc...@uowmail.edu.au

  I see, OK so we shouldn't use the old implementation. But I mean, the old
 interface doesn't have to be discarded. The discrepancy between your
 FactorizablePreferences and DataModel is that, your model supports
 getPreferences(), which returns all preferences as an iterator, and
 DataModel supports a few old functions that returns preferences for an
 individual user or item.

 My point is that, it is not hard for each of them to implement what they
 lack of: old DataModel can implement getPreferences() just by a a loop in
 abstract class. Your new FactorizablePreferences can implement those old
 functions by a binary search that takes O(log n) time, or an
 interpolation
 search that takes O(log log n) time in average. So does the online
 update.
 It will just be a matter of different speed and space, but not different
 interface standard, we can use old unit tests, old examples, old
 everything. And we will be more flexible in writing ensemble recommender.

 Just a few thoughts, I'll have to validate the idea first before creating
 a new JIRA ticket.

 Yours Peng



 On 13-07-16 02:51 PM, Sebastian Schelter wrote:

  I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient
 representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

   Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:

   The second idea is indeed splendid, we should separate
 time-complexity

 first and space-complexity first implementation. What I'm not quite
 sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new

  laptop

  can already handle that (emphasis on laptop). And if we replace hash
 map
 (the culprit of high memory consumption) with list/linkedList, it
 would
 simply degrade time complexity for a linear search to O(n), not too
 bad
 either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it instead
 of
 subverting it.







Re: Regarding Online Recommenders

2013-07-16 Thread Peng Cheng
Yeah, setPreference() and removePreference() shouldn't be there, but 
injecting Recommender back to DataModel is kind of a strong dependency, 
which may intermingle components for different concerns. Maybe we can do 
something to RefreshHelper class? e.g. push something into a swap field 
so the downstream of a refreshable chain can read it out. I have read 
Gokhan's UpdateAwareDataModel, and feel that it's probably too 
heavyweight for a model selector as every time he change the algorithm 
he has to re-register that.


The second idea is indeed splendid, we should separate time-complexity 
first and space-complexity first implementation. What I'm not quite 
sure, is that if we really need to create two interfaces instead of one. 
Personally, I think 12G heap space is not that high right? Most new 
laptop can already handle that (emphasis on laptop). And if we replace 
hash map (the culprit of high memory consumption) with list/linkedList, 
it would simply degrade time complexity for a linear search to O(n), not 
too bad either. The current DataModel is a result of careful thoughts 
and has underwent extensive test, it is easier to expand on top of it 
instead of subverting it.


All the best,
Yours Peng

On 13-07-16 01:05 AM, Sebastian Schelter wrote:

Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian



The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.









Re: Regarding Online Recommenders

2013-07-16 Thread Ted Dunning
Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 The second idea is indeed splendid, we should separate time-complexity
 first and space-complexity first implementation. What I'm not quite sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new laptop
 can already handle that (emphasis on laptop). And if we replace hash map
 (the culprit of high memory consumption) with list/linkedList, it would
 simply degrade time complexity for a linear search to O(n), not too bad
 either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it instead of
 subverting it.


Re: Regarding Online Recommenders

2013-07-16 Thread Sebastian Schelter
I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

 Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote:

  The second idea is indeed splendid, we should separate time-complexity
  first and space-complexity first implementation. What I'm not quite sure,
  is that if we really need to create two interfaces instead of one.
  Personally, I think 12G heap space is not that high right? Most new
 laptop
  can already handle that (emphasis on laptop). And if we replace hash map
  (the culprit of high memory consumption) with list/linkedList, it would
  simply degrade time complexity for a linear search to O(n), not too bad
  either. The current DataModel is a result of careful thoughts and has
  underwent extensive test, it is easier to expand on top of it instead of
  subverting it.



Re: Regarding Online Recommenders

2013-07-16 Thread Gokhan Capan
Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over all preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote:

 I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.
 
  Note also that this is before you have done any work.
 
  Ideally, 100million observations should take  1GB.
 
  On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:
 
   The second idea is indeed splendid, we should separate time-complexity
   first and space-complexity first implementation. What I'm not quite
 sure,
   is that if we really need to create two interfaces instead of one.
   Personally, I think 12G heap space is not that high right? Most new
  laptop
   can already handle that (emphasis on laptop). And if we replace hash
 map
   (the culprit of high memory consumption) with list/linkedList, it would
   simply degrade time complexity for a linear search to O(n), not too bad
   either. The current DataModel is a result of careful thoughts and has
   underwent extensive test, it is easier to expand on top of it instead
 of
   subverting it.
 



Re: Regarding Online Recommenders

2013-07-16 Thread Abhishek Sharma
Sorry to interrupt guys, but I just wanted to bring it to your notice that
I am also interested in contributing to this idea. I am planning to
participate in ASF-ICFOSS mentor-ship
programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme.
(this is very similar to GSOC)

I do have strong concepts in machine learning (have done the ML course by
Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
experience). I am not really sure of how can I approach this problem (but I
do have a strong interest to work on this problem) hence would like to pair
up on this. I am currently working as a research intern at Indian Institute
of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

Please let me know your thoughts if I can be a part of this.

Thanks  Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21


On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote:

 Peng,

 This is the reason I separated out the DataModel, and only put the learner
 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
 where preferences are stored.

 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to deploy a
 recommender and perform a lot of top-N recommendation tasks.

 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me try a
 matrix-backed DataModel approach)

 Gokhan


 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:

  I completely agree, Netflix is less than one gigabye in a smart
  representation, 12x more memory is a nogo. The techniques used in
  FactorizablePreferences allow a much more memory efficient
 representation,
  tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
  3GB with that approach.
 
 
  2013/7/16 Ted Dunning ted.dunn...@gmail.com
 
   Netflix is a small dataset.  12G for that seems quite excessive.
  
   Note also that this is before you have done any work.
  
   Ideally, 100million observations should take  1GB.
  
   On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
  wrote:
  
The second idea is indeed splendid, we should separate
 time-complexity
first and space-complexity first implementation. What I'm not quite
  sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new
   laptop
can already handle that (emphasis on laptop). And if we replace hash
  map
(the culprit of high memory consumption) with list/linkedList, it
 would
simply degrade time complexity for a linear search to O(n), not too
 bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead
  of
subverting it.
  
 




-- 
--
Abhishek Sharma
ThoughtWorks


Re: Regarding Online Recommenders

2013-07-16 Thread Sebastian Schelter
Hi Abhishek,

Great to hear that you're willing to put some work into this! Have you
ever worked with Mahout's recommenders before? If not, then a good first
step would be to get familiar with them and code up a few examples.

Best,
Sebastian

On 17.07.2013 07:29, Abhishek Sharma wrote:
 Sorry to interrupt guys, but I just wanted to bring it to your notice that
 I am also interested in contributing to this idea. I am planning to
 participate in ASF-ICFOSS mentor-ship
 programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme.
 (this is very similar to GSOC)
 
 I do have strong concepts in machine learning (have done the ML course by
 Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
 experience). I am not really sure of how can I approach this problem (but I
 do have a strong interest to work on this problem) hence would like to pair
 up on this. I am currently working as a research intern at Indian Institute
 of Science (IISc), Bangalore India and can put up 15-20 hrs per week.
 
 Please let me know your thoughts if I can be a part of this.
 
 Thanks  Regards,
 Abhishek Sharma
 http://www.linkedin.com/in/abhi21
 https://github.com/abhi21
 
 
 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote:
 
 Peng,

 This is the reason I separated out the DataModel, and only put the learner
 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
 where preferences are stored.

 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to deploy a
 recommender and perform a lot of top-N recommendation tasks.

 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me try a
 matrix-backed DataModel approach)

 Gokhan


 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:

 I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient
 representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

 Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:

 The second idea is indeed splendid, we should separate
 time-complexity
 first and space-complexity first implementation. What I'm not quite
 sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new
 laptop
 can already handle that (emphasis on laptop). And if we replace hash
 map
 (the culprit of high memory consumption) with list/linkedList, it
 would
 simply degrade time complexity for a linear search to O(n), not too
 bad
 either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it instead
 of
 subverting it.



 
 
 



Re: Regarding Online Recommenders

2013-07-15 Thread Sebastian Schelter
Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian


 The DataModel I imagine would follow the current API, where underlying
 preference storage is replaced with a matrix.
 
 A Recommender would then use the DataModel and the OnlineLearner, where
 Recommender#setPreference is delegated to DataModel#setPreference (like it
 does now), and DataModel#setPreference triggers OnlineLearner#train.