Wow, that's lightning fast.

Is it a SparseMatrix or DenseMatrix?

On 13-07-18 07:23 PM, Gokhan Capan wrote:
I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng <[email protected]> wrote:

I see, sorry I was too presumptuous. I only recently worked and tested
SVDRecommender, never could have known its efficiency using an item-based
recommender. Maybe there is space for algorithmic optimization.

The online recommender Gokhan is working on is also an SVDRecommender. An
online user-based or item-based recommender based on clustering technique
would definitely be critical, but we need an expert to volunteer :)

Perhaps Dr Dunning can have a few words? He announced the online
clustering component.

Yours Peng


On 13-07-18 03:54 PM, Pat Ferrel wrote:

No it was CPU bound not memory. I gave it something like 14G heap. It was
running, just too slow to be of any real use. We switched to the hadoop
version and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng <[email protected]> wrote:

Strange, its just a little bit larger than limibseti dataset (17m
ratings), did you encountered an outOfMemory or GCTimeOut exception?
Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of
data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng <[email protected]> wrote:

If I remember right, a highlight of 0.8 release is an online clustering
algorithm. I'm not sure if it can be used in item-based recommender, but
this is definitely I would like to pursue. It's probably the only advantage
a non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory
GenericDataModel and FileDataModel are largely implemented for sandboxes,
IMHO they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported
by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items "enter" the recommender,
usually
this happens after a retraining in batch. You have to care for
cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel <[email protected]>

  Yes, what Myrrix does is good.
My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're
experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit
of
tinkering and doesn't have quite the same set of options--no llr
similarity
for instance.

On the same subject I recently attended a workshop in Seattle for
UAI2013
where Walmart reported similar results using a factorized recommender.
They
had to increase the factor number past where it would perform well.
Along
the way they saw increasing performance measuring precision offline.
They
eventually gave up on a factorized solution. This decision seems odd
but
anyway… In the case of Walmart and our data set they are quite
diverse. The
best idea is probably to create different recommenders for separate
parts
of the catalog but if you create one model on all items our intuition
is
that item-based works better than factorized. Again caveat--no A/B
tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and
it
could not really handle more than a down-sampled few months of our
data.
Down-sampling lost us 20% of our precision scores so we moved to the
hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter <[email protected]>
wrote:

Hi Pat

I think we should provide a simple support for recommending to
anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to
fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code
for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel <[email protected]>

  May I ask how you plan to support model updates and 'anonymous' users?
I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in
batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user
asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for
passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to
Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6
months

of

data. The data was purchase data from a diverse ecom source with a
large
variety of products from electronics to clothes. We found Item based
CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used
MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full item based online recommender has very
different
tradeoffs and anyway others may not have seen this disparity of
results.
Furthermore we don't have A/B test results yet to validate the offline
metric.

On Jul 16, 2013, at 2:41 PM, Gokhan Capan <[email protected]> wrote:

Peng,

This is the reason I separated out the DataModel, and only put the

learner

stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not
care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over "all" preferences, one for if one wants to
deploy

a

recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me

try a

matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter <[email protected]>
wrote:

  I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient

representation,

tested on KDD Music dataset which is approx 2.5 times Netflix and
fits

into

3GB with that approach.


2013/7/16 Ted Dunning <[email protected]>

  Netflix is a small dataset.  12G for that seems quite excessive.
Note also that this is before you have done any work.

Ideally, 100million observations should take << 1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <[email protected]>

wrote:

The second idea is indeed splendid, we should separate
time-complexity
first and space-complexity first implementation. What I'm not quite

sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace
hash

map
(the culprit of high memory consumption) with list/linkedList, it
would
  simply degrade time complexity for a linear search to O(n), not too
bad
  either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it
instead

of
subverting it.






Reply via email to