Re: Need some pointers towards algorithm capabilities.

Niels Basjes Thu, 16 Dec 2010 08:29:28 -0800

Hi,

2010/12/16 Sean Owen <[email protected]>:
> This should address much of that:
> https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation
> As does the book yes.


Thanks. I already found the book. I think I should get it. :)

> The answer also depends on whether you want a Hadoop-based job, for which
> there is not much written yet, or the more mature non-distributed version.

The dataset size really huge. I'm currently looking up against >5M
items, >2M users and several millions of "item views" per day.
All of those dimensions are growing.
Can the non-distributed way handle that kind of volume?

> For Hadoop there is an item-based recommender with pluggable similarity
> metrics.
> For non-distributed there's much more.

Is there an overview? ... or is that in your book?

> Explicit vs implicit ratings and factoring in time are "out of scope" -- you
> can collect data however you want and adjust it however you want. What
> matters is what's fed into the framework. So the answer is, yes, that's
> supported just fine, but not within the framework itself.

This information really helps me! Thanks.

> Long-tail issues are fine if you choose the right algorithms, and they are
> going to vary a lot in this regard.

That's what I would expect.

> For example a user-based or item-based
> recommender with log-likelihood similarity, or an SVD-based recommender,
> doesn't suffer as much from these issues.

Ok .... you just lost this newbie ...

> The distributed version is necessarily batch -- it's Hadoop after all.
> The non-distributed version is all real-time, incremental updates.

Can the incremental versions handle the volume I mentioned?

> I am not sure what you mean by preprocessing daily data sets?

I've seen that in some cases of "large volume processing" it pays to
do part of the processing per "day" of input data and aggregate over
the whole period. As I have almost no understanding of the kind of
algorithms used here this remark of mine could very well be
meaningless here.

Niels Basjes

> On Thu, Dec 16, 2010 at 10:35 AM, Niels Basjes <[email protected]> wrote:
>
>> Hi,
>>
>> I'm an experienced developer yet a complete newbie when it comes to
>> the type of functionality Mahout offers.
>> I do have some experience in designing and writing MapReduce jobs in
>> Hadoop so I understand enough of the base platform that is used.
>>
>> I want to investigate and experiment with both the item-item and
>> user-item recommenders in Mahout.
>> The problem I have is that I'm having a hard time finding a good
>> overview of the capabilities of the various algorithms.
>> Most Wikipedia articles immediately dive into the underlying
>> mathematical foundations instead of the practical implications I'm
>> looking for.
>> I've also not been able to find what I'm looking for in the Mahout
>> Wiki/Confluence.
>>
>> Putting it simply I'm looking for a comprehensive overview of
>> - the kind of things you can and cannot do with the various algorithms
>> that are available in Mahout.
>>    - can it handle both "Implicit" and "Explicit" ratings.
>>    - can I 'age' the relevance of the (implicit) ratings? I.e.
>> Recommendations should change with the changing taste.
>>    - how does it handle in long tail situations (with millions of
>> items most are only viewed/rated very infrequently)
>> - what are the scaling properties of the algorithms.
>>    - is it always batch or can I do real-time incremental updates
>> with new ratings?
>>    - can I preprocess a daily dataset and then combine the daily sets
>> into "what I need"?
>>
>> Thanks for any info you can point me to.
>>
>> --
>> Best regards,
>>
>> Niels Basjes
>>
>



-- 
Met vriendelijke groeten,

Niels Basjes

Re: Need some pointers towards algorithm capabilities.

Reply via email to