Re: Mahout performance issues

Manuel Blechschmidt Thu, 01 Dec 2011 00:52:44 -0800

Hello,

On 01.12.2011, at 09:37, Sebastian Schelter wrote:


> Daniel, can you plot two curves showing the distribution of
> interactions per user and the distribution of interactions per item? I
> think we need to get a better picture of your data first.
> 
> Generally I always recommend to use precomputed similarities. You can
> still serve new users with realtime recommendations, the only
> disadvantages are the higher complexity and a delayed inclusion of new
> items.

In this paper:
Fast Online Learning through Ofﬂine Initialization for Time-sensitive 
Recommendation
http://users.cs.fiu.edu/~lzhen001/activities/KDD_USB_key_2010/docs/p703.pdf
Deepak Agarwal et. al. describes a solution how to include new items quickly 
into the recommendations.
This is used for personalizing the news stories on the yahoo start page.

@Daniel: I would also recommend to profile your application with JVisualVM:
http://visualvm.java.net/

After I did this with my recommender. I figured out that the default cache size 
for item similarities was
far to low. The details are described in this ticket:
https://issues.apache.org/jira/browse/MAHOUT-905


> 
> --sebastian

/Manuel

> 
> 2011/11/30 Sean Owen <[email protected]>:
>> The simple answer is that:
>> 
>> Mahout absorbed a non-distributed recommender project called Taste, which
>> scales up to a point which may be sufficient for a lot of users. It
>> certainly is a lot simpler. Yes it is realistic to do near-real-time
>> recommendations, though it gets harder and harder and requires more tuning,
>> tradeoffs and optimization as this thread shows.
>> 
>> The rest, written from scratch, is almost all distributed and Hadoop-based,
>> including distributed re-implementations of the same algorithms.
>> 
>> On Wed, Nov 30, 2011 at 8:23 PM, Dan Beaulieu
>> <[email protected]>wrote:
>> 
>>> Hi all, this is a tangent and can mostly be ignored by the people
>>> interested in this problem.
>>> 
>>> I'm new to Machine Learning and especially Mahout. Following this
>>> discussion has made me a bit confused.
>>> Isn't Mahout used for large datasets where it makes sense to distribute the
>>> work? Why then isn't anyone pointing
>>> out that the problem may be the use of one single Mahout node? Is it
>>> because it's boolean based? Is it because the data set
>>> isn't really that large?
>>> 
>>> Even if for whatever reason a single node will do for this case, is it
>>> really expected that the recommendation process would finish in less than
>>> half a second?
>>> This makes me think if that is the expectation then the data set is
>>> actually small and Mahout might be overkill...
>>> 
>>> What obvious piece of the Mahout puzzle am I missing?
>>> 
>>> Thanks.
>>> 
>>> Dan
>>> 
>>> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[email protected]> wrote:
>>> 
>>>> Have you used CachingItemSimilarity? That will hold common similarities
>>> in
>>>> memory. It's a lot easier than pre-computing and might help.
>>>> 
>>>> I think something like your change is a good one (Sebastian what do you
>>>> think) in that it gives you the ultimate lever to control how many
>>>> candidates are evaluated. That ought to make it go as fast as you like,
>>> but
>>>> it trades off quality. Still I'd be really surprised if there's no viable
>>>> middle ground -- this works fine at smaller scale, where 100s of
>>> candidates
>>>> are evaluated, perhaps, and you can use your lever to get to 100s of
>>>> candidates at your scale too. Is that still both slow and inaccurate?
>>>> 
>>>> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[email protected]>
>>> wrote:
>>>> 
>>>>> I just tested the app with Mahout 0.6.
>>>>> There seems to be a small performance improvement, but still
>>>>> recommendations for the 'heavy users' take between 1-5 seconds.
>>>>> 
>>>>> 
>>>> 
>>> 

-- 
Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B

Re: Mahout performance issues

Reply via email to