Re: Performance issues in Mahout recommendations

Pat Ferrel Fri, 06 Jun 2014 06:41:47 -0700

In the original case you were using a hadoop command line tools which produces 
all recs for all users, not just one. Since the recs are ALL calculated they 
just need to be stored and retrieved—very fast. Put them in a DB, when the user 
visits, show the precalculated recs, which is as fast as a single DB fetch.


Sebastian talks about the in-memory recommender for one machine and medium 
sized datasets. It will produce recommendations for a specific user very fast 
as long as the data is not too big in which case the performance drops off.

The third way to do this is to break out the core data structure created by 
ItemSimilarity Job, translate the Mahout IDs into your Item IDs and index it 
with Solr. Then you can use a user’s history as a query in realtime to Solr, 
which will return an ordered list of recs. This scales indefinitely as Solr 
scales and is very fast. It is also nice because you can bias result towards 
metadata like category, genre, catalog section, with the query, not new nodel 
creation required. You’ll find a tool to help with this in mahout/examples or 
here: https://github.com/pferrel/solr-recommender

One of those should fit, they are all fast in the right environment. They all 
do require some background non-realtime model calculation but this is done only 
periodically.


On Jun 6, 2014, at 5:33 AM, Sebastian Schelter <s...@apache.org> wrote:

Mahout has single machine and distributed recommenders.


On 06/06/2014 02:31 PM, Warunika Ranaweera wrote:
> I agree with your suggestion though. I have already implemented a Java
> recommender and it performed better. But, due to scalability problems that
> are predicted to occur in the future, we thought of moving to Mahout.
> However, it seems like, for now, it's better to go with the single machine
> implementation.
> 
> Thanks for your suggestions,
> Warunika
> 
> 
> 
> On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter <s...@apache.org> wrote:
> 
>> 1M ratings take up something like 20 megabytes. This is a datasize where
>> it does not make any sense to use Hadoop. Just try the single machine
>> implementation.
>> 
>> --sebastian
>> 
>> 
>> 
>> 
>> On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:
>> 
>>> Hi Sebastian,
>>> 
>>> Thanks for your prompt response. It's just a sample data set from our
>>> database and it may expand up to 6 million ratings. Since the performance
>>> was low for a smaller data set, I thought it would be even worse for a
>>> larger data set. As per your suggestion, I also applied the same command
>>> on
>>> 1 million user ratings for approx. 6000 users and got the same performance
>>> level.
>>> 
>>> What is the average running time for the Mahout distributed recommendation
>>> job on 1 million ratings? Does it usually take more than 1 minute?
>>> 
>>> Thanks in advance,
>>> Warunika
>>> 
>>> 
>>> On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter <s...@apache.org>
>>> wrote:
>>> 
>>>  You should not use Hadoop for such a tiny dataset. Use the
>>>> GenericItemBasedRecommender on a single machine in Java.
>>>> 
>>>> --sebastian
>>>> 
>>>> 
>>>> On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:
>>>> 
>>>>  Hi,
>>>>> 
>>>>> I am using Mahout's recommenditembased algorithm on a data set with
>>>>> nearly
>>>>> 10,000 (implicit) user ratings. This is the command I used:
>>>>> *mahout recommenditembased --input ratings.csv --output recommendation
>>>>> 
>>>>> --usersFile users.dat --tempDir temp --similarityClassname
>>>>> SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *
>>>>> 
>>>>> 
>>>>> Although the output is successfully generated, this process takes
>>>>> nearly 7
>>>>> minutes to produce recommendations for a single user. The Hadoop cluster
>>>>> has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
>>>>> c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that
>>>>> more
>>>>> than one machine is *not* utilized at a time, and the
>>>>> *recommenditembased*
>>>>> 
>>>>> command takes 9 mapreduce jobs altogether with approx. 45 seconds taken
>>>>> per
>>>>> job.
>>>>> 
>>>>> Since the performance is too slow for real time recommendations, it
>>>>> would
>>>>> be really helpful to know whether I'm missing out any additional
>>>>> commands
>>>>> or configurations that enables faster performance.
>>>>> 
>>>>> Thanks,
>>>>> Warunikay
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Performance issues in Mahout recommendations

Reply via email to