Hi Pat, Thanks for your quite descriptive reply. I tried out some of your suggestions, especially the in-memory recommender using Mahout libraries, and it works well for now. Once we reach the point where data becomes large enough to affect the performance of the in-memory recommender, we are hoping to move to the distributed recommender.
Thanks for your help, Warunika On Fri, Jun 6, 2014 at 7:10 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > In the original case you were using a hadoop command line tools which > produces all recs for all users, not just one. Since the recs are ALL > calculated they just need to be stored and retrieved—very fast. Put them in > a DB, when the user visits, show the precalculated recs, which is as fast > as a single DB fetch. > > Sebastian talks about the in-memory recommender for one machine and medium > sized datasets. It will produce recommendations for a specific user very > fast as long as the data is not too big in which case the performance drops > off. > > The third way to do this is to break out the core data structure created > by ItemSimilarity Job, translate the Mahout IDs into your Item IDs and > index it with Solr. Then you can use a user’s history as a query in > realtime to Solr, which will return an ordered list of recs. This scales > indefinitely as Solr scales and is very fast. It is also nice because you > can bias result towards metadata like category, genre, catalog section, > with the query, not new nodel creation required. You’ll find a tool to help > with this in mahout/examples or here: > https://github.com/pferrel/solr-recommender > > One of those should fit, they are all fast in the right environment. They > all do require some background non-realtime model calculation but this is > done only periodically. > > > On Jun 6, 2014, at 5:33 AM, Sebastian Schelter <s...@apache.org> wrote: > > Mahout has single machine and distributed recommenders. > > > On 06/06/2014 02:31 PM, Warunika Ranaweera wrote: > > I agree with your suggestion though. I have already implemented a Java > > recommender and it performed better. But, due to scalability problems > that > > are predicted to occur in the future, we thought of moving to Mahout. > > However, it seems like, for now, it's better to go with the single > machine > > implementation. > > > > Thanks for your suggestions, > > Warunika > > > > > > > > On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter <s...@apache.org> > wrote: > > > >> 1M ratings take up something like 20 megabytes. This is a datasize where > >> it does not make any sense to use Hadoop. Just try the single machine > >> implementation. > >> > >> --sebastian > >> > >> > >> > >> > >> On 06/06/2014 12:01 PM, Warunika Ranaweera wrote: > >> > >>> Hi Sebastian, > >>> > >>> Thanks for your prompt response. It's just a sample data set from our > >>> database and it may expand up to 6 million ratings. Since the > performance > >>> was low for a smaller data set, I thought it would be even worse for a > >>> larger data set. As per your suggestion, I also applied the same > command > >>> on > >>> 1 million user ratings for approx. 6000 users and got the same > performance > >>> level. > >>> > >>> What is the average running time for the Mahout distributed > recommendation > >>> job on 1 million ratings? Does it usually take more than 1 minute? > >>> > >>> Thanks in advance, > >>> Warunika > >>> > >>> > >>> On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter <s...@apache.org> > >>> wrote: > >>> > >>> You should not use Hadoop for such a tiny dataset. Use the > >>>> GenericItemBasedRecommender on a single machine in Java. > >>>> > >>>> --sebastian > >>>> > >>>> > >>>> On 06/06/2014 11:10 AM, Warunika Ranaweera wrote: > >>>> > >>>> Hi, > >>>>> > >>>>> I am using Mahout's recommenditembased algorithm on a data set with > >>>>> nearly > >>>>> 10,000 (implicit) user ratings. This is the command I used: > >>>>> *mahout recommenditembased --input ratings.csv --output > recommendation > >>>>> > >>>>> --usersFile users.dat --tempDir temp --similarityClassname > >>>>> SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 * > >>>>> > >>>>> > >>>>> Although the output is successfully generated, this process takes > >>>>> nearly 7 > >>>>> minutes to produce recommendations for a single user. The Hadoop > cluster > >>>>> has 8 nodes and the machine on which Mahout is invoked is an AWS EC2 > >>>>> c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that > >>>>> more > >>>>> than one machine is *not* utilized at a time, and the > >>>>> *recommenditembased* > >>>>> > >>>>> command takes 9 mapreduce jobs altogether with approx. 45 seconds > taken > >>>>> per > >>>>> job. > >>>>> > >>>>> Since the performance is too slow for real time recommendations, it > >>>>> would > >>>>> be really helpful to know whether I'm missing out any additional > >>>>> commands > >>>>> or configurations that enables faster performance. > >>>>> > >>>>> Thanks, > >>>>> Warunikay > >>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > > > >