Hi Oren, If you use an item-based approach, its sufficient to use the top-k similar items per item (with k somewhere between 25 and 100). That means the data to hold in memory is num_items * k data points.
While this is a theoretical limitation, it should not be a problem in practical scenarios, as you can easily fit some hundred million of that datapoints in a few gigabytes of RAM. --sebastian On 05.04.2012 09:27, Razon, Oren wrote: > Ok, so here is the point I still not getting. > > The architecture we are talking about is to push heavy computation for > offline work, for that I could utilize Hadoop part. > Beside, having an online part, which will retrieve the recommendation from > the pre-computed results or even will do some more computation online to try > and adjust the recommendation to current user context. > But as you said for the JDBC connector, in order to serve recommendations > fast, the online recommender need to have all pre-computed results in-memory. > So isn't it a limitation to scale up? It means that as long as my recommender > service is growing I will need more memory in order to hold it all in-memory > in the online part... > Am I wrong here? > > -----Original Message----- > From: Sean Owen [mailto:sro...@gmail.com] > Sent: Thursday, March 22, 2012 17:57 > To: user@mahout.apache.org > Subject: Re: Mahout beginner questions... > > A distributed and non-distributed recommender are really quite > separate. They perform the same task in quite different ways. I don't > think you would mix them per se. > > Depends on what you mean by a model-based recommender... I would call > the matrix-factorization-based and clustering-based approaches > "model-based" in the sense that they assume the existence of some > underlying structure and discover it. There's no Bayesian-style > approaches in the code. > > They scale in different ways; I am not sure they are unilaterally a > solution to scale, no. I do agree in general that these have good > scaling properties for real-world use cases, like the > matrix-factorization approaches. > > > A "real" scalable architecture would have a real-time component and a > big distributed computation component. Mahout has elements of both and > can be the basis for piecing that together, but it's not a question of > strapping together the distributed and non-distributed implementation. > It's a bit harder than that. > > > I am actually quite close to being ready to show off something in this > area -- I have been working separately on a more complete rec system > that has both the real-time element but integrated directly with a > distributed element to handle the large-scale computation. I think > this is typical of big data architectures. You have (at least) a > real-time distributed "Serving Layer" and a big distributed batch > "Computation Layer". More on this in about... 2 weeks. > > > On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <oren.ra...@intel.com> wrote: >> Hi Sean, >> Thanks for your fast response, I really appreciate the quality of your book >> ("Mahout in action"), and the support you give in such forums. >> Just to clear my second question... >> I want to build a recommender framework that will support different use >> cases. So my intention is to have both distributed and non-distributed >> solution in one framework, the question is, is it a good design to put them >> both in the same machine (one of the machines in the Hadoop cluster)? >> >> BTW... another question, it seem that a good solution to the recommender >> scalability will be to use model based recommenders. >> Saying this, I wonder why there is such few model based recommenders, >> especially considering the fact that Mahout contain several data mining >> models implemented already? >> >> >> -----Original Message----- >> From: Sean Owen [mailto:sro...@gmail.com] >> Sent: Thursday, March 22, 2012 13:51 >> To: user@mahout.apache.org >> Subject: Re: Mahout beginner questions... >> >> 1. These are the JDBC-related classes. For example see >> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ >> >> 2. The distributed and non-distributed code are quite separate. At >> this scale I don't think you can use the non-distributed code to a >> meaningful degree. For example you could pre-compute item-item >> similarities over this data and use a non-distributed item-based >> recommender but you probably have enough items that this will strain >> memory. You would probably be looking at pre-computing recommendations >> in batch. >> >> 3. I don't think Netezza will help much here. It's still not fast >> enough at this scale to use with a real-time recommender (nothing is). >> If it's just a place you store data to feed into Hadoop it's not >> adding value. All the JDBC-related integrations ultimately load data >> into memory and that's out of the question with 500M data points. >> >> I'd also suggest you have a think about whether you "really" have 500M >> data points. Often you can know that most of the data is noise or not >> useful, and can get useful recommendations on a fraction of the data >> (maybe 5M). That makes a lot of things easier. >> >> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <oren.ra...@intel.com> wrote: >>> Hi, >>> As a data mining developer who need to build a recommender engine POC >>> (Proof Of Concept) to support several future use cases, I've found Mahout >>> framework as an appealing place to start with. But as I'm new to Mahout and >>> Hadoop in general I've a couple of questions... >>> >>> 1. In "Mahout in action" under section 3.2.5 (Database-based data) it >>> says: "...Several classes in Mahout's recommender implementation will >>> attempt to push computations into the database for performance...". I've >>> looked in the documents and inside the code itself, but didn't found >>> anywhere a reference to what are those calculations that are pushed into >>> the DB. Could you please explain what could be done inside the DB? >>> 2. My future use will include use cases with small-medium data volumes >>> (where I guess the non-distributed algorithms will do the job), but also >>> use cases that include huge amounts of data (over 500,000,000 ratings). >>> From my understanding this is where the distributed code should be come >>> handy. My question here is, because I will need to use both distributed & >>> non-distributed how could I build a good design here? >>> Should I build two different solutions on different machines? Could I >>> do part of the job distributed (for example similarity calculation) and the >>> output will be used for the non-distributed code? Is it a BKM? Also if I >>> deploy entire mahout code on an Hadoop environment, what does it mean for >>> the non-distributed code, will it all run as a different java process on >>> the name node? >>> 3. As for now, beside of the Hadoop cluster we are building we have >>> some strong SQL machines (Netezza appliance) that can handle big >>> (structure) data and include good integration with 3'rd party analytics >>> providers or developing on java platform but don't include such reach >>> recommender framework like Mahout. I'm trying to understand how could I >>> utilize both solutions (Netezza & Mahout) to handle big data recommender >>> system use cases. Thought maybe to move data into Netezza, do there all >>> data manipulation and transformation, and in the end to prepare a file that >>> contain the classic data model structure needed for Mahout. But could you >>> think on better solution \ architecture? Maybe keeping the data only inside >>> Netezza and extracting it to Mahout using JDBC when needed? I will be glad >>> to hear your ideas :) >>> >>> Thanks, >>> Oren >>> >>> >>> >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> Intel Electronics Ltd. >>> >>> This e-mail and any attachments may contain confidential material for >>> the sole use of the intended recipient(s). Any review or distribution >>> by others is strictly prohibited. If you are not the intended >>> recipient, please contact the sender and delete all copies. >> --------------------------------------------------------------------- >> Intel Electronics Ltd. >> >> This e-mail and any attachments may contain confidential material for >> the sole use of the intended recipient(s). Any review or distribution >> by others is strictly prohibited. If you are not the intended >> recipient, please contact the sender and delete all copies. > --------------------------------------------------------------------- > Intel Electronics Ltd. > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies.