Re: Mahout beginner questions...

Sebastian Schelter Thu, 05 Apr 2012 00:34:46 -0700

Hi Oren,

If you use an item-based approach, its sufficient to use the top-k
similar items per item (with k somewhere between 25 and 100). That means
the data to hold in memory is num_items * k data points.


While this is a theoretical limitation, it should not be a problem in
practical scenarios, as you can easily fit some hundred million of that
datapoints in a few gigabytes of RAM.

--sebastian


On 05.04.2012 09:27, Razon, Oren wrote:
> Ok, so here is the point I still not getting.
> 
> The architecture we are talking about is to push heavy computation for 
> offline work, for that I could utilize Hadoop part.
> Beside, having an online part, which will retrieve the recommendation from 
> the pre-computed results or even will do some more computation online to try 
> and adjust the recommendation to current user context. 
> But as you said for the JDBC connector, in order to serve recommendations 
> fast, the online recommender need to have all pre-computed results in-memory. 
> So isn't it a limitation to scale up? It means that as long as my recommender 
> service  is growing I will need more memory in order to hold it all in-memory 
> in the online part...
> Am I wrong here?  
> 
> -----Original Message-----
> From: Sean Owen [mailto:sro...@gmail.com] 
> Sent: Thursday, March 22, 2012 17:57
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> A distributed and non-distributed recommender are really quite
> separate. They perform the same task in quite different ways. I don't
> think you would mix them per se.
> 
> Depends on what you mean by a model-based recommender... I would call
> the matrix-factorization-based and clustering-based approaches
> "model-based" in the sense that they assume the existence of some
> underlying structure and discover it. There's no Bayesian-style
> approaches in the code.
> 
> They scale in different ways; I am not sure they are unilaterally a
> solution to scale, no. I do agree in general that these have good
> scaling properties for real-world use cases, like the
> matrix-factorization approaches.
> 
> 
> A "real" scalable architecture would have a real-time component and a
> big distributed computation component. Mahout has elements of both and
> can be the basis for piecing that together, but it's not a question of
> strapping together the distributed and non-distributed implementation.
> It's a bit harder than that.
> 
> 
> I am actually quite close to being ready to show off something in this
> area -- I have been working separately on a more complete rec system
> that has both the real-time element but integrated directly with a
> distributed element to handle the large-scale computation. I think
> this is typical of big data architectures. You have (at least) a
> real-time distributed "Serving Layer" and a big distributed batch
> "Computation Layer". More on this in about... 2 weeks.
> 
> 
> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <oren.ra...@intel.com> wrote:
>> Hi Sean,
>> Thanks for your fast response, I really appreciate the quality of your book 
>> ("Mahout in action"), and the support you give in such forums.
>> Just to clear my second question...
>> I want to build a recommender framework that will support different use 
>> cases.  So my intention is to have both distributed and non-distributed 
>> solution in one framework, the question is, is it a good design to put them 
>> both in the same machine (one of the machines in the Hadoop cluster)?
>>
>> BTW... another question, it seem that a good solution to the recommender 
>> scalability will be to use model based recommenders.
>> Saying this, I wonder why there is such few model based recommenders, 
>> especially considering the fact that Mahout contain several data mining 
>> models implemented already?
>>
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:sro...@gmail.com]
>> Sent: Thursday, March 22, 2012 13:51
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> 1. These are the JDBC-related classes. For example see
>> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>>
>> 2. The distributed and non-distributed code are quite separate. At
>> this scale I don't think you can use the non-distributed code to a
>> meaningful degree. For example you could pre-compute item-item
>> similarities over this data and use a non-distributed item-based
>> recommender but you probably have enough items that this will strain
>> memory. You would probably be looking at pre-computing recommendations
>> in batch.
>>
>> 3. I don't think Netezza will help much here. It's still not fast
>> enough at this scale to use with a real-time recommender (nothing is).
>> If it's just a place you store data to feed into Hadoop it's not
>> adding value. All the JDBC-related integrations ultimately load data
>> into memory and that's out of the question with 500M data points.
>>
>> I'd also suggest you have a think about whether you "really" have 500M
>> data points. Often you can know that most of the data is noise or not
>> useful, and can get useful recommendations on a fraction of the data
>> (maybe 5M). That makes a lot of things easier.
>>
>> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <oren.ra...@intel.com> wrote:
>>> Hi,
>>> As a data mining developer who need to build a recommender engine POC 
>>> (Proof Of Concept) to support several future use cases, I've found Mahout 
>>> framework as an appealing place to start with. But as I'm new to Mahout and 
>>> Hadoop in general I've a couple of questions...
>>>
>>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it 
>>> says: "...Several classes in Mahout's recommender implementation will 
>>> attempt to push computations into the database for performance...". I've 
>>> looked in the documents and inside the code itself, but didn't found 
>>> anywhere a reference to what are those calculations that are pushed into 
>>> the DB. Could you please explain what could be done inside the DB?
>>> 2.      My future use will include use cases with small-medium data volumes 
>>> (where I guess the non-distributed algorithms will do the job), but also 
>>> use cases that include huge amounts of data (over 500,000,000 ratings). 
>>> From my understanding this is where the distributed code should be come 
>>> handy. My question here is, because I will need to use both distributed & 
>>> non-distributed how could I build a good design here?
>>>      Should I build two different solutions on different machines? Could I 
>>> do part of the job distributed (for example similarity calculation) and the 
>>> output will be used for the non-distributed code? Is it a BKM? Also if I 
>>> deploy entire mahout code on an Hadoop environment, what does it mean for 
>>> the non-distributed code, will it all run as a different java process on 
>>> the name node?
>>> 3.      As for now, beside of the Hadoop cluster we are building we have 
>>> some strong SQL machines (Netezza appliance) that can handle big 
>>> (structure) data and include good integration with 3'rd party analytics 
>>> providers or developing on java platform but don't include such reach 
>>> recommender framework like Mahout. I'm trying to understand how could I 
>>> utilize both solutions (Netezza & Mahout) to handle big data recommender 
>>> system use cases. Thought maybe to move data into Netezza, do there all 
>>> data manipulation and transformation, and in the end to prepare a file that 
>>> contain the classic data model structure needed for Mahout. But could you 
>>> think on better solution \ architecture? Maybe keeping the data only inside 
>>> Netezza and extracting it to Mahout using JDBC when needed? I will be glad 
>>> to hear your ideas :)
>>>
>>> Thanks,
>>> Oren
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

Re: Mahout beginner questions...

Reply via email to