RE: Mahout beginner questions...

Razon, Oren Thu, 05 Apr 2012 00:44:51 -0700

Thanks for the answer, but still...
I will need to keep in memory the rating matrix so I will be able to utilize 
the ranking a user gave to items together with the item similarity.


-----Original Message-----
From: Sebastian Schelter [mailto:s...@apache.org] 
Sent: Thursday, April 05, 2012 10:34
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

Hi Oren,

If you use an item-based approach, its sufficient to use the top-k
similar items per item (with k somewhere between 25 and 100). That means
the data to hold in memory is num_items * k data points.

While this is a theoretical limitation, it should not be a problem in
practical scenarios, as you can easily fit some hundred million of that
datapoints in a few gigabytes of RAM.

--sebastian


On 05.04.2012 09:27, Razon, Oren wrote:
> Ok, so here is the point I still not getting.
> 
> The architecture we are talking about is to push heavy computation for 
> offline work, for that I could utilize Hadoop part.
> Beside, having an online part, which will retrieve the recommendation from 
> the pre-computed results or even will do some more computation online to try 
> and adjust the recommendation to current user context. 
> But as you said for the JDBC connector, in order to serve recommendations 
> fast, the online recommender need to have all pre-computed results in-memory. 
> So isn't it a limitation to scale up? It means that as long as my recommender 
> service  is growing I will need more memory in order to hold it all in-memory 
> in the online part...
> Am I wrong here?  
> 
> -----Original Message-----
> From: Sean Owen [mailto:sro...@gmail.com] 
> Sent: Thursday, March 22, 2012 17:57
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> A distributed and non-distributed recommender are really quite
> separate. They perform the same task in quite different ways. I don't
> think you would mix them per se.
> 
> Depends on what you mean by a model-based recommender... I would call
> the matrix-factorization-based and clustering-based approaches
> "model-based" in the sense that they assume the existence of some
> underlying structure and discover it. There's no Bayesian-style
> approaches in the code.
> 
> They scale in different ways; I am not sure they are unilaterally a
> solution to scale, no. I do agree in general that these have good
> scaling properties for real-world use cases, like the
> matrix-factorization approaches.
> 
> 
> A "real" scalable architecture would have a real-time component and a
> big distributed computation component. Mahout has elements of both and
> can be the basis for piecing that together, but it's not a question of
> strapping together the distributed and non-distributed implementation.
> It's a bit harder than that.
> 
> 
> I am actually quite close to being ready to show off something in this
> area -- I have been working separately on a more complete rec system
> that has both the real-time element but integrated directly with a
> distributed element to handle the large-scale computation. I think
> this is typical of big data architectures. You have (at least) a
> real-time distributed "Serving Layer" and a big distributed batch
> "Computation Layer". More on this in about... 2 weeks.
> 
> 
> On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <oren.ra...@intel.com> wrote:
>> Hi Sean,
>> Thanks for your fast response, I really appreciate the quality of your book 
>> ("Mahout in action"), and the support you give in such forums.
>> Just to clear my second question...
>> I want to build a recommender framework that will support different use 
>> cases.  So my intention is to have both distributed and non-distributed 
>> solution in one framework, the question is, is it a good design to put them 
>> both in the same machine (one of the machines in the Hadoop cluster)?
>>
>> BTW... another question, it seem that a good solution to the recommender 
>> scalability will be to use model based recommenders.
>> Saying this, I wonder why there is such few model based recommenders, 
>> especially considering the fact that Mahout contain several data mining 
>> models implemented already?
>>
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:sro...@gmail.com]
>> Sent: Thursday, March 22, 2012 13:51
>> To: user@mahout.apache.org
>> Subject: Re: Mahout beginner questions...
>>
>> 1. These are the JDBC-related classes. For example see
>> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>>
>> 2. The distributed and non-distributed code are quite separate. At
>> this scale I don't think you can use the non-distributed code to a
>> meaningful degree. For example you could pre-compute item-item
>> similarities over this data and use a non-distributed item-based
>> recommender but you probably have enough items that this will strain
>> memory. You would probably be looking at pre-computing recommendations
>> in batch.
>>
>> 3. I don't think Netezza will help much here. It's still not fast
>> enough at this scale to use with a real-time recommender (nothing is).
>> If it's just a place you store data to feed into Hadoop it's not
>> adding value. All the JDBC-related integrations ultimately load data
>> into memory and that's out of the question with 500M data points.
>>
>> I'd also suggest you have a think about whether you "really" have 500M
>> data points. Often you can know that most of the data is noise or not
>> useful, and can get useful recommendations on a fraction of the data
>> (maybe 5M). That makes a lot of things easier.
>>
>> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <oren.ra...@intel.com> wrote:
>>> Hi,
>>> As a data mining developer who need to build a recommender engine POC 
>>> (Proof Of Concept) to support several future use cases, I've found Mahout 
>>> framework as an appealing place to start with. But as I'm new to Mahout and 
>>> Hadoop in general I've a couple of questions...
>>>
>>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it 
>>> says: "...Several classes in Mahout's recommender implementation will 
>>> attempt to push computations into the database for performance...". I've 
>>> looked in the documents and inside the code itself, but didn't found 
>>> anywhere a reference to what are those calculations that are pushed into 
>>> the DB. Could you please explain what could be done inside the DB?
>>> 2.      My future use will include use cases with small-medium data volumes 
>>> (where I guess the non-distributed algorithms will do the job), but also 
>>> use cases that include huge amounts of data (over 500,000,000 ratings). 
>>> From my understanding this is where the distributed code should be come 
>>> handy. My question here is, because I will need to use both distributed & 
>>> non-distributed how could I build a good design here?
>>>      Should I build two different solutions on different machines? Could I 
>>> do part of the job distributed (for example similarity calculation) and the 
>>> output will be used for the non-distributed code? Is it a BKM? Also if I 
>>> deploy entire mahout code on an Hadoop environment, what does it mean for 
>>> the non-distributed code, will it all run as a different java process on 
>>> the name node?
>>> 3.      As for now, beside of the Hadoop cluster we are building we have 
>>> some strong SQL machines (Netezza appliance) that can handle big 
>>> (structure) data and include good integration with 3'rd party analytics 
>>> providers or developing on java platform but don't include such reach 
>>> recommender framework like Mahout. I'm trying to understand how could I 
>>> utilize both solutions (Netezza & Mahout) to handle big data recommender 
>>> system use cases. Thought maybe to move data into Netezza, do there all 
>>> data manipulation and transformation, and in the end to prepare a file that 
>>> contain the classic data model structure needed for Mahout. But could you 
>>> think on better solution \ architecture? Maybe keeping the data only inside 
>>> Netezza and extracting it to Mahout using JDBC when needed? I will be glad 
>>> to hear your ideas :)
>>>
>>> Thanks,
>>> Oren
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Mahout beginner questions...

Reply via email to