RE: Mahout beginner questions...

2012-04-05 Thread Razon, Oren
: Thursday, March 22, 2012 13:51 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... 1. These are the JDBC-related classes. For example see MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ 2. The distributed and non-distributed code are quite separate

Re: Mahout beginner questions...

2012-04-05 Thread Sebastian Schelter
? -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Thursday, March 22, 2012 17:57 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... A distributed and non-distributed recommender are really quite separate. They perform the same task in quite different

RE: Mahout beginner questions...

2012-04-05 Thread Razon, Oren
To: user@mahout.apache.org Subject: Re: Mahout beginner questions... Hi Oren, If you use an item-based approach, its sufficient to use the top-k similar items per item (with k somewhere between 25 and 100). That means the data to hold in memory is num_items * k data points. While

Re: Mahout beginner questions...

2012-04-05 Thread Sebastian Schelter
...@apache.org] Sent: Thursday, April 05, 2012 10:34 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... Hi Oren, If you use an item-based approach, its sufficient to use the top-k similar items per item (with k somewhere between 25 and 100). That means the data to hold in memory

Re: Mahout beginner questions...

2012-04-05 Thread Sean Owen
It might or might not be interesting to comment on this discussion in light of the new product/project I mentioned last night, Myrrix. It's definitely an example of precisely this two-layered architecture we've been discussing on this thread. http://myrrix.com/design/ The nice thing about a

RE: Mahout beginner questions...

2012-03-26 Thread Razon, Oren
Subject: Re: Mahout beginner questions... On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren oren.ra...@intel.com wrote: ... The system I need should of course give the recommendation itself in no time. ... But because I'm talking about very large scales, I guess that I want to push much of my

Re: Mahout beginner questions...

2012-03-26 Thread Sean Owen
I'm sure he's referring to the off-line model-building bit, not an online component. On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren oren.ra...@intel.com wrote: By saying: At Veoh, we built our models from several billion interactions on a tiny cluster you meant that you used the distributed

RE: Mahout beginner questions...

2012-03-26 Thread Razon, Oren
Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Monday, March 26, 2012 11:48 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... I'm sure he's referring to the off-line model-building bit, not an online component. On Mon, Mar 26, 2012 at 9:27 AM, Razon, Oren oren.ra

Re: Mahout beginner questions...

2012-03-26 Thread Sean Owen
necessarily need to load the entire intermediate file (similarity results) into the memory?! -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Monday, March 26, 2012 11:48 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... I'm sure he's referring

Re: Mahout beginner questions...

2012-03-26 Thread Ted Dunning
: Mahout beginner questions... On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren oren.ra...@intel.com wrote: ... The system I need should of course give the recommendation itself in no time. ... But because I'm talking about very large scales, I guess that I want to push much of my model

RE: Mahout beginner questions...

2012-03-26 Thread Razon, Oren
from the DB into your memory So what is the pros in doing so? When should I consider it? -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, March 26, 2012 15:52 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... No. I meant that I used

Re: Mahout beginner questions...

2012-03-26 Thread Sean Owen
Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, March 26, 2012 15:52 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... No.  I meant that I used the same sort of combined offline and online processes that I have recommended to you.  The cluster did

RE: Mahout beginner questions...

2012-03-25 Thread Razon, Oren
, meaning I could scale up), or is it because of the recommendation time it takes? Thanks, Oren -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Thursday, March 22, 2012 17:57 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... A distributed and non

RE: Mahout beginner questions...

2012-03-25 Thread Razon, Oren
[mailto:sro...@gmail.com] Sent: Sunday, March 25, 2012 21:25 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... It is memory. You will need a pretty large heap to put 100M data in memory -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). You can go bigger if you

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
It rounds like the original poster isn't clear about the division between off-line and on-line work. Almost all production recommendation systems have a large off-line component which analyzes logs of behavior and produces a recommendation model. This model typically consists of item-item

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
the recommendations in advanced (refresh it every X min\hours) and always recommend using the most updated recommendations, right?! -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Sunday, March 25, 2012 21:25 To: user@mahout.apache.org Subject: Re: Mahout beginner

Re: Mahout beginner questions...

2012-03-25 Thread Sean Owen
, March 25, 2012 21:25 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... It is memory. You will need a pretty large heap to put 100M data in memory -- probably 4GB, if not a little more (so the machine would need 8GB+ RAM). You can go bigger if you have more memory

RE: Mahout beginner questions...

2012-03-25 Thread Razon, Oren
the reading from the DB offline so I'm not too afraid from losing some of my speed... -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Sunday, March 25, 2012 21:35 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... Not really. See my previous posting

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren oren.ra...@intel.com wrote: ... The system I need should of course give the recommendation itself in no time. ... But because I'm talking about very large scales, I guess that I want to push much of my model computation to offline mode (which

RE: Mahout beginner questions...

2012-03-25 Thread Razon, Oren
@mahout.apache.org Subject: Re: Mahout beginner questions... On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren oren.ra...@intel.com wrote: ... The system I need should of course give the recommendation itself in no time. ... But because I'm talking about very large scales, I guess that I want to push much of my

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
On Sun, Mar 25, 2012 at 4:02 PM, Razon, Oren oren.ra...@intel.com wrote: So let's continue with your example... I will do I 2 I similarity matrix on Hadoop and then will do online recommendation based on it and the user ranked items. Yes. So where does the online part will sit at? Is it

Re: Mahout beginner questions...

2012-03-25 Thread Sean Owen
, March 25, 2012 21:35 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... Not really. See my previous posting. The best way to get fast recommendations is to use an item-based recommender. Pre-computing recommendations for all users is not usually a win because you wind up

Re: Mahout beginner questions...

2012-03-22 Thread Sean Owen
1. These are the JDBC-related classes. For example see MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ 2. The distributed and non-distributed code are quite separate. At this scale I don't think you can use the non-distributed code to a meaningful degree. For example you could

RE: Mahout beginner questions...

2012-03-22 Thread Razon, Oren
@mahout.apache.org Subject: Re: Mahout beginner questions... 1. These are the JDBC-related classes. For example see MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ 2. The distributed and non-distributed code are quite separate. At this scale I don't think you can use the non-distributed code

Re: Mahout beginner questions...

2012-03-22 Thread Sean Owen
, March 22, 2012 13:51 To: user@mahout.apache.org Subject: Re: Mahout beginner questions... 1. These are the JDBC-related classes. For example see MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/ 2. The distributed and non-distributed code are quite separate. At this scale I don't