RE: Mahout beginner questions...

Razon, Oren Mon, 26 Mar 2012 08:43:15 -0700

Another question that crossed my mind.
Consider all you said below... I'm not quite sure when will I want to use a SQL 
machine at all as my data source?
Response perspective --> You said it will take much more than reading from a 
file
Memory perspective --> In the end you need to move the data from the DB into 
your memory


So what is the pros in doing so? When should I consider it?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Monday, March 26, 2012 15:52
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

No.  I meant that I used the same sort of combined offline and online processes 
that I have recommended to you.  The cluster did the offline part and a web 
tier did the online part. 

Sent from my iPhone

On Mar 26, 2012, at 1:27 AM, "Razon, Oren" <oren.ra...@intel.com> wrote:

> By saying: "At Veoh, we built our models from several billion interactions on 
> a tiny cluster " you meant that you used the distributed code on your cluster 
> as an online recommender?
> From what I've understood so far, I can't rely only on the Hadoop part if I 
> want a truly real time recommender that will modify his recommendations and 
> models per click of the user (because you need to rebuild the data in the 
> HDFS run you batch job, and return an answer)
> 
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
> Sent: Monday, March 26, 2012 00:56
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
> 
> On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren <oren.ra...@intel.com> wrote:
> 
>> ...
>> The system I need should of course give the recommendation itself in no
>> time.
>> ...
> 
> But because I'm talking about very large scales, I guess that I want to
>> push much of my model computation to offline mode (which will be refreshed
>> every X minutes).
>> 
> 
> Actually, you aren't talking about all that large a scale.  At Veoh, we
> built our models from several billion interactions on a tiny cluster.
> 
> 
>> So my options are like that (considering I want to build a real scalable
>> solution):
>> Use the non-distributed \ distributed code to compute some of my model in
>> advance (for example similarity between items \ KNN for each users) --> I
>> guess that for that part, considering I'm offline, the mapreduce code is
>> idle, because of his scalability.
>> 
> 
> Repeating what I said earlier, the offline part produces item-item
> information only.  It does not produce KNN data for any users.  There is no
> reference to a user in the result.
> 
> 
>> Than use a non-distributed online code to calculate the final
>> recommendations based on the pre computed part and do some final
>> computation (weighting the KNN ratings for items my user didn't experienced
>> yet)
>> 
> 
> All that happens here is that item => item* lists are combined.
> 
> 
>> In order to be able to do so, I will probably need a machine that have
>> high memory capacity to contain all the calculations inside the memory.
>> 
> 
> Not really.
> 
> 
>> I can even go further and prepare a cached recommender that will be
>> refreshed whenever I really want my recommendations to be updated.
>> 
> 
> This is correct.
> 
> 
>> ...
>> I know the "glue" between the 2 parts is not quite there (as Sean said),
>> but my question is, how much does the current framework support this kind
>> of architecture?
> 
> 
> Yes.
> 
> 
>> Meaning what kind of actions can I really prepare in advance before
>> continuing to the final computation? If so, beside of co-occurrence matrix
>> and matrix factorization what other computations are available to me to do
>> in a mapreduce manner? Does it mean I will have 2 separate machines for
>> that case, one as an Hadoop cluster for the offline computation and an
>> online one that will use the distributed output to do final recommendations
>> (but then it mean I need to move data between machines, which is not so
>> idle...)?
>> 
> 
> Yes.  You will need off-line and on-line machines if you want to have
> serious guarantees about response times.  And yes, you will need to do some
> copying if you use standard Hadoop.  If you use MapR's version of Hadoop,
> you can serve data directly out of the cluster with no copying because you
> can access files via NFS.
> 
> 
>> 
>> Also, as I mentioned earlier I might need to store my data in a SQL
>> machine. If so, what drivers are currently supported? I saw only JDBC &
>> PostgreSQL, is there anyone else?
>> 
> 
> You don't need to store your data ONLY on an SQL machine and storing logs
> in SQL is generally a bad mistake.
> 
> 
>> As you said in the book, using a SQL machine will probably slow things
>> down because of the data movement using the drivers... Could you estimate
>> how much slower is it comparing to using a file?
> 
> 
> 100x, roughly.  SQL is generally not usable as the source for parallel
> computations.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

RE: Mahout beginner questions...

Reply via email to