Re: Discussion Of ML environments/MR, Mahout

Dmitriy Lyubimov Mon, 11 Mar 2013 12:19:11 -0700

sorry for typos in the subj
-d


On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov <dlie...@gmail.com>wrote:

>
>
> ---------- Forwarded message ----------
> From: Dmitriy Lyubimov <dlie...@gmail.com>
> Date: Mon, Mar 11, 2013 at 11:38 AM
> Subject: Re: Missing Mahout board report
> To: priv...@mahout.apache.org
>
>
>
>
>
> On Mon, Mar 11, 2013 at 11:27 AM, Sean Owen <sro...@gmail.com> wrote:
>
>>
>>
>
>> On Mon, Mar 11, 2013 at 6:06 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> wrote:
>> >
>> > Hadoop MR platform of course, 'cause those are all flaws of the Hadoop
>> MR.
>> > So Mahout just suffers the ills of MR and that's why the flagships of
>> > distributed CF algorithms frankly do not shine here
>> >
>>
>> FWIW I think there's a big performance difference between an M/R job, and
>> an optimized one. It takes a lot of honing, tuning, and cheating to make
>> them run fast, and, that's the practical problem. But I'd hate to
>> necessarily conflate what's in this project with what's possible in M/R.
>>
>> >
>> >
>> > So it does call for a new distributed environment to use -- other than
>> "MR
>> > 1.0" -- if distributed stuff to be presented in Mahout on par with
>> > competition. I don't know how feasible that is though.ps for good.
>> >
>> >
>> Depends on your goal -- if building a tool for academia or for fun or for
>> a
>> purpose-built project, any tool is in bounds, maybe even niche or alpha
>> ones. You can pick the tool that is optimal just for the problem being
>> solved. Hadoop is the devil people know though. If you're writing a
>> product
>> / project for the broad market in 2013 I think it's still Hadoop-based.
>> Some of these alternatives look like they will become mature, but niche,
>> or
>> broadly applicable but not mature. Most of what I'm seeing still feels to
>> be of the form "I solved this problem with a specialized framework and its
>> faster than a bad M/R implementation" which is good but not game-changing.
>> A generalized M/R (a la YARN) is my personal bet, but probably will be
>> worth building around later this year.
>>
>
> Sure. for many pragmatical projects Apache's MR will be just good enough.
> Familiarity beats additional hadrware costs; super large problems are not
> that common.
>
> The problem is still a little bit about how to make ALS-like stuff be
> practical. As far as i could recollect, Sebastian did not recommend that
> stuff in Mahout (as opposed to Giraph), for once it is not practical to run
> it enough time to figure good regularization parameter automatically.
>
> Many such problems are not just slow startup/high I/O type of things. in
> many cases it is about MR  shuffle and sort logic itself.
>
> Imagine for a moment we wanted to solve a problem of deinterlacing an NTSC
> signal.
>
> So we get two fields, first one containing odd lines and the second
> containing even lines. MR way of solving that is to key every line with
> (field#, line#) and then do shuffle-and-sort. Sort component adds log to
> the asymptotic complexity, whereas it is clear that any streaming merge
> algorithm just wouldn't need to do sort and capitalize on the structure we
> already know . (sure, you can do it map-side with a specific streaming join
> logic but that would not be pure MR but rather some map task acrobatics).
>
>  A lot of things we do with blocking matrix arithmetic are exactly like
> that. They have structure but we cannot use it and forward it appropriately
> at scale unless we run thru sort.
>
>
>
>

Re: Discussion Of ML environments/MR, Mahout

Reply via email to