sorry for typos in the subj -d
On Mon, Mar 11, 2013 at 12:01 PM, Dmitriy Lyubimov <dlie...@gmail.com>wrote: > > > ---------- Forwarded message ---------- > From: Dmitriy Lyubimov <dlie...@gmail.com> > Date: Mon, Mar 11, 2013 at 11:38 AM > Subject: Re: Missing Mahout board report > To: priv...@mahout.apache.org > > > > > > On Mon, Mar 11, 2013 at 11:27 AM, Sean Owen <sro...@gmail.com> wrote: > >> >> > >> On Mon, Mar 11, 2013 at 6:06 PM, Dmitriy Lyubimov <dlie...@gmail.com> >> wrote: >> > >> > Hadoop MR platform of course, 'cause those are all flaws of the Hadoop >> MR. >> > So Mahout just suffers the ills of MR and that's why the flagships of >> > distributed CF algorithms frankly do not shine here >> > >> >> FWIW I think there's a big performance difference between an M/R job, and >> an optimized one. It takes a lot of honing, tuning, and cheating to make >> them run fast, and, that's the practical problem. But I'd hate to >> necessarily conflate what's in this project with what's possible in M/R. >> >> > >> > >> > So it does call for a new distributed environment to use -- other than >> "MR >> > 1.0" -- if distributed stuff to be presented in Mahout on par with >> > competition. I don't know how feasible that is though.ps for good. >> > >> > >> Depends on your goal -- if building a tool for academia or for fun or for >> a >> purpose-built project, any tool is in bounds, maybe even niche or alpha >> ones. You can pick the tool that is optimal just for the problem being >> solved. Hadoop is the devil people know though. If you're writing a >> product >> / project for the broad market in 2013 I think it's still Hadoop-based. >> Some of these alternatives look like they will become mature, but niche, >> or >> broadly applicable but not mature. Most of what I'm seeing still feels to >> be of the form "I solved this problem with a specialized framework and its >> faster than a bad M/R implementation" which is good but not game-changing. >> A generalized M/R (a la YARN) is my personal bet, but probably will be >> worth building around later this year. >> > > Sure. for many pragmatical projects Apache's MR will be just good enough. > Familiarity beats additional hadrware costs; super large problems are not > that common. > > The problem is still a little bit about how to make ALS-like stuff be > practical. As far as i could recollect, Sebastian did not recommend that > stuff in Mahout (as opposed to Giraph), for once it is not practical to run > it enough time to figure good regularization parameter automatically. > > Many such problems are not just slow startup/high I/O type of things. in > many cases it is about MR shuffle and sort logic itself. > > Imagine for a moment we wanted to solve a problem of deinterlacing an NTSC > signal. > > So we get two fields, first one containing odd lines and the second > containing even lines. MR way of solving that is to key every line with > (field#, line#) and then do shuffle-and-sort. Sort component adds log to > the asymptotic complexity, whereas it is clear that any streaming merge > algorithm just wouldn't need to do sort and capitalize on the structure we > already know . (sure, you can do it map-side with a specific streaming join > logic but that would not be pure MR but rather some map task acrobatics). > > A lot of things we do with blocking matrix arithmetic are exactly like > that. They have structure but we cannot use it and forward it appropriately > at scale unless we run thru sort. > > > >