I'd like us to cut a 1.0.0 or 0.10.0 release with the spark work, then commit to regular maintenance/point releases and a semi-yearly major release cycle, and agree that publicizing it with talks and articles is essential.
I don't think changing the name would do anything to reinvigorate or clarify interest and perception. Even though Mahout's "elephant driving" legacy is deprecated, it has brand recognition behind it. There are some good things in the code base, especially the linear algebra work and the DSL, that like you guys mention is just not in other tools right now. I like the idea of clearly defining a contrib package like what Pig has, to incorporate purpose-built jobs. On Wed, Feb 25, 2015 at 10:42 AM, Dmitriy Lyubimov <[email protected]> wrote: > I think a release with some value in it and a talk clarifying status will > suffice for starters. > > Name change IMO is immaterial if there's the value and talks clarify > general philosophy sufficiently. Nobody else can tell people better what it > is all about, it is lack of the release and information that follows it > turns people to speculations or legacy understanding of things. > > General philosophy -- yes, that's of R -base + R packages. Or, what i > actually like more, is that of Julia ( + which can run on different > distributed shared-nothing programming models). People use off-the-shelf > stuff but people also do their own programming. I found that i have to > customize methodologies in some way in at least 80% of cases, which is why > value for me shifting towards 'r-base' rather than set of packages. As R > demonstrates, do the former right-ish, and the latter will follow. > > I don't care for comparisons and don't spend time thinking on collating > algorithm names. I'm strictly 100% pragmatically driven. If there's a black > box thing and it fits, i just take it. if not, (and 80% of the time it is > the "not") I'd have to do something of my own. Take SPCA, for example. > There's no strict publication that describes its exact flow (knwon to me). > It is just a 2-step derivation of Stochastic SVD (which is, in itself, a > 2-step derivation/customization of random projection paper). These > customizations and small derivations are actually incredibly numerous in > practice. > > On mllib, here's probably little value in chasing mllib set of things -- at > least not by "mahout-base" implementors, and not for spark backend. Since > in Spark's case we act as an "add-on", all black box mllib things are > already in our scope. They are, literally, available for programming > environment of Mahout. But yes, probably some gentleman's survival kit > should be eventually present even if it may repeat some of mllib methods as > it is not automatically in the scope for Flink. (although, again, Flink has > stuff like K-means too). Kinda hoped Flink guys could help with this one > day. > > > On Wed, Feb 25, 2015 at 9:50 AM, Pat Ferrel <[email protected]> wrote: > > > Looking back over the last year Mahout has gone through a lot of changes. > > Most users are still using the legacy mapreduce code and new users have > > mostly looked elsewhere. > > > > The fact that people as knowledgable as former committers compare Mahout > > to Oryx or MLlib seems odd to me because Mahout is neither a server nor a > > loose collection of algorithms. It was the later until all of mapreduce > was > > moved to legacy and “no new mapreduce” was the rule. > > > > But what is it now? What is unique and of value? Is it destined to be > late > > to the party and chasing the algo checklists of things like MLlib? > > > > First a slight digression. I looked at moving itemsimilarity to raw Spark > > if only to remove mrlegacy from the dependencies. At about the same time > > another Mahouter asked the Spark list how to transpose a matrix. He got > the > > answer “why would you want to do that?” The fairly high performance > > algorithm behind spark-itemsimilarity was designed by Sebastian and > > requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires > AA’. > > None of these are provided by MLlib. No actual transpose is required so > > these two things should be seen as separate comments about MLlib. The > > moral: unless I want to write optimized matrix transpose-and-multiply > > solvers I will stick with Mahout. > > > > So back to Mahout’s unique value. Mahout today is a general linear > algebra > > lib and environment that performs optimized calculations on modern > engines > > like Spark. It is something like a Scala-fied R on Spark (or other > engine). > > > > If this is true then spark-itemsimilarity can be seen as a package/add-on > > that requires Mahout’s core Linear Algebra. > > > > Why use Mahout? Use it if you need scalable general linear algebra. > That’s > > not what MLlib does well. > > > > Should we be chasing MLlib’s algo list? Why would we? If we need some > > algo, why not consume it directly from MLlib or somewhere else? Why is a > > reimplementation important all else being equal? > > > > Is general scalable linear algebra sufficient for all important ML algos? > > Certainly not. For instance streaming ones and in particular online > updated > > streaming algos may have little to gain from Mahout as it is today. > > > > If the above is true then Mahout is nothing like what it was in 0.9 and > is > > being unfairly compared to 0.9 and other things like that. This > > misunderstanding of what Mahout _is_ leads to misapplied criticism and > lack > > of use for what it does well. At very least this all implies a very > > different description on the CMS at most maybe something as drastic as a > > name change. > > > > > > >
