+1 for keeping the name -1 for incubation
On Thu, Feb 26, 2015 at 5:24 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > Along with workspaces, code completion, +1 for visualization and extended > (bayesian, stats, etc) ops. Anything that is scalable and general seems > fair game. > > Also -1 for incubation. This is all an evolution of loosely collected > algos into generalizations and extensions of legacy stuff on new ground. > > Also +1 for separating out packages more formally—like > spark-itemsimilarity and other things that aren’t general. They may come > with generalized bits (like similarity) but have package like delivery > mechanisms. We should be able to have something better than contrib, > especially since these may come with math and core extensions generally > useful. No need to separate that until the core is done. > > However a new identity would be a big boost to being able to communicate > the new mission—and is it is a new mission. If the issue is about support > for legacy that doesn’t seem to be a problem. If we stay a top level > project we can support legacy, in fact we have to. > > > On Feb 25, 2015, at 6:21 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > -1 on incubation as well. The website and docs and user lists and this > champion and mentor stuff, and logos and promotions for committers > absolutely do not make any sense at this point. From what i hear, people > are pretty busy without having that as it is. It would probably make more > sense to take both Andrews :) and committers who actively pursue the > programming environment vision to PMC and for people who feel that they > have no valuable input for new philosophy of the project just go emeritus > and give up their voting rights. "Power of do", as they say. > > There's no major change in philosophy either -- mahout has been proclaiming > "scalable machine learning", which is what we will continue doing. Only > doing it (hopefully) a bit easier and with new set of backend tools. > > I want to emphasize that i'd seek math environment status in more general > sense: not just algebraic, but also connect this to stats, samplers, > optimizers, (including bayesian opts), feature extractors, i.e. all basic > big ml tools. Adapt Spark's DataFrame to these tools where appropriate. > Viewing it as solely distributed algebra is a bit skewed away from reality. > On private branches, i have previously developed a lot of that > functionality (except for the visual stuff) and it is in practice very > useful; it creates a common umbrella for people with R background. > > I would very much want to integrate something for visualization, as it is > important for environment. Unfortunately, I don't see any mature science > plotting for jvm stuff around. Scatter plots at best. I want at least to be > able to plot 2d maps and KDEs in with contours or density levels. There are > ways to visualize massive datasets (and their parts). See no tools for this > around at all. Maybe some clever way to integrate with ggplot2 or shiny > server? even that would've been better, even if it required 3rd party > software installation, than nothing at all. > > I don't expect methodologies go to contrib, actually. Slightly different > modules, maybe, but not so extreme as contrib. > > > > > > On Wed, Feb 25, 2015 at 5:18 PM, Andrew Musselman < > andrew.mussel...@gmail.com> wrote: > > > How much would be involved in changing the name of a top-level project? > > > > I'd prefer to avoid the overhead of going back into incubation. > > > > I agree 0.10 makes more sense. > > > > On Wed, Feb 25, 2015 at 12:16 PM, Sean Owen <sro...@gmail.com> wrote: > > > >> My $0.02: > >> > >> There is no shortage of algorithm libraries that are in some way > >> runnable on Hadoop out there, and not as much easy-to-use distributed > >> matrix operation libraries. I think it's more additive to the > >> ecosystem to solve that narrow, and deep, linear algebra problem and > >> really nail it. That's a pretty good 'identity' to claim. It seems > >> like an appropriate scope. > >> > >> I do think the project has changed so much that it's more confusing to > >> keep calling it Mahout than to change the name. I can't think of one > >> person I've talked to about Mahout in the last 6 months that was not > >> under the impression that what is in 0.9 has simply been ported to > >> Spark. It's different enough that it could even be it's own incubator > >> project (under a different name). > >> > >> The brand recognition is for the deprecated part so keeping that is > >> almost the problem. It's not crazy to just change the name. Or even > >> consider a re-incubation. It might give some latitude to more fully > >> reboot. > >> > >> Releasing 1.0.0 on the other hand means committing to the APIs (and > >> name) for some fairly new code and fairly soon. Given that this is > >> sort of a 0.1 of a new project, going to 1.0 feels semantically wrong. > >> But a release would be good. Personally I'd suggest 0.10. > >> > >> On Wed, Feb 25, 2015 at 5:50 PM, Pat Ferrel <p...@occamsmachete.com> > > wrote: > >>> Looking back over the last year Mahout has gone through a lot of > >> changes. Most users are still using the legacy mapreduce code and new > > users > >> have mostly looked elsewhere. > >>> > >>> The fact that people as knowledgable as former committers compare > > Mahout > >> to Oryx or MLlib seems odd to me because Mahout is neither a server nor > a > >> loose collection of algorithms. It was the later until all of mapreduce > > was > >> moved to legacy and “no new mapreduce” was the rule. > >>> > >>> But what is it now? What is unique and of value? Is it destined to be > >> late to the party and chasing the algo checklists of things like MLlib? > >>> > >>> First a slight digression. I looked at moving itemsimilarity to raw > >> Spark if only to remove mrlegacy from the dependencies. At about the > same > >> time another Mahouter asked the Spark list how to transpose a matrix. He > >> got the answer “why would you want to do that?” The fairly high > > performance > >> algorithm behind spark-itemsimilarity was designed by Sebastian and > >> requires an optimized A’A, A’B, A’C… and spark-rowsimilarity requires > > AA’. > >> None of these are provided by MLlib. No actual transpose is required so > >> these two things should be seen as separate comments about MLlib. The > >> moral: unless I want to write optimized matrix transpose-and-multiply > >> solvers I will stick with Mahout. > >>> > >>> So back to Mahout’s unique value. Mahout today is a general linear > >> algebra lib and environment that performs optimized calculations on > > modern > >> engines like Spark. It is something like a Scala-fied R on Spark (or > > other > >> engine). > >>> > >>> If this is true then spark-itemsimilarity can be seen as a > >> package/add-on that requires Mahout’s core Linear Algebra. > >>> > >>> Why use Mahout? Use it if you need scalable general linear algebra. > >> That’s not what MLlib does well. > >>> > >>> Should we be chasing MLlib’s algo list? Why would we? If we need some > >> algo, why not consume it directly from MLlib or somewhere else? Why is a > >> reimplementation important all else being equal? > >>> > >>> Is general scalable linear algebra sufficient for all important ML > >> algos? Certainly not. For instance streaming ones and in particular > > online > >> updated streaming algos may have little to gain from Mahout as it is > > today. > >>> > >>> If the above is true then Mahout is nothing like what it was in 0.9 and > >> is being unfairly compared to 0.9 and other things like that. This > >> misunderstanding of what Mahout _is_ leads to misapplied criticism and > > lack > >> of use for what it does well. At very least this all implies a very > >> different description on the CMS at most maybe something as drastic as a > >> name change. > >>> > >>> > >> > > > >