Jim, let me start by stating it's an (unexpected on my side) honor. Are you willing to get hands-on at this point in numerical problems (or have resources that can get hands-on)?
Short modern Mahout story (as short as it is possible to be short) Most nagging problem: lack of support by industry and/or academia. We have capable committers but less capable capable backers in terms of willingness to sanction contributions. Current mahout development goes 2 ways: (a) the platform (aka `samsara`); and (b) useful, preferrably end2end use case scenarious, or just methodology implementation. Note that while (b) is intended to use (a) (and gain backend portability as a bonus), it is not strictly required as long as the backend-speicific code could be fairly easily ported to other backends. Still though, if we come across a need for custom code, we try to analyze the situation if it is something that might be a fairly common abstraction so we could add it to the formalisms list we got in the platform and avoid repetition in the future. Platform primer could be found on the site, I won't be getting into that now. In the platform the problem #1, currently, is the performance. Not that it is generally bad, but some pieces are limited by back-ends. We did some in-memory work to integrate more performing backends there but the effort is constrained by our immediate capacities to contribute, and the most glaring issue (as one of visitors duly noted in jira) is that the distributed backends we are trying to run are severely limited in terms of interconnected algebraic problems. We have ideas what to do here though. It is the very distributed performance of interconnected numerical problems of the current backends (flink, spark) which precludes Mahout from being a pragmatical platform for implementing deep learning at scale, for example. I suppose in-memory performance should be ok for that purpose once we have added GPU and DL specific GPU primitives. The in-memory improvements are not complete for everything that would be ideal, but there has been some notable progress there. With methodologies, well, there's no one single most pressing problem, it is really just defined by a pragmatical problem one has at hand. Currently, Trevor does the most of this outstanding work. It simply and preferably should be a more edgy than most distributed packages offer. E.g., decent-to-good bayesian optimization for hyperparameters, or say I was suggesting to experiment with LRFM recommendation techniques for a few years, as they significantly expand on type of predictors the method can take, and their treatment, compared to things like COO or implicit feedback behavior-based recommenders. Another example is there's no good coverage in clustering in terms of _type_ of clustering -- mixtures, density, spectral, not just traditional centroid type of methods. Visualization techniques, even as simple as 2d density estimators for big datasets are also in demand. Generally speaking, industry has stepped far ahead in terms of visualization approaches than commonly is available in open source software. Bottom line, the only guidance here i see is -- "don't be trivial. Seek unique value proposition". But most guiding principle so far was people's pragmatism: "I have actual production use case and/or very specific requirements for that, I want to use the methodology X for that, and I don't seem to be able to find it elsewhere under management of a distributed platform Y". -d On Thu, Feb 9, 2017 at 6:34 AM, Jim Jagielski <j...@jagunet.com> wrote: > > > On Feb 8, 2017, at 11:50 PM, Suneel Marthi <smar...@apache.org> wrote: > > > > Curious JimJag, > > Did some dude from CapitalOne poke u about Mahout > > > > Not really, no... >