On Fri, Apr 16, 2010 at 11:56 AM, Sean Owen <sro...@gmail.com> wrote:
> On Fri, Apr 16, 2010 at 7:39 PM, Jake Mannix <jake.man...@gmail.com> > wrote: > > I will start playing around with Anthony's github-based stuff, and > > see where a patch can be made. The question is where it would > > go? It's a fully functioning project already over on its own. > > > I suppose that's my question too -- what is being fixed by a move? > What is being fixed is that we currently have an open ticket for providing LSA hooks for Solr, and lsa4solr provides an end-to-end solution for that particular task, along with a bunch of other nice things (clojure wrapping and thus the REPL). If we want to say "MAHOUT-343 is a Don't Fix", and that's the consensus, then that's fine. It could also be implemented in pure java inside of mahout-core, but I don't see anyone stepping up to the plate to write that, and here's someone who's done it in another JVM language we could use. > The point about integrating with the ML community by having a > 'LISP-speaking' module, to be friendlier, is a good one. It does call > into question the Mahout identity -- is it for tinkering with in a lab > to explore new algorithms (for which Clojure/LISP makes sense)? or is > it for engineers and production systems at scale -- where Hadoop/Java > is the lingua franca? Yeah, this is not just another language, but for > a somewhat different audience. > > Maybe "both" is nice. I think "both" is *necessary*. At least at all the smallish bay-area startups doing some scalable production ML these days, there is not much distinction between "researcher" and "production code-monkey" - you prototype in whatever language you can, you try it out (maybe using hadoop streaming) on bigger data sets, then you productionalize it ( usually the same engineers involved in all steps). If Mahout could be helping along with that entire process, that would be fantastic. We'd have shell scripts and an actual REPL to tinker with, and then when it came time to optimize performance, the same exact libraries could be used and extended, no more "first we do stuff with Matlab or R, then port all of the code over to java/c++ later". -jake