Ah, now back to freely babbling on the dev list. Mahout wishlist: 1) scaling: I don’t get the need for R integration or running without hadoop or spark. You can run hadoop in local mode on your native file system even using a debugger--then run the exact same code on a cluster. If you don’t care about scaling there are plenty of great libs for R already, why worry about Mahout? One project I worked on started with the in-memory recommender but within months had hopelessly outgrown it. If there isn’t at least a path to scaling we would never have started with Mahout. Non-scalable code is fine and solves many applications but I hope it’s not the primary design point. 2) speed: read below, Hadoop now (speed means buying more computers) More Spark later (buy less computers) 3) ease of data input/output. The conversion of external ids into Mahout sequential integers is deceptively difficult and has to be re-created with every project. I’m trying to submit an example, which includes an input/output pipeline that is mostly scalable. It takes delimited logfiles with external ids, creates Mahout input, then takes the output of Mahout and converts back to external Ids. It is not worthy of core inclusion but is at least a prototype or example of how to do this.
My $0.02 worth about the future of Mahout: 1) the future will be in moving lots of the current code to Spark and that may not be the end of it. If yet another faster platform emerges Mahout will have to go there too. If Mahout doesn’t move (pretty quickly) someone will fill the gap and Mahout will be left behind. 2) the future of Mahout is tied to big data, at least I hope so. Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge algorithms or is Mahout a scalable, performant ML library that is targeted for production environments? I hope most people think it is the later.