John, This is well said and is a critical need.
There are some beginnings to this. The recommender side of the house already works the way you say. The classifier and hashed encoding API's are beginning to work that way. The naive Bayes classifiers pretty much do not and the classifier API's are just beginning to have an API-centric form. On Mon, Feb 13, 2012 at 5:31 PM, John Conwell <j...@iamjohn.me> wrote: > From my perspective, I'd really like to see the Mahout API migrate away > from a command line centric design it currently utilizes, and migrate more > towards an library centric API design. I think this would go a long way in > getting Mahout adopted into real life commercial applications. > > While there might be a few algorithm drivers that you interact with by > creating an instance of a class, and calling some method(s) on the instance > to interact with it (I havent actually seen one like that, but there might > be a few), many algorithms are invoked by calling some static function on a > class that takes ~37 typed arguments. Buts whats worse, many drivers are > invoked by having to create a String array with ~37 arguments as string > values, and calling the static main function on the class. > > Now I'm not saying that having a static main function available to invoke > an algorithm from the command line isn't useful. It is, when your testing > an algorithm. But once you want to integrate the algorithm into a > commercial workflow it kind of sucks. > > For example, immagine if the API for invoking Math.max was designed the way > many of the Mahout algorithms currently are? You'd have something like > this: > > String[] args = new String[2]; > args[0] = "max"; > args[1] = "7"; > args[0] = "4"; > int max = Math.main(args); > > It makes your code a horrible mess and very hard to maintain, as well as > very prone to bugs. > > When I see a bunch of static main functions as the only way to interact > with a library, no matter what the quality of the library is, my initial > impression is that this has to be some minimally supported effort by a few > PhD candidates still in academia, who will drop the project as soon as they > graduate. And while this might not be the case, it is one of the first > impressions it gives, and can lead a company to drop the library from > consideration before they do any due diligence into its quality and > utility. > > I think as Mahout matures and gets closer to a 1.0 release, this kind of > API re-design will become more and more necessary, especially if you want a > higher Mahout integration rate into commercial applications and workflows. > > Also, I hope I dont sound too negative. I'm very impressed with Mahout and > its capabilities. I really like that there is a well thought out class > library of primitives for designing new serial and distributed machine > learning algorithms. And I think it has a high utility for integration > into highly visible commercial projects. But its high level public API > really is a barrier to entry when trying to design commercial applications. > > > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman > <j...@windwardsolutions.com>wrote: > > > We have a couple JIRAs that relate here: We want to factor all the (-cl) > > classification steps out of all of the driver classes (MAHOUT-930) and > into > > a separate job to remove duplicated code; MAHOUT-931 is to add a > pluggable > > outlier removal capability to this job; and MAHOUT-933 is aimed at > > factoring all the iteration mechanics from each driver class into the > > ClusterIterator, which uses a ClusterClassifier which is itself an > > OnlineLearner. This will hopefully allow semi-supervised classifier > > applications to be constructed by feeding cluster-derived models into the > > classification process. Still kind of fuzzy at this point but promising > too. > > > > On 2/11/12 2:29 PM, Frank Scholten wrote: > > > >> ... > >> > >> What kind of clustering refactoring do mean here? I did some work on > >> creating bean configurations in the past (MAHOUT-612). I underestimated > the > >> amount of work required to do the entire refactoring. If this can be > >> contributed and committed on a per-job basis I would like to help out. > >> > >>> ... > >>> > >> > >> > > > > > -- > > Thanks, > John C >