Ah, now back to freely babbling on the dev list.

Mahout wishlist:
1) scaling:  I don’t get the need for R integration or running without hadoop 
or spark. You can run hadoop in local mode on your native file system even 
using a debugger--then run the exact same code on a cluster. If you don’t care 
about scaling there are plenty of great libs for R already, why worry about 
Mahout? One project I worked on started with the in-memory recommender but 
within months had hopelessly outgrown it. If there isn’t at least a path to 
scaling we would never have started with Mahout.  Non-scalable code is fine and 
solves many applications but I hope it’s not the primary design point.
2) speed: read below, Hadoop now (speed means buying more computers) More Spark 
later (buy less computers)
3) ease of data input/output. The conversion of external ids into Mahout 
sequential integers is deceptively difficult and has to be re-created with 
every project. I’m trying to submit an example, which includes an input/output 
pipeline that is mostly scalable. It takes delimited logfiles with external 
ids, creates Mahout input, then takes the output of Mahout and converts back to 
external Ids. It is not worthy of core inclusion but is at least a prototype or 
example of how to do this. 

My $0.02 worth about the future of Mahout:
1) the future will be in moving lots of the current code to Spark and that may 
not be the end of it. If yet another faster platform emerges Mahout will have 
to go there too. If Mahout doesn’t move (pretty quickly) someone will fill the 
gap and Mahout will be left behind.
2) the future of Mahout is tied to big data, at least I hope so.

Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge 
algorithms or is Mahout a scalable, performant ML library that is targeted for 
production environments?

I hope most people think it is the later.

Reply via email to