On Saturday, March 8, 2014 5:41 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
Ah, now back to freely babbling on the dev list.
Mahout wishlist:
1) scaling: I don’t get the need for R integration or running without hadoop
or spark. You can run hadoop in local mode on your native file system even
using a debugger--then run the exact same code on a cluster. If you don’t care
about scaling there are plenty of great libs for R already, why worry about
Mahout? One project I worked on started with the in-memory recommender but
within months had hopelessly outgrown it. If there isn’t at least a path to
scaling we would never have started with Mahout. Non-scalable code is fine and
solves many applications but I hope it’s not the primary design point.
2) speed: read below, Hadoop now (speed means buying more computers) More Spark
later (buy less computers)
3) ease of data input/output. The conversion of external ids into Mahout
sequential integers is deceptively difficult and has to be re-created with
every project. I’m trying to submit an example, which includes an input/output
pipeline that is mostly scalable. It takes delimited logfiles with external
ids, creates Mahout input, then takes the output of Mahout and converts back to
external Ids. It is not worthy of core inclusion but is at least a prototype or
example of how to do this.
My $0.02 worth about the future of Mahout:
1) the future will be in moving lots of the current code to Spark and that may
not be the end of it. If yet another faster platform emerges Mahout will have
to go there too. If Mahout doesn’t move (pretty quickly) someone will fill the
gap and Mahout will be left behind.
2) the future of Mahout is tied to big data, at least I hope so.
Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge
algorithms or is Mahout a scalable, performant ML library that is targeted for
production environments?
>> Agree with the later and given that the future is moving existing
>> implementations to Spark, all the more reason to make Mahout less of an
>> experimental sandbox.
I hope most people think it is the later.