On Fri, Mar 14, 2014 at 10:38 AM, Pat Ferrel <[email protected]> wrote:
> So with very little work we could have RSJ, Matrix ops, SSVD+PCA running > on Spark in the mainline of Mahout? Honestly? What makes you doubt? There's a unit test there that runs it in local mode. Good benchmarking is what it lacks of course. It may require some presplit tuning (e.g. for cases when hdfs splits are too large so that it would affect run time of individual task), but that's an improvement, as everything is. Point is since programming model is very palatble, one'd be able to tweak these things with ease. The distributed PCA version it think was not yet committed though. i think it still sits on the dev branch. it is not like it is a very active development, more like POC. But given results on other ML learning projects on spark, i don't see much reason to doubt the performance will be significantly different from those that already run on spark. Again, it is more about environment optimizer, ease of use and prototyping, programming model.
