Mahout has changed a lot in the past couple years, becoming more focused on serving the needs of data workers and scientists who need to experiment with large matrix math problems. To that end we've broadened the execution engines that perform the distribution of computation to include Spark and Flink, and we're thinking about just how many pre-built algorithms we should include in the library versus working on performance behind the scenes.
There is a new declarative language that is R/MATLAB-like and allows for interactive sessions at scale; see the "Mahout-Samsara" tab in the navigation on the home page http://mahout.apache.org. This book was written by two of the major contributors to the new declarative language, worth taking a look: https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785 Thanks for your interest; we'll be happy to help you as you proceed if you have any other questions. On Fri, Sep 16, 2016 at 5:03 PM, Reth RM <reth.ik...@gmail.com> wrote: > Hi, > > I am trying to learn the key differences between mahout ML and spark ML and > then the mahout-spark integration specifically for clustering algorithms. I > learned through forms and blogposts that one of the major difference is > mahout runs as batch process and spark backed by streaming apis. But I do > see mahout-spark integration as well. So I'm slightly confused and would > like to know the major differences that should be considered(looked into)? > > Background: > I'm working on a new research project that requires clustering of > documents( 50M webpages for now) and focus is only towards using clustering > algorithms and the LSH implementation. Right now, I started with > experimenting mahout-kmean (standalone not the streaming-kmean) and also > looked in to LSH, which is again available in both frameworks, so the above > questions rising at this point. > > Looking forward to hear thoughts and insights from all users here. > Thank you. >