Hi, I am trying to learn the key differences between mahout ML and spark ML and then the mahout-spark integration specifically for clustering algorithms. I learned through forms and blogposts that one of the major difference is mahout runs as batch process and spark backed by streaming apis. But I do see mahout-spark integration as well. So I'm slightly confused and would like to know the major differences that should be considered(looked into)?
Background: I'm working on a new research project that requires clustering of documents( 50M webpages for now) and focus is only towards using clustering algorithms and the LSH implementation. Right now, I started with experimenting mahout-kmean (standalone not the streaming-kmean) and also looked in to LSH, which is again available in both frameworks, so the above questions rising at this point. Looking forward to hear thoughts and insights from all users here. Thank you.