Seems like we need the top list to be responded to also. Agree about similarity but a completely different method is needed for cosine and the other actual distance measures. The way the old Hadoop code did it is more appropriate. I’ll put it on my list.
> On Mar 5, 2015, at 9:46 AM, Andrew Musselman <andrew.mussel...@gmail.com> > wrote: > > Agree with Suneel's comments. > > So you're proposing these four things for 0.10, right? I'm good with these. > > 1) mrlegacy & scala dependency reduction and possible split > 2) sync with most widely used Spark version (implies frequent releases to > stay synced with big distros I suspect) > 3) the release build is completely broken. No artifacts are created for > scala, spark, or h2o. No hosted scaladocs are created afaik. > 4) commitment to revamping the Mahout docs. They look more like 0.9+ than > anything like what Mahout is today. > > > On Thu, Mar 5, 2015 at 9:31 AM, Suneel Marthi <suneel_mar...@yahoo.com > <mailto:suneel_mar...@yahoo.com>> wrote: > > Agree with most of the points outlined below, next steps would be to work > towards 0.10. > >> From: Pat Ferrel <p...@occamsmachete.com <mailto:p...@occamsmachete.com>> >> To: Suneel Marthi <suneel_mar...@yahoo.com >> <mailto:suneel_mar...@yahoo.com>>; ap.dev <ap....@outlook.com >> <mailto:ap....@outlook.com>>; Andrew Musselman <andrew.mussel...@gmail.com >> <mailto:andrew.mussel...@gmail.com>> >> Sent: Thursday, March 5, 2015 12:11 PM >> Subject: Next release >> >> I’d send this to @dev if it won’t turn into a public argument. Maybe leave >> out the wishlist? >> >> Hopefully people will chime in with opinions or status but here’s what it >> looks like to me: >> >> 1) The DSL needs the mrlegacy pruning that is ready but held up by external >> issues. This would be required if we do a project split. Also the external >> deps have been reduced to nearly the minimum and are written to a smallish >> jar in the spark module. It is possible to do more fine grained class-level >> shading but not sure it’s needed. >> 2) significant DSL additions are held up by external issues but there is >> already SSVD, PCA, QR and pretty mature linear algebra ops. >> 3) similarity, item (column) and row seem to be fine with LLR only, and >> therefor are mainly for recommender use cases. > >>>> It would be nice to generalize this to be able to use any similarity > >>>> measure before next release. > >> 4) Naive Bayes only partial pipeline for text classification is implemented >> in Scala but NB itself is working, TD-IDF in progress >> 5) There is some distributed aggregation work that is waiting in a PR and >> seems to be stalled. I’d vote to see this included. >> > >>> +1 > >> What is a minimum release? >> >> Sort of an odd question without a clear idea of what Mahout is. I see its >> future as a scalable R-like environment integrated with Scala and >> distributed computation engines like Spark. Put another way it is a >> distributed optimized linear algebra environment and library with some >> important higher level algorithms. It is general where things like MLlib do >> not attempt to be. >> >> When would you use Mahout vs MLlib or H2O? If you need deep learning, look >> at H2O, if you need Kmeans look at MLlib, if you require or want to mix-in a >> general linear algebra engine look at Mahout’s DSL since it plays well with >> MLlib and to some degree H2O. >> >> What is a minimum release given the above definition? >> >> Seems like polishing up the 5 things mentioned above along with: >> 1) mrlegacy & scala dependency reduction and possible split >> 2) sync with most widely used Spark version (implies frequent releases to >> stay synced with big distros I suspect) >> 3) the release build is completely broken. No artifacts are created for >> scala, spark, or h2o. No hosted scaladocs are created afaik. >> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than >> anything like what Mahout is today. >> >> Not sure we should go down this rat hole right now so feel free to ignore >> this but my intermediate term and post release wishlist is: >> >> 1) more stats and polish to the shell (savable workspaces, etc) >> 2) some helpers/conversions to make accessing MLlib easier. For instance a >> few lines of code would make KMeans usable with DRMs >> 3) a lightweight package formalization for adding new contributor based high >> level algorithms—maybe along the lines of Examples which pull in code from >> github and include their own build mechanism. > +1 >> 4) finish the text pipeline > +1, would explore the new text processing features available in Lucene 5. > Please don't go by how MlLib does this >> 5) integrate Spark dataframes with DRMs and IndexedDatasets > +1 >> 6) retire sequence files for PMML, JSON (SchemaRDD/Dataframes), >> CSV—whatever. These are only needed as input and output not intermediate >> results anymore so why have sequence files when supporting IO to other tools >> like Hive, Spark SQL, Solr/ES and others is more important? >> > +100, sequencefiles have been Mahout's nemesis all along > > > > >