Seems like we need the top list to be responded to also.

Agree about similarity but a completely different method is needed for cosine 
and the other actual distance measures. The way the old Hadoop code did it is 
more appropriate. I’ll put it on my list.


> On Mar 5, 2015, at 9:46 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
> wrote:
> 
> Agree with Suneel's comments.
> 
> So you're proposing these four things for 0.10, right?  I'm good with these.
> 
> 1) mrlegacy & scala dependency reduction and possible split
> 2) sync with most widely used Spark version (implies frequent releases to 
> stay synced with big distros I suspect)
> 3) the release build is completely broken. No artifacts are created for 
> scala, spark, or h2o. No hosted scaladocs are created afaik.
> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than 
> anything like what Mahout is today.
> 
> 
> On Thu, Mar 5, 2015 at 9:31 AM, Suneel Marthi <suneel_mar...@yahoo.com 
> <mailto:suneel_mar...@yahoo.com>> wrote:
> 
> Agree with most of the points outlined below, next steps would be to work 
> towards 0.10. 
> 
>> From: Pat Ferrel <p...@occamsmachete.com <mailto:p...@occamsmachete.com>>
>> To: Suneel Marthi <suneel_mar...@yahoo.com 
>> <mailto:suneel_mar...@yahoo.com>>; ap.dev <ap....@outlook.com 
>> <mailto:ap....@outlook.com>>; Andrew Musselman <andrew.mussel...@gmail.com 
>> <mailto:andrew.mussel...@gmail.com>> 
>> Sent: Thursday, March 5, 2015 12:11 PM
>> Subject: Next release
>> 
>> I’d send this to @dev if it won’t turn into a public argument. Maybe leave 
>> out the wishlist?
>> 
>> Hopefully people will chime in with opinions or status but here’s what it 
>> looks like to me:
>> 
>> 1) The DSL needs the mrlegacy pruning that is ready but held up by external 
>> issues. This would be required if we do a project split. Also the external 
>> deps have been reduced to nearly the minimum and are written to a smallish 
>> jar in the spark module. It is possible to do more fine grained class-level 
>> shading but not sure it’s needed.
>> 2) significant DSL additions are held up by external issues but there is 
>> already SSVD, PCA, QR and pretty mature linear algebra ops.
>> 3) similarity, item (column) and row seem to be fine with LLR only, and 
>> therefor are mainly for recommender use cases.
> >>>> It would be nice to generalize this to be able to use any similarity 
> >>>> measure before next release.
> 
>> 4) Naive Bayes only partial pipeline for text classification is implemented 
>> in Scala but NB itself is working, TD-IDF in progress
>> 5) There is some distributed aggregation work that is waiting in a PR and 
>> seems to be stalled. I’d vote to see this included.
>> 
> >>> +1
> 
>> What is a minimum release?
>> 
>> Sort of an odd question without a clear idea of what Mahout is. I see its 
>> future as a scalable R-like environment integrated with Scala and 
>> distributed computation engines like Spark. Put another way it is a 
>> distributed optimized linear algebra environment and library with some 
>> important higher level algorithms. It is general where things like MLlib do 
>> not attempt to be.
>> 
>> When would you use Mahout vs MLlib or H2O? If you need deep learning, look 
>> at H2O, if you need Kmeans look at MLlib, if you require or want to mix-in a 
>> general linear algebra engine look at Mahout’s DSL since it plays well with 
>> MLlib and to some degree H2O.
>> 
>> What is a minimum release given the above definition?
>> 
>> Seems like polishing up the 5 things mentioned above along with:
>> 1) mrlegacy & scala dependency reduction and possible split
>> 2) sync with most widely used Spark version (implies frequent releases to 
>> stay synced with big distros I suspect)
>> 3) the release build is completely broken. No artifacts are created for 
>> scala, spark, or h2o. No hosted scaladocs are created afaik.
>> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than 
>> anything like what Mahout is today.
>> 
>> Not sure we should go down this rat hole right now so feel free to ignore 
>> this but my intermediate term and post release wishlist is:
>> 
>> 1) more stats and polish to the shell (savable workspaces, etc)
>> 2) some helpers/conversions to make accessing MLlib easier. For instance a 
>> few lines of code would make KMeans usable with DRMs 
>> 3) a lightweight package formalization for adding new contributor based high 
>> level algorithms—maybe along the lines of Examples which pull in code from 
>> github and include their own build mechanism.
> +1
>> 4) finish the text pipeline
> +1, would explore the new text processing features available in Lucene 5. 
> Please don't go by how MlLib does this
>> 5) integrate Spark dataframes with DRMs and IndexedDatasets
> +1
>> 6) retire sequence files for PMML, JSON (SchemaRDD/Dataframes), 
>> CSV—whatever. These are only needed as input and output not intermediate 
>> results anymore so why have sequence files when supporting IO to other tools 
>> like Hive, Spark SQL, Solr/ES and others is more important?
>> 
> +100, sequencefiles have been Mahout's nemesis all along
> 
> 
> 
> 
> 

Reply via email to