Yes to splitting builds into legacy and scala (mostly). D can speak to his stuff better but It sounds like the Java Math module will be required but nothing from mrlegacy afaik. So a legacy and ??? build would overlap in the one module. We talked about using sbt but not sure that’s required for a release—what ever is easiest. I’d rather make a clean break and use the most fit tools since we are all-in with Scala but I don’t know if there’s any need to rush.
+1 to Jira cleanup. On Mar 5, 2015, at 10:20 AM, Andrew Palumbo <ap....@outlook.com> wrote: I agree as well with pretty much everything. Though I'm not sure exactly what you mean by a split between mrlegacy and scala. If you're talking about complete compartmentalization between these two sides of the project for this release, I'm all for it. I think the only intersection at the moment is in spark-shell and spark (there's nothing in math right?), and should they be completely split with Dmitriy's refactoring (he said he was working on this). >>3) the release build is completely broken. No artifacts are created for >>scala, spark, or h2o. No hosted scaladocs are created afaik. right now i think only mrlegacy docs are being published. >> 4) Naive Bayes only partial pipeline for text classification is implemented >> in Scala but NB itself is working, TD-IDF in progress >> 4) finish the text pipeline >>+1, would explore the new text processing features available in Lucene 5. >>Please don't go by how MlLib does this agreed. also +1 for Lucene for text processing. I've been looking into this and we talked a bit about implementing a Lucene analyzer based vectorizer for text in the spark module. I've been thinking about trying to work something up that would support both IndexedDatasets and SchemaRDDs but have to get more familiar with both. >> 5) There is some distributed aggregation work that is waiting in a PR and >> seems to be stalled. I’d vote to see this included. also +1 >> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than >> anything like what Mahout is today. +1 -- very important. we should really have a template or some standard to make the docs easier to follow. >> 1) more stats and polish to the shell (savable workspaces, etc) +1 to visualization here also. Yes also agree to getting a release out. We do have several MRLegacy bugfixes as well. I counted 45 the other day since the last release with several in the JIRA- been meaning to post to the dev list about this . I think that it would good be for us to get back to using JIRA more regularly again with a release coming up also. (and to clean out the backlog of won't fixes)