my take is legacy is just a module (aka maven artifact). Just like it is now. we just need to re-route(cut) dependencies on it.
On Fri, Mar 6, 2015 at 2:56 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > The simplest way to split the project is into engines—hadoop and spark. What > is happening with H2O? is it being used? Flink isn’t anything like ready for > a release. > > Again the simplest would be two packaged builds, one for legacy stuff, which > would not require Scala or Spark at all. > > The other would be a maven based Scala + Spark + java math module. So this > would be mostly Scala with only the math module overlap. It requires the > refactoring work that Dmitriy has done, which would make it stand-alone. An > sbt build is clearly optional here but would be in keeping with our all-in > Scala approach. Personally I like sbt a lot better than maven but it is less > mature. > > The benefit would be: > 1) potentially separate release schedules, hadoop not so often and eventually > not at all, spark every few days if you follow their schedule (not suggesting > this) > 2) much faster build times for either branch—as anyone knows, building with > tests is starting to take a long time. > 3) possible use of new tool chain like sbt in scala branch > 4) much simpler launcher script—mahout’s is getting a mess and doesn’t run at > all on Windows. Requiring it to support both engines is not making things > easy and much work goes into getting around old ideas like the classpath and > job.jars. Creating one for each engine would seem to reduce complexity. > 5) easier to support. If we really are going to have 4 engines the current > build and launch mechanisms along with release schedules can’t really be > maintained and even 2 is ugly. > > On Mar 6, 2015, at 11:52 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: > > On Fri, Mar 6, 2015 at 1:41 PM, Andrew Palumbo <ap....@outlook.com> wrote: > >> >> On 03/06/2015 12:44 PM, Pat Ferrel wrote: >> >>> This is great. >>> >>> So we’ve talked about a name change and shortly we’ll be forced to come >>> up with something the describes what Mahout has become. Most past users >>> think of it as a scalable ML library on Hadoop. That may describe >>> Mahout-Legacy but it seems like we need a name for the Scala >>> DSL/Spark/other? part of the project. Lots of projects have sub-projects so >>> we know there is no issue with naming sub-projects. So my question to >>> everyone is: >>> >>> Should (or can) the Top Level Project be renamed? If so to what? >>> >> I don't like the idea of a top level name change. I think that it would >> be a much better idea to direct our resources at polishing and developing >> what we have now. As well, especially for this release, I think that it >> would do a disservice to the "legacy" components (which as you point out >> have not been deprecated) with ~45 completed bugfixes and several more in >> the pipe. >> >> I don't like the idea of renaming Mahout either and agree with AP. > >> >>> If we don’t rename the TLP then what should we call legacy (not very >>> appealing) and scala/DSL (not a name really) >>> >> agreed. Legacy is not the most appealing name. Maybe something like >> Mahout-MapReduce? Though that could cause some confusion regarding the "no >> new MapReduce code" >> >> My opinion: >>> Since we are deemphasizing legacy I’m not sure there is a need to call >>> attention to it by giving it a subproject name. However it is not >>> deprecated so we need to include it in releases and even fix the minimum of >>> critical bugs for some time to come. >>> >> agreed regarding fixing critical legacy bugs. Looking through the issues >> last night there didn't seem to me a lot of critical bugs, and probably a >> good amount of issues can be closed out as wont fix/not an issue. >> > > +1 > > >> >>> Mahout is getting beat up in the circles of those who talk about such >>> things and much of this is because people don’t understand what it has >>> become. Therefore I’d like to see a project rename to reset expectations. >>> Leave the name Mahout for legacy stuff and give a new name to the Scala >>> environment. Split the builds and create new docs for the Scala stuff. This >>> would seem to make it easier to document since legacy is most of what the >>> CMS documents, we could create whole new template for the new project name. >>> >> What is the upside to splitting the builds? I'm not against it- I'm just >> not sure I understand. >> >>> >>> Failing this, many of the same benefits could be gained by creating >>> legacy and scala sub-projects with better names. This I know we can do and >>> recall that things like MLlib are generally not tied to Spark when speaking >>> about them. So a subproject could have very much its own identity. >>> >>> Looking at the long history of Mahout it seems like the current >>> generality was hard gained through implementing many special purpose >>> algorithms, some of which were grad student projects. This is where MLlib >>> is today in some ways. So a general framework and environment makes a lot >>> of sense as the evolution of Mahout. Let’s give it a name, something better >>> than DSL. >>> >> I think that a pretty clear description of what the other side of the >> project is has been emerging recently. IMO We need to start getting it out >> there. Probably a good start would be to update the front page of the >> mahout site. > > > +1 > >> I don't have any good ideas regarding names for this side of the project. >> >> >> >>> On Mar 5, 2015, at 7:43 PM, Andrew Musselman <andrew.mussel...@gmail.com> >>> wrote: >>> >>> Thanks AP >>> >>> On Thursday, March 5, 2015, Andrew Palumbo <ap....@outlook.com> wrote: >>> >>> I went through all of the unresolved JIRA issues and marked all with at >>>> least a "legacy" or "scala". (for lack of a better name for all that is >>>> not >>>> legacy) label. Hopefully I got them all. >>>> >>>> Some are labelled with both (math, build, documentation related to both >>>> or >>>> neither, etc.) >>>> >>>> legacy issues: >>>> >>>> https://issues.apache.org/jira/browse/MAHOUT-1522?jql= >>>> project%20%3D%20MAHOUT%20AND%20resolution%20%3D% >>>> 20Unresolved%20AND%20labels%20%3D%20scala%20ORDER%20BY%20priority%20DESC >>>> >>>> "scala" issues: >>>> >>>> https://issues.apache.org/jira/browse/MAHOUT-1522?jql= >>>> project%20%3D%20MAHOUT%20AND%20resolution%20%3D% >>>> 20Unresolved%20AND%20labels%20%3D%20legacy%20ORDER%20BY% >>>> 20priority%20DESC >>>> >>>> Hopefully this will help us get started closing up some old issues. I'll >>>> try to make another pass over them and close tomorrow and try to find >>>> some >>>> that need to be closed out. >>>> >>>> >> >