Why not put this argument to bed with a vote? Sraw poll or not it will make the consensus visible so we can get on with things. I know that many are on vacation now but please take time to vote we really need a large sample of active committers. Feel free to give a short defense of you position too. I further propose we keep this at the 1000 meter level and not start quoting code—let’s look at the forest instead of the trees.
The choice as far as I can tell is: 1) merge the h2o implementation of math-scala and h2o modules into mainstream Mahout. I suppose this implies accepting h2o specific code too, though someone can contradict me here. 2) support h2o in integrating math and math-scala with their engine project (even as an artifact) and be welcoming and responsive with this support. 3) break the DSL into it’s own project, give it a name like Mahout-core, make all tests engine independent or live in other project code (like h2o or Flink). Then All-the-rest implements on Spark (the rest of Mahout), h2o, or Flink. This is the linux kernel approach, many distros but one kernel. I support #2. The reasons: 1) engine specific work should be done by the experts and work done on one engine should never affect work done on another. 2) math-scala is the closest thing to engine independent thing we have but it is not complete. Changes to it will need to be negotiated and cannot be forced into a single commit as they would if breakage in h2o also breaks the build. 3) Every committer should not have to understand all engines. Currently work, outside the DSL or not, often requires additions to the DSL and also often require the committers to pick an engine or design a new abstraction. This work of finding abstractions should not be forced into a single commit. 4) Mahout gets no known advantage by merging this PR. The alternative is that h2o merge it with their project. We still get the benefit of being (at least at the algebra/ r-like api / DSL) a multi-engine project. In other words we have proven our stated desire to support other engines. 5) Be welcoming. Providing a key component with the optimizer and DSL (along with all future improvements) to any and all engines and agreeing to support it and jointly work to keep it core seem very supportive of the open source community and mentality. There are many ways to work together and some bad ones. 6) Keeping the engine work separated by project boundaries but supported by mutual PRs will be a much more maintainable and productive way to cooperate. This is the model of choice for most modern OSS project, especially on Github. Git was made for this. 7) When Flink (Stratosphere) looks at cooperating with Mahout as they have already indicated, isn’t option #2 a much better way to deal with them too. Again the burden of integration should be with the engine, not Mahout. By merging h2o we would be committing to merging every other viable engine. It’s a slippery slope that the DSL alone may be able to pull off but not a core team supporting every engine. I don’t favor #3 because the DSL is not complete and Mahout Spark as it’s reference implementation should have the easiest path to modify it. Maybe some day this will be the better alternative. A word about bone fides. I’m one of a vey small number of people to push Scala or Spark code. I’m working on ItemSimilarity and a framework for readers/writers for tuples and DRMs (text-delimited is the first) as well as the core cooccurrence, whose primary author was Sebastian. Plans include a revamp of the item-based recommenders based on earlier hadoop+mahout+solr work. My work is generally outside the DSL but has required several changes or additions to it.
