Call for vote on integrating h2o

Pat Ferrel Sat, 12 Jul 2014 10:59:31 -0700

Why not put this argument to bed with a vote? Sraw poll or not it will make the 
consensus visible so we can get on with things. I know that many are on 
vacation now but please take time to vote we really need a large sample of 
active committers. Feel free to give a short defense of you position too. I 
further propose we keep this at the 1000 meter level and not start quoting 
code—let’s look at the forest instead of the trees.


The choice as far as I can tell is:

1) merge the h2o implementation of math-scala and h2o modules into mainstream 
Mahout. I suppose this implies accepting h2o specific code too, though someone 
can contradict me here.
2) support h2o in integrating math and math-scala with their engine project 
(even as an artifact) and be welcoming and responsive with this support.
3) break the DSL into it’s own project, give it a name like Mahout-core, make 
all tests engine independent or live in other project code (like h2o or Flink). 
Then All-the-rest implements on Spark (the rest of Mahout), h2o, or Flink. This 
is the linux kernel approach, many distros but one kernel.

I support #2. The reasons:

1) engine specific work should be done by the experts and work done on one 
engine should never affect work done on another. 
2) math-scala is the closest thing to engine independent thing we have but it 
is not complete. Changes to it will need to be negotiated and cannot be forced 
into a single commit as they would if breakage in h2o also breaks the build.
3) Every committer should not have to understand all engines. Currently work, 
outside the DSL or not, often requires additions to the DSL and also often 
require the committers to pick an engine or design a new abstraction. This work 
of finding abstractions should not be forced into a single commit.
4) Mahout gets no known advantage by merging this PR. The alternative is that 
h2o merge it with their project. We still get the benefit of being (at least at 
the algebra/ r-like api / DSL) a multi-engine project. In other words we have 
proven our stated desire to support other engines.
5) Be welcoming. Providing a key component with the optimizer and DSL (along 
with all future improvements) to any and all engines and agreeing to support it 
and jointly work to keep it core seem very supportive of the open source 
community and mentality. There are many ways to work together and some bad ones.
6) Keeping the engine work separated by project boundaries but supported by 
mutual PRs will be a much more maintainable and productive way to cooperate.  
This is the model of choice for most modern OSS project, especially on Github. 
Git was made for this.
7) When Flink (Stratosphere) looks at cooperating with Mahout as they have 
already indicated, isn’t option #2 a much better way to deal with them too. 
Again the burden of integration should be with the engine, not Mahout. By 
merging h2o we would be committing to merging every other viable engine. It’s a 
slippery slope that the DSL alone may be able to pull off but not a core team 
supporting every engine.

I don’t favor #3 because the DSL is not complete and Mahout Spark as it’s 
reference implementation should have the easiest path to modify it. Maybe some 
day this will be the better alternative.

A word about bone fides. I’m one of a vey small number of people to push Scala 
or Spark code. I’m working on ItemSimilarity and a framework for 
readers/writers for tuples and DRMs (text-delimited is the first) as well as 
the core cooccurrence, whose primary author was Sebastian. Plans include a 
revamp of the item-based recommenders based on earlier hadoop+mahout+solr work. 
My work is generally outside the DSL but has required several changes or 
additions to it.

Call for vote on integrating h2o

Reply via email to