I'll write longer, but in general, +1 to Anand Sent from my iPhone
> On Jul 11, 2014, at 20:54, Anand Avati <[email protected]> wrote: > >> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel <[email protected]> wrote: >> >> Duplicated from a comment on the PR: >> >> Beyond these details (specific merge issues) I have a bigger problem with >> merging this. Now every time the DSL is changed it may break things in h2o >> specific code. Merging this would require every committer who might touch >> the DSL to sign up for fixing any broken tests on both engines. >> >> To solve this the entire data prep pipeline must be virtualized to run on >> either engine so the tests for things like CF and ItemSimilarity or matrix >> factorization (and the multitude of others to come) pass and are engine >> independent. As it stands any DSL change that breaks the build will have to >> rely on a contributor's fix. Even if one of you guys was made a committer >> we will have this problem where a needed change breaks one or the other >> engine specific code. Unless 99% of the entire pipeline is engine neutral >> the build will be unmaintainable. >> >> For instance I am making a small DSL change that is required for >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity >> and its tests, which are in the spark module but since I’m working on that >> I can fix everything. If someone working on an h2o specific thing had to >> change the DSL in a way that broke spark code like ItemSimilarity you might >> not be able to fix it and I certainly do not want to fix stuff in h2o >> specific code when I change the DSL. I have a hard enough time keeping mine >> running :-) > > The way I interpret the above points, the problem you are trying to > highlight is with having multiple backends in general, and not this backend > in specific? Hypothetically, even if this backend is abandoned for the > above "problems", as more backends get added in the future, the same > "problems" will continue to apply to all of them. > > >> Crudely speaking this means doing away with all references to a >> SparkContext and any use of it. So it's not just a matter of reproducing >> the spark module but reducing the need for one. Making it so small that >> breakages in one or the other engines code will be infrequent and changes >> to neutral code will only rarely break an engine that the committer is >> unfamiliar with. > > I think things are already very close to this "ideal" situation you > describe above. As a pipeline implementor we should just use > DistributedContext, and not SparkContext. And we need an engine neutral way > to get hold of a DistributedContext from within the math-scala module, like > this pseudocode: > > import org.apache.mahout.math.drm._ > > val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), > System.getenv("BACKEND_ID"), opts...) > > If environment variables are not set, DistributedContextCreate could > default to Spark and local. But all of the pipeline code should ideally > exist outside any engine specific module. > > > >> I raised this red flag a long time ago but in the heat of other issues it >> got lost. I don't think this can be ignored anymore. > > The only missing piece I think is having a DistributedContextCreate() call > such as above? I don't think things are in such a dire state really.. Am I > missing something? > > >> I would propose that we should remain two separate projects with a mostly >> shared DSL until the maintainability issues are resolved. This seems way to >> early to merge. > > Call me an optimist, but I was hoping more of a "let's work together now to > make the DSL abstractions easier for future contributors". I will explore > such a DistributedContextCreate() method in math-scala. That might also be > the answer for test cases to remain in math-scala. > > Thanks
