Re: H2O integration - completion and review
This if fine but why is Mahout integrating all of the engines, isn’t that the job of the engine people. Anand and Dmitriy have already said they are not interested in anything but the DSL so give it to any project that wants it and agree to support it with them. Be welcoming! Committing this will affect every committer that plans to work with the DSL because it’s incomplete and changes to it will often break the build on two places. Also everything done outside of the Sore DSL (much of it is still engine dependent) will be difficult or impossible to make engine independent without a fair amount of work. To me this all argues for treating the core math as an engine independent module that we provide to other projects. This keeps the engine dependent work in the engine projects. Since this is a very special commit I propose a vote. This is one where a single committer should not have the power to put such a burden on all development. On Jul 11, 2014, at 9:12 PM, Ted Dunning wrote: On Fri, Jul 11, 2014 at 2:00 PM, Dmitriy Lyubimov wrote: > And most importantly, as i always was saying, the main benefit to me is not > that there's a CF algorithm in Mahout, but that i can write a custom one of > my own with less effort than just writing directly to any given engine api. > That part is very conceptual (drm api-dependent) and would be common > regardless of my deployment infrastructure. In other words, everyone can > just write their own co-occurrence analysis version hopefully easier than > just writing it directly for Spark or directly to H20 if he/she wanted to. > that's the real story (at least in my talk). > I agree with this 110%. In fact, I would argue that if all the engines worked just the same that there would be absolutely no reason to have more than one. It is precisely the differences that make having more than one important. The goal of the DSL is to minimize the gratuitous aspects of those differences so that you can relatively easily make use of the virtues of one engine or another.
Re: H2O integration - completion and review
On Fri, Jul 11, 2014 at 2:00 PM, Dmitriy Lyubimov wrote: > And most importantly, as i always was saying, the main benefit to me is not > that there's a CF algorithm in Mahout, but that i can write a custom one of > my own with less effort than just writing directly to any given engine api. > That part is very conceptual (drm api-dependent) and would be common > regardless of my deployment infrastructure. In other words, everyone can > just write their own co-occurrence analysis version hopefully easier than > just writing it directly for Spark or directly to H20 if he/she wanted to. > that's the real story (at least in my talk). > I agree with this 110%. In fact, I would argue that if all the engines worked just the same that there would be absolutely no reason to have more than one. It is precisely the differences that make having more than one important. The goal of the DSL is to minimize the gratuitous aspects of those differences so that you can relatively easily make use of the virtues of one engine or another.
Re: H2O integration - completion and review
The concern would be if it creates fragmentation for the project. as in say, a collection of semi-consistent things (Mahout 1.0). But there's more commonality in that effort than it isn't. E.g. there are pure algebraic algorithms in math-scala that one can run to compare how they would be behave in both cases. (for most part, i assume it is a function of in-core algebra though). And most importantly, as i always was saying, the main benefit to me is not that there's a CF algorithm in Mahout, but that i can write a custom one of my own with less effort than just writing directly to any given engine api. That part is very conceptual (drm api-dependent) and would be common regardless of my deployment infrastructure. In other words, everyone can just write their own co-occurrence analysis version hopefully easier than just writing it directly for Spark or directly to H20 if he/she wanted to. that's the real story (at least in my talk). On Fri, Jul 11, 2014 at 1:50 PM, Pat Ferrel wrote: > So given what Dmitriy said and Anand's answers below why are we trying to > merge this into Mahout? > > Why don’t you Anand/oxdata create Mahout2O and take what has been > delivered. We create Mahout v2 for Spark. We keep Spark specific code in > the spark module. You get the algebraic DSL and all the java math stuff. > You get all the future solver optimizations and anything you want. > > The benefit of keeping things separate for now are that Mahout2O only has > to deal with h2o duplicated code and Mahout v2 only has to deal with Spark. > You deal with h2o optimization and we deal with spark. You know the old > saying that it’s not 2x it’s x^2. > > This seems so obvious. If in the future we work out the impedance > mismatch, IO engine neutrality, etc then we talk again of merge. > > For that matter I’d be interested in seeing how to make ItemSimilarity > work on Mahout2O. A performance comparison would help push this one way or > the other. > > On Jul 11, 2014, at 1:36 PM, Dmitriy Lyubimov wrote: > > On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati wrote: > > > > > c) abandon the pretense/goal that Mahout aims to be backend independent > and > > admit/become Spark specific. > > > > > For the record, there has never been said the work is backend-independent > for _anything_. The claim has always been much more modest. It was said it > was backend-independent for R-(matlab)-like algebraic expressions. which it > is. And it is not even main side of the story. > > I suspect there are couple of more areas in general math beyond "base" R > algebra where abstractions can also be built and be useful and be engine > independent. > > Like i said, the true solution is probably ports of non-algebraic portions > of quasi-algebraic solution (i.e. b + doing something h2o specific for that > work if desired). Smart componentization of concerns may (or may not) go a > long way here (just like in tests). > >
Re: H2O integration - completion and review
So given what Dmitriy said and Anand's answers below why are we trying to merge this into Mahout? Why don’t you Anand/oxdata create Mahout2O and take what has been delivered. We create Mahout v2 for Spark. We keep Spark specific code in the spark module. You get the algebraic DSL and all the java math stuff. You get all the future solver optimizations and anything you want. The benefit of keeping things separate for now are that Mahout2O only has to deal with h2o duplicated code and Mahout v2 only has to deal with Spark. You deal with h2o optimization and we deal with spark. You know the old saying that it’s not 2x it’s x^2. This seems so obvious. If in the future we work out the impedance mismatch, IO engine neutrality, etc then we talk again of merge. For that matter I’d be interested in seeing how to make ItemSimilarity work on Mahout2O. A performance comparison would help push this one way or the other. On Jul 11, 2014, at 1:36 PM, Dmitriy Lyubimov wrote: On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati wrote: > > c) abandon the pretense/goal that Mahout aims to be backend independent and > admit/become Spark specific. > > For the record, there has never been said the work is backend-independent for _anything_. The claim has always been much more modest. It was said it was backend-independent for R-(matlab)-like algebraic expressions. which it is. And it is not even main side of the story. I suspect there are couple of more areas in general math beyond "base" R algebra where abstractions can also be built and be useful and be engine independent. Like i said, the true solution is probably ports of non-algebraic portions of quasi-algebraic solution (i.e. b + doing something h2o specific for that work if desired). Smart componentization of concerns may (or may not) go a long way here (just like in tests).
Re: H2O integration - completion and review
On Fri, Jul 11, 2014 at 1:36 PM, Dmitriy Lyubimov wrote: > > Like i said, the true solution is probably ports of non-algebraic portions > of quasi-algebraic solution (i.e. b + doing something h2o specific for that > work if desired). Smart componentization of concerns may (or may not) go a > long way here (just like in tests). > I find that a reasonable stance. I too don't see a reason to fret that drivers don't work on all backends.
Re: H2O integration - completion and review
On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati wrote: > > c) abandon the pretense/goal that Mahout aims to be backend independent and > admit/become Spark specific. > > For the record, there has never been said the work is backend-independent for _anything_. The claim has always been much more modest. It was said it was backend-independent for R-(matlab)-like algebraic expressions. which it is. And it is not even main side of the story. I suspect there are couple of more areas in general math beyond "base" R algebra where abstractions can also be built and be useful and be engine independent. Like i said, the true solution is probably ports of non-algebraic portions of quasi-algebraic solution (i.e. b + doing something h2o specific for that work if desired). Smart componentization of concerns may (or may not) go a long way here (just like in tests).
Re: H2O integration - completion and review
You are speaking of two issues in the same breath. 1. Duplication of test case code. This is already being addressed by Dmitriy's https://github.com/apache/mahout/pull/28. With that change all the algo test code will co-reside with algos in math-scala. 2. Driver code for various backends. As you show in the code snippet below, you are clearly using Spark specific calls (mc.textFile()) and bypassing the DSL. This code is obviously spark specific and will not run on any other backends. Now, are you asking that, you want to continue using Spark specific functionality in the driver code and therefore how do you reconcile that with multiple backends? I see only three options - a) re-implement driver to only use the the DSL and avoid making backend specific calls in the driver (not sure that is possible.) b) continue with Spark specific calls in your driver and have per-backend driver for each algo. This probably makes sense in a way as not all algos run best on all backends. Just having the core of the algo backend independent is nice enough by itself. So don't sweat about the full pipeline not working everywhere (i.e - it need not) c) abandon the pretense/goal that Mahout aims to be backend independent and admit/become Spark specific. The choice is not mine though. On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel wrote: > > > > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel > wrote: > > > >> Duplicated from a comment on the PR: > >> > >> Beyond these details (specific merge issues) I have a bigger problem > with > >> merging this. Now every time the DSL is changed it may break things in > h2o > >> specific code. Merging this would require every committer who might > touch > >> the DSL to sign up for fixing any broken tests on both engines. > >> > >> To solve this the entire data prep pipeline must be virtualized to run > on > >> either engine so the tests for things like CF and ItemSimilarity or > matrix > >> factorization (and the multitude of others to come) pass and are engine > >> independent. As it stands any DSL change that breaks the build will > have to > >> rely on a contributor's fix. Even if one of you guys was made a > committer > >> we will have this problem where a needed change breaks one or the other > >> engine specific code. Unless 99% of the entire pipeline is engine > neutral > >> the build will be unmaintainable. > >> > >> For instance I am making a small DSL change that is required for > >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity > >> and its tests, which are in the spark module but since I’m working on > that > >> I can fix everything. If someone working on an h2o specific thing had to > >> change the DSL in a way that broke spark code like ItemSimilarity you > might > >> not be able to fix it and I certainly do not want to fix stuff in h2o > >> specific code when I change the DSL. I have a hard enough time keeping > mine > >> running :-) > >> > > > > The way I interpret the above points, the problem you are trying to > > highlight is with having multiple backends in general, and not this > backend > > in specific? Hypothetically, even if this backend is abandoned for the > > above "problems", as more backends get added in the future, the same > > "problems" will continue to apply to all of them. > > > > yes, exactly. Adding backends is only maintainable if backend specific > code (code > in the spark module for now) is squeezed down to near zero. The more that > is there > the more code there will be duplicated in the h2o modules. Test breakage > illustrates > the problem it does not express the breadth or depth of the problem. > > > > >> Crudely speaking this means doing away with all references to a > >> SparkContext and any use of it. So it's not just a matter of reproducing > >> the spark module but reducing the need for one. Making it so small that > >> breakages in one or the other engines code will be infrequent and > changes > >> to neutral code will only rarely break an engine that the committer is > >> unfamiliar with. > >> > > > > I think things are already very close to this "ideal" situation you > > describe above. As a pipeline implementor we should just use > > DistributedContext, and not SparkContext. And we need an engine neutral > way > > to get hold of a DistributedContext from within the math-scala module, > like > > this pseudocode: > > > > import org.apache.mahout.math.drm._ > > > > val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), > > System.getenv("BACKEND_ID"), opts...) > > > > If environment variables are not set, DistributedContextCreate could > > default to Spark and local. But all of the pipeline code should ideally > > exist outside any engine specific module. > > The Readers and Writers rely on > > var columns = mc.textFile(source).map { line => line.split(delimiter) } > > This will not run unless the DistributedContext is actually implemented by > SparkContext. > > Running item similarity on epinions
Re: H2O integration - completion and review
initial idea is to only provide algebraic independence. we may want to ask engines to support persistence operations to/from, as it stands, HDFS, as currently couple dozen of projects dealing with distributed data ask as well, but in general algebraic expressions are agnostic from how inputs come into existence. when building e2e distributed application, naturally, algebra is not enough. Heck, it is not even enough even for a modereately involved logics inside an algorithm, so quasi-algebraic algorithms are expected. This is an obstinent reality. But engine independence, or even partial portability is only one side of the story, and it is not the biggest one. So the hope is that (1) algebraic part hopefully is still significant enough so that non-algebraic part of algorithm could be more easily ported if needed; or (2) for folks like me, one version of algorithm is quite enough and engine independence side of the story becomes much smaller story making other sides of the story (i.e. convenience and semantics of algebraic translation itself) much more prominent. What it means, quasi-portable algorithms are expected to happen, and i wouldn't be overly heartbroken about adding things only to spark side of things -- either as a first port, or even for good. After all, I am all for solving problems that actually exist. I probably have a need for coocurrence work with Spark deployment but i have no need for CF on H20 so i wouldn't care if quasi-port exists for h20. Folks who do, are welcome to contribute a quasi-algebraic port. On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel wrote: > > > > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel > wrote: > > > >> Duplicated from a comment on the PR: > >> > >> Beyond these details (specific merge issues) I have a bigger problem > with > >> merging this. Now every time the DSL is changed it may break things in > h2o > >> specific code. Merging this would require every committer who might > touch > >> the DSL to sign up for fixing any broken tests on both engines. > >> > >> To solve this the entire data prep pipeline must be virtualized to run > on > >> either engine so the tests for things like CF and ItemSimilarity or > matrix > >> factorization (and the multitude of others to come) pass and are engine > >> independent. As it stands any DSL change that breaks the build will > have to > >> rely on a contributor's fix. Even if one of you guys was made a > committer > >> we will have this problem where a needed change breaks one or the other > >> engine specific code. Unless 99% of the entire pipeline is engine > neutral > >> the build will be unmaintainable. > >> > >> For instance I am making a small DSL change that is required for > >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity > >> and its tests, which are in the spark module but since I’m working on > that > >> I can fix everything. If someone working on an h2o specific thing had to > >> change the DSL in a way that broke spark code like ItemSimilarity you > might > >> not be able to fix it and I certainly do not want to fix stuff in h2o > >> specific code when I change the DSL. I have a hard enough time keeping > mine > >> running :-) > >> > > > > The way I interpret the above points, the problem you are trying to > > highlight is with having multiple backends in general, and not this > backend > > in specific? Hypothetically, even if this backend is abandoned for the > > above "problems", as more backends get added in the future, the same > > "problems" will continue to apply to all of them. > > > > yes, exactly. Adding backends is only maintainable if backend specific > code (code > in the spark module for now) is squeezed down to near zero. The more that > is there > the more code there will be duplicated in the h2o modules. Test breakage > illustrates > the problem it does not express the breadth or depth of the problem. > > > > >> Crudely speaking this means doing away with all references to a > >> SparkContext and any use of it. So it's not just a matter of reproducing > >> the spark module but reducing the need for one. Making it so small that > >> breakages in one or the other engines code will be infrequent and > changes > >> to neutral code will only rarely break an engine that the committer is > >> unfamiliar with. > >> > > > > I think things are already very close to this "ideal" situation you > > describe above. As a pipeline implementor we should just use > > DistributedContext, and not SparkContext. And we need an engine neutral > way > > to get hold of a DistributedContext from within the math-scala module, > like > > this pseudocode: > > > > import org.apache.mahout.math.drm._ > > > > val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), > > System.getenv("BACKEND_ID"), opts...) > > > > If environment variables are not set, DistributedContextCreate could > > default to Spark and local. But all of the pipeline code should ideally > > exist outside any en
Re: H2O integration - completion and review
> > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel wrote: > >> Duplicated from a comment on the PR: >> >> Beyond these details (specific merge issues) I have a bigger problem with >> merging this. Now every time the DSL is changed it may break things in h2o >> specific code. Merging this would require every committer who might touch >> the DSL to sign up for fixing any broken tests on both engines. >> >> To solve this the entire data prep pipeline must be virtualized to run on >> either engine so the tests for things like CF and ItemSimilarity or matrix >> factorization (and the multitude of others to come) pass and are engine >> independent. As it stands any DSL change that breaks the build will have to >> rely on a contributor's fix. Even if one of you guys was made a committer >> we will have this problem where a needed change breaks one or the other >> engine specific code. Unless 99% of the entire pipeline is engine neutral >> the build will be unmaintainable. >> >> For instance I am making a small DSL change that is required for >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity >> and its tests, which are in the spark module but since I’m working on that >> I can fix everything. If someone working on an h2o specific thing had to >> change the DSL in a way that broke spark code like ItemSimilarity you might >> not be able to fix it and I certainly do not want to fix stuff in h2o >> specific code when I change the DSL. I have a hard enough time keeping mine >> running :-) >> > > The way I interpret the above points, the problem you are trying to > highlight is with having multiple backends in general, and not this backend > in specific? Hypothetically, even if this backend is abandoned for the > above "problems", as more backends get added in the future, the same > "problems" will continue to apply to all of them. > yes, exactly. Adding backends is only maintainable if backend specific code (code in the spark module for now) is squeezed down to near zero. The more that is there the more code there will be duplicated in the h2o modules. Test breakage illustrates the problem it does not express the breadth or depth of the problem. > >> Crudely speaking this means doing away with all references to a >> SparkContext and any use of it. So it's not just a matter of reproducing >> the spark module but reducing the need for one. Making it so small that >> breakages in one or the other engines code will be infrequent and changes >> to neutral code will only rarely break an engine that the committer is >> unfamiliar with. >> > > I think things are already very close to this "ideal" situation you > describe above. As a pipeline implementor we should just use > DistributedContext, and not SparkContext. And we need an engine neutral way > to get hold of a DistributedContext from within the math-scala module, like > this pseudocode: > > import org.apache.mahout.math.drm._ > > val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), > System.getenv("BACKEND_ID"), opts...) > > If environment variables are not set, DistributedContextCreate could > default to Spark and local. But all of the pipeline code should ideally > exist outside any engine specific module. The Readers and Writers rely on var columns = mc.textFile(source).map { line => line.split(delimiter) } This will not run unless the DistributedContext is actually implemented by SparkContext. Running item similarity on epinions dataset requires Spark Executror Memory to be 5g in the SparkConf so this has to be passed in to Spark, what is it for h2o? Do I as the implementor have to figure out important tuning factors for every engine? I need a serializer for HashBiMap to be registered with Kryo or the Spark version will not run, what analogous problems for h2o? How much time will it take me to figure it out? Answers are moot. The fact that questions come up so often is the issue. It took me a fair amount of time to discover these tuning and setup issues with only one engine. The more duplicated code the bigger this problem is and the greater the impedance mismatch between spark and h2o the bigger the problem is. This directly affects how fast Mahout is moving. If there were some clear reason for taking this productivity hit other that some idea that engine independence sounds clean or good then it would be easier to accept. Still so many questions and we are being asked to merge this into the mainstream? I am tired of debating this so I’ll just say that until the spark and h2o modules are tiny and trivial two engines will be a major productivity hit and so until the “ideal” is met -1 on merge. If people want to work on making the spark and h2o modules small—increasing engine dependence, great. But ask yourself why? Seems like if Anand has a build that works on both we should be able to run some non-trivial standard data through them on identical clusters and compare speed.
Re: H2O integration - completion and review
I'll write longer, but in general, +1 to Anand Sent from my iPhone > On Jul 11, 2014, at 20:54, Anand Avati wrote: > >> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel wrote: >> >> Duplicated from a comment on the PR: >> >> Beyond these details (specific merge issues) I have a bigger problem with >> merging this. Now every time the DSL is changed it may break things in h2o >> specific code. Merging this would require every committer who might touch >> the DSL to sign up for fixing any broken tests on both engines. >> >> To solve this the entire data prep pipeline must be virtualized to run on >> either engine so the tests for things like CF and ItemSimilarity or matrix >> factorization (and the multitude of others to come) pass and are engine >> independent. As it stands any DSL change that breaks the build will have to >> rely on a contributor's fix. Even if one of you guys was made a committer >> we will have this problem where a needed change breaks one or the other >> engine specific code. Unless 99% of the entire pipeline is engine neutral >> the build will be unmaintainable. >> >> For instance I am making a small DSL change that is required for >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity >> and its tests, which are in the spark module but since I’m working on that >> I can fix everything. If someone working on an h2o specific thing had to >> change the DSL in a way that broke spark code like ItemSimilarity you might >> not be able to fix it and I certainly do not want to fix stuff in h2o >> specific code when I change the DSL. I have a hard enough time keeping mine >> running :-) > > The way I interpret the above points, the problem you are trying to > highlight is with having multiple backends in general, and not this backend > in specific? Hypothetically, even if this backend is abandoned for the > above "problems", as more backends get added in the future, the same > "problems" will continue to apply to all of them. > > >> Crudely speaking this means doing away with all references to a >> SparkContext and any use of it. So it's not just a matter of reproducing >> the spark module but reducing the need for one. Making it so small that >> breakages in one or the other engines code will be infrequent and changes >> to neutral code will only rarely break an engine that the committer is >> unfamiliar with. > > I think things are already very close to this "ideal" situation you > describe above. As a pipeline implementor we should just use > DistributedContext, and not SparkContext. And we need an engine neutral way > to get hold of a DistributedContext from within the math-scala module, like > this pseudocode: > > import org.apache.mahout.math.drm._ > > val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), > System.getenv("BACKEND_ID"), opts...) > > If environment variables are not set, DistributedContextCreate could > default to Spark and local. But all of the pipeline code should ideally > exist outside any engine specific module. > > > >> I raised this red flag a long time ago but in the heat of other issues it >> got lost. I don't think this can be ignored anymore. > > The only missing piece I think is having a DistributedContextCreate() call > such as above? I don't think things are in such a dire state really.. Am I > missing something? > > >> I would propose that we should remain two separate projects with a mostly >> shared DSL until the maintainability issues are resolved. This seems way to >> early to merge. > > Call me an optimist, but I was hoping more of a "let's work together now to > make the DSL abstractions easier for future contributors". I will explore > such a DistributedContextCreate() method in math-scala. That might also be > the answer for test cases to remain in math-scala. > > Thanks
Re: H2O integration - completion and review
On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel wrote: > Duplicated from a comment on the PR: > > Beyond these details (specific merge issues) I have a bigger problem with > merging this. Now every time the DSL is changed it may break things in h2o > specific code. Merging this would require every committer who might touch > the DSL to sign up for fixing any broken tests on both engines. > > To solve this the entire data prep pipeline must be virtualized to run on > either engine so the tests for things like CF and ItemSimilarity or matrix > factorization (and the multitude of others to come) pass and are engine > independent. As it stands any DSL change that breaks the build will have to > rely on a contributor's fix. Even if one of you guys was made a committer > we will have this problem where a needed change breaks one or the other > engine specific code. Unless 99% of the entire pipeline is engine neutral > the build will be unmaintainable. > > For instance I am making a small DSL change that is required for > cooccurrence and ItemSimilarity to work. This would break ItemSimilarity > and its tests, which are in the spark module but since I’m working on that > I can fix everything. If someone working on an h2o specific thing had to > change the DSL in a way that broke spark code like ItemSimilarity you might > not be able to fix it and I certainly do not want to fix stuff in h2o > specific code when I change the DSL. I have a hard enough time keeping mine > running :-) > The way I interpret the above points, the problem you are trying to highlight is with having multiple backends in general, and not this backend in specific? Hypothetically, even if this backend is abandoned for the above "problems", as more backends get added in the future, the same "problems" will continue to apply to all of them. > Crudely speaking this means doing away with all references to a > SparkContext and any use of it. So it's not just a matter of reproducing > the spark module but reducing the need for one. Making it so small that > breakages in one or the other engines code will be infrequent and changes > to neutral code will only rarely break an engine that the committer is > unfamiliar with. > I think things are already very close to this "ideal" situation you describe above. As a pipeline implementor we should just use DistributedContext, and not SparkContext. And we need an engine neutral way to get hold of a DistributedContext from within the math-scala module, like this pseudocode: import org.apache.mahout.math.drm._ val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), System.getenv("BACKEND_ID"), opts...) If environment variables are not set, DistributedContextCreate could default to Spark and local. But all of the pipeline code should ideally exist outside any engine specific module. > I raised this red flag a long time ago but in the heat of other issues it > got lost. I don't think this can be ignored anymore. > The only missing piece I think is having a DistributedContextCreate() call such as above? I don't think things are in such a dire state really.. Am I missing something? > I would propose that we should remain two separate projects with a mostly > shared DSL until the maintainability issues are resolved. This seems way to > early to merge. > Call me an optimist, but I was hoping more of a "let's work together now to make the DSL abstractions easier for future contributors". I will explore such a DistributedContextCreate() method in math-scala. That might also be the answer for test cases to remain in math-scala. Thanks
Re: H2O integration - completion and review
Duplicated from a comment on the PR: Beyond these details (specific merge issues) I have a bigger problem with merging this. Now every time the DSL is changed it may break things in h2o specific code. Merging this would require every committer who might touch the DSL to sign up for fixing any broken tests on both engines. To solve this the entire data prep pipeline must be virtualized to run on either engine so the tests for things like CF and ItemSimilarity or matrix factorization (and the multitude of others to come) pass and are engine independent. As it stands any DSL change that breaks the build will have to rely on a contributor's fix. Even if one of you guys was made a committer we will have this problem where a needed change breaks one or the other engine specific code. Unless 99% of the entire pipeline is engine neutral the build will be unmaintainable. For instance I am making a small DSL change that is required for cooccurrence and ItemSimilarity to work. This would break ItemSimilarity and its tests, which are in the spark module but since I’m working on that I can fix everything. If someone working on an h2o specific thing had to change the DSL in a way that broke spark code like ItemSimilarity you might not be able to fix it and I certainly do not want to fix stuff in h2o specific code when I change the DSL. I have a hard enough time keeping mine running :-) Crudely speaking this means doing away with all references to a SparkContext and any use of it. So it's not just a matter of reproducing the spark module but reducing the need for one. Making it so small that breakages in one or the other engines code will be infrequent and changes to neutral code will only rarely break an engine that the committer is unfamiliar with. I raised this red flag a long time ago but in the heat of other issues it got lost. I don't think this can be ignored anymore. I would propose that we should remain two separate projects with a mostly shared DSL until the maintainability issues are resolved. This seems way to early to merge. On Jul 11, 2014, at 2:40 AM, Anand Avati wrote: Hi all, The H2O integration is now feature complete till date and is ready for final review. All the test cases are passing. The pull request https://github.com/apache/mahout/pull/21 has been updated with the latest code. Please treat this PR as a candidate for merge. I have written a brief document on how to set up and use/test the integration at https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That includes instructions to test in both local and distributed mode. I would really appreciate if folks can review the work and provide feedback, and the next steps. Thanks, Avati
H2O integration - completion and review
Hi all, The H2O integration is now feature complete till date and is ready for final review. All the test cases are passing. The pull request https://github.com/apache/mahout/pull/21 has been updated with the latest code. Please treat this PR as a candidate for merge. I have written a brief document on how to set up and use/test the integration at https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That includes instructions to test in both local and distributed mode. I would really appreciate if folks can review the work and provide feedback, and the next steps. Thanks, Avati