Re: H2O integration - completion and review

2014-07-12 Thread Pat Ferrel
This if fine but why is Mahout integrating all of the engines, isn’t that the 
job of the engine people. Anand and Dmitriy have already said they are not 
interested in anything but the DSL so give it to any project that wants it and 
agree to support it with them. Be welcoming!

Committing this will affect every committer that plans to work with the DSL 
because it’s incomplete and changes to it will often break the build on two 
places. Also everything done outside of the Sore DSL (much of it is still 
engine dependent) will be difficult or impossible to make engine independent 
without a fair amount of work.

To me this all argues for treating the core math as an engine independent 
module that we provide to other projects. This keeps the engine dependent work 
in the engine projects.

Since this is a very special commit I propose a vote. This is one where a 
single committer should not have the power to put such a burden on all 
development.

On Jul 11, 2014, at 9:12 PM, Ted Dunning  wrote:

On Fri, Jul 11, 2014 at 2:00 PM, Dmitriy Lyubimov  wrote:

> And most importantly, as i always was saying, the main benefit to me is not
> that there's a CF algorithm in Mahout, but that i can write a custom one of
> my own with less effort than just writing directly to any given engine api.
> That part is very conceptual (drm api-dependent) and would be common
> regardless of my deployment infrastructure. In other words, everyone can
> just write their own co-occurrence analysis version hopefully easier than
> just writing it directly for Spark or directly to H20 if he/she wanted to.
> that's the real story (at least in my talk).
> 


I agree with this 110%.

In fact, I would argue that if all the engines worked just the same that
there would be absolutely no reason to have more than one.  It is precisely
the differences that make having more than one important.

The goal of the DSL is to minimize the gratuitous aspects of those
differences so that you can relatively easily make use of the virtues of
one engine or another.



Re: H2O integration - completion and review

2014-07-11 Thread Ted Dunning
On Fri, Jul 11, 2014 at 2:00 PM, Dmitriy Lyubimov  wrote:

> And most importantly, as i always was saying, the main benefit to me is not
> that there's a CF algorithm in Mahout, but that i can write a custom one of
> my own with less effort than just writing directly to any given engine api.
> That part is very conceptual (drm api-dependent) and would be common
> regardless of my deployment infrastructure. In other words, everyone can
> just write their own co-occurrence analysis version hopefully easier than
> just writing it directly for Spark or directly to H20 if he/she wanted to.
> that's the real story (at least in my talk).
>


I agree with this 110%.

In fact, I would argue that if all the engines worked just the same that
there would be absolutely no reason to have more than one.  It is precisely
the differences that make having more than one important.

The goal of the DSL is to minimize the gratuitous aspects of those
differences so that you can relatively easily make use of the virtues of
one engine or another.


Re: H2O integration - completion and review

2014-07-11 Thread Dmitriy Lyubimov
The concern would be if it creates fragmentation for the project. as in
say, a collection of semi-consistent things (Mahout 1.0).

But there's more commonality in that effort than it isn't. E.g. there are
pure algebraic algorithms in math-scala that one can run to compare how
they would be behave in both cases. (for most part, i assume it is a
function of in-core algebra though).

And most importantly, as i always was saying, the main benefit to me is not
that there's a CF algorithm in Mahout, but that i can write a custom one of
my own with less effort than just writing directly to any given engine api.
That part is very conceptual (drm api-dependent) and would be common
regardless of my deployment infrastructure. In other words, everyone can
just write their own co-occurrence analysis version hopefully easier than
just writing it directly for Spark or directly to H20 if he/she wanted to.
that's the real story (at least in my talk).



On Fri, Jul 11, 2014 at 1:50 PM, Pat Ferrel  wrote:

> So given what Dmitriy said and Anand's answers below why are we trying to
> merge this into Mahout?
>
> Why don’t you Anand/oxdata create Mahout2O and take what has been
> delivered. We create Mahout v2 for Spark. We keep Spark specific code in
> the spark module. You get the algebraic DSL and all the java math stuff.
> You get all the future solver optimizations and anything you want.
>
> The benefit of keeping things separate for now are that Mahout2O only has
> to deal with h2o duplicated code and Mahout v2 only has to deal with Spark.
> You deal with h2o optimization and we deal with spark. You know the old
> saying that it’s not 2x it’s x^2.
>
> This seems so obvious. If in the future we work out the impedance
> mismatch, IO engine neutrality, etc then we talk again of merge.
>
> For that matter I’d be interested in seeing how to make ItemSimilarity
> work on Mahout2O. A performance comparison would help push this one way or
> the other.
>
> On Jul 11, 2014, at 1:36 PM, Dmitriy Lyubimov  wrote:
>
> On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati  wrote:
>
> >
> > c) abandon the pretense/goal that Mahout aims to be backend independent
> and
> > admit/become Spark specific.
> >
> >
> For the record, there has never been said the work is backend-independent
> for _anything_. The claim has always been much more modest. It was said it
> was backend-independent for R-(matlab)-like algebraic expressions. which it
> is. And it is not even main side of the story.
>
> I suspect there are couple of more areas in general math beyond "base" R
> algebra where abstractions can also be built and be useful and be engine
> independent.
>
> Like i said, the true solution is probably ports of non-algebraic portions
> of quasi-algebraic solution (i.e. b + doing something h2o specific for that
> work if desired). Smart componentization of concerns may (or may not) go a
> long way here (just like in tests).
>
>


Re: H2O integration - completion and review

2014-07-11 Thread Pat Ferrel
So given what Dmitriy said and Anand's answers below why are we trying to merge 
this into Mahout?

Why don’t you Anand/oxdata create Mahout2O and take what has been delivered. We 
create Mahout v2 for Spark. We keep Spark specific code in the spark module. 
You get the algebraic DSL and all the java math stuff. You get all the future 
solver optimizations and anything you want.

The benefit of keeping things separate for now are that Mahout2O only has to 
deal with h2o duplicated code and Mahout v2 only has to deal with Spark. You 
deal with h2o optimization and we deal with spark. You know the old saying that 
it’s not 2x it’s x^2.

This seems so obvious. If in the future we work out the impedance mismatch, IO 
engine neutrality, etc then we talk again of merge. 

For that matter I’d be interested in seeing how to make ItemSimilarity work on 
Mahout2O. A performance comparison would help push this one way or the other.

On Jul 11, 2014, at 1:36 PM, Dmitriy Lyubimov  wrote:

On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati  wrote:

> 
> c) abandon the pretense/goal that Mahout aims to be backend independent and
> admit/become Spark specific.
> 
> 
For the record, there has never been said the work is backend-independent
for _anything_. The claim has always been much more modest. It was said it
was backend-independent for R-(matlab)-like algebraic expressions. which it
is. And it is not even main side of the story.

I suspect there are couple of more areas in general math beyond "base" R
algebra where abstractions can also be built and be useful and be engine
independent.

Like i said, the true solution is probably ports of non-algebraic portions
of quasi-algebraic solution (i.e. b + doing something h2o specific for that
work if desired). Smart componentization of concerns may (or may not) go a
long way here (just like in tests).



Re: H2O integration - completion and review

2014-07-11 Thread Anand Avati
On Fri, Jul 11, 2014 at 1:36 PM, Dmitriy Lyubimov  wrote:

>
> Like i said, the true solution is probably ports of non-algebraic portions
> of quasi-algebraic solution (i.e. b + doing something h2o specific for that
> work if desired). Smart componentization of concerns may (or may not) go a
> long way here (just like in tests).
>

I find that a reasonable stance. I too don't see a reason to fret that
drivers don't work on all backends.


Re: H2O integration - completion and review

2014-07-11 Thread Dmitriy Lyubimov
On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati  wrote:

>
> c) abandon the pretense/goal that Mahout aims to be backend independent and
> admit/become Spark specific.
>
>
For the record, there has never been said the work is backend-independent
for _anything_. The claim has always been much more modest. It was said it
was backend-independent for R-(matlab)-like algebraic expressions. which it
is. And it is not even main side of the story.

I suspect there are couple of more areas in general math beyond "base" R
algebra where abstractions can also be built and be useful and be engine
independent.

Like i said, the true solution is probably ports of non-algebraic portions
of quasi-algebraic solution (i.e. b + doing something h2o specific for that
work if desired). Smart componentization of concerns may (or may not) go a
long way here (just like in tests).


Re: H2O integration - completion and review

2014-07-11 Thread Anand Avati
You are speaking of two issues in the same breath.

1. Duplication of test case code. This is already being addressed by
Dmitriy's https://github.com/apache/mahout/pull/28. With that change all
the algo test code will co-reside with algos in math-scala.

2. Driver code for various backends. As you show in the code snippet below,
you are clearly using Spark specific calls (mc.textFile()) and bypassing
the DSL. This code is obviously spark specific and will not run on any
other backends. Now, are you asking that, you want to continue using Spark
specific functionality in the driver code and therefore how do you
reconcile that with multiple backends? I see only three options -

a) re-implement driver to only use the the DSL and avoid making backend
specific calls in the driver (not sure that is possible.)

b) continue with Spark specific calls in your driver and have per-backend
driver for each algo. This probably makes sense in a way as not all algos
run best on all backends. Just having the core of the algo backend
independent is nice enough by itself. So don't sweat about the full
pipeline not working everywhere (i.e - it need not)

c) abandon the pretense/goal that Mahout aims to be backend independent and
admit/become Spark specific.

The choice is not mine though.

On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel  wrote:

> >
> > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel 
> wrote:
> >
> >> Duplicated from a comment on the PR:
> >>
> >> Beyond these details (specific merge issues)  I have a bigger problem
> with
> >> merging this. Now every time the DSL is changed it may break things in
> h2o
> >> specific code. Merging this would require every committer who might
> touch
> >> the DSL to sign up for fixing any broken tests on both engines.
> >>
> >> To solve this the entire data prep pipeline must be virtualized to run
> on
> >> either engine so the tests for things like CF and ItemSimilarity or
> matrix
> >> factorization (and the multitude of others to come) pass and are engine
> >> independent. As it stands any DSL change that breaks the build will
> have to
> >> rely on a contributor's fix. Even if one of you guys was made a
> committer
> >> we will have this problem where a needed change breaks one or the other
> >> engine specific code. Unless 99% of the entire pipeline is engine
> neutral
> >> the build will be unmaintainable.
> >>
> >> For instance I am making a small DSL change that is required for
> >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> >> and its tests, which are in the spark module but since I’m working on
> that
> >> I can fix everything. If someone working on an h2o specific thing had to
> >> change the DSL in a way that broke spark code like ItemSimilarity you
> might
> >> not be able to fix it and I certainly do not want to fix stuff in h2o
> >> specific code when I change the DSL. I have a hard enough time keeping
> mine
> >> running :-)
> >>
> >
> > The way I interpret the above points, the problem you are trying to
> > highlight is with having multiple backends in general, and not this
> backend
> > in specific? Hypothetically, even if this backend is abandoned for the
> > above "problems", as more backends get added in the future, the same
> > "problems" will continue to apply to all of them.
> >
>
> yes, exactly. Adding backends is only maintainable if backend specific
> code (code
> in the spark module for now) is squeezed down to near zero. The more that
> is there
> the more code there will be duplicated in the h2o modules. Test breakage
> illustrates
> the problem it does not express the breadth or depth of the problem.
>
> >
> >> Crudely speaking this means doing away with all references to a
> >> SparkContext and any use of it. So it's not just a matter of reproducing
> >> the spark module but reducing the need for one. Making it so small that
> >> breakages in one or the other engines code will be infrequent and
> changes
> >> to neutral code will only rarely break an engine that the committer is
> >> unfamiliar with.
> >>
> >
> > I think things are already very close to this "ideal" situation you
> > describe above. As a pipeline implementor we should just use
> > DistributedContext, and not SparkContext. And we need an engine neutral
> way
> > to get hold of a DistributedContext from within the math-scala module,
> like
> > this pseudocode:
> >
> >  import org.apache.mahout.math.drm._
> >
> >  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> > System.getenv("BACKEND_ID"), opts...)
> >
> > If environment variables are not set, DistributedContextCreate could
> > default to Spark and local. But all of the pipeline code should ideally
> > exist outside any engine specific module.
>
> The Readers and Writers rely on
>
> var columns = mc.textFile(source).map { line => line.split(delimiter) }
>
> This will not run unless the DistributedContext is actually implemented by
> SparkContext.
>
> Running item similarity on epinions

Re: H2O integration - completion and review

2014-07-11 Thread Dmitriy Lyubimov
initial idea is to only provide algebraic independence. we may want to ask
engines to support persistence operations to/from, as it stands, HDFS, as
currently couple dozen of projects dealing with distributed data ask as
well, but in general algebraic expressions are agnostic from how inputs
come into existence.

when building e2e distributed application, naturally, algebra is not
enough. Heck, it is not even enough even for a modereately involved logics
inside an  algorithm, so quasi-algebraic algorithms are expected. This is
an obstinent reality. But engine independence, or even partial portability
is only one side of the story, and it is not the biggest one. So the  hope
is that

(1) algebraic part hopefully is still significant enough so that
non-algebraic part of algorithm could be more easily ported if needed; or
(2) for folks like me, one version of algorithm is quite enough and engine
independence side of the story becomes much smaller story making other
sides of the story (i.e. convenience and semantics of algebraic translation
itself) much more prominent.

What it means, quasi-portable algorithms are expected to happen, and i
wouldn't be overly heartbroken about adding things only to spark side of
things -- either as a first port, or even for good. After all, I am all for
solving problems that actually exist. I probably have a need for
coocurrence work with Spark deployment but i have no need for CF on H20 so
i wouldn't care if quasi-port exists for h20. Folks who do, are welcome to
contribute a quasi-algebraic port.




On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel  wrote:

> >
> > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel 
> wrote:
> >
> >> Duplicated from a comment on the PR:
> >>
> >> Beyond these details (specific merge issues)  I have a bigger problem
> with
> >> merging this. Now every time the DSL is changed it may break things in
> h2o
> >> specific code. Merging this would require every committer who might
> touch
> >> the DSL to sign up for fixing any broken tests on both engines.
> >>
> >> To solve this the entire data prep pipeline must be virtualized to run
> on
> >> either engine so the tests for things like CF and ItemSimilarity or
> matrix
> >> factorization (and the multitude of others to come) pass and are engine
> >> independent. As it stands any DSL change that breaks the build will
> have to
> >> rely on a contributor's fix. Even if one of you guys was made a
> committer
> >> we will have this problem where a needed change breaks one or the other
> >> engine specific code. Unless 99% of the entire pipeline is engine
> neutral
> >> the build will be unmaintainable.
> >>
> >> For instance I am making a small DSL change that is required for
> >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> >> and its tests, which are in the spark module but since I’m working on
> that
> >> I can fix everything. If someone working on an h2o specific thing had to
> >> change the DSL in a way that broke spark code like ItemSimilarity you
> might
> >> not be able to fix it and I certainly do not want to fix stuff in h2o
> >> specific code when I change the DSL. I have a hard enough time keeping
> mine
> >> running :-)
> >>
> >
> > The way I interpret the above points, the problem you are trying to
> > highlight is with having multiple backends in general, and not this
> backend
> > in specific? Hypothetically, even if this backend is abandoned for the
> > above "problems", as more backends get added in the future, the same
> > "problems" will continue to apply to all of them.
> >
>
> yes, exactly. Adding backends is only maintainable if backend specific
> code (code
> in the spark module for now) is squeezed down to near zero. The more that
> is there
> the more code there will be duplicated in the h2o modules. Test breakage
> illustrates
> the problem it does not express the breadth or depth of the problem.
>
> >
> >> Crudely speaking this means doing away with all references to a
> >> SparkContext and any use of it. So it's not just a matter of reproducing
> >> the spark module but reducing the need for one. Making it so small that
> >> breakages in one or the other engines code will be infrequent and
> changes
> >> to neutral code will only rarely break an engine that the committer is
> >> unfamiliar with.
> >>
> >
> > I think things are already very close to this "ideal" situation you
> > describe above. As a pipeline implementor we should just use
> > DistributedContext, and not SparkContext. And we need an engine neutral
> way
> > to get hold of a DistributedContext from within the math-scala module,
> like
> > this pseudocode:
> >
> >  import org.apache.mahout.math.drm._
> >
> >  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> > System.getenv("BACKEND_ID"), opts...)
> >
> > If environment variables are not set, DistributedContextCreate could
> > default to Spark and local. But all of the pipeline code should ideally
> > exist outside any en

Re: H2O integration - completion and review

2014-07-11 Thread Pat Ferrel
> 
> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel  wrote:
> 
>> Duplicated from a comment on the PR:
>> 
>> Beyond these details (specific merge issues)  I have a bigger problem with
>> merging this. Now every time the DSL is changed it may break things in h2o
>> specific code. Merging this would require every committer who might touch
>> the DSL to sign up for fixing any broken tests on both engines.
>> 
>> To solve this the entire data prep pipeline must be virtualized to run on
>> either engine so the tests for things like CF and ItemSimilarity or matrix
>> factorization (and the multitude of others to come) pass and are engine
>> independent. As it stands any DSL change that breaks the build will have to
>> rely on a contributor's fix. Even if one of you guys was made a committer
>> we will have this problem where a needed change breaks one or the other
>> engine specific code. Unless 99% of the entire pipeline is engine neutral
>> the build will be unmaintainable.
>> 
>> For instance I am making a small DSL change that is required for
>> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
>> and its tests, which are in the spark module but since I’m working on that
>> I can fix everything. If someone working on an h2o specific thing had to
>> change the DSL in a way that broke spark code like ItemSimilarity you might
>> not be able to fix it and I certainly do not want to fix stuff in h2o
>> specific code when I change the DSL. I have a hard enough time keeping mine
>> running :-)
>> 
> 
> The way I interpret the above points, the problem you are trying to
> highlight is with having multiple backends in general, and not this backend
> in specific? Hypothetically, even if this backend is abandoned for the
> above "problems", as more backends get added in the future, the same
> "problems" will continue to apply to all of them.
> 

yes, exactly. Adding backends is only maintainable if backend specific code 
(code 
in the spark module for now) is squeezed down to near zero. The more that is 
there
the more code there will be duplicated in the h2o modules. Test breakage 
illustrates 
the problem it does not express the breadth or depth of the problem.

> 
>> Crudely speaking this means doing away with all references to a
>> SparkContext and any use of it. So it's not just a matter of reproducing
>> the spark module but reducing the need for one. Making it so small that
>> breakages in one or the other engines code will be infrequent and changes
>> to neutral code will only rarely break an engine that the committer is
>> unfamiliar with.
>> 
> 
> I think things are already very close to this "ideal" situation you
> describe above. As a pipeline implementor we should just use
> DistributedContext, and not SparkContext. And we need an engine neutral way
> to get hold of a DistributedContext from within the math-scala module, like
> this pseudocode:
> 
>  import org.apache.mahout.math.drm._
> 
>  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> System.getenv("BACKEND_ID"), opts...)
> 
> If environment variables are not set, DistributedContextCreate could
> default to Spark and local. But all of the pipeline code should ideally
> exist outside any engine specific module.

The Readers and Writers rely on

var columns = mc.textFile(source).map { line => line.split(delimiter) }

This will not run unless the DistributedContext is actually implemented by 
SparkContext. 

Running item similarity on epinions dataset requires Spark Executror Memory
to be 5g in the SparkConf so this has to be passed in to Spark, what is it for 
h2o? 
Do I as the implementor have to figure out important tuning factors for every 
engine?

I need a serializer for HashBiMap to be registered with Kryo or the Spark 
version 
will not run, what analogous problems for h2o? How much time will it take me
to figure it out?

Answers are moot. The fact that questions come up so often is the issue. It 
took me a fair amount of time to discover these tuning and setup
issues with only one engine.

The more duplicated code the bigger this problem is
and the greater the impedance mismatch between spark and h2o the bigger the
problem is. This directly affects how fast Mahout is moving. If there were some
clear reason for taking this productivity hit other that some idea that engine 
independence sounds clean or good then it would be easier to accept. Still
so many questions and we are being asked to merge this into the mainstream?

I am tired of debating this so I’ll just say that until the spark and h2o 
modules are
tiny and trivial two engines will be a major productivity hit and so until the 
“ideal” is met -1 on merge.

If people want to work on making the spark and h2o modules small—increasing 
engine dependence, great. But ask yourself why? Seems like if Anand 
has a build that works on both we should be able to run some non-trivial 
standard data through them on identical clusters and compare speed.



Re: H2O integration - completion and review

2014-07-11 Thread Gokhan Capan
I'll write longer, but in general, +1 to Anand

Sent from my iPhone

> On Jul 11, 2014, at 20:54, Anand Avati  wrote:
>
>> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel  wrote:
>>
>> Duplicated from a comment on the PR:
>>
>> Beyond these details (specific merge issues)  I have a bigger problem with
>> merging this. Now every time the DSL is changed it may break things in h2o
>> specific code. Merging this would require every committer who might touch
>> the DSL to sign up for fixing any broken tests on both engines.
>>
>> To solve this the entire data prep pipeline must be virtualized to run on
>> either engine so the tests for things like CF and ItemSimilarity or matrix
>> factorization (and the multitude of others to come) pass and are engine
>> independent. As it stands any DSL change that breaks the build will have to
>> rely on a contributor's fix. Even if one of you guys was made a committer
>> we will have this problem where a needed change breaks one or the other
>> engine specific code. Unless 99% of the entire pipeline is engine neutral
>> the build will be unmaintainable.
>>
>> For instance I am making a small DSL change that is required for
>> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
>> and its tests, which are in the spark module but since I’m working on that
>> I can fix everything. If someone working on an h2o specific thing had to
>> change the DSL in a way that broke spark code like ItemSimilarity you might
>> not be able to fix it and I certainly do not want to fix stuff in h2o
>> specific code when I change the DSL. I have a hard enough time keeping mine
>> running :-)
>
> The way I interpret the above points, the problem you are trying to
> highlight is with having multiple backends in general, and not this backend
> in specific? Hypothetically, even if this backend is abandoned for the
> above "problems", as more backends get added in the future, the same
> "problems" will continue to apply to all of them.
>
>
>> Crudely speaking this means doing away with all references to a
>> SparkContext and any use of it. So it's not just a matter of reproducing
>> the spark module but reducing the need for one. Making it so small that
>> breakages in one or the other engines code will be infrequent and changes
>> to neutral code will only rarely break an engine that the committer is
>> unfamiliar with.
>
> I think things are already very close to this "ideal" situation you
> describe above. As a pipeline implementor we should just use
> DistributedContext, and not SparkContext. And we need an engine neutral way
> to get hold of a DistributedContext from within the math-scala module, like
> this pseudocode:
>
>  import org.apache.mahout.math.drm._
>
>  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> System.getenv("BACKEND_ID"), opts...)
>
> If environment variables are not set, DistributedContextCreate could
> default to Spark and local. But all of the pipeline code should ideally
> exist outside any engine specific module.
>
>
>
>> I raised this red flag a long time ago but in the heat of other issues it
>> got lost. I don't think this can be ignored anymore.
>
> The only missing piece I think is having a DistributedContextCreate() call
> such as above? I don't think things are in such a dire state really.. Am I
> missing something?
>
>
>> I would propose that we should remain two separate projects with a mostly
>> shared DSL until the maintainability issues are resolved. This seems way to
>> early to merge.
>
> Call me an optimist, but I was hoping more of a "let's work together now to
> make the DSL abstractions easier for future contributors". I will explore
> such a DistributedContextCreate() method in math-scala. That might also be
> the answer for test cases to remain in math-scala.
>
> Thanks


Re: H2O integration - completion and review

2014-07-11 Thread Anand Avati
On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel  wrote:

> Duplicated from a comment on the PR:
>
> Beyond these details (specific merge issues)  I have a bigger problem with
> merging this. Now every time the DSL is changed it may break things in h2o
> specific code. Merging this would require every committer who might touch
> the DSL to sign up for fixing any broken tests on both engines.
>
> To solve this the entire data prep pipeline must be virtualized to run on
> either engine so the tests for things like CF and ItemSimilarity or matrix
> factorization (and the multitude of others to come) pass and are engine
> independent. As it stands any DSL change that breaks the build will have to
> rely on a contributor's fix. Even if one of you guys was made a committer
> we will have this problem where a needed change breaks one or the other
> engine specific code. Unless 99% of the entire pipeline is engine neutral
> the build will be unmaintainable.
>
> For instance I am making a small DSL change that is required for
> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> and its tests, which are in the spark module but since I’m working on that
> I can fix everything. If someone working on an h2o specific thing had to
> change the DSL in a way that broke spark code like ItemSimilarity you might
> not be able to fix it and I certainly do not want to fix stuff in h2o
> specific code when I change the DSL. I have a hard enough time keeping mine
> running :-)
>

The way I interpret the above points, the problem you are trying to
highlight is with having multiple backends in general, and not this backend
in specific? Hypothetically, even if this backend is abandoned for the
above "problems", as more backends get added in the future, the same
"problems" will continue to apply to all of them.


> Crudely speaking this means doing away with all references to a
> SparkContext and any use of it. So it's not just a matter of reproducing
> the spark module but reducing the need for one. Making it so small that
> breakages in one or the other engines code will be infrequent and changes
> to neutral code will only rarely break an engine that the committer is
> unfamiliar with.
>

I think things are already very close to this "ideal" situation you
describe above. As a pipeline implementor we should just use
DistributedContext, and not SparkContext. And we need an engine neutral way
to get hold of a DistributedContext from within the math-scala module, like
this pseudocode:

  import org.apache.mahout.math.drm._

  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
System.getenv("BACKEND_ID"), opts...)

If environment variables are not set, DistributedContextCreate could
default to Spark and local. But all of the pipeline code should ideally
exist outside any engine specific module.



> I raised this red flag a long time ago but in the heat of other issues it
> got lost. I don't think this can be ignored anymore.
>

The only missing piece I think is having a DistributedContextCreate() call
such as above? I don't think things are in such a dire state really.. Am I
missing something?


> I would propose that we should remain two separate projects with a mostly
> shared DSL until the maintainability issues are resolved. This seems way to
> early to merge.
>

Call me an optimist, but I was hoping more of a "let's work together now to
make the DSL abstractions easier for future contributors". I will explore
such a DistributedContextCreate() method in math-scala. That might also be
the answer for test cases to remain in math-scala.

Thanks


Re: H2O integration - completion and review

2014-07-11 Thread Pat Ferrel
Duplicated from a comment on the PR:

Beyond these details (specific merge issues)  I have a bigger problem with 
merging this. Now every time the DSL is changed it may break things in h2o 
specific code. Merging this would require every committer who might touch the 
DSL to sign up for fixing any broken tests on both engines. 

To solve this the entire data prep pipeline must be virtualized to run on 
either engine so the tests for things like CF and ItemSimilarity or matrix 
factorization (and the multitude of others to come) pass and are engine 
independent. As it stands any DSL change that breaks the build will have to 
rely on a contributor's fix. Even if one of you guys was made a committer we 
will have this problem where a needed change breaks one or the other engine 
specific code. Unless 99% of the entire pipeline is engine neutral the build 
will be unmaintainable.

For instance I am making a small DSL change that is required for cooccurrence 
and ItemSimilarity to work. This would break ItemSimilarity and its tests, 
which are in the spark module but since I’m working on that I can fix 
everything. If someone working on an h2o specific thing had to change the DSL 
in a way that broke spark code like ItemSimilarity you might not be able to fix 
it and I certainly do not want to fix stuff in h2o specific code when I change 
the DSL. I have a hard enough time keeping mine running :-) 

Crudely speaking this means doing away with all references to a SparkContext 
and any use of it. So it's not just a matter of reproducing the spark module 
but reducing the need for one. Making it so small that breakages in one or the 
other engines code will be infrequent and changes to neutral code will only 
rarely break an engine that the committer is unfamiliar with.

I raised this red flag a long time ago but in the heat of other issues it got 
lost. I don't think this can be ignored anymore.

I would propose that we should remain two separate projects with a mostly 
shared DSL until the maintainability issues are resolved. This seems way to 
early to merge.


On Jul 11, 2014, at 2:40 AM, Anand Avati  wrote:

Hi all,
The H2O integration is now feature complete till date and is ready for
final review. All the test cases are passing. The pull request
https://github.com/apache/mahout/pull/21 has been updated with the latest
code. Please treat this PR as a candidate for merge.

I have written a brief document on how to set up and use/test the
integration at
https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That
includes instructions to test in both local and distributed mode.

I would really appreciate if folks can review the work and provide
feedback, and the next steps.

Thanks,
Avati



H2O integration - completion and review

2014-07-11 Thread Anand Avati
Hi all,
The H2O integration is now feature complete till date and is ready for
final review. All the test cases are passing. The pull request
https://github.com/apache/mahout/pull/21 has been updated with the latest
code. Please treat this PR as a candidate for merge.

I have written a brief document on how to set up and use/test the
integration at
https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That
includes instructions to test in both local and distributed mode.

I would really appreciate if folks can review the work and provide
feedback, and the next steps.

Thanks,
Avati