Re: H2O integration - completion and review

2014-07-11 Thread Ted Dunning
On Fri, Jul 11, 2014 at 2:00 PM, Dmitriy Lyubimov  wrote:

> And most importantly, as i always was saying, the main benefit to me is not
> that there's a CF algorithm in Mahout, but that i can write a custom one of
> my own with less effort than just writing directly to any given engine api.
> That part is very conceptual (drm api-dependent) and would be common
> regardless of my deployment infrastructure. In other words, everyone can
> just write their own co-occurrence analysis version hopefully easier than
> just writing it directly for Spark or directly to H20 if he/she wanted to.
> that's the real story (at least in my talk).
>


I agree with this 110%.

In fact, I would argue that if all the engines worked just the same that
there would be absolutely no reason to have more than one.  It is precisely
the differences that make having more than one important.

The goal of the DSL is to minimize the gratuitous aspects of those
differences so that you can relatively easily make use of the virtues of
one engine or another.


Re: H2O integration - completion and review

2014-07-11 Thread Dmitriy Lyubimov
The concern would be if it creates fragmentation for the project. as in
say, a collection of semi-consistent things (Mahout 1.0).

But there's more commonality in that effort than it isn't. E.g. there are
pure algebraic algorithms in math-scala that one can run to compare how
they would be behave in both cases. (for most part, i assume it is a
function of in-core algebra though).

And most importantly, as i always was saying, the main benefit to me is not
that there's a CF algorithm in Mahout, but that i can write a custom one of
my own with less effort than just writing directly to any given engine api.
That part is very conceptual (drm api-dependent) and would be common
regardless of my deployment infrastructure. In other words, everyone can
just write their own co-occurrence analysis version hopefully easier than
just writing it directly for Spark or directly to H20 if he/she wanted to.
that's the real story (at least in my talk).



On Fri, Jul 11, 2014 at 1:50 PM, Pat Ferrel  wrote:

> So given what Dmitriy said and Anand's answers below why are we trying to
> merge this into Mahout?
>
> Why don’t you Anand/oxdata create Mahout2O and take what has been
> delivered. We create Mahout v2 for Spark. We keep Spark specific code in
> the spark module. You get the algebraic DSL and all the java math stuff.
> You get all the future solver optimizations and anything you want.
>
> The benefit of keeping things separate for now are that Mahout2O only has
> to deal with h2o duplicated code and Mahout v2 only has to deal with Spark.
> You deal with h2o optimization and we deal with spark. You know the old
> saying that it’s not 2x it’s x^2.
>
> This seems so obvious. If in the future we work out the impedance
> mismatch, IO engine neutrality, etc then we talk again of merge.
>
> For that matter I’d be interested in seeing how to make ItemSimilarity
> work on Mahout2O. A performance comparison would help push this one way or
> the other.
>
> On Jul 11, 2014, at 1:36 PM, Dmitriy Lyubimov  wrote:
>
> On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati  wrote:
>
> >
> > c) abandon the pretense/goal that Mahout aims to be backend independent
> and
> > admit/become Spark specific.
> >
> >
> For the record, there has never been said the work is backend-independent
> for _anything_. The claim has always been much more modest. It was said it
> was backend-independent for R-(matlab)-like algebraic expressions. which it
> is. And it is not even main side of the story.
>
> I suspect there are couple of more areas in general math beyond "base" R
> algebra where abstractions can also be built and be useful and be engine
> independent.
>
> Like i said, the true solution is probably ports of non-algebraic portions
> of quasi-algebraic solution (i.e. b + doing something h2o specific for that
> work if desired). Smart componentization of concerns may (or may not) go a
> long way here (just like in tests).
>
>


Re: H2O integration - completion and review

2014-07-11 Thread Pat Ferrel
So given what Dmitriy said and Anand's answers below why are we trying to merge 
this into Mahout?

Why don’t you Anand/oxdata create Mahout2O and take what has been delivered. We 
create Mahout v2 for Spark. We keep Spark specific code in the spark module. 
You get the algebraic DSL and all the java math stuff. You get all the future 
solver optimizations and anything you want.

The benefit of keeping things separate for now are that Mahout2O only has to 
deal with h2o duplicated code and Mahout v2 only has to deal with Spark. You 
deal with h2o optimization and we deal with spark. You know the old saying that 
it’s not 2x it’s x^2.

This seems so obvious. If in the future we work out the impedance mismatch, IO 
engine neutrality, etc then we talk again of merge. 

For that matter I’d be interested in seeing how to make ItemSimilarity work on 
Mahout2O. A performance comparison would help push this one way or the other.

On Jul 11, 2014, at 1:36 PM, Dmitriy Lyubimov  wrote:

On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati  wrote:

> 
> c) abandon the pretense/goal that Mahout aims to be backend independent and
> admit/become Spark specific.
> 
> 
For the record, there has never been said the work is backend-independent
for _anything_. The claim has always been much more modest. It was said it
was backend-independent for R-(matlab)-like algebraic expressions. which it
is. And it is not even main side of the story.

I suspect there are couple of more areas in general math beyond "base" R
algebra where abstractions can also be built and be useful and be engine
independent.

Like i said, the true solution is probably ports of non-algebraic portions
of quasi-algebraic solution (i.e. b + doing something h2o specific for that
work if desired). Smart componentization of concerns may (or may not) go a
long way here (just like in tests).



Re: H2O integration - completion and review

2014-07-11 Thread Anand Avati
On Fri, Jul 11, 2014 at 1:36 PM, Dmitriy Lyubimov  wrote:

>
> Like i said, the true solution is probably ports of non-algebraic portions
> of quasi-algebraic solution (i.e. b + doing something h2o specific for that
> work if desired). Smart componentization of concerns may (or may not) go a
> long way here (just like in tests).
>

I find that a reasonable stance. I too don't see a reason to fret that
drivers don't work on all backends.


Re: H2O integration - completion and review

2014-07-11 Thread Dmitriy Lyubimov
On Fri, Jul 11, 2014 at 1:28 PM, Anand Avati  wrote:

>
> c) abandon the pretense/goal that Mahout aims to be backend independent and
> admit/become Spark specific.
>
>
For the record, there has never been said the work is backend-independent
for _anything_. The claim has always been much more modest. It was said it
was backend-independent for R-(matlab)-like algebraic expressions. which it
is. And it is not even main side of the story.

I suspect there are couple of more areas in general math beyond "base" R
algebra where abstractions can also be built and be useful and be engine
independent.

Like i said, the true solution is probably ports of non-algebraic portions
of quasi-algebraic solution (i.e. b + doing something h2o specific for that
work if desired). Smart componentization of concerns may (or may not) go a
long way here (just like in tests).


Re: H2O integration - completion and review

2014-07-11 Thread Anand Avati
You are speaking of two issues in the same breath.

1. Duplication of test case code. This is already being addressed by
Dmitriy's https://github.com/apache/mahout/pull/28. With that change all
the algo test code will co-reside with algos in math-scala.

2. Driver code for various backends. As you show in the code snippet below,
you are clearly using Spark specific calls (mc.textFile()) and bypassing
the DSL. This code is obviously spark specific and will not run on any
other backends. Now, are you asking that, you want to continue using Spark
specific functionality in the driver code and therefore how do you
reconcile that with multiple backends? I see only three options -

a) re-implement driver to only use the the DSL and avoid making backend
specific calls in the driver (not sure that is possible.)

b) continue with Spark specific calls in your driver and have per-backend
driver for each algo. This probably makes sense in a way as not all algos
run best on all backends. Just having the core of the algo backend
independent is nice enough by itself. So don't sweat about the full
pipeline not working everywhere (i.e - it need not)

c) abandon the pretense/goal that Mahout aims to be backend independent and
admit/become Spark specific.

The choice is not mine though.

On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel  wrote:

> >
> > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel 
> wrote:
> >
> >> Duplicated from a comment on the PR:
> >>
> >> Beyond these details (specific merge issues)  I have a bigger problem
> with
> >> merging this. Now every time the DSL is changed it may break things in
> h2o
> >> specific code. Merging this would require every committer who might
> touch
> >> the DSL to sign up for fixing any broken tests on both engines.
> >>
> >> To solve this the entire data prep pipeline must be virtualized to run
> on
> >> either engine so the tests for things like CF and ItemSimilarity or
> matrix
> >> factorization (and the multitude of others to come) pass and are engine
> >> independent. As it stands any DSL change that breaks the build will
> have to
> >> rely on a contributor's fix. Even if one of you guys was made a
> committer
> >> we will have this problem where a needed change breaks one or the other
> >> engine specific code. Unless 99% of the entire pipeline is engine
> neutral
> >> the build will be unmaintainable.
> >>
> >> For instance I am making a small DSL change that is required for
> >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> >> and its tests, which are in the spark module but since I’m working on
> that
> >> I can fix everything. If someone working on an h2o specific thing had to
> >> change the DSL in a way that broke spark code like ItemSimilarity you
> might
> >> not be able to fix it and I certainly do not want to fix stuff in h2o
> >> specific code when I change the DSL. I have a hard enough time keeping
> mine
> >> running :-)
> >>
> >
> > The way I interpret the above points, the problem you are trying to
> > highlight is with having multiple backends in general, and not this
> backend
> > in specific? Hypothetically, even if this backend is abandoned for the
> > above "problems", as more backends get added in the future, the same
> > "problems" will continue to apply to all of them.
> >
>
> yes, exactly. Adding backends is only maintainable if backend specific
> code (code
> in the spark module for now) is squeezed down to near zero. The more that
> is there
> the more code there will be duplicated in the h2o modules. Test breakage
> illustrates
> the problem it does not express the breadth or depth of the problem.
>
> >
> >> Crudely speaking this means doing away with all references to a
> >> SparkContext and any use of it. So it's not just a matter of reproducing
> >> the spark module but reducing the need for one. Making it so small that
> >> breakages in one or the other engines code will be infrequent and
> changes
> >> to neutral code will only rarely break an engine that the committer is
> >> unfamiliar with.
> >>
> >
> > I think things are already very close to this "ideal" situation you
> > describe above. As a pipeline implementor we should just use
> > DistributedContext, and not SparkContext. And we need an engine neutral
> way
> > to get hold of a DistributedContext from within the math-scala module,
> like
> > this pseudocode:
> >
> >  import org.apache.mahout.math.drm._
> >
> >  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> > System.getenv("BACKEND_ID"), opts...)
> >
> > If environment variables are not set, DistributedContextCreate could
> > default to Spark and local. But all of the pipeline code should ideally
> > exist outside any engine specific module.
>
> The Readers and Writers rely on
>
> var columns = mc.textFile(source).map { line => line.split(delimiter) }
>
> This will not run unless the DistributedContext is actually implemented by
> SparkContext.
>
> Running item similarity on epinions

Re: H2O integration - completion and review

2014-07-11 Thread Dmitriy Lyubimov
initial idea is to only provide algebraic independence. we may want to ask
engines to support persistence operations to/from, as it stands, HDFS, as
currently couple dozen of projects dealing with distributed data ask as
well, but in general algebraic expressions are agnostic from how inputs
come into existence.

when building e2e distributed application, naturally, algebra is not
enough. Heck, it is not even enough even for a modereately involved logics
inside an  algorithm, so quasi-algebraic algorithms are expected. This is
an obstinent reality. But engine independence, or even partial portability
is only one side of the story, and it is not the biggest one. So the  hope
is that

(1) algebraic part hopefully is still significant enough so that
non-algebraic part of algorithm could be more easily ported if needed; or
(2) for folks like me, one version of algorithm is quite enough and engine
independence side of the story becomes much smaller story making other
sides of the story (i.e. convenience and semantics of algebraic translation
itself) much more prominent.

What it means, quasi-portable algorithms are expected to happen, and i
wouldn't be overly heartbroken about adding things only to spark side of
things -- either as a first port, or even for good. After all, I am all for
solving problems that actually exist. I probably have a need for
coocurrence work with Spark deployment but i have no need for CF on H20 so
i wouldn't care if quasi-port exists for h20. Folks who do, are welcome to
contribute a quasi-algebraic port.




On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel  wrote:

> >
> > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel 
> wrote:
> >
> >> Duplicated from a comment on the PR:
> >>
> >> Beyond these details (specific merge issues)  I have a bigger problem
> with
> >> merging this. Now every time the DSL is changed it may break things in
> h2o
> >> specific code. Merging this would require every committer who might
> touch
> >> the DSL to sign up for fixing any broken tests on both engines.
> >>
> >> To solve this the entire data prep pipeline must be virtualized to run
> on
> >> either engine so the tests for things like CF and ItemSimilarity or
> matrix
> >> factorization (and the multitude of others to come) pass and are engine
> >> independent. As it stands any DSL change that breaks the build will
> have to
> >> rely on a contributor's fix. Even if one of you guys was made a
> committer
> >> we will have this problem where a needed change breaks one or the other
> >> engine specific code. Unless 99% of the entire pipeline is engine
> neutral
> >> the build will be unmaintainable.
> >>
> >> For instance I am making a small DSL change that is required for
> >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> >> and its tests, which are in the spark module but since I’m working on
> that
> >> I can fix everything. If someone working on an h2o specific thing had to
> >> change the DSL in a way that broke spark code like ItemSimilarity you
> might
> >> not be able to fix it and I certainly do not want to fix stuff in h2o
> >> specific code when I change the DSL. I have a hard enough time keeping
> mine
> >> running :-)
> >>
> >
> > The way I interpret the above points, the problem you are trying to
> > highlight is with having multiple backends in general, and not this
> backend
> > in specific? Hypothetically, even if this backend is abandoned for the
> > above "problems", as more backends get added in the future, the same
> > "problems" will continue to apply to all of them.
> >
>
> yes, exactly. Adding backends is only maintainable if backend specific
> code (code
> in the spark module for now) is squeezed down to near zero. The more that
> is there
> the more code there will be duplicated in the h2o modules. Test breakage
> illustrates
> the problem it does not express the breadth or depth of the problem.
>
> >
> >> Crudely speaking this means doing away with all references to a
> >> SparkContext and any use of it. So it's not just a matter of reproducing
> >> the spark module but reducing the need for one. Making it so small that
> >> breakages in one or the other engines code will be infrequent and
> changes
> >> to neutral code will only rarely break an engine that the committer is
> >> unfamiliar with.
> >>
> >
> > I think things are already very close to this "ideal" situation you
> > describe above. As a pipeline implementor we should just use
> > DistributedContext, and not SparkContext. And we need an engine neutral
> way
> > to get hold of a DistributedContext from within the math-scala module,
> like
> > this pseudocode:
> >
> >  import org.apache.mahout.math.drm._
> >
> >  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> > System.getenv("BACKEND_ID"), opts...)
> >
> > If environment variables are not set, DistributedContextCreate could
> > default to Spark and local. But all of the pipeline code should ideally
> > exist outside any en

Re: H2O integration - completion and review

2014-07-11 Thread Pat Ferrel
> 
> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel  wrote:
> 
>> Duplicated from a comment on the PR:
>> 
>> Beyond these details (specific merge issues)  I have a bigger problem with
>> merging this. Now every time the DSL is changed it may break things in h2o
>> specific code. Merging this would require every committer who might touch
>> the DSL to sign up for fixing any broken tests on both engines.
>> 
>> To solve this the entire data prep pipeline must be virtualized to run on
>> either engine so the tests for things like CF and ItemSimilarity or matrix
>> factorization (and the multitude of others to come) pass and are engine
>> independent. As it stands any DSL change that breaks the build will have to
>> rely on a contributor's fix. Even if one of you guys was made a committer
>> we will have this problem where a needed change breaks one or the other
>> engine specific code. Unless 99% of the entire pipeline is engine neutral
>> the build will be unmaintainable.
>> 
>> For instance I am making a small DSL change that is required for
>> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
>> and its tests, which are in the spark module but since I’m working on that
>> I can fix everything. If someone working on an h2o specific thing had to
>> change the DSL in a way that broke spark code like ItemSimilarity you might
>> not be able to fix it and I certainly do not want to fix stuff in h2o
>> specific code when I change the DSL. I have a hard enough time keeping mine
>> running :-)
>> 
> 
> The way I interpret the above points, the problem you are trying to
> highlight is with having multiple backends in general, and not this backend
> in specific? Hypothetically, even if this backend is abandoned for the
> above "problems", as more backends get added in the future, the same
> "problems" will continue to apply to all of them.
> 

yes, exactly. Adding backends is only maintainable if backend specific code 
(code 
in the spark module for now) is squeezed down to near zero. The more that is 
there
the more code there will be duplicated in the h2o modules. Test breakage 
illustrates 
the problem it does not express the breadth or depth of the problem.

> 
>> Crudely speaking this means doing away with all references to a
>> SparkContext and any use of it. So it's not just a matter of reproducing
>> the spark module but reducing the need for one. Making it so small that
>> breakages in one or the other engines code will be infrequent and changes
>> to neutral code will only rarely break an engine that the committer is
>> unfamiliar with.
>> 
> 
> I think things are already very close to this "ideal" situation you
> describe above. As a pipeline implementor we should just use
> DistributedContext, and not SparkContext. And we need an engine neutral way
> to get hold of a DistributedContext from within the math-scala module, like
> this pseudocode:
> 
>  import org.apache.mahout.math.drm._
> 
>  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> System.getenv("BACKEND_ID"), opts...)
> 
> If environment variables are not set, DistributedContextCreate could
> default to Spark and local. But all of the pipeline code should ideally
> exist outside any engine specific module.

The Readers and Writers rely on

var columns = mc.textFile(source).map { line => line.split(delimiter) }

This will not run unless the DistributedContext is actually implemented by 
SparkContext. 

Running item similarity on epinions dataset requires Spark Executror Memory
to be 5g in the SparkConf so this has to be passed in to Spark, what is it for 
h2o? 
Do I as the implementor have to figure out important tuning factors for every 
engine?

I need a serializer for HashBiMap to be registered with Kryo or the Spark 
version 
will not run, what analogous problems for h2o? How much time will it take me
to figure it out?

Answers are moot. The fact that questions come up so often is the issue. It 
took me a fair amount of time to discover these tuning and setup
issues with only one engine.

The more duplicated code the bigger this problem is
and the greater the impedance mismatch between spark and h2o the bigger the
problem is. This directly affects how fast Mahout is moving. If there were some
clear reason for taking this productivity hit other that some idea that engine 
independence sounds clean or good then it would be easier to accept. Still
so many questions and we are being asked to merge this into the mainstream?

I am tired of debating this so I’ll just say that until the spark and h2o 
modules are
tiny and trivial two engines will be a major productivity hit and so until the 
“ideal” is met -1 on merge.

If people want to work on making the spark and h2o modules small—increasing 
engine dependence, great. But ask yourself why? Seems like if Anand 
has a build that works on both we should be able to run some non-trivial 
standard data through them on identical clusters and compare speed.



[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059226#comment-14059226
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48771908
  
On Fri, Jul 11, 2014 at 11:46 AM, Pat Ferrel 
wrote:

> So you don't see how changing the drm API or storage format will now break
> code in two places written for two different engines?
>
Changing DRM API? Yes, of course - that is the nature of the beast of
supporting multiple implementations behind a single abstraction. Change in
abstraction API will need corresponding change in all backends. That's the
reason why APIs must be designed carefully so that future changes to them
are estimated to be most minimum. I don't see how this by itself qualifies
as an objection.

Storage format? Neither spark nor h2o is defining any storage formats. The
current APIs read and write to sequence files whose formats are very well
defined and standardized. As far the they both read and write that common
format from engine neutral locations, I don't see any problems at all.

If I make the change to drm I can fix spark breakage but not h2o. This bit
> of code is extremely stable and super simple for spark so may be a bad
> example but new code will not be so stable just the opposite. For each new
> IO operation (SparkContext dependent) or engine tuning (SparkConf
> dependent) we will grow the problem. The core will become untouchable or
> breakage will happen in places one engineer will not be able to fix.
>
Can you please provide a more concrete example for both "make change do
drm" and "new IO operation (SparkContext dependent)"? It is hard for me to
visualize the problems you are foreseeing without more specifics.

This is a real issue, I need to change code in math-scala today, already
> have but it isn't pushed. Who knows what that will break in h2o
> implementations? I will be changing cooccurrence tests, so have to make
> them in two places. Maybe I can do that but when they diverge further than
> this example I won't be able to.
>
Well, as long as you are fixing a bug in cf logic, that should be engine
independent. However if you are adding a new DRM API or modifying an
existing DRM API - that will need corresponding changes in all the engines.
There's no getting around that. That's something we all have to live with,
no matter what project it is.


> You guys need to address these issues as if you were supporting two
> engines for all Mahout code or you will never see what Mahout committers
> problems will be.
>

As I said before, please provide a concrete example of what the issues are.
I don't know *what* to fix yet.

Thanks


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059204#comment-14059204
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48769459
  
@pferrel  (in case you are talking to me) sorry don't have time to read the 
whole discussion. if you can point me to concrete places in the code  what you 
think is needed to be done and why, i may be able to try to figure it. But as 
for as h20 issue, independent tests have nothing really new that @avati hasn't 
already done (except he cut-and-pasted them, and now he needs just to remove 
all cut-and-paste and just pull in a trait form math-scala).


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059183#comment-14059183
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48767998
  
So you don't see how changing the drm API or storage format will now break 
code in two places written for two different engines? If I make the change to 
drm I can fix spark breakage but not h2o. This bit of code is extremely stable 
and super simple for spark so may be a bad example but new code will not be so 
stable just the opposite. For each new IO operation (SparkContext dependent)  
or engine tuning (SparkConf dependent) we will grow the problem. The core will 
become untouchable or breakage will happen in places one engineer will not be 
able to fix.

This is a real issue, I need to change code in math-scala today, already 
have but it isn't pushed. Who knows what that will break in h2o 
implementations? I will be changing cooccurrence tests, so have to make them in 
two places. Maybe I can do that but when they diverge further than this example 
I won't be able to.

You guys need to address these issues as if you were supporting two engines 
for all Mahout code or you will never see what Mahout committers problems will 
be.


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059159#comment-14059159
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48766569
  
Look at #28. Just spent 30 mins doing quick refactoring, should help you 
with test independence. Every  engine should run some common asserts which are 
included in the `*SuiteBase` traits


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059110#comment-14059110
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48762753
  
On Fri, Jul 11, 2014 at 10:46 AM, Pat Ferrel 
wrote:

> Exactly, thanks. I see you've done the same for CF also great.
>
> But this illustrates the problem. I need to change 50% of the tests in CF
> cooccurrence because they were not catching a bug. Now the tests live in
> two places h2o and spark. And unless I change the tests in both places the
> build will break. The files look virtually identical except for the
> imports, which is good. If that's true, I wonder if we could we use a 
Scala
> macro to keep the code all in one file? We might be able to take the same
> code and produce two artifacts that are both run at build time. That would
> reduce the load on devs for this kind of thing.
>
As we discussed on another email thread, I'm independently working on how
to move tests back into math-scala. That effort should address this concern
I think?

 However currently almost all IO code is spark specific. You must have
> re-implemented drm.writeDrm for h2o. Until this is *not* a
> re-implementation but is engine neutral we are going to have a growing
> problem.
>
Why is this a problem? drm.writeDrm() accepts an engine neutral path, like
"hdfs://.." or "file://..." and the content of what gets written is the
well defined sequencefile format no matter what the runtime backend is. And
as long as the path and file content are engine neutral, why should
pipeline code worry how the IO implementation is done? Again, am I missing
something?


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: H2O integration - completion and review

2014-07-11 Thread Gokhan Capan
I'll write longer, but in general, +1 to Anand

Sent from my iPhone

> On Jul 11, 2014, at 20:54, Anand Avati  wrote:
>
>> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel  wrote:
>>
>> Duplicated from a comment on the PR:
>>
>> Beyond these details (specific merge issues)  I have a bigger problem with
>> merging this. Now every time the DSL is changed it may break things in h2o
>> specific code. Merging this would require every committer who might touch
>> the DSL to sign up for fixing any broken tests on both engines.
>>
>> To solve this the entire data prep pipeline must be virtualized to run on
>> either engine so the tests for things like CF and ItemSimilarity or matrix
>> factorization (and the multitude of others to come) pass and are engine
>> independent. As it stands any DSL change that breaks the build will have to
>> rely on a contributor's fix. Even if one of you guys was made a committer
>> we will have this problem where a needed change breaks one or the other
>> engine specific code. Unless 99% of the entire pipeline is engine neutral
>> the build will be unmaintainable.
>>
>> For instance I am making a small DSL change that is required for
>> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
>> and its tests, which are in the spark module but since I’m working on that
>> I can fix everything. If someone working on an h2o specific thing had to
>> change the DSL in a way that broke spark code like ItemSimilarity you might
>> not be able to fix it and I certainly do not want to fix stuff in h2o
>> specific code when I change the DSL. I have a hard enough time keeping mine
>> running :-)
>
> The way I interpret the above points, the problem you are trying to
> highlight is with having multiple backends in general, and not this backend
> in specific? Hypothetically, even if this backend is abandoned for the
> above "problems", as more backends get added in the future, the same
> "problems" will continue to apply to all of them.
>
>
>> Crudely speaking this means doing away with all references to a
>> SparkContext and any use of it. So it's not just a matter of reproducing
>> the spark module but reducing the need for one. Making it so small that
>> breakages in one or the other engines code will be infrequent and changes
>> to neutral code will only rarely break an engine that the committer is
>> unfamiliar with.
>
> I think things are already very close to this "ideal" situation you
> describe above. As a pipeline implementor we should just use
> DistributedContext, and not SparkContext. And we need an engine neutral way
> to get hold of a DistributedContext from within the math-scala module, like
> this pseudocode:
>
>  import org.apache.mahout.math.drm._
>
>  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> System.getenv("BACKEND_ID"), opts...)
>
> If environment variables are not set, DistributedContextCreate could
> default to Spark and local. But all of the pipeline code should ideally
> exist outside any engine specific module.
>
>
>
>> I raised this red flag a long time ago but in the heat of other issues it
>> got lost. I don't think this can be ignored anymore.
>
> The only missing piece I think is having a DistributedContextCreate() call
> such as above? I don't think things are in such a dire state really.. Am I
> missing something?
>
>
>> I would propose that we should remain two separate projects with a mostly
>> shared DSL until the maintainability issues are resolved. This seems way to
>> early to merge.
>
> Call me an optimist, but I was hoping more of a "let's work together now to
> make the DSL abstractions easier for future contributors". I will explore
> such a DistributedContextCreate() method in math-scala. That might also be
> the answer for test cases to remain in math-scala.
>
> Thanks


Re: H2O integration - completion and review

2014-07-11 Thread Anand Avati
On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel  wrote:

> Duplicated from a comment on the PR:
>
> Beyond these details (specific merge issues)  I have a bigger problem with
> merging this. Now every time the DSL is changed it may break things in h2o
> specific code. Merging this would require every committer who might touch
> the DSL to sign up for fixing any broken tests on both engines.
>
> To solve this the entire data prep pipeline must be virtualized to run on
> either engine so the tests for things like CF and ItemSimilarity or matrix
> factorization (and the multitude of others to come) pass and are engine
> independent. As it stands any DSL change that breaks the build will have to
> rely on a contributor's fix. Even if one of you guys was made a committer
> we will have this problem where a needed change breaks one or the other
> engine specific code. Unless 99% of the entire pipeline is engine neutral
> the build will be unmaintainable.
>
> For instance I am making a small DSL change that is required for
> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> and its tests, which are in the spark module but since I’m working on that
> I can fix everything. If someone working on an h2o specific thing had to
> change the DSL in a way that broke spark code like ItemSimilarity you might
> not be able to fix it and I certainly do not want to fix stuff in h2o
> specific code when I change the DSL. I have a hard enough time keeping mine
> running :-)
>

The way I interpret the above points, the problem you are trying to
highlight is with having multiple backends in general, and not this backend
in specific? Hypothetically, even if this backend is abandoned for the
above "problems", as more backends get added in the future, the same
"problems" will continue to apply to all of them.


> Crudely speaking this means doing away with all references to a
> SparkContext and any use of it. So it's not just a matter of reproducing
> the spark module but reducing the need for one. Making it so small that
> breakages in one or the other engines code will be infrequent and changes
> to neutral code will only rarely break an engine that the committer is
> unfamiliar with.
>

I think things are already very close to this "ideal" situation you
describe above. As a pipeline implementor we should just use
DistributedContext, and not SparkContext. And we need an engine neutral way
to get hold of a DistributedContext from within the math-scala module, like
this pseudocode:

  import org.apache.mahout.math.drm._

  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
System.getenv("BACKEND_ID"), opts...)

If environment variables are not set, DistributedContextCreate could
default to Spark and local. But all of the pipeline code should ideally
exist outside any engine specific module.



> I raised this red flag a long time ago but in the heat of other issues it
> got lost. I don't think this can be ignored anymore.
>

The only missing piece I think is having a DistributedContextCreate() call
such as above? I don't think things are in such a dire state really.. Am I
missing something?


> I would propose that we should remain two separate projects with a mostly
> shared DSL until the maintainability issues are resolved. This seems way to
> early to merge.
>

Call me an optimist, but I was hoping more of a "let's work together now to
make the DSL abstractions easier for future contributors". I will explore
such a DistributedContextCreate() method in math-scala. That might also be
the answer for test cases to remain in math-scala.

Thanks


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059083#comment-14059083
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48761123
  
Exactly, thanks. I see you've done the same for CF also great.

But this illustrates the problem. I need to change 50% of the tests in CF 
cooccurrence because they were not catching a bug. Now the tests live in two 
places h2o and spark. And unless I change the tests in both places the build 
will break. The files look virtually identical except for the imports, which is 
good. If that's true, I wonder if we could we use a Scala macro to keep the 
code all in one file? We might be able to take the same code and produce two 
artifacts that are both run at build time. That would reduce the load on devs 
for this kind of thing. 

However currently almost all IO code is spark specific. You must have 
re-implemented drm.writeDrm for h2o.  Until this is **not** a re-implementation 
but is engine neutral we are going to have a growing problem. I am the only 
person currently working in spark specific land and only Dmitriy and Sebastian 
are writing for V2. When other committers get past the Scala barrier and start 
committing similar stuff they will immediately face this. 

BTW I am very interested in seeing how h2o ItemSimilarityDriver compares to 
an h2o version. IMO this is the kind of motivation we have to see. If you 
implemented the driver or the reader/writers we could compare speed on h2o and 
spark. we have a large enough dataset to make it interesting.


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059029#comment-14059029
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48757859
  
> The test issue is with the tests in the spark module that actually test
> stuff in the math-scala module. Remember our discussion about splitting
> impl from test for cf? There are several things that cannot be tested
> without the engine in place.
>
I think we are talking about the same tests here. Please compare for
yourself -

https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/src/test/scala/org/apache/mahout/math/decompositions/MathSuite.scala
and

https://github.com/avati/mahout/blob/MAHOUT-1500/spark/src/test/scala/org/apache/mahout/math/decompositions/MathSuite.scala


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Build failed in Jenkins: Mahout-Quality #2694

2014-07-11 Thread Apache Jenkins Server
See 

--
Started by timer
Building remotely on ubuntu2 (Ubuntu ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree
Fetching changes from the remote Git repository
 > git config remote.origin.url 
 > https://git-wip-us.apache.org/repos/asf/mahout.git
Fetching upstream changes from 
https://git-wip-us.apache.org/repos/asf/mahout.git
 > git --version
 > git fetch --tags --progress 
 > https://git-wip-us.apache.org/repos/asf/mahout.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse origin/master^{commit}
Checking out Revision e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e (origin/master)
 > git config core.sparsecheckout
 > git checkout -f e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e
FATAL: Could not checkout null with start point 
e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e
hudson.plugins.git.GitException: Could not checkout null with start point 
e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$8.execute(CliGitAPIImpl.java:1473)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:328)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: hudson.plugins.git.GitException: Command "git checkout -f 
e4ba7887fc6dbf17c3d73f8d4aa1045eeb48d53e" returned status code 1:
stdout: 
stderr: error: unable to create file 
mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJob.java
 (Input/output error)
Checking out files:  37% (13/35)   
error: unable to create file 
mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/als/PredictionMapper.java
 (Input/output error)
Checking out files:  40% (14/35)   
error: unable to create file 
mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java
 (Input/output error)
Checking out files:  42% (15/35)   
Checking out files:  45% (16/35)   
Checking out files:  48% (17/35)   
Checking out files:  51% (18/35)   
Checking out files:  54% (19/35)   
Checking out files:  57% (20/35)   
Checking out files:  60% (21/35)   
Checking out files:  62% (22/35)   
Checking out files:  65% (23/35)   
Checking out files:  68% (24/35)   
Checking out files:  71% (25/35)   
Checking out files:  74% (26/35)   
Checking out files:  77% (27/35)   
Checking out files:  80% (28/35)   
Checking out files:  82% (29/35)   
Checking out files:  85% (30/35)   
Checking out files:  88% (31/35)   
Checking out files:  91% (32/35)   
Checking out files:  94% (33/35)   
Checking out files:  97% (34/35)   
Checking out files: 100% (35/35)   
Checking out files: 100% (35/35), done.
Previous HEAD position was 37b8525... MAHOUT-1565 Remove mapreduce-job-specific 
MAHOUT_OPTS (nishkamravi via gcapan): this closes apache/mahout#4
HEAD is now at e4ba788... MAHOUT-1529: third collection of various edits 
against private branch

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1301)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1277)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1273)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1078)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1088)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$8.execute(CliGitAPIImpl.java:1468)
... 10 more


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058986#comment-14058986
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48754615
  
The test issue is with the tests in the spark module that actually test 
stuff in the math-scala module. Remember our discussion about splitting impl 
from test for cf? There are several things that cannot be tested without the 
engine in place.

I will be vocal about objecting to TBD for pipelines. The build will be 
unmaintainable unless the spark module is reduced to trivial and tiny bits. Any 
change to the DSL could break things I do not know how to fix and really don't 
want to sign up for--namely h2o specific TBD stuff.


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058966#comment-14058966
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48753061
  
>
> Are the scalatests implemented in the Spark module that covers math-scala
> code implemented here somewhere? I'd vote against merge untils those are 
in
> all in place and passing.
>
Yes, those were the first tests to pass. You can find them in
h2o/src/test/org/apache/mahout/math/.

Also I may have missed it but there should be clear instructions for how to
> build this and run it. This is like a heart transplant. Before you release
> the patient make sure all systems are working correctly, the DSL is not 
the
> whole body. There should at least be some end-to-end pipelines in examples
> that anyone can run from a local installation.
>
As mentioned in the email, there is a somewhat simple "how to build and
test" for both local and distributed mode in h2o/README.md. Larger
end-to-end pipelines and examples are TBD.


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: H2O integration - completion and review

2014-07-11 Thread Pat Ferrel
Duplicated from a comment on the PR:

Beyond these details (specific merge issues)  I have a bigger problem with 
merging this. Now every time the DSL is changed it may break things in h2o 
specific code. Merging this would require every committer who might touch the 
DSL to sign up for fixing any broken tests on both engines. 

To solve this the entire data prep pipeline must be virtualized to run on 
either engine so the tests for things like CF and ItemSimilarity or matrix 
factorization (and the multitude of others to come) pass and are engine 
independent. As it stands any DSL change that breaks the build will have to 
rely on a contributor's fix. Even if one of you guys was made a committer we 
will have this problem where a needed change breaks one or the other engine 
specific code. Unless 99% of the entire pipeline is engine neutral the build 
will be unmaintainable.

For instance I am making a small DSL change that is required for cooccurrence 
and ItemSimilarity to work. This would break ItemSimilarity and its tests, 
which are in the spark module but since I’m working on that I can fix 
everything. If someone working on an h2o specific thing had to change the DSL 
in a way that broke spark code like ItemSimilarity you might not be able to fix 
it and I certainly do not want to fix stuff in h2o specific code when I change 
the DSL. I have a hard enough time keeping mine running :-) 

Crudely speaking this means doing away with all references to a SparkContext 
and any use of it. So it's not just a matter of reproducing the spark module 
but reducing the need for one. Making it so small that breakages in one or the 
other engines code will be infrequent and changes to neutral code will only 
rarely break an engine that the committer is unfamiliar with.

I raised this red flag a long time ago but in the heat of other issues it got 
lost. I don't think this can be ignored anymore.

I would propose that we should remain two separate projects with a mostly 
shared DSL until the maintainability issues are resolved. This seems way to 
early to merge.


On Jul 11, 2014, at 2:40 AM, Anand Avati  wrote:

Hi all,
The H2O integration is now feature complete till date and is ready for
final review. All the test cases are passing. The pull request
https://github.com/apache/mahout/pull/21 has been updated with the latest
code. Please treat this PR as a candidate for merge.

I have written a brief document on how to set up and use/test the
integration at
https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That
includes instructions to test in both local and distributed mode.

I would really appreciate if folks can review the work and provide
feedback, and the next steps.

Thanks,
Avati



[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058934#comment-14058934
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48749896
  
Are the scalatests implemented in the Spark module that covers math-scala 
code implemented here somewhere? I'd vote against merge untils those are in all 
in place and passing.

The cf stuff has a rather major bug the I'm working on so I wouldn't move 
this into math-scala just yet, although it would make an interesting speed 
comparison once completed. The cf changes will require DSL additions that will 
be under separate review. Don't have a pr number yet.

Also I may have missed it but there should be clear instructions for how to 
build this and run it.  This is like a heart transplant. Before you release the 
patient make sure all systems are working correctly, the DSL is not the whole 
body. There should at least be some end-to-end pipelines in examples that 
anyone can run from a local installation.

Beyond these details I have a bigger issue with merging this. Now every 
time the DSL is changed it may break things in h20 specific code. It already 
does in cf for instance but I've signed up to fix those fro spark. No committer 
has signed up to fix code in both Spark and H2O. IMO this is untenable. 

To solve this the entire data prep pipeline must be virtualized to run on 
either engine so the tests for things like CF and ItemSimilarity (and the 
multitude of others to come) pass and are engine independent. As it stands any 
DSL change that breaks the build will have to rely on a contributor's fix. Even 
if one of you guys was made a committer we will have this problem where a 
needed change breaks one or the other engine specific code. Unless 99% of the 
entire pipeline is engine neutral the build will be unmaintainable.

Crudely speaking this means doing away with all references to a 
SparkContext and any use of it. So it's not just a matter of reproducing the 
spark module but reducing the need for one. Making it so small that breakages 
in one or the other engines code will be infrequent. 

I raised this red flag long ago but in the heat of other issues it seemed 
minor, but I don't think it can be ignored anymore.


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1500) H2O integration

2014-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058589#comment-14058589
 ] 

ASF GitHub Bot commented on MAHOUT-1500:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/21#issuecomment-48712178
  
All the points in the previous comments are now completed. This PR is ready 
for final review.


> H2O integration
> ---
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


H2O integration - completion and review

2014-07-11 Thread Anand Avati
Hi all,
The H2O integration is now feature complete till date and is ready for
final review. All the test cases are passing. The pull request
https://github.com/apache/mahout/pull/21 has been updated with the latest
code. Please treat this PR as a candidate for merge.

I have written a brief document on how to set up and use/test the
integration at
https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That
includes instructions to test in both local and distributed mode.

I would really appreciate if folks can review the work and provide
feedback, and the next steps.

Thanks,
Avati