Re: H2O integration - completion and review

Gokhan Capan Fri, 11 Jul 2014 10:59:05 -0700

I'll write longer, but in general, +1 to Anand

Sent from my iPhone


> On Jul 11, 2014, at 20:54, Anand Avati <[email protected]> wrote:
>
>> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel <[email protected]> wrote:
>>
>> Duplicated from a comment on the PR:
>>
>> Beyond these details (specific merge issues)  I have a bigger problem with
>> merging this. Now every time the DSL is changed it may break things in h2o
>> specific code. Merging this would require every committer who might touch
>> the DSL to sign up for fixing any broken tests on both engines.
>>
>> To solve this the entire data prep pipeline must be virtualized to run on
>> either engine so the tests for things like CF and ItemSimilarity or matrix
>> factorization (and the multitude of others to come) pass and are engine
>> independent. As it stands any DSL change that breaks the build will have to
>> rely on a contributor's fix. Even if one of you guys was made a committer
>> we will have this problem where a needed change breaks one or the other
>> engine specific code. Unless 99% of the entire pipeline is engine neutral
>> the build will be unmaintainable.
>>
>> For instance I am making a small DSL change that is required for
>> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
>> and its tests, which are in the spark module but since I’m working on that
>> I can fix everything. If someone working on an h2o specific thing had to
>> change the DSL in a way that broke spark code like ItemSimilarity you might
>> not be able to fix it and I certainly do not want to fix stuff in h2o
>> specific code when I change the DSL. I have a hard enough time keeping mine
>> running :-)
>
> The way I interpret the above points, the problem you are trying to
> highlight is with having multiple backends in general, and not this backend
> in specific? Hypothetically, even if this backend is abandoned for the
> above "problems", as more backends get added in the future, the same
> "problems" will continue to apply to all of them.
>
>
>> Crudely speaking this means doing away with all references to a
>> SparkContext and any use of it. So it's not just a matter of reproducing
>> the spark module but reducing the need for one. Making it so small that
>> breakages in one or the other engines code will be infrequent and changes
>> to neutral code will only rarely break an engine that the committer is
>> unfamiliar with.
>
> I think things are already very close to this "ideal" situation you
> describe above. As a pipeline implementor we should just use
> DistributedContext, and not SparkContext. And we need an engine neutral way
> to get hold of a DistributedContext from within the math-scala module, like
> this pseudocode:
>
>  import org.apache.mahout.math.drm._
>
>  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> System.getenv("BACKEND_ID"), opts...)
>
> If environment variables are not set, DistributedContextCreate could
> default to Spark and local. But all of the pipeline code should ideally
> exist outside any engine specific module.
>
>
>
>> I raised this red flag a long time ago but in the heat of other issues it
>> got lost. I don't think this can be ignored anymore.
>
> The only missing piece I think is having a DistributedContextCreate() call
> such as above? I don't think things are in such a dire state really.. Am I
> missing something?
>
>
>> I would propose that we should remain two separate projects with a mostly
>> shared DSL until the maintainability issues are resolved. This seems way to
>> early to merge.
>
> Call me an optimist, but I was hoping more of a "let's work together now to
> make the DSL abstractions easier for future contributors". I will explore
> such a DistributedContextCreate() method in math-scala. That might also be
> the answer for test cases to remain in math-scala.
>
> Thanks

Re: H2O integration - completion and review

Reply via email to