Re: Making it easier to use Mahout algorithms with Apache Spark pipelines

Andrew Palumbo Sun, 09 Jul 2017 15:40:06 -0700

Holden, great to have you here.  This sounds great!  Easier interoperability 
with Spark and a ease of the Mahout learning curve IMO are huge priorities.

I am conceptually +1 on this as well (only minor concerns are with our goals of 
preserving engine neutrality as best we can).  With the precedence of Spark 
having favorable treatment, as Trevor pointed out, this should not be much of a 
problem.

> Also- I don't see this affecting anything outside of the spark bindings, so
engine neutrality should be maintained (with spark getting some favorable
treatment, but at this point... we've pushed Flink to its own profile and
we keep h2o around because its not causing any trouble).

I believe that this could fit into our high level algorithm framework (in 
math-scala)...

https://github.com/apache/mahout/tree/master/math-scala/src/main/scala/org/apache/mahout/math/algorithms

It seems so.  Keeping pipeline interfaces in a high level module, dropping down 
to the spark module and extending for Spark only (which in this case would 
likely be most of the work) and then adding stubs for Flink and h2o for future 
developers that may have interest would be best IMO.

There is precedence here as well. E.g.: `IndexedDataset`s.

Tangentially- @all - I'm just going to throw in that we should consider a 
profile for h2o for symmetry but that is an other discussion.

--andy

________________________________
From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden Karau 
<hol...@pigscanfly.ca>
Sent: Friday, July 7, 2017 8:22:12 PM
To: dev@mahout.apache.org
Subject: Re: Making it easier to use Mahout algorithms with Apache Spark 
pipelines

The version creep is certainly an issue, normally its solved by having a
2.X directory for things that are only supported in 2.X and only including
that in the 2.X build. That being said the pipeline stuff has been around
since 1.3 (albeit as an alpha component) so we could probably make it work
for 1.3+ (but it might make sense to only bother doing for the 2.X series
since the rest of the pipeline stages in Spark weren't really well fleshed
out in the 1.X branch).

On Fri, Jul 7, 2017 at 3:33 PM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> +1 on this.
>
> There's precedence with spark interoperability with the various drmWrap
> functions.
>
> We've discussed pipelines in the past and roll-our-own vs. utilize
> underlying engine. Inter-operating with other pipelines (Spark) doesn't
> preclude that.
>
> The goal of the pipeline discussion iirc, was to eventually get towards
> automated hyper-parameter tuning.  Again, I don't see conflict- maybe a way
> to work in at some point?
>
> In addition to all of this- I think convenience methods and interfaces for
> more advanced spark operations will make the Mahout Learning curve less
> steep, and hopefully drive adoption.
>
> The only concern I can think of is version creep- which opens a whole other
> discussion on 'how long will we support Spark 1.6' (I'm not proposing to
> stop anytime soon), but as I understand a lot of the advance pipeline stuff
> came about in 2.x. I think this can be easily handled- the Spark
> Interpreter in Apache Zeppelin is rife with multi version support examples
> (1.2 - 2.1)
>
> Also- I don't see this affecting anything outside of the spark bindings, so
> engine neutrality should be maintained (with spark getting some favorable
> treatment, but at this point... we've pushed Flink to its own profile and
> we keep h2o around because its not causing any trouble).
>
>
>
>
> On Fri, Jul 7, 2017 at 4:32 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> > Hi y'all,
> >
> > Trevor and I had been talking a bit and one of the things I'm interested
> in
> > doing is trying to make it easier for the different ML libraries to be
> used
> > in Spark. Spark ML has this unified pipeline interface (which is
> certainly
> > far from perfect), but I was thinking I'd take a crack at trying to
> expose
> > some of Mahout's algorithms so that they could be used/configured with
> > Spark ML's pipeline interface.
> >
> > I'd like to take a stab at doing that inside the mahout project, but if
> > it's something people feel would be better to live outside I'm happy to
> do
> > that as well.
> >
> > Cheers,
> >
> > Holden
> >
> > For reference:
> >
> > https://spark.apache.org/docs/latest/ml-pipeline.html
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> >
>

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Making it easier to use Mahout algorithms with Apache Spark pipelines

Reply via email to