Re: Making it easier to use Mahout algorithms with Apache Spark pipelines

Andrew Musselman Sun, 09 Jul 2017 17:45:09 -0700

Holden, sounds good to me; the only thing I'd be cautious of is how
dependent we get on that other project but I don't think it's a big risk.


Thanks!

On Sun, Jul 9, 2017 at 3:33 PM, Andrew Palumbo <ap....@outlook.com> wrote:

> Holden, great to have you here.  This sounds great!  Easier
> interoperability with Spark and a ease of the Mahout learning curve IMO are
> huge priorities.
>
>
> I am conceptually +1 on this as well (only minor concerns are with our
> goals of preserving engine neutrality as best we can).  With the precedence
> of Spark having favorable treatment, as Trevor pointed out, this should not
> be much of a problem.
>
>
> > Also- I don't see this affecting anything outside of the spark bindings,
> so
> engine neutrality should be maintained (with spark getting some favorable
> treatment, but at this point... we've pushed Flink to its own profile and
> we keep h2o around because its not causing any trouble).
>
>
>
> I believe that this could fit into our high level algorithm framework (in
> math-scala)...
>
>
> https://github.com/apache/mahout/tree/master/math-scala/
> src/main/scala/org/apache/mahout/math/algorithms
>
>
> It seems so.  Keeping pipeline interfaces in a high level module, dropping
> down to the spark module and extending for Spark only (which in this case
> would likely be most of the work) and then adding stubs for Flink and h2o
> for future developers that may have interest would be best IMO.
>
>
> There is precedence here as well. E.g.: `IndexedDataset`s.
>
>
> Tangentially- @all - I'm just going to throw in that we should consider a
> profile for h2o for symmetry but that is an other discussion.
>
>
>
> --andy
>
> ________________________________
> From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden
> Karau <hol...@pigscanfly.ca>
> Sent: Friday, July 7, 2017 8:22:12 PM
> To: dev@mahout.apache.org
> Subject: Re: Making it easier to use Mahout algorithms with Apache Spark
> pipelines
>
> The version creep is certainly an issue, normally its solved by having a
> 2.X directory for things that are only supported in 2.X and only including
> that in the 2.X build. That being said the pipeline stuff has been around
> since 1.3 (albeit as an alpha component) so we could probably make it work
> for 1.3+ (but it might make sense to only bother doing for the 2.X series
> since the rest of the pipeline stages in Spark weren't really well fleshed
> out in the 1.X branch).
>
> On Fri, Jul 7, 2017 at 3:33 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
> > +1 on this.
> >
> > There's precedence with spark interoperability with the various drmWrap
> > functions.
> >
> > We've discussed pipelines in the past and roll-our-own vs. utilize
> > underlying engine. Inter-operating with other pipelines (Spark) doesn't
> > preclude that.
> >
> > The goal of the pipeline discussion iirc, was to eventually get towards
> > automated hyper-parameter tuning.  Again, I don't see conflict- maybe a
> way
> > to work in at some point?
> >
> > In addition to all of this- I think convenience methods and interfaces
> for
> > more advanced spark operations will make the Mahout Learning curve less
> > steep, and hopefully drive adoption.
> >
> > The only concern I can think of is version creep- which opens a whole
> other
> > discussion on 'how long will we support Spark 1.6' (I'm not proposing to
> > stop anytime soon), but as I understand a lot of the advance pipeline
> stuff
> > came about in 2.x. I think this can be easily handled- the Spark
> > Interpreter in Apache Zeppelin is rife with multi version support
> examples
> > (1.2 - 2.1)
> >
> > Also- I don't see this affecting anything outside of the spark bindings,
> so
> > engine neutrality should be maintained (with spark getting some favorable
> > treatment, but at this point... we've pushed Flink to its own profile and
> > we keep h2o around because its not causing any trouble).
> >
> >
> >
> >
> > On Fri, Jul 7, 2017 at 4:32 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
> >
> > > Hi y'all,
> > >
> > > Trevor and I had been talking a bit and one of the things I'm
> interested
> > in
> > > doing is trying to make it easier for the different ML libraries to be
> > used
> > > in Spark. Spark ML has this unified pipeline interface (which is
> > certainly
> > > far from perfect), but I was thinking I'd take a crack at trying to
> > expose
> > > some of Mahout's algorithms so that they could be used/configured with
> > > Spark ML's pipeline interface.
> > >
> > > I'd like to take a stab at doing that inside the mahout project, but if
> > > it's something people feel would be better to live outside I'm happy to
> > do
> > > that as well.
> > >
> > > Cheers,
> > >
> > > Holden
> > >
> > > For reference:
> > >
> > > https://spark.apache.org/docs/latest/ml-pipeline.html
> > >
> > > --
> > > Twitter: https://twitter.com/holdenkarau
> > >
> >
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: Making it easier to use Mahout algorithms with Apache Spark pipelines

Reply via email to