Re: Making it easier to use Mahout algorithms with Apache Spark pipelines

Holden Karau Fri, 07 Jul 2017 17:22:44 -0700

The version creep is certainly an issue, normally its solved by having a
2.X directory for things that are only supported in 2.X and only including
that in the 2.X build. That being said the pipeline stuff has been around
since 1.3 (albeit as an alpha component) so we could probably make it work
for 1.3+ (but it might make sense to only bother doing for the 2.X series
since the rest of the pipeline stages in Spark weren't really well fleshed
out in the 1.X branch).


On Fri, Jul 7, 2017 at 3:33 PM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> +1 on this.
>
> There's precedence with spark interoperability with the various drmWrap
> functions.
>
> We've discussed pipelines in the past and roll-our-own vs. utilize
> underlying engine. Inter-operating with other pipelines (Spark) doesn't
> preclude that.
>
> The goal of the pipeline discussion iirc, was to eventually get towards
> automated hyper-parameter tuning.  Again, I don't see conflict- maybe a way
> to work in at some point?
>
> In addition to all of this- I think convenience methods and interfaces for
> more advanced spark operations will make the Mahout Learning curve less
> steep, and hopefully drive adoption.
>
> The only concern I can think of is version creep- which opens a whole other
> discussion on 'how long will we support Spark 1.6' (I'm not proposing to
> stop anytime soon), but as I understand a lot of the advance pipeline stuff
> came about in 2.x. I think this can be easily handled- the Spark
> Interpreter in Apache Zeppelin is rife with multi version support examples
> (1.2 - 2.1)
>
> Also- I don't see this affecting anything outside of the spark bindings, so
> engine neutrality should be maintained (with spark getting some favorable
> treatment, but at this point... we've pushed Flink to its own profile and
> we keep h2o around because its not causing any trouble).
>
>
>
>
> On Fri, Jul 7, 2017 at 4:32 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> > Hi y'all,
> >
> > Trevor and I had been talking a bit and one of the things I'm interested
> in
> > doing is trying to make it easier for the different ML libraries to be
> used
> > in Spark. Spark ML has this unified pipeline interface (which is
> certainly
> > far from perfect), but I was thinking I'd take a crack at trying to
> expose
> > some of Mahout's algorithms so that they could be used/configured with
> > Spark ML's pipeline interface.
> >
> > I'd like to take a stab at doing that inside the mahout project, but if
> > it's something people feel would be better to live outside I'm happy to
> do
> > that as well.
> >
> > Cheers,
> >
> > Holden
> >
> > For reference:
> >
> > https://spark.apache.org/docs/latest/ml-pipeline.html
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> >
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Making it easier to use Mahout algorithms with Apache Spark pipelines

Reply via email to