Holden, great to have you here. This sounds great! Easier interoperability with Spark and a ease of the Mahout learning curve IMO are huge priorities.
I am conceptually +1 on this as well (only minor concerns are with our goals of preserving engine neutrality as best we can). With the precedence of Spark having favorable treatment, as Trevor pointed out, this should not be much of a problem. > Also- I don't see this affecting anything outside of the spark bindings, so engine neutrality should be maintained (with spark getting some favorable treatment, but at this point... we've pushed Flink to its own profile and we keep h2o around because its not causing any trouble). I believe that this could fit into our high level algorithm framework (in math-scala)... https://github.com/apache/mahout/tree/master/math-scala/src/main/scala/org/apache/mahout/math/algorithms It seems so. Keeping pipeline interfaces in a high level module, dropping down to the spark module and extending for Spark only (which in this case would likely be most of the work) and then adding stubs for Flink and h2o for future developers that may have interest would be best IMO. There is precedence here as well. E.g.: `IndexedDataset`s. Tangentially- @all - I'm just going to throw in that we should consider a profile for h2o for symmetry but that is an other discussion. --andy ________________________________ From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden Karau <hol...@pigscanfly.ca> Sent: Friday, July 7, 2017 8:22:12 PM To: dev@mahout.apache.org Subject: Re: Making it easier to use Mahout algorithms with Apache Spark pipelines The version creep is certainly an issue, normally its solved by having a 2.X directory for things that are only supported in 2.X and only including that in the 2.X build. That being said the pipeline stuff has been around since 1.3 (albeit as an alpha component) so we could probably make it work for 1.3+ (but it might make sense to only bother doing for the 2.X series since the rest of the pipeline stages in Spark weren't really well fleshed out in the 1.X branch). On Fri, Jul 7, 2017 at 3:33 PM, Trevor Grant <trevor.d.gr...@gmail.com> wrote: > +1 on this. > > There's precedence with spark interoperability with the various drmWrap > functions. > > We've discussed pipelines in the past and roll-our-own vs. utilize > underlying engine. Inter-operating with other pipelines (Spark) doesn't > preclude that. > > The goal of the pipeline discussion iirc, was to eventually get towards > automated hyper-parameter tuning. Again, I don't see conflict- maybe a way > to work in at some point? > > In addition to all of this- I think convenience methods and interfaces for > more advanced spark operations will make the Mahout Learning curve less > steep, and hopefully drive adoption. > > The only concern I can think of is version creep- which opens a whole other > discussion on 'how long will we support Spark 1.6' (I'm not proposing to > stop anytime soon), but as I understand a lot of the advance pipeline stuff > came about in 2.x. I think this can be easily handled- the Spark > Interpreter in Apache Zeppelin is rife with multi version support examples > (1.2 - 2.1) > > Also- I don't see this affecting anything outside of the spark bindings, so > engine neutrality should be maintained (with spark getting some favorable > treatment, but at this point... we've pushed Flink to its own profile and > we keep h2o around because its not causing any trouble). > > > > > On Fri, Jul 7, 2017 at 4:32 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > > Hi y'all, > > > > Trevor and I had been talking a bit and one of the things I'm interested > in > > doing is trying to make it easier for the different ML libraries to be > used > > in Spark. Spark ML has this unified pipeline interface (which is > certainly > > far from perfect), but I was thinking I'd take a crack at trying to > expose > > some of Mahout's algorithms so that they could be used/configured with > > Spark ML's pipeline interface. > > > > I'd like to take a stab at doing that inside the mahout project, but if > > it's something people feel would be better to live outside I'm happy to > do > > that as well. > > > > Cheers, > > > > Holden > > > > For reference: > > > > https://spark.apache.org/docs/latest/ml-pipeline.html > > > > -- > > Twitter: https://twitter.com/holdenkarau > > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau