Holden, sounds good to me; the only thing I'd be cautious of is how dependent we get on that other project but I don't think it's a big risk.
Thanks! On Sun, Jul 9, 2017 at 3:33 PM, Andrew Palumbo <ap....@outlook.com> wrote: > Holden, great to have you here. This sounds great! Easier > interoperability with Spark and a ease of the Mahout learning curve IMO are > huge priorities. > > > I am conceptually +1 on this as well (only minor concerns are with our > goals of preserving engine neutrality as best we can). With the precedence > of Spark having favorable treatment, as Trevor pointed out, this should not > be much of a problem. > > > > Also- I don't see this affecting anything outside of the spark bindings, > so > engine neutrality should be maintained (with spark getting some favorable > treatment, but at this point... we've pushed Flink to its own profile and > we keep h2o around because its not causing any trouble). > > > > I believe that this could fit into our high level algorithm framework (in > math-scala)... > > > https://github.com/apache/mahout/tree/master/math-scala/ > src/main/scala/org/apache/mahout/math/algorithms > > > It seems so. Keeping pipeline interfaces in a high level module, dropping > down to the spark module and extending for Spark only (which in this case > would likely be most of the work) and then adding stubs for Flink and h2o > for future developers that may have interest would be best IMO. > > > There is precedence here as well. E.g.: `IndexedDataset`s. > > > Tangentially- @all - I'm just going to throw in that we should consider a > profile for h2o for symmetry but that is an other discussion. > > > > --andy > > ________________________________ > From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden > Karau <hol...@pigscanfly.ca> > Sent: Friday, July 7, 2017 8:22:12 PM > To: dev@mahout.apache.org > Subject: Re: Making it easier to use Mahout algorithms with Apache Spark > pipelines > > The version creep is certainly an issue, normally its solved by having a > 2.X directory for things that are only supported in 2.X and only including > that in the 2.X build. That being said the pipeline stuff has been around > since 1.3 (albeit as an alpha component) so we could probably make it work > for 1.3+ (but it might make sense to only bother doing for the 2.X series > since the rest of the pipeline stages in Spark weren't really well fleshed > out in the 1.X branch). > > On Fri, Jul 7, 2017 at 3:33 PM, Trevor Grant <trevor.d.gr...@gmail.com> > wrote: > > > +1 on this. > > > > There's precedence with spark interoperability with the various drmWrap > > functions. > > > > We've discussed pipelines in the past and roll-our-own vs. utilize > > underlying engine. Inter-operating with other pipelines (Spark) doesn't > > preclude that. > > > > The goal of the pipeline discussion iirc, was to eventually get towards > > automated hyper-parameter tuning. Again, I don't see conflict- maybe a > way > > to work in at some point? > > > > In addition to all of this- I think convenience methods and interfaces > for > > more advanced spark operations will make the Mahout Learning curve less > > steep, and hopefully drive adoption. > > > > The only concern I can think of is version creep- which opens a whole > other > > discussion on 'how long will we support Spark 1.6' (I'm not proposing to > > stop anytime soon), but as I understand a lot of the advance pipeline > stuff > > came about in 2.x. I think this can be easily handled- the Spark > > Interpreter in Apache Zeppelin is rife with multi version support > examples > > (1.2 - 2.1) > > > > Also- I don't see this affecting anything outside of the spark bindings, > so > > engine neutrality should be maintained (with spark getting some favorable > > treatment, but at this point... we've pushed Flink to its own profile and > > we keep h2o around because its not causing any trouble). > > > > > > > > > > On Fri, Jul 7, 2017 at 4:32 PM, Holden Karau <hol...@pigscanfly.ca> > wrote: > > > > > Hi y'all, > > > > > > Trevor and I had been talking a bit and one of the things I'm > interested > > in > > > doing is trying to make it easier for the different ML libraries to be > > used > > > in Spark. Spark ML has this unified pipeline interface (which is > > certainly > > > far from perfect), but I was thinking I'd take a crack at trying to > > expose > > > some of Mahout's algorithms so that they could be used/configured with > > > Spark ML's pipeline interface. > > > > > > I'd like to take a stab at doing that inside the mahout project, but if > > > it's something people feel would be better to live outside I'm happy to > > do > > > that as well. > > > > > > Cheers, > > > > > > Holden > > > > > > For reference: > > > > > > https://spark.apache.org/docs/latest/ml-pipeline.html > > > > > > -- > > > Twitter: https://twitter.com/holdenkarau > > > > > > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau >