Hey- sorry for delayed reply, I unplugged for the holiday weekend. Question 1)... how will Mahout use MADlib? MADlib will be the abstraction layer that lets users fun Mahout on SQL
Is the plan for Mahout to just expose a wrapper that will call a MADlib function internally? Yes- this all happens in the engine bindings As suggested by you (if I understand correctly), we must either convert a Mahout vector to MADlib's convention at Mahout's or MADlib's end. But if MADlib does not have the kind of parallelization that Mahout currently has for linear algebra, then you will be limited by MADlib's capabilities right? Not necessarily. The only conversion that HAS to be done will be if someone is multiplying a DRM (distributed row matrix, in this case a MADlib matrix which is backed by a SQL table) times an "In core vector or matrix" The incoming incore matrix/vector will be Mahout style and so will have to be converted to the madlib equivalent. I am assuming that Mahout's linear algebra is way more powerful than what MADlib has, especially since Mahout kind of specializes in that! But I presume what you are talking about is not such a simple wrapper. My lack of experience with Mahout/engine bindings/MapBlock just makes it harder for me to understand. "Simple" is relative I suppose- but yes, I think the simple wrapper approach will get the job done. Question2 ) ... how MADlib would use Mahout's super powers. If MADlib moved from its own concept of Vectors and Matrices to using Mahout Vectors and Matrices- in theory it would be able to leverage all of the work we've done wrt various BLAS packs. Other Mahout folks who are more familiar with this can chime in- its not a requirement but an option. MADlib works under the principle that people don't have to move their data out of their database for analytics, but rather do it in-database. Since Mahout does not currently run on a SQL database engine, I am not sure how MADlib can leverage what Mahout is already good at (including use of GPU). In essence Mahout would run on MADlib. MADlib would be the SQL database engine (or more accurately it would be the abstraction layer). The advantage for MADlib would be more visibility and flexibility for users. The GPU usage would require MADlib to change over to Mahout Vectors/Matrices- and even then I'm not sure how straight forward this would be. Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Thu, May 25, 2017 at 5:50 PM, Nandish Jayaram <njaya...@pivotal.io> wrote: > Thank you for initiating this thread Trevor. > The possibility of two Apache projects collaborating together is wonderful, > and I was just trying to wrap my head around how we could do that with > Mahout > and MADlib. Thanks to my ignorance, I think I have more questions than > answers now. :-/ > > The first question is how will Mahout use MADlib? Is the plan for Mahout to > just expose a wrapper that will call a MADlib function internally? As > suggested > by you (if I understand correctly), we must either convert a Mahout vector > to > MADlib's convention at Mahout's or MADlib's end. But if MADlib does not > have the kind of parallelization that Mahout currently has for linear > algebra, > then you will be limited by MADlib's capabilities right? I am assuming that > Mahout's linear algebra is way more powerful than what MADlib has, > especially > since Mahout kind of specializes in that! But I presume what you > are talking about is not such a simple wrapper. My lack of experience with > Mahout/engine bindings/MapBlock just makes it harder for me to understand. > > The second question is about how MADlib would use Mahout's super powers. > MADlib works under the principle that people don't have to move their data > out of their database for analytics, but rather do it in-database. Since > Mahout does > not currently run on a SQL database engine, I am not sure how MADlib can > leverage > what Mahout is already good at (including use of GPU). I am clearly missing > something here, can you please shed some light on this too? > > Nandish > > On Mon, May 22, 2017 at 12:33 PM, Trevor Grant <trevor.d.gr...@gmail.com> > wrote: > > > Nice call out. > > > > So there is precedence on NOT utilizing the Mahout inCore matrix/vector > > structure in Mahout Bindings- See H2O bindings. > > > > In this case- we let the underlying engine (in this case MADlib) utilize > > its own concept of a Matrix. > > > > Makes quicker work of writing bindings and, since most of the deep stuff > in > > MADlib is CPP, I assume there's fairly good performance there anyway. > > (Mahout is JVM under the hood, so with out the accelerators, performance > > was not spectacular). > > > > > > Trevor Grant > > Data Scientist > > https://github.com/rawkintrevo > > http://stackexchange.com/users/3002022/rawkintrevo > > http://trevorgrant.org > > > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > > > > On Sun, May 21, 2017 at 9:05 PM, Jim Nasby <jim.na...@openscg.com> > wrote: > > > > > On 5/21/17 7:38 PM, Trevor Grant wrote: > > > > > >> I don't think a PhD in math/ML is required at all for this little > > venture. > > >> Mainly just a knowledge of basic BLAS operations (Matrix A %*% Matrix > B, > > >> Matrix A %*% Vector, etc.) > > >> > > > > > > Related to that, there's also been discussion[1] on the Postgres > hackers > > > list about adding a true matrix data type. Having that would allow > plCUDA > > > to do direct GPU matrix math with the bare minimum of fuss. > > > > > > Madlib would presumably need some other solution for non-postgres stuff > > > (though, the matrix type could potentially be pulled into GPDB with > > minimal > > > fuss). > > > > > > 1: https://www.postgresql.org/message-id/flat/9A28C8860F777E439 > > > AA12E8AEA7694F8011F52EF%40BPXM15GP.gisp.nec.co.jp > > > -- > > > Jim Nasby, Chief Data Architect, Austin TX > > > OpenSCG http://OpenSCG.com > > > > > >