Re: Follow up from ApacheCon

2017-05-30 Thread Trevor Grant
Hey- sorry for delayed reply, I unplugged for the holiday weekend.

Question 1)... how will Mahout use MADlib?
MADlib will be the abstraction layer that lets users fun Mahout on SQL

Is the plan for Mahout to just expose a wrapper that will call a MADlib
function internally?
Yes- this all happens in the engine bindings

As suggested by you (if I understand correctly), we must either convert a
Mahout vector to MADlib's convention at Mahout's or MADlib's end. But if
MADlib does not have the kind of parallelization that Mahout currently has
for linear algebra, then you will be limited by MADlib's capabilities
right?
Not necessarily.  The only conversion that HAS to be done will be if
someone is multiplying a DRM (distributed row matrix, in this case a MADlib
matrix which is backed by a SQL table) times an "In core vector or matrix"
The incoming incore matrix/vector will be Mahout style and so will have to
be converted to the madlib equivalent.

I am assuming that Mahout's linear algebra is way more powerful than what
MADlib has, especially since Mahout kind of specializes in that! But I
presume what you are talking about is not such a simple wrapper. My lack of
experience with Mahout/engine bindings/MapBlock just makes it harder for me
to understand.

"Simple" is relative I suppose- but yes, I think the simple wrapper
approach will get the job done.

Question2 ) ... how MADlib would use Mahout's super powers.
If MADlib moved from its own concept of Vectors and Matrices to using
Mahout Vectors and Matrices- in theory it would be able to leverage all of
the work we've done wrt various BLAS packs. Other Mahout folks who are more
familiar with this can chime in- its not a requirement but an option.


MADlib works under the principle that people don't have to move their data out
of their database for analytics, but rather do it in-database. Since Mahout
does not currently run on a SQL database engine, I am not sure how MADlib
can leverage what Mahout is already good at (including use of GPU).

In essence Mahout would run on MADlib. MADlib would be the SQL database
engine (or more accurately it would be the abstraction layer).  The
advantage for MADlib would be more visibility and flexibility for users.

The GPU usage would require MADlib to change over to Mahout
Vectors/Matrices- and even then I'm not sure how straight forward this
would be.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, May 25, 2017 at 5:50 PM, Nandish Jayaram 
wrote:

> Thank you for initiating this thread Trevor.
> The possibility of two Apache projects collaborating together is wonderful,
> and I was just trying to wrap my head around how we could do that with
> Mahout
> and MADlib. Thanks to my ignorance, I think I have more questions than
> answers now. :-/
>
> The first question is how will Mahout use MADlib? Is the plan for Mahout to
> just expose a wrapper that will call a MADlib function internally? As
> suggested
> by you (if I understand correctly), we must either convert a Mahout vector
> to
> MADlib's convention at Mahout's or MADlib's end. But if MADlib does not
> have the kind of parallelization that Mahout currently has for linear
> algebra,
> then you will be limited by MADlib's capabilities right? I am assuming that
> Mahout's linear algebra is way more powerful than what MADlib has,
> especially
> since Mahout kind of specializes in that! But I presume what you
> are talking about is not such a simple wrapper. My lack of experience with
> Mahout/engine bindings/MapBlock just makes it harder for me to understand.
>
> The second question is about how MADlib would use Mahout's super powers.
> MADlib works under the principle that people don't have to move their data
> out of their database for analytics, but rather do it in-database. Since
> Mahout does
> not currently run on a SQL database engine, I am not sure how MADlib can
> leverage
> what Mahout is already good at (including use of GPU). I am clearly missing
> something here, can you please shed some light on this too?
>
> Nandish
>
> On Mon, May 22, 2017 at 12:33 PM, Trevor Grant 
> wrote:
>
> > Nice call out.
> >
> > So there is precedence on NOT utilizing the Mahout inCore matrix/vector
> > structure in Mahout Bindings- See H2O bindings.
> >
> > In this case- we let the underlying engine (in this case MADlib) utilize
> > its own concept of a Matrix.
> >
> > Makes quicker work of writing bindings and, since most of the deep stuff
> in
> > MADlib is CPP, I assume there's fairly good performance there anyway.
> > (Mahout is JVM under the hood, so with out the accelerators, performance
> > was not spectacular).
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able

Re: Follow up from ApacheCon

2017-05-25 Thread Nandish Jayaram
Thank you for initiating this thread Trevor.
The possibility of two Apache projects collaborating together is wonderful,
and I was just trying to wrap my head around how we could do that with
Mahout
and MADlib. Thanks to my ignorance, I think I have more questions than
answers now. :-/

The first question is how will Mahout use MADlib? Is the plan for Mahout to
just expose a wrapper that will call a MADlib function internally? As
suggested
by you (if I understand correctly), we must either convert a Mahout vector
to
MADlib's convention at Mahout's or MADlib's end. But if MADlib does not
have the kind of parallelization that Mahout currently has for linear
algebra,
then you will be limited by MADlib's capabilities right? I am assuming that
Mahout's linear algebra is way more powerful than what MADlib has,
especially
since Mahout kind of specializes in that! But I presume what you
are talking about is not such a simple wrapper. My lack of experience with
Mahout/engine bindings/MapBlock just makes it harder for me to understand.

The second question is about how MADlib would use Mahout's super powers.
MADlib works under the principle that people don't have to move their data
out of their database for analytics, but rather do it in-database. Since
Mahout does
not currently run on a SQL database engine, I am not sure how MADlib can
leverage
what Mahout is already good at (including use of GPU). I am clearly missing
something here, can you please shed some light on this too?

Nandish

On Mon, May 22, 2017 at 12:33 PM, Trevor Grant 
wrote:

> Nice call out.
>
> So there is precedence on NOT utilizing the Mahout inCore matrix/vector
> structure in Mahout Bindings- See H2O bindings.
>
> In this case- we let the underlying engine (in this case MADlib) utilize
> its own concept of a Matrix.
>
> Makes quicker work of writing bindings and, since most of the deep stuff in
> MADlib is CPP, I assume there's fairly good performance there anyway.
> (Mahout is JVM under the hood, so with out the accelerators, performance
> was not spectacular).
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Sun, May 21, 2017 at 9:05 PM, Jim Nasby  wrote:
>
> > On 5/21/17 7:38 PM, Trevor Grant wrote:
> >
> >> I don't think a PhD in math/ML is required at all for this little
> venture.
> >> Mainly just a knowledge of basic BLAS operations (Matrix A %*% Matrix B,
> >> Matrix A %*% Vector, etc.)
> >>
> >
> > Related to that, there's also been discussion[1] on the Postgres hackers
> > list about adding a true matrix data type. Having that would allow plCUDA
> > to do direct GPU matrix math with the bare minimum of fuss.
> >
> > Madlib would presumably need some other solution for non-postgres stuff
> > (though, the matrix type could potentially be pulled into GPDB with
> minimal
> > fuss).
> >
> > 1: https://www.postgresql.org/message-id/flat/9A28C8860F777E439
> > AA12E8AEA7694F8011F52EF%40BPXM15GP.gisp.nec.co.jp
> > --
> > Jim Nasby, Chief Data Architect, Austin TX
> > OpenSCG http://OpenSCG.com
> >
>


Re: Follow up from ApacheCon

2017-05-22 Thread Trevor Grant
Nice call out.

So there is precedence on NOT utilizing the Mahout inCore matrix/vector
structure in Mahout Bindings- See H2O bindings.

In this case- we let the underlying engine (in this case MADlib) utilize
its own concept of a Matrix.

Makes quicker work of writing bindings and, since most of the deep stuff in
MADlib is CPP, I assume there's fairly good performance there anyway.
(Mahout is JVM under the hood, so with out the accelerators, performance
was not spectacular).


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Sun, May 21, 2017 at 9:05 PM, Jim Nasby  wrote:

> On 5/21/17 7:38 PM, Trevor Grant wrote:
>
>> I don't think a PhD in math/ML is required at all for this little venture.
>> Mainly just a knowledge of basic BLAS operations (Matrix A %*% Matrix B,
>> Matrix A %*% Vector, etc.)
>>
>
> Related to that, there's also been discussion[1] on the Postgres hackers
> list about adding a true matrix data type. Having that would allow plCUDA
> to do direct GPU matrix math with the bare minimum of fuss.
>
> Madlib would presumably need some other solution for non-postgres stuff
> (though, the matrix type could potentially be pulled into GPDB with minimal
> fuss).
>
> 1: https://www.postgresql.org/message-id/flat/9A28C8860F777E439
> AA12E8AEA7694F8011F52EF%40BPXM15GP.gisp.nec.co.jp
> --
> Jim Nasby, Chief Data Architect, Austin TX
> OpenSCG http://OpenSCG.com
>


Re: Follow up from ApacheCon

2017-05-21 Thread Trevor Grant
Awesome!!

I don't think a PhD in math/ML is required at all for this little venture.
Mainly just a knowledge of basic BLAS operations (Matrix A %*% Matrix B,
Matrix A %*% Vector, etc.)

The keys to success here are going to be:
- Making CPP/Scala/SQL all talk to each other (no big deal... lol).
- Being able to work with respective communities and tap their knowledge.

To create bindings in Mahout see:
https://github.com/apache/mahout/tree/master/flink/src/main/scala/org/apache/mahout/flinkbindings/blas
https://github.com/apache/mahout/tree/master/spark/src/main/scala/org/apache/mahout/sparkbindings/blas

https://github.com/apache/mahout/tree/master/h2o/src/main/java/org/apache/mahout/h2obindings/ops

Those types of operations need to be implemented.

As I dig around on MADlib a little more I find this:
http://madlib.incubator.apache.org/docs/latest/group__grp__matrix.html

Again, I'm just sticking my toes in the water- but it appears that most of
the 'hard' stuff is done, just need a wrapper.  There will either need to
be a way to serialize a Mahout Vector to look like a MADlib vector (easy
way) or MADlib will need to implement Mahout Vectors ( much more convoluted
but adds GPU acceleration to MADlib).  Also need to figure out the MADlib
equivalent of a MapBlock like operation. ( Apply anonymous function to each
row ).

Having never worked with MADlib nor written my own bindings in Mahout-
everyone is encouraged to chime in and sharp shoot my naivety in thinking
this isn't going to be too painful :)


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Sun, May 21, 2017 at 6:08 PM, Jim Jagielski  wrote:

> ME ME. I *really* want to get more involved in both, but as
> a serious interested volunteer (this would be all on my copious
> amounts of free time)! This area intrigues me and would love
> to be able to hack on it, but am by no means a PhD in ML.
>
> > On May 19, 2017, at 2:05 AM, Trevor Grant 
> wrote:
> >
> > Saw a really awesome shark tank talk today at ApacheCon.
> >
> > Had a conversation after and wanted to follow up.
> >
> > The Apache MADlib-incubator project is Machine Learning on SQL. (also
> close
> > to graduation as I understand)
> >
> > The Apache Mahout project is engine neutral roll your own machine
> learning
> > / statistical algorithms (with a quickly increasing cannon of 'precanned'
> > algorithms).
> >
> > (Both projects have a lot of other cool tricks, but let's table that for
> > now).
> >
> > Based on a one off discussion, it is highly likely that the 'hard part'
> of
> > writing engine bindings in Mahout, has already been done by MADlib as a
> > course of business. (That is linear algebra like operations on 'matrices'
> > backed by SQL).
> >
> > Mahout also brings some cool things like GPU acceleration to the table.
> > (FYI Mahout GPU, as I understand is CPP at the low level, just to get
> your
> > wheels turning) (MADlib project, Mahout uses JavaCPP and other Java
> > wrappers for CPP libraries at the very low level for implementing GPU
> > acceleration)
> >
> > There are numerous more benefits I can think of- but that's the high
> level
> > so everyone on each project gets the jist of it.
> >
> > I think an integration (MADLib based SQL bindings, for lack of better
> term)
> > is a potentially an easy win that would yield big advantages for both
> > projects, and would like to propose some exploratory collaboration.
> >
> > "Roll your own GPU accelerated statistical algorithms on PostgreSQL and
> > other SQL engines- brought to you by Apache Mahout+ Apache
> > MADlib-incubator" - or Apache MADlib-incubator + Apache Mahout, depending
> > on who is giving the conference talk ;)
> >
> > Encouraging anyone interested to sign up for the appropriate dev list.
>
>


Re: Follow up from ApacheCon

2017-05-21 Thread Jim Jagielski
ME ME. I *really* want to get more involved in both, but as
a serious interested volunteer (this would be all on my copious
amounts of free time)! This area intrigues me and would love
to be able to hack on it, but am by no means a PhD in ML.

> On May 19, 2017, at 2:05 AM, Trevor Grant  wrote:
> 
> Saw a really awesome shark tank talk today at ApacheCon.
> 
> Had a conversation after and wanted to follow up.
> 
> The Apache MADlib-incubator project is Machine Learning on SQL. (also close
> to graduation as I understand)
> 
> The Apache Mahout project is engine neutral roll your own machine learning
> / statistical algorithms (with a quickly increasing cannon of 'precanned'
> algorithms).
> 
> (Both projects have a lot of other cool tricks, but let's table that for
> now).
> 
> Based on a one off discussion, it is highly likely that the 'hard part' of
> writing engine bindings in Mahout, has already been done by MADlib as a
> course of business. (That is linear algebra like operations on 'matrices'
> backed by SQL).
> 
> Mahout also brings some cool things like GPU acceleration to the table.
> (FYI Mahout GPU, as I understand is CPP at the low level, just to get your
> wheels turning) (MADlib project, Mahout uses JavaCPP and other Java
> wrappers for CPP libraries at the very low level for implementing GPU
> acceleration)
> 
> There are numerous more benefits I can think of- but that's the high level
> so everyone on each project gets the jist of it.
> 
> I think an integration (MADLib based SQL bindings, for lack of better term)
> is a potentially an easy win that would yield big advantages for both
> projects, and would like to propose some exploratory collaboration.
> 
> "Roll your own GPU accelerated statistical algorithms on PostgreSQL and
> other SQL engines- brought to you by Apache Mahout+ Apache
> MADlib-incubator" - or Apache MADlib-incubator + Apache Mahout, depending
> on who is giving the conference talk ;)
> 
> Encouraging anyone interested to sign up for the appropriate dev list.