Re: Future Mahout - Zeppelin work

Eric Charles Wed, 01 Jun 2016 09:28:43 -0700

+1 piggybacking sounds reasonable and quick-win.

On 01/06/16 18:17, Trevor Grant wrote:

Hey Eric,


The 'piggyback' or 'patch' approach is a lot easier and less invasive to
implement in practice, and has the Zeppelin community blessing.

When the Flink version comes on line, it will be also super easy to
replicate the effort.  And even doing two (or more) 'piggybacks' will be
easier to maintain than one stand-alone Mahout interpretter.  Also,
'piggybacking' opens up the possibility of sharing between contexts,
minimizes user configuration, etc.

The differential is about 20 new lines of code for a piggy back on any
underlying engine, vs. about 300 lines of code for a stand alone
interpreter which must be kept up to date with its Spark/Flink counter
parts.

Philosophically the stand-alone makes sense, practically the piggyback
does. *shruggie*

It is possible that somewhere down the road we'll refactor the piggy
back(s) into a stand alone interpreter, at which point none of the current
effort will be wasted, it will just be moving some code around.  So the
other advantage to the piggyback is that it quickly fields a minimum viable
product, with out having to pay much for it later on down the road.

This is in part due to the way Zeppelin implemented its interpreters which
involves a lot of code repetition.

I'm open to further discussion, but after playing in the Zeppelin code for
a while and really groking different approaches I think this one is best. I
do invite critiques because I believe I have considered most angles and can
properly defend the current path, and if there is something I haven't
thought of, I'd rather it be brought to light sooner than later.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, Jun 1, 2016 at 11:00 AM, Eric Charles <e...@apache.org> wrote:

Hi Suneel, an independent makes sense as mahout is supposed to run on
various backend, so not only spark.

Yes, I am following mahout mailing list (and not abroad this year - this
may change in the future).

On 30/05/16 05:47, Suneel Marthi wrote:

Hi Eric,

We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?

BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?

On Sun, May 29, 2016 at 10:57 PM, Eric Charles <e...@apache.org> wrote:

Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?


https://github.com/apache/incubator-zeppelin/pull/928

It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).

On 29/05/16 19:16, Suneel Marthi wrote:

On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <trevor.d.gr...@gmail.com

wrote:

OK cool. Just wanted to make sure I wasn't stealing anyone's baby or

duplicating efforts.

Two things:

1- The blog post referenced the linear-regression example notebook
twice-
I've updated it to reference the ggplot integration. E.g. import this
note:



https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done
in
that note...) So to any who tried the blog, I huge appology because
that
notebook is where all of the 'magic happened', (all of the screen
shots /
gg-plots / etc happened there).

2- I have a working prototype of the Zeppelin integration:
'mahout-terp' branch of :
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark
Interpretter
properties, you have a Mahout interpreter. This is the minimally
invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over
at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make
sure
I don't need to do a major refactor before I get carried away with
those
activities.

In essence when 'spark-mahout' is 'true' you jump right in on r-like
dsl
and you have a sdc declared based on the underlying sc.


I am not sure if messing with the very "sacrosanct" Zeppelin-Spark

interpreter is gonna go down well with the Spark insanity.  I would
prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin
project
if that's acceptable to the Zeppelin folks, even though most of it might
be
repeatee.

What do others have to say?


have a good holiday weekend,


tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap....@outlook.com>
wrote:

Thx Trevor,

Re: m-1854, It was something that we started when were first
discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned
it.

Feel free to Assign any Jiras to yourself.  I think that m-1854 is

similar


to the mahout-spark-shell, so I may be able to help out there.

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Saturday, May 28, 2016 11:21:44 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Created a subtask on 1855 for tsv strings.

Looking at 1854 assigned to Pat Ferrel, what's your progress to date?

How


can I help?


tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."
-Virgil*


On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap....@outlook.com>
wrote:

Great!


When you free up and have the time, could you create some Jiras for

these?

We actually have MAHOUT-1852 open for Histograms already, and

MAHOUT-1854


and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854

and


m-1855 out and we can start new ones if they're not relevant anymore

or

we


can just go with those.


Thanks

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Thursday, May 26, 2016 3:17:22 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Short answer: it is high priority. I think it will be a Mahout

interpreter


into Zeppelin, and given that plans are on hold for a Flink-Mahout in

the


short term, I think it should be a piggy-back spark interpreter (e.g.

exposed through something like %spark.mahout).   So I have thoughts,

but

no


plan.  Been busy with a couple of other commitments.


On the Mahout side we need:
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things

like


histograms, etc, that one would want to plot. I.e. histogram bucketing?

(less important for the moment)

On the Zeppelin Size we need:
an interpreter.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."
-Virgil*


On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <smar...@apache.org>

wrote:

While on this subject, do we have a plan yet of integrating Zeppelin


into


Mahout (or the converse) of having Mahout specific interpreter for

Zeppelin?  I think that shuld be high priority in the short term.

On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <

trevor.d.gr...@gmail.com>


wrote:


Ahh, like the "Sample From Matrix" paragraph in the notebook.


Yea that seems like a good add. If not this afternoon, I'll include

it

Saturday.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."

-Virgil*

On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <

ap....@outlook.com

wrote:

Trevor, I was reading over your blog last night again- first time


since

you updated. It is  great!

I have one suggestion being adding in a code line on how the the

sampling


of the  DRM ->  in-core Matrix is done:

https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148

eg something like:


       mxSin = drmSampleKRows(drmSin, 1000, replacement = false)

Maybe you omitted this intentionally?

Andy

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Friday, May 20, 2016 7:56:20 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a

version

is


uninformative to me. I'd say if possible, you're first


troubleshooting

measure would be to re clone or do a "git fetch upstream" to get

up

to

the


very latest


Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <

andrew.mussel...@gmail.com>


wrote:


Trevor, my zeppelin source is at this version:


     <groupId>org.apache.zeppelin</groupId>
     <artifactId>zeppelin</artifactId>
     <packaging>pom</packaging>
     <version>0.6.0-incubating-SNAPSHOT</version>
     <name>Zeppelin</name>
     <description>Zeppelin project</description>
     <url>http://zeppelin.incubator.apache.org/</url>

And yes you're right the artifacts weren't added to the

dependencies;

is


that a feature in more modern zep?


On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <

dlie...@gmail.com

wrote:

no parenthesis.


import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd


On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <

smar...@apache.org

wrote:

On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <


trevor.d.gr...@gmail.com>


wrote:


Hey Pat,


If you spit out a TSV - you can import into pyspark /

matplotlib

from

the


resource pool in essentially the same way and use that


plotting

library

if


you prefer.  In fact you could import the tsv into pandas

and

use

all

of


the pandas plotting as well (though I think it is for the


most

part,

also


matplotlib with some convenience functions).

https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u

In Zeppelin, unless you specify otherwise, pyspark,


sparkr,

spark-sql,

and


scala-spark all share the same spark context you can


create

RDDs

in

one


language and access them / work on them in another (so I

understand).

So in Mahout can you "save" a matrix as a RDD? e.g.


something

like

val myRDD = myDRM.asRDD()



val myRDD = myDRM.rdd()



And would 'myRDD' then exist in the spark context?


yes it will be in sparkContext


Trevor Grant

Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."

-Virgil*

On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <

p...@occamsmachete.com>

wrote:

Agreed.


BTW I don’t want to stall progress but being the most

ignorant

of


plot

libs, I’ll ask if we should consider python and

matplotlib.

In


another

project we use python because of the RDD support on

Spark

though

the


visualizations are extremely limited in our case. If we

can

pass

an

RDD

to


pyspark it would allow custom reductions in python


before

plotting,

even

though we will support many natively in Mahout. I’m


guessing

that


this

would cross a context boundary and require a write to

disk?

So 2 questions:

1) what does the inter language support look like with

Spark

python

vs


SparkR, can we transfer RDDs?

2) are the plot libs significantly different?


On May 20, 2016, at 9:54 AM, Trevor Grant <

trevor.d.gr...@gmail.com>

wrote:

Dmitriy really nailed it on the head in his reply to

the

post


which

I'll

rebroadcast below. In essence the whole reason you are


(theoretically)

using Mahout is the data is to big to fit in memory.

If

it's

to

big

to

fit


in memory, well then its probably too big to plot each


point

(e.g.

trillions of row, you only have so many pixels).   For

the

example


randomly sampled a matrix.

So as Dmitriy says, in Mahout we need to have functions

that

will


'preprocess' the data into something plotable.

For the Zepplin-Plotting thing, we need to have a

function

that


will

spit


out a tsv like string of the data we wanted plotted.


I agree an honest Mahout interpreter in Zeppelin is

probably

worth

doing.

There are a couple of ways to go about it. I opened up

the

discussion

on


dev@Zeppelin and didn't get any replies. I'm going to


take

that

to


mean

we


can do it in a way that makes the most sense to Mahout


users...

First steps are to include some methods in Mahout that


will

do


that


preprocessing, and one that will turn something into a

tsv

string.

I have some general ideas on possible approached to


making

an


honest-mahout

interpreter but I want to play in the code and look at

the

Flink-Mahout

shell a bit before I try to organize my thoughts and

present

them.

...(2) not sure what is the point of supporting


distributed

anything.

It

is


distributed presumably because it is hard to keep it in


memory.

Therefore,

plotting anything distributed potentially presents 2


problems:

storage

space and overplotting due to number of points. The

idea

is


that

we


have

to


work out algorithms that condense big data information


into

small

plottable

information (like density grids, for example, or


histograms)....

Trevor Grant

Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."

-Virgil*

On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <

p...@occamsmachete.com>

wrote:

Great job Trevor, we’ll need this detail to smooth

out

the


sharp

edges

and


any guidance from you or the Zeppelin community will


be a

big


help.

On May 20, 2016, at 8:13 AM, Shannon Quinn <

squ...@gatech.edu>

wrote:

Agreed, thoroughly enjoying the blog post.


On 5/19/16 12:01 AM, Andrew Palumbo wrote:

Well done, Trevor!  I've not yet had a chance to try


this

in


zeppelin

but I just read the blog which is great!

-------- Original message --------
From: Trevor Grant <trevor.d.gr...@gmail.com>
Date: 05/18/2016 2:44 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Ah thank you.

Fixing now.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."

-Virgil*

On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <

ap....@outlook.com

wrote:

Hey Trevor- Just refreshed your readme.  The jar


that I

mentioned

is


actually:

/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

rather than:

/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

(In the spark module that is)

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 11:02:43 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

ah yes- I remember you pointing that out to me too.

I got side tracked yesterday for most of the day on

an

adventure

in


getting

Zeppelin to work right after I accidently updated

to

the

new


snapshot

(free

hint: the secret was to clear my cache *face-palm*)


I'm going to add that dependency to the readme.md

now.

thanks,

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes

of

things."

-Virgil*

On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <

ap....@outlook.com>

wrote:

Trevor this is very cool- I have not been able to


look

at

it


closely

yet

but just a small point: I believe that you'll also

need

to

add

the

mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

For things like the classification stats,


confusion

matrix,

and


t-digest.

Andy


________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org

Re: Future Mahout - Zeppelin work

Reply via email to