Re: Future Mahout - Zeppelin work

Eric Charles Wed, 01 Jun 2016 09:02:40 -0700

Hi Suneel, an independent makes sense as mahout is supposed to run onvarious backend, so not only spark.

Yes, I am following mahout mailing list (and not abroad this year - thismay change in the future).


On 30/05/16 05:47, Suneel Marthi wrote:

Hi Eric,

We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?

BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?

On Sun, May 29, 2016 at 10:57 PM, Eric Charles <e...@apache.org> wrote:

Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?

https://github.com/apache/incubator-zeppelin/pull/928

It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).

On 29/05/16 19:16, Suneel Marthi wrote:

On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

OK cool. Just wanted to make sure I wasn't stealing anyone's baby or

duplicating efforts.

Two things:

1- The blog post referenced the linear-regression example notebook twice-
I've updated it to reference the ggplot integration. E.g. import this
note:


https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done
in
that note...) So to any who tried the blog, I huge appology because that
notebook is where all of the 'magic happened', (all of the screen shots /
gg-plots / etc happened there).

2- I have a working prototype of the Zeppelin integration:
'mahout-terp' branch of :
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
properties, you have a Mahout interpreter. This is the minimally invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make
sure
I don't need to do a major refactor before I get carried away with those
activities.

In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
and you have a sdc declared based on the underlying sc.

I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity.  I would prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
if that's acceptable to the Zeppelin folks, even though most of it might
be
repeatee.

What do others have to say?


have a good holiday weekend,


tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap....@outlook.com>
wrote:

Thx Trevor,

Re: m-1854, It was something that we started when were first discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned it.

Feel free to Assign any Jiras to yourself.  I think that m-1854 is

similar

to the mahout-spark-shell, so I may be able to help out there.

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Saturday, May 28, 2016 11:21:44 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Created a subtask on 1855 for tsv strings.

Looking at 1854 assigned to Pat Ferrel, what's your progress to date?

How

can I help?

tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap....@outlook.com>
wrote:

Great!


When you free up and have the time, could you create some Jiras for

these?


We actually have MAHOUT-1852 open for Histograms already, and

MAHOUT-1854

and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854

and

m-1855 out and we can start new ones if they're not relevant anymore or

we

can just go with those.

Thanks

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Thursday, May 26, 2016 3:17:22 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Short answer: it is high priority. I think it will be a Mahout

interpreter

into Zeppelin, and given that plans are on hold for a Flink-Mahout in

the

short term, I think it should be a piggy-back spark interpreter (e.g.

exposed through something like %spark.mahout).   So I have thoughts,

but

no

plan.  Been busy with a couple of other commitments.

On the Mahout side we need:
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things

like

histograms, etc, that one would want to plot. I.e. histogram bucketing?

(less important for the moment)

On the Zeppelin Size we need:
an interpreter.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <smar...@apache.org>

wrote:


While on this subject, do we have a plan yet of integrating Zeppelin

into

Mahout (or the converse) of having Mahout specific interpreter for

Zeppelin?  I think that shuld be high priority in the short term.

On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <

trevor.d.gr...@gmail.com>

wrote:


Ahh, like the "Sample From Matrix" paragraph in the notebook.


Yea that seems like a good add. If not this afternoon, I'll include

it

Saturday.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."

-Virgil*


On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <

ap....@outlook.com

wrote:


Trevor, I was reading over your blog last night again- first time

since

you updated. It is  great!


I have one suggestion being adding in a code line on how the the

sampling

of the  DRM ->  in-core Matrix is done:

https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148

eg something like:

      mxSin = drmSampleKRows(drmSin, 1000, replacement = false)

Maybe you omitted this intentionally?

Andy

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Friday, May 20, 2016 7:56:20 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a

version

is

uninformative to me. I'd say if possible, you're first

troubleshooting

measure would be to re clone or do a "git fetch upstream" to get

up

to

the

very latest

Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <

andrew.mussel...@gmail.com>

wrote:


Trevor, my zeppelin source is at this version:


    <groupId>org.apache.zeppelin</groupId>
    <artifactId>zeppelin</artifactId>
    <packaging>pom</packaging>
    <version>0.6.0-incubating-SNAPSHOT</version>
    <name>Zeppelin</name>
    <description>Zeppelin project</description>
    <url>http://zeppelin.incubator.apache.org/</url>

And yes you're right the artifacts weren't added to the

dependencies;

is

that a feature in more modern zep?


On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <

dlie...@gmail.com

wrote:


no parenthesis.


import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd


On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <

smar...@apache.org

wrote:


On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <

trevor.d.gr...@gmail.com>

wrote:


Hey Pat,


If you spit out a TSV - you can import into pyspark /

matplotlib

from

the

resource pool in essentially the same way and use that

plotting

library

if

you prefer.  In fact you could import the tsv into pandas

and

use

all

of

the pandas plotting as well (though I think it is for the

most

part,

also

matplotlib with some convenience functions).

https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u

In Zeppelin, unless you specify otherwise, pyspark,

sparkr,

spark-sql,

and

scala-spark all share the same spark context you can

create

RDDs

in

one

language and access them / work on them in another (so I

understand).

So in Mahout can you "save" a matrix as a RDD? e.g.

something

like

val myRDD = myDRM.asRDD()

val myRDD = myDRM.rdd()

And would 'myRDD' then exist in the spark context?

yes it will be in sparkContext

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."

-Virgil*


On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <

p...@occamsmachete.com>

wrote:


Agreed.


BTW I don’t want to stall progress but being the most

ignorant

of

plot

libs, I’ll ask if we should consider python and

matplotlib.

In

another

project we use python because of the RDD support on

Spark

though

the

visualizations are extremely limited in our case. If we

can

pass

an

RDD

to

pyspark it would allow custom reductions in python

before

plotting,

even

though we will support many natively in Mahout. I’m

guessing

that

this

would cross a context boundary and require a write to

disk?

So 2 questions:
1) what does the inter language support look like with

Spark

python

vs

SparkR, can we transfer RDDs?

2) are the plot libs significantly different?

On May 20, 2016, at 9:54 AM, Trevor Grant <

trevor.d.gr...@gmail.com>

wrote:


Dmitriy really nailed it on the head in his reply to

the

post

which

I'll

rebroadcast below. In essence the whole reason you are

(theoretically)

using Mahout is the data is to big to fit in memory.

If

it's

to

big

to

fit

in memory, well then its probably too big to plot each

point

(e.g.

trillions of row, you only have so many pixels).   For

the

example

randomly sampled a matrix.


So as Dmitriy says, in Mahout we need to have functions

that

will

'preprocess' the data into something plotable.


For the Zepplin-Plotting thing, we need to have a

function

that

will

spit

out a tsv like string of the data we wanted plotted.


I agree an honest Mahout interpreter in Zeppelin is

probably

worth

doing.

There are a couple of ways to go about it. I opened up

the

discussion

on

dev@Zeppelin and didn't get any replies. I'm going to

take

that

to

mean

we

can do it in a way that makes the most sense to Mahout

users...

First steps are to include some methods in Mahout that

will

do

that

preprocessing, and one that will turn something into a

tsv

string.

I have some general ideas on possible approached to

making

an

honest-mahout

interpreter but I want to play in the code and look at

the

Flink-Mahout

shell a bit before I try to organize my thoughts and

present

them.

...(2) not sure what is the point of supporting

distributed

anything.

It

is

distributed presumably because it is hard to keep it in

memory.

Therefore,

plotting anything distributed potentially presents 2

problems:

storage

space and overplotting due to number of points. The

idea

is

that

we

have

to

work out algorithms that condense big data information

into

small

plottable

information (like density grids, for example, or

histograms)....

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."

-Virgil*


On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <

p...@occamsmachete.com>

wrote:


Great job Trevor, we’ll need this detail to smooth

out

the

sharp

edges

and

any guidance from you or the Zeppelin community will

be a

big

help.


On May 20, 2016, at 8:13 AM, Shannon Quinn <

squ...@gatech.edu>

wrote:

Agreed, thoroughly enjoying the blog post.

On 5/19/16 12:01 AM, Andrew Palumbo wrote:

Well done, Trevor!  I've not yet had a chance to try

this

in

zeppelin

but I just read the blog which is great!


-------- Original message --------
From: Trevor Grant <trevor.d.gr...@gmail.com>
Date: 05/18/2016 2:44 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Ah thank you.

Fixing now.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."

-Virgil*


On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <

ap....@outlook.com

wrote:


Hey Trevor- Just refreshed your readme.  The jar

that I

mentioned

is

actually:

/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

rather than:

/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

(In the spark module that is)
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 11:02:43 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

ah yes- I remember you pointing that out to me too.

I got side tracked yesterday for most of the day on

an

adventure

in

getting

Zeppelin to work right after I accidently updated

to

the

new

snapshot

(free

hint: the secret was to clear my cache *face-palm*)


I'm going to add that dependency to the readme.md

now.

thanks,
tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes

of

things."

-Virgil*


On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <

ap....@outlook.com>

wrote:


Trevor this is very cool- I have not been able to

look

at

it

closely

yet

but just a small point: I believe that you'll also

need

to

add

the


mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

For things like the classification stats,

confusion

matrix,

and

t-digest.

Andy

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

I still need to update my readme/env per Pat's

comments

below,

however

with

out further ado, I present two notebooks that

integrate

Mahout +

Spark

Zeppelin + ggplot2

Re: Future Mahout - Zeppelin work

Reply via email to