+1 piggybacking sounds reasonable and quick-win.

On 01/06/16 18:17, Trevor Grant wrote:
Hey Eric,

The 'piggyback' or 'patch' approach is a lot easier and less invasive to
implement in practice, and has the Zeppelin community blessing.

When the Flink version comes on line, it will be also super easy to
replicate the effort.  And even doing two (or more) 'piggybacks' will be
easier to maintain than one stand-alone Mahout interpretter.  Also,
'piggybacking' opens up the possibility of sharing between contexts,
minimizes user configuration, etc.

The differential is about 20 new lines of code for a piggy back on any
underlying engine, vs. about 300 lines of code for a stand alone
interpreter which must be kept up to date with its Spark/Flink counter
parts.

Philosophically the stand-alone makes sense, practically the piggyback
does. *shruggie*

It is possible that somewhere down the road we'll refactor the piggy
back(s) into a stand alone interpreter, at which point none of the current
effort will be wasted, it will just be moving some code around.  So the
other advantage to the piggyback is that it quickly fields a minimum viable
product, with out having to pay much for it later on down the road.

This is in part due to the way Zeppelin implemented its interpreters which
involves a lot of code repetition.

I'm open to further discussion, but after playing in the Zeppelin code for
a while and really groking different approaches I think this one is best. I
do invite critiques because I believe I have considered most angles and can
properly defend the current path, and if there is something I haven't
thought of, I'd rather it be brought to light sooner than later.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, Jun 1, 2016 at 11:00 AM, Eric Charles <e...@apache.org> wrote:

Hi Suneel, an independent makes sense as mahout is supposed to run on
various backend, so not only spark.

Yes, I am following mahout mailing list (and not abroad this year - this
may change in the future).

On 30/05/16 05:47, Suneel Marthi wrote:

Hi Eric,

We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?

BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?

On Sun, May 29, 2016 at 10:57 PM, Eric Charles <e...@apache.org> wrote:

Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?

https://github.com/apache/incubator-zeppelin/pull/928

It declares in the spark interpreter the mahout deps, and creates the sdc
(spark distributed context).

On 29/05/16 19:16, Suneel Marthi wrote:

On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <trevor.d.gr...@gmail.com

wrote:

OK cool. Just wanted to make sure I wasn't stealing anyone's baby or

duplicating efforts.

Two things:

1- The blog post referenced the linear-regression example notebook
twice-
I've updated it to reference the ggplot integration. E.g. import this
note:



https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done
in
that note...) So to any who tried the blog, I huge appology because
that
notebook is where all of the 'magic happened', (all of the screen
shots /
gg-plots / etc happened there).

2- I have a working prototype of the Zeppelin integration:
'mahout-terp' branch of :
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark
Interpretter
properties, you have a Mahout interpreter. This is the minimally
invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over
at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make
sure
I don't need to do a major refactor before I get carried away with
those
activities.

In essence when 'spark-mahout' is 'true' you jump right in on r-like
dsl
and you have a sdc declared based on the underlying sc.


I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
interpreter is gonna go down well with the Spark insanity.  I would
prefer
having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin
project
if that's acceptable to the Zeppelin folks, even though most of it might
be
repeatee.

What do others have to say?


have a good holiday weekend,


tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap....@outlook.com>
wrote:

Thx Trevor,

Re: m-1854, It was something that we started when were first
discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned
it.

Feel free to Assign any Jiras to yourself.  I think that m-1854 is

similar

to the mahout-spark-shell, so I may be able to help out there.


________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Saturday, May 28, 2016 11:21:44 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Created a subtask on 1855 for tsv strings.

Looking at 1854 assigned to Pat Ferrel, what's your progress to date?

How

can I help?

tg



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."
-Virgil*


On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap....@outlook.com>
wrote:

Great!


When you free up and have the time, could you create some Jiras for

these?


We actually have MAHOUT-1852 open for Histograms already, and

MAHOUT-1854


and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854


and

m-1855 out and we can start new ones if they're not relevant anymore
or

we

can just go with those.

Thanks

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Thursday, May 26, 2016 3:17:22 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Short answer: it is high priority. I think it will be a Mahout

interpreter

into Zeppelin, and given that plans are on hold for a Flink-Mahout in

the


short term, I think it should be a piggy-back spark interpreter (e.g.

exposed through something like %spark.mahout).   So I have thoughts,

but


no

plan.  Been busy with a couple of other commitments.

On the Mahout side we need:
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things

like


histograms, etc, that one would want to plot. I.e. histogram bucketing?

(less important for the moment)

On the Zeppelin Size we need:
an interpreter.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."
-Virgil*


On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <smar...@apache.org>

wrote:


While on this subject, do we have a plan yet of integrating Zeppelin


into


Mahout (or the converse) of having Mahout specific interpreter for

Zeppelin?  I think that shuld be high priority in the short term.

On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <

trevor.d.gr...@gmail.com>


wrote:


Ahh, like the "Sample From Matrix" paragraph in the notebook.


Yea that seems like a good add. If not this afternoon, I'll include

it


Saturday.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."

-Virgil*




On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <

ap....@outlook.com



wrote:


Trevor, I was reading over your blog last night again- first time


since


you updated. It is  great!


I have one suggestion being adding in a code line on how the the

sampling


of the  DRM ->  in-core Matrix is done:










https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148


eg something like:

       mxSin = drmSampleKRows(drmSin, 1000, replacement = false)

Maybe you omitted this intentionally?

Andy

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Friday, May 20, 2016 7:56:20 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a

version


is


uninformative to me. I'd say if possible, you're first

troubleshooting


measure would be to re clone or do a "git fetch upstream" to get


up


to


the


very latest

Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <

andrew.mussel...@gmail.com>


wrote:


Trevor, my zeppelin source is at this version:


     <groupId>org.apache.zeppelin</groupId>
     <artifactId>zeppelin</artifactId>
     <packaging>pom</packaging>
     <version>0.6.0-incubating-SNAPSHOT</version>
     <name>Zeppelin</name>
     <description>Zeppelin project</description>
     <url>http://zeppelin.incubator.apache.org/</url>

And yes you're right the artifacts weren't added to the

dependencies;


is


that a feature in more modern zep?


On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <

dlie...@gmail.com



wrote:


no parenthesis.


import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd


On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <

smar...@apache.org



wrote:



On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <


trevor.d.gr...@gmail.com>


wrote:


Hey Pat,


If you spit out a TSV - you can import into pyspark /

matplotlib


from


the


resource pool in essentially the same way and use that


plotting


library


if


you prefer.  In fact you could import the tsv into pandas

and


use


all


of


the pandas plotting as well (though I think it is for the


most


part,


also


matplotlib with some convenience functions).














https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u


In Zeppelin, unless you specify otherwise, pyspark,

sparkr,


spark-sql,


and


scala-spark all share the same spark context you can

create


RDDs


in

one


language and access them / work on them in another (so I


understand).



So in Mahout can you "save" a matrix as a RDD? e.g.

something


like



val myRDD = myDRM.asRDD()


val myRDD = myDRM.rdd()


And would 'myRDD' then exist in the spark context?

yes it will be in sparkContext



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."


-Virgil*




On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <

p...@occamsmachete.com>


wrote:


Agreed.


BTW I don’t want to stall progress but being the most

ignorant


of

plot


libs, I’ll ask if we should consider python and


matplotlib.


In


another


project we use python because of the RDD support on


Spark


though


the


visualizations are extremely limited in our case. If we


can


pass


an

RDD


to


pyspark it would allow custom reductions in python

before


plotting,


even


though we will support many natively in Mahout. I’m


guessing


that


this


would cross a context boundary and require a write to


disk?



So 2 questions:
1) what does the inter language support look like with

Spark


python


vs


SparkR, can we transfer RDDs?

2) are the plot libs significantly different?

On May 20, 2016, at 9:54 AM, Trevor Grant <

trevor.d.gr...@gmail.com>


wrote:


Dmitriy really nailed it on the head in his reply to

the


post


which


I'll


rebroadcast below. In essence the whole reason you are


(theoretically)


using Mahout is the data is to big to fit in memory.


If


it's


to


big


to

fit


in memory, well then its probably too big to plot each

point


(e.g.


trillions of row, you only have so many pixels).   For


the


example


I

randomly sampled a matrix.


So as Dmitriy says, in Mahout we need to have functions

that


will


'preprocess' the data into something plotable.


For the Zepplin-Plotting thing, we need to have a

function


that


will


spit


out a tsv like string of the data we wanted plotted.


I agree an honest Mahout interpreter in Zeppelin is

probably


worth


doing.


There are a couple of ways to go about it. I opened up


the


discussion


on


dev@Zeppelin and didn't get any replies. I'm going to


take


that


to

mean


we

can do it in a way that makes the most sense to Mahout

users...



First steps are to include some methods in Mahout that

will


do


that


preprocessing, and one that will turn something into a


tsv


string.



I have some general ideas on possible approached to

making


an

honest-mahout


interpreter but I want to play in the code and look at

the


Flink-Mahout


shell a bit before I try to organize my thoughts and


present


them.



...(2) not sure what is the point of supporting

distributed


anything.


It


is

distributed presumably because it is hard to keep it in

memory.


Therefore,


plotting anything distributed potentially presents 2

problems:


storage


space and overplotting due to number of points. The


idea


is

that


we

have


to

work out algorithms that condense big data information

into


small


plottable


information (like density grids, for example, or

histograms)....



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."


-Virgil*




On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <

p...@occamsmachete.com>


wrote:


Great job Trevor, we’ll need this detail to smooth


out


the


sharp


edges


and


any guidance from you or the Zeppelin community will

be a


big


help.




On May 20, 2016, at 8:13 AM, Shannon Quinn <

squ...@gatech.edu>


wrote:



Agreed, thoroughly enjoying the blog post.

On 5/19/16 12:01 AM, Andrew Palumbo wrote:

Well done, Trevor!  I've not yet had a chance to try

this


in

zeppelin


but I just read the blog which is great!



-------- Original message --------
From: Trevor Grant <trevor.d.gr...@gmail.com>
Date: 05/18/2016 2:44 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Ah thank you.

Fixing now.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of

things."


-Virgil*




On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <

ap....@outlook.com



wrote:



Hey Trevor- Just refreshed your readme.  The jar


that I


mentioned


is


actually:

















/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar


rather than:
















/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar


(In the spark module that is)
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 11:02:43 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

ah yes- I remember you pointing that out to me too.

I got side tracked yesterday for most of the day on

an


adventure


in

getting


Zeppelin to work right after I accidently updated


to


the


new


snapshot


(free


hint: the secret was to clear my cache *face-palm*)


I'm going to add that dependency to the readme.md

now.



thanks,
tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes

of


things."


-Virgil*




On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <

ap....@outlook.com>


wrote:


Trevor this is very cool- I have not been able to


look


at


it

closely


yet


but just a small point: I believe that you'll also


need


to

add


the




mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar



For things like the classification stats,

confusion


matrix,


and


t-digest.



Andy

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org



Reply via email to