Trevor this is very cool- I have not been able to look at it closely yet but 
just a small point: I believe that you'll also need to add the

mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

For things like the classification stats, confusion matrix, and t-digest.

Andy

________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

I still need to update my readme/env per Pat's comments below, however with
out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2

https://github.com/rawkintrevo/mahout-zeppelin

Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
into Zeppelin:

https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json

https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json

So my thoughs on next steps, which I'm positing only as a starting point
for discussion, and are in no particular order of importance:

- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and only
enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
  - I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
  - If we go that route I'll need some more support finding out what is the
absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the same
thing in Python.

The basic deal here is we are:
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
    - This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
   - This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a <plot package of your choice>

To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Seems like there is plenty to use in ggplot or python but the pipeline is
> a little convoluted (so maybe no need for Angular integration). To get
> graphics out of Mahout it would be nice to not require knowledge of R
> and/or python. Knowing Mahout is already bad enough but I guess the API
> from the Mahout side for plotting could be Scala syntactic sugar. What and
> how this all is installed and setup is the next question.
>
> BTW this is what I use elsewhere (Mahout as a lib to this code)
>
>     "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
>     "spark.kryo.registrator":
> "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
>     "spark.kryo.referenceTracking": "false",
>     "spark.kryoserializer.buffer": "300m”,
>
> afaik you will only see if Kryo is working when you have to serialize a
> mahout specific data type like vector of drm, something registered with
> Kryo.
>
>
> On May 16, 2016, at 6:18 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
> As a quick recap- we're trying to leverage Zeppelin for charting.
>
> It seems as though this can be achieved by
> - Adding properties to the Spark Interpreter
> - Adding dependency jars to the spark interpreter
> - importing in a spark paragraph
>
> All seems to be working well, but I've fooled myself into thinking things
> were 'working' before because I wasn't actually integrating. Lower I will
> outline the imports/properties, please look over and tell me if I'm
> theoretically missing anything.
>
> The next phase for me will be
> 1) Convert a matrix to some sort of serializable object that I can easily
> unpack from R
> 2) use Zeppelin's resource buffers to pass the object
> 3) collect the object in an R paragraph, convert it to a dataframe then map
> using ggplot
>
> Once I have a working prototype I will work add some syntactic sugar to
> prepare the matrix from the scala side and pass to zeppelin (using resource
> pools so the same functionality can be reused in Flink) and an R library
> containing some functions which will pull the data out of the resource pool
> and spit out a dataframe.
>
> Once its in a Dataframe in R- go nuts with any plotting package you like.
> Likewise, it should be possible to do the same thing with matplotlib and
> python (https://gist.github.com/andershammar/9070e0f6916a0fbda7a5)
>
> All of this doesn't necessarily require any changing of the Zeppelin source
> code, and isn't very intrusive or difficult to set up, I'll make a blog
> post but its almost a text book entry tutorial on using imports in
> Zeppelin. (e.g. a tutorial would be just as at home on the Zeppelin site as
> it would on the Mahout site).
>
> Now, there has been some talk of using Zeppelin's angularJS.  Things get a
> little more harry in that case, but we could make an optional build profile
> that would make zeppelin recognize matrices at tables and expose all of the
> built in charting features of Zeppelin.
>
> If you're not adding a bunch of custom charts to Zeppelin (which would be
> somewhat tedious), you're going to end up with a lot of examples where you
> create a table in Mahout/Spark pass it to AngularJS then some AngularJS
> code charts it for you.  At that point however, you're doing just as much
> work, if not more than it would be to simply pass to R or Python and let
> ggplot or matlibplot do the work for you.
>
> Finally, I haven't run into any errors yet using Kyro (which in part is
> what makes me fear I'm not doing this right... it was too easy...) If
> anything seems redundant or missing, please call it out.
>
> Add Properties to Spark interp:
>
> spark.kryo.registrator
> org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
> spark.serializer org.apache.spark.serializer.KryoSerializer
>
> Add artifacts (need to change these to maven not local, also need to
> add/change one jar per below, however this does run):
>
>
> /home/trevor/.m2/repository/org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
>
> /home/trevor/.m2/repository/org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
>
> /home/trevor/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
>
> /home/trevor/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
>
> Add following code to first paragraph of notebook:
> ```
> %spark
> import org.apache.mahout.math._
> import org.apache.mahout.math.scalabindings._
> import org.apache.mahout.math.drm._
> import org.apache.mahout.math.scalabindings.RLikeOps._
> import org.apache.mahout.math.drm.RLikeDrmOps._
> import org.apache.mahout.sparkbindings._
>
> implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
> sc2sdc(sc)
> ```
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Mon, May 16, 2016 at 6:42 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > Creating an mc used to do some Kryo setup, like registering serializers
> or
> > serializer factories IIRC. Also there is the Spark conf for allocating
> > memory for the Kryo buffer. Look at the code in the mc creation code in
> the
> > Spark package helpers. All can be done in straight Spark and passed in to
> > create the mc when needed. Again from old weak brain cells but I think
> that
> > is part of what makes the Mahout shell different than teh Spark shell
> plus
> > imports, it auto-creates the mc instead of or along with an sc.
> >
> > When I get back to my computer I can check.
> >
> > On May 16, 2016, at 3:40 PM, Andrew Palumbo <ap....@outlook.com> wrote:
> >
> > Trevor,
> >
> > Could you post any kryo errors that you may be having?
> >
> > ________________________________
> > From: Andrew Palumbo <ap....@outlook.com>
> > Sent: Monday, May 16, 2016 6:25:07 PM
> > To: mahout
> > Subject: Future Mahout - Zeppelin work
> >
> >
> >
> >
> > To Dmitriy's point, I agree ggplot is def the priority,  The mahout plots
> > are at this point are really just a POC, but at some point we may be want
> > to integrate some data transformation features into the mahout plots
> > classes so they're really more future work.
> >
> >
> > long story short:
> >
> >
> >> OK. I'll read through the examples and try to do something with some
> > data, then do a ggplot and/or an angular plot on it (probably ggplot).
> >
> >> I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
> > issue about weather we want to go ahead and add another interpreter.
> >
> >
> > Souds Great.
> >
> >
> > Thank you.
> >
> > ________________________________
> > From: Trevor Grant <trevor.d.gr...@gmail.com>
> > Sent: Monday, May 16, 2016 5:49:17 PM
> > To: Dmitriy Lyubimov
> > Cc: Andrew Palumbo; Pat Ferrel; Suneel Marthi
> > Subject: Re: Intro - Future Mahout - Zeppelin work
> >
> > I just signed up for dev, should i just reply all and cc dev or start a
> > new thread?
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > [https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<
> > https://github.com/rawkintrevo>
> >
> > rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo>
> > github.com
> > rawkintrevo has 12 repositories written in Python, Batchfile, and R.
> > Follow their code on GitHub.
> >
> >
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > "Fortunate is he, who is able to know the causes of things."  -Virgil
> >
> >
> > On Mon, May 16, 2016 at 4:46 PM, Dmitriy Lyubimov <dlie...@gmail.com
> > <mailto:dlie...@gmail.com>> wrote:
> > fwiw ggplot2 is pretty darn advanced:) i am a bit skeptical smile would
> > have something that ggplot2 would not, the other way around is much more
> > expected by me:)
> >
> > anyhow if ggplot2 and matplotlib are available in Zeppelin without major
> > limitations, it sounds like Zeppelin should be an all around very nice
> > venue then.
> >
> > On Mon, May 16, 2016 at 2:42 PM, Andrew Palumbo <ap....@outlook.com
> > <mailto:ap....@outlook.com>> wrote:
> >
> > yeah we should probably move this over to dev@
> >
> >
> > sorry- answering a question from a couple emails back on the thread.
> >
> >
> > If possible,  I think it would be great to eventually have both (native
> > mahout/smile plots and ggplot), since in the future we're going to be
> > adding more visualization features rather than simple scatter plots etc
> > that may not be covered by ggplot.
> >
> >
> > That's why we were thinking about using angular and the pngs.
> >
> >
> > But what youre saying in your last email would be great!
> >
> >
> > Thank you!
> >
> >
> > ________________________________
> > From: Trevor Grant <trevor.d.gr...@gmail.com<mailto:
> > trevor.d.gr...@gmail.com>>
> > Sent: Monday, May 16, 2016 5:33:12 PM
> > To: Andrew Palumbo
> > Cc: Pat Ferrel; Suneel Marthi; Dmitriy Lyubimov
> >
> > Subject: Re: Intro - Future Mahout - Zeppelin work
> >
> > I somehow replied to your last email without seeing it...
> >
> > OK. I'll read through the examples and try to do something with some
> data,
> > then do a ggplot and/or an angular plot on it (probably ggplot).
> >
> > I'll do a quick tutorial. Then I'll reopen discussion on that Zeppelin
> > issue about weather we want to go ahead and add another interpreter.
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > "Fortunate is he, who is able to know the causes of things."  -Virgil
> >
> >
> > On Mon, May 16, 2016 at 4:26 PM, Trevor Grant <trevor.d.gr...@gmail.com
> > <mailto:trevor.d.gr...@gmail.com>> wrote:
> > sorry for double email but are you thinking visualization should be a
> > library internal to mahout or should we leverage zeppelins visualization
> > capabilities?
> >
> > Also, should we move this discussion to dev?
> >
> > tg
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > "Fortunate is he, who is able to know the causes of things."  -Virgil
> >
> >
> > On Mon, May 16, 2016 at 4:14 PM, Andrew Palumbo <ap....@outlook.com
> > <mailto:ap....@outlook.com>> wrote:
> >
> > Sorry- to be a little more clear,  Part of what we're trying to is to get
> > the new plotting features integrated with Zeppelin. We plan on adding
> more
> > advanced plotting.
> >
> >
> > ________________________________
> > From: Andrew Palumbo <ap....@outlook.com<mailto:ap....@outlook.com>>
> > Sent: Monday, May 16, 2016 5:04:49 PM
> > To: Pat Ferrel; Trevor Grant
> > Cc: Suneel Marthi; Dmitriy Lyubimov
> > Subject: Re: Intro - Future Mahout - Zeppelin work
> >
> >
> > Awesome!
> >
> >
> > most of the hard work was done by Dmitriy[??] , I've just reworked it a
> > couple of times to keep up with spark's refactoring.
> >
> >
> > I think that you will also need to include:
> >
> >
> >   mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> >
> >
> > For the new plotting features that we're working on.
> >
> >
> > the plotting is still a work in progress, and the grid and surface plots
> > are not working properly.  The plots are swing based and can currently be
> > exported as  PNGs.  There are a few examples on the closed PR:
> > https://github.com/apache/mahout/pull/230
> >
> >
> > There is an example script in examples/bin/spark-shell-plot.mscala
> > (commited to master) :
> >
> https://github.com/apache/mahout/blob/master/examples/bin/spark-shell-plot.mscala
> >
> >
> > Thanks!
> >
> >
> >
> > ________________________________
> > From: Pat Ferrel <p...@occamsmachete.com<mailto:p...@occamsmachete.com>>
> > Sent: Monday, May 16, 2016 4:54:15 PM
> > To: Trevor Grant
> > Cc: Andrew Palumbo; Suneel Marthi; Dmitriy Lyubimov
> > Subject: Re: Intro - Future Mahout - Zeppelin work
> >
> > This is only the beginning. Andy has been using Smile as a visualization
> > lib since it is pretty rich in ML support. We are looking at integrating
> > some of that with Zeppelin then adding code to feed the new
> visualizations
> > in Mahout. I’m here because I’m fairly familiar with AngularJS if that’s
> > the way to go. Smile is swing based but can output pngs, maybe other
> image
> > formats—Andy?
> >
> > BTW Dmitriy is still very involved but has rouble getting permission to
> > donate code.
> >
> >
> > On May 16, 2016, at 1:45 PM, Trevor Grant <trevor.d.gr...@gmail.com
> > <mailto:trevor.d.gr...@gmail.com>> wrote:
> >
> > Hey Andrew,
> >
> > thanks- you basically did all of the hard work for me!
> >
> > I've got the linear regression example working from:
> > http://mahout.apache.org/users/sparkbindings/play-with-shell.html
> >
> > my java is sketchy at best, i tend to over import. I pulled in the
> > following jars:
> >
> >
> org/apache/mahout/mahout-math/0.12.1-SNAPSHOT/mahout-math-0.12.1-SNAPSHOT.jar
> >
> >
> org/apache/mahout/mahout-math-scala_2.10/0.12.1-SNAPSHOT/mahout-math-scala_2.10-0.12.1-SNAPSHOT.jar
> >
> >
> org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT.jar
> >
> >
> org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark-shell_2.10-0.12.1-SNAPSHOT.jar
> >
> > I think those are all necessary...  should I be pulling in more?
> >
> > I hate to say it (but will do so bc this isn't public) this integration
> is
> > super easy from a user perspective, almost too easy- eg why not let the
> > user add it themselves...  Add the appropriate maven artifacts, restart
> the
> > interpreter and run the following in a notebook:
> > ```
> > import org.apache.mahout.math._
> > import org.apache.mahout.math.scalabindings._
> > import org.apache.mahout.math.drm._
> > import org.apache.mahout.math.scalabindings.RLikeOps._
> > import org.apache.mahout.math.drm.RLikeDrmOps._
> > import org.apache.mahout.sparkbindings._
> >
> > implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext
> > = sc2sdc(sc)
> > ```
> > Then whatever code you want and you're off to the races...
> >
> > that said, adding a build profile like -PsparkMahout and creating an
> > interpretter like %spark.mahout should be fairly straight forward.
> >
> > Second question, do you have an example that would be more 'visualization
> > friendly'? I could pass the results to Angular or R just to show off how
> to
> > do it.
> >
> > Which leads back to the question, is this even worth building a full
> > interpreter for or just make a really nice blog post with examples on how
> > to integrate with R...?
> >
> >
> >
> >
> >
> >
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org<http://trevorgrant.org/>
> >
> > "Fortunate is he, who is able to know the causes of things."  -Virgil
> >
> >
> > On Mon, May 16, 2016 at 2:09 PM, Andrew Palumbo <ap....@outlook.com
> > <mailto:ap....@outlook.com>> wrote:
> > Hi Trevor, welcome!
> >
> > It's great to have you helping out, thanks very much.  I've done a good
> > amount of work on our mahout spark shell .. so let me know if you have
> any
> > questions there about what we did there..
> >
> > Thanks alot!
> >
> > Andy
> >
> >
> > -------- Original message --------
> > From: Suneel Marthi <smar...@apache.org<mailto:smar...@apache.org>>
> > Date: 05/16/2016 2:44 PM (GMT-05:00)
> > To: Trevor Grant <trevor.d.gr...@gmail.com<mailto:
> trevor.d.gr...@gmail.com
> >>>
> > Cc: Suneel Marthi <smar...@apache.org<mailto:smar...@apache.org>>, Pat
> > Ferrel <p...@occamsmachete.com<mailto:p...@occamsmachete.com>>, Andrew
> > Palumbo <ap....@outlook.com<mailto:ap....@outlook.com>>
> > Subject: Re: Intro - Future Mahout - Zeppelin work
> >
> > Oh yes, he's around. I see him online.
> >
> > On Mon, May 16, 2016 at 2:42 PM, Trevor Grant <trevor.d.gr...@gmail.com
> > <mailto:trevor.d.gr...@gmail.com>> wrote:
> > Is Dmitriy Lyubimov still around?
> >
> > Looks like he created this issue for Zeppelin a while ago. (The old lost
> > code to which you were referring?)
> >
> > https://issues.apache.org/jira/browse/ZEPPELIN-116
> >
> >
> > tg
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org<http://trevorgrant.org/>
> >
> > "Fortunate is he, who is able to know the causes of things."  -Virgil
> >
> >
> > On Mon, May 16, 2016 at 1:37 PM, Suneel Marthi <smar...@apache.org
> <mailto:
> > smar...@apache.org>> wrote:
> > Welcome to the party TG !!
> >
> > On Mon, May 16, 2016 at 2:28 PM, Trevor Grant <trevor.d.gr...@gmail.com
> > <mailto:trevor.d.gr...@gmail.com>> wrote:
> > Hey all,
> >
> > I'm excited for a chance to help out.  I'm actually getting ready to
> > download now and start playing around.
> >
> > I had talked about this briefly but it given a properly functioning
> > Zeppelin interpreter for Apache Mahout, one could leverage all of the
> > Zeppelin visualizations, anything in AngularJS, or anything in R (through
> > clever use of Zeppelin's Resource Pools).
> >
> > I'll work on getting logged in to the slack channel as well.
> >
> > Nice to meet you all, looking forward to helping out!
> >
> > tg
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org<http://trevorgrant.org/>
> >
> > "Fortunate is he, who is able to know the causes of things."  -Virgil
> >
> >
> > On Sun, May 15, 2016 at 12:56 PM, Suneel Marthi <smar...@apache.org
> > <mailto:smar...@apache.org>> wrote:
> > FYi...
> > Trevor was there for my talk, so he has some idea of Mahout Samsara.
> >
> > On Sun, May 15, 2016 at 1:51 PM, Pat Ferrel <p...@occamsmachete.com
> <mailto:
> > p...@occamsmachete.com>> wrote:
> > Hey Trevor,
> >
> > Good to meet you. As you probably know Mahout-Samsara is a reincarnation
> > of the project in a new body, which is less a collection of algorithms
> than
> > a roll-your-own math/algorithm tool. The major benefit is that during
> > experimentation and later in production the code is by nature scalable on
> > Spark and Flink. Most of the Mahout DSL is R-like and supports tensor
> math
> > but we are now looking at streaming online algo support too.
> >
> > In any case you probably know we have a Mahout version of the Spark
> Shell,
> > which has been integrated with an old version of Zeppelin (code is lost).
> > Recently Andy has experimented with some very nice visualizations of ML
> > data (not just analytics data). We as a project are interested in
> Zeppelin
> > integration of our shell and graphics. From what I understand the
> graphics
> > extension mechanism of Zeppelin is based on AngularJS, which I have some
> > experience with.
> >
> > So, we’d like to start the conversation about how to proceed. We would
> > love some help but will move ahead in any case.
> >
> > Pat
> >
> >
> > On May 15, 2016, at 9:52 AM, Suneel Marthi <smar...@apache.org<mailto:
> > smar...@apache.org>> wrote:
> >
> > Hi Trevor,
> >
> > Nice meeting u last week in Vancouver.  Per our conversation, I wanted to
> > introduce u to Andrew Palumbo (Mahout Chair) and Pat Ferrel (Mahout PMC).
> >
> > As I mentioned in my talk, we are actively looking at Zeppelin
> integration
> > with Mahout (primarily for spark) and would appreciate your help (as also
> > all things DL and ML).
> >
> > We definitely can use all your help as we r revamping the Mahout project
> > and shedding its legacy MapReduce image.
> >
> > I sent u an invite to the Mahout slack channel, mahout.apache.org<
> > http://mahout.apache.org/> - that's where we all hangout and not having
> > to worry about avoiding naughty words.
> >
> > Looking forward to working with you
> >
> > Suneel
> >
> >
>
>

Reply via email to