duplicating efforts.
Two things:
1- The blog post referenced the linear-regression example notebook
twice-
I've updated it to reference the ggplot integration. E.g. import this
note:
https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
(I still need to update with a blurb about sampling, however it is done
in
that note...) So to any who tried the blog, I huge appology because
that
notebook is where all of the 'magic happened', (all of the screen
shots /
gg-plots / etc happened there).
2- I have a working prototype of the Zeppelin integration:
'mahout-terp' branch of :
https://github.com/rawkintrevo/incubator-zeppelin
if you build, and set 'spark.mahout' to 'true' in the Spark
Interpretter
properties, you have a Mahout interpreter. This is the minimally
invasive
way to do it, I'll be opening a PR soon, we'll see what the gang over
at
Zeppelin say.
I'll still need docs and an example notebook, but I'm waiting to make
sure
I don't need to do a major refactor before I get carried away with
those
activities.
In essence when 'spark-mahout' is 'true' you jump right in on r-like
dsl
and you have a sdc declared based on the underlying sc.
I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <ap....@outlook.com>
wrote:
Thx Trevor,
Re: m-1854, It was something that we started when were first
discussing
using the smile plots for and trying to pipe them over to Zeppelin ..
As
far as I know there was not progress started on it.. I've unassigned
it.
Feel free to Assign any Jiras to yourself. I think that m-1854 is
similar
to the mahout-spark-shell, so I may be able to help out there.
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Saturday, May 28, 2016 11:21:44 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work
Created a subtask on 1855 for tsv strings.
Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
How
can I help?
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <ap....@outlook.com>
wrote:
Great!
When you free up and have the time, could you create some Jiras for
these?
We actually have MAHOUT-1852 open for Histograms already, and
MAHOUT-1854
and MAHOUT-1855 (early Zeppelin integration Jiras). I can close m-1854
and
m-1855 out and we can start new ones if they're not relevant anymore
or
we
can just go with those.
Thanks
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Thursday, May 26, 2016 3:17:22 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work
Short answer: it is high priority. I think it will be a Mahout
interpreter
into Zeppelin, and given that plans are on hold for a Flink-Mahout in
the
short term, I think it should be a piggy-back spark interpreter (e.g.
exposed through something like %spark.mahout). So I have thoughts,
but
no
plan. Been busy with a couple of other commitments.
On the Mahout side we need:
A function that will convert small matrices into TSV strings
Convenience functions for sampling super-large matrices into things
like
histograms, etc, that one would want to plot. I.e. histogram bucketing?
(less important for the moment)
On the Zeppelin Size we need:
an interpreter.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <smar...@apache.org>
wrote:
While on this subject, do we have a plan yet of integrating Zeppelin
into
Mahout (or the converse) of having Mahout specific interpreter for
Zeppelin? I think that shuld be high priority in the short term.
On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
trevor.d.gr...@gmail.com>
wrote:
Ahh, like the "Sample From Matrix" paragraph in the notebook.
Yea that seems like a good add. If not this afternoon, I'll include
it
Saturday.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things."
-Virgil*
On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
ap....@outlook.com
wrote:
Trevor, I was reading over your blog last night again- first time
since
you updated. It is great!
I have one suggestion being adding in a code line on how the the
sampling
of the DRM -> in-core Matrix is done:
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
eg something like:
mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
Maybe you omitted this intentionally?
Andy
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Friday, May 20, 2016 7:56:20 PM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
version
is
uninformative to me. I'd say if possible, you're first
troubleshooting
measure would be to re clone or do a "git fetch upstream" to get
up
to
the
very latest
Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" <
andrew.mussel...@gmail.com>
wrote:
Trevor, my zeppelin source is at this version:
<groupId>org.apache.zeppelin</groupId>
<artifactId>zeppelin</artifactId>
<packaging>pom</packaging>
<version>0.6.0-incubating-SNAPSHOT</version>
<name>Zeppelin</name>
<description>Zeppelin project</description>
<url>http://zeppelin.incubator.apache.org/</url>
And yes you're right the artifacts weren't added to the
dependencies;
is
that a feature in more modern zep?
On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
dlie...@gmail.com
wrote:
no parenthesis.
import o.a.m.sparkbindings._
....
myRdd = myDrm.rdd
On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
smar...@apache.org
wrote:
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
trevor.d.gr...@gmail.com>
wrote:
Hey Pat,
If you spit out a TSV - you can import into pyspark /
matplotlib
from
the
resource pool in essentially the same way and use that
plotting
library
if
you prefer. In fact you could import the tsv into pandas
and
use
all
of
the pandas plotting as well (though I think it is for the
most
part,
also
matplotlib with some convenience functions).
https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
In Zeppelin, unless you specify otherwise, pyspark,
sparkr,
spark-sql,
and
scala-spark all share the same spark context you can
create
RDDs
in
one
language and access them / work on them in another (so I
understand).
So in Mahout can you "save" a matrix as a RDD? e.g.
something
like
val myRDD = myDRM.asRDD()
val myRDD = myDRM.rdd()
And would 'myRDD' then exist in the spark context?
yes it will be in sparkContext
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
p...@occamsmachete.com>
wrote:
Agreed.
BTW I don’t want to stall progress but being the most
ignorant
of
plot
libs, I’ll ask if we should consider python and
matplotlib.
In
another
project we use python because of the RDD support on
Spark
though
the
visualizations are extremely limited in our case. If we
can
pass
an
RDD
to
pyspark it would allow custom reductions in python
before
plotting,
even
though we will support many natively in Mahout. I’m
guessing
that
this
would cross a context boundary and require a write to
disk?
So 2 questions:
1) what does the inter language support look like with
Spark
python
vs
SparkR, can we transfer RDDs?
2) are the plot libs significantly different?
On May 20, 2016, at 9:54 AM, Trevor Grant <
trevor.d.gr...@gmail.com>
wrote:
Dmitriy really nailed it on the head in his reply to
the
post
which
I'll
rebroadcast below. In essence the whole reason you are
(theoretically)
using Mahout is the data is to big to fit in memory.
If
it's
to
big
to
fit
in memory, well then its probably too big to plot each
point
(e.g.
trillions of row, you only have so many pixels). For
the
example
I
randomly sampled a matrix.
So as Dmitriy says, in Mahout we need to have functions
that
will
'preprocess' the data into something plotable.
For the Zepplin-Plotting thing, we need to have a
function
that
will
spit
out a tsv like string of the data we wanted plotted.
I agree an honest Mahout interpreter in Zeppelin is
probably
worth
doing.
There are a couple of ways to go about it. I opened up
the
discussion
on
dev@Zeppelin and didn't get any replies. I'm going to
take
that
to
mean
we
can do it in a way that makes the most sense to Mahout
users...
First steps are to include some methods in Mahout that
will
do
that
preprocessing, and one that will turn something into a
tsv
string.
I have some general ideas on possible approached to
making
an
honest-mahout
interpreter but I want to play in the code and look at
the
Flink-Mahout
shell a bit before I try to organize my thoughts and
present
them.
...(2) not sure what is the point of supporting
distributed
anything.
It
is
distributed presumably because it is hard to keep it in
memory.
Therefore,
plotting anything distributed potentially presents 2
problems:
storage
space and overplotting due to number of points. The
idea
is
that
we
have
to
work out algorithms that condense big data information
into
small
plottable
information (like density grids, for example, or
histograms)....
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
p...@occamsmachete.com>
wrote:
Great job Trevor, we’ll need this detail to smooth
out
the
sharp
edges
and
any guidance from you or the Zeppelin community will
be a
big
help.
On May 20, 2016, at 8:13 AM, Shannon Quinn <
squ...@gatech.edu>
wrote:
Agreed, thoroughly enjoying the blog post.
On 5/19/16 12:01 AM, Andrew Palumbo wrote:
Well done, Trevor! I've not yet had a chance to try
this
in
zeppelin
but I just read the blog which is great!
-------- Original message --------
From: Trevor Grant <trevor.d.gr...@gmail.com>
Date: 05/18/2016 2:44 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work
Ah thank you.
Fixing now.
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of
things."
-Virgil*
On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
ap....@outlook.com
wrote:
Hey Trevor- Just refreshed your readme. The jar
that I
mentioned
is
actually:
/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
rather than:
/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
(In the spark module that is)
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 11:02:43 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work
ah yes- I remember you pointing that out to me too.
I got side tracked yesterday for most of the day on
an
adventure
in
getting
Zeppelin to work right after I accidently updated
to
the
new
snapshot
(free
hint: the secret was to clear my cache *face-palm*)
I'm going to add that dependency to the readme.md
now.
thanks,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes
of
things."
-Virgil*
On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
ap....@outlook.com>
wrote:
Trevor this is very cool- I have not been able to
look
at
it
closely
yet
but just a small point: I believe that you'll also
need
to
add
the
mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
For things like the classification stats,
confusion
matrix,
and
t-digest.
Andy
________________________________________
From: Trevor Grant <trevor.d.gr...@gmail.com>
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org