Re: Future Mahout - Zeppelin work

Suneel Marthi Sun, 29 May 2016 20:47:57 -0700

Hi Eric,

We r talking about the same PR which is a tweak of existing Spark-Zeppelin
interpreter.
What we r looking at is a specific Mahout-Spark-Zeppelin interpreter that
is independent of above?


BTW Eric, nice to see u on Mahout mailing lists, u didn't make it to
Vancouver this time?

On Sun, May 29, 2016 at 10:57 PM, Eric Charles <[email protected]> wrote:

> Have you seen [ZEPPELIN-116] Add Mahout Support for Spark Interpreter?
>
> https://github.com/apache/incubator-zeppelin/pull/928
>
> It declares in the spark interpreter the mahout deps, and creates the sdc
> (spark distributed context).
>
> On 29/05/16 19:16, Suneel Marthi wrote:
>
>> On Sun, May 29, 2016 at 12:07 PM, Trevor Grant <[email protected]>
>> wrote:
>>
>> OK cool. Just wanted to make sure I wasn't stealing anyone's baby or
>>> duplicating efforts.
>>>
>>> Two things:
>>>
>>> 1- The blog post referenced the linear-regression example notebook twice-
>>> I've updated it to reference the ggplot integration. E.g. import this
>>> note:
>>>
>>>
>>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
>>> (I still need to update with a blurb about sampling, however it is done
>>> in
>>> that note...) So to any who tried the blog, I huge appology because that
>>> notebook is where all of the 'magic happened', (all of the screen shots /
>>> gg-plots / etc happened there).
>>>
>>> 2- I have a working prototype of the Zeppelin integration:
>>> 'mahout-terp' branch of :
>>> https://github.com/rawkintrevo/incubator-zeppelin
>>> if you build, and set 'spark.mahout' to 'true' in the Spark Interpretter
>>> properties, you have a Mahout interpreter. This is the minimally invasive
>>> way to do it, I'll be opening a PR soon, we'll see what the gang over at
>>> Zeppelin say.
>>> I'll still need docs and an example notebook, but I'm waiting to make
>>> sure
>>> I don't need to do a major refactor before I get carried away with those
>>> activities.
>>>
>>> In essence when 'spark-mahout' is 'true' you jump right in on r-like dsl
>>> and you have a sdc declared based on the underlying sc.
>>>
>>>
>> I am not sure if messing with the very "sacrosanct" Zeppelin-Spark
>> interpreter is gonna go down well with the Spark insanity.  I would prefer
>> having a separate MAhout-Spark-Zeppelin interpreter under Zeppelin project
>> if that's acceptable to the Zeppelin folks, even though most of it might
>> be
>> repeatee.
>>
>> What do others have to say?
>>
>>
>> have a good holiday weekend,
>>>
>>> tg
>>>
>>>
>>>
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
>>> http://stackexchange.com/users/3002022/rawkintrevo
>>> http://trevorgrant.org
>>>
>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>
>>>
>>> On Sun, May 29, 2016 at 10:49 AM, Andrew Palumbo <[email protected]>
>>> wrote:
>>>
>>> Thx Trevor,
>>>> Re: m-1854, It was something that we started when were first discussing
>>>> using the smile plots for and trying to pipe them over to Zeppelin ..
>>>> As
>>>> far as I know there was not progress started on it.. I've unassigned it.
>>>>
>>>> Feel free to Assign any Jiras to yourself.  I think that m-1854 is
>>>>
>>> similar
>>>
>>>> to the mahout-spark-shell, so I may be able to help out there.
>>>>
>>>>
>>>> ________________________________________
>>>> From: Trevor Grant <[email protected]>
>>>> Sent: Saturday, May 28, 2016 11:21:44 PM
>>>> To: [email protected]
>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>
>>>> Created a subtask on 1855 for tsv strings.
>>>>
>>>> Looking at 1854 assigned to Pat Ferrel, what's your progress to date?
>>>>
>>> How
>>>
>>>> can I help?
>>>>
>>>> tg
>>>>
>>>>
>>>>
>>>> Trevor Grant
>>>> Data Scientist
>>>> https://github.com/rawkintrevo
>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>> http://trevorgrant.org
>>>>
>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>
>>>>
>>>> On Thu, May 26, 2016 at 2:34 PM, Andrew Palumbo <[email protected]>
>>>> wrote:
>>>>
>>>> Great!
>>>>>
>>>>> When you free up and have the time, could you create some Jiras for
>>>>>
>>>> these?
>>>>
>>>>>
>>>>> We actually have MAHOUT-1852 open for Histograms already, and
>>>>>
>>>> MAHOUT-1854
>>>
>>>> and MAHOUT-1855 (early Zeppelin integration Jiras).  I can close m-1854
>>>>>
>>>> and
>>>>
>>>>> m-1855 out and we can start new ones if they're not relevant anymore or
>>>>>
>>>> we
>>>>
>>>>> can just go with those.
>>>>>
>>>>> Thanks
>>>>>
>>>>> ________________________________________
>>>>> From: Trevor Grant <[email protected]>
>>>>> Sent: Thursday, May 26, 2016 3:17:22 PM
>>>>> To: [email protected]
>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>
>>>>> Short answer: it is high priority. I think it will be a Mahout
>>>>>
>>>> interpreter
>>>>
>>>>> into Zeppelin, and given that plans are on hold for a Flink-Mahout in
>>>>>
>>>> the
>>>
>>>> short term, I think it should be a piggy-back spark interpreter (e.g.
>>>>> exposed through something like %spark.mahout).   So I have thoughts,
>>>>>
>>>> but
>>>
>>>> no
>>>>
>>>>> plan.  Been busy with a couple of other commitments.
>>>>>
>>>>> On the Mahout side we need:
>>>>> A function that will convert small matrices into TSV strings
>>>>> Convenience functions for sampling super-large matrices into things
>>>>>
>>>> like
>>>
>>>> histograms, etc, that one would want to plot. I.e. histogram bucketing?
>>>>> (less important for the moment)
>>>>>
>>>>> On the Zeppelin Size we need:
>>>>> an interpreter.
>>>>>
>>>>>
>>>>> Trevor Grant
>>>>> Data Scientist
>>>>> https://github.com/rawkintrevo
>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>> http://trevorgrant.org
>>>>>
>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>
>>>>>
>>>>> On Thu, May 26, 2016 at 1:22 PM, Suneel Marthi <[email protected]>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>> While on this subject, do we have a plan yet of integrating Zeppelin
>>>>>>
>>>>> into
>>>>
>>>>> Mahout (or the converse) of having Mahout specific interpreter for
>>>>>> Zeppelin?  I think that shuld be high priority in the short term.
>>>>>>
>>>>>> On Thu, May 26, 2016 at 1:17 PM, Trevor Grant <
>>>>>>
>>>>> [email protected]>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Ahh, like the "Sample From Matrix" paragraph in the notebook.
>>>>>>>
>>>>>>> Yea that seems like a good add. If not this afternoon, I'll include
>>>>>>>
>>>>>> it
>>>>
>>>>> Saturday.
>>>>>>>
>>>>>>>
>>>>>>> Trevor Grant
>>>>>>> Data Scientist
>>>>>>> https://github.com/rawkintrevo
>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>> http://trevorgrant.org
>>>>>>>
>>>>>>> *"Fortunate is he, who is able to know the causes of things."
>>>>>>>
>>>>>> -Virgil*
>>>>
>>>>>
>>>>>>>
>>>>>>> On Thu, May 26, 2016 at 11:52 AM, Andrew Palumbo <
>>>>>>>
>>>>>> [email protected]
>>>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Trevor, I was reading over your blog last night again- first time
>>>>>>>>
>>>>>>> since
>>>>>
>>>>>> you updated. It is  great!
>>>>>>>>
>>>>>>>> I have one suggestion being adding in a code line on how the the
>>>>>>>>
>>>>>>> sampling
>>>>>>
>>>>>>> of the  DRM ->  in-core Matrix is done:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L148
>>>
>>>>
>>>>>>>> eg something like:
>>>>>>>>
>>>>>>>>      mxSin = drmSampleKRows(drmSin, 1000, replacement = false)
>>>>>>>>
>>>>>>>> Maybe you omitted this intentionally?
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: Trevor Grant <[email protected]>
>>>>>>>> Sent: Friday, May 20, 2016 7:56:20 PM
>>>>>>>> To: [email protected]
>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>
>>>>>>>> Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a
>>>>>>>>
>>>>>>> version
>>>>>
>>>>>> is
>>>>>>>
>>>>>>>> uninformative to me. I'd say if possible, you're first
>>>>>>>>
>>>>>>> troubleshooting
>>>>>
>>>>>> measure would be to re clone or do a "git fetch upstream" to get
>>>>>>>>
>>>>>>> up
>>>
>>>> to
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> very latest
>>>>>>>>
>>>>>>>> Sorry for delayed reply
>>>>>>>> Tg
>>>>>>>> On May 20, 2016 5:36 PM, "Andrew Musselman" <
>>>>>>>>
>>>>>>> [email protected]>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Trevor, my zeppelin source is at this version:
>>>>>>>>>
>>>>>>>>>    <groupId>org.apache.zeppelin</groupId>
>>>>>>>>>    <artifactId>zeppelin</artifactId>
>>>>>>>>>    <packaging>pom</packaging>
>>>>>>>>>    <version>0.6.0-incubating-SNAPSHOT</version>
>>>>>>>>>    <name>Zeppelin</name>
>>>>>>>>>    <description>Zeppelin project</description>
>>>>>>>>>    <url>http://zeppelin.incubator.apache.org/</url>
>>>>>>>>>
>>>>>>>>> And yes you're right the artifacts weren't added to the
>>>>>>>>>
>>>>>>>> dependencies;
>>>>>
>>>>>> is
>>>>>>>
>>>>>>>> that a feature in more modern zep?
>>>>>>>>>
>>>>>>>>> On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov <
>>>>>>>>>
>>>>>>>> [email protected]
>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> no parenthesis.
>>>>>>>>>>
>>>>>>>>>> import o.a.m.sparkbindings._
>>>>>>>>>> ....
>>>>>>>>>> myRdd = myDrm.rdd
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi <
>>>>>>>>>>
>>>>>>>>> [email protected]
>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
>>>>>>>>>>>
>>>>>>>>>> [email protected]>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hey Pat,
>>>>>>>>>>>>
>>>>>>>>>>>> If you spit out a TSV - you can import into pyspark /
>>>>>>>>>>>>
>>>>>>>>>>> matplotlib
>>>>>>
>>>>>>> from
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> resource pool in essentially the same way and use that
>>>>>>>>>>>>
>>>>>>>>>>> plotting
>>>>>
>>>>>> library
>>>>>>>>>
>>>>>>>>>> if
>>>>>>>>>>>
>>>>>>>>>>>> you prefer.  In fact you could import the tsv into pandas
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>
>>>>> use
>>>>>>
>>>>>>> all
>>>>>>>>
>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>>> the pandas plotting as well (though I think it is for the
>>>>>>>>>>>>
>>>>>>>>>>> most
>>>>>
>>>>>> part,
>>>>>>>>
>>>>>>>>> also
>>>>>>>>>>
>>>>>>>>>>> matplotlib with some convenience functions).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
>>>
>>>>
>>>>>>>>>>>> In Zeppelin, unless you specify otherwise, pyspark,
>>>>>>>>>>>>
>>>>>>>>>>> sparkr,
>>>
>>>> spark-sql,
>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>> scala-spark all share the same spark context you can
>>>>>>>>>>>>
>>>>>>>>>>> create
>>>
>>>> RDDs
>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>>> one
>>>>>>>>>
>>>>>>>>>> language and access them / work on them in another (so I
>>>>>>>>>>>>
>>>>>>>>>>> understand).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> So in Mahout can you "save" a matrix as a RDD? e.g.
>>>>>>>>>>>>
>>>>>>>>>>> something
>>>>
>>>>> like
>>>>>>>
>>>>>>>>
>>>>>>>>>>>> val myRDD = myDRM.asRDD()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> val myRDD = myDRM.rdd()
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> And would 'myRDD' then exist in the spark context?
>>>>>>>>>>>>
>>>>>>>>>>>> yes it will be in sparkContext
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>
>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>
>>>>>>>>>>> things."
>>>>
>>>>> -Virgil*
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel <
>>>>>>>>>>>>
>>>>>>>>>>> [email protected]>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW I don’t want to stall progress but being the most
>>>>>>>>>>>>>
>>>>>>>>>>>> ignorant
>>>>>>
>>>>>>> of
>>>>>>>
>>>>>>>> plot
>>>>>>>>>>
>>>>>>>>>>> libs, I’ll ask if we should consider python and
>>>>>>>>>>>>>
>>>>>>>>>>>> matplotlib.
>>>>
>>>>> In
>>>>>>
>>>>>>> another
>>>>>>>>>>
>>>>>>>>>>> project we use python because of the RDD support on
>>>>>>>>>>>>>
>>>>>>>>>>>> Spark
>>>
>>>> though
>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> visualizations are extremely limited in our case. If we
>>>>>>>>>>>>>
>>>>>>>>>>>> can
>>>>
>>>>> pass
>>>>>>>
>>>>>>>> an
>>>>>>>>
>>>>>>>>> RDD
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>> pyspark it would allow custom reductions in python
>>>>>>>>>>>>>
>>>>>>>>>>>> before
>>>
>>>> plotting,
>>>>>>>>
>>>>>>>>> even
>>>>>>>>>>>
>>>>>>>>>>>> though we will support many natively in Mahout. I’m
>>>>>>>>>>>>>
>>>>>>>>>>>> guessing
>>>>>
>>>>>> that
>>>>>>>
>>>>>>>> this
>>>>>>>>>>
>>>>>>>>>>> would cross a context boundary and require a write to
>>>>>>>>>>>>>
>>>>>>>>>>>> disk?
>>>>
>>>>>
>>>>>>>>>>>>> So 2 questions:
>>>>>>>>>>>>> 1) what does the inter language support look like with
>>>>>>>>>>>>>
>>>>>>>>>>>> Spark
>>>>>
>>>>>> python
>>>>>>>>
>>>>>>>>> vs
>>>>>>>>>>
>>>>>>>>>>> SparkR, can we transfer RDDs?
>>>>>>>>>>>>> 2) are the plot libs significantly different?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On May 20, 2016, at 9:54 AM, Trevor Grant <
>>>>>>>>>>>>>
>>>>>>>>>>>> [email protected]>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dmitriy really nailed it on the head in his reply to
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>
>>>> post
>>>>>
>>>>>> which
>>>>>>>>
>>>>>>>>> I'll
>>>>>>>>>>>
>>>>>>>>>>>> rebroadcast below. In essence the whole reason you are
>>>>>>>>>>>>>
>>>>>>>>>>>> (theoretically)
>>>>>>>>>>
>>>>>>>>>>> using Mahout is the data is to big to fit in memory.
>>>>>>>>>>>>>
>>>>>>>>>>>> If
>>>
>>>> it's
>>>>>
>>>>>> to
>>>>>>>
>>>>>>>> big
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> fit
>>>>>>>>>>>>
>>>>>>>>>>>>> in memory, well then its probably too big to plot each
>>>>>>>>>>>>>
>>>>>>>>>>>> point
>>>>>
>>>>>> (e.g.
>>>>>>>>
>>>>>>>>> trillions of row, you only have so many pixels).   For
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>
>>>>> example
>>>>>>>>
>>>>>>>>> I
>>>>>>>>>
>>>>>>>>>> randomly sampled a matrix.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So as Dmitriy says, in Mahout we need to have functions
>>>>>>>>>>>>>
>>>>>>>>>>>> that
>>>>>
>>>>>> will
>>>>>>>
>>>>>>>> 'preprocess' the data into something plotable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the Zepplin-Plotting thing, we need to have a
>>>>>>>>>>>>>
>>>>>>>>>>>> function
>>>>
>>>>> that
>>>>>>
>>>>>>> will
>>>>>>>>>
>>>>>>>>>> spit
>>>>>>>>>>>
>>>>>>>>>>>> out a tsv like string of the data we wanted plotted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree an honest Mahout interpreter in Zeppelin is
>>>>>>>>>>>>>
>>>>>>>>>>>> probably
>>>>>
>>>>>> worth
>>>>>>>>
>>>>>>>>> doing.
>>>>>>>>>>>
>>>>>>>>>>>> There are a couple of ways to go about it. I opened up
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>
>>>>> discussion
>>>>>>>>>
>>>>>>>>>> on
>>>>>>>>>>>
>>>>>>>>>>>> dev@Zeppelin and didn't get any replies. I'm going to
>>>>>>>>>>>>>
>>>>>>>>>>>> take
>>>>
>>>>> that
>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>> mean
>>>>>>>>>>>
>>>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>>>> can do it in a way that makes the most sense to Mahout
>>>>>>>>>>>>>
>>>>>>>>>>>> users...
>>>>>>
>>>>>>>
>>>>>>>>>>>>> First steps are to include some methods in Mahout that
>>>>>>>>>>>>>
>>>>>>>>>>>> will
>>>>
>>>>> do
>>>>>>
>>>>>>> that
>>>>>>>>
>>>>>>>>> preprocessing, and one that will turn something into a
>>>>>>>>>>>>>
>>>>>>>>>>>> tsv
>>>>
>>>>> string.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>> I have some general ideas on possible approached to
>>>>>>>>>>>>>
>>>>>>>>>>>> making
>>>>
>>>>> an
>>>>>
>>>>>> honest-mahout
>>>>>>>>>>>>
>>>>>>>>>>>>> interpreter but I want to play in the code and look at
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>
>>>>> Flink-Mahout
>>>>>>>>>>
>>>>>>>>>>> shell a bit before I try to organize my thoughts and
>>>>>>>>>>>>>
>>>>>>>>>>>> present
>>>>>
>>>>>> them.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>> ...(2) not sure what is the point of supporting
>>>>>>>>>>>>>
>>>>>>>>>>>> distributed
>>>>
>>>>> anything.
>>>>>>>>>
>>>>>>>>>> It
>>>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>>> distributed presumably because it is hard to keep it in
>>>>>>>>>>>>>
>>>>>>>>>>>> memory.
>>>>>>
>>>>>>> Therefore,
>>>>>>>>>>>>
>>>>>>>>>>>>> plotting anything distributed potentially presents 2
>>>>>>>>>>>>>
>>>>>>>>>>>> problems:
>>>>>>
>>>>>>> storage
>>>>>>>>>>
>>>>>>>>>>> space and overplotting due to number of points. The
>>>>>>>>>>>>>
>>>>>>>>>>>> idea
>>>
>>>> is
>>>>
>>>>> that
>>>>>>>
>>>>>>>> we
>>>>>>>>
>>>>>>>>> have
>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>> work out algorithms that condense big data information
>>>>>>>>>>>>>
>>>>>>>>>>>> into
>>>>
>>>>> small
>>>>>>>
>>>>>>>> plottable
>>>>>>>>>>>>
>>>>>>>>>>>>> information (like density grids, for example, or
>>>>>>>>>>>>>
>>>>>>>>>>>> histograms)....
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>
>>>>>>>>>>>> things."
>>>>>
>>>>>> -Virgil*
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
>>>>>>>>>>>>>
>>>>>>>>>>>> [email protected]>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Great job Trevor, we’ll need this detail to smooth
>>>>>>>>>>>>>>
>>>>>>>>>>>>> out
>>>
>>>> the
>>>>>
>>>>>> sharp
>>>>>>>>
>>>>>>>>> edges
>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>>> any guidance from you or the Zeppelin community will
>>>>>>>>>>>>>>
>>>>>>>>>>>>> be a
>>>>
>>>>> big
>>>>>>
>>>>>>> help.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On May 20, 2016, at 8:13 AM, Shannon Quinn <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> [email protected]>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> Agreed, thoroughly enjoying the blog post.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 5/19/16 12:01 AM, Andrew Palumbo wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Well done, Trevor!  I've not yet had a chance to try
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> this
>>>>>
>>>>>> in
>>>>>>
>>>>>>> zeppelin
>>>>>>>>>>>
>>>>>>>>>>>> but I just read the blog which is great!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -------- Original message --------
>>>>>>>>>>>>>>> From: Trevor Grant <[email protected]>
>>>>>>>>>>>>>>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ah thank you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Fixing now.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> things."
>>>>>>>
>>>>>>>> -Virgil*
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Trevor- Just refreshed your readme.  The jar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> that I
>>>>
>>>>> mentioned
>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>
>>>>
>>>>>>>>>>>>>>>> rather than:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>
>>>>
>>>>>>>>>>>>>>>> (In the spark module that is)
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Trevor Grant <[email protected]>
>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 11:02:43 AM
>>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ah yes- I remember you pointing that out to me too.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I got side tracked yesterday for most of the day on
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> an
>>>>
>>>>> adventure
>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>>> getting
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zeppelin to work right after I accidently updated
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> to
>>>
>>>> the
>>>>>
>>>>>> new
>>>>>>>
>>>>>>>> snapshot
>>>>>>>>>>>
>>>>>>>>>>>> (free
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> hint: the secret was to clear my cache *face-palm*)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm going to add that dependency to the readme.md
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> now.
>>>>
>>>>>
>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>> tg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Trevor Grant
>>>>>>>>>>>>>>>> Data Scientist
>>>>>>>>>>>>>>>> https://github.com/rawkintrevo
>>>>>>>>>>>>>>>> http://stackexchange.com/users/3002022/rawkintrevo
>>>>>>>>>>>>>>>> http://trevorgrant.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *"Fortunate is he, who is able to know the causes
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> of
>>>
>>>> things."
>>>>>>>
>>>>>>>> -Virgil*
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Trevor this is very cool- I have not been able to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> look
>>>>
>>>>> at
>>>>>>
>>>>>>> it
>>>>>>>
>>>>>>>> closely
>>>>>>>>>>>
>>>>>>>>>>>> yet
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> but just a small point: I believe that you'll also
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> need
>>>>>
>>>>>> to
>>>>>>
>>>>>>> add
>>>>>>>>
>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>>>>
>>>>>>
>>>>>>>>>>>>>>>>> For things like the classification stats,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> confusion
>>>
>>>> matrix,
>>>>>>>
>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> t-digest.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>> From: Trevor Grant <[email protected]>
>>>>>>>>>>>>>>>>> Sent: Wednesday, May 18, 2016 10:47:21 AM
>>>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>>>> Subject: Re: Future Mahout - Zeppelin work
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I still need to update my readme/env per Pat's
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> comments
>>>>>
>>>>>> below,
>>>>>>>>
>>>>>>>>> however
>>>>>>>>>>>>
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> out further ado, I present two notebooks that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> integrate
>>>>>
>>>>>> Mahout +
>>>>>>>>>
>>>>>>>>>> Spark
>>>>>>>>>>>>
>>>>>>>>>>>>> +
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zeppelin + ggplot2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Future Mahout - Zeppelin work

Reply via email to