Re: Future Mahout - Zeppelin work

2016-05-20 Thread Trevor Grant
Unfortunately Zeppelin dev has been so rapid, 0.6-SNAPSHOT as a version is
uninformative to me. I'd say if possible, you're first troubleshooting
measure would be to re clone or do a "git fetch upstream" to get up to the
very latest

Sorry for delayed reply
Tg
On May 20, 2016 5:36 PM, "Andrew Musselman" 
wrote:

> Trevor, my zeppelin source is at this version:
>
>   org.apache.zeppelin
>   zeppelin
>   pom
>   0.6.0-incubating-SNAPSHOT
>   Zeppelin
>   Zeppelin project
>   http://zeppelin.incubator.apache.org/
>
> And yes you're right the artifacts weren't added to the dependencies; is
> that a feature in more modern zep?
>
> On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov 
> wrote:
>
> > no parenthesis.
> >
> > import o.a.m.sparkbindings._
> > 
> > myRdd = myDrm.rdd
> >
> >
> > On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi 
> wrote:
> >
> > > On Fri, May 20, 2016 at 3:18 PM, Trevor Grant <
> trevor.d.gr...@gmail.com>
> > > wrote:
> > >
> > > > Hey Pat,
> > > >
> > > > If you spit out a TSV - you can import into pyspark / matplotlib from
> > the
> > > > resource pool in essentially the same way and use that plotting
> library
> > > if
> > > > you prefer.  In fact you could import the tsv into pandas and use all
> > of
> > > > the pandas plotting as well (though I think it is for the most part,
> > also
> > > > matplotlib with some convenience functions).
> > > >
> > > >
> > > >
> > >
> >
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
> > > >
> > > > In Zeppelin, unless you specify otherwise, pyspark, sparkr,
> spark-sql,
> > > and
> > > > scala-spark all share the same spark context you can create RDDs in
> one
> > > > language and access them / work on them in another (so I understand).
> > > >
> > > > So in Mahout can you "save" a matrix as a RDD? e.g. something like
> > > >
> > > > val myRDD = myDRM.asRDD()
> > > >
> > >
> > > val myRDD = myDRM.rdd()
> > >
> > > >
> > > > And would 'myRDD' then exist in the spark context?
> > > >
> > > > yes it will be in sparkContext
> > >
> > > >
> > > > Trevor Grant
> > > > Data Scientist
> > > > https://github.com/rawkintrevo
> > > > http://stackexchange.com/users/3002022/rawkintrevo
> > > > http://trevorgrant.org
> > > >
> > > > *"Fortunate is he, who is able to know the causes of things."
> -Virgil*
> > > >
> > > >
> > > > On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel 
> > > > wrote:
> > > >
> > > > > Agreed.
> > > > >
> > > > > BTW I don’t want to stall progress but being the most ignorant of
> > plot
> > > > > libs, I’ll ask if we should consider python and matplotlib. In
> > another
> > > > > project we use python because of the RDD support on Spark though
> the
> > > > > visualizations are extremely limited in our case. If we can pass an
> > RDD
> > > > to
> > > > > pyspark it would allow custom reductions in python before plotting,
> > > even
> > > > > though we will support many natively in Mahout. I’m guessing that
> > this
> > > > > would cross a context boundary and require a write to disk?
> > > > >
> > > > > So 2 questions:
> > > > > 1) what does the inter language support look like with Spark python
> > vs
> > > > > SparkR, can we transfer RDDs?
> > > > > 2) are the plot libs significantly different?
> > > > >
> > > > > On May 20, 2016, at 9:54 AM, Trevor Grant <
> trevor.d.gr...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > Dmitriy really nailed it on the head in his reply to the post which
> > > I'll
> > > > > rebroadcast below. In essence the whole reason you are
> > (theoretically)
> > > > > using Mahout is the data is to big to fit in memory.  If it's to
> big
> > to
> > > > fit
> > > > > in memory, well then its probably too big to plot each point (e.g.
> > > > > trillions of row, you only have so many pixels).   For the example
> I
> > > > > randomly sampled a matrix.
> > > > >
> > > > > So as Dmitriy says, in Mahout we need to have functions that will
> > > > > 'preprocess' the data into something plotable.
> > > > >
> > > > > For the Zepplin-Plotting thing, we need to have a function that
> will
> > > spit
> > > > > out a tsv like string of the data we wanted plotted.
> > > > >
> > > > > I agree an honest Mahout interpreter in Zeppelin is probably worth
> > > doing.
> > > > > There are a couple of ways to go about it. I opened up the
> discussion
> > > on
> > > > > dev@Zeppelin and didn't get any replies. I'm going to take that to
> > > mean
> > > > we
> > > > > can do it in a way that makes the most sense to Mahout users...
> > > > >
> > > > > First steps are to include some methods in Mahout that will do that
> > > > > preprocessing, and one that will turn something into a tsv string.
> > > > >
> > > > > I have some general ideas on possible approached to making an
> > > > honest-mahout
> > > 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
Nope

Created spark context..
Spark context is available as "val sc".
Mahout distributed context is available as "implicit val sdc".
16/05/20 15:48:46 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 0.13.1aa
SQL context available as "val sqlContext".
mahout> import org.apache.mahout.math._
import org.apache.mahout.math._
mahout> import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.scalabindings._
mahout> import org.apache.mahout.math.drm._
import org.apache.mahout.math.drm._
mahout> import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.scalabindings.RLikeOps._
mahout> import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
mahout> import org.apache.mahout.sparkbindings._
import org.apache.mahout.sparkbindings._
mahout>
mahout> implicit val sdc:
org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
org.apache.mahout.sparkbindings.SparkDistributedContext@1737d824


On Fri, May 20, 2016 at 3:48 PM, Suneel Marthi  wrote:

> R u seeing a similar thing in plain Mahout-Spark shell too ?
>
> On Fri, May 20, 2016 at 6:46 PM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
> > Now this, definitely would help to clarify the instructions; let me know
> if
> > I can help.
> >
> > import org.apache.mahout.math._
> > import org.apache.mahout.math.scalabindings._
> > import org.apache.mahout.math.drm._
> > import org.apache.mahout.math.scalabindings.RLikeOps._
> > import org.apache.mahout.math.drm.RLikeDrmOps._
> > import org.apache.mahout.sparkbindings._
> > java.lang.NoClassDefFoundError: org/apache/mahout/math/AbstractMatrix
> > at
> >
> >
> org.apache.mahout.sparkbindings.SparkDistributedContext.(SparkDistributedContext.scala:25)
> > at org.apache.mahout.sparkbindings.package$.sc2sdc(package.scala:98)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:86)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:88)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:90)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:92)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:94)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:96)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:98)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:100)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
> > at
> >
> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
> > at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
> > at 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Suneel Marthi
R u seeing a similar thing in plain Mahout-Spark shell too ?

On Fri, May 20, 2016 at 6:46 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Now this, definitely would help to clarify the instructions; let me know if
> I can help.
>
> import org.apache.mahout.math._
> import org.apache.mahout.math.scalabindings._
> import org.apache.mahout.math.drm._
> import org.apache.mahout.math.scalabindings.RLikeOps._
> import org.apache.mahout.math.drm.RLikeDrmOps._
> import org.apache.mahout.sparkbindings._
> java.lang.NoClassDefFoundError: org/apache/mahout/math/AbstractMatrix
> at
>
> org.apache.mahout.sparkbindings.SparkDistributedContext.(SparkDistributedContext.scala:25)
> at org.apache.mahout.sparkbindings.package$.sc2sdc(package.scala:98)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:86)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:88)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:90)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:92)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:94)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:96)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:98)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:100)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
> at
>
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:116)
> at $iwC$$iwC$$iwC$$iwC.(:118)
> at $iwC$$iwC$$iwC.(:120)
> at $iwC$$iwC.(:122)
> at $iwC.(:124)
> at (:126)
> at .(:130)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at
>
> org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:812)
> at
>
> org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:755)
> at
>
> org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:748)
> at
>
> org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
> at
>
> 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
Now this, definitely would help to clarify the instructions; let me know if
I can help.

import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._
java.lang.NoClassDefFoundError: org/apache/mahout/math/AbstractMatrix
at
org.apache.mahout.sparkbindings.SparkDistributedContext.(SparkDistributedContext.scala:25)
at org.apache.mahout.sparkbindings.package$.sc2sdc(package.scala:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:86)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:88)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:90)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:92)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:94)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:96)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:98)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:100)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:116)
at $iwC$$iwC$$iwC$$iwC.(:118)
at $iwC$$iwC$$iwC.(:120)
at $iwC$$iwC.(:122)
at $iwC.(:124)
at (:126)
at .(:130)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:812)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:755)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:748)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:331)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
Oh might have been a browser cache issue; even after a couple hard refresh
methods using another browser has the import link.

On Fri, May 20, 2016 at 3:36 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Trevor, my zeppelin source is at this version:
>
>   org.apache.zeppelin
>   zeppelin
>   pom
>   0.6.0-incubating-SNAPSHOT
>   Zeppelin
>   Zeppelin project
>   http://zeppelin.incubator.apache.org/
>
> And yes you're right the artifacts weren't added to the dependencies; is
> that a feature in more modern zep?
>
> On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov 
> wrote:
>
>> no parenthesis.
>>
>> import o.a.m.sparkbindings._
>> 
>> myRdd = myDrm.rdd
>>
>>
>> On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi 
>> wrote:
>>
>> > On Fri, May 20, 2016 at 3:18 PM, Trevor Grant > >
>> > wrote:
>> >
>> > > Hey Pat,
>> > >
>> > > If you spit out a TSV - you can import into pyspark / matplotlib from
>> the
>> > > resource pool in essentially the same way and use that plotting
>> library
>> > if
>> > > you prefer.  In fact you could import the tsv into pandas and use all
>> of
>> > > the pandas plotting as well (though I think it is for the most part,
>> also
>> > > matplotlib with some convenience functions).
>> > >
>> > >
>> > >
>> >
>> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
>> > >
>> > > In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql,
>> > and
>> > > scala-spark all share the same spark context you can create RDDs in
>> one
>> > > language and access them / work on them in another (so I understand).
>> > >
>> > > So in Mahout can you "save" a matrix as a RDD? e.g. something like
>> > >
>> > > val myRDD = myDRM.asRDD()
>> > >
>> >
>> > val myRDD = myDRM.rdd()
>> >
>> > >
>> > > And would 'myRDD' then exist in the spark context?
>> > >
>> > > yes it will be in sparkContext
>> >
>> > >
>> > > Trevor Grant
>> > > Data Scientist
>> > > https://github.com/rawkintrevo
>> > > http://stackexchange.com/users/3002022/rawkintrevo
>> > > http://trevorgrant.org
>> > >
>> > > *"Fortunate is he, who is able to know the causes of things."
>> -Virgil*
>> > >
>> > >
>> > > On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel 
>> > > wrote:
>> > >
>> > > > Agreed.
>> > > >
>> > > > BTW I don’t want to stall progress but being the most ignorant of
>> plot
>> > > > libs, I’ll ask if we should consider python and matplotlib. In
>> another
>> > > > project we use python because of the RDD support on Spark though the
>> > > > visualizations are extremely limited in our case. If we can pass an
>> RDD
>> > > to
>> > > > pyspark it would allow custom reductions in python before plotting,
>> > even
>> > > > though we will support many natively in Mahout. I’m guessing that
>> this
>> > > > would cross a context boundary and require a write to disk?
>> > > >
>> > > > So 2 questions:
>> > > > 1) what does the inter language support look like with Spark python
>> vs
>> > > > SparkR, can we transfer RDDs?
>> > > > 2) are the plot libs significantly different?
>> > > >
>> > > > On May 20, 2016, at 9:54 AM, Trevor Grant > >
>> > > > wrote:
>> > > >
>> > > > Dmitriy really nailed it on the head in his reply to the post which
>> > I'll
>> > > > rebroadcast below. In essence the whole reason you are
>> (theoretically)
>> > > > using Mahout is the data is to big to fit in memory.  If it's to
>> big to
>> > > fit
>> > > > in memory, well then its probably too big to plot each point (e.g.
>> > > > trillions of row, you only have so many pixels).   For the example I
>> > > > randomly sampled a matrix.
>> > > >
>> > > > So as Dmitriy says, in Mahout we need to have functions that will
>> > > > 'preprocess' the data into something plotable.
>> > > >
>> > > > For the Zepplin-Plotting thing, we need to have a function that will
>> > spit
>> > > > out a tsv like string of the data we wanted plotted.
>> > > >
>> > > > I agree an honest Mahout interpreter in Zeppelin is probably worth
>> > doing.
>> > > > There are a couple of ways to go about it. I opened up the
>> discussion
>> > on
>> > > > dev@Zeppelin and didn't get any replies. I'm going to take that to
>> > mean
>> > > we
>> > > > can do it in a way that makes the most sense to Mahout users...
>> > > >
>> > > > First steps are to include some methods in Mahout that will do that
>> > > > preprocessing, and one that will turn something into a tsv string.
>> > > >
>> > > > I have some general ideas on possible approached to making an
>> > > honest-mahout
>> > > > interpreter but I want to play in the code and look at the
>> Flink-Mahout
>> > > > shell a bit before I try to organize my thoughts and present them.
>> > > >
>> > > > ...(2) not sure what is the point of supporting distributed
>> anything.
>> 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
Trevor, my zeppelin source is at this version:

  org.apache.zeppelin
  zeppelin
  pom
  0.6.0-incubating-SNAPSHOT
  Zeppelin
  Zeppelin project
  http://zeppelin.incubator.apache.org/

And yes you're right the artifacts weren't added to the dependencies; is
that a feature in more modern zep?

On Fri, May 20, 2016 at 3:02 PM, Dmitriy Lyubimov  wrote:

> no parenthesis.
>
> import o.a.m.sparkbindings._
> 
> myRdd = myDrm.rdd
>
>
> On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi  wrote:
>
> > On Fri, May 20, 2016 at 3:18 PM, Trevor Grant 
> > wrote:
> >
> > > Hey Pat,
> > >
> > > If you spit out a TSV - you can import into pyspark / matplotlib from
> the
> > > resource pool in essentially the same way and use that plotting library
> > if
> > > you prefer.  In fact you could import the tsv into pandas and use all
> of
> > > the pandas plotting as well (though I think it is for the most part,
> also
> > > matplotlib with some convenience functions).
> > >
> > >
> > >
> >
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
> > >
> > > In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql,
> > and
> > > scala-spark all share the same spark context you can create RDDs in one
> > > language and access them / work on them in another (so I understand).
> > >
> > > So in Mahout can you "save" a matrix as a RDD? e.g. something like
> > >
> > > val myRDD = myDRM.asRDD()
> > >
> >
> > val myRDD = myDRM.rdd()
> >
> > >
> > > And would 'myRDD' then exist in the spark context?
> > >
> > > yes it will be in sparkContext
> >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel 
> > > wrote:
> > >
> > > > Agreed.
> > > >
> > > > BTW I don’t want to stall progress but being the most ignorant of
> plot
> > > > libs, I’ll ask if we should consider python and matplotlib. In
> another
> > > > project we use python because of the RDD support on Spark though the
> > > > visualizations are extremely limited in our case. If we can pass an
> RDD
> > > to
> > > > pyspark it would allow custom reductions in python before plotting,
> > even
> > > > though we will support many natively in Mahout. I’m guessing that
> this
> > > > would cross a context boundary and require a write to disk?
> > > >
> > > > So 2 questions:
> > > > 1) what does the inter language support look like with Spark python
> vs
> > > > SparkR, can we transfer RDDs?
> > > > 2) are the plot libs significantly different?
> > > >
> > > > On May 20, 2016, at 9:54 AM, Trevor Grant 
> > > > wrote:
> > > >
> > > > Dmitriy really nailed it on the head in his reply to the post which
> > I'll
> > > > rebroadcast below. In essence the whole reason you are
> (theoretically)
> > > > using Mahout is the data is to big to fit in memory.  If it's to big
> to
> > > fit
> > > > in memory, well then its probably too big to plot each point (e.g.
> > > > trillions of row, you only have so many pixels).   For the example I
> > > > randomly sampled a matrix.
> > > >
> > > > So as Dmitriy says, in Mahout we need to have functions that will
> > > > 'preprocess' the data into something plotable.
> > > >
> > > > For the Zepplin-Plotting thing, we need to have a function that will
> > spit
> > > > out a tsv like string of the data we wanted plotted.
> > > >
> > > > I agree an honest Mahout interpreter in Zeppelin is probably worth
> > doing.
> > > > There are a couple of ways to go about it. I opened up the discussion
> > on
> > > > dev@Zeppelin and didn't get any replies. I'm going to take that to
> > mean
> > > we
> > > > can do it in a way that makes the most sense to Mahout users...
> > > >
> > > > First steps are to include some methods in Mahout that will do that
> > > > preprocessing, and one that will turn something into a tsv string.
> > > >
> > > > I have some general ideas on possible approached to making an
> > > honest-mahout
> > > > interpreter but I want to play in the code and look at the
> Flink-Mahout
> > > > shell a bit before I try to organize my thoughts and present them.
> > > >
> > > > ...(2) not sure what is the point of supporting distributed anything.
> > It
> > > is
> > > > distributed presumably because it is hard to keep it in memory.
> > > Therefore,
> > > > plotting anything distributed potentially presents 2 problems:
> storage
> > > > space and overplotting due to number of points. The idea is that we
> > have
> > > to
> > > > work out algorithms that condense big data information into small
> > > plottable
> > > > information 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Dmitriy Lyubimov
no parenthesis.

import o.a.m.sparkbindings._

myRdd = myDrm.rdd


On Fri, May 20, 2016 at 2:57 PM, Suneel Marthi  wrote:

> On Fri, May 20, 2016 at 3:18 PM, Trevor Grant 
> wrote:
>
> > Hey Pat,
> >
> > If you spit out a TSV - you can import into pyspark / matplotlib from the
> > resource pool in essentially the same way and use that plotting library
> if
> > you prefer.  In fact you could import the tsv into pandas and use all of
> > the pandas plotting as well (though I think it is for the most part, also
> > matplotlib with some convenience functions).
> >
> >
> >
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
> >
> > In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql,
> and
> > scala-spark all share the same spark context you can create RDDs in one
> > language and access them / work on them in another (so I understand).
> >
> > So in Mahout can you "save" a matrix as a RDD? e.g. something like
> >
> > val myRDD = myDRM.asRDD()
> >
>
> val myRDD = myDRM.rdd()
>
> >
> > And would 'myRDD' then exist in the spark context?
> >
> > yes it will be in sparkContext
>
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel 
> > wrote:
> >
> > > Agreed.
> > >
> > > BTW I don’t want to stall progress but being the most ignorant of plot
> > > libs, I’ll ask if we should consider python and matplotlib. In another
> > > project we use python because of the RDD support on Spark though the
> > > visualizations are extremely limited in our case. If we can pass an RDD
> > to
> > > pyspark it would allow custom reductions in python before plotting,
> even
> > > though we will support many natively in Mahout. I’m guessing that this
> > > would cross a context boundary and require a write to disk?
> > >
> > > So 2 questions:
> > > 1) what does the inter language support look like with Spark python vs
> > > SparkR, can we transfer RDDs?
> > > 2) are the plot libs significantly different?
> > >
> > > On May 20, 2016, at 9:54 AM, Trevor Grant 
> > > wrote:
> > >
> > > Dmitriy really nailed it on the head in his reply to the post which
> I'll
> > > rebroadcast below. In essence the whole reason you are (theoretically)
> > > using Mahout is the data is to big to fit in memory.  If it's to big to
> > fit
> > > in memory, well then its probably too big to plot each point (e.g.
> > > trillions of row, you only have so many pixels).   For the example I
> > > randomly sampled a matrix.
> > >
> > > So as Dmitriy says, in Mahout we need to have functions that will
> > > 'preprocess' the data into something plotable.
> > >
> > > For the Zepplin-Plotting thing, we need to have a function that will
> spit
> > > out a tsv like string of the data we wanted plotted.
> > >
> > > I agree an honest Mahout interpreter in Zeppelin is probably worth
> doing.
> > > There are a couple of ways to go about it. I opened up the discussion
> on
> > > dev@Zeppelin and didn't get any replies. I'm going to take that to
> mean
> > we
> > > can do it in a way that makes the most sense to Mahout users...
> > >
> > > First steps are to include some methods in Mahout that will do that
> > > preprocessing, and one that will turn something into a tsv string.
> > >
> > > I have some general ideas on possible approached to making an
> > honest-mahout
> > > interpreter but I want to play in the code and look at the Flink-Mahout
> > > shell a bit before I try to organize my thoughts and present them.
> > >
> > > ...(2) not sure what is the point of supporting distributed anything.
> It
> > is
> > > distributed presumably because it is hard to keep it in memory.
> > Therefore,
> > > plotting anything distributed potentially presents 2 problems: storage
> > > space and overplotting due to number of points. The idea is that we
> have
> > to
> > > work out algorithms that condense big data information into small
> > plottable
> > > information (like density grids, for example, or histograms)
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> > > wrote:
> > >
> > > > Great job Trevor, we’ll need this detail to smooth out the sharp
> edges
> > > and
> > > > any guidance from you or the Zeppelin community will be a big help.
> > > >
> > > >
> > > > On May 20, 2016, at 8:13 AM, 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Suneel Marthi
On Fri, May 20, 2016 at 3:18 PM, Trevor Grant 
wrote:

> Hey Pat,
>
> If you spit out a TSV - you can import into pyspark / matplotlib from the
> resource pool in essentially the same way and use that plotting library if
> you prefer.  In fact you could import the tsv into pandas and use all of
> the pandas plotting as well (though I think it is for the most part, also
> matplotlib with some convenience functions).
>
>
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u
>
> In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql, and
> scala-spark all share the same spark context you can create RDDs in one
> language and access them / work on them in another (so I understand).
>
> So in Mahout can you "save" a matrix as a RDD? e.g. something like
>
> val myRDD = myDRM.asRDD()
>

val myRDD = myDRM.rdd()

>
> And would 'myRDD' then exist in the spark context?
>
> yes it will be in sparkContext

>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel 
> wrote:
>
> > Agreed.
> >
> > BTW I don’t want to stall progress but being the most ignorant of plot
> > libs, I’ll ask if we should consider python and matplotlib. In another
> > project we use python because of the RDD support on Spark though the
> > visualizations are extremely limited in our case. If we can pass an RDD
> to
> > pyspark it would allow custom reductions in python before plotting, even
> > though we will support many natively in Mahout. I’m guessing that this
> > would cross a context boundary and require a write to disk?
> >
> > So 2 questions:
> > 1) what does the inter language support look like with Spark python vs
> > SparkR, can we transfer RDDs?
> > 2) are the plot libs significantly different?
> >
> > On May 20, 2016, at 9:54 AM, Trevor Grant 
> > wrote:
> >
> > Dmitriy really nailed it on the head in his reply to the post which I'll
> > rebroadcast below. In essence the whole reason you are (theoretically)
> > using Mahout is the data is to big to fit in memory.  If it's to big to
> fit
> > in memory, well then its probably too big to plot each point (e.g.
> > trillions of row, you only have so many pixels).   For the example I
> > randomly sampled a matrix.
> >
> > So as Dmitriy says, in Mahout we need to have functions that will
> > 'preprocess' the data into something plotable.
> >
> > For the Zepplin-Plotting thing, we need to have a function that will spit
> > out a tsv like string of the data we wanted plotted.
> >
> > I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
> > There are a couple of ways to go about it. I opened up the discussion on
> > dev@Zeppelin and didn't get any replies. I'm going to take that to mean
> we
> > can do it in a way that makes the most sense to Mahout users...
> >
> > First steps are to include some methods in Mahout that will do that
> > preprocessing, and one that will turn something into a tsv string.
> >
> > I have some general ideas on possible approached to making an
> honest-mahout
> > interpreter but I want to play in the code and look at the Flink-Mahout
> > shell a bit before I try to organize my thoughts and present them.
> >
> > ...(2) not sure what is the point of supporting distributed anything. It
> is
> > distributed presumably because it is hard to keep it in memory.
> Therefore,
> > plotting anything distributed potentially presents 2 problems: storage
> > space and overplotting due to number of points. The idea is that we have
> to
> > work out algorithms that condense big data information into small
> plottable
> > information (like density grids, for example, or histograms)
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> > wrote:
> >
> > > Great job Trevor, we’ll need this detail to smooth out the sharp edges
> > and
> > > any guidance from you or the Zeppelin community will be a big help.
> > >
> > >
> > > On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:
> > >
> > > Agreed, thoroughly enjoying the blog post.
> > >
> > > On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> > >> Well done, Trevor!  I've not yet had a chance to try this in zeppelin
> > > but I just read the blog which is great!
> > >>
> > >>  Original message 
> > >> From: Trevor Grant 
> > >> Date: 05/18/2016 2:44 PM 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Suneel Marthi
I concur. It seems to work for me the present Zeppelin snapshot version.

On Fri, May 20, 2016 at 5:54 PM, Trevor Grant 
wrote:

> If appears the jars aren't loading.
>
> Did you add those artifacts?
>
> If your version is the one cloned from tills that's fairly ancient.
>
> I need to update that post badly.
>
> Do a fresh git clone from apache/incubator-zeppelin the point of my last
> post was to get flink 0.10 working w Zeppelin pre release. Zeppelin
> snapshot is now on 1.0
> On May 20, 2016 4:35 PM, "Andrew Musselman" 
> wrote:
>
> > In any case, still getting this error in the console when I run this
> block:
> >
> > "import org.apache.mahout.math._
> > import org.apache.mahout.math.scalabindings._
> > import org.apache.mahout.math.drm._
> > import org.apache.mahout.math.scalabindings.RLikeOps._
> > import org.apache.mahout.math.drm.RLikeDrmOps._
> > import org.apache.mahout.sparkbindings._
> >
> > implicit val sdc:
> org.apache.mahout.sparkbindings.SparkDistributedContext =
> > sc2sdc(sc)"
> >
> > ":21: error: object mahout is not a member of package org.apache
> > import org.apache.mahout.math._"
> >
> > On Fri, May 20, 2016 at 2:31 PM, Andrew Musselman <
> > andrew.mussel...@gmail.com> wrote:
> >
> > > Ah, well I cloned the Till branch per your Nov 3 article..
> > >
> > > git clone https://github.com/tillrohrmann/incubator-zeppelin.git
> > >
> > > On Fri, May 20, 2016 at 2:28 PM, Trevor Grant <
> trevor.d.gr...@gmail.com>
> > > wrote:
> > >
> > >> That's a "new" feature in the 0.6-snapshot... Say within the last
> month
> > or
> > >> two, how long has it been since you did a git pull?
> > >>
> > >> I'll update soon with a note on that.
> > >>
> > >> I can also create a gist with the code.
> > >> On May 20, 2016 4:24 PM, "Andrew Musselman" <
> andrew.mussel...@gmail.com
> > >
> > >> wrote:
> > >>
> > >> > At this step of the tutorial I'm stuck because I don't have an
> "Import
> > >> > Note" link in my Zeppelin home:
> > >> >
> > >> > "I’m going to do you another favor. Go to the Zeppelin home page and
> > >> click
> > >> > on ‘Import Note’. When given the option between URL and json, click
> on
> > >> URL
> > >> > and enter the following link:
> > >> >
> > >> >
> > >> >
> > >>
> >
> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
> > >> > "
> > >> >
> > >> > On Fri, May 20, 2016 at 12:35 PM, Trevor Grant <
> > >> trevor.d.gr...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > FYI:
> > >> > >
> > >> > > Looks like Flink shell is fixed :D
> > >> > >
> > >> > > https://github.com/apache/flink/pull/1913
> > >> > >
> > >> > > (I tested, is working good).
> > >> > >
> > >> > >
> > >> > >
> > >> > > Trevor Grant
> > >> > > Data Scientist
> > >> > > https://github.com/rawkintrevo
> > >> > > http://stackexchange.com/users/3002022/rawkintrevo
> > >> > > http://trevorgrant.org
> > >> > >
> > >> > > *"Fortunate is he, who is able to know the causes of things."
> > >> -Virgil*
> > >> > >
> > >> > >
> > >> > > On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi <
> smar...@apache.org>
> > >> > wrote:
> > >> > >
> > >> > > > On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
> > >> > trevor.d.gr...@gmail.com
> > >> > > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Dmitriy really nailed it on the head in his reply to the post
> > >> which
> > >> > > I'll
> > >> > > > > rebroadcast below. In essence the whole reason you are
> > >> > (theoretically)
> > >> > > > > using Mahout is the data is to big to fit in memory.  If it's
> to
> > >> big
> > >> > to
> > >> > > > fit
> > >> > > > > in memory, well then its probably too big to plot each point
> > (e.g.
> > >> > > > > trillions of row, you only have so many pixels).   For the
> > >> example I
> > >> > > > > randomly sampled a matrix.
> > >> > > > >
> > >> > > > > So as Dmitriy says, in Mahout we need to have functions that
> > will
> > >> > > > > 'preprocess' the data into something plotable.
> > >> > > > >
> > >> > > > > For the Zepplin-Plotting thing, we need to have a function
> that
> > >> will
> > >> > > spit
> > >> > > > > out a tsv like string of the data we wanted plotted.
> > >> > > > >
> > >> > > > > I agree an honest Mahout interpreter in Zeppelin is probably
> > worth
> > >> > > doing.
> > >> > > > > There are a couple of ways to go about it. I opened up the
> > >> discussion
> > >> > > on
> > >> > > > > dev@Zeppelin and didn't get any replies. I'm going to take
> that
> > >> to
> > >> > > mean
> > >> > > > we
> > >> > > > > can do it in a way that makes the most sense to Mahout
> users...
> > >> > > > >
> > >> > > > > First steps are to include some methods in Mahout that will do
> > >> that
> > >> > > > > preprocessing, and one that will turn something into a tsv
> > string.
> > >> > > > >
> > >> > > > > I have some general ideas on possible approached to making an
> > >> > > > honest-mahout
> > >> > > > > 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Trevor Grant
If appears the jars aren't loading.

Did you add those artifacts?

If your version is the one cloned from tills that's fairly ancient.

I need to update that post badly.

Do a fresh git clone from apache/incubator-zeppelin the point of my last
post was to get flink 0.10 working w Zeppelin pre release. Zeppelin
snapshot is now on 1.0
On May 20, 2016 4:35 PM, "Andrew Musselman" 
wrote:

> In any case, still getting this error in the console when I run this block:
>
> "import org.apache.mahout.math._
> import org.apache.mahout.math.scalabindings._
> import org.apache.mahout.math.drm._
> import org.apache.mahout.math.scalabindings.RLikeOps._
> import org.apache.mahout.math.drm.RLikeDrmOps._
> import org.apache.mahout.sparkbindings._
>
> implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
> sc2sdc(sc)"
>
> ":21: error: object mahout is not a member of package org.apache
> import org.apache.mahout.math._"
>
> On Fri, May 20, 2016 at 2:31 PM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
> > Ah, well I cloned the Till branch per your Nov 3 article..
> >
> > git clone https://github.com/tillrohrmann/incubator-zeppelin.git
> >
> > On Fri, May 20, 2016 at 2:28 PM, Trevor Grant 
> > wrote:
> >
> >> That's a "new" feature in the 0.6-snapshot... Say within the last month
> or
> >> two, how long has it been since you did a git pull?
> >>
> >> I'll update soon with a note on that.
> >>
> >> I can also create a gist with the code.
> >> On May 20, 2016 4:24 PM, "Andrew Musselman"  >
> >> wrote:
> >>
> >> > At this step of the tutorial I'm stuck because I don't have an "Import
> >> > Note" link in my Zeppelin home:
> >> >
> >> > "I’m going to do you another favor. Go to the Zeppelin home page and
> >> click
> >> > on ‘Import Note’. When given the option between URL and json, click on
> >> URL
> >> > and enter the following link:
> >> >
> >> >
> >> >
> >>
> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
> >> > "
> >> >
> >> > On Fri, May 20, 2016 at 12:35 PM, Trevor Grant <
> >> trevor.d.gr...@gmail.com>
> >> > wrote:
> >> >
> >> > > FYI:
> >> > >
> >> > > Looks like Flink shell is fixed :D
> >> > >
> >> > > https://github.com/apache/flink/pull/1913
> >> > >
> >> > > (I tested, is working good).
> >> > >
> >> > >
> >> > >
> >> > > Trevor Grant
> >> > > Data Scientist
> >> > > https://github.com/rawkintrevo
> >> > > http://stackexchange.com/users/3002022/rawkintrevo
> >> > > http://trevorgrant.org
> >> > >
> >> > > *"Fortunate is he, who is able to know the causes of things."
> >> -Virgil*
> >> > >
> >> > >
> >> > > On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi 
> >> > wrote:
> >> > >
> >> > > > On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
> >> > trevor.d.gr...@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > Dmitriy really nailed it on the head in his reply to the post
> >> which
> >> > > I'll
> >> > > > > rebroadcast below. In essence the whole reason you are
> >> > (theoretically)
> >> > > > > using Mahout is the data is to big to fit in memory.  If it's to
> >> big
> >> > to
> >> > > > fit
> >> > > > > in memory, well then its probably too big to plot each point
> (e.g.
> >> > > > > trillions of row, you only have so many pixels).   For the
> >> example I
> >> > > > > randomly sampled a matrix.
> >> > > > >
> >> > > > > So as Dmitriy says, in Mahout we need to have functions that
> will
> >> > > > > 'preprocess' the data into something plotable.
> >> > > > >
> >> > > > > For the Zepplin-Plotting thing, we need to have a function that
> >> will
> >> > > spit
> >> > > > > out a tsv like string of the data we wanted plotted.
> >> > > > >
> >> > > > > I agree an honest Mahout interpreter in Zeppelin is probably
> worth
> >> > > doing.
> >> > > > > There are a couple of ways to go about it. I opened up the
> >> discussion
> >> > > on
> >> > > > > dev@Zeppelin and didn't get any replies. I'm going to take that
> >> to
> >> > > mean
> >> > > > we
> >> > > > > can do it in a way that makes the most sense to Mahout users...
> >> > > > >
> >> > > > > First steps are to include some methods in Mahout that will do
> >> that
> >> > > > > preprocessing, and one that will turn something into a tsv
> string.
> >> > > > >
> >> > > > > I have some general ideas on possible approached to making an
> >> > > > honest-mahout
> >> > > > > interpreter but I want to play in the code and look at the
> >> > Flink-Mahout
> >> > > > > shell a bit before I try to organize my thoughts and present
> them.
> >> > > > >
> >> > > >
> >> > > > FYI Trevor, there's no Flink-Mahout shell today; in large part
> >> because
> >> > > the
> >> > > > Flink Shell is still busted on their end and we on the Mahout end
> >> have
> >> > > not
> >> > > > had time to muck with it.  What exists today is the Mahout-Spark
> >> 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
In any case, still getting this error in the console when I run this block:

"import org.apache.mahout.math._
import org.apache.mahout.math.scalabindings._
import org.apache.mahout.math.drm._
import org.apache.mahout.math.scalabindings.RLikeOps._
import org.apache.mahout.math.drm.RLikeDrmOps._
import org.apache.mahout.sparkbindings._

implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext =
sc2sdc(sc)"

":21: error: object mahout is not a member of package org.apache
import org.apache.mahout.math._"

On Fri, May 20, 2016 at 2:31 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Ah, well I cloned the Till branch per your Nov 3 article..
>
> git clone https://github.com/tillrohrmann/incubator-zeppelin.git
>
> On Fri, May 20, 2016 at 2:28 PM, Trevor Grant 
> wrote:
>
>> That's a "new" feature in the 0.6-snapshot... Say within the last month or
>> two, how long has it been since you did a git pull?
>>
>> I'll update soon with a note on that.
>>
>> I can also create a gist with the code.
>> On May 20, 2016 4:24 PM, "Andrew Musselman" 
>> wrote:
>>
>> > At this step of the tutorial I'm stuck because I don't have an "Import
>> > Note" link in my Zeppelin home:
>> >
>> > "I’m going to do you another favor. Go to the Zeppelin home page and
>> click
>> > on ‘Import Note’. When given the option between URL and json, click on
>> URL
>> > and enter the following link:
>> >
>> >
>> >
>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
>> > "
>> >
>> > On Fri, May 20, 2016 at 12:35 PM, Trevor Grant <
>> trevor.d.gr...@gmail.com>
>> > wrote:
>> >
>> > > FYI:
>> > >
>> > > Looks like Flink shell is fixed :D
>> > >
>> > > https://github.com/apache/flink/pull/1913
>> > >
>> > > (I tested, is working good).
>> > >
>> > >
>> > >
>> > > Trevor Grant
>> > > Data Scientist
>> > > https://github.com/rawkintrevo
>> > > http://stackexchange.com/users/3002022/rawkintrevo
>> > > http://trevorgrant.org
>> > >
>> > > *"Fortunate is he, who is able to know the causes of things."
>> -Virgil*
>> > >
>> > >
>> > > On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi 
>> > wrote:
>> > >
>> > > > On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
>> > trevor.d.gr...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > Dmitriy really nailed it on the head in his reply to the post
>> which
>> > > I'll
>> > > > > rebroadcast below. In essence the whole reason you are
>> > (theoretically)
>> > > > > using Mahout is the data is to big to fit in memory.  If it's to
>> big
>> > to
>> > > > fit
>> > > > > in memory, well then its probably too big to plot each point (e.g.
>> > > > > trillions of row, you only have so many pixels).   For the
>> example I
>> > > > > randomly sampled a matrix.
>> > > > >
>> > > > > So as Dmitriy says, in Mahout we need to have functions that will
>> > > > > 'preprocess' the data into something plotable.
>> > > > >
>> > > > > For the Zepplin-Plotting thing, we need to have a function that
>> will
>> > > spit
>> > > > > out a tsv like string of the data we wanted plotted.
>> > > > >
>> > > > > I agree an honest Mahout interpreter in Zeppelin is probably worth
>> > > doing.
>> > > > > There are a couple of ways to go about it. I opened up the
>> discussion
>> > > on
>> > > > > dev@Zeppelin and didn't get any replies. I'm going to take that
>> to
>> > > mean
>> > > > we
>> > > > > can do it in a way that makes the most sense to Mahout users...
>> > > > >
>> > > > > First steps are to include some methods in Mahout that will do
>> that
>> > > > > preprocessing, and one that will turn something into a tsv string.
>> > > > >
>> > > > > I have some general ideas on possible approached to making an
>> > > > honest-mahout
>> > > > > interpreter but I want to play in the code and look at the
>> > Flink-Mahout
>> > > > > shell a bit before I try to organize my thoughts and present them.
>> > > > >
>> > > >
>> > > > FYI Trevor, there's no Flink-Mahout shell today; in large part
>> because
>> > > the
>> > > > Flink Shell is still busted on their end and we on the Mahout end
>> have
>> > > not
>> > > > had time to muck with it.  What exists today is the Mahout-Spark
>> shell.
>> > > >
>> > > > >
>> > > > > ...(2) not sure what is the point of supporting distributed
>> anything.
>> > > It
>> > > > is
>> > > > > distributed presumably because it is hard to keep it in memory.
>> > > > Therefore,
>> > > > > plotting anything distributed potentially presents 2 problems:
>> > storage
>> > > > > space and overplotting due to number of points. The idea is that
>> we
>> > > have
>> > > > to
>> > > > > work out algorithms that condense big data information into small
>> > > > plottable
>> > > > > information (like density grids, for example, or histograms)
>> > > > >
>> > > >
>> > > > Agreed, something like sampling x% of points from a DRM (like the
>> > 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
Ah, well I cloned the Till branch per your Nov 3 article..

git clone https://github.com/tillrohrmann/incubator-zeppelin.git

On Fri, May 20, 2016 at 2:28 PM, Trevor Grant 
wrote:

> That's a "new" feature in the 0.6-snapshot... Say within the last month or
> two, how long has it been since you did a git pull?
>
> I'll update soon with a note on that.
>
> I can also create a gist with the code.
> On May 20, 2016 4:24 PM, "Andrew Musselman" 
> wrote:
>
> > At this step of the tutorial I'm stuck because I don't have an "Import
> > Note" link in my Zeppelin home:
> >
> > "I’m going to do you another favor. Go to the Zeppelin home page and
> click
> > on ‘Import Note’. When given the option between URL and json, click on
> URL
> > and enter the following link:
> >
> >
> >
> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
> > "
> >
> > On Fri, May 20, 2016 at 12:35 PM, Trevor Grant  >
> > wrote:
> >
> > > FYI:
> > >
> > > Looks like Flink shell is fixed :D
> > >
> > > https://github.com/apache/flink/pull/1913
> > >
> > > (I tested, is working good).
> > >
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi 
> > wrote:
> > >
> > > > On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
> > trevor.d.gr...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Dmitriy really nailed it on the head in his reply to the post which
> > > I'll
> > > > > rebroadcast below. In essence the whole reason you are
> > (theoretically)
> > > > > using Mahout is the data is to big to fit in memory.  If it's to
> big
> > to
> > > > fit
> > > > > in memory, well then its probably too big to plot each point (e.g.
> > > > > trillions of row, you only have so many pixels).   For the example
> I
> > > > > randomly sampled a matrix.
> > > > >
> > > > > So as Dmitriy says, in Mahout we need to have functions that will
> > > > > 'preprocess' the data into something plotable.
> > > > >
> > > > > For the Zepplin-Plotting thing, we need to have a function that
> will
> > > spit
> > > > > out a tsv like string of the data we wanted plotted.
> > > > >
> > > > > I agree an honest Mahout interpreter in Zeppelin is probably worth
> > > doing.
> > > > > There are a couple of ways to go about it. I opened up the
> discussion
> > > on
> > > > > dev@Zeppelin and didn't get any replies. I'm going to take that to
> > > mean
> > > > we
> > > > > can do it in a way that makes the most sense to Mahout users...
> > > > >
> > > > > First steps are to include some methods in Mahout that will do that
> > > > > preprocessing, and one that will turn something into a tsv string.
> > > > >
> > > > > I have some general ideas on possible approached to making an
> > > > honest-mahout
> > > > > interpreter but I want to play in the code and look at the
> > Flink-Mahout
> > > > > shell a bit before I try to organize my thoughts and present them.
> > > > >
> > > >
> > > > FYI Trevor, there's no Flink-Mahout shell today; in large part
> because
> > > the
> > > > Flink Shell is still busted on their end and we on the Mahout end
> have
> > > not
> > > > had time to muck with it.  What exists today is the Mahout-Spark
> shell.
> > > >
> > > > >
> > > > > ...(2) not sure what is the point of supporting distributed
> anything.
> > > It
> > > > is
> > > > > distributed presumably because it is hard to keep it in memory.
> > > > Therefore,
> > > > > plotting anything distributed potentially presents 2 problems:
> > storage
> > > > > space and overplotting due to number of points. The idea is that we
> > > have
> > > > to
> > > > > work out algorithms that condense big data information into small
> > > > plottable
> > > > > information (like density grids, for example, or histograms)
> > > > >
> > > >
> > > > Agreed, something like sampling x% of points from a DRM (like the
> > > visuals I
> > > > had from Palumbo for the talk in Vancouver that demonstrated the
> > concept)
> > > >
> > > >
> > > > >
> > > > > Trevor Grant
> > > > > Data Scientist
> > > > > https://github.com/rawkintrevo
> > > > > http://stackexchange.com/users/3002022/rawkintrevo
> > > > > http://trevorgrant.org
> > > > >
> > > > > *"Fortunate is he, who is able to know the causes of things."
> > -Virgil*
> > > > >
> > > > >
> > > > > On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel <
> p...@occamsmachete.com>
> > > > > wrote:
> > > > >
> > > > > > Great job Trevor, we’ll need this detail to smooth out the sharp
> > > edges
> > > > > and
> > > > > > any guidance from you or the Zeppelin community will be a big
> help.
> > > > > >
> > > > > >
> > > > > > On May 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Trevor Grant
That's a "new" feature in the 0.6-snapshot... Say within the last month or
two, how long has it been since you did a git pull?

I'll update soon with a note on that.

I can also create a gist with the code.
On May 20, 2016 4:24 PM, "Andrew Musselman" 
wrote:

> At this step of the tutorial I'm stuck because I don't have an "Import
> Note" link in my Zeppelin home:
>
> "I’m going to do you another favor. Go to the Zeppelin home page and click
> on ‘Import Note’. When given the option between URL and json, click on URL
> and enter the following link:
>
>
> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
> "
>
> On Fri, May 20, 2016 at 12:35 PM, Trevor Grant 
> wrote:
>
> > FYI:
> >
> > Looks like Flink shell is fixed :D
> >
> > https://github.com/apache/flink/pull/1913
> >
> > (I tested, is working good).
> >
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi 
> wrote:
> >
> > > On Fri, May 20, 2016 at 12:54 PM, Trevor Grant <
> trevor.d.gr...@gmail.com
> > >
> > > wrote:
> > >
> > > > Dmitriy really nailed it on the head in his reply to the post which
> > I'll
> > > > rebroadcast below. In essence the whole reason you are
> (theoretically)
> > > > using Mahout is the data is to big to fit in memory.  If it's to big
> to
> > > fit
> > > > in memory, well then its probably too big to plot each point (e.g.
> > > > trillions of row, you only have so many pixels).   For the example I
> > > > randomly sampled a matrix.
> > > >
> > > > So as Dmitriy says, in Mahout we need to have functions that will
> > > > 'preprocess' the data into something plotable.
> > > >
> > > > For the Zepplin-Plotting thing, we need to have a function that will
> > spit
> > > > out a tsv like string of the data we wanted plotted.
> > > >
> > > > I agree an honest Mahout interpreter in Zeppelin is probably worth
> > doing.
> > > > There are a couple of ways to go about it. I opened up the discussion
> > on
> > > > dev@Zeppelin and didn't get any replies. I'm going to take that to
> > mean
> > > we
> > > > can do it in a way that makes the most sense to Mahout users...
> > > >
> > > > First steps are to include some methods in Mahout that will do that
> > > > preprocessing, and one that will turn something into a tsv string.
> > > >
> > > > I have some general ideas on possible approached to making an
> > > honest-mahout
> > > > interpreter but I want to play in the code and look at the
> Flink-Mahout
> > > > shell a bit before I try to organize my thoughts and present them.
> > > >
> > >
> > > FYI Trevor, there's no Flink-Mahout shell today; in large part because
> > the
> > > Flink Shell is still busted on their end and we on the Mahout end have
> > not
> > > had time to muck with it.  What exists today is the Mahout-Spark shell.
> > >
> > > >
> > > > ...(2) not sure what is the point of supporting distributed anything.
> > It
> > > is
> > > > distributed presumably because it is hard to keep it in memory.
> > > Therefore,
> > > > plotting anything distributed potentially presents 2 problems:
> storage
> > > > space and overplotting due to number of points. The idea is that we
> > have
> > > to
> > > > work out algorithms that condense big data information into small
> > > plottable
> > > > information (like density grids, for example, or histograms)
> > > >
> > >
> > > Agreed, something like sampling x% of points from a DRM (like the
> > visuals I
> > > had from Palumbo for the talk in Vancouver that demonstrated the
> concept)
> > >
> > >
> > > >
> > > > Trevor Grant
> > > > Data Scientist
> > > > https://github.com/rawkintrevo
> > > > http://stackexchange.com/users/3002022/rawkintrevo
> > > > http://trevorgrant.org
> > > >
> > > > *"Fortunate is he, who is able to know the causes of things."
> -Virgil*
> > > >
> > > >
> > > > On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> > > > wrote:
> > > >
> > > > > Great job Trevor, we’ll need this detail to smooth out the sharp
> > edges
> > > > and
> > > > > any guidance from you or the Zeppelin community will be a big help.
> > > > >
> > > > >
> > > > > On May 20, 2016, at 8:13 AM, Shannon Quinn 
> > wrote:
> > > > >
> > > > > Agreed, thoroughly enjoying the blog post.
> > > > >
> > > > > On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> > > > > > Well done, Trevor!  I've not yet had a chance to try this in
> > zeppelin
> > > > > but I just read the blog which is great!
> > > > > >
> > > > > >  Original message 
> > > > > > From: Trevor Grant 
> > > > > > Date: 05/18/2016 2:44 PM (GMT-05:00)
> > > > > > 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Andrew Musselman
At this step of the tutorial I'm stuck because I don't have an "Import
Note" link in my Zeppelin home:

"I’m going to do you another favor. Go to the Zeppelin home page and click
on ‘Import Note’. When given the option between URL and json, click on URL
and enter the following link:

https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
"

On Fri, May 20, 2016 at 12:35 PM, Trevor Grant 
wrote:

> FYI:
>
> Looks like Flink shell is fixed :D
>
> https://github.com/apache/flink/pull/1913
>
> (I tested, is working good).
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi  wrote:
>
> > On Fri, May 20, 2016 at 12:54 PM, Trevor Grant  >
> > wrote:
> >
> > > Dmitriy really nailed it on the head in his reply to the post which
> I'll
> > > rebroadcast below. In essence the whole reason you are (theoretically)
> > > using Mahout is the data is to big to fit in memory.  If it's to big to
> > fit
> > > in memory, well then its probably too big to plot each point (e.g.
> > > trillions of row, you only have so many pixels).   For the example I
> > > randomly sampled a matrix.
> > >
> > > So as Dmitriy says, in Mahout we need to have functions that will
> > > 'preprocess' the data into something plotable.
> > >
> > > For the Zepplin-Plotting thing, we need to have a function that will
> spit
> > > out a tsv like string of the data we wanted plotted.
> > >
> > > I agree an honest Mahout interpreter in Zeppelin is probably worth
> doing.
> > > There are a couple of ways to go about it. I opened up the discussion
> on
> > > dev@Zeppelin and didn't get any replies. I'm going to take that to
> mean
> > we
> > > can do it in a way that makes the most sense to Mahout users...
> > >
> > > First steps are to include some methods in Mahout that will do that
> > > preprocessing, and one that will turn something into a tsv string.
> > >
> > > I have some general ideas on possible approached to making an
> > honest-mahout
> > > interpreter but I want to play in the code and look at the Flink-Mahout
> > > shell a bit before I try to organize my thoughts and present them.
> > >
> >
> > FYI Trevor, there's no Flink-Mahout shell today; in large part because
> the
> > Flink Shell is still busted on their end and we on the Mahout end have
> not
> > had time to muck with it.  What exists today is the Mahout-Spark shell.
> >
> > >
> > > ...(2) not sure what is the point of supporting distributed anything.
> It
> > is
> > > distributed presumably because it is hard to keep it in memory.
> > Therefore,
> > > plotting anything distributed potentially presents 2 problems: storage
> > > space and overplotting due to number of points. The idea is that we
> have
> > to
> > > work out algorithms that condense big data information into small
> > plottable
> > > information (like density grids, for example, or histograms)
> > >
> >
> > Agreed, something like sampling x% of points from a DRM (like the
> visuals I
> > had from Palumbo for the talk in Vancouver that demonstrated the concept)
> >
> >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> > > wrote:
> > >
> > > > Great job Trevor, we’ll need this detail to smooth out the sharp
> edges
> > > and
> > > > any guidance from you or the Zeppelin community will be a big help.
> > > >
> > > >
> > > > On May 20, 2016, at 8:13 AM, Shannon Quinn 
> wrote:
> > > >
> > > > Agreed, thoroughly enjoying the blog post.
> > > >
> > > > On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> > > > > Well done, Trevor!  I've not yet had a chance to try this in
> zeppelin
> > > > but I just read the blog which is great!
> > > > >
> > > > >  Original message 
> > > > > From: Trevor Grant 
> > > > > Date: 05/18/2016 2:44 PM (GMT-05:00)
> > > > > To: dev@mahout.apache.org
> > > > > Subject: Re: Future Mahout - Zeppelin work
> > > > >
> > > > > Ah thank you.
> > > > >
> > > > > Fixing now.
> > > > >
> > > > >
> > > > > Trevor Grant
> > > > > Data Scientist
> > > > > https://github.com/rawkintrevo
> > > > > http://stackexchange.com/users/3002022/rawkintrevo
> > > > > http://trevorgrant.org
> > > > >
> > > > > *"Fortunate is he, who is able to know the causes of things."
> > -Virgil*
> > > > >
> > > > >
> > > > > On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo <
> ap@outlook.com>
> > > > wrote:
> > 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Trevor Grant
FYI:

Looks like Flink shell is fixed :D

https://github.com/apache/flink/pull/1913

(I tested, is working good).



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, May 20, 2016 at 1:46 PM, Suneel Marthi  wrote:

> On Fri, May 20, 2016 at 12:54 PM, Trevor Grant 
> wrote:
>
> > Dmitriy really nailed it on the head in his reply to the post which I'll
> > rebroadcast below. In essence the whole reason you are (theoretically)
> > using Mahout is the data is to big to fit in memory.  If it's to big to
> fit
> > in memory, well then its probably too big to plot each point (e.g.
> > trillions of row, you only have so many pixels).   For the example I
> > randomly sampled a matrix.
> >
> > So as Dmitriy says, in Mahout we need to have functions that will
> > 'preprocess' the data into something plotable.
> >
> > For the Zepplin-Plotting thing, we need to have a function that will spit
> > out a tsv like string of the data we wanted plotted.
> >
> > I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
> > There are a couple of ways to go about it. I opened up the discussion on
> > dev@Zeppelin and didn't get any replies. I'm going to take that to mean
> we
> > can do it in a way that makes the most sense to Mahout users...
> >
> > First steps are to include some methods in Mahout that will do that
> > preprocessing, and one that will turn something into a tsv string.
> >
> > I have some general ideas on possible approached to making an
> honest-mahout
> > interpreter but I want to play in the code and look at the Flink-Mahout
> > shell a bit before I try to organize my thoughts and present them.
> >
>
> FYI Trevor, there's no Flink-Mahout shell today; in large part because the
> Flink Shell is still busted on their end and we on the Mahout end have not
> had time to muck with it.  What exists today is the Mahout-Spark shell.
>
> >
> > ...(2) not sure what is the point of supporting distributed anything. It
> is
> > distributed presumably because it is hard to keep it in memory.
> Therefore,
> > plotting anything distributed potentially presents 2 problems: storage
> > space and overplotting due to number of points. The idea is that we have
> to
> > work out algorithms that condense big data information into small
> plottable
> > information (like density grids, for example, or histograms)
> >
>
> Agreed, something like sampling x% of points from a DRM (like the visuals I
> had from Palumbo for the talk in Vancouver that demonstrated the concept)
>
>
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> > wrote:
> >
> > > Great job Trevor, we’ll need this detail to smooth out the sharp edges
> > and
> > > any guidance from you or the Zeppelin community will be a big help.
> > >
> > >
> > > On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:
> > >
> > > Agreed, thoroughly enjoying the blog post.
> > >
> > > On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> > > > Well done, Trevor!  I've not yet had a chance to try this in zeppelin
> > > but I just read the blog which is great!
> > > >
> > > >  Original message 
> > > > From: Trevor Grant 
> > > > Date: 05/18/2016 2:44 PM (GMT-05:00)
> > > > To: dev@mahout.apache.org
> > > > Subject: Re: Future Mahout - Zeppelin work
> > > >
> > > > Ah thank you.
> > > >
> > > > Fixing now.
> > > >
> > > >
> > > > Trevor Grant
> > > > Data Scientist
> > > > https://github.com/rawkintrevo
> > > > http://stackexchange.com/users/3002022/rawkintrevo
> > > > http://trevorgrant.org
> > > >
> > > > *"Fortunate is he, who is able to know the causes of things."
> -Virgil*
> > > >
> > > >
> > > > On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo 
> > > wrote:
> > > >
> > > >> Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
> > > >> actually:
> > > >>
> > > >>
> > > >>
> > >
> >
> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> > > >>
> > > >> rather than:
> > > >>
> > > >>
> > > >>
> > >
> >
> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> > > >>
> > > >> (In the spark module that is)
> > > >> 
> > > >> From: Trevor Grant 
> > > >> Sent: Wednesday, May 18, 2016 11:02:43 AM
> > > >> To: dev@mahout.apache.org
> > > >> Subject: Re: Future Mahout - 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Trevor Grant
Hey Pat,

If you spit out a TSV - you can import into pyspark / matplotlib from the
resource pool in essentially the same way and use that plotting library if
you prefer.  In fact you could import the tsv into pandas and use all of
the pandas plotting as well (though I think it is for the most part, also
matplotlib with some convenience functions).

https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZlbGl4Y2hldW5nL3NwYXJrLW5vdGVib29rLWV4YW1wbGVzL21hc3Rlci9aZXBwZWxpbl9ub3RlYm9vay8yQU1YNUNWQ1Uvbm90ZS5qc29u

In Zeppelin, unless you specify otherwise, pyspark, sparkr, spark-sql, and
scala-spark all share the same spark context you can create RDDs in one
language and access them / work on them in another (so I understand).

So in Mahout can you "save" a matrix as a RDD? e.g. something like

val myRDD = myDRM.asRDD()

And would 'myRDD' then exist in the spark context?


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, May 20, 2016 at 12:21 PM, Pat Ferrel  wrote:

> Agreed.
>
> BTW I don’t want to stall progress but being the most ignorant of plot
> libs, I’ll ask if we should consider python and matplotlib. In another
> project we use python because of the RDD support on Spark though the
> visualizations are extremely limited in our case. If we can pass an RDD to
> pyspark it would allow custom reductions in python before plotting, even
> though we will support many natively in Mahout. I’m guessing that this
> would cross a context boundary and require a write to disk?
>
> So 2 questions:
> 1) what does the inter language support look like with Spark python vs
> SparkR, can we transfer RDDs?
> 2) are the plot libs significantly different?
>
> On May 20, 2016, at 9:54 AM, Trevor Grant 
> wrote:
>
> Dmitriy really nailed it on the head in his reply to the post which I'll
> rebroadcast below. In essence the whole reason you are (theoretically)
> using Mahout is the data is to big to fit in memory.  If it's to big to fit
> in memory, well then its probably too big to plot each point (e.g.
> trillions of row, you only have so many pixels).   For the example I
> randomly sampled a matrix.
>
> So as Dmitriy says, in Mahout we need to have functions that will
> 'preprocess' the data into something plotable.
>
> For the Zepplin-Plotting thing, we need to have a function that will spit
> out a tsv like string of the data we wanted plotted.
>
> I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
> There are a couple of ways to go about it. I opened up the discussion on
> dev@Zeppelin and didn't get any replies. I'm going to take that to mean we
> can do it in a way that makes the most sense to Mahout users...
>
> First steps are to include some methods in Mahout that will do that
> preprocessing, and one that will turn something into a tsv string.
>
> I have some general ideas on possible approached to making an honest-mahout
> interpreter but I want to play in the code and look at the Flink-Mahout
> shell a bit before I try to organize my thoughts and present them.
>
> ...(2) not sure what is the point of supporting distributed anything. It is
> distributed presumably because it is hard to keep it in memory. Therefore,
> plotting anything distributed potentially presents 2 problems: storage
> space and overplotting due to number of points. The idea is that we have to
> work out algorithms that condense big data information into small plottable
> information (like density grids, for example, or histograms)
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> wrote:
>
> > Great job Trevor, we’ll need this detail to smooth out the sharp edges
> and
> > any guidance from you or the Zeppelin community will be a big help.
> >
> >
> > On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:
> >
> > Agreed, thoroughly enjoying the blog post.
> >
> > On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> >> Well done, Trevor!  I've not yet had a chance to try this in zeppelin
> > but I just read the blog which is great!
> >>
> >>  Original message 
> >> From: Trevor Grant 
> >> Date: 05/18/2016 2:44 PM (GMT-05:00)
> >> To: dev@mahout.apache.org
> >> Subject: Re: Future Mahout - Zeppelin work
> >>
> >> Ah thank you.
> >>
> >> Fixing now.
> >>
> >>
> >> Trevor Grant
> >> Data Scientist
> >> https://github.com/rawkintrevo
> >> http://stackexchange.com/users/3002022/rawkintrevo
> >> http://trevorgrant.org
> >>
> >> *"Fortunate is he, who is able to know the causes of 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Suneel Marthi
On Fri, May 20, 2016 at 12:54 PM, Trevor Grant 
wrote:

> Dmitriy really nailed it on the head in his reply to the post which I'll
> rebroadcast below. In essence the whole reason you are (theoretically)
> using Mahout is the data is to big to fit in memory.  If it's to big to fit
> in memory, well then its probably too big to plot each point (e.g.
> trillions of row, you only have so many pixels).   For the example I
> randomly sampled a matrix.
>
> So as Dmitriy says, in Mahout we need to have functions that will
> 'preprocess' the data into something plotable.
>
> For the Zepplin-Plotting thing, we need to have a function that will spit
> out a tsv like string of the data we wanted plotted.
>
> I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
> There are a couple of ways to go about it. I opened up the discussion on
> dev@Zeppelin and didn't get any replies. I'm going to take that to mean we
> can do it in a way that makes the most sense to Mahout users...
>
> First steps are to include some methods in Mahout that will do that
> preprocessing, and one that will turn something into a tsv string.
>
> I have some general ideas on possible approached to making an honest-mahout
> interpreter but I want to play in the code and look at the Flink-Mahout
> shell a bit before I try to organize my thoughts and present them.
>

FYI Trevor, there's no Flink-Mahout shell today; in large part because the
Flink Shell is still busted on their end and we on the Mahout end have not
had time to muck with it.  What exists today is the Mahout-Spark shell.

>
> ...(2) not sure what is the point of supporting distributed anything. It is
> distributed presumably because it is hard to keep it in memory. Therefore,
> plotting anything distributed potentially presents 2 problems: storage
> space and overplotting due to number of points. The idea is that we have to
> work out algorithms that condense big data information into small plottable
> information (like density grids, for example, or histograms)
>

Agreed, something like sampling x% of points from a DRM (like the visuals I
had from Palumbo for the talk in Vancouver that demonstrated the concept)


>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel 
> wrote:
>
> > Great job Trevor, we’ll need this detail to smooth out the sharp edges
> and
> > any guidance from you or the Zeppelin community will be a big help.
> >
> >
> > On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:
> >
> > Agreed, thoroughly enjoying the blog post.
> >
> > On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> > > Well done, Trevor!  I've not yet had a chance to try this in zeppelin
> > but I just read the blog which is great!
> > >
> > >  Original message 
> > > From: Trevor Grant 
> > > Date: 05/18/2016 2:44 PM (GMT-05:00)
> > > To: dev@mahout.apache.org
> > > Subject: Re: Future Mahout - Zeppelin work
> > >
> > > Ah thank you.
> > >
> > > Fixing now.
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo 
> > wrote:
> > >
> > >> Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
> > >> actually:
> > >>
> > >>
> > >>
> >
> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> > >>
> > >> rather than:
> > >>
> > >>
> > >>
> >
> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> > >>
> > >> (In the spark module that is)
> > >> 
> > >> From: Trevor Grant 
> > >> Sent: Wednesday, May 18, 2016 11:02:43 AM
> > >> To: dev@mahout.apache.org
> > >> Subject: Re: Future Mahout - Zeppelin work
> > >>
> > >> ah yes- I remember you pointing that out to me too.
> > >>
> > >> I got side tracked yesterday for most of the day on an adventure in
> > getting
> > >> Zeppelin to work right after I accidently updated to the new snapshot
> > (free
> > >> hint: the secret was to clear my cache *face-palm*)
> > >>
> > >> I'm going to add that dependency to the readme.md now.
> > >>
> > >> thanks,
> > >> tg
> > >>
> > >> Trevor Grant
> > >> Data Scientist
> > >> https://github.com/rawkintrevo
> > >> http://stackexchange.com/users/3002022/rawkintrevo
> > >> http://trevorgrant.org
> > >>
> > >> *"Fortunate is he, who is able to know the 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Pat Ferrel
Agreed.

BTW I don’t want to stall progress but being the most ignorant of plot libs, 
I’ll ask if we should consider python and matplotlib. In another project we use 
python because of the RDD support on Spark though the visualizations are 
extremely limited in our case. If we can pass an RDD to pyspark it would allow 
custom reductions in python before plotting, even though we will support many 
natively in Mahout. I’m guessing that this would cross a context boundary and 
require a write to disk?

So 2 questions:
1) what does the inter language support look like with Spark python vs SparkR, 
can we transfer RDDs? 
2) are the plot libs significantly different?

On May 20, 2016, at 9:54 AM, Trevor Grant  wrote:

Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory.  If it's to big to fit
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels).   For the example I
randomly sampled a matrix.

So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.

For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.

I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
dev@Zeppelin and didn't get any replies. I'm going to take that to mean we
can do it in a way that makes the most sense to Mahout users...

First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.

I have some general ideas on possible approached to making an honest-mahout
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.

...(2) not sure what is the point of supporting distributed anything. It is
distributed presumably because it is hard to keep it in memory. Therefore,
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have to
work out algorithms that condense big data information into small plottable
information (like density grids, for example, or histograms)

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel  wrote:

> Great job Trevor, we’ll need this detail to smooth out the sharp edges and
> any guidance from you or the Zeppelin community will be a big help.
> 
> 
> On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:
> 
> Agreed, thoroughly enjoying the blog post.
> 
> On 5/19/16 12:01 AM, Andrew Palumbo wrote:
>> Well done, Trevor!  I've not yet had a chance to try this in zeppelin
> but I just read the blog which is great!
>> 
>>  Original message 
>> From: Trevor Grant 
>> Date: 05/18/2016 2:44 PM (GMT-05:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Future Mahout - Zeppelin work
>> 
>> Ah thank you.
>> 
>> Fixing now.
>> 
>> 
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>> 
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>> 
>> 
>> On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo 
> wrote:
>> 
>>> Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
>>> actually:
>>> 
>>> 
>>> 
> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>> 
>>> rather than:
>>> 
>>> 
>>> 
> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>> 
>>> (In the spark module that is)
>>> 
>>> From: Trevor Grant 
>>> Sent: Wednesday, May 18, 2016 11:02:43 AM
>>> To: dev@mahout.apache.org
>>> Subject: Re: Future Mahout - Zeppelin work
>>> 
>>> ah yes- I remember you pointing that out to me too.
>>> 
>>> I got side tracked yesterday for most of the day on an adventure in
> getting
>>> Zeppelin to work right after I accidently updated to the new snapshot
> (free
>>> hint: the secret was to clear my cache *face-palm*)
>>> 
>>> I'm going to add that dependency to the readme.md now.
>>> 
>>> thanks,
>>> tg
>>> 
>>> Trevor Grant
>>> Data Scientist
>>> https://github.com/rawkintrevo
>>> http://stackexchange.com/users/3002022/rawkintrevo
>>> http://trevorgrant.org
>>> 
>>> *"Fortunate is he, who 

Re: [Hello] from NASa

2016-05-20 Thread Suneel Marthi
Welcome to the project Steven!!

On Fri, May 20, 2016 at 10:07 AM, Steven NASa  wrote:

> Hi Folk & Masters,
>
> My name is *NASa*. I am now working for an e-commerce B2C company in China,
> dealing with Transaction Process development in C++ & Java on Linux
> environment.
>
> As you know, *Recommender System* is quite valuable and important to an
> e-commerce online shopping website like Amazon. I was told and required to
> design and implement a Recommender System which can bring some value to my
> Company. Our System is based on C++ codes. So I was searching for an robust
> Machine Learning framework in C++ which can help me to easily implement a
> Recommender System. I did not find any one which can satisfy my
> requirements, but only some C++ math libraries.
>
> Our system is based on an internal distributed frameworks like RPC and DB
> access on Linux environment based on C++ programming language. But I find
> it is really inconvenient to implement a Recommender System in C++ from
> zero without distributed computing library supporting, like
> implementing *Collaborative
> Filtering* with SVD in a distributed computing way. So I am trying to find
> a framework/library with is designed based on Distributed-System. There I
> come to *Mahout*.
>
> I wish I can build a library that can help people easily and quickly build
> up a Recommender System based on Distributed System and also use the
> Machine Learning Algorithms in distributed way. Apache has many amazing
> projects which can help people to build up robust distributed system
> easily. So I am moving to using “Java” environment.
>
> I am new to *Mahout* and *Hadoop*, *Spark*, *Scala* and I learned Andrew
> Ng’s “Machine Learning” from Coursera
> . So I have
> the basic knowledge of Machine Learning, and now I am keeping forward to
> *Deep
> Learning* and *Convex Optimization*, some other Mathematical Optimization
> implementation. I am now still learning and getting famiIiar with Mahout. I
> hope I can contribute some codes to Mahout in the early future with
> learning by coding and coding by learning.
> NASa 2016/05/20
> ​
>


Re: [Hello] from NASa

2016-05-20 Thread Andrew Musselman
Steven, thanks for reaching out, and welcome to the project!

If you want to discuss how to build a recommender system, the user list is
probably more appropriate, and we all hang out there too.

If you'd like to contribute to the project dev's the right list. Let us
know if you have any trouble getting up and running and we can help out.

Best
Andrew

On Fri, May 20, 2016 at 7:07 AM, Steven NASa  wrote:

> Hi Folk & Masters,
>
> My name is *NASa*. I am now working for an e-commerce B2C company in China,
> dealing with Transaction Process development in C++ & Java on Linux
> environment.
>
> As you know, *Recommender System* is quite valuable and important to an
> e-commerce online shopping website like Amazon. I was told and required to
> design and implement a Recommender System which can bring some value to my
> Company. Our System is based on C++ codes. So I was searching for an robust
> Machine Learning framework in C++ which can help me to easily implement a
> Recommender System. I did not find any one which can satisfy my
> requirements, but only some C++ math libraries.
>
> Our system is based on an internal distributed frameworks like RPC and DB
> access on Linux environment based on C++ programming language. But I find
> it is really inconvenient to implement a Recommender System in C++ from
> zero without distributed computing library supporting, like
> implementing *Collaborative
> Filtering* with SVD in a distributed computing way. So I am trying to find
> a framework/library with is designed based on Distributed-System. There I
> come to *Mahout*.
>
> I wish I can build a library that can help people easily and quickly build
> up a Recommender System based on Distributed System and also use the
> Machine Learning Algorithms in distributed way. Apache has many amazing
> projects which can help people to build up robust distributed system
> easily. So I am moving to using “Java” environment.
>
> I am new to *Mahout* and *Hadoop*, *Spark*, *Scala* and I learned Andrew
> Ng’s “Machine Learning” from Coursera
> . So I have
> the basic knowledge of Machine Learning, and now I am keeping forward to
> *Deep
> Learning* and *Convex Optimization*, some other Mathematical Optimization
> implementation. I am now still learning and getting famiIiar with Mahout. I
> hope I can contribute some codes to Mahout in the early future with
> learning by coding and coding by learning.
> NASa 2016/05/20
> ​
>


Re: Future Mahout - Zeppelin work

2016-05-20 Thread Trevor Grant
Dmitriy really nailed it on the head in his reply to the post which I'll
rebroadcast below. In essence the whole reason you are (theoretically)
using Mahout is the data is to big to fit in memory.  If it's to big to fit
in memory, well then its probably too big to plot each point (e.g.
trillions of row, you only have so many pixels).   For the example I
randomly sampled a matrix.

So as Dmitriy says, in Mahout we need to have functions that will
'preprocess' the data into something plotable.

For the Zepplin-Plotting thing, we need to have a function that will spit
out a tsv like string of the data we wanted plotted.

I agree an honest Mahout interpreter in Zeppelin is probably worth doing.
There are a couple of ways to go about it. I opened up the discussion on
dev@Zeppelin and didn't get any replies. I'm going to take that to mean we
can do it in a way that makes the most sense to Mahout users...

First steps are to include some methods in Mahout that will do that
preprocessing, and one that will turn something into a tsv string.

I have some general ideas on possible approached to making an honest-mahout
interpreter but I want to play in the code and look at the Flink-Mahout
shell a bit before I try to organize my thoughts and present them.

...(2) not sure what is the point of supporting distributed anything. It is
distributed presumably because it is hard to keep it in memory. Therefore,
plotting anything distributed potentially presents 2 problems: storage
space and overplotting due to number of points. The idea is that we have to
work out algorithms that condense big data information into small plottable
information (like density grids, for example, or histograms)

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, May 20, 2016 at 10:22 AM, Pat Ferrel  wrote:

> Great job Trevor, we’ll need this detail to smooth out the sharp edges and
> any guidance from you or the Zeppelin community will be a big help.
>
>
> On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:
>
> Agreed, thoroughly enjoying the blog post.
>
> On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> > Well done, Trevor!  I've not yet had a chance to try this in zeppelin
> but I just read the blog which is great!
> >
> >  Original message 
> > From: Trevor Grant 
> > Date: 05/18/2016 2:44 PM (GMT-05:00)
> > To: dev@mahout.apache.org
> > Subject: Re: Future Mahout - Zeppelin work
> >
> > Ah thank you.
> >
> > Fixing now.
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo 
> wrote:
> >
> >> Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
> >> actually:
> >>
> >>
> >>
> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> >>
> >> rather than:
> >>
> >>
> >>
> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> >>
> >> (In the spark module that is)
> >> 
> >> From: Trevor Grant 
> >> Sent: Wednesday, May 18, 2016 11:02:43 AM
> >> To: dev@mahout.apache.org
> >> Subject: Re: Future Mahout - Zeppelin work
> >>
> >> ah yes- I remember you pointing that out to me too.
> >>
> >> I got side tracked yesterday for most of the day on an adventure in
> getting
> >> Zeppelin to work right after I accidently updated to the new snapshot
> (free
> >> hint: the secret was to clear my cache *face-palm*)
> >>
> >> I'm going to add that dependency to the readme.md now.
> >>
> >> thanks,
> >> tg
> >>
> >> Trevor Grant
> >> Data Scientist
> >> https://github.com/rawkintrevo
> >> http://stackexchange.com/users/3002022/rawkintrevo
> >> http://trevorgrant.org
> >>
> >> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >>
> >>
> >> On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo 
> >> wrote:
> >>
> >>> Trevor this is very cool- I have not been able to look at it closely
> yet
> >>> but just a small point: I believe that you'll also need to add the
> >>>
> >>> mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
> >>>
> >>> For things like the classification stats, confusion matrix, and
> t-digest.
> >>>
> >>> Andy
> >>>
> >>> 
> >>> From: Trevor Grant 
> >>> Sent: Wednesday, May 18, 2016 10:47:21 AM
> >>> To: dev@mahout.apache.org
> >>> Subject: Re: Future Mahout - Zeppelin work
> 

Re: [Hello] from NASa

2016-05-20 Thread Khurrum Nasim
Sounds more like demand prediction to me.   

However your system should be able to interact with other non-C/C++ systems.  
There is something called Apache Thrift.   

Which brings me to the following - would it be a valuable feature to Mahout 
library to provide
connectivity with other systems using Thrift.   


Thoughts ?

Khurrum

p.s. Andrew Ng can put you to sleep easily. 


> On May 20, 2016, at 10:07 AM, Steven NASa  wrote:
> 
> Hi Folk & Masters,
> 
> My name is *NASa*. I am now working for an e-commerce B2C company in China,
> dealing with Transaction Process development in C++ & Java on Linux
> environment.
> 
> As you know, *Recommender System* is quite valuable and important to an
> e-commerce online shopping website like Amazon. I was told and required to
> design and implement a Recommender System which can bring some value to my
> Company. Our System is based on C++ codes. So I was searching for an robust
> Machine Learning framework in C++ which can help me to easily implement a
> Recommender System. I did not find any one which can satisfy my
> requirements, but only some C++ math libraries.
> 
> Our system is based on an internal distributed frameworks like RPC and DB
> access on Linux environment based on C++ programming language. But I find
> it is really inconvenient to implement a Recommender System in C++ from
> zero without distributed computing library supporting, like
> implementing *Collaborative
> Filtering* with SVD in a distributed computing way. So I am trying to find
> a framework/library with is designed based on Distributed-System. There I
> come to *Mahout*.
> 
> I wish I can build a library that can help people easily and quickly build
> up a Recommender System based on Distributed System and also use the
> Machine Learning Algorithms in distributed way. Apache has many amazing
> projects which can help people to build up robust distributed system
> easily. So I am moving to using “Java” environment.
> 
> I am new to *Mahout* and *Hadoop*, *Spark*, *Scala* and I learned Andrew
> Ng’s “Machine Learning” from Coursera
> . So I have
> the basic knowledge of Machine Learning, and now I am keeping forward to *Deep
> Learning* and *Convex Optimization*, some other Mathematical Optimization
> implementation. I am now still learning and getting famiIiar with Mahout. I
> hope I can contribute some codes to Mahout in the early future with
> learning by coding and coding by learning.
> NASa 2016/05/20
> ​



Re: Future Mahout - Zeppelin work

2016-05-20 Thread Pat Ferrel
Great job Trevor, we’ll need this detail to smooth out the sharp edges and any 
guidance from you or the Zeppelin community will be a big help.


On May 20, 2016, at 8:13 AM, Shannon Quinn  wrote:

Agreed, thoroughly enjoying the blog post.

On 5/19/16 12:01 AM, Andrew Palumbo wrote:
> Well done, Trevor!  I've not yet had a chance to try this in zeppelin but I 
> just read the blog which is great!
> 
>  Original message 
> From: Trevor Grant 
> Date: 05/18/2016 2:44 PM (GMT-05:00)
> To: dev@mahout.apache.org
> Subject: Re: Future Mahout - Zeppelin work
> 
> Ah thank you.
> 
> Fixing now.
> 
> 
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
> 
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> 
> 
> On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo  wrote:
> 
>> Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
>> actually:
>> 
>> 
>> /home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>> 
>> rather than:
>> 
>> 
>> /home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>> 
>> (In the spark module that is)
>> 
>> From: Trevor Grant 
>> Sent: Wednesday, May 18, 2016 11:02:43 AM
>> To: dev@mahout.apache.org
>> Subject: Re: Future Mahout - Zeppelin work
>> 
>> ah yes- I remember you pointing that out to me too.
>> 
>> I got side tracked yesterday for most of the day on an adventure in getting
>> Zeppelin to work right after I accidently updated to the new snapshot (free
>> hint: the secret was to clear my cache *face-palm*)
>> 
>> I'm going to add that dependency to the readme.md now.
>> 
>> thanks,
>> tg
>> 
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>> 
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>> 
>> 
>> On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo 
>> wrote:
>> 
>>> Trevor this is very cool- I have not been able to look at it closely yet
>>> but just a small point: I believe that you'll also need to add the
>>> 
>>> mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar
>>> 
>>> For things like the classification stats, confusion matrix, and t-digest.
>>> 
>>> Andy
>>> 
>>> 
>>> From: Trevor Grant 
>>> Sent: Wednesday, May 18, 2016 10:47:21 AM
>>> To: dev@mahout.apache.org
>>> Subject: Re: Future Mahout - Zeppelin work
>>> 
>>> I still need to update my readme/env per Pat's comments below, however
>> with
>>> out further ado, I present two notebooks that integrate Mahout + Spark +
>>> Zeppelin + ggplot2
>>> 
>>> https://github.com/rawkintrevo/mahout-zeppelin
>>> 
>>> Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
>>> support running already, you may import the following raw notes directly
>>> into Zeppelin:
>>> 
>>> 
>>> 
>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json
>>> 
>>> 
>> https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json
>>> So my thoughs on next steps, which I'm positing only as a starting point
>>> for discussion, and are in no particular order of importance:
>>> 
>>> - Blog on HOWTO for everyman (assumes no familiarity with Mahout, and
>> only
>>> enough familiarity with Zeppelin to have Zeppelin + SparkR support)
>>> - Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
>>> string. (with some sanity, eg a sample of a matrix)
>>> - Figure out with Zeppelin community what deeper integration feels like -
>>> e.g. build-profile vs. tutorial
>>>   - I think the case for making a build-profile is that Zeppelin is first
>>> and foremost a datascience tool for non technical users.
>>>   - If we go that route I'll need some more support finding out what is
>> the
>>> absolute minimum 'bare-bones' mahout we can include, e.g. does the user
>>> have to have mahout installed? To be discussed.
>>> - Add matplotlib (python) "support" -> paragraph showing how to do the
>> same
>>> thing in Python.
>>> 
>>> The basic deal here is we are:
>>> 1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
>>> interpretter
>>> - This is taken care of by setting some env. variables, adding some
>>> dependencies, and importing relevent packages
>>> 2) do mahout things as you do
>>> 3) export table to tsv string, which is passed to a resource pool
>>>- This could be done to a disk if you didn't have 

Re: Future Mahout - Zeppelin work

2016-05-20 Thread Shannon Quinn

Agreed, thoroughly enjoying the blog post.

On 5/19/16 12:01 AM, Andrew Palumbo wrote:

Well done, Trevor!  I've not yet had a chance to try this in zeppelin but I 
just read the blog which is great!

 Original message 
From: Trevor Grant 
Date: 05/18/2016 2:44 PM (GMT-05:00)
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

Ah thank you.

Fixing now.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, May 18, 2016 at 1:04 PM, Andrew Palumbo  wrote:


Hey Trevor- Just refreshed your readme.  The jar that I mentioned is
actually:


/home/username/.m2/repository/org/apache/mahout/mahout-spark_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

rather than:


/home/username/.m2/repository/org/apache/mahout/mahout-spark-shell_2.10/0.12.1-SNAPSHOT/mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

(In the spark module that is)

From: Trevor Grant 
Sent: Wednesday, May 18, 2016 11:02:43 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

ah yes- I remember you pointing that out to me too.

I got side tracked yesterday for most of the day on an adventure in getting
Zeppelin to work right after I accidently updated to the new snapshot (free
hint: the secret was to clear my cache *face-palm*)

I'm going to add that dependency to the readme.md now.

thanks,
tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, May 18, 2016 at 9:59 AM, Andrew Palumbo 
wrote:


Trevor this is very cool- I have not been able to look at it closely yet
but just a small point: I believe that you'll also need to add the

mahout-spark_2.10-0.12.1-SNAPSHOT-dependency-reduced.jar

For things like the classification stats, confusion matrix, and t-digest.

Andy


From: Trevor Grant 
Sent: Wednesday, May 18, 2016 10:47:21 AM
To: dev@mahout.apache.org
Subject: Re: Future Mahout - Zeppelin work

I still need to update my readme/env per Pat's comments below, however

with

out further ado, I present two notebooks that integrate Mahout + Spark +
Zeppelin + ggplot2

https://github.com/rawkintrevo/mahout-zeppelin

Supposing you have a somewhat recent version of Zeppelin 0.6 with sparkr
support running already, you may import the following raw notes directly
into Zeppelin:




https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DLinear%20Regression%20in%20Spark.json




https://raw.githubusercontent.com/rawkintrevo/mahout-zeppelin/master/%5BMAHOUT%5D%5BPROVING-GROUNDS%5DSpark-Mahout%2Bggplot2.json

So my thoughs on next steps, which I'm positing only as a starting point
for discussion, and are in no particular order of importance:

- Blog on HOWTO for everyman (assumes no familiarity with Mahout, and

only

enough familiarity with Zeppelin to have Zeppelin + SparkR support)
- Some syntactic sugar somewhere in Mahout to convert a matrix into a tsv
string. (with some sanity, eg a sample of a matrix)
- Figure out with Zeppelin community what deeper integration feels like -
e.g. build-profile vs. tutorial
   - I think the case for making a build-profile is that Zeppelin is first
and foremost a datascience tool for non technical users.
   - If we go that route I'll need some more support finding out what is

the

absolute minimum 'bare-bones' mahout we can include, e.g. does the user
have to have mahout installed? To be discussed.
- Add matplotlib (python) "support" -> paragraph showing how to do the

same

thing in Python.

The basic deal here is we are:
1) Setting up a standard Zeppelin Spark Interpretter to act like a Mahout
interpretter
 - This is taken care of by setting some env. variables, adding some
dependencies, and importing relevent packages
2) do mahout things as you do
3) export table to tsv string, which is passed to a resource pool
- This could be done to a disk if you didn't have zeppelin
4) read the tsv from the resource pool (or disk if you didn't have
zeppelin) in R (python soon) and create a 

To Pat's point- this is a kind of clumsy pipeline, however the Zeppelin
wrapper at least makes it *feel* less so.


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, May 17, 2016 at 1:17 PM, Pat Ferrel 

wrote:

Seems like there is plenty to use in ggplot or python but the pipeline

is

a 

[Hello] from NASa

2016-05-20 Thread Steven NASa
Hi Folk & Masters,

My name is *NASa*. I am now working for an e-commerce B2C company in China,
dealing with Transaction Process development in C++ & Java on Linux
environment.

As you know, *Recommender System* is quite valuable and important to an
e-commerce online shopping website like Amazon. I was told and required to
design and implement a Recommender System which can bring some value to my
Company. Our System is based on C++ codes. So I was searching for an robust
Machine Learning framework in C++ which can help me to easily implement a
Recommender System. I did not find any one which can satisfy my
requirements, but only some C++ math libraries.

Our system is based on an internal distributed frameworks like RPC and DB
access on Linux environment based on C++ programming language. But I find
it is really inconvenient to implement a Recommender System in C++ from
zero without distributed computing library supporting, like
implementing *Collaborative
Filtering* with SVD in a distributed computing way. So I am trying to find
a framework/library with is designed based on Distributed-System. There I
come to *Mahout*.

I wish I can build a library that can help people easily and quickly build
up a Recommender System based on Distributed System and also use the
Machine Learning Algorithms in distributed way. Apache has many amazing
projects which can help people to build up robust distributed system
easily. So I am moving to using “Java” environment.

I am new to *Mahout* and *Hadoop*, *Spark*, *Scala* and I learned Andrew
Ng’s “Machine Learning” from Coursera
. So I have
the basic knowledge of Machine Learning, and now I am keeping forward to *Deep
Learning* and *Convex Optimization*, some other Mathematical Optimization
implementation. I am now still learning and getting famiIiar with Mahout. I
hope I can contribute some codes to Mahout in the early future with
learning by coding and coding by learning.
NASa 2016/05/20
​