Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks
Hi Wush,

I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing
u...@spark.incubator.apache.org.

In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
)

As for a model.matrix, you might have a look at the new pipelines API in
spark 1.2 (to be further improved in 1.3) which provides facilities for
repeatable data transformation as input to ML algorithms. That said -
something to handle the case of automatically one-hot encoding all the
categorical variables in a DataFrame might be a welcome addition.

- Evan

On Thu, Mar 5, 2015 at 8:43 PM, Wush Wu w...@bridgewell.com wrote:

 Dear all,

 I am a new spark user from R.

 After exploring the schemaRDD, I notice that it is similar to data.frame.
 Is there a feature like `model.matrix` in R to convert schemaRDD to model
 matrix automatically according to the type without explicitly converting
 them one by one?

 Thanks,
 Wush








Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks
Have you taken a look at the TeradataDBInputFormat? Spark is compatible
with arbitrary hadoop input formats - so this might work for you:
http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw

On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com wrote:

 Thanks a lot for your reply.
 In fact, I need to work on almost all the data in teradata (~100T). So, I
 don't think that jdbcRDD is a good choice.

 Cheers
 Gen


 On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote:

 Depending on your use cases. If the use case is to extract small amount
 of data out of teradata, then you can use the JdbcRDD and soon a jdbc input
 source based on the new Spark SQL external data source API.



 On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I have a stupid question:
 Is it possible to use spark on Teradata data warehouse, please? I read
 some news on internet which say yes. However, I didn't find any example
 about this issue

 Thanks in advance.

 Cheers
 Gen






Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan R. Sparks
Chris,

Thanks for stopping by! Here's a simple example. Imagine I've got a corpus
of data, which is an RDD[String], and I want to do some POS tagging on it.
In naive spark, that might look like this:

val props = new Properties.setAnnotators(pos)
val proc = new StanfordCoreNLP(props)
val data = sc.textFile(hdfs://some/distributed/corpus)

def processData(s: String): Annotation = {
  val a = new Annotation(s)
  proc.annotate(a)
}

val processedData = data.map(processData) //Note that this is actually
executed lazily.

Under the covers, spark takes the closure (processData), serializes it and
all objects/methods that it references (including the proc), and ships
the serialized closure off to workers so that they can run it on their
local partitions of the corpus. The issue at hand is that since the
StanfordCoreNLP object isn't serializable, *this will fail at runtime.* Hence
the solutions to this problem suggested in this thread, which all come down
to initializing the processor on the worker side (preferably once).

Your intuition about not wanting to serialize huge objects is fine. This
issue is not unique to CoreNLP - any Java library which has
non-serializable objects will face this issue.

HTH,
Evan


On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning mann...@stanford.edu
wrote:

 I’m not (yet!) an active Spark user, but saw this thread on twitter … and
 am involved with Stanford CoreNLP.

 Could someone explain how things need to be to work better with Spark —
 since that would be a useful goal.

 That is, while Stanford CoreNLP is not quite uniform (being developed by
 various people for over a decade), the general approach has always been
 that models should be serializable but that processors should not be. This
 make sense to me intuitively. It doesn’t really make sense to serialize a
 processor, which often has large mutable data structures used for
 processing.

 But does that not work well with Spark? Do processors need to be
 serializable, and then one needs to go through and make all the elements of
 the processor transient?

 Or what?

 Thanks!

 Chris


  On Nov 25, 2014, at 7:54 AM, Evan Sparks evan.spa...@gmail.com wrote:
 
  If you only mark it as transient, then the object won't be serialized,
 and on the worker the field will be null. When the worker goes to use it,
 you get an NPE.
 
  Marking it lazy defers initialization to first use. If that use happens
 to be after serialization time (e.g. on the worker), then the worker will
 first check to see if it's initialized, and then initialize it if not.
 
  I think if you *do* reference the lazy val before serializing you will
 likely get an NPE.
 
 
  On Nov 25, 2014, at 1:05 AM, Theodore Vasiloudis 
 theodoros.vasilou...@gmail.com wrote:
 
  Great, Ian's approach seems to work fine.
 
  Can anyone provide an explanation as to why this works, but passing the
  CoreNLP object itself
  as transient does not?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
This is probably not the right venue for general questions on CoreNLP - the
project website (http://nlp.stanford.edu/software/corenlp.shtml) provides
documentation and links to mailing lists/stack overflow topics.

On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hello,

 I'm new to Stanford CoreNLP. Could any one share good training material
 and examples(java or scala) on NLP.

 Regards,
 Rajesh

 On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell i...@ianoconnell.com
 wrote:


 object MyCoreNLP {
   @transient lazy val coreNLP = new coreNLP()
 }

 and then refer to it from your map/reduce/map partitions or that it
 should be fine (presuming its thread safe), it will only be initialized
 once per classloader per jvm

 On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com
 wrote:

 We have gotten this to work, but it requires instantiating the CoreNLP
 object on the worker side. Because of the initialization time it makes a
 lot of sense to do this inside of a .mapPartitions instead of a .map, for
 example.

 As an aside, if you're using it from Scala, have a look at sistanlp,
 which provided a nicer, scala-friendly interface to CoreNLP.


  On Nov 24, 2014, at 7:46 AM, tvas theodoros.vasilou...@gmail.com
 wrote:
 
  Hello,
 
  I was wondering if anyone has gotten the Stanford CoreNLP Java library
 to
  work with Spark.
 
  My attempts to use the parser/annotator fail because of task
 serialization
  errors since the class
  StanfordCoreNLP cannot be serialized.
 
  I've tried the remedies of registering StanfordCoreNLP through kryo,
 as well
  as using chill.MeatLocker,
  but these still produce serialization errors.
  Passing the StanfordCoreNLP object as transient leads to a
  NullPointerException instead.
 
  Has anybody managed to get this work?
 
  Regards,
  Theodore
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
Additionally - I strongly recommend using OpenBLAS over the Atlas build
from the default Ubuntu repositories. Alternatively, you can build ATLAS on
the hardware you're actually going to be running the matrix ops on (the
master/workers), but we've seen modest performance gains doing this vs.
OpenBLAS, at least on the bigger EC2 machines (e.g. cc2.8xlarge,
c3.8xlarge).

On Mon, Nov 24, 2014 at 11:26 AM, Xiangrui Meng men...@gmail.com wrote:

 Try building Spark with -Pnetlib-lgpl, which includes the JNI library
 in the Spark assembly jar. This is the simplest approach. If you want
 to include it as part of your project, make sure the library is inside
 the assembly jar or you specify it via `--jars` with spark-submit.
 -Xiangrui

 On Mon, Nov 24, 2014 at 8:51 AM, agg212 alexander_galaka...@brown.edu
 wrote:
  Hi, i'm trying to improve performance for Spark's Mllib, and I am having
  trouble getting native netlib-java libraries installed/recognized by
 Spark.
  I am running on a single machine, Ubuntu 14.04 and here is what I've
 tried:
 
  sudo apt-get install libgfortran3
  sudo apt-get install libatlas3-base libopenblas-base (this is how
  netlib-java's website says to install it)
 
  I also double checked and it looks like the libraries are linked
 correctly
  in /usr/lib (see below):
  /usr/lib/libblas.so.3 - /etc/alternatives/libblas.so.3
  /usr/lib/liblapack.so.3 - /etc/alternatives/liblapack.so.3
 
 
  The Dependencies section on Spark's Mllib website also says to include
  com.github.fommil.netlib:all:1.1.2 as a dependency.  I therefore tried
  adding this to my sbt file like so:
 
  libraryDependencies += com.github.fommil.netlib % all % 1.1.2
 
  After all this, i'm still seeing the following error message.  Does
 anyone
  have more detailed installation instructions?
 
  14/11/24 16:49:29 WARN BLAS: Failed to load implementation from:
  com.github.fommil.netlib.NativeSystemBLAS
  14/11/24 16:49:29 WARN BLAS: Failed to load implementation from:
  com.github.fommil.netlib.NativeRefBLAS
 
  Thanks!
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
Neat hack! This is cute and actually seems to work. The fact that it works
is a little surprising and somewhat unintuitive.

On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell i...@ianoconnell.com wrote:


 object MyCoreNLP {
   @transient lazy val coreNLP = new coreNLP()
 }

 and then refer to it from your map/reduce/map partitions or that it should
 be fine (presuming its thread safe), it will only be initialized once per
 classloader per jvm

 On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com
 wrote:

 We have gotten this to work, but it requires instantiating the CoreNLP
 object on the worker side. Because of the initialization time it makes a
 lot of sense to do this inside of a .mapPartitions instead of a .map, for
 example.

 As an aside, if you're using it from Scala, have a look at sistanlp,
 which provided a nicer, scala-friendly interface to CoreNLP.


  On Nov 24, 2014, at 7:46 AM, tvas theodoros.vasilou...@gmail.com
 wrote:
 
  Hello,
 
  I was wondering if anyone has gotten the Stanford CoreNLP Java library
 to
  work with Spark.
 
  My attempts to use the parser/annotator fail because of task
 serialization
  errors since the class
  StanfordCoreNLP cannot be serialized.
 
  I've tried the remedies of registering StanfordCoreNLP through kryo, as
 well
  as using chill.MeatLocker,
  but these still produce serialization errors.
  Passing the StanfordCoreNLP object as transient leads to a
  NullPointerException instead.
 
  Has anybody managed to get this work?
 
  Regards,
  Theodore
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
You can try recompiling spark with that option, and doing an sbt/sbt
publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT
(assuming you're building from the 1.1 branch) - sbt or maven (whichever
you're compiling your app with) will pick up the version of spark that you
just built.

On Mon, Nov 24, 2014 at 6:31 PM, agg212 alexander_galaka...@brown.edu
wrote:

 I am running it in local.  How can I use the built version (in local mode)
 so
 that I can use the native libraries?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19705.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com wrote:

 I believe the Spark Job Server by Ooyala can help you share data across
 multiple jobs, take a look at
 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
 seems to fit closely to what you need.

 Best Regards,
 Sonal
 Founder, Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal



 On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote:

 We are relatively new to spark and so far have been manually submitting
 single jobs at a time for ML training, during our development process,
 using
 spark-submit.  Each job accepts a small user-submitted data set and
 compares
 it to every data set in our hdfs corpus, which only changes incrementally
 on
 a daily basis.  (that detail is relevant to question 3 below)

 Now we are ready to start building out the front-end, which will allow a
 team of data scientists to submit their problems to the system via a web
 front-end (web tier will be java).  Users could of course be submitting
 jobs
 more or less simultaneously.  We want to make sure we understand how to
 best
 structure this.

 Questions:

 1 - Does a new SparkContext get created in the web tier for each new
 request
 for processing?

 2 - If so, how much time should we expect it to take for setting up the
 context?  Our goal is to return a response to the users in under 10
 seconds,
 but if it takes many seconds to create a new context or otherwise set up
 the
 job, then we need to adjust our expectations for what is possible.  From
 using spark-shell one might conclude that it might take more than 10
 seconds
 to create a context, however it's not clear how much of that is
 context-creation vs other things.

 3 - (This last question perhaps deserves a post in and of itself:) if
 every
 job is always comparing some little data structure to the same HDFS corpus
 of data, what is the best pattern to use to cache the RDD's from HDFS so
 they don't have to always be re-constituted from disk?  I.e. how can RDD's
 be shared from the context of one job to the context of subsequent jobs?
 Or does something like memcache have to be used?

 Thanks!
 David



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks
There are a few examples where this is the case. Let's take ALS, where the
result is a MatrixFactorizationModel, which is assumed to be big - the
model consists of two matrices, one (users x k) and one (k x products).
These are represented as RDDs.

You can save these RDDs out to disk by doing something like

model.userFeatures.saveAsObjectFile(...) and
model.productFeatures.saveAsObjectFile(...)

to save out to HDFS or Tachyon or S3.

Then, when you want to reload you'd have to instantiate them into a class
of MatrixFactorizationModel. That class is package private to MLlib right
now, so you'd need to copy the logic over to a new class, but that's the
basic idea.

That said - using spark to serve these recommendations on a point-by-point
basis might not be optimal. There's some work going on in the AMPLab to
address this issue.

On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote:

 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
 merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
 These all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks
You can imagine this same logic applying to the continuous case. E.g. what
if all the quartiles or deciles of a particular value have different
behavior - this could capture that too. Of what if some combination of
features was highly discriminitive but only into n buckets, rather than
two.. you can see there are lots of different options here.

In general in MLlib, we're trying to support widely accepted and frequently
used ML models, and simply offer a platform to efficiently train these with
spark. While decision trees with n-ary splits might be a sensible thing to
explore, they are not widely used in practice, and I'd want to see some
compelling results from proper ML/stats researchers before shipping them as
a default feature.

If you're looking for a way to control variance and pick up nuance in your
dataset that's not covered by plain decision trees, I recommend looking at
Random Forests - a well studied extension to decision trees that's also
widely used in practice - and coming to MLlib soon!

On Thu, Nov 6, 2014 at 3:29 AM, Tamas Jambor jambo...@gmail.com wrote:

 Thanks for the reply, Sean.

 I can see that splitting on all the categories would probably overfit
 the tree, on the other hand, it might give more insight on the
 subcategories (probably only would work if the data is uniformly
 distributed between the categories).

 I haven't really found any comparison between the two methods in terms
 of performance and interpretability.

 thanks,

 On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen so...@cloudera.com wrote:
  I haven't seen that done before, which may be most of the reason - I am
 not
  sure that is common practice.
 
  I can see upsides - you need not pick candidate splits to test since
 there
  is only one N-way rule possible. The binary split equivalent is N levels
  instead of 1.
 
  The big problem is that you are always segregating the data set entirely,
  and making the equivalent of those N binary rules, even when you would
 not
  otherwise bother because they don't add information about the target. The
  subsets matching each child are therefore unnecessarily small and this
 makes
  learning on each independent subset weaker.
 
  On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:
 
  I meant above, that in the case of categorical variables it might be
 more
  efficient to create a node on each categorical value. Is there a reason
  why
  spark went down the binary route?
 
  thanks,
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
Plain old java serialization is one straightforward approach if you're in
java/scala.

On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.

What are you used to doing in other environments? In R I'm used to running
save(), same with matlab. In python either pickling things or dumping to
json seems pretty common. (even the scikit-learn docs recommend pickling -
http://scikit-learn.org/stable/modules/model_persistence.html). These all
seem basically equivalent java serialization to me..

Would some helper functions (in, say, mllib.util.modelpersistence or
something) make sense to add?

On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're in
 java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to force this to be materialized before you run KMeans.

Also, KMeans is optimized to run quickly on both sparse and dense data. The
result of PCA is going to be dense, but if your input data has #nnz ~=
size(pca data), performance might be about the same. (I haven't actually
verified this last point.)

Finally, speed is partially going to be dependent on how much data you have
relative to scheduler overheads - if your input data is small it could be
that the costs of distributing your task are greater than the time spent
actually computing - usually this would manifest itself in the stages
taking about the same amount of time even though you're passing datasets
that have different dimensionality.

On Tue, Sep 30, 2014 at 9:00 AM, st553 sthompson...@gmail.com wrote:

 Thanks for your response Burak it was very helpful.

 I am noticing that if I run PCA before KMeans that the KMeans algorithm
 will
 actually take longer to run than if I had just run KMeans without PCA. I
 was
 hoping that by using PCA first it would actually speed up the KMeans
 algorithm.

 I have followed the steps you've outlined but Im wondering if I need to
 cache/persist the RDD[Vector] rows of the RowMatrix returned after
 multiplying. Something like:

 val newData: RowMatrix = data.multiply(bcPrincipalComponents.value)
 val cachedRows = newData.rows.persist()
 KMeans.run(cachedRows)
 cachedRows.unpersist()

 It doesnt seem intuitive to me that a smaller dimensional version of my
 data
 set would take longer for KMeans... unless Im missing something?

 Thanks!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks
In its current implementation, the principal components are computed in
MLlib in two steps:
1) In a distributed fashion, compute the covariance matrix - the result is
a local matrix.
2) On this local matrix, compute the SVD.

The sorting comes from the SVD. If you want to get the eigenvalues out, you
can simply run step 1 yourself on your RowMatrix via the (experimental)
computeCovariance() method, and then run SVD on the result using a library
like breeze.

- Evan



On Tue, Sep 23, 2014 at 12:49 PM, st553 sthompson...@gmail.com wrote:

 sowen wrote
  it seems that the singular values from the SVD aren't returned, so I
 don't
  know that you can access this directly

 Its not clear to me why these aren't returned? The S matrix would be useful
 to determine a reasonable value for K.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p14919.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Message Passing among workers

2014-09-03 Thread Evan R. Sparks
Asynchrony is not supported directly - spark's programming model is
naturally BSP. I have seen cases where people have instantiated actors with
akka on worker nodes to enable message passing, or even used spark's own
ActorSystem to do this. But, I do not recommend this, since you lose a
bunch of benefits of spark - e.g. fault tolerance.

Instead, I would think about whether your algorithm can be cast as a BSP
one, or think about how frequently you really need to synchronize state
among your workers. It may be that having the occasional synchronization
barrier is OK.


On Wed, Sep 3, 2014 at 7:28 AM, laxmanvemula laxman8...@gmail.com wrote:

 Hi,

 I would like to implement an asynchronous distributed optimization
 algorithm
 where workers communicate among one another. It is similar to belief
 propagation where each worker is a vertex in the graph. Can some one let me
 know if this is possible using spark?

 Thanks,
 Laxman



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Message-Passing-among-workers-tp13355.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks
I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
worth documenting:

Benchmarking on a dataset this small on this many cores is probably not
going to give you any meaningful information about how the algorithms scale
to real data problems.

In this case, you've thrown 200 cores at 5.6kb of data - 200
low-dimensional data points. The overheads of scheduling tasks, sending
them out to each worker, and network latencies between the nodes, which are
essentially fixed regardless of problem size are COMPLETELY dominating the
time spent computing - which in the first two cases is 9-10 flops per data
point and in the last case is a couple of array lookups and adds per data
point.

It would make a lot more sense to find or generate a dataset that's 10 or
100GB and see how performance scales there. You can do this with the code I
pasted earlier, just change the second, third, and fourth arguments to an
appropriate number of elements, dimensionality, and number of partitions
that matches the number of cores you have on your cluster.

In short, don't use a cluster unless you need one :).

Hope this helps!


On Tue, Sep 2, 2014 at 3:51 PM, SK skrishna...@gmail.com wrote:

 The dataset is quite small : 5.6 KB.  It has 200 rows and 3 features, and 1
 column of labels.  From this dataset, I split 80% for training set and 20%
 for test set. The features are integer counts and labels are binary (1/0).

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?

For very small datasets the scheduling overheads of shipping tasks across
the cluster and delays due to stragglers can dominate the time actually
doing your parallel computation. If you have too few partitions, you won't
be taking advantage of cluster parallelism, and if you have too many you're
introducing even more of the aforementioned overheads.



On Tue, Sep 2, 2014 at 11:24 AM, SK skrishna...@gmail.com wrote:

 Hi,

 I evaluated the runtime performance of some of the MLlib classification
 algorithms on a local machine and a cluster with 10 nodes. I used
 standalone
 mode and Spark 1.0.1 in both cases. Here are the results for the total
 runtime:
Local Cluster
 Logistic regression   138 sec  336 sec
 SVM   138 sec  336 sec
 Decision tree 50 sec 132 sec

 My dataset is quite small and my programs are very similar to the mllib
 examples that are included in the Spark distribution. Why is the runtime on
 the cluster significantly higher (almost 3 times) than that on the local
 machine even though the former uses more memory and more nodes? Is it
 because of the communication overhead on the cluster? I would like to know
 if there is something I need to be doing to optimize the performance on the
 cluster or if others have also been getting similar results.

 thanks




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here.

That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
1s on a single machine.
Are you caching your dataset before training the classifier on it? It's
possible that you're rereading it from disk (or across the internet, maybe)
on every iteration?

From spark-shell:

import org.apache.spark.mllib.util.LogisticRegressionDataGenerator

val dat = LogisticRegressionDataGenerator.generateLogisticRDD(sc, 200, 3,
1e-4, 4, 0.2).cache()

println(dat.count()) //should give 200

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

val start = System.currentTimeMillis; val model =
LogisticRegressionWithSGD.train(dat, 100); val delta =
System.currentTimeMillis - start;

println(delta) //On my laptop, 863ms.








On Tue, Sep 2, 2014 at 3:51 PM, SK skrishna...@gmail.com wrote:

 The dataset is quite small : 5.6 KB.  It has 200 rows and 3 features, and 1
 column of labels.  From this dataset, I split 80% for training set and 20%
 for test set. The features are integer counts and labels are binary (1/0).

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a combiner in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions




On Thu, Aug 7, 2014 at 8:15 AM, Cheng Lian lian.cs@gmail.com wrote:

 The point is that in many cases the operation passed to reduceByKey
 aggregates data into much smaller size, say + and * for integer. String
 concatenation doesn’t actually “shrink” data, thus in your case,
 rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance
 issue. In general, don’t do these unless you have to.

 And in Konstantin’s case, I guess he knows what he’s doing. At least we
 can’t know whether we can help to optimize without further information
 about the business logic” is provided.

 On Aug 7, 2014, at 10:22 PM, chutium teng@gmail.com wrote:

  a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk
 about
  performance
  (
 http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/
 )
 
  that, reduceByKey will be more efficient than groupByKey... he mentioned
  groupByKey copies all data over network.
 
  is that still true? which one should we choice? because actually we can
  replace all of groupByKey with reduceByKey
 
  for example, if we want to use groupByKey on a RDD[ String, String ], to
 get
  a RDD[ String, Seq[String] ],
 
  we can also do it with reduceByKey:
  at first, map RDD[ String, String ] to RDD[ String, Seq[String] ]
  then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ]
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is
sparse (which it often is in social networks), you may have better luck
with this.

I haven't tried the GraphX implementation, but those algorithms are often
well-suited for power-law distributed graphs as you might see in social
networks.

FWIW, I believe you need to square elements of the sigma matrix from the
SVD to get the eigenvalues.




On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen so...@cloudera.com wrote:

 (-incubator, +user)

 If your matrix is symmetric (and real I presume), and if my linear
 algebra isn't too rusty, then its SVD is its eigendecomposition. The
 SingularValueDecomposition object you get back has U and V, both of
 which have columns that are the eigenvectors.

 There are a few SVDs in the Spark code. The one in mllib is not
 distributed (right?) and is probably not an efficient means of
 computing eigenvectors if you really just want a decomposition of a
 symmetric matrix.

 The one I see in graphx is distributed? I haven't used it though.
 Maybe it could be part of a solution.



 On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan yaochun...@gmail.com wrote:
  Our lab need to do some simulation on online social networks. We need to
  handle a 5000*5000 adjacency matrix, namely, to get its largest
 eigenvalue
  and corresponding eigenvector. Matlab can be used but it is
 time-consuming.
  Is Spark effective in linear algebra calculations and transformations?
 Later
  we would have 500*500 matrix processed. It seems emergent that we
  should find some distributed computation platform.
 
  I see SVD has been implemented and I can get eigenvalues of a matrix
 through
  this API.  But when I want to get both eigenvalues and eigenvectors or at
  least the biggest eigenvalue and the corresponding eigenvector, it seems
  that current Spark doesn't have such API. Is it possible that I write
  eigenvalue decomposition from scratch? What should I do? Thanks a lot!
 
 
  Miles Yao
 
  
  View this message in context: How can I implement eigenvalue
 decomposition
  in Spark?
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Evan R. Sparks
The loss functions are represented in the various names of the model
families. SVM is hinge loss, LogisticRegression is logistic loss,
LinearRegression is linear loss. These are used internally as arguments to
the SGD and L-BFGS optimizers.


On Thu, Aug 7, 2014 at 6:31 PM, SK skrishna...@gmail.com wrote:

 Hi,

 According to the MLLib guide, there seems to be support for different loss
 functions. But I could not find a command line parameter to choose the loss
 function but only found regType to choose the regularization. Does MLLib
 support a parameter to choose  the loss function?

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Computing the variance is similar to this example, you just need to keep
around the sum of squares as well.

The formula for variance is (sumsq/n) - (sum/n)^2

But with big datasets or large values, you can quickly run into overflow
issues - MLlib handles this by maintaining the the average sum of squares
in an online fashion. (see:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L83
)

You might consider just calling into the MLlib stats module directly.


On Fri, Aug 1, 2014 at 1:48 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 I meant not sure how to do variance in one shot :-)

 With mean in hand, you can obvious broadcast the variable, and do another
 map/reduce to calculate variance per key.


 On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 val res = rdd.map(t = (t._1, (t._2.foo, 1))).reduceByKey((x,y) =
 (x._1+x._2, y._1+y._2)).collect

 This gives you a list of (key, (tot, count)), which you can easily
 calculate the mean. Not sure about variance.


 On Fri, Aug 1, 2014 at 2:55 PM, kriskalish k...@kalish.net wrote:

 I have what seems like a relatively straightforward task to accomplish,
 but I
 cannot seem to figure it out from the Spark documentation or searching
 the
 mailing list.

 I have an RDD[(String, MyClass)] that I would like to group by the key,
 and
 calculate the mean and standard deviation of the foo field of MyClass.
 It
 feels like I should be able to use group by to get an RDD for each
 unique
 key, but it gives me an iterable.

 As in:

 val grouped = rdd.groupByKey()

 grouped.foreach{g =
val mean = g.map( x = x.foo).mean()
val dev = g.map( x = x.foo ).stddev()
// do fancy things with the mean and deviation
 }

 However, there seems to be no way to convert the iterable into an RDD. Is
 there some other technique for doing this? I'm to the point where I'm
 considering copying and pasting the StatCollector class and changing the
 type from Double to MyClass (or making it generic).

 Am I going down the wrong path?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Ignoring my warning about overflow - even more functional - just use a
reduceByKey.

Since your main operation is just a bunch of summing, you've got a
commutative-associative reduce operation and spark will run do everything
cluster-parallel, and then shuffle the (small) result set and merge
appropriately.

For example:
input
  .map{ case (k, v) = (k, (1, v, v*v)) }
  .reduceByKey { case ((c1, s1, ss1), (c2, s2, ss2)) = (c1+c2, s1+s2,
ss1+ss2) }
  .map { case (k, (count, sum, sumsq)) = (k, sumsq/count - (sum/count *
sum/count)) }

This is by no means the most memory/time efficient way to do it, but I
think it's a nice example of how to think about using spark at a higher
level of abstraction.

- Evan



On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen so...@cloudera.com wrote:

 Here's the more functional programming-friendly take on the
 computation (but yeah this is the naive formula):

 rdd.groupByKey.mapValues { mcs =
   val values = mcs.map(_.foo.toDouble)
   val n = values.count
   val sum = values.sum
   val sumSquares = values.map(x = x * x).sum
   math.sqrt(n * sumSquares - sum * sum) / n
 }

 This gives you a bunch of (key,stdev). I think you want to compute
 this RDD and *then* do something to save it if you like. Sure, that
 could be collecting it locally and saving to a DB. Or you could use
 foreach to do something remotely for every key-value pair. More
 efficient would be to mapPartitions and do something to a whole
 partition of key-value pairs at a time.


 On Fri, Aug 1, 2014 at 9:56 PM, kriskalish k...@kalish.net wrote:
  So if I do something like this, spark handles the parallelization and
  recombination of sum and count on the cluster automatically? I started
  peeking into the source and see that foreach does submit a job to the
  cluster, but it looked like the inner function needed to return
 something to
  work properly.
 
  val grouped = rdd.groupByKey()
  grouped.foreach{ x =
  val iterable = x._2
  var sum = 0.0
  var count = 0
  iterable.foreach{ y =
  sum = sum + y.foo
  count = count + 1
  }
  val mean = sum/count;
  // save mean to database...
  }
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks
Can you share the dataset via a gist or something and we can take a look at
what's going on?


On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote:

 yes, the output  is continuous. So I used a threshold to get binary labels.
 If prediction  threshold, then class is 0 else 1. I use this binary label
 to then compute the accuracy. Even with this binary transformation, the
 accuracy with decision tree model is low compared to LR or SVM (for the
 specific dataset I used).



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457p10678.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks
Try sc.getExecutorStorageStatus().length

SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will
give you back an object per executor - the StorageStatus objects are what
drives a lot of the Spark Web UI.

https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.SparkContext


On Thu, Jul 24, 2014 at 11:16 AM, Nicolas Mai nicolas@gmail.com wrote:

 Hi,

 Is there a way to get the number of slaves/workers during runtime?

 I searched online but didn't find anything :/ The application I'm working
 will run on different clusters corresponding to different deployment stages
 (beta - prod). It would be great to get the number of slaves currently in
 use, in order set the level of parallelism and RDD partitions, based on
 that
 number.

 Thanks!
 Nicolas



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Evan R. Sparks
To be clear - each of the RDDs is still a distributed dataset and each of
the individual SVM models will be trained in parallel across the cluster.
Sean's suggestion effectively has you submitting multiple spark jobs
simultaneously, which, depending on your cluster configuration and the size
of your dataset, may or may not be a good idea.

There are some tricks you can do to make training multiple models on the
same dataset faster, which we're hoping to expose to users in an upcoming
release.

- Evan


On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen so...@cloudera.com wrote:

 If you call .par on data_kfolded it will become a parallel collection in
 Scala and so the maps will happen in parallel .
 On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote:

 Hi,

 I am trying to fit a logistic regression model with cross validation in
 Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
 each element is a pair of RDDs containing the training and test data:

 (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
 test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint])

 scala data_kfolded
 res21:

 Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],
 org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])]
 =
 Array((MappedRDD[9] at map at console:24,MappedRDD[7] at map at
 console:23), (MappedRDD[13] at map at console:24,MappedRDD[11] at map
 at
 console:23), (MappedRDD[17] at map at console:24,MappedRDD[15] at map
 at
 console:23))

 Everything works fine when using data_kfolded:

 val validationErrors =
 data_kfolded.map { datafold =
   val svmAlg = new SVMWithSGD()
   val model_reg = svmAlg.run(datafold._1)
   val labelAndPreds = datafold._2.map { point =
 val prediction = model_reg.predict(point.features)
 (point.label, prediction)
   }
   val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 datafold._2.count
   trainErr.toDouble
 }

 scala validationErrors
 res1: Array[Double] = Array(0.8819836785938481, 0.07082521117608837,
 0.29833546734955185)

 However, I have understood that the models are not fitted in parallel as
 data_kfolded is not an RDD (although it's an array of pairs of RDDs). When
 running the same code where data_kfolded has been replaced with
 sc.parallelize(data_kfolded), I get a null pointer exception from the line
 where the run method of the SVMWithSGD object is called with the traning
 data. I guess this is somehow related to the fact that RDDs can't be
 accessed from inside a closure. I fail to understand though why the first
 version works and the second doesn't. Most importantly, is there a way to
 fit the models in parallel? I would really appreciate your help.

 val validationErrors =
 sc.parallelize(data_kfolded).map { datafold =
   val svmAlg = new SVMWithSGD()
   val model_reg = svmAlg.run(datafold._1) // This line gives null pointer
 exception
   val labelAndPreds = datafold._2.map { point =
 val prediction = model_reg.predict(point.features)
 (point.label, prediction)
   }
   val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 datafold._2.count
   trainErr.toDouble
 }
 validationErrors.collect

 java.lang.NullPointerException
 at org.apache.spark.rdd.RDD.firstParent(RDD.scala:971)
 at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
 at org.apache.spark.rdd.RDD.take(RDD.scala:824)
 at org.apache.spark.rdd.RDD.first(RDD.scala:856)
 at

 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:121)
 at
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:36)
 at
 $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:34)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at 

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks
There is a method in org.apache.spark.mllib.util.MLUtils called kFold
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.

For example (a very rough sketch - assuming I want to do 10-fold cross
validation on a binary classification model on a file with 1000 features in
it):

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.util.LabelParsers
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD


val dat = MLUtils.loadLibSVMFile(sc, path/to/data, false, 1000)

val cvdat = kFold(dat, 10, 42)

val modelErrrors = cvdat.map { case (train, test) = {
   val model = LogisticRegressionWithSGD.train(train, 100, 0.1, 1.0)
   val error = computeError(model, test)
(model, error)}}

//Average error:
val avgError = modelErrors.map(_._2).reduce(_ + _) / modelErrors.length

Here, I'm assuming you've got some computeError function defined. Note
that many of these APIs are marked experimental and thus might change in
a future spark release.


On Tue, Jun 24, 2014 at 6:44 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:

 I'm interested in this topic too :)

 Are the MLLib core devs on this list ?

 E/


 2014-06-24 14:19 GMT+02:00 holdingonrobin robinholdin...@gmail.com:

 Anyone knows anything about it? Or should I actually move this topic to a
 MLlib specif mailing list? Any information is appreciated! Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks
Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.


On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote:

 Your data source is S3 and data is used twice. m1.large does not have very
 good network performance. Please try file.count() and see how fast it goes.
 -Xiangrui

  On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk
 wrote:
 
  Hi there,
 
  We're trying out Spark and are experiencing some performance issues using
  Spark SQL.
  Anyone who can tell us if our results are normal?
 
  We are using the Amazon EC2 scripts to create a cluster with 3
  workers/executors (m1.large).
  Tried both spark 1.0.0 as well as the git master; the Scala as well as
 the
  Python shells.
 
  Running the following code takes about 5 minutes, which seems a long time
  for this query.
 
  val file = sc.textFile(s3n:// ...  .csv);
  val data = file.map(x = x.split('|')); // 300k rows
 
  case class BookingInfo(num_rooms: String, hotelId: String, toDate:
 String,
  ...);
  val rooms2 = data.filter(x = x(0) == 2).map(x = BookingInfo(x(0),
 x(1),
  ... , x(9))); // 50k rows
  val rooms3 = data.filter(x = x(0) == 3).map(x = BookingInfo(x(0),
 x(1),
  ... , x(9))); // 30k rows
 
  rooms2.registerAsTable(rooms2);
  cacheTable(rooms2);
  rooms3.registerAsTable(rooms3);
  cacheTable(rooms3);
 
  sql(SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
  rooms3.hotelId AND rooms2.toDate = rooms3.toDate).count();
 
 
  Are we doing something wrong here?
  Thanks!
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread Evan R. Sparks
Larry,

I don't see any reference to Spark in particular there.

Additionally, the benchmark only scales up to datasets that are roughly
10gb (though I realize they've picked some fairly computationally intensive
tasks), and they don't present their results on more than 4 nodes. This can
hide things like, for example, a communication pattern that is O(n^2) in
the number of cluster nodes.

Obviously they've gotten some great performance out of SciDB, but I don't
think this answers the MPI vs. Spark question directly.

My own experience suggests that as long as your algorithm fits in a BSP
programming model, with Spark you can achieve performance that is
comparable to a tuned C++/MPI codebase by leveraging the right libraries
locally and thinking carefully about what and when you have to communicate.

- Evan


On Thu, Jun 19, 2014 at 8:48 AM, ldmtwo larry.d.moore...@intel.com wrote:


 Here is a partial comparison.


 http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2

 SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost
 of more work.

 In case the link stops working:
 A Complex Analytics Genomics Benchmark Rebecca Taft-, Manasi Vartak-,
 Nadathur Rajagopalan Satish, Narayanan Sundaram, Samuel Madden, and Michael
 Stonebraker



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-C-MPI-with-Spark-tp7661p7919.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How do you run your spark app?

2014-06-19 Thread Evan R. Sparks
I use SBT, create an assembly, and then add the assembly jars when I create
my spark context. The main executor I run with something like java -cp ...
MyDriver.

That said - as of spark 1.0 the preferred way to run spark applications is
via spark-submit -
http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo ldm...@gmail.com wrote:

 I want to ask this, not because I can't read endless documentation and
 several tutorials, but because there seems to be many ways of doing things
 and I keep having issues. How do you run /your /spark app?

 I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
 to get Spark and Shark working and ended upgrading everything and dropped
 CDH support. Anyways, this is what I used with master=yarn-client and
 app_jar being Scala code compiled with Maven.

 java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
 $ARGS

 Do you use this? or something else? I could never figure out this method.
 SPARK_HOME/bin/spark jar APP_JAR ARGS

 For example:
 bin/spark-class jar
 /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
 pi 10 10

 Do you use SBT or Maven to compile? or something else?


 ** It seams that I can't get subscribed to the mailing list and I tried
 both
 my work email and personal.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks
This looks like a job for SparkSQL!


val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int, hits: Long)
val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234),
MyRecord(USA, Bob, 55, 108), MyRecord(France, Remi, 33, 72)))
data.registerAsTable(MyRecords)
val results = sql(SELECT t.country, AVG(t.age), SUM(t.hits) FROM
MyRecords t GROUP BY t.country).collect

Now results contains:

Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342])


On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin doris.s@gmail.com wrote:

 Hi Nick,

 Instead of using reduceByKey(), you might want to look into using
 aggregateByKey(), which allows you to return a different value type U
 instead of the input value type V for each input tuple (K, V). You can
 define U to be a datatype that holds both the average and total and have
 seqOp update both fields of U in a single pass.

 Hope this makes sense,
 Doris


 On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas nicholas.cham...@gmail.com
 wrote:

 The following is a simplified example of what I am trying to accomplish.

 Say I have an RDD of objects like this:

 {
 country: USA,
 name: Franklin,
 age: 24,
 hits: 224}
 {

 country: USA,
 name: Bob,
 age: 55,
 hits: 108}
 {

 country: France,
 name: Remi,
 age: 33,
 hits: 72}

 I want to find the average age and total number of hits per country.
 Ideally, I would like to scan the data once and perform both aggregations
 simultaneously.

 What is a good approach to doing this?

 I’m thinking that we’d want to keyBy(country), and then somehow
 reduceByKey(). The problem is, I don’t know how to approach writing a
 function that can be passed to reduceByKey() and that will track a
 running average and total simultaneously.

 Nick
 ​

 --
 View this message in context: Patterns for making multiple aggregations
 in one pass
 http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks
I should point out that if you don't want to take a polyglot approach to
languages and reside solely in the JVM, then you can just use plain old
java serialization on the Model objects that come out of MLlib's APIs from
Java or Scala and load them up in another process and call the relevant
.predict() method when it comes time to serve. The same approach would
probably also work for models trained via MLlib's python APIs, but I
haven't tried that.

Native PMML serialization would be a nice feature to add to MLlib as a
mechanism to transfer models to other environments for further
analysis/serving. There's a JIRA discussion about this here:
https://issues.apache.org/jira/browse/SPARK-1406


On Tue, Jun 10, 2014 at 10:53 AM, filipus floe...@gmail.com wrote:

 Thank you very much

 the cascading project i didn't recognize it at all till now

 this project is very interesting

 also I got the idea of the usage of scala as a language for spark - becuase
 i can intergrate jvm based libraries very easy/naturaly when I got it right

 mh... but I could also use sparc as a model engine, augustus for the
 serializer and a third party produkt for the prediction engine like using
 jpmml

 mh... got the feeling that i need to do java, scala and python at the same
 time...

 first things first - augustus for an pmml output from spark :-)





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7335.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks
Interesting, and thanks for the thoughts.

I think we're on the same page with 100s of millions of records. We've
tested the tree implementation in mllib on 1b rows and up to 100 features -
though this isn't hitting the 1000s of features you mention.

Obviously multi class support isn't there yet, but I can see your point
about deeper trees for many class problems. Will try them out on some image
processing stuff with 1k classes we're doing in the lab once they are more
developed to get a sense for where the issues are.

If you're only allocating 2GB/worker you're going to have a hard time
getting the real advantages of Spark.

For your 1k features causing heap exceptions at depth 5  - are these
categorical or continuous? The categorical vars create much smaller
histograms.

If you're fitting all continuous features, the memory requirements are
O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
level of the tree. Even accounting for object overhead, with the default
number of bins, the histograms at this depth should be order of 10s of MB,
not 2GB - so I'm guessing your cached data is occupying a significant chunk
of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
data points with 20 continuous features and was able to run without running
into memory issues (and scaling properties got better as the depth grew).
His worker mem was 7.5GB and 30% of that was reserved for caching. If you
wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
necessary for heap space for the worker to compute/store the histograms,
and I guess 2x that on the master to do the reduce.

Again 2GB per worker is pretty tight, because there are overheads of just
starting the jvm, launching a worker, loading libraries, etc.

- Evan

On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:

Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):

1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are
common and even thousands in some cases - you could imagine that this is a
big reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent
features

From the theoretical perspective, I would argue that it's usually in the
best interest to prune as little as possible. I believe that pruning
inherently increases bias of an individual tree, which RF can't do anything
about while decreasing variance - which is what RF is for.

The default pruning criteria for R's reference implementation is min-node
of 1 (meaning fully-grown tree) for classification, and 5 for regression.
I'd imagine they did at least some empirical testing to justify these
values at the time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient
Boosting implementation, however, the memory requirement is still a bit too
steep (we were getting heap exceptions at depth limit of 5 with 2GB per
worker with approximately 1000 features). Now 2GB per worker is about what
we expect our typical customers would tolerate and I don't think that it's
unreasonable for shallow trees.



On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks evan.spa...@gmail.comwrote:

 What kind of data are you training on? These effects are *highly* data
 dependent, and while saying the depth of 10 is simply not adequate to
 build high-accuracy models may be accurate for the particular problem
 you're modeling, it is not true in general. From a statistical perspective,
 I consider each node in each tree an additional degree of freedom for the
 model, and all else equal I'd expect a model with fewer degrees of freedom
 to generalize better. Regardless, if there are lots of use cases for really
 deep trees, we'd like to hear about them so that we can decide how
 important they are to support!

 In the context of CART - pruning very specifically refers to a step
 *after* a tree has been constructed to some depth using cross-validation.
 This was a variance reduction technique in the original tree work that is
 unnecessary and computationally expensive in the context of forests. In the
 original Random Forests paper, there are still stopping criteria - usually
 either minimum leaf size or minimum split improvement (or both), so
 training to maximum depth doesn't mean train until you've completely
 divided your dataset and there's one point per leaf. My point is that if
 you set minimum leaf size to something like 0.2% of the dataset, then
 you're not going to get deeper than 10 or 12 levels with a reasonably
 balanced tree.

 With respect to PLANET - our implementation is very much in the spirit of
 planet, but has some key differences - there's good documentation on
 exactly what the differences are forthcoming, so I

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Sorry - I meant to say that Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon.


On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote:

 Multiclass classification, Gradient Boosting, and Random Forest support
 for based on the recent Decision Tree implementation in MLlib.

 Sung - I'd be curious to hear about your use of decision trees (and
 forests) where you want to go to 100+ depth. My experience with random
 forests has been that people typically build hundreds of shallow trees
 (maybe depth 7 or 8), rather than a few (or many) really deep trees.

 Generally speaking, we save passes over the data by computing histograms
 per variable per split at each *level* of a decision tree. This can blow up
 as the level of the decision tree gets deep, but I'd recommend a lot more
 memory than 2-4GB per worker for most big data workloads.





 On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Debasish, we've tested the MLLib decision tree a bit and it eats up too
 much memory for RF purposes.
 Once the tree got to depth 8~9, it was easy to get heap exception, even
 with 2~4 GB of memory per worker.

 With RF, it's very easy to get 100+ depth in RF with even only 100,000+
 rows (because trees usually are not balanced). Additionally, the lack of
 multi-class classification limits its applicability.

 Also, RF requires random features per tree node to be effective (not just
 bootstrap samples), and MLLib decision tree doesn't support that.


 On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das 
 debasish.da...@gmail.comwrote:

 Mllib has decision treethere is a rf pr which is not active
 nowtake that and swap the tree builder with the fast tree builder
 that's in mllib...search for the spark jira...the code is based on google
 planet paper. ..

 I am sure people in devlist are already working on it...send an email to
 know the status over there...

 There is also a rf in cloudera oryx but we could not run it on our data
 yet

 Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but
 it does not scale...
  On Apr 17, 2014 2:45 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:

 Hi,

 For one of my application, I want to use Random forests(RF) on top of
 spark. I see that currenlty MLLib does not have implementation for RF. What
 other opensource RF implementations will be great to use with spark in
 terms of speed?

 Regards,
 Laeeq Ahmed,
 KTH, Sweden.






Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
What kind of data are you training on? These effects are *highly* data
dependent, and while saying the depth of 10 is simply not adequate to
build high-accuracy models may be accurate for the particular problem
you're modeling, it is not true in general. From a statistical perspective,
I consider each node in each tree an additional degree of freedom for the
model, and all else equal I'd expect a model with fewer degrees of freedom
to generalize better. Regardless, if there are lots of use cases for really
deep trees, we'd like to hear about them so that we can decide how
important they are to support!

In the context of CART - pruning very specifically refers to a step *after*
a tree has been constructed to some depth using cross-validation. This was
a variance reduction technique in the original tree work that is
unnecessary and computationally expensive in the context of forests. In the
original Random Forests paper, there are still stopping criteria - usually
either minimum leaf size or minimum split improvement (or both), so
training to maximum depth doesn't mean train until you've completely
divided your dataset and there's one point per leaf. My point is that if
you set minimum leaf size to something like 0.2% of the dataset, then
you're not going to get deeper than 10 or 12 levels with a reasonably
balanced tree.

With respect to PLANET - our implementation is very much in the spirit of
planet, but has some key differences - there's good documentation on
exactly what the differences are forthcoming, so I won't belabor these
here. The differences are designed to 1) avoid data shuffling, and 2)
minimize number of passes over the training data. Of course, there are
tradeoffs involved, and there is at least one really good trick in the
PLANET work that we should leverage that we aren't yet - namely once the
nodes get small enough for data to fit easily on a single machine, data can
be shuffled and then the remainder of the tree can be trained in parallel
from each lower node on a single machine This would actually help with the
memory overheads in model training when trees get deep  - if someone wants
to modify the current implementation of trees in MLlib and contribute this
optimization as a pull request, it would be welcome!

At any rate, we'll take this feedback into account with respect to
improving the tree implementation, but if anyone can send over use cases or
(even better) datasets where really deep trees are necessary, that would be
great!




On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:

 Well, if you read the original paper,
 http://oz.berkeley.edu/~breiman/randomforest2001.pdf
 Grow the tree using CART methodology to maximum size and do not prune.

 Now, the elements of statistical learning book on page 598 says that you
 could potentially overfit fully-grown regression random forest. However,
 this effect is very slight, and likely negligible for classifications.
 http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

 In our experiments however, if the pruning is drastic, then the
 performance actually becomes much worse. This makes intuitive sense IMO
 because a decision tree is a non-parametric model, and the expressibility
 of a tree depends on the number of nodes.

 With a huge amount of data (millions or even billions of rows), we found
 that the depth of 10 is simply not adequate to build high-accuracy models.


 On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks evan.spa...@gmail.comwrote:

 Hmm... can you provide some pointers to examples where deep trees are
 helpful?

 Typically with Decision Trees you limit depth (either directly or
 indirectly with minimum node size and minimum improvement criteria) to
 avoid overfitting. I agree with the assessment that forests are a variance
 reduction technique, but I'd be a little surprised if a bunch of hugely
 deep trees don't overfit to training data. I guess I view limiting tree
 depth as an analogue to regularization in linear models.


 On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Evan,

 I actually haven't heard of 'shallow' random forest. I think that the
 only scenarios where shallow trees are useful are boosting scenarios.

 AFAIK, Random Forest is a variance reducing technique and doesn't do
 much about bias (although some people claim that it does have some bias
 reducing effect). Because shallow trees typically have higher bias than
 fully-grown trees, people don't often use shallow trees with RF.

 You can confirm this through some experiments with R's random forest
 implementation as well. They allow you to set some limits of depth and/or
 pruning.

 In contrast, boosting is a bias reduction technique (and increases
 variance), so people typically use shallow trees.

 Our empirical experiments also confirmed that shallow trees resulted in
 drastically lower accuracy for random forests.

 There are some papers that mix boosting-like

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
That work is under submission at an academic conference and will be made
available if/when the paper is published.

In terms of algorithms for hyperparameter tuning, we consider Grid Search,
Random Search, a couple of older derivative-free optimization methods, and
a few newer methods - TPE (aka HyperOpt from James Bergstra), SMAC (from
Frank Hutter's group), and Spearmint (Jasper Snoek's method) - the short
answer is that in our hands Random Search works surprisingly well for the
low-dimensional problems we looked at, but TPE and SMAC perform slightly
better. I've got a private branch with TPE (as well as random and grid
search) integrated with MLI, but the code is research quality right now and
not extremely general.

We're actively working on bringing these things up to snuff for a proper
open source release.




On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou yi.zou.li...@gmail.com wrote:

 Hi, Evan,

 Just noticed this thread, do you mind sharing more details regarding
 algorithms targetted at hyperparameter tuning/model selection? or a link
 to dev git repo for that work.

 thanks,
 yi


 On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote:

 Targeting 0.9.0 should work out of the box (just a change to the
 build.sbt) - I'll push some changes I've been sitting on to the public repo
 in the next couple of days.


 On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote:

 Thanks for the update Evan! In terms of using MLI, I see that the Github
 code is linked to Spark 0.8; will it not work with 0.9 (which is what I
 have set up) or higher versions?


 On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
 List] [hidden email]http://user/SendEmail.jtp?type=nodenode=3632i=0
  wrote:

 Hi there,

 MLlib is the first component of MLbase - MLI and the higher levels of
 the stack are still being developed. Look for updates in terms of our
 progress on the hyperparameter tuning/model selection problem in the next
 month or so!

 - Evan


 On Tue, Apr 1, 2014 at 8:05 PM, Krakna H [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3615i=0
  wrote:

 Hi Nan,

 I was actually referring to MLI/MLBase (http://www.mlbase.org); is
 this being actively developed?

 I'm familiar with mllib and have been looking at its documentation.

 Thanks!


 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List]
 [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=0wrote:

  mllib has been part of Spark distribution (under mllib directory),
 also check http://spark.apache.org/docs/latest/mllib-guide.html

 and for JIRA, because of the recent migration to apache JIRA, I think
 all mllib-related issues should be under the Spark umbrella,
 https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

 --
 Nan Zhu

 On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the
 github repo is lying dormant (https://github.com/amplab/MLI) and
 JIRA has had no activity in the last 30 days (
 https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 Is the plan to add a lot of this into mllib itself without needing a
 separate API?

 Thanks!

 --
 View this message in context: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
  To start a new topic under Apache Spark User List, email [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3612i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at 
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html
  To start a new topic under Apache Spark User List, email [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3632i=1

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks
Targeting 0.9.0 should work out of the box (just a change to the build.sbt)
- I'll push some changes I've been sitting on to the public repo in the
next couple of days.


On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote:

 Thanks for the update Evan! In terms of using MLI, I see that the Github
 code is linked to Spark 0.8; will it not work with 0.9 (which is what I
 have set up) or higher versions?


 On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
 List] [hidden email] 
 http://user/SendEmail.jtp?type=nodenode=3632i=0wrote:

 Hi there,

 MLlib is the first component of MLbase - MLI and the higher levels of the
 stack are still being developed. Look for updates in terms of our progress
 on the hyperparameter tuning/model selection problem in the next month or
 so!

 - Evan


 On Tue, Apr 1, 2014 at 8:05 PM, Krakna H [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3615i=0
  wrote:

 Hi Nan,

 I was actually referring to MLI/MLBase (http://www.mlbase.org); is this
 being actively developed?

 I'm familiar with mllib and have been looking at its documentation.

 Thanks!


 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] 
 [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3612i=0 wrote:

  mllib has been part of Spark distribution (under mllib directory),
 also check http://spark.apache.org/docs/latest/mllib-guide.html

 and for JIRA, because of the recent migration to apache JIRA, I think
 all mllib-related issues should be under the Spark umbrella,
 https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

 --
 Nan Zhu

 On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the
 github repo is lying dormant (https://github.com/amplab/MLI) and JIRA
 has had no activity in the last 30 days (
 https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 Is the plan to add a lot of this into mllib itself without needing a
 separate API?

 Thanks!

 --
 View this message in context: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
  To start a new topic under Apache Spark User List, email [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3612i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html
  To start a new topic under Apache Spark User List, email [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3632i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3632.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks
Hi there,

MLlib is the first component of MLbase - MLI and the higher levels of the
stack are still being developed. Look for updates in terms of our progress
on the hyperparameter tuning/model selection problem in the next month or
so!

- Evan


On Tue, Apr 1, 2014 at 8:05 PM, Krakna H shankark+...@gmail.com wrote:

 Hi Nan,

 I was actually referring to MLI/MLBase (http://www.mlbase.org); is this
 being actively developed?

 I'm familiar with mllib and have been looking at its documentation.

 Thanks!


 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3612i=0 wrote:

  mllib has been part of Spark distribution (under mllib directory), also
 check http://spark.apache.org/docs/latest/mllib-guide.html

 and for JIRA, because of the recent migration to apache JIRA, I think all
 mllib-related issues should be under the Spark umbrella,
 https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

 --
 Nan Zhu

 On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the
 github repo is lying dormant (https://github.com/amplab/MLI) and JIRA
 has had no activity in the last 30 days (
 https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 Is the plan to add a lot of this into mllib itself without needing a
 separate API?

 Thanks!

 --
 View this message in context: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
  To start a new topic under Apache Spark User List, email [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3612i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod,

This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).

To answer your question, though:

Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
they're often stored on s3 (which spark can read from natively) and it's
very easy to spin up a spark cluster on EC2 to begin experimenting with the
data.

There's also a pretty good list of (mostly big) datasets that google has
released over the years here:
http://svonava.com/post/62186512058/datasets-released-by-google

- Evan

On Tue, Feb 25, 2014 at 6:33 PM, 黄远强 hyq...@163.com wrote:

 Hi all:
 I am a freshman in Spark community. i dream of being a expert in the field
 of big data.  But i have no idea where to start after i have gone through
 the published  documents in Spark website and examples in  Spark source
 code.  I want to know if there are some public data set in the internet
 that can be utilized  to learn Spark and test my some new ideas base on
 Spark.
   Thanks a lot.


 ---
 Best regards
 hyqgod