Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks
Hi Wush, I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing u...@spark.incubator.apache.org. In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html ) As for a

Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks
Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan R. Sparks
Chris, Thanks for stopping by! Here's a simple example. Imagine I've got a corpus of data, which is an RDD[String], and I want to do some POS tagging on it. In naive spark, that might look like this: val props = new Properties.setAnnotators(pos) val proc = new StanfordCoreNLP(props) val data =

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
This is probably not the right venue for general questions on CoreNLP - the project website (http://nlp.stanford.edu/software/corenlp.shtml) provides documentation and links to mailing lists/stack overflow topics. On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs.

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
Neat hack! This is cute and actually seems to work. The fact that it works is a little surprising and somewhat unintuitive. On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell i...@ianoconnell.com wrote: object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks
, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML

Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks
You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks
, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R

Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks
Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to

Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks
In its current implementation, the principal components are computed in MLlib in two steps: 1) In a distributed fashion, compute the covariance matrix - the result is a local matrix. 2) On this local matrix, compute the SVD. The sorting comes from the SVD. If you want to get the eigenvalues out,

Re: Message Passing among workers

2014-09-03 Thread Evan R. Sparks
Asynchrony is not supported directly - spark's programming model is naturally BSP. I have seen cases where people have instantiated actors with akka on worker nodes to enable message passing, or even used spark's own ActorSystem to do this. But, I do not recommend this, since you lose a bunch of

Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks
I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in 1s on a single machine. Are you caching your

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a combiner in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu,

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Evan R. Sparks
The loss functions are represented in the various names of the model families. SVM is hinge loss, LogisticRegression is logistic loss, LinearRegression is linear loss. These are used internally as arguments to the SGD and L-BFGS optimizers. On Thu, Aug 7, 2014 at 6:31 PM, SK

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Ignoring my warning about overflow - even more functional - just use a reduceByKey. Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge

Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks
Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote: yes, the output is continuous. So I used a threshold to get binary labels. If prediction threshold, then class is 0 else 1. I use

Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks
Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI.

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Evan R. Sparks
To be clear - each of the RDDs is still a distributed dataset and each of the individual SVM models will be trained in parallel across the cluster. Sean's suggestion effectively has you submitting multiple spark jobs simultaneously, which, depending on your cluster configuration and the size of

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks
There is a method in org.apache.spark.mllib.util.MLUtils called kFold which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross

Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks
Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote: Your data source is S3 and data is used twice. m1.large does not

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread Evan R. Sparks
Larry, I don't see any reference to Spark in particular there. Additionally, the benchmark only scales up to datasets that are roughly 10gb (though I realize they've picked some fairly computationally intensive tasks), and they don't present their results on more than 4 nodes. This can hide

Re: How do you run your spark app?

2014-06-19 Thread Evan R. Sparks
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like java -cp ... MyDriver. That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit -

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks
This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234), MyRecord(USA, Bob, 55, 108),

Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks
I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks
and I don't think that it's unreasonable for shallow trees. On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks evan.spa...@gmail.comwrote: What kind of data are you training on? These effects are *highly* data dependent, and while saying the depth of 10 is simply not adequate to build high-accuracy

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
Sorry - I meant to say that Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon. On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Multiclass classification

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks
. With a huge amount of data (millions or even billions of rows), we found that the depth of 10 is simply not adequate to build high-accuracy models. On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Hmm... can you provide some pointers to examples where deep trees

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
: Hi, Evan, Just noticed this thread, do you mind sharing more details regarding algorithms targetted at hyperparameter tuning/model selection? or a link to dev git repo for that work. thanks, yi On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Targeting 0.9.0

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks
see that the Github code is linked to Spark 0.8; will it not work with 0.9 (which is what I have set up) or higher versions? On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=3632i=0wrote: Hi there, MLlib

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks
Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because