Hi Wush,
I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing
u...@spark.incubator.apache.org.
In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
)
As for a
Have you taken a look at the TeradataDBInputFormat? Spark is compatible
with arbitrary hadoop input formats - so this might work for you:
http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw
On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com
Chris,
Thanks for stopping by! Here's a simple example. Imagine I've got a corpus
of data, which is an RDD[String], and I want to do some POS tagging on it.
In naive spark, that might look like this:
val props = new Properties.setAnnotators(pos)
val proc = new StanfordCoreNLP(props)
val data =
This is probably not the right venue for general questions on CoreNLP - the
project website (http://nlp.stanford.edu/software/corenlp.shtml) provides
documentation and links to mailing lists/stack overflow topics.
On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com
Additionally - I strongly recommend using OpenBLAS over the Atlas build
from the default Ubuntu repositories. Alternatively, you can build ATLAS on
the hardware you're actually going to be running the matrix ops on (the
master/workers), but we've seen modest performance gains doing this vs.
Neat hack! This is cute and actually seems to work. The fact that it works
is a little surprising and somewhat unintuitive.
On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell i...@ianoconnell.com wrote:
object MyCoreNLP {
@transient lazy val coreNLP = new coreNLP()
}
and then refer to it
You can try recompiling spark with that option, and doing an sbt/sbt
publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT
(assuming you're building from the 1.1 branch) - sbt or maven (whichever
you're compiling your app with) will pick up the version of spark that you
just
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.
-
On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com
, save). and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.
much appreciated.
On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
There's some work going on to support PMML
You can imagine this same logic applying to the continuous case. E.g. what
if all the quartiles or deciles of a particular value have different
behavior - this could capture that too. Of what if some combination of
features was highly discriminitive but only into n buckets, rather than
two.. you
Plain old java serialization is one straightforward approach if you're in
java/scala.
On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:
what is the best way to save an mllib model that you just trained and
reload
it in the future? specifically, i'm using the mllib word2vec
, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote:
that works. is there a better way in spark? this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.
On Fri, Nov 7, 2014 at 2:30 AM, Evan R
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to
In its current implementation, the principal components are computed in
MLlib in two steps:
1) In a distributed fashion, compute the covariance matrix - the result is
a local matrix.
2) On this local matrix, compute the SVD.
The sorting comes from the SVD. If you want to get the eigenvalues out,
Asynchrony is not supported directly - spark's programming model is
naturally BSP. I have seen cases where people have instantiated actors with
akka on worker nodes to enable message passing, or even used spark's own
ActorSystem to do this. But, I do not recommend this, since you lose a
bunch of
I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?
For very small datasets the scheduling
Hmm... something is fishy here.
That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
1s on a single machine.
Are you caching your
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a combiner in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
On Thu,
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data
The loss functions are represented in the various names of the model
families. SVM is hinge loss, LogisticRegression is logistic loss,
LinearRegression is linear loss. These are used internally as arguments to
the SGD and L-BFGS optimizers.
On Thu, Aug 7, 2014 at 6:31 PM, SK
Computing the variance is similar to this example, you just need to keep
around the sum of squares as well.
The formula for variance is (sumsq/n) - (sum/n)^2
But with big datasets or large values, you can quickly run into overflow
issues - MLlib handles this by maintaining the the average sum of
Ignoring my warning about overflow - even more functional - just use a
reduceByKey.
Since your main operation is just a bunch of summing, you've got a
commutative-associative reduce operation and spark will run do everything
cluster-parallel, and then shuffle the (small) result set and merge
Can you share the dataset via a gist or something and we can take a look at
what's going on?
On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote:
yes, the output is continuous. So I used a threshold to get binary labels.
If prediction threshold, then class is 0 else 1. I use
Try sc.getExecutorStorageStatus().length
SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will
give you back an object per executor - the StorageStatus objects are what
drives a lot of the Spark Web UI.
To be clear - each of the RDDs is still a distributed dataset and each of
the individual SVM models will be trained in parallel across the cluster.
Sean's suggestion effectively has you submitting multiple spark jobs
simultaneously, which, depending on your cluster configuration and the size
of
There is a method in org.apache.spark.mllib.util.MLUtils called kFold
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.
For example (a very rough sketch - assuming I want to do 10-fold cross
Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.
On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote:
Your data source is S3 and data is used twice. m1.large does not
Larry,
I don't see any reference to Spark in particular there.
Additionally, the benchmark only scales up to datasets that are roughly
10gb (though I realize they've picked some fairly computationally intensive
tasks), and they don't present their results on more than 4 nodes. This can
hide
I use SBT, create an assembly, and then add the assembly jars when I create
my spark context. The main executor I run with something like java -cp ...
MyDriver.
That said - as of spark 1.0 the preferred way to run spark applications is
via spark-submit -
This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int, hits: Long)
val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234),
MyRecord(USA, Bob, 55, 108),
I should point out that if you don't want to take a polyglot approach to
languages and reside solely in the JVM, then you can just use plain old
java serialization on the Model objects that come out of MLlib's APIs from
Java or Scala and load them up in another process and call the relevant
and I don't think that it's
unreasonable for shallow trees.
On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
What kind of data are you training on? These effects are *highly* data
dependent, and while saying the depth of 10 is simply not adequate to
build high-accuracy
Sorry - I meant to say that Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon.
On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
Multiclass classification
.
With a huge amount of data (millions or even billions of rows), we found
that the depth of 10 is simply not adequate to build high-accuracy models.
On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
Hmm... can you provide some pointers to examples where deep trees
:
Hi, Evan,
Just noticed this thread, do you mind sharing more details regarding
algorithms targetted at hyperparameter tuning/model selection? or a link
to dev git repo for that work.
thanks,
yi
On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote:
Targeting 0.9.0
see that the Github
code is linked to Spark 0.8; will it not work with 0.9 (which is what I
have set up) or higher versions?
On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
List] [hidden email]
http://user/SendEmail.jtp?type=nodenode=3632i=0wrote:
Hi there,
MLlib
Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the
stack are still being developed. Look for updates in terms of our progress
on the hyperparameter tuning/model selection problem in the next month or
so!
- Evan
On Tue, Apr 1, 2014 at 8:05 PM, Krakna H
Hi hyqgod,
This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).
To answer your question, though:
Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
39 matches
Mail list logo