Re: Construct model matrix from SchemaRDD automatically
Hi Wush, I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing u...@spark.incubator.apache.org. In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html ) As for a model.matrix, you might have a look at the new pipelines API in spark 1.2 (to be further improved in 1.3) which provides facilities for repeatable data transformation as input to ML algorithms. That said - something to handle the case of automatically one-hot encoding all the categorical variables in a DataFrame might be a welcome addition. - Evan On Thu, Mar 5, 2015 at 8:43 PM, Wush Wu w...@bridgewell.com wrote: Dear all, I am a new spark user from R. After exploring the schemaRDD, I notice that it is similar to data.frame. Is there a feature like `model.matrix` in R to convert schemaRDD to model matrix automatically according to the type without explicitly converting them one by one? Thanks, Wush
Re: Spark on teradata?
Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com wrote: Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote: Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a stupid question: Is it possible to use spark on Teradata data warehouse, please? I read some news on internet which say yes. However, I didn't find any example about this issue Thanks in advance. Cheers Gen
Re: Spark and Stanford CoreNLP
Chris, Thanks for stopping by! Here's a simple example. Imagine I've got a corpus of data, which is an RDD[String], and I want to do some POS tagging on it. In naive spark, that might look like this: val props = new Properties.setAnnotators(pos) val proc = new StanfordCoreNLP(props) val data = sc.textFile(hdfs://some/distributed/corpus) def processData(s: String): Annotation = { val a = new Annotation(s) proc.annotate(a) } val processedData = data.map(processData) //Note that this is actually executed lazily. Under the covers, spark takes the closure (processData), serializes it and all objects/methods that it references (including the proc), and ships the serialized closure off to workers so that they can run it on their local partitions of the corpus. The issue at hand is that since the StanfordCoreNLP object isn't serializable, *this will fail at runtime.* Hence the solutions to this problem suggested in this thread, which all come down to initializing the processor on the worker side (preferably once). Your intuition about not wanting to serialize huge objects is fine. This issue is not unique to CoreNLP - any Java library which has non-serializable objects will face this issue. HTH, Evan On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning mann...@stanford.edu wrote: I’m not (yet!) an active Spark user, but saw this thread on twitter … and am involved with Stanford CoreNLP. Could someone explain how things need to be to work better with Spark — since that would be a useful goal. That is, while Stanford CoreNLP is not quite uniform (being developed by various people for over a decade), the general approach has always been that models should be serializable but that processors should not be. This make sense to me intuitively. It doesn’t really make sense to serialize a processor, which often has large mutable data structures used for processing. But does that not work well with Spark? Do processors need to be serializable, and then one needs to go through and make all the elements of the processor transient? Or what? Thanks! Chris On Nov 25, 2014, at 7:54 AM, Evan Sparks evan.spa...@gmail.com wrote: If you only mark it as transient, then the object won't be serialized, and on the worker the field will be null. When the worker goes to use it, you get an NPE. Marking it lazy defers initialization to first use. If that use happens to be after serialization time (e.g. on the worker), then the worker will first check to see if it's initialized, and then initialize it if not. I think if you *do* reference the lazy val before serializing you will likely get an NPE. On Nov 25, 2014, at 1:05 AM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: Great, Ian's approach seems to work fine. Can anyone provide an explanation as to why this works, but passing the CoreNLP object itself as transient does not? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654p19739.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark and Stanford CoreNLP
This is probably not the right venue for general questions on CoreNLP - the project website (http://nlp.stanford.edu/software/corenlp.shtml) provides documentation and links to mailing lists/stack overflow topics. On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hello, I'm new to Stanford CoreNLP. Could any one share good training material and examples(java or scala) on NLP. Regards, Rajesh On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell i...@ianoconnell.com wrote: object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com wrote: We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at sistanlp, which provided a nicer, scala-friendly interface to CoreNLP. On Nov 24, 2014, at 7:46 AM, tvas theodoros.vasilou...@gmail.com wrote: Hello, I was wondering if anyone has gotten the Stanford CoreNLP Java library to work with Spark. My attempts to use the parser/annotator fail because of task serialization errors since the class StanfordCoreNLP cannot be serialized. I've tried the remedies of registering StanfordCoreNLP through kryo, as well as using chill.MeatLocker, but these still produce serialization errors. Passing the StanfordCoreNLP object as transient leads to a NullPointerException instead. Has anybody managed to get this work? Regards, Theodore -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mllib native netlib-java/OpenBLAS
Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs. OpenBLAS, at least on the bigger EC2 machines (e.g. cc2.8xlarge, c3.8xlarge). On Mon, Nov 24, 2014 at 11:26 AM, Xiangrui Meng men...@gmail.com wrote: Try building Spark with -Pnetlib-lgpl, which includes the JNI library in the Spark assembly jar. This is the simplest approach. If you want to include it as part of your project, make sure the library is inside the assembly jar or you specify it via `--jars` with spark-submit. -Xiangrui On Mon, Nov 24, 2014 at 8:51 AM, agg212 alexander_galaka...@brown.edu wrote: Hi, i'm trying to improve performance for Spark's Mllib, and I am having trouble getting native netlib-java libraries installed/recognized by Spark. I am running on a single machine, Ubuntu 14.04 and here is what I've tried: sudo apt-get install libgfortran3 sudo apt-get install libatlas3-base libopenblas-base (this is how netlib-java's website says to install it) I also double checked and it looks like the libraries are linked correctly in /usr/lib (see below): /usr/lib/libblas.so.3 - /etc/alternatives/libblas.so.3 /usr/lib/liblapack.so.3 - /etc/alternatives/liblapack.so.3 The Dependencies section on Spark's Mllib website also says to include com.github.fommil.netlib:all:1.1.2 as a dependency. I therefore tried adding this to my sbt file like so: libraryDependencies += com.github.fommil.netlib % all % 1.1.2 After all this, i'm still seeing the following error message. Does anyone have more detailed installation instructions? 14/11/24 16:49:29 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/11/24 16:49:29 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark and Stanford CoreNLP
Neat hack! This is cute and actually seems to work. The fact that it works is a little surprising and somewhat unintuitive. On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell i...@ianoconnell.com wrote: object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com wrote: We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at sistanlp, which provided a nicer, scala-friendly interface to CoreNLP. On Nov 24, 2014, at 7:46 AM, tvas theodoros.vasilou...@gmail.com wrote: Hello, I was wondering if anyone has gotten the Stanford CoreNLP Java library to work with Spark. My attempts to use the parser/annotator fail because of task serialization errors since the class StanfordCoreNLP cannot be serialized. I've tried the remedies of registering StanfordCoreNLP through kryo, as well as using chill.MeatLocker, but these still produce serialization errors. Passing the StanfordCoreNLP object as transient leads to a NullPointerException instead. Has anybody managed to get this work? Regards, Theodore -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mllib native netlib-java/OpenBLAS
You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just built. On Mon, Nov 24, 2014 at 6:31 PM, agg212 alexander_galaka...@brown.edu wrote: I am running it in local. How can I use the built version (in local mode) so that I can use the native libraries? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19705.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Best practice for multi-user web controller in front of Spark
For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com wrote: I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote: We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally on a daily basis. (that detail is relevant to question 3 below) Now we are ready to start building out the front-end, which will allow a team of data scientists to submit their problems to the system via a web front-end (web tier will be java). Users could of course be submitting jobs more or less simultaneously. We want to make sure we understand how to best structure this. Questions: 1 - Does a new SparkContext get created in the web tier for each new request for processing? 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be shared from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Thanks! David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing something like model.userFeatures.saveAsObjectFile(...) and model.productFeatures.saveAsObjectFile(...) to save out to HDFS or Tachyon or S3. Then, when you want to reload you'd have to instantiate them into a class of MatrixFactorizationModel. That class is package private to MLlib right now, so you'd need to copy the logic over to a new class, but that's the basic idea. That said - using spark to serve these recommendations on a point-by-point basis might not be optimal. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why decision trees do binary split?
You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you can see there are lots of different options here. In general in MLlib, we're trying to support widely accepted and frequently used ML models, and simply offer a platform to efficiently train these with spark. While decision trees with n-ary splits might be a sensible thing to explore, they are not widely used in practice, and I'd want to see some compelling results from proper ML/stats researchers before shipping them as a default feature. If you're looking for a way to control variance and pick up nuance in your dataset that's not covered by plain decision trees, I recommend looking at Random Forests - a well studied extension to decision trees that's also widely used in practice - and coming to MLlib soon! On Thu, Nov 6, 2014 at 3:29 AM, Tamas Jambor jambo...@gmail.com wrote: Thanks for the reply, Sean. I can see that splitting on all the categories would probably overfit the tree, on the other hand, it might give more insight on the subcategories (probably only would work if the data is uniformly distributed between the categories). I haven't really found any comparison between the two methods in terms of performance and interpretability. thanks, On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen so...@cloudera.com wrote: I haven't seen that done before, which may be most of the reason - I am not sure that is common practice. I can see upsides - you need not pick candidate splits to test since there is only one N-way rule possible. The binary split equivalent is N levels instead of 1. The big problem is that you are always segregating the data set entirely, and making the equivalent of those N binary rules, even when you would not otherwise bother because they don't add information about the target. The subsets matching each child are therefore unnecessarily small and this makes learning on each independent subset weaker. On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote: I meant above, that in the case of categorical variables it might be more efficient to create a node on each categorical value. Is there a reason why spark went down the binary route? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to run kmeans after pca?
Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to force this to be materialized before you run KMeans. Also, KMeans is optimized to run quickly on both sparse and dense data. The result of PCA is going to be dense, but if your input data has #nnz ~= size(pca data), performance might be about the same. (I haven't actually verified this last point.) Finally, speed is partially going to be dependent on how much data you have relative to scheduler overheads - if your input data is small it could be that the costs of distributing your task are greater than the time spent actually computing - usually this would manifest itself in the stages taking about the same amount of time even though you're passing datasets that have different dimensionality. On Tue, Sep 30, 2014 at 9:00 AM, st553 sthompson...@gmail.com wrote: Thanks for your response Burak it was very helpful. I am noticing that if I run PCA before KMeans that the KMeans algorithm will actually take longer to run than if I had just run KMeans without PCA. I was hoping that by using PCA first it would actually speed up the KMeans algorithm. I have followed the steps you've outlined but Im wondering if I need to cache/persist the RDD[Vector] rows of the RowMatrix returned after multiplying. Something like: val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) val cachedRows = newData.rows.persist() KMeans.run(cachedRows) cachedRows.unpersist() It doesnt seem intuitive to me that a smaller dimensional version of my data set would take longer for KMeans... unless Im missing something? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark1.0 principal component analysis
In its current implementation, the principal components are computed in MLlib in two steps: 1) In a distributed fashion, compute the covariance matrix - the result is a local matrix. 2) On this local matrix, compute the SVD. The sorting comes from the SVD. If you want to get the eigenvalues out, you can simply run step 1 yourself on your RowMatrix via the (experimental) computeCovariance() method, and then run SVD on the result using a library like breeze. - Evan On Tue, Sep 23, 2014 at 12:49 PM, st553 sthompson...@gmail.com wrote: sowen wrote it seems that the singular values from the SVD aren't returned, so I don't know that you can access this directly Its not clear to me why these aren't returned? The S matrix would be useful to determine a reasonable value for K. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249p14919.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Message Passing among workers
Asynchrony is not supported directly - spark's programming model is naturally BSP. I have seen cases where people have instantiated actors with akka on worker nodes to enable message passing, or even used spark's own ActorSystem to do this. But, I do not recommend this, since you lose a bunch of benefits of spark - e.g. fault tolerance. Instead, I would think about whether your algorithm can be cast as a BSP one, or think about how frequently you really need to synchronize state among your workers. It may be that having the occasional synchronization barrier is OK. On Wed, Sep 3, 2014 at 7:28 AM, laxmanvemula laxman8...@gmail.com wrote: Hi, I would like to implement an asynchronous distributed optimization algorithm where workers communicate among one another. It is similar to belief propagation where each worker is a vertex in the graph. Can some one let me know if this is possible using spark? Thanks, Laxman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Message-Passing-among-workers-tp13355.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: mllib performance on cluster
I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's worth documenting: Benchmarking on a dataset this small on this many cores is probably not going to give you any meaningful information about how the algorithms scale to real data problems. In this case, you've thrown 200 cores at 5.6kb of data - 200 low-dimensional data points. The overheads of scheduling tasks, sending them out to each worker, and network latencies between the nodes, which are essentially fixed regardless of problem size are COMPLETELY dominating the time spent computing - which in the first two cases is 9-10 flops per data point and in the last case is a couple of array lookups and adds per data point. It would make a lot more sense to find or generate a dataset that's 10 or 100GB and see how performance scales there. You can do this with the code I pasted earlier, just change the second, third, and fourth arguments to an appropriate number of elements, dimensionality, and number of partitions that matches the number of cores you have on your cluster. In short, don't use a cluster unless you need one :). Hope this helps! On Tue, Sep 2, 2014 at 3:51 PM, SK skrishna...@gmail.com wrote: The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: mllib performance on cluster
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling overheads of shipping tasks across the cluster and delays due to stragglers can dominate the time actually doing your parallel computation. If you have too few partitions, you won't be taking advantage of cluster parallelism, and if you have too many you're introducing even more of the aforementioned overheads. On Tue, Sep 2, 2014 at 11:24 AM, SK skrishna...@gmail.com wrote: Hi, I evaluated the runtime performance of some of the MLlib classification algorithms on a local machine and a cluster with 10 nodes. I used standalone mode and Spark 1.0.1 in both cases. Here are the results for the total runtime: Local Cluster Logistic regression 138 sec 336 sec SVM 138 sec 336 sec Decision tree 50 sec 132 sec My dataset is quite small and my programs are very similar to the mllib examples that are included in the Spark distribution. Why is the runtime on the cluster significantly higher (almost 3 times) than that on the local machine even though the former uses more memory and more nodes? Is it because of the communication overhead on the cluster? I would like to know if there is something I need to be doing to optimize the performance on the cluster or if others have also been getting similar results. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: mllib performance on cluster
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in 1s on a single machine. Are you caching your dataset before training the classifier on it? It's possible that you're rereading it from disk (or across the internet, maybe) on every iteration? From spark-shell: import org.apache.spark.mllib.util.LogisticRegressionDataGenerator val dat = LogisticRegressionDataGenerator.generateLogisticRDD(sc, 200, 3, 1e-4, 4, 0.2).cache() println(dat.count()) //should give 200 import org.apache.spark.mllib.classification.LogisticRegressionWithSGD val start = System.currentTimeMillis; val model = LogisticRegressionWithSGD.train(dat, 100); val delta = System.currentTimeMillis - start; println(delta) //On my laptop, 863ms. On Tue, Sep 2, 2014 at 3:51 PM, SK skrishna...@gmail.com wrote: The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reduceByKey to get all associated values
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a combiner in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu, Aug 7, 2014 at 8:15 AM, Cheng Lian lian.cs@gmail.com wrote: The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general, don’t do these unless you have to. And in Konstantin’s case, I guess he knows what he’s doing. At least we can’t know whether we can help to optimize without further information about the business logic” is provided. On Aug 7, 2014, at 10:22 PM, chutium teng@gmail.com wrote: a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance ( http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/ ) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network. is that still true? which one should we choice? because actually we can replace all of groupByKey with reduceByKey for example, if we want to use groupByKey on a RDD[ String, String ], to get a RDD[ String, Seq[String] ], we can also do it with reduceByKey: at first, map RDD[ String, String ] to RDD[ String, Seq[String] ] then, reduceByKey(_ ++ _) on this RDD[ String, Seq[String] ] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-to-get-all-associated-values-tp11645p11652.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How can I implement eigenvalue decomposition in Spark?
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is sparse (which it often is in social networks), you may have better luck with this. I haven't tried the GraphX implementation, but those algorithms are often well-suited for power-law distributed graphs as you might see in social networks. FWIW, I believe you need to square elements of the sigma matrix from the SVD to get the eigenvalues. On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen so...@cloudera.com wrote: (-incubator, +user) If your matrix is symmetric (and real I presume), and if my linear algebra isn't too rusty, then its SVD is its eigendecomposition. The SingularValueDecomposition object you get back has U and V, both of which have columns that are the eigenvectors. There are a few SVDs in the Spark code. The one in mllib is not distributed (right?) and is probably not an efficient means of computing eigenvectors if you really just want a decomposition of a symmetric matrix. The one I see in graphx is distributed? I haven't used it though. Maybe it could be part of a solution. On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan yaochun...@gmail.com wrote: Our lab need to do some simulation on online social networks. We need to handle a 5000*5000 adjacency matrix, namely, to get its largest eigenvalue and corresponding eigenvector. Matlab can be used but it is time-consuming. Is Spark effective in linear algebra calculations and transformations? Later we would have 500*500 matrix processed. It seems emergent that we should find some distributed computation platform. I see SVD has been implemented and I can get eigenvalues of a matrix through this API. But when I want to get both eigenvalues and eigenvectors or at least the biggest eigenvalue and the corresponding eigenvector, it seems that current Spark doesn't have such API. Is it possible that I write eigenvalue decomposition from scratch? What should I do? Thanks a lot! Miles Yao View this message in context: How can I implement eigenvalue decomposition in Spark? Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [MLLib]:choosing the Loss function
The loss functions are represented in the various names of the model families. SVM is hinge loss, LogisticRegression is logistic loss, LinearRegression is linear loss. These are used internally as arguments to the SGD and L-BFGS optimizers. On Thu, Aug 7, 2014 at 6:31 PM, SK skrishna...@gmail.com wrote: Hi, According to the MLLib guide, there seems to be support for different loss functions. But I could not find a command line parameter to choose the loss function but only found regType to choose the regularization. Does MLLib support a parameter to choose the loss function? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Computing mean and standard deviation by key
Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of squares in an online fashion. (see: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L83 ) You might consider just calling into the MLlib stats module directly. On Fri, Aug 1, 2014 at 1:48 PM, Xu (Simon) Chen xche...@gmail.com wrote: I meant not sure how to do variance in one shot :-) With mean in hand, you can obvious broadcast the variable, and do another map/reduce to calculate variance per key. On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen xche...@gmail.com wrote: val res = rdd.map(t = (t._1, (t._2.foo, 1))).reduceByKey((x,y) = (x._1+x._2, y._1+y._2)).collect This gives you a list of (key, (tot, count)), which you can easily calculate the mean. Not sure about variance. On Fri, Aug 1, 2014 at 2:55 PM, kriskalish k...@kalish.net wrote: I have what seems like a relatively straightforward task to accomplish, but I cannot seem to figure it out from the Spark documentation or searching the mailing list. I have an RDD[(String, MyClass)] that I would like to group by the key, and calculate the mean and standard deviation of the foo field of MyClass. It feels like I should be able to use group by to get an RDD for each unique key, but it gives me an iterable. As in: val grouped = rdd.groupByKey() grouped.foreach{g = val mean = g.map( x = x.foo).mean() val dev = g.map( x = x.foo ).stddev() // do fancy things with the mean and deviation } However, there seems to be no way to convert the iterable into an RDD. Is there some other technique for doing this? I'm to the point where I'm considering copying and pasting the StatCollector class and changing the type from Double to MyClass (or making it generic). Am I going down the wrong path? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Computing mean and standard deviation by key
Ignoring my warning about overflow - even more functional - just use a reduceByKey. Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge appropriately. For example: input .map{ case (k, v) = (k, (1, v, v*v)) } .reduceByKey { case ((c1, s1, ss1), (c2, s2, ss2)) = (c1+c2, s1+s2, ss1+ss2) } .map { case (k, (count, sum, sumsq)) = (k, sumsq/count - (sum/count * sum/count)) } This is by no means the most memory/time efficient way to do it, but I think it's a nice example of how to think about using spark at a higher level of abstraction. - Evan On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen so...@cloudera.com wrote: Here's the more functional programming-friendly take on the computation (but yeah this is the naive formula): rdd.groupByKey.mapValues { mcs = val values = mcs.map(_.foo.toDouble) val n = values.count val sum = values.sum val sumSquares = values.map(x = x * x).sum math.sqrt(n * sumSquares - sum * sum) / n } This gives you a bunch of (key,stdev). I think you want to compute this RDD and *then* do something to save it if you like. Sure, that could be collecting it locally and saving to a DB. Or you could use foreach to do something remotely for every key-value pair. More efficient would be to mapPartitions and do something to a whole partition of key-value pairs at a time. On Fri, Aug 1, 2014 at 9:56 PM, kriskalish k...@kalish.net wrote: So if I do something like this, spark handles the parallelization and recombination of sum and count on the cluster automatically? I started peeking into the source and see that foreach does submit a job to the cluster, but it looked like the inner function needed to return something to work properly. val grouped = rdd.groupByKey() grouped.foreach{ x = val iterable = x._2 var sum = 0.0 var count = 0 iterable.foreach{ y = sum = sum + y.foo count = count + 1 } val mean = sum/count; // save mean to database... } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Decision tree classifier in MLlib
Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK skrishna...@gmail.com wrote: yes, the output is continuous. So I used a threshold to get binary labels. If prediction threshold, then class is 0 else 1. I use this binary label to then compute the accuracy. Even with this binary transformation, the accuracy with decision tree model is low compared to LR or SVM (for the specific dataset I used). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457p10678.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Getting the number of slaves
Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI. https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.SparkContext On Thu, Jul 24, 2014 at 11:16 AM, Nicolas Mai nicolas@gmail.com wrote: Hi, Is there a way to get the number of slaves/workers during runtime? I searched online but didn't find anything :/ The application I'm working will run on different clusters corresponding to different deployment stages (beta - prod). It would be great to get the number of slaves currently in use, in order set the level of parallelism and RDD partitions, based on that number. Thanks! Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to parallelize model fitting with different cross-validation folds?
To be clear - each of the RDDs is still a distributed dataset and each of the individual SVM models will be trained in parallel across the cluster. Sean's suggestion effectively has you submitting multiple spark jobs simultaneously, which, depending on your cluster configuration and the size of your dataset, may or may not be a good idea. There are some tricks you can do to make training multiple models on the same dataset faster, which we're hoping to expose to users in an upcoming release. - Evan On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen so...@cloudera.com wrote: If you call .par on data_kfolded it will become a parallel collection in Scala and so the maps will happen in parallel . On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote: Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where each element is a pair of RDDs containing the training and test data: (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint], test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint]) scala data_kfolded res21: Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint], org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])] = Array((MappedRDD[9] at map at console:24,MappedRDD[7] at map at console:23), (MappedRDD[13] at map at console:24,MappedRDD[11] at map at console:23), (MappedRDD[17] at map at console:24,MappedRDD[15] at map at console:23)) Everything works fine when using data_kfolded: val validationErrors = data_kfolded.map { datafold = val svmAlg = new SVMWithSGD() val model_reg = svmAlg.run(datafold._1) val labelAndPreds = datafold._2.map { point = val prediction = model_reg.predict(point.features) (point.label, prediction) } val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / datafold._2.count trainErr.toDouble } scala validationErrors res1: Array[Double] = Array(0.8819836785938481, 0.07082521117608837, 0.29833546734955185) However, I have understood that the models are not fitted in parallel as data_kfolded is not an RDD (although it's an array of pairs of RDDs). When running the same code where data_kfolded has been replaced with sc.parallelize(data_kfolded), I get a null pointer exception from the line where the run method of the SVMWithSGD object is called with the traning data. I guess this is somehow related to the fact that RDDs can't be accessed from inside a closure. I fail to understand though why the first version works and the second doesn't. Most importantly, is there a way to fit the models in parallel? I would really appreciate your help. val validationErrors = sc.parallelize(data_kfolded).map { datafold = val svmAlg = new SVMWithSGD() val model_reg = svmAlg.run(datafold._1) // This line gives null pointer exception val labelAndPreds = datafold._2.map { point = val prediction = model_reg.predict(point.features) (point.label, prediction) } val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / datafold._2.count trainErr.toDouble } validationErrors.collect java.lang.NullPointerException at org.apache.spark.rdd.RDD.firstParent(RDD.scala:971) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.RDD.take(RDD.scala:824) at org.apache.spark.rdd.RDD.first(RDD.scala:856) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:121) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:36) at $line28.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:34) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at
Re: How to use K-fold validation in spark-1.0?
There is a method in org.apache.spark.mllib.util.MLUtils called kFold which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross validation on a binary classification model on a file with 1000 features in it): import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.util.LabelParsers import org.apache.spark.mllib.classification.LogisticRegressionWithSGD val dat = MLUtils.loadLibSVMFile(sc, path/to/data, false, 1000) val cvdat = kFold(dat, 10, 42) val modelErrrors = cvdat.map { case (train, test) = { val model = LogisticRegressionWithSGD.train(train, 100, 0.1, 1.0) val error = computeError(model, test) (model, error)}} //Average error: val avgError = modelErrors.map(_._2).reduce(_ + _) / modelErrors.length Here, I'm assuming you've got some computeError function defined. Note that many of these APIs are marked experimental and thus might change in a future spark release. On Tue, Jun 24, 2014 at 6:44 AM, Eustache DIEMERT eusta...@diemert.fr wrote: I'm interested in this topic too :) Are the MLLib core devs on this list ? E/ 2014-06-24 14:19 GMT+02:00 holdingonrobin robinholdin...@gmail.com: Anyone knows anything about it? Or should I actually move this topic to a MLlib specif mailing list? Any information is appreciated! Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Performance problems on SQL JOIN
Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote: Your data source is S3 and data is used twice. m1.large does not have very good network performance. Please try file.count() and see how fast it goes. -Xiangrui On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote: Hi there, We're trying out Spark and are experiencing some performance issues using Spark SQL. Anyone who can tell us if our results are normal? We are using the Amazon EC2 scripts to create a cluster with 3 workers/executors (m1.large). Tried both spark 1.0.0 as well as the git master; the Scala as well as the Python shells. Running the following code takes about 5 minutes, which seems a long time for this query. val file = sc.textFile(s3n:// ... .csv); val data = file.map(x = x.split('|')); // 300k rows case class BookingInfo(num_rooms: String, hotelId: String, toDate: String, ...); val rooms2 = data.filter(x = x(0) == 2).map(x = BookingInfo(x(0), x(1), ... , x(9))); // 50k rows val rooms3 = data.filter(x = x(0) == 3).map(x = BookingInfo(x(0), x(1), ... , x(9))); // 30k rows rooms2.registerAsTable(rooms2); cacheTable(rooms2); rooms3.registerAsTable(rooms3); cacheTable(rooms3); sql(SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId = rooms3.hotelId AND rooms2.toDate = rooms3.toDate).count(); Are we doing something wrong here? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is There Any Benchmarks Comparing C++ MPI with Spark
Larry, I don't see any reference to Spark in particular there. Additionally, the benchmark only scales up to datasets that are roughly 10gb (though I realize they've picked some fairly computationally intensive tasks), and they don't present their results on more than 4 nodes. This can hide things like, for example, a communication pattern that is O(n^2) in the number of cluster nodes. Obviously they've gotten some great performance out of SciDB, but I don't think this answers the MPI vs. Spark question directly. My own experience suggests that as long as your algorithm fits in a BSP programming model, with Spark you can achieve performance that is comparable to a tuned C++/MPI codebase by leveraging the right libraries locally and thinking carefully about what and when you have to communicate. - Evan On Thu, Jun 19, 2014 at 8:48 AM, ldmtwo larry.d.moore...@intel.com wrote: Here is a partial comparison. http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2 SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost of more work. In case the link stops working: A Complex Analytics Genomics Benchmark Rebecca Taft-, Manasi Vartak-, Nadathur Rajagopalan Satish, Narayanan Sundaram, Samuel Madden, and Michael Stonebraker -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-There-Any-Benchmarks-Comparing-C-MPI-with-Spark-tp7661p7919.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How do you run your spark app?
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like java -cp ... MyDriver. That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo ldm...@gmail.com wrote: I want to ask this, not because I can't read endless documentation and several tutorials, but because there seems to be many ways of doing things and I keep having issues. How do you run /your /spark app? I had it working when I was only using yarn+hadoop1 (Cloudera), then I had to get Spark and Shark working and ended upgrading everything and dropped CDH support. Anyways, this is what I used with master=yarn-client and app_jar being Scala code compiled with Maven. java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME $ARGS Do you use this? or something else? I could never figure out this method. SPARK_HOME/bin/spark jar APP_JAR ARGS For example: bin/spark-class jar /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 10 Do you use SBT or Maven to compile? or something else? ** It seams that I can't get subscribed to the mailing list and I tried both my work email and personal. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Patterns for making multiple aggregations in one pass
This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234), MyRecord(USA, Bob, 55, 108), MyRecord(France, Remi, 33, 72))) data.registerAsTable(MyRecords) val results = sql(SELECT t.country, AVG(t.age), SUM(t.hits) FROM MyRecords t GROUP BY t.country).collect Now results contains: Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342]) On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin doris.s@gmail.com wrote: Hi Nick, Instead of using reduceByKey(), you might want to look into using aggregateByKey(), which allows you to return a different value type U instead of the input value type V for each input tuple (K, V). You can define U to be a datatype that holds both the average and total and have seqOp update both fields of U in a single pass. Hope this makes sense, Doris On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas nicholas.cham...@gmail.com wrote: The following is a simplified example of what I am trying to accomplish. Say I have an RDD of objects like this: { country: USA, name: Franklin, age: 24, hits: 224} { country: USA, name: Bob, age: 55, hits: 108} { country: France, name: Remi, age: 33, hits: 72} I want to find the average age and total number of hits per country. Ideally, I would like to scan the data once and perform both aggregations simultaneously. What is a good approach to doing this? I’m thinking that we’d want to keyBy(country), and then somehow reduceByKey(). The problem is, I don’t know how to approach writing a function that can be passed to reduceByKey() and that will track a running average and total simultaneously. Nick -- View this message in context: Patterns for making multiple aggregations in one pass http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: pmml with augustus
I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant .predict() method when it comes time to serve. The same approach would probably also work for models trained via MLlib's python APIs, but I haven't tried that. Native PMML serialization would be a nice feature to add to MLlib as a mechanism to transfer models to other environments for further analysis/serving. There's a JIRA discussion about this here: https://issues.apache.org/jira/browse/SPARK-1406 On Tue, Jun 10, 2014 at 10:53 AM, filipus floe...@gmail.com wrote: Thank you very much the cascading project i didn't recognize it at all till now this project is very interesting also I got the idea of the usage of scala as a language for spark - becuase i can intergrate jvm based libraries very easy/naturaly when I got it right mh... but I could also use sparc as a model engine, augustus for the serializer and a third party produkt for the prediction engine like using jpmml mh... got the feeling that i need to do java, scala and python at the same time... first things first - augustus for an pmml output from spark :-) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7335.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Random Forest on Spark
Interesting, and thanks for the thoughts. I think we're on the same page with 100s of millions of records. We've tested the tree implementation in mllib on 1b rows and up to 100 features - though this isn't hitting the 1000s of features you mention. Obviously multi class support isn't there yet, but I can see your point about deeper trees for many class problems. Will try them out on some image processing stuff with 1k classes we're doing in the lab once they are more developed to get a sense for where the issues are. If you're only allocating 2GB/worker you're going to have a hard time getting the real advantages of Spark. For your 1k features causing heap exceptions at depth 5 - are these categorical or continuous? The categorical vars create much smaller histograms. If you're fitting all continuous features, the memory requirements are O(b*d*2^l) where b=number of histogram bins, d=number of features, and l = level of the tree. Even accounting for object overhead, with the default number of bins, the histograms at this depth should be order of 10s of MB, not 2GB - so I'm guessing your cached data is occupying a significant chunk of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m data points with 20 continuous features and was able to run without running into memory issues (and scaling properties got better as the depth grew). His worker mem was 7.5GB and 30% of that was reserved for caching. If you wanted to go 1000 features at depth 10 I'd estimate a couple of gigs necessary for heap space for the worker to compute/store the histograms, and I guess 2x that on the master to do the reduce. Again 2GB per worker is pretty tight, because there are overheads of just starting the jvm, launching a worker, loading libraries, etc. - Evan On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Yes, it should be data specific and perhaps we're biased toward the data sets that we are playing with. To put things in perspective, we're highly interested in (and I believe, our customers are): 1. large (hundreds of millions of rows) 2. multi-class classification - nowadays, dozens of target categories are common and even thousands in some cases - you could imagine that this is a big reason for us requiring more 'complex' models 3. high dimensional with thousands of descriptive and sort-of-independent features From the theoretical perspective, I would argue that it's usually in the best interest to prune as little as possible. I believe that pruning inherently increases bias of an individual tree, which RF can't do anything about while decreasing variance - which is what RF is for. The default pruning criteria for R's reference implementation is min-node of 1 (meaning fully-grown tree) for classification, and 5 for regression. I'd imagine they did at least some empirical testing to justify these values at the time - although at a time of small datasets :). FYI, we are also considering the MLLib decision tree for our Gradient Boosting implementation, however, the memory requirement is still a bit too steep (we were getting heap exceptions at depth limit of 5 with 2GB per worker with approximately 1000 features). Now 2GB per worker is about what we expect our typical customers would tolerate and I don't think that it's unreasonable for shallow trees. On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks evan.spa...@gmail.comwrote: What kind of data are you training on? These effects are *highly* data dependent, and while saying the depth of 10 is simply not adequate to build high-accuracy models may be accurate for the particular problem you're modeling, it is not true in general. From a statistical perspective, I consider each node in each tree an additional degree of freedom for the model, and all else equal I'd expect a model with fewer degrees of freedom to generalize better. Regardless, if there are lots of use cases for really deep trees, we'd like to hear about them so that we can decide how important they are to support! In the context of CART - pruning very specifically refers to a step *after* a tree has been constructed to some depth using cross-validation. This was a variance reduction technique in the original tree work that is unnecessary and computationally expensive in the context of forests. In the original Random Forests paper, there are still stopping criteria - usually either minimum leaf size or minimum split improvement (or both), so training to maximum depth doesn't mean train until you've completely divided your dataset and there's one point per leaf. My point is that if you set minimum leaf size to something like 0.2% of the dataset, then you're not going to get deeper than 10 or 12 levels with a reasonably balanced tree. With respect to PLANET - our implementation is very much in the spirit of planet, but has some key differences - there's good documentation on exactly what the differences are forthcoming, so I
Re: Random Forest on Spark
Sorry - I meant to say that Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon. On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Multiclass classification, Gradient Boosting, and Random Forest support for based on the recent Decision Tree implementation in MLlib. Sung - I'd be curious to hear about your use of decision trees (and forests) where you want to go to 100+ depth. My experience with random forests has been that people typically build hundreds of shallow trees (maybe depth 7 or 8), rather than a few (or many) really deep trees. Generally speaking, we save passes over the data by computing histograms per variable per split at each *level* of a decision tree. This can blow up as the level of the decision tree gets deep, but I'd recommend a lot more memory than 2-4GB per worker for most big data workloads. On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Debasish, we've tested the MLLib decision tree a bit and it eats up too much memory for RF purposes. Once the tree got to depth 8~9, it was easy to get heap exception, even with 2~4 GB of memory per worker. With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows (because trees usually are not balanced). Additionally, the lack of multi-class classification limits its applicability. Also, RF requires random features per tree node to be effective (not just bootstrap samples), and MLLib decision tree doesn't support that. On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das debasish.da...@gmail.comwrote: Mllib has decision treethere is a rf pr which is not active nowtake that and swap the tree builder with the fast tree builder that's in mllib...search for the spark jira...the code is based on google planet paper. .. I am sure people in devlist are already working on it...send an email to know the status over there... There is also a rf in cloudera oryx but we could not run it on our data yet Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but it does not scale... On Apr 17, 2014 2:45 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, For one of my application, I want to use Random forests(RF) on top of spark. I see that currenlty MLLib does not have implementation for RF. What other opensource RF implementations will be great to use with spark in terms of speed? Regards, Laeeq Ahmed, KTH, Sweden.
Re: Random Forest on Spark
What kind of data are you training on? These effects are *highly* data dependent, and while saying the depth of 10 is simply not adequate to build high-accuracy models may be accurate for the particular problem you're modeling, it is not true in general. From a statistical perspective, I consider each node in each tree an additional degree of freedom for the model, and all else equal I'd expect a model with fewer degrees of freedom to generalize better. Regardless, if there are lots of use cases for really deep trees, we'd like to hear about them so that we can decide how important they are to support! In the context of CART - pruning very specifically refers to a step *after* a tree has been constructed to some depth using cross-validation. This was a variance reduction technique in the original tree work that is unnecessary and computationally expensive in the context of forests. In the original Random Forests paper, there are still stopping criteria - usually either minimum leaf size or minimum split improvement (or both), so training to maximum depth doesn't mean train until you've completely divided your dataset and there's one point per leaf. My point is that if you set minimum leaf size to something like 0.2% of the dataset, then you're not going to get deeper than 10 or 12 levels with a reasonably balanced tree. With respect to PLANET - our implementation is very much in the spirit of planet, but has some key differences - there's good documentation on exactly what the differences are forthcoming, so I won't belabor these here. The differences are designed to 1) avoid data shuffling, and 2) minimize number of passes over the training data. Of course, there are tradeoffs involved, and there is at least one really good trick in the PLANET work that we should leverage that we aren't yet - namely once the nodes get small enough for data to fit easily on a single machine, data can be shuffled and then the remainder of the tree can be trained in parallel from each lower node on a single machine This would actually help with the memory overheads in model training when trees get deep - if someone wants to modify the current implementation of trees in MLlib and contribute this optimization as a pull request, it would be welcome! At any rate, we'll take this feedback into account with respect to improving the tree implementation, but if anyone can send over use cases or (even better) datasets where really deep trees are necessary, that would be great! On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Well, if you read the original paper, http://oz.berkeley.edu/~breiman/randomforest2001.pdf Grow the tree using CART methodology to maximum size and do not prune. Now, the elements of statistical learning book on page 598 says that you could potentially overfit fully-grown regression random forest. However, this effect is very slight, and likely negligible for classifications. http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf In our experiments however, if the pruning is drastic, then the performance actually becomes much worse. This makes intuitive sense IMO because a decision tree is a non-parametric model, and the expressibility of a tree depends on the number of nodes. With a huge amount of data (millions or even billions of rows), we found that the depth of 10 is simply not adequate to build high-accuracy models. On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Hmm... can you provide some pointers to examples where deep trees are helpful? Typically with Decision Trees you limit depth (either directly or indirectly with minimum node size and minimum improvement criteria) to avoid overfitting. I agree with the assessment that forests are a variance reduction technique, but I'd be a little surprised if a bunch of hugely deep trees don't overfit to training data. I guess I view limiting tree depth as an analogue to regularization in linear models. On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Evan, I actually haven't heard of 'shallow' random forest. I think that the only scenarios where shallow trees are useful are boosting scenarios. AFAIK, Random Forest is a variance reducing technique and doesn't do much about bias (although some people claim that it does have some bias reducing effect). Because shallow trees typically have higher bias than fully-grown trees, people don't often use shallow trees with RF. You can confirm this through some experiments with R's random forest implementation as well. They allow you to set some limits of depth and/or pruning. In contrast, boosting is a bias reduction technique (and increases variance), so people typically use shallow trees. Our empirical experiments also confirmed that shallow trees resulted in drastically lower accuracy for random forests. There are some papers that mix boosting-like
Re: Status of MLI?
That work is under submission at an academic conference and will be made available if/when the paper is published. In terms of algorithms for hyperparameter tuning, we consider Grid Search, Random Search, a couple of older derivative-free optimization methods, and a few newer methods - TPE (aka HyperOpt from James Bergstra), SMAC (from Frank Hutter's group), and Spearmint (Jasper Snoek's method) - the short answer is that in our hands Random Search works surprisingly well for the low-dimensional problems we looked at, but TPE and SMAC perform slightly better. I've got a private branch with TPE (as well as random and grid search) integrated with MLI, but the code is research quality right now and not extremely general. We're actively working on bringing these things up to snuff for a proper open source release. On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou yi.zou.li...@gmail.com wrote: Hi, Evan, Just noticed this thread, do you mind sharing more details regarding algorithms targetted at hyperparameter tuning/model selection? or a link to dev git repo for that work. thanks, yi On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote: Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote: Thanks for the update Evan! In terms of using MLI, I see that the Github code is linked to Spark 0.8; will it not work with 0.9 (which is what I have set up) or higher versions? On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List] [hidden email]http://user/SendEmail.jtp?type=nodenode=3632i=0 wrote: Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H [hidden email]http://user/SendEmail.jtp?type=nodenode=3615i=0 wrote: Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively developed? I'm familiar with mllib and have been looking at its documentation. Thanks! On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=0wrote: mllib has been part of Spark distribution (under mllib directory), also check http://spark.apache.org/docs/latest/mllib-guide.html and for JIRA, because of the recent migration to apache JIRA, I think all mllib-related issues should be under the Spark umbrella, https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel -- Nan Zhu On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote: What is the current development status of MLI/MLBase? I see that the github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no activity in the last 30 days ( https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). Is the plan to add a lot of this into mllib itself without needing a separate API? Thanks! -- View this message in context: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=3632i=1
Re: Status of MLI?
Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote: Thanks for the update Evan! In terms of using MLI, I see that the Github code is linked to Spark 0.8; will it not work with 0.9 (which is what I have set up) or higher versions? On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=3632i=0wrote: Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H [hidden email]http://user/SendEmail.jtp?type=nodenode=3615i=0 wrote: Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively developed? I'm familiar with mllib and have been looking at its documentation. Thanks! On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=0 wrote: mllib has been part of Spark distribution (under mllib directory), also check http://spark.apache.org/docs/latest/mllib-guide.html and for JIRA, because of the recent migration to apache JIRA, I think all mllib-related issues should be under the Spark umbrella, https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel -- Nan Zhu On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote: What is the current development status of MLI/MLBase? I see that the github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no activity in the last 30 days ( https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). Is the plan to add a lot of this into mllib itself without needing a separate API? Thanks! -- View this message in context: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html To start a new topic under Apache Spark User List, email [hidden email]http://user/SendEmail.jtp?type=nodenode=3632i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3632.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: Status of MLI?
Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H shankark+...@gmail.com wrote: Hi Nan, I was actually referring to MLI/MLBase (http://www.mlbase.org); is this being actively developed? I'm familiar with mllib and have been looking at its documentation. Thanks! On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=0 wrote: mllib has been part of Spark distribution (under mllib directory), also check http://spark.apache.org/docs/latest/mllib-guide.html and for JIRA, because of the recent migration to apache JIRA, I think all mllib-related issues should be under the Spark umbrella, https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel -- Nan Zhu On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote: What is the current development status of MLI/MLBase? I see that the github repo is lying dormant (https://github.com/amplab/MLI) and JIRA has had no activity in the last 30 days ( https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel). Is the plan to add a lot of this into mllib itself without needing a separate API? Thanks! -- View this message in context: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html To start a new topic under Apache Spark User List, email [hidden email]http://user/SendEmail.jtp?type=nodenode=3612i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: Status of MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: [HELP] ask for some information about public data set
Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because they're often stored on s3 (which spark can read from natively) and it's very easy to spin up a spark cluster on EC2 to begin experimenting with the data. There's also a pretty good list of (mostly big) datasets that google has released over the years here: http://svonava.com/post/62186512058/datasets-released-by-google - Evan On Tue, Feb 25, 2014 at 6:33 PM, 黄远强 hyq...@163.com wrote: Hi all: I am a freshman in Spark community. i dream of being a expert in the field of big data. But i have no idea where to start after i have gone through the published documents in Spark website and examples in Spark source code. I want to know if there are some public data set in the internet that can be utilized to learn Spark and test my some new ideas base on Spark. Thanks a lot. --- Best regards hyqgod