Re: Python API Performance
We used breeze in some early MLlib prototypes last year. It feels very scala which is a huge plus, but unfortunately we found that the object overhead and difficulty of tracking down performance problems due to heavy use of implicit conversions inside breeze made writing high performance matrix code with it difficult. Further - at least for the early algorithms, we didn't need all the extra flexibility that breeze provides, since our use cases were pretty straightforward. On Feb 1, 2014, at 5:51 PM, 尹绪森 yinxu...@gmail.com wrote: How about breeze (http://www.scalanlp.org/) ? It is written in scala, and use netlib-java as the backend. (https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra#wiki-performance) I think breeze is more like matlab and numpy/scipy on the subject of ease of use. This is also a good aspect to have a test. 2014-02-02 Ankur Chauhan achau...@brightcove.com: How does Julia interact with spark. I would be interested, mainly because I seem to find scala syntax a little obscure and it would be great to see actual numbers comparing scala, Python, Julia workloads. On Feb 1, 2014, at 16:08, Aureliano Buendia buendia...@gmail.com wrote: A much (much) better solution than python, (and also scala, if that doesn't make you upset) is julia. Libraries like numpy and scipy are bloated when compared with julia c-like performance. Julia comes with eveything that numpy+scipy come with + more - performance hit. I hope we can see an official support of julia on spark very soon. On Thu, Jan 30, 2014 at 4:30 PM, nileshc nil...@nileshc.com wrote: Hi there, *Background:* I need to do some matrix multiplication stuff inside the mappers, and trying to choose between Python and Scala for writing the Spark MR jobs. I'm equally fluent with Python and Java, and find Scala pretty easy too for what it's worth. Going with Python would let me use numpy + scipy, which is blazing fast when compared to Java libraries like Colt etc. Configuring Java with BLAS seems to be a pain when compared to scipy (direct apt-get installs, or pip). *Question:* I posted a couple of comments on this answer at StackOverflow: http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python. Basically it states that as of Spark 0.7.2, the Python API would be slower than Scala. What's the performance scenario now? The fork issue seems to be fixed. How about serialization? Can it match Java/Scala Writable-like serialization (having knowledge of object type beforehand, reducing I/O) performance? Also, a probably silly question - loops seem to be slow in Python in general, do you think this can turn out to be an issue? Bottomline, should I choose Python for computation-intensive algorithms like PageRank? Scipy gives me an edge, but does the framework kill it? Any help, insights, benchmarks will be much appreciated. :) Cheers, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Best Regards --- Xusen Yin尹绪森 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts Telecommunications Intel Labs China Homepage: http://yinxusen.github.io/
Re: Tree classifiers in MLib
Yes - Manish Amde and Hirakendu Das have been working on a distributed tree classifier. We are taking the current version through large scale testing and expect to merge it into the master branch soon. I expect that ensembled tree learned (random forests, GBDTs) will follow shortly. On Dec 29, 2013, at 10:35 AM, Charles Earl charles.ce...@gmail.com wrote: In the latest API docs off of the web page http://spark.incubator.apache.org/docs/latest/api/mllib/index.html#org.apache.spark.mllib.package I had not seen tree classifiers included. Are there plans to include decision trees etc at some point. Is there an interest? -- - Charles
Re: Running Spark jar on EC2
I ran into a similar issue a few months back - pay careful attention to the order in which spark decides to look for your jars. The root of my problem was a stale jar in SPARK_CLASSPATH on the worker nodes, which took precedence (IIRC) over jars passed in with the SparkContext constructor. On Dec 20, 2013, at 8:49 PM, K. Shankari shank...@eecs.berkeley.edu wrote: I don't think that you need to copy the jar to the rest of the cluster - you should be able to do addJar() in the SparkContext and spark should automatically push the jars to the client for you. I don't know how set you are on running code through checking out and compiling, but here's what I do instead to get my own application to run: - compile my code on my desktop and generate a jar - scp the jar to the master - modify runExample to include the jar in the classpath. I think that you can also just modify SPARK_CLASSPATH - run using something like: $ runExample my.class.name arg1 arg2 arg3 Hope this helps! Shankari On Tue, Dec 10, 2013 at 12:15 PM, Jeff Higgens jefh...@gmail.com wrote: I'm having trouble running my Spark program as a fat jar on EC2. This is the process I'm using: (1) spark-ec2 script to launch cluster (2) ssh to master, install sbt and git clone my project's source code (3) update source to reference correct master and jar (4) sbt assembly (5) copy-dir to copy the jar to the rest of the cluster I tried both running the jar (java -jar ...) and using sbt run, but I always end up with this error: 18:58:59.556 [spark-akka.actor.default-dispatcher-4] INFO o.a.s.d.client.Client$ClientActor - Connecting to master spark://ec2-50-16-80-0.compute-1.amazonaws.com:7077 18:58:59.838 [spark-akka.actor.default-dispatcher-4] ERROR o.a.s.d.client.Client$ClientActor - Connection to master failed; stopping client 18:58:59.839 [spark-akka.actor.default-dispatcher-4] ERROR o.a.s.s.c.SparkDeploySchedulerBackend - Disconnected from Spark cluster! 18:58:59.840 [spark-akka.actor.default-dispatcher-4] ERROR o.a.s.s.cluster.ClusterScheduler - Exiting due to error from cluster scheduler: Disconnected from Spark cluster 18:58:59.844 [delete Spark local dirs] DEBUG org.apache.spark.storage.DiskStore - Shutdown hook called But when I use spark-shell it has no problems connecting to the master using the exact same url: 13/12/10 18:59:40 INFO client.Client$ClientActor: Connecting to master spark://ec2-50-16-80-0.compute-1.amazonaws.com:7077 Spark context available as sc. I'm probably missing something obvious so any tips are very appreciated.
Re: MLBase Test
Hi Aslan, You'll need to link against the spark-mllib artifact. The method we have currently for collaborative filtering is ALS. Documentation is available here - http://spark.incubator.apache.org/docs/latest/mllib-guide.html We're working on a more complete ALS tutorial, and will link to it from that page when it's ready. - Evan On Nov 29, 2013, at 10:33 AM, Aslan Bekirov aslanbeki...@gmail.com wrote: Hi All, I am trying to do collaborative filtering with MLbase. I am using spark 0.8.0 I have some basic questions. 1) I am using maven and added dependency to my pom dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.9.3/artifactId version0.8.0-incubating/version /dependency I could not see any MLbase related classes in downloaded jar that is why I could not import mli libraries. Am I missing something? Do I have to add some more dependency for mli? 2) Is there exist java api for MLBase? Thanks in advance, BR, Aslan
Re: SVM Prediction
Hi Prabeesh, Once you have an SVM model trained, you can make predictions with the model (via the model's .predict() method) with any new input data as long as it's in the same format that the model was trained with. - Evan On Nov 26, 2013, at 10:03 PM, prabeesh k prabsma...@gmail.com wrote: Hi All, Is it possible SVM prediction with DStream data. The SVM model is trained using RDD after that is there any possibility to use Dstream data for prediction. I am not that much aware of SVM. Please suggest. Thanks in Advance. Ragards, Prabeesh
Re: SVM Prediction
Right -as long as the elements of the stream are (for example) Array[Double] you should be able to make a prediction on each point if you trained the SVM on LabeledPoint examples that are comparable to what you're getting with the DStream. On Nov 26, 2013, at 11:00 PM, prabeesh k prabsma...@gmail.com wrote: Hi Evan, Actually the input data for prediction is streaming data. In spark example training data is RDD. But want to predict the model using Dstream(streaming data). I think it is impossible to train the the model using streaming data. So are we able to train SVM using static data and predictions using Streaming data. On Wed, Nov 27, 2013 at 12:18 PM, Evan Sparks evan.spa...@gmail.com wrote: Hi Prabeesh, Once you have an SVM model trained, you can make predictions with the model (via the model's .predict() method) with any new input data as long as it's in the same format that the model was trained with. - Evan On Nov 26, 2013, at 10:03 PM, prabeesh k prabsma...@gmail.com wrote: Hi All, Is it possible SVM prediction with DStream data. The SVM model is trained using RDD after that is there any possibility to use Dstream data for prediction. I am not that much aware of SVM. Please suggest. Thanks in Advance. Ragards, Prabeesh
Re: Streaming JSON From S3?
You can always use some non-split table file format (eg gzip) and then a binary input format to get the file at a time behavior you're looking for. On Aug 21, 2013, at 9:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Paul, On Aug 21, 2013, at 6:11 PM, Paul Snively psniv...@icloud.com wrote: Just to understand, are you trying to do a real-time application (which is what the streaming in Spark Streaming is for), or just to read an input file into a batch job? Well, it's an interesting case. I'm trying to take advantage of Spark Streaming's scanning of sources to automatically process new content, and possibly its sliding window support, e.g. do something with every 5 RDDs in the stream. So it's not so much that the requirements are real time—on the contrary, the processing in the middle will be pretty heavyweight—but rather that streaming offers a couple of desirable ancillary features. Got it; that's fine as a use case for Spark Streaming That's essentially what I expected. When you say stream of Strings, is each String the entire contents of a file? If so, that would be perfectly suitable. No, unfortunately each String is one line of text. You'd have to create a Hadoop InputFormat that returns one record per file if you wanted that. Maybe we should add that as a feature in Spark by default, because it does seem like a useful way to run it. Matei