Re: Python API Performance

2014-02-01 Thread Evan Sparks
We used breeze in some early MLlib prototypes last year. It feels very scala 
which is a huge plus, but unfortunately we found that the object overhead and 
difficulty of tracking down performance problems due to heavy use of implicit 
conversions inside breeze made writing high performance matrix code with it 
difficult. Further - at least for the early algorithms, we didn't need all the 
extra flexibility that breeze provides, since our use cases were pretty 
straightforward. 

 On Feb 1, 2014, at 5:51 PM, 尹绪森 yinxu...@gmail.com wrote:
 
 How about breeze (http://www.scalanlp.org/) ? It is written in scala, and use 
 netlib-java as the backend. 
 (https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra#wiki-performance)
 
 I think breeze is more like matlab and numpy/scipy on the subject of ease of 
 use. This is also a good aspect to have a test.
 
 
 2014-02-02 Ankur Chauhan achau...@brightcove.com:
 How does Julia interact with spark. I would be interested, mainly because I 
 seem to find scala syntax a little obscure and it would be great to see 
 actual numbers comparing scala, Python, Julia workloads. 
 
 On Feb 1, 2014, at 16:08, Aureliano Buendia buendia...@gmail.com wrote:
 
 A much (much) better solution than python, (and also scala, if that doesn't 
 make you upset) is julia.
 
 Libraries like numpy and scipy are bloated when compared with julia c-like 
 performance. Julia comes with eveything that numpy+scipy come with + more - 
 performance hit.
 
 I hope we can see an official support of julia on spark very soon.
 
 
 On Thu, Jan 30, 2014 at 4:30 PM, nileshc nil...@nileshc.com wrote:
 Hi there,
 
 *Background:*
 I need to do some matrix multiplication stuff inside the mappers, and 
 trying
 to choose between Python and Scala for writing the Spark MR jobs. I'm
 equally fluent with Python and Java, and find Scala pretty easy too for 
 what
 it's worth. Going with Python would let me use numpy + scipy, which is
 blazing fast when compared to Java libraries like Colt etc. Configuring 
 Java
 with BLAS seems to be a pain when compared to scipy (direct apt-get
 installs, or pip).
 
 *Question:*
 I posted a couple of comments on this answer at StackOverflow:
 http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python.
 Basically it states that as of Spark 0.7.2, the Python API would be slower
 than Scala. What's the performance scenario now? The fork issue seems to be
 fixed. How about serialization? Can it match Java/Scala Writable-like
 serialization (having knowledge of object type beforehand, reducing I/O)
 performance? Also, a probably silly question - loops seem to be slow in
 Python in general, do you think this can turn out to be an issue?
 
 Bottomline, should I choose Python for computation-intensive algorithms 
 like
 PageRank? Scipy gives me an edge, but does the framework kill it?
 
 Any help, insights, benchmarks will be much appreciated. :)
 
 Cheers,
 Nilesh
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Python-API-Performance-tp1048.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 
 -- 
 Best Regards
 ---
 Xusen Yin尹绪森
 Beijing Key Laboratory of Intelligent Telecommunications Software and 
 Multimedia
 Beijing University of Posts  Telecommunications
 Intel Labs China
 Homepage: http://yinxusen.github.io/


Re: Tree classifiers in MLib

2013-12-29 Thread Evan Sparks
Yes - Manish Amde and Hirakendu Das have been working on a distributed tree 
classifier. We are taking the current version through large scale testing and 
expect to merge it into the master branch soon. I expect that ensembled tree 
learned (random forests, GBDTs) will follow shortly. 

 On Dec 29, 2013, at 10:35 AM, Charles Earl charles.ce...@gmail.com wrote:
 
 In the latest API docs off of the web page
 http://spark.incubator.apache.org/docs/latest/api/mllib/index.html#org.apache.spark.mllib.package
 I had not seen tree classifiers included.
 Are there plans to include decision trees etc at some point. Is there an 
 interest?
 
 
 -- 
 - Charles


Re: Running Spark jar on EC2

2013-12-20 Thread Evan Sparks
I ran into a similar issue a few months back - pay careful attention to the 
order in which spark decides to look for your jars. The root of my problem was 
a stale jar in SPARK_CLASSPATH on the worker nodes, which took precedence 
(IIRC) over jars passed in with the SparkContext constructor. 

 On Dec 20, 2013, at 8:49 PM, K. Shankari shank...@eecs.berkeley.edu wrote:
 
 I don't think that you need to copy the jar to the rest of the cluster - you 
 should be able to do addJar() in the SparkContext and spark should 
 automatically push the jars to the client for you.
 
 I don't know how set you are on running code through checking out and 
 compiling, but here's what I do instead to get my own application to run:
 - compile my code on my desktop and generate a jar
 - scp the jar to the master
 - modify runExample to include the jar in the classpath. I think that you can 
 also just modify SPARK_CLASSPATH
 - run using something like:
 
 $ runExample my.class.name arg1 arg2 arg3
 
 Hope this helps!
 Shankari
 
 
 On Tue, Dec 10, 2013 at 12:15 PM, Jeff Higgens jefh...@gmail.com wrote:
 I'm having trouble running my Spark program as a fat jar on EC2.
 
 This is the process I'm using:
 (1) spark-ec2 script to launch cluster
 (2) ssh to master, install sbt and git clone my project's source code
 (3) update source to reference correct master and jar
 (4) sbt assembly
 (5) copy-dir to copy the jar to the rest of the cluster
 
 I tried both running the jar (java -jar ...) and using sbt run, but I always 
 end up with this error:
 
 18:58:59.556 [spark-akka.actor.default-dispatcher-4] INFO  
 o.a.s.d.client.Client$ClientActor - Connecting to master 
 spark://ec2-50-16-80-0.compute-1.amazonaws.com:7077
 18:58:59.838 [spark-akka.actor.default-dispatcher-4] ERROR 
 o.a.s.d.client.Client$ClientActor - Connection to master failed; stopping 
 client
 18:58:59.839 [spark-akka.actor.default-dispatcher-4] ERROR 
 o.a.s.s.c.SparkDeploySchedulerBackend - Disconnected from Spark cluster!
 18:58:59.840 [spark-akka.actor.default-dispatcher-4] ERROR 
 o.a.s.s.cluster.ClusterScheduler - Exiting due to error from cluster 
 scheduler: Disconnected from Spark cluster
 18:58:59.844 [delete Spark local dirs] DEBUG 
 org.apache.spark.storage.DiskStore - Shutdown hook called
 
 
 But when I use spark-shell it has no problems connecting to the master using 
 the exact same url: 
 
 13/12/10 18:59:40 INFO client.Client$ClientActor: Connecting to master 
 spark://ec2-50-16-80-0.compute-1.amazonaws.com:7077
 Spark context available as sc.
 
 I'm probably missing something obvious so any tips are very appreciated.
 


Re: MLBase Test

2013-11-29 Thread Evan Sparks
Hi Aslan,

You'll need to link against the spark-mllib artifact. The method we have 
currently for collaborative filtering is ALS. 

Documentation is available here - 
http://spark.incubator.apache.org/docs/latest/mllib-guide.html

We're working on a more complete ALS tutorial, and will link to it from that 
page when it's ready. 

- Evan

 On Nov 29, 2013, at 10:33 AM, Aslan Bekirov aslanbeki...@gmail.com wrote:
 
 Hi All,
 
 I am trying to do collaborative filtering with  MLbase. I am using spark 0.8.0
 
 I have some basic questions.
 
 1) I am using maven and added dependency to my pom 
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.9.3/artifactId
 version0.8.0-incubating/version
 /dependency
 
 I could not see any MLbase related classes in downloaded jar that is why I 
 could not import mli libraries. Am I missing something? Do I have to add some 
 more dependency for mli?
 
 2) Is there exist java api for MLBase? 
 
 Thanks in advance,
 
 BR,
 Aslan


Re: SVM Prediction

2013-11-26 Thread Evan Sparks
Hi Prabeesh,

Once you have an SVM model trained, you can make predictions with the model 
(via the model's .predict() method) with any new input data as long as it's in 
the same format that the model was trained with. 

- Evan

 On Nov 26, 2013, at 10:03 PM, prabeesh k prabsma...@gmail.com wrote:
 
 Hi All,
Is it possible SVM prediction with DStream data. The SVM model is 
 trained  using RDD after that is there any possibility to use Dstream data 
 for prediction.  I am not that much aware of SVM. 
 Please suggest.
 
 Thanks in Advance.
 
 Ragards,
Prabeesh
   
   
   
 



Re: SVM Prediction

2013-11-26 Thread Evan Sparks
Right -as long as the elements of the stream are (for example) Array[Double] 
you should be able to make a prediction on each point if you trained the SVM on 
LabeledPoint examples that are comparable to what you're getting with the 
DStream. 

 On Nov 26, 2013, at 11:00 PM, prabeesh k prabsma...@gmail.com wrote:
 
 Hi Evan,
Actually the input data for prediction is streaming data. In spark 
 example training data is RDD. But want to predict the model using 
 Dstream(streaming data). I think it is impossible  to train the the model 
 using streaming data. So are we able to train SVM  using static data and 
 predictions using Streaming data.
 
 
 On Wed, Nov 27, 2013 at 12:18 PM, Evan Sparks evan.spa...@gmail.com wrote:
 Hi Prabeesh,
 
 Once you have an SVM model trained, you can make predictions with the model 
 (via the model's .predict() method) with any new input data as long as it's 
 in the same format that the model was trained with.
 
 - Evan
 
  On Nov 26, 2013, at 10:03 PM, prabeesh k prabsma...@gmail.com wrote:
 
  Hi All,
 Is it possible SVM prediction with DStream data. The SVM model is 
  trained  using RDD after that is there any possibility to use Dstream data 
  for prediction.  I am not that much aware of SVM.
  Please suggest.
 
  Thanks in Advance.
 
  Ragards,
 Prabeesh
 
 
 
 


Re: Streaming JSON From S3?

2013-08-21 Thread Evan Sparks
You can always use some non-split table file format (eg gzip) and then
a binary input format to get the file at a time behavior you're
looking for.

On Aug 21, 2013, at 9:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Hi Paul,

 On Aug 21, 2013, at 6:11 PM, Paul Snively psniv...@icloud.com wrote:

 Just to understand, are you trying to do a real-time application (which is 
 what the streaming in Spark Streaming is for), or just to read an input 
 file into a batch job?

 Well, it's an interesting case. I'm trying to take advantage of Spark 
 Streaming's scanning of sources to automatically process new content, and 
 possibly its sliding window support, e.g. do something with every 5 RDDs in 
 the stream. So it's not so much that the requirements are real time—on the 
 contrary, the processing in the middle will be pretty heavyweight—but 
 rather that streaming offers a couple of desirable ancillary features.

 Got it; that's fine as a use case for Spark Streaming

 That's essentially what I expected. When you say stream of Strings, is 
 each String the entire contents of a file? If so, that would be perfectly 
 suitable.

 No, unfortunately each String is one line of text. You'd have to create a 
 Hadoop InputFormat that returns one record per file if you wanted that. Maybe 
 we should add that as a feature in Spark by default, because it does seem 
 like a useful way to run it.

 Matei