Re: Python API Performance

2014-01-31 Thread Jeremy Freeman
The test I was referring to was the included KMeans algorithm, which uses NumPy for PySpark but can be done without jBlas in scala, so it's more testing basic performance, not matrix libraries. I can certainly try the ALS test, though note that the scala example you pointed to uses Colt, whereas m

Re: Distributed streaming quantiles with PySpark

2014-01-31 Thread Nick Pentreath
Thanks Uri, I came across that and took a quick look, seems interesting. On a related note, it would be quite cool to have a sort of port of Algebird (or at least count-min, top-k and HLL, perhaps bloom filter) to Python, that are monoid-style for us in PySpark... — Sent from Mailbox for iPhone

RE: Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend

2014-01-31 Thread Hussam_Jarada
I found the issue which was due to my app looking for wrong spark jar. Thanks, Hussam From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Monday, January 20, 2014 6:17 PM To: user@spark.incubator.apache.org Subject: Re: Error: Could not find or load main class org.apache.spark.executo

Distributed streaming quantiles with PySpark

2014-01-31 Thread Uri Laserson
Hi everyone, I implemented a version of distributed streaming quantiles for PySpark. It uses a count-min sketch approach. You can find the code here: https://github.com/laserson/dsq Thought it might be of interest... Uri -- Uri Laserson, PhD Data Scientist, Cloudera Twitter/GitHub: @laserso

Re: Single application using all the cores - preventing other applications from running

2014-01-31 Thread Timothee Besset
Thank you! TTimo On Fri, Jan 31, 2014 at 4:48 PM, Matei Zaharia wrote: > You can set the spark.cores.max property in your application to limit the > maximum number of cores it will take. Checko ut > http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling. > It's

Re: Single application using all the cores - preventing other applications from running

2014-01-31 Thread Matei Zaharia
You can set the spark.cores.max property in your application to limit the maximum number of cores it will take. Checko ut http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling. It’s also possible to control scheduling in more detail within a Spark application,

Re: Single application using all the cores - preventing other applications from running

2014-01-31 Thread Mayur Rustagi
Go for Fair scheduler and different weights. Default is FIFO. If you are feeling adventurous try out sparrow scheduler . Regards Mayur On Feb 1, 2014 4:12 AM, "Timothee Besset" wrote: > Hello, > > What are my options to balance resources between multiple applications > running against a Spark clu

Single application using all the cores - preventing other applications from running

2014-01-31 Thread Timothee Besset
Hello, What are my options to balance resources between multiple applications running against a Spark cluster? I am using the standalone cluster [1] setup on my local machine, and starting a single application uses all the available cores. As long as that first application is running, no other ap

Spark app gets slower as it gets executed more times

2014-01-31 Thread Aureliano Buendia
Hi, I've noticed my spark app (on ec2) gets slower and slower as I repeatedly execute it. With a fresh ec2 cluster, it is snappy and takes about 15 mins to complete, after running the same app 4 times it gets slower and takes to 40 mins and more. While the cluster gets slower, the monitoring met

Re: Python API Performance

2014-01-31 Thread Josh Rosen
If anyone wants to benchmark PySpark against the Scala/Java APIs, it might be nice to add Python benchmarks to the spark-perf performance testing suite: https://github.com/amplab/spark-perf. On Thu, Jan 30, 2014 at 3:53 PM, nileshc wrote: > Hi Jeremy, > > Can you try doing a comparison of the S

Connecting to remote Spark cluster using Java+Maven

2014-01-31 Thread Guillermo Cabrera
Hi: I have a 2 node Spark cluster that I built with Hadoop 2.2.0 compatibility, I also have HDFS on both machines. Everything works great, I can read files from HDFS through spark shell. My question is on the requirement if I want to connect to this cluster from a machine outside my cluster. So, i

Re: MLLib Sparse Input

2014-01-31 Thread Xiangrui Meng
Hi Jason, Sorry, I didn't see this message before I replied in another thread. So the following is copy-and-paste: We are currently working on the sparse data support, one of the highest priority features for MLlib. All existing algorithms will support sparse input. We will open a JIRA ticket for

MLLib Sparse Input

2014-01-31 Thread jshao
Hi, Spark is absolutely amazing for machine learning as its iterative process is super fast. However one big issue that I realized was that the MLLib API isn't suitable for sparse inputs at all because it requires the feature vector to be a dense array. For example, I currently want to run a logi

Re: Cassandra composite and simple keys

2014-01-31 Thread Heiko Braun
Since C* + Spark seems to be used quiet a lot, maybe you guys could update the docs & examples with a description of the different approaches? I.e. benefits and drawbacks, etc ? I think this would be of great help for the community. Regards, Heiko On 31 Jan 2014, at 16:49, Rohit Rai wrote: >

Re: Cassandra composite and simple keys

2014-01-31 Thread Rohit Rai
Hi Anton, I'll recommend using CqlPagingInputFormat instead of CFIF while dealing with composite key... In which case Cassandra takes care of serializing the data back as simple columns. Shameless Plug: Use our C*+Spark library Calliope ( http://tuplejump.github.io/calliope/ ) if you are using Sp