The test I was referring to was the included KMeans algorithm, which uses
NumPy for PySpark but can be done without jBlas in scala, so it's more
testing basic performance, not matrix libraries.
I can certainly try the ALS test, though note that the scala example you
pointed to uses Colt, whereas m
Thanks Uri, I came across that and took a quick look, seems interesting.
On a related note, it would be quite cool to have a sort of port of Algebird
(or at least count-min, top-k and HLL, perhaps bloom filter) to Python, that
are monoid-style for us in PySpark...
—
Sent from Mailbox for iPhone
I found the issue which was due to my app looking for wrong spark jar.
Thanks,
Hussam
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Monday, January 20, 2014 6:17 PM
To: user@spark.incubator.apache.org
Subject: Re: Error: Could not find or load main class
org.apache.spark.executo
Hi everyone,
I implemented a version of distributed streaming quantiles for PySpark. It
uses a count-min sketch approach. You can find the code here:
https://github.com/laserson/dsq
Thought it might be of interest...
Uri
--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserso
Thank you!
TTimo
On Fri, Jan 31, 2014 at 4:48 PM, Matei Zaharia wrote:
> You can set the spark.cores.max property in your application to limit the
> maximum number of cores it will take. Checko ut
> http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling.
> It's
You can set the spark.cores.max property in your application to limit the
maximum number of cores it will take. Checko ut
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling.
It’s also possible to control scheduling in more detail within a Spark
application,
Go for Fair scheduler and different weights. Default is FIFO. If you are
feeling adventurous try out sparrow scheduler .
Regards
Mayur
On Feb 1, 2014 4:12 AM, "Timothee Besset" wrote:
> Hello,
>
> What are my options to balance resources between multiple applications
> running against a Spark clu
Hello,
What are my options to balance resources between multiple applications
running against a Spark cluster?
I am using the standalone cluster [1] setup on my local machine, and
starting a single application uses all the available cores. As long as that
first application is running, no other ap
Hi,
I've noticed my spark app (on ec2) gets slower and slower as I repeatedly
execute it.
With a fresh ec2 cluster, it is snappy and takes about 15 mins to complete,
after running the same app 4 times it gets slower and takes to 40 mins and
more.
While the cluster gets slower, the monitoring met
If anyone wants to benchmark PySpark against the Scala/Java APIs, it might
be nice to add Python benchmarks to the spark-perf performance testing
suite: https://github.com/amplab/spark-perf.
On Thu, Jan 30, 2014 at 3:53 PM, nileshc wrote:
> Hi Jeremy,
>
> Can you try doing a comparison of the S
Hi:
I have a 2 node Spark cluster that I built with Hadoop 2.2.0 compatibility,
I also have HDFS on both machines. Everything works great, I can read files
from HDFS through spark shell. My question is on the requirement if I want
to connect to this cluster from a machine outside my cluster. So, i
Hi Jason,
Sorry, I didn't see this message before I replied in another thread.
So the following is copy-and-paste:
We are currently working on the sparse data support, one of the
highest priority features for MLlib. All existing algorithms will
support sparse input. We will open a JIRA ticket for
Hi,
Spark is absolutely amazing for machine learning as its iterative process is
super fast. However one big issue that I realized was that the MLLib API
isn't suitable for sparse inputs at all because it requires the feature
vector to be a dense array.
For example, I currently want to run a logi
Since C* + Spark seems to be used quiet a lot, maybe you guys could update the
docs & examples with a description of the different approaches? I.e. benefits
and drawbacks, etc ? I think this would be of great help for the community.
Regards, Heiko
On 31 Jan 2014, at 16:49, Rohit Rai wrote:
>
Hi Anton,
I'll recommend using CqlPagingInputFormat instead of CFIF while dealing
with composite key... In which case Cassandra takes care of serializing the
data back as simple columns.
Shameless Plug: Use our C*+Spark library Calliope (
http://tuplejump.github.io/calliope/ ) if you are using Sp
15 matches
Mail list logo