I’m also one of the authors of this paper and I am responsible for the Spark
experiments in this paper. Thank you for your guys discussion!
(1)
Ignacio Zendejas wrote
I should rephrase my question as it was poorly phrased: on average, how
much faster is Spark v. PySpark (I didn't really mean
I’m out of the authors of this paper, and I just came across this thread.
I’m glad that Ignacio Zendejas noticed our paper!
First off, let me post link to the published version of the paper, which is
likely slightly different than the version linked above:
@Ignacio, happy to share, here's a link to a library we've been developing
(https://github.com/freeman-lab/thunder). As just a couple examples, we have
pipelines that use fourier transforms and other signal processing from scipy,
and others that do massively parallel model fitting via Scikit
Thanks, Jeremy! That's awesome. There's a group at Facebook that is
considering using Spark, so to have more projects to refer to is great.
And Matei, I completely agree. MLlib is very exciting. I respect how well
you guys are managing the project for quality. This will set the Spark
ecosystem
Has anyone had a chance to look at this paper (with title in subject)?
http://www.cs.rice.edu/~lp6/comparison.pdf
Interesting that they chose to use Python alone. Do we know how much faster
Scala is vs. Python in general, if at all?
As with any and all benchmarks, I'm sure there are caveats, but
They only compared their own implementations of couple algorithms on
different platforms rather than comparing the different platforms
themselves (in the case of Spark -- PySpark). I can write two variants of
an algorithm on Spark and make them perform drastically differently.
I have no doubt if
Our experience matches Reynold's comments; pure-Python implementations of
anything are generally sub-optimal compared to pure Scala implementations,
or Scala versions exposed to Python (which are faster, but still slower than
pure Scala). It also seems on first glance that some of the
On a related note, I recently heard about Distributed R
https://github.com/vertica/DistributedR, which is coming out of
HP/Vertica and seems to be their proposition for machine learning at scale.
It would be interesting to see some kind of comparison between that and
MLlib (and perhaps also
Actually I believe the same person started both projects.
The Distributed R project from HP was started by Shivaram Venkataraman when
he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR
was his latest project.
On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas
BTW you can find the original Presto (rebranded as Distributed R) paper
here:
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf
On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote:
Actually I believe the same person started both projects.
The
Yeah I worked on DistributedR while I was an intern at HP Labs, but it has
evolved a lot since then. I don't think its a direct comparison as
DistributedR is a pure R implementation in a distributed setting while
SparkR is a wrapper around the Scala / Java implementations in Spark.
That said, it
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
Yep, I thought it was a bogus comparison.
I should rephrase my question as it was poorly phrased: on average, how
much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
I've only used Spark
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote:
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
Yep, I thought it was a bogus comparison.
I should rephrase my question as it was poorly phrased: on average, how
much faster is
13 matches
Mail list logo