Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-09-20 Thread Seraph
I’m also one of the authors of this paper and I am responsible for the Spark experiments in this paper. Thank you for your guys discussion! (1) Ignacio Zendejas wrote I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-09-07 Thread cjermaine
I’m out of the authors of this paper, and I just came across this thread. I’m glad that Ignacio Zendejas noticed our paper! First off, let me post link to the published version of the paper, which is likely slightly different than the version linked above:

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Jeremy Freeman
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder). As just a couple examples, we have pipelines that use fourier transforms and other signal processing from scipy, and others that do massively parallel model fitting via Scikit

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Ignacio Zendejas
Thanks, Jeremy! That's awesome. There's a group at Facebook that is considering using Spark, so to have more projects to refer to is great. And Matei, I completely agree. MLlib is very exciting. I respect how well you guys are managing the project for quality. This will set the Spark ecosystem

A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Ignacio Zendejas
Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Nicholas Chammas
On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
Actually I believe the same person started both projects. The Distributed R project from HP was started by Shivaram Venkataraman when he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR was his latest project. On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote: Actually I believe the same person started both projects. The

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Shivaram Venkataraman
Yeah I worked on DistributedR while I was an intern at HP Labs, but it has evolved a lot since then. I don't think its a direct comparison as DistributedR is a pure R implementation in a distributed setting while SparkR is a wrapper around the Scala / Java implementations in Spark. That said, it

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote: On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is