Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa
Hi all, I am also interested in specifying a common framework. And I am trying to implement a hierarchical k-means and a hierarchical clustering like single-link method with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API,

A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Ignacio Zendejas
Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Nicholas Chammas
On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
Actually I believe the same person started both projects. The Distributed R project from HP was started by Shivaram Venkataraman when he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR was his latest project. On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote: Actually I believe the same person started both projects. The

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Shivaram Venkataraman
Yeah I worked on DistributedR while I was an intern at HP Labs, but it has evolved a lot since then. I don't think its a direct comparison as DistributedR is a pure R implementation in a distributed setting while SparkR is a wrapper around the Scala / Java implementations in Spark. That said, it

Added support for :cp jar to the Spark Shell

2014-08-13 Thread Robert C Senkbeil
I've created a new pull request, which can be found at https://github.com/apache/spark/pull/1929. Since Spark is using Scala 2.10.3 and there is a known issue with Scala 2.10.x not supporting the :cp command (https://issues.scala-lang.org/browse/SI-6502), the Spark shell does not have the

Re: Added support for :cp jar to the Spark Shell

2014-08-13 Thread Reynold Xin
I haven't read the code yet, but if it is what I think it is, this is SUPER, UBER, HUGELY useful. On a related note, I asked about this on the Scala dev list but never got a satisfactory answer https://groups.google.com/forum/#!msg/scala-internals/_cZ1pK7q6cU/xyBQA0DdcYwJ On Wed, Aug 13,

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-13 Thread Graham Dennis
I now have a complete pull request for this issue that I'd like to get reviewed and committed. The PR is available here: https://github.com/apache/spark/pull/1890 and includes a testcase for the issue I described. I've also submitted a related PR ( https://github.com/apache/spark/pull/1827) that

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote: On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is

Need info on Spark's Communication/Networking layer...

2014-08-13 Thread aniketadnaik
Hi, I am new to Spark and want to explore more on Spark's master-worker/Cluster manager communication architecture. Any documents ? or code pointers will be helpful to start with. Thanks! -- View this message in context:

Re: Need info on Spark's Communication/Networking layer...

2014-08-13 Thread Rajiv Abraham
Hi Aniket, Perhaps this video will help: https://www.youtube.com/watch?v=HG2Yd-3r4-Mlist=PLTPXxbhUt-YWGNTaDj6HSjnHMxiTD1HCRindex=1 You can see other upto date videos and slides here at : http://spark-summit.org/2014/training Best regards, Rajiv 2014-08-13 19:36 GMT-04:00 aniketadnaik

acquire and give back resources dynamically

2014-08-13 Thread 牛兆捷
Dear all: Does spark can acquire resources from and give back resources to YARN dynamically ? -- *Regards,* *Zhaojie*

proposal for pluggable block transfer interface

2014-08-13 Thread Reynold Xin
Hi devs, I posted a design doc proposing an interface for pluggable block transfer (used in shuffle, broadcast, block replication, etc). This is expected to be done in 1.2 time frame. It should make our code base cleaner, and enable us to provide alternative implementations of block transfers