Link not working

2014-04-21 Thread prabeesh k
For Spark-0.8.0, the download links are not working. Please update the same Regarding, prabeesh

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Matei Zaharia
The wiki is actually maintained separately in https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. We restricted editing of the wiki because bots would automatically add stuff. I’ve given you permissions now. Matei On Apr 21, 2014, at 6:22 PM, Nan Zhu wrote: > I thought those are

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nan Zhu
I thought those are files of spark.apache.org? -- Nan Zhu On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote: > The markdown files are under spark/docs. You can submit a PR for > changes. -Xiangrui > > On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza (mailto:sandy.r...@cloudera.com)> wrot

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
I thought this might be a good thing to add to the wiki's "How to contribute" page, as it's not tied to a release. On Mon, Apr 21, 2014 at 6:09 PM, Xiangrui Meng wrote: > The markdown files are under spark/docs. You can s

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
The markdown files are under spark/docs. You can submit a PR for changes. -Xiangrui On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza wrote: > How do I get permissions to edit the wiki? > > > On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng wrote: > >> Cannot agree more with your words. Could you add on

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
How do I get permissions to edit the wiki? On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng wrote: > Cannot agree more with your words. Could you add one section about > "how and what to contribute" to MLlib's guide? -Xiangrui > > On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath > wrote: > > I'd

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
Cannot agree more with your words. Could you add one section about "how and what to contribute" to MLlib's guide? -Xiangrui On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath wrote: > I'd say a section in the "how to contribute" page would be a good place to > put this. > > In general I'd say that

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nick Pentreath
I'd say a section in the "how to contribute" page would be a good place to put this. In general I'd say that the criteria for inclusion of an algorithm is it should be high quality, widely known, used and accepted (citations and concrete use cases as examples of this), scalable and parallelizab

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sandy Ryza
If it's not done already, would it make sense to codify this philosophy somewhere? I imagine this won't be the first time this discussion comes up, and it would be nice to have a doc to point to. I'd be happy to take a stab at this. On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng wrote: > +1

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Xiangrui Meng
+1 on Sean's comment. MLlib covers the basic algorithms but we definitely need to spend more time on how to make the design scalable. For example, think about current "ProblemWithAlgorithm" naming scheme. That being said, new algorithms are welcomed. I wish they are well-established and well-unders

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sean Owen
On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown wrote: > - MLlib as Mahout.next would be a unfortunate. There are some gems in > Mahout, but there are also lots of rocks. Setting a minimal bar of > working, correctly implemented, and documented requires a surprising amount > of work. As someone wit

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Paul Brown
I agree that it will be good to see more algorithms added to the MLlib universe, although this does bring to mind a couple of comments: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly impl

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Aliaksei Litouka
Thank you very much for detailed answers. I can't but agree that a good MLLib core is a higher priority than algorithms built on top of it. I'll check if I can contribute anything to the core. I will also follow Nick Pentreath's recommendation to start a new GitHub project. Actually, here is a link

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nick Pentreath
I am very much +1 on Sean's comment. I think the correct abstractions and API for Vectors, Matrices and distributed matrices (distributed row matrix etc) will, once bedded down and battle tested in the wild, allow a whole lot of flexibility for developers of algorithms on top of MLlib core. This

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sang Venkatraman
Hi, On a related note, I have not looked at the the MLlib library in detail but are there plans on reusing or porting over parts of apache mahout. Thanks, Sang On Mon, Apr 21, 2014 at 12:07 PM, Evan R. Sparks wrote: > While DBSCAN and others would be welcome contributions, I couldn't agree > m

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Evan R. Sparks
While DBSCAN and others would be welcome contributions, I couldn't agree more with Sean. On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen wrote: > Nobody asked me, and this is a comment on a broader question, not this > one, but: > > In light of a number of recent items about adding more algorithms

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sean Owen
Nobody asked me, and this is a comment on a broader question, not this one, but: In light of a number of recent items about adding more algorithms, I'll say that I personally think an explosion of algorithms should come after the MLlib "core" is more fully baked. I'm thinking of finishing out the

Any plans for new clustering algorithms?

2014-04-21 Thread Aliaksei Litouka
Hi, Spark developers. Are there any plans for implementing new clustering algorithms in MLLib? As far as I understand, current version of Spark ships with only one clustering algorithm - K-Means. I want to contribute to Spark and I'm thinking of adding more clustering algorithms - maybe DBSCAN

Re: all values for a key must fit in memory

2014-04-21 Thread Sandy Ryza
Thanks Matei and Mridul - was basically wondering whether we would be able to change the shuffle to accommodate this after 1.0, and from your answers it sounds like we can. On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan wrote: > As Matei mentioned, the Values is now an Iterable : which ca

Re: all values for a key must fit in memory

2014-04-21 Thread Mridul Muralidharan
As Matei mentioned, the Values is now an Iterable : which can be disk backed. Does that not address the concern ? @Patrick - we do have cases where the length of the sequence is large and size per value is also non trivial : so we do need this :-) Note that join is a trivial example where this is