Contribution to MLlib

2014-07-09 Thread MEETHU MATHEW
Hi, I am interested in contributing a clustering algorithm towards MLlib of Spark.I am focusing on Gaussian Mixture Model. But I saw a JIRA @ https://spark-project.atlassian.net/browse/SPARK-952 regrading the same.I would like to know whether Gaussian Mixture Model is  already implemented or

Re: Contribution to MLlib

2014-07-09 Thread RJ Nowling
Hi Meethu, There is no code for a Gaussian Mixture Model clustering algorithm in the repository, but I don't know if anyone is working on it. RJ On Wednesday, July 9, 2014, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am interested in contributing a clustering algorithm towards MLlib

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling
Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath
Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On

Unresponsive to PR/jira changes

2014-07-09 Thread Mridul Muralidharan
Hi, I noticed today that gmail has been marking most of the mails from spark github/jira I was receiving to spam folder; and I was assuming it was lull in activity due to spark summit for past few weeks ! In case I have commented on specific PR/JIRA issues and not followed up, apologies for

Re: Contribution to MLlib

2014-07-09 Thread Xiangrui Meng
I don't know if anyone is working on it either. If that JIRA is not moved to Apache JIRA, feel free to create a new one and make a note that you are working on it. Thanks! -Xiangrui On Wed, Jul 9, 2014 at 4:56 AM, RJ Nowling rnowl...@gmail.com wrote: Hi Meethu, There is no code for a Gaussian

15 new MLlib algorithms

2014-07-09 Thread Michael Malak
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would roughly double in 1.1 from the current approx. 15. http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf What are the planned additional algorithms? In Jira, I only see two when

CPU/Disk/network performance instrumentation

2014-07-09 Thread Kay Ousterhout
Hi all, I've been doing a bunch of performance measurement of Spark and, as part of doing this, added metrics that record the average CPU utilization, disk throughput and utilization for each block device, and network throughput while each task is running. These metrics are collected by reading

Re: CPU/Disk/network performance instrumentation

2014-07-09 Thread Reynold Xin
Maybe it's time to create an advanced mode in the ui. On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, I've been doing a bunch of performance measurement of Spark and, as part of doing this, added metrics that record the average CPU utilization, disk

Re: 15 new MLlib algorithms

2014-07-09 Thread Burak Yavuz
Hi, The roadmap for the 1.1 release and MLLib includes algorithms such as: Non-negative matrix factorization, Sparse SVD, Multiclass decision tree, Random Forests (?) and optimizers such as: ADMM, Accelerated gradient methods also a statistical toolbox that includes: descriptive statistics,

Re: CPU/Disk/network performance instrumentation

2014-07-09 Thread Shivaram Venkataraman
I think it would be very useful to have this. We could put the ui display either behind a flag or a url parameter Shivaram On Wed, Jul 9, 2014 at 12:25 PM, Reynold Xin r...@databricks.com wrote: Maybe it's time to create an advanced mode in the ui. On Wed, Jul 9, 2014 at 12:23 PM, Kay

Re: ExecutorState.LOADING?

2014-07-09 Thread Kay Ousterhout
Git history to the rescue! It seems to have been added by Matei way back in July 2012: https://github.com/apache/spark/commit/5d1a887bed8423bd6c25660910d18d91880e01fe and then was removed a few months later (replaced by RUNNING) by the same Mr. Zaharia:

Re: ExecutorState.LOADING?

2014-07-09 Thread Mark Hamstra
Actually, I'm thinking about re-purposing it. There's a nasty behavior that I'll open a JIRA for soon, and that I'm thinking about addressing by introducing/using another ExecutorState transition. The basic problem is that Master can be overly aggressive in calling removeApplication on

Re: ExecutorState.LOADING?

2014-07-09 Thread Aaron Davidson
Agreed that the behavior of the Master killing off an Application when Executors from the same set of nodes repeatedly die is silly. This can also strike if a single node enters a state where any Executor created on it quickly dies (e.g., a block device becomes faulty). This prevents the

Testing period for better jenkins integration

2014-07-09 Thread Patrick Wendell
Just a heads up - I've added some better Jenkins integration that posts more useful messages on pull requests. We'll run this side-by-side with the current Jenkins messages for a while to make sure it's working well. Things may be a bit chatty while we are testing this - we can migrate over as

libgfortran Dependency

2014-07-09 Thread Taka Shinagawa
Hi, After testing Spark 1.0.1-RC2 on EC2 instances from the standard Ubuntu and Amazon Linux AMIs, I've noticed the MLlib's dependancy on gfortran library (libgfortran.so.3). sbt assembly succeeds without this library installed, but sbt test fails as follows. I'm wondering if documenting this

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-09 Thread qingyang li
could i set some cache policy to let spark load data from tachyon only one time for all sql query? for example by using CacheAllPolicy FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, they are not useful. I think , if spark always load data for each sql query, it will impact

Re: libgfortran Dependency

2014-07-09 Thread Xiangrui Meng
It is documented in the official doc: http://spark.apache.org/docs/latest/mllib-guide.html On Wed, Jul 9, 2014 at 7:35 PM, Taka Shinagawa taka.epsi...@gmail.com wrote: Hi, After testing Spark 1.0.1-RC2 on EC2 instances from the standard Ubuntu and Amazon Linux AMIs, I've noticed the MLlib's

Re: libgfortran Dependency

2014-07-09 Thread Taka Shinagawa
Thanks for point me to the MLlib guide. I was looking at only README and Spark docs. Also found it's already filed in JIRA https://spark-project.atlassian.net/browse/SPARK-797 On Wed, Jul 9, 2014 at 7:45 PM, Xiangrui Meng men...@gmail.com wrote: It is documented in the official doc: