Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng
For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me

Re: Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
Hi Sandy, thanks for the reply. I tried to run this code without the cache and it worked. Also if I cache before repartition, it also works, the problem seems to be something related with repartition and caching. My train is a SchemaRDD, and if I make all my columns as StringType, the error

Re: Issue with repartition and cache

2015-01-21 Thread Sandy Ryza
Hi Dirceu, Does the issue not show up if you run map(f = f(1).asInstanceOf[Int]).sum on the train RDD? It appears that f(1) is an String, not an Int. If you're looking to parse and convert it, toInt should be used instead of asInstanceOf. -Sandy On Wed, Jan 21, 2015 at 8:43 AM, Dirceu

Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
Yep, I think it's only useful (and likely to be maintained) if we actually use this on Jenkins. So that was my proposal. Basically give people a docker file so they can understand exactly what versions of everything we use for our reference build. And if they don't want to use docker directly,

Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
Hi guys, have anyone find something like this? I have a training set, and when I repartition it, if I call cache it throw a classcastexception when I try to execute anything that access it val rep120 = train.repartition(120) val cached120 = rep120.cache cached120.map(f =

SPARK-5267 : Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-21 Thread Stephen Brewin
Hi All Any thoughts, comments or questions regarding the proposal outlined at https://issues.apache.org/jira/browse/SPARK-5267? Cheers Steve - - - - - - - - - - - - - - - - - - This private and confidential e-mail has been sent to you by Synergy Systems Limited. It may not represent the

Test suites in the python wrapper of kmeans failing

2015-01-21 Thread Meethu Mathew
Hi, The test suites in the Kmeans class in clustering.py is not updated to take the seed value and hence it is failing. Shall I make the changes and submit it along with my PR( Python API for Gaussian Mixture Model) or create a JIRA ? Regards, Meethu

Re: Test suites in the python wrapper of kmeans failing

2015-01-21 Thread Meethu Mathew
Hi, Sorry it was my mistake. My code was not properly built. Regards, Meethu _http://www.linkedin.com/home?trk=hb_tab_home_top_ On Thursday 22 January 2015 10:39 AM, Meethu Mathew wrote: Hi, The test suites in the Kmeans class in clustering.py is not updated to take the seed value and

Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
If the goal is a reproducible test environment then I think that is what Jenkins is. Granted you can only ask it for a test. But presumably you get the same result if you start from the same VM image as Jenkins and run the same steps. But the issue is when users can't reproduce Jenkins