MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele
How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga sammiest...@gmail.com wrote: Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
I'm mostly using example code, see here: http://paste.openstack.org/show/211966/ The data has 799305 dimensions and is separated by space Please note the issues I'm seeing is because of the scala implementation imo as it happens also when using the Python wrappers. On Wed, Apr 29, 2015 at 8:00

Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first. On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga sammiest...@gmail.com wrote: I'm mostly using example code, see here: