Re: KMeans with large clusters Java Heap Space

2015-01-30 Thread derrickburns
bytes (8 bytes per double) You should try to use my new generalized kmeans clustering package https://github.com/derrickburns/generalized-kmeans-clustering , which works on high dimensional sparse data. You will want to use the RandomIndexing embedding: def sparseTrain(raw: RDD[Vector

spark challenge: zip with next???

2015-01-29 Thread derrickburns
Here is a spark challenge for you! I have a data set where each entry has a date. I would like to identify gaps in the dates greater larger a given length. For example, if the data were log entries, then the gaps would tell me when I was missing log data for long periods of time. What is the

Announcement: Generalized K-Means Clustering on Spark

2015-01-25 Thread derrickburns
This project generalizes the Spark MLLIB K-Means clusterer to support clustering of dense or sparse, low or high dimensional data using distance functions defined by Bregman divergences. https://github.com/derrickburns/generalized-kmeans-clustering -- View this message in context: http