Re: K Means Clustering Explanation

2018-03-04 Thread Alessandro Solimando
. All the >> points belonging to a certain cluster are closer to its than to the >> centroids of any other cluster. >> >> What I typically do is to convert the cluster centers back to the >> original input format or of that is not possible use the point nearest to >> t

Re: K Means Clustering Explanation

2018-03-02 Thread Matt Hicks
as a representation of the whole cluster. Can you be a little bit more specific about your use-case? Best,Christoph Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>: I'm using K Means clustering for a project right now, and it's working very well.  However, I'd like to determine

Re: K Means Clustering Explanation

2018-03-02 Thread Alessandro Solimando
cluster. >> >> What I typically do is to convert the cluster centers back to the >> original input format or of that is not possible use the point nearest to >> the cluster center and use this as a representation of the whole cluster. >> >> Can you be a little bit m

Re: K Means Clustering Explanation

2018-03-01 Thread Christoph Brücke
sentation of the whole cluster. > > Can you be a little bit more specific about your use-case? > > Best, > Christoph > > Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>: > > I'm using K Means clustering for a project right now, and it's wor

K Means Clustering Explanation

2018-03-01 Thread Matt Hicks
I'm using K Means clustering for a project right now, and it's working very well.  However, I'd like to determine from the clusters what information distinctions define each cluster so I can explain the "reasons" data fits into a specific cluster. Is there a proper way to do this in Spark ML?

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans

K means clustering in spark

2015-12-30 Thread anjali . gautam09
Hi, I am trying to use kmeans for clustering in spark using python. I implemented it on the data set which spark has within. It's a 3*4 matrix. Can anybody please help me with how and what is orientation of data for kmeans. Also how to find out what all clusters and its members are. Thanks

Distance Calculation in Spark K means clustering

2015-08-31 Thread ashensw
Hi all, I am currently working on some K means clustering project. I want to get the distances of each data point to it's cluster center after building the K means model. Currently I get the cluster centers of each data point by sending the JavaRDD which includes all the data points to K means

Spark Taking too long on K-means clustering

2015-08-27 Thread masoom alam
HI every one, I am trying to run KDD data set - basically chapter 5 of the Advanced Analytics with Spark book. The data set is of 789MB, but Spark is taking some 3 to 4 hours. Is it normal behaviour.or some tuning is required. The server RAM is 32 GB, but we can only give 4 GB RAM on 64 bit

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, maxIterations, runs, initializationMode, seed, initializationSteps, epsilon) 134 Train a k-means clustering model. 135 model = callMLlibFunc(trainKMeansModel, rdd.map(_convert_to_vector), k, maxIterations

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-19 Thread Rogers Jeffrey
, initializationMode, seed, initializationSteps, epsilon) 134 Train a k-means clustering model. 135 model = callMLlibFunc(trainKMeansModel, rdd.map(_convert_to_vector), k, maxIterations, -- 136 runs, initializationMode, seed

Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
in train(cls, rdd, k, maxIterations, runs, initializationMode, seed, initializationSteps, epsilon) 134 Train a k-means clustering model. 135 model = callMLlibFunc(trainKMeansModel, rdd.map(_convert_to_vector), k, maxIterations, -- 136 runs

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
, initializationMode=k-means||) /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, maxIterations, runs, initializationMode, seed, initializationSteps, epsilon) 134 Train a k-means clustering model. 135 model = callMLlibFunc(trainKMeansModel, rdd.map

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
) 134 Train a k-means clustering model. 135 model = callMLlibFunc(trainKMeansModel, rdd.map(_convert_to_vector), k, maxIterations, -- 136 runs, initializationMode, seed, initializationSteps, epsilon) 137 centers

Announcement: Generalized K-Means Clustering on Spark

2015-01-25 Thread derrickburns
://apache-spark-user-list.1001560.n3.nabble.com/Announcement-Generalized-K-Means-Clustering-on-Spark-tp21363.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

K-means clustering

2014-11-25 Thread amin mohebbi
 I  have generated a sparse matrix by python, which has the size of   4000*174000 (.pkl), the following is a small part of this matrix :  (0, 45) 1  (0, 413) 1  (0, 445) 1  (0, 107) 4  (0, 80) 2  (0, 352) 1  (0, 157) 1  (0, 191) 1  (0, 315) 1  (0, 395) 4  (0, 282) 3  (0, 184) 1  (0, 403) 1  (0,

Re: K-means clustering

2014-11-25 Thread Xiangrui Meng
There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui On Tue, Nov 25, 2014 at 2:39

Re: k-means clustering

2014-11-25 Thread Yanbo Liang
Pre-processing is major workload before training model. MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is essential for preprocessing and would be great help to the model training. Take a look at this http://spark.apache.org/docs/latest/mllib-feature-extraction.html

Re: k-means clustering

2014-11-20 Thread Jun Yang
Guys, As to the questions of pre-processing, you could just migrate your logic to Spark before using K-means. I only used Scala on Spark, and haven't used Python binding on Spark, but I think the basic steps must be the same. BTW, if your data set is big with huge sparse dimension feature

k-means clustering

2014-11-18 Thread amin mohebbi
Hi there, I would like to do text clustering using  k-means and Spark on a massive dataset. As you know, before running the k-means, I have to do pre-processing methods such as TFIDF and NLTK on my big dataset. The following is my code in python : | | if __name__ == '__main__': | | | #

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen
-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
at 1:48 AM, Ravishankar Rajagopalan viora...@gmail.com wrote: I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile(hdfs

Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split('\t

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile(hdfs://...inputData.txt) val parsedData3 = data3.map( _.split

Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
Hi Folks, Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark? In other words, I want to cluster on a set of attributes that include categorical variables. I know I could probably implement some custom code

Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Sean Owen
, 2014 at 3:07 PM, Wen Phan wen.p...@mac.com wrote: Hi Folks, Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark? In other words, I want to cluster on a set of attributes that include categorical variables. I

Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
categorical features (attributes) into k-means clustering in Spark? In other words, I want to cluster on a set of attributes that include categorical variables. I know I could probably implement some custom code to parse and calculate my own similarity function, but I wanted to reach out