Re: K Means Clustering Explanation
Hi Matt, unfortunately I have no code pointer at hand. I will sketch how to accomplish this via the API, it will for sure at least help you getting started. 1) ETL + vectorization (I assume your feature vector to be named "features") 2) You run a clustering algorithm (say KMeans: https://spark.apache.org/docs/2.2.0/ml-clustering.html), and the "fit" method will add an extra column named "prediction", where each row in the original dataframe now has a cluster id associated (you can control the name of the prediction column, as shown in the example, I assume the default) 3) You run a DT (https://spark.apache.org/docs/2.2.0/ml-classification- regression.html#decision-tree-classifier), and specify .setLabelCol("prediction") > .setFeaturesCol("features") so that the output of KMeans (the cluster id) is used as a class label by the classification algorithm, which is again using the feature vector (always stored in "features" column). 4) For a start, you can visualize the decision tree with "toDebugString()" method (note that you have indexed feature names, not the original name, check this for an idea of how to convert the indexes back: https://stackoverflow.com/questions/36122559/how-to-map-variable-names-to- features-after-pipeline) The easiest insight you can get on the classifier is "feature importance", that can give you an approximate idea of which are the most relevant features used to classify. Otherwise you can inspect the model programmatically or manually, but you have to define precisely what you want to have a look at (coverage, precision, recall etc.), and rank the tree leaves accordingly (using the impurity stats of each node, for instance). Hth, Alessandro On 2 March 2018 at 15:42, Matt Hicks wrote: > Thanks Alessandro and Christoph. I appreciate the feedback, but I'm still > having issues determining how to actually accomplish this with the API. > > Can anyone point me to an example in code showing how to accomplish this? > > > > On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando > alessandro.solima...@gmail.com wrote: > >> Hi Matt, >> similarly to what Christoph does, I first derive the cluster id for the >> elements of my original dataset, and then I use a classification algorithm >> (cluster ids being the classes here). >> >> For this method to be useful you need a "human-readable" model, >> tree-based models are generally a good choice (e.g., Decision Tree). >> >> However, since those models tend to be verbose, you still need a way to >> summarize them to facilitate readability (there must be some literature on >> this topic, although I have no pointers to provide). >> >> Hth, >> Alessandro >> >> >> >> >> >> On 1 March 2018 at 21:59, Christoph Brücke wrote: >> >> Hi Matt, >> >> I see. You could use the trained model to predict the cluster id for each >> training point. Now you should be able to create a dataset with your >> original input data and the associated cluster id for each data point in >> the input data. Now you can group this dataset by cluster id and aggregate >> over the original 5 features. E.g., get the mean for numerical data or the >> value that occurs the most for categorical data. >> >> The exact aggregation is use-case dependent. >> >> I hope this helps, >> Christoph >> >> Am 01.03.2018 21:40 schrieb "Matt Hicks" : >> >> Thanks for the response Christoph. >> >> I'm converting large amounts of data into clustering training and I'm >> just having a hard time reasoning about reversing the clusters (in code) >> back to the original format to properly understand the dominant values in >> each cluster. >> >> For example, if I have five fields of data and I've trained ten clusters >> of data I'd like to output the five fields of data as represented by each >> of the ten clusters. >> >> >> >> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote: >> >> Hi matt, >> >> the cluster are defined by there centroids / cluster centers. All the >> points belonging to a certain cluster are closer to its than to the >> centroids of any other cluster. >> >> What I typically do is to convert the cluster centers back to the >> original input format or of that is not possible use the point nearest to >> the cluster center and use this as a representation of the whole cluster. >> >> Can you be a little bit more specific about your use-case? >> >> Best, >> Christoph >> >> Am 01.03.2018 20:53 schrieb "Matt Hicks" : >> >> I'm using K Means clustering for a project right now, and it's working >> very well. However, I'd like to determine from the clusters what >> information distinctions define each cluster so I can explain the "reasons" >> data fits into a specific cluster. >> >> Is there a proper way to do this in Spark ML? >> >> >> >>
Re: K Means Clustering Explanation
Thanks Alessandro and Christoph. I appreciate the feedback, but I'm still having issues determining how to actually accomplish this with the API. Can anyone point me to an example in code showing how to accomplish this? On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando alessandro.solima...@gmail.com wrote: Hi Matt,similarly to what Christoph does, I first derive the cluster id for the elements of my original dataset, and then I use a classification algorithm (cluster ids being the classes here). For this method to be useful you need a "human-readable" model, tree-based models are generally a good choice (e.g., Decision Tree). However, since those models tend to be verbose, you still need a way to summarize them to facilitate readability (there must be some literature on this topic, although I have no pointers to provide). Hth,Alessandro On 1 March 2018 at 21:59, Christoph Brücke wrote: Hi Matt, I see. You could use the trained model to predict the cluster id for each training point. Now you should be able to create a dataset with your original input data and the associated cluster id for each data point in the input data. Now you can group this dataset by cluster id and aggregate over the original 5 features. E.g., get the mean for numerical data or the value that occurs the most for categorical data. The exact aggregation is use-case dependent. I hope this helps,Christoph Am 01.03.2018 21:40 schrieb "Matt Hicks" : Thanks for the response Christoph. I'm converting large amounts of data into clustering training and I'm just having a hard time reasoning about reversing the clusters (in code) back to the original format to properly understand the dominant values in each cluster. For example, if I have five fields of data and I've trained ten clusters of data I'd like to output the five fields of data as represented by each of the ten clusters. On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote: Hi matt, the cluster are defined by there centroids / cluster centers. All the points belonging to a certain cluster are closer to its than to the centroids of any other cluster. What I typically do is to convert the cluster centers back to the original input format or of that is not possible use the point nearest to the cluster center and use this as a representation of the whole cluster. Can you be a little bit more specific about your use-case? Best,Christoph Am 01.03.2018 20:53 schrieb "Matt Hicks" : I'm using K Means clustering for a project right now, and it's working very well. However, I'd like to determine from the clusters what information distinctions define each cluster so I can explain the "reasons" data fits into a specific cluster. Is there a proper way to do this in Spark ML?
Re: K Means Clustering Explanation
Hi Matt, similarly to what Christoph does, I first derive the cluster id for the elements of my original dataset, and then I use a classification algorithm (cluster ids being the classes here). For this method to be useful you need a "human-readable" model, tree-based models are generally a good choice (e.g., Decision Tree). However, since those models tend to be verbose, you still need a way to summarize them to facilitate readability (there must be some literature on this topic, although I have no pointers to provide). Hth, Alessandro On 1 March 2018 at 21:59, Christoph Brücke wrote: > Hi Matt, > > I see. You could use the trained model to predict the cluster id for each > training point. Now you should be able to create a dataset with your > original input data and the associated cluster id for each data point in > the input data. Now you can group this dataset by cluster id and aggregate > over the original 5 features. E.g., get the mean for numerical data or the > value that occurs the most for categorical data. > > The exact aggregation is use-case dependent. > > I hope this helps, > Christoph > > Am 01.03.2018 21:40 schrieb "Matt Hicks" : > > Thanks for the response Christoph. > > I'm converting large amounts of data into clustering training and I'm just > having a hard time reasoning about reversing the clusters (in code) back to > the original format to properly understand the dominant values in each > cluster. > > For example, if I have five fields of data and I've trained ten clusters > of data I'd like to output the five fields of data as represented by each > of the ten clusters. > > > > On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote: > >> Hi matt, >> >> the cluster are defined by there centroids / cluster centers. All the >> points belonging to a certain cluster are closer to its than to the >> centroids of any other cluster. >> >> What I typically do is to convert the cluster centers back to the >> original input format or of that is not possible use the point nearest to >> the cluster center and use this as a representation of the whole cluster. >> >> Can you be a little bit more specific about your use-case? >> >> Best, >> Christoph >> >> Am 01.03.2018 20:53 schrieb "Matt Hicks" : >> >> I'm using K Means clustering for a project right now, and it's working >> very well. However, I'd like to determine from the clusters what >> information distinctions define each cluster so I can explain the "reasons" >> data fits into a specific cluster. >> >> Is there a proper way to do this in Spark ML? >> >> >
Re: K Means Clustering Explanation
Hi Matt, I see. You could use the trained model to predict the cluster id for each training point. Now you should be able to create a dataset with your original input data and the associated cluster id for each data point in the input data. Now you can group this dataset by cluster id and aggregate over the original 5 features. E.g., get the mean for numerical data or the value that occurs the most for categorical data. The exact aggregation is use-case dependent. I hope this helps, Christoph Am 01.03.2018 21:40 schrieb "Matt Hicks" : Thanks for the response Christoph. I'm converting large amounts of data into clustering training and I'm just having a hard time reasoning about reversing the clusters (in code) back to the original format to properly understand the dominant values in each cluster. For example, if I have five fields of data and I've trained ten clusters of data I'd like to output the five fields of data as represented by each of the ten clusters. On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote: > Hi matt, > > the cluster are defined by there centroids / cluster centers. All the > points belonging to a certain cluster are closer to its than to the > centroids of any other cluster. > > What I typically do is to convert the cluster centers back to the original > input format or of that is not possible use the point nearest to the > cluster center and use this as a representation of the whole cluster. > > Can you be a little bit more specific about your use-case? > > Best, > Christoph > > Am 01.03.2018 20:53 schrieb "Matt Hicks" : > > I'm using K Means clustering for a project right now, and it's working > very well. However, I'd like to determine from the clusters what > information distinctions define each cluster so I can explain the "reasons" > data fits into a specific cluster. > > Is there a proper way to do this in Spark ML? > >
K Means Clustering Explanation
I'm using K Means clustering for a project right now, and it's working very well. However, I'd like to determine from the clusters what information distinctions define each cluster so I can explain the "reasons" data fits into a specific cluster. Is there a proper way to do this in Spark ML?
Re: K means clustering in spark
Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans for clustering in spark using python. I > implemented it on the data set which spark has within. It's a 3*4 matrix. > > Can anybody please help me with how and what is orientation of data for > kmeans. > Also how to find out what all clusters and its members are. > > Thanks > Anjali > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
K means clustering in spark
Hi, I am trying to use kmeans for clustering in spark using python. I implemented it on the data set which spark has within. It's a 3*4 matrix. Can anybody please help me with how and what is orientation of data for kmeans. Also how to find out what all clusters and its members are. Thanks Anjali - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Distance Calculation in Spark K means clustering
Hi all, I am currently working on some K means clustering project. I want to get the distances of each data point to it's cluster center after building the K means model. Currently I get the cluster centers of each data point by sending the JavaRDD which includes all the data points to K means predict function. Then It returns a JavaRDD which consist of all cluster Indexes of each data point. Then I convert those JavaRDD's to lists using collect function and use them to calculate Euclidean distances. But since this process involve a collect function seems like it's more time consuming. Is there any other efficient way to calculate these distances of each data points to their cluster centers? And also I want to know the distance measure that Spark K means algorithm use to build the model. Is it euclidean distance or squared euclidean distance? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-Calculation-in-Spark-K-means-clustering-tp24516.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Taking too long on K-means clustering
HI every one, I am trying to run KDD data set - basically chapter 5 of the Advanced Analytics with Spark book. The data set is of 789MB, but Spark is taking some 3 to 4 hours. Is it normal behaviour.or some tuning is required. The server RAM is 32 GB, but we can only give 4 GB RAM on 64 bit Ubuntu to Java Please guide. Thanks
Re: Settings for K-Means Clustering in Mlib for large data set
non_nan_indices] >>> > dictionary=dict(zip(non_nan_indices[0],non_nan_values)) >>> > return Vectors.sparse (len(A),dictionary) >>> > >>> > X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ] >>> > sc=SparkContext(appName="parallel_kmeans") >>> > data=sc.parallelize(X,10) >>> > model = KMeans.train(data, 1000, initializationMode="k-means||") >>> > >>> > where complete_dataframe is a pandas data frame that has my data. >>> > >>> > I get the error: Py4JNetworkError: An error occurred while trying to >>> > connect >>> > to the Java server. >>> > >>> > The error trace is as follows: >>> > >>> >> Exception happened during >>> >> processing of request from ('127.0.0.1', 41360) Traceback (most recent >>> >> call last): File "/usr/lib64/python2.6/SocketServer.py", line 283, >>> >> in _handle_request_noblock >>> >> self.process_request(request, client_address) File >>> >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request >>> >> self.finish_request(request, client_address) File >>> >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request >>> >> self.RequestHandlerClass(request, client_address, self) File >>> >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ >>> >> self.handle() File "/root/spark/python/pyspark/accumulators.py", >>> >> line 235, in handle >>> >> num_updates = read_int(self.rfile) File >>> >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int >>> >> raise EOFError EOFError >>> >> >>> >> >>> >> >>> >> --- >>> >> Py4JNetworkError Traceback (most recent call >>> >> last) in () >>> >> > 1 model = KMeans.train(data, 1000, >>> >> initializationMode="k-means||") >>> >> >>> >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, >>> >> maxIterations, runs, initializationMode, seed, initializationSteps, >>> >> epsilon) >>> >> 134 """Train a k-means clustering model.""" >>> >> 135 model = callMLlibFunc("trainKMeansModel", >>> >> rdd.map(_convert_to_vector), k, maxIterations, >>> >> --> 136 runs, initializationMode, seed, >>> >> initializationSteps, epsilon) >>> >> 137 centers = callJavaFunc(rdd.context, >>> >> model.clusterCenters) >>> >> 138 return KMeansModel([c.toArray() for c in centers]) >>> >> >>> >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, >>> >> *args) >>> >> 126 sc = SparkContext._active_spark_context >>> >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name) >>> >> --> 128 return callJavaFunc(sc, api, *args) >>> >> 129 >>> >> 130 >>> >> >>> >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, >>> >> *args) >>> >> 119 """ Call Java Function """ >>> >> 120 args = [_py2java(sc, a) for a in args] >>> >> --> 121 return _java2py(sc, func(*args)) >>> >> 122 >>> >> 123 >>> >> >>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >>> >> __call__(self, *args) >>> >> 534 END_COMMAND_PART >>> >> 535 >>> >> --> 536 answer = self.gateway_client.send_command(command) >>> >> 537 return_value = get_return_value(answer, >>> >> self.gateway_client, >>> >> 538 self.target_id, self.name) >>> >> >>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >>> >> send_command(self, command, retry) >>> >> 367 if retry: >>> >>
Re: Settings for K-Means Clustering in Mlib for large data set
processing of request from ('127.0.0.1', 41360) Traceback (most recent >> >> call last): File "/usr/lib64/python2.6/SocketServer.py", line 283, >> >> in _handle_request_noblock >> >> self.process_request(request, client_address) File >> >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request >> >> self.finish_request(request, client_address) File >> >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request >> >> self.RequestHandlerClass(request, client_address, self) File >> >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ >> >> self.handle() File "/root/spark/python/pyspark/accumulators.py", >> >> line 235, in handle >> >> num_updates = read_int(self.rfile) File >> >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int >> >> raise EOFError EOFError >> >> >> >> >> >> >> --- >> >> Py4JNetworkError Traceback (most recent call >> >> last) in () >> >> > 1 model = KMeans.train(data, 1000, >> initializationMode="k-means||") >> >> >> >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, >> >> maxIterations, runs, initializationMode, seed, initializationSteps, >> >> epsilon) >> >> 134 """Train a k-means clustering model.""" >> >> 135 model = callMLlibFunc("trainKMeansModel", >> >> rdd.map(_convert_to_vector), k, maxIterations, >> >> --> 136 runs, initializationMode, seed, >> >> initializationSteps, epsilon) >> >> 137 centers = callJavaFunc(rdd.context, >> model.clusterCenters) >> >> 138 return KMeansModel([c.toArray() for c in centers]) >> >> >> >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, >> >> *args) >> >> 126 sc = SparkContext._active_spark_context >> >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name) >> >> --> 128 return callJavaFunc(sc, api, *args) >> >> 129 >> >> 130 >> >> >> >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, >> >> *args) >> >> 119 """ Call Java Function """ >> >> 120 args = [_py2java(sc, a) for a in args] >> >> --> 121 return _java2py(sc, func(*args)) >> >> 122 >> >> 123 >> >> >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> >> __call__(self, *args) >> >> 534 END_COMMAND_PART >> >> 535 >> >> --> 536 answer = self.gateway_client.send_command(command) >> >> 537 return_value = get_return_value(answer, >> >> self.gateway_client, >> >> 538 self.target_id, self.name) >> >> >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> >> send_command(self, command, retry) >> >> 367 if retry: >> >> 368 #print_exc() >> >> --> 369 response = self.send_command(command) >> >> 370 else: >> >> 371 response = ERROR >> >> >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> >> send_command(self, command, retry) >> >> 360 the Py4J protocol. >> >> 361 """ >> >> --> 362 connection = self._get_connection() >> >> 363 try: >> >> 364 response = connection.send_command(command) >> >> >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> >> _get_connection(self) >> >> 316 connection = self.deque.pop() >> >> 317 except Exception: >> >> --> 318 connection = self._create_connection() >> >> 319 return connection >> >> 320 >> >> >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> >> _create_connection(self) >> >> 323 connection = GatewayConnection(self.address, self.port, >> >> 324 self.auto_close, self.gateway_property) >> >> --> 325 connection.start() >> >> 326 return connection >> >> 327 >> >> >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> >> start(self) >> >> 430 'server' >> >> 431 logger.exception(msg) >> >> --> 432 raise Py4JNetworkError(msg) >> >> 433 >> >> 434 def close(self): >> >> >> >> Py4JNetworkError: An error occurred while trying to connect to the >> >> Java server >> > >> > >> > Is there any specific setting that I am missing , that causes this >> error? >> > >> > Thanks and Regards, >> > Rogers Jeffrey L >> > >
Re: Settings for K-Means Clustering in Mlib for large data set
I am submitting the application from a python notebook. I am launching pyspark as follows: SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g -Xms30g -Xmx30g" IPYTHON=1 PYSPARK_PYTHON=/usr/bin/python SPARK_PRINT_LAUNCH_COMMAND=1 ./spark/bin/pyspark --master spark:// 54.165.202.17.compute-1.amazonaws.com:7077 --deploy-mode client I guess I should be adding another extra argument --conf "spark.driver.memory=15g" . Is that correct? Regards, Rogers Jeffrey L On Thu, Jun 18, 2015 at 7:50 PM, Xiangrui Meng wrote: > With 80,000 features and 1000 clusters, you need 80,000,000 doubles to > store the cluster centers. That is ~600MB. If there are 10 partitions, > you might need 6GB on the driver to collect updates from workers. I > guess the driver died. Did you specify driver memory with > spark-submit? -Xiangrui > > On Thu, Jun 18, 2015 at 12:22 PM, Rogers Jeffrey > wrote: > > Hi All, > > > > I am trying to run KMeans clustering on a large data set with 12,000 > points > > and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode > with > > 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. > > > > My Code is as Follows: > > > > def convert_into_sparse_vector(A): > > non_nan_indices=np.nonzero(~np.isnan(A) ) > > non_nan_values=A[non_nan_indices] > > dictionary=dict(zip(non_nan_indices[0],non_nan_values)) > > return Vectors.sparse (len(A),dictionary) > > > > X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ] > > sc=SparkContext(appName="parallel_kmeans") > > data=sc.parallelize(X,10) > > model = KMeans.train(data, 1000, initializationMode="k-means||") > > > > where complete_dataframe is a pandas data frame that has my data. > > > > I get the error: Py4JNetworkError: An error occurred while trying to > connect > > to the Java server. > > > > The error trace is as follows: > > > >> Exception happened during > >> processing of request from ('127.0.0.1', 41360) Traceback (most recent > >> call last): File "/usr/lib64/python2.6/SocketServer.py", line 283, > >> in _handle_request_noblock > >> self.process_request(request, client_address) File > >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request > >> self.finish_request(request, client_address) File > >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request > >> self.RequestHandlerClass(request, client_address, self) File > >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ > >> self.handle() File "/root/spark/python/pyspark/accumulators.py", > >> line 235, in handle > >> num_updates = read_int(self.rfile) File > >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int > >> raise EOFError EOFError > >> > >> > >> > --- > >> Py4JNetworkError Traceback (most recent call > >> last) in () > >> > 1 model = KMeans.train(data, 1000, initializationMode="k-means||") > >> > >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, > >> maxIterations, runs, initializationMode, seed, initializationSteps, > >> epsilon) > >> 134 """Train a k-means clustering model.""" > >> 135 model = callMLlibFunc("trainKMeansModel", > >> rdd.map(_convert_to_vector), k, maxIterations, > >> --> 136 runs, initializationMode, seed, > >> initializationSteps, epsilon) > >> 137 centers = callJavaFunc(rdd.context, > model.clusterCenters) > >> 138 return KMeansModel([c.toArray() for c in centers]) > >> > >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, > >> *args) > >> 126 sc = SparkContext._active_spark_context > >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name) > >> --> 128 return callJavaFunc(sc, api, *args) > >> 129 > >> 130 > >> > >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, > >> *args) > >> 119 "
Re: Settings for K-Means Clustering in Mlib for large data set
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers. That is ~600MB. If there are 10 partitions, you might need 6GB on the driver to collect updates from workers. I guess the driver died. Did you specify driver memory with spark-submit? -Xiangrui On Thu, Jun 18, 2015 at 12:22 PM, Rogers Jeffrey wrote: > Hi All, > > I am trying to run KMeans clustering on a large data set with 12,000 points > and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with > 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. > > My Code is as Follows: > > def convert_into_sparse_vector(A): > non_nan_indices=np.nonzero(~np.isnan(A) ) > non_nan_values=A[non_nan_indices] > dictionary=dict(zip(non_nan_indices[0],non_nan_values)) > return Vectors.sparse (len(A),dictionary) > > X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ] > sc=SparkContext(appName="parallel_kmeans") > data=sc.parallelize(X,10) > model = KMeans.train(data, 1000, initializationMode="k-means||") > > where complete_dataframe is a pandas data frame that has my data. > > I get the error: Py4JNetworkError: An error occurred while trying to connect > to the Java server. > > The error trace is as follows: > >> Exception happened during >> processing of request from ('127.0.0.1', 41360) Traceback (most recent >> call last): File "/usr/lib64/python2.6/SocketServer.py", line 283, >> in _handle_request_noblock >> self.process_request(request, client_address) File >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request >> self.finish_request(request, client_address) File >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request >> self.RequestHandlerClass(request, client_address, self) File >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ >> self.handle() File "/root/spark/python/pyspark/accumulators.py", >> line 235, in handle >> num_updates = read_int(self.rfile) File >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int >> raise EOFError EOFError >> >> >> --- >> Py4JNetworkError Traceback (most recent call >> last) in () >> > 1 model = KMeans.train(data, 1000, initializationMode="k-means||") >> >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, >> maxIterations, runs, initializationMode, seed, initializationSteps, >> epsilon) >> 134 """Train a k-means clustering model.""" >> 135 model = callMLlibFunc("trainKMeansModel", >> rdd.map(_convert_to_vector), k, maxIterations, >> --> 136 runs, initializationMode, seed, >> initializationSteps, epsilon) >> 137 centers = callJavaFunc(rdd.context, model.clusterCenters) >> 138 return KMeansModel([c.toArray() for c in centers]) >> >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, >> *args) >> 126 sc = SparkContext._active_spark_context >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name) >> --> 128 return callJavaFunc(sc, api, *args) >> 129 >> 130 >> >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, >> *args) >> 119 """ Call Java Function """ >> 120 args = [_py2java(sc, a) for a in args] >> --> 121 return _java2py(sc, func(*args)) >> 122 >> 123 >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> __call__(self, *args) >> 534 END_COMMAND_PART >> 535 >> --> 536 answer = self.gateway_client.send_command(command) >> 537 return_value = get_return_value(answer, >> self.gateway_client, >> 538 self.target_id, self.name) >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >> send_command(self, command, retry) >> 367 if retry: >> 368 #print_exc() >> --> 369 response = self.send_command(command) >> 370 else: >> 371 response = ERROR >> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in >>
Settings for K-Means Clustering in Mlib for large data set
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. My Code is as Follows: def convert_into_sparse_vector(A): non_nan_indices=np.nonzero(~np.isnan(A) ) non_nan_values=A[non_nan_indices] dictionary=dict(zip(non_nan_indices[0],non_nan_values)) return Vectors.sparse (len(A),dictionary) X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ] sc=SparkContext(appName="parallel_kmeans") data=sc.parallelize(X,10) model = KMeans.train(data, 1000, initializationMode="k-means||") where complete_dataframe is a pandas data frame that has my data. I get the error: *Py4JNetworkError: An error occurred while trying to connect to the **Java server.* The error trace is as follows: > Exception happened during processing > of request from ('127.0.0.1', 41360) Traceback (most recent > call last): File "/usr/lib64/python2.6/SocketServer.py", line 283, > in _handle_request_noblock > self.process_request(request, client_address) File > "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request > self.finish_request(request, client_address) File > "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request > self.RequestHandlerClass(request, client_address, self) File > "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ > self.handle() File "/root/spark/python/pyspark/accumulators.py", line > 235, in handle > num_updates = read_int(self.rfile) File > "/root/spark/python/pyspark/serializers.py", line 544, in read_int > raise EOFError EOFError > > --- > Py4JNetworkError Traceback (most recent call > last) in () > > 1 model = KMeans.train(data, 1000, initializationMode="k-means||") > > /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, > maxIterations, runs, initializationMode, seed, initializationSteps, > epsilon) > 134 """Train a k-means clustering model.""" > 135 model = callMLlibFunc("trainKMeansModel", > rdd.map(_convert_to_vector), k, maxIterations, > --> 136 runs, initializationMode, seed, > initializationSteps, epsilon) > 137 centers = callJavaFunc(rdd.context, model.clusterCenters) > 138 return KMeansModel([c.toArray() for c in centers]) > > /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, > *args) > 126 sc = SparkContext._active_spark_context > 127 api = getattr(sc._jvm.PythonMLLibAPI(), name) > --> 128 return callJavaFunc(sc, api, *args) > 129 > 130 > > /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, > *args) > 119 """ Call Java Function """ > 120 args = [_py2java(sc, a) for a in args] > --> 121 return _java2py(sc, func(*args)) > 122 > 123 > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 534 END_COMMAND_PART > 535 > --> 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > 538 self.target_id, self.name) > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > send_command(self, command, retry) > 367 if retry: > 368 #print_exc() > --> 369 response = self.send_command(command) > 370 else: > 371 response = ERROR > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > send_command(self, command, retry) > 360 the Py4J protocol. > 361 """ > --> 362 connection = self._get_connection() > 363 try: > 364 response = connection.send_command(command) > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > _get_connection(self) > 316 connection = self.deque.pop() > 317 except Exception: > --> 318 connection = self._create_connection() > 319 return connection > 320 > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > _create_connection(self) > 323 connection = GatewayConnection(se
Announcement: Generalized K-Means Clustering on Spark
This project generalizes the Spark MLLIB K-Means clusterer to support clustering of dense or sparse, low or high dimensional data using distance functions defined by Bregman divergences. https://github.com/derrickburns/generalized-kmeans-clustering -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Announcement-Generalized-K-Means-Clustering-on-Spark-tp21363.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: k-means clustering
Pre-processing is major workload before training model. MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is essential for preprocessing and would be great help to the model training. Take a look at this http://spark.apache.org/docs/latest/mllib-feature-extraction.html 2014-11-21 7:18 GMT+08:00 Jun Yang : > Guys, > > As to the questions of pre-processing, you could just migrate your logic > to Spark before using K-means. > > I only used Scala on Spark, and haven't used Python binding on Spark, but > I think the basic steps must be the same. > > BTW, if your data set is big with huge sparse dimension feature vector, > K-Means may not works as good as you expected. And I think this is still > the optimization direction of Spark MLLib. > > On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi > wrote: > >> Hi there, >> >> I would like to do "text clustering" using k-means and Spark on a >> massive dataset. As you know, before running the k-means, I have to do >> pre-processing methods such as TFIDF and NLTK on my big dataset. The >> following is my code in python : >> >> if __name__ == '__main__': # Cluster a bunch of text documents. import re >> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv' >> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for >> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error >> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, >> e)) remove_spl_char_regex = re.compile('[%s]' % >> re.escape(string.punctuation)) # regex to remove special characters >> remove_num = re.compile('[\d]+') #nltk.download() stop_words= >> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float >> ) a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # >> Remove special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove >> stop words words = a3.split() filter_stop_words = [w for w in words if >> not w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in >> filter_stop_words] ws=sorted(stemed) #ws=re.findall(r"\w+", a1) for w in >> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items()) >> >> Can anyone explain to me how can I do the pre-processing step, before >> running the k-means using spark. >> >> >> Best Regards >> >> ... >> >> Amin Mohebbi >> >> PhD candidate in Software Engineering >> at university of Malaysia >> >> Tel : +60 18 2040 017 >> >> >> >> E-Mail : tp025...@ex.apiit.edu.my >> >> amin_...@me.com >> > > > > -- > yangjun...@gmail.com > http://hi.baidu.com/yjpro >
Re: K-means clustering
There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi wrote: > I have generated a sparse matrix by python, which has the size of > 4000*174000 (.pkl), the following is a small part of this matrix : > > (0, 45) 1 > (0, 413) 1 > (0, 445) 1 > (0, 107) 4 > (0, 80) 2 > (0, 352) 1 > (0, 157) 1 > (0, 191) 1 > (0, 315) 1 > (0, 395) 4 > (0, 282) 3 > (0, 184) 1 > (0, 403) 1 > (0, 169) 1 > (0, 267) 1 > (0, 148) 1 > (0, 449) 1 > (0, 241) 1 > (0, 303) 1 > (0, 364) 1 > (0, 257) 1 > (0, 372) 1 > (0, 73) 1 > (0, 64) 1 > (0, 427) 1 > : : > (2, 399) 1 > (2, 277) 1 > (2, 229) 1 > (2, 255) 1 > (2, 409) 1 > (2, 355) 1 > (2, 391) 1 > (2, 28) 1 > (2, 384) 1 > (2, 86) 1 > (2, 285) 2 > (2, 166) 1 > (2, 165) 1 > (2, 419) 1 > (2, 367) 2 > (2, 133) 1 > (2, 61) 1 > (2, 434) 1 > (2, 51) 1 > (2, 423) 1 > (2, 398) 1 > (2, 438) 1 > (2, 389) 1 > (2, 26) 1 > (2, 455) 1 > > I am new in Spark and would like to cluster this matrix by k-means > algorithm. Can anyone explain to me what kind of problems I might be faced. > Please note that I do not want to use Mllib and would like to write my own > k-means. > Best Regards > > ... > > Amin Mohebbi > > PhD candidate in Software Engineering > at university of Malaysia > > Tel : +60 18 2040 017 > > > > E-Mail : tp025...@ex.apiit.edu.my > > amin_...@me.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
K-means clustering
I have generated a sparse matrix by python, which has the size of 4000*174000 (.pkl), the following is a small part of this matrix : (0, 45) 1 (0, 413) 1 (0, 445) 1 (0, 107) 4 (0, 80) 2 (0, 352) 1 (0, 157) 1 (0, 191) 1 (0, 315) 1 (0, 395) 4 (0, 282) 3 (0, 184) 1 (0, 403) 1 (0, 169) 1 (0, 267) 1 (0, 148) 1 (0, 449) 1 (0, 241) 1 (0, 303) 1 (0, 364) 1 (0, 257) 1 (0, 372) 1 (0, 73) 1 (0, 64) 1 (0, 427) 1 : : (2, 399) 1 (2, 277) 1 (2, 229) 1 (2, 255) 1 (2, 409) 1 (2, 355) 1 (2, 391) 1 (2, 28) 1 (2, 384) 1 (2, 86) 1 (2, 285) 2 (2, 166) 1 (2, 165) 1 (2, 419) 1 (2, 367) 2 (2, 133) 1 (2, 61) 1 (2, 434) 1 (2, 51) 1 (2, 423) 1 (2, 398) 1 (2, 438) 1 (2, 389) 1 (2, 26) 1 (2, 455) 1 I am new in Spark and would like to cluster this matrix by k-means algorithm. Can anyone explain to me what kind of problems I might be faced. Please note that I do not want to use Mllib and would like to write my own k-means. Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: k-means clustering
Guys, As to the questions of pre-processing, you could just migrate your logic to Spark before using K-means. I only used Scala on Spark, and haven't used Python binding on Spark, but I think the basic steps must be the same. BTW, if your data set is big with huge sparse dimension feature vector, K-Means may not works as good as you expected. And I think this is still the optimization direction of Spark MLLib. On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi wrote: > Hi there, > > I would like to do "text clustering" using k-means and Spark on a massive > dataset. As you know, before running the k-means, I have to do > pre-processing methods such as TFIDF and NLTK on my big dataset. The > following is my code in python : > > if __name__ == '__main__': # Cluster a bunch of text documents. import re > import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv' > with open(filename, newline='') as f: try: newsreader = csv.reader(f) for > row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error > as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, > e)) remove_spl_char_regex = re.compile('[%s]' % > re.escape(string.punctuation)) # regex to remove special characters > remove_num = re.compile('[\d]+') #nltk.download() stop_words= > nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float) > a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # Remove > special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove > stop words words = a3.split() filter_stop_words = [w for w in words if not > w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in > filter_stop_words] ws=sorted(stemed) #ws=re.findall(r"\w+", a1) for w in > ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items()) > > Can anyone explain to me how can I do the pre-processing step, before > running the k-means using spark. > > > Best Regards > > ... > > Amin Mohebbi > > PhD candidate in Software Engineering > at university of Malaysia > > Tel : +60 18 2040 017 > > > > E-Mail : tp025...@ex.apiit.edu.my > > amin_...@me.com > -- yangjun...@gmail.com http://hi.baidu.com/yjpro
k-means clustering
Hi there, I would like to do "text clustering" using k-means and Spark on a massive dataset. As you know, before running the k-means, I have to do pre-processing methods such as TFIDF and NLTK on my big dataset. The following is my code in python : | | if __name__ == '__main__': | | | # Cluster a bunch of text documents. | | | import re | | | import sys | | | | | | k = 6 | | | vocab = {} | | | xs = [] | | | ns=[] | | | cat=[] | | | filename='2013-01.csv' | | | with open(filename, newline='') as f: | | | try: | | | newsreader = csv.reader(f) | | | for row in newsreader: | | | ns.append(row[3]) | | | cat.append(row[4]) | | | except csv.Error as e: | | | sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) | | | | | | | | | remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters | | | remove_num = re.compile('[\d]+') | | | #nltk.download() | | | stop_words=nltk.corpus.stopwords.words('english') | | | | | | for a in ns: | | | x = defaultdict(float) | | | | | | | | | a1 = a.strip().lower() | | | a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters | | | a3 = remove_num.sub("", a2) #Remove numbers | | | #Remove stop words | | | words = a3.split() | | | filter_stop_words = [w for w in words if not w in stop_words] | | | stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] | | | ws=sorted(stemed) | | | | | | | | | #ws=re.findall(r"\w+", a1) | | | for w in ws: | | | vocab.setdefault(w, len(vocab)) | | | x[vocab[w]] += 1 | | | xs.append(x.items()) | | | Can anyone explain to me how can I do the pre-processing step, before running the k-means using spark. Best Regards ... Amin Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my amin_...@me.com
Re: Categorical Features for K-Means Clustering
Yeah - another vote here to do what's called One-Hot encoding, just convert the single categorical feature into N columns, where N is the number of distinct values of that feature, with a single one and all the other features/columns set to zero. On Tue, Sep 16, 2014 at 2:16 PM, Sean Owen wrote: > I think it's on the table but not yet merged? > https://issues.apache.org/jira/browse/SPARK-1216 > > On Tue, Sep 16, 2014 at 10:04 PM, st553 wrote: > > Does MLlib provide utility functions to do this kind of encoding? > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Categorical Features for K-Means Clustering
I think it's on the table but not yet merged? https://issues.apache.org/jira/browse/SPARK-1216 On Tue, Sep 16, 2014 at 10:04 PM, st553 wrote: > Does MLlib provide utility functions to do this kind of encoding? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Categorical Features for K-Means Clustering
Does MLlib provide utility functions to do this kind of encoding? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Speeding up K-Means Clustering
Please try val parsedData3 = data3.repartition(12).map(_.split("\t")).map(_.toDouble).cache() and check the storage and driver/executor memory in the WebUI. Make sure the data is fully cached. -Xiangrui On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan wrote: > Hi Xiangrui, > > Yes I am using Spark v0.9 and am not running it in local mode. > > I did the memory setting using "export SPARK_MEM=4G" before starting the > Spark instance. > > Also previously, I was starting it with -c 1 but changed it to -c 12 since > it is a 12 core machine. It did bring down the time taken to less than 200 > seconds from over 700 seconds. > > I am not sure how to repartition the data to match the CPU cores. How do I > do it? > > Thank you. > > Ravi > > > On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng wrote: >> >> Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g >> and repartition your data to match number of CPU cores such that the >> data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage >> the data. Make sure there are enough memory for caching. -Xiangrui >> >> On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan >> wrote: >> > I am trying to use MLlib for K-Means clustering on a data set with 1 >> > million >> > rows and 50 columns (all columns have double values) which is on HDFS >> > (raw >> > txt file is 28 MB) >> > >> > I initially tried the following: >> > >> > val data3 = sc.textFile("hdfs://...inputData.txt") >> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) >> > val numIterations = 10 >> > val numClusters = 200 >> > val clusters = KMeans.train(parsedData3, numClusters, numIterations) >> > >> > This took me nearly 850 seconds. >> > >> > I tried using persist with MEMORY_ONLY option hoping that this would >> > significantly speed up the algorithm: >> > >> > val data3 = sc.textFile("hdfs://...inputData.txt") >> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) >> > parsedData3.persist(MEMORY_ONLY) >> > val numIterations = 10 >> > val numClusters = 200 >> > val clusters = KMeans.train(parsedData3, numClusters, numIterations) >> > >> > This resulted in only a marginal improvement and took around 720 >> > seconds. >> > >> > Is there any other way to speed up the algorithm further? >> > >> > Thank you. >> > >> > Regards, >> > Ravi > >
Re: Speeding up K-Means Clustering
Hi Xiangrui, Yes I am using Spark v0.9 and am not running it in local mode. I did the memory setting using "export SPARK_MEM=4G" before starting the Spark instance. Also previously, I was starting it with -c 1 but changed it to -c 12 since it is a 12 core machine. It did bring down the time taken to less than 200 seconds from over 700 seconds. I am not sure how to repartition the data to match the CPU cores. How do I do it? Thank you. Ravi On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng wrote: > Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g > and repartition your data to match number of CPU cores such that the > data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage > the data. Make sure there are enough memory for caching. -Xiangrui > > On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan > wrote: > > I am trying to use MLlib for K-Means clustering on a data set with 1 > million > > rows and 50 columns (all columns have double values) which is on HDFS > (raw > > txt file is 28 MB) > > > > I initially tried the following: > > > > val data3 = sc.textFile("hdfs://...inputData.txt") > > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) > > val numIterations = 10 > > val numClusters = 200 > > val clusters = KMeans.train(parsedData3, numClusters, numIterations) > > > > This took me nearly 850 seconds. > > > > I tried using persist with MEMORY_ONLY option hoping that this would > > significantly speed up the algorithm: > > > > val data3 = sc.textFile("hdfs://...inputData.txt") > > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) > > parsedData3.persist(MEMORY_ONLY) > > val numIterations = 10 > > val numClusters = 200 > > val clusters = KMeans.train(parsedData3, numClusters, numIterations) > > > > This resulted in only a marginal improvement and took around 720 seconds. > > > > Is there any other way to speed up the algorithm further? > > > > Thank you. > > > > Regards, > > Ravi >
Re: Speeding up K-Means Clustering
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan wrote: > I am trying to use MLlib for K-Means clustering on a data set with 1 million > rows and 50 columns (all columns have double values) which is on HDFS (raw > txt file is 28 MB) > > I initially tried the following: > > val data3 = sc.textFile("hdfs://...inputData.txt") > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) > val numIterations = 10 > val numClusters = 200 > val clusters = KMeans.train(parsedData3, numClusters, numIterations) > > This took me nearly 850 seconds. > > I tried using persist with MEMORY_ONLY option hoping that this would > significantly speed up the algorithm: > > val data3 = sc.textFile("hdfs://...inputData.txt") > val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) > parsedData3.persist(MEMORY_ONLY) > val numIterations = 10 > val numClusters = 200 > val clusters = KMeans.train(parsedData3, numClusters, numIterations) > > This resulted in only a marginal improvement and took around 720 seconds. > > Is there any other way to speed up the algorithm further? > > Thank you. > > Regards, > Ravi
Speeding up K-Means Clustering
I am trying to use MLlib for K-Means clustering on a data set with 1 million rows and 50 columns (all columns have double values) which is on HDFS (raw txt file is 28 MB) I initially tried the following: val data3 = sc.textFile("hdfs://...inputData.txt") val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This took me nearly 850 seconds. I tried using persist with MEMORY_ONLY option hoping that this would significantly speed up the algorithm: val data3 = sc.textFile("hdfs://...inputData.txt") val parsedData3 = data3.map( _.split('\t').map(_.toDouble)) parsedData3.persist(MEMORY_ONLY) val numIterations = 10 val numClusters = 200 val clusters = KMeans.train(parsedData3, numClusters, numIterations) This resulted in only a marginal improvement and took around 720 seconds. Is there any other way to speed up the algorithm further? Thank you. Regards, Ravi
Re: Categorical Features for K-Means Clustering
I see. So, basically, kind of like dummy variables like with regressions. Thanks, Sean. On Jul 11, 2014, at 10:11 AM, Sean Owen wrote: > Since you can't define your own distance function, you will need to > convert these to numeric dimensions. 1-of-n encoding can work OK, > depending on your use case. So a dimension that takes on 3 categorical > values, becomes 3 dimensions, of which all are 0 except one that has > value 1. > > On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan wrote: >> Hi Folks, >> >> Does any one have experience or recommendations on incorporating categorical >> features (attributes) into k-means clustering in Spark? In other words, I >> want to cluster on a set of attributes that include categorical variables. >> >> I know I could probably implement some custom code to parse and calculate my >> own similarity function, but I wanted to reach out before I did so. I’d >> also prefer to take advantage of the k-means\parallel initialization feature >> of the model in MLlib, so an MLlib-based implementation would be preferred. >> >> Thanks in advance. >> >> Best, >> >> -Wen signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Categorical Features for K-Means Clustering
Since you can't define your own distance function, you will need to convert these to numeric dimensions. 1-of-n encoding can work OK, depending on your use case. So a dimension that takes on 3 categorical values, becomes 3 dimensions, of which all are 0 except one that has value 1. On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan wrote: > Hi Folks, > > Does any one have experience or recommendations on incorporating categorical > features (attributes) into k-means clustering in Spark? In other words, I > want to cluster on a set of attributes that include categorical variables. > > I know I could probably implement some custom code to parse and calculate my > own similarity function, but I wanted to reach out before I did so. I’d also > prefer to take advantage of the k-means\parallel initialization feature of > the model in MLlib, so an MLlib-based implementation would be preferred. > > Thanks in advance. > > Best, > > -Wen
Categorical Features for K-Means Clustering
Hi Folks, Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark? In other words, I want to cluster on a set of attributes that include categorical variables. I know I could probably implement some custom code to parse and calculate my own similarity function, but I wanted to reach out before I did so. I’d also prefer to take advantage of the k-means\parallel initialization feature of the model in MLlib, so an MLlib-based implementation would be preferred. Thanks in advance. Best, -Wen signature.asc Description: Message signed with OpenPGP using GPGMail