Re: K Means Clustering Explanation

2018-03-04 Thread Alessandro Solimando
Hi Matt,
unfortunately I have no code pointer at hand.

I will sketch how to accomplish this via the API, it will for sure at least
help you getting started.

1) ETL + vectorization (I assume your feature vector to be named "features")

2) You run a clustering algorithm (say KMeans:
https://spark.apache.org/docs/2.2.0/ml-clustering.html), and the "fit"
method will add an extra column named "prediction", where each row in the
original dataframe now has a cluster id associated (you can control the
name of the prediction column, as shown in the example, I assume the
default)

3) You run a DT (https://spark.apache.org/docs/2.2.0/ml-classification-
regression.html#decision-tree-classifier), and specify

.setLabelCol("prediction")
> .setFeaturesCol("features")


so that the output of KMeans (the cluster id) is used as a class label by
the classification algorithm, which is again using the feature vector
(always stored in "features" column).

4) For a start, you can visualize the decision tree with "toDebugString()"
method (note that you have indexed feature names, not the original name,
check this for an idea of how to convert the indexes back:
https://stackoverflow.com/questions/36122559/how-to-map-variable-names-to-
features-after-pipeline)

The easiest insight you can get on the classifier is "feature importance",
that can give you an approximate idea of which are the most relevant
features used to classify.

Otherwise you can inspect the model programmatically or manually, but you
have to define precisely what you want to have a look at (coverage,
precision, recall etc.), and rank the tree leaves accordingly (using the
impurity stats of each node, for instance).

Hth,
Alessandro

On 2 March 2018 at 15:42, Matt Hicks  wrote:

> Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
> having issues determining how to actually accomplish this with the API.
>
> Can anyone point me to an example in code showing how to accomplish this?
>
>
>
> On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando
> alessandro.solima...@gmail.com wrote:
>
>> Hi Matt,
>> similarly to what Christoph does, I first derive the cluster id for the
>> elements of my original dataset, and then I use a classification algorithm
>> (cluster ids being the classes here).
>>
>> For this method to be useful you need a "human-readable" model,
>> tree-based models are generally a good choice (e.g., Decision Tree).
>>
>> However, since those models tend to be verbose, you still need a way to
>> summarize them to facilitate readability (there must be some literature on
>> this topic, although I have no pointers to provide).
>>
>> Hth,
>> Alessandro
>>
>>
>>
>>
>>
>> On 1 March 2018 at 21:59, Christoph Brücke  wrote:
>>
>> Hi Matt,
>>
>> I see. You could use the trained model to predict the cluster id for each
>> training point. Now you should be able to create a dataset with your
>> original input data and the associated cluster id for each data point in
>> the input data. Now you can group this dataset by cluster id and aggregate
>> over the original 5 features. E.g., get the mean for numerical data or the
>> value that occurs the most for categorical data.
>>
>> The exact aggregation is use-case dependent.
>>
>> I hope this helps,
>> Christoph
>>
>> Am 01.03.2018 21:40 schrieb "Matt Hicks" :
>>
>> Thanks for the response Christoph.
>>
>> I'm converting large amounts of data into clustering training and I'm
>> just having a hard time reasoning about reversing the clusters (in code)
>> back to the original format to properly understand the dominant values in
>> each cluster.
>>
>> For example, if I have five fields of data and I've trained ten clusters
>> of data I'd like to output the five fields of data as represented by each
>> of the ten clusters.
>>
>>
>>
>> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:
>>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" :
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>>
>>


Re: K Means Clustering Explanation

2018-03-02 Thread Matt Hicks
Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still
having issues determining how to actually accomplish this with the API.
Can anyone point me to an example in code showing how to accomplish this?  





On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando 
alessandro.solima...@gmail.com 
 wrote:
Hi Matt,similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on this
topic, although I have no pointers to provide).
Hth,Alessandro



On 1 March 2018 at 21:59, Christoph Brücke   wrote:
Hi Matt,
I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your original
input data and the associated cluster id for each data point in the input data.
Now you can group this dataset by cluster id and aggregate over the original 5
features. E.g., get the mean for numerical data or the value that occurs the
most for categorical data.
The exact aggregation is use-case dependent.
I hope this helps,Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" :
Thanks for the response Christoph.
I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to the
original format to properly understand the dominant values in each cluster.
For example, if I have five fields of data and I've trained ten clusters of data
I'd like to output the five fields of data as represented by each of the ten
clusters.  





On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com  wrote:
Hi matt,
the cluster are defined by there centroids / cluster centers. All the points
belonging to a certain cluster are closer to its than to the centroids of any
other cluster.
What I typically do is to convert the cluster centers back to the original input
format or of that is not possible use the point nearest to the cluster center
and use this as a representation of the whole cluster.
Can you be a little bit more specific about your use-case?
Best,Christoph
Am 01.03.2018 20:53 schrieb "Matt Hicks" :
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K Means Clustering Explanation

2018-03-02 Thread Alessandro Solimando
Hi Matt,
similarly to what Christoph does, I first derive the cluster id for the
elements of my original dataset, and then I use a classification algorithm
(cluster ids being the classes here).

For this method to be useful you need a "human-readable" model, tree-based
models are generally a good choice (e.g., Decision Tree).

However, since those models tend to be verbose, you still need a way to
summarize them to facilitate readability (there must be some literature on
this topic, although I have no pointers to provide).

Hth,
Alessandro





On 1 March 2018 at 21:59, Christoph Brücke  wrote:

> Hi Matt,
>
> I see. You could use the trained model to predict the cluster id for each
> training point. Now you should be able to create a dataset with your
> original input data and the associated cluster id for each data point in
> the input data. Now you can group this dataset by cluster id and aggregate
> over the original 5 features. E.g., get the mean for numerical data or the
> value that occurs the most for categorical data.
>
> The exact aggregation is use-case dependent.
>
> I hope this helps,
> Christoph
>
> Am 01.03.2018 21:40 schrieb "Matt Hicks" :
>
> Thanks for the response Christoph.
>
> I'm converting large amounts of data into clustering training and I'm just
> having a hard time reasoning about reversing the clusters (in code) back to
> the original format to properly understand the dominant values in each
> cluster.
>
> For example, if I have five fields of data and I've trained ten clusters
> of data I'd like to output the five fields of data as represented by each
> of the ten clusters.
>
>
>
> On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:
>
>> Hi matt,
>>
>> the cluster are defined by there centroids / cluster centers. All the
>> points belonging to a certain cluster are closer to its than to the
>> centroids of any other cluster.
>>
>> What I typically do is to convert the cluster centers back to the
>> original input format or of that is not possible use the point nearest to
>> the cluster center and use this as a representation of the whole cluster.
>>
>> Can you be a little bit more specific about your use-case?
>>
>> Best,
>> Christoph
>>
>> Am 01.03.2018 20:53 schrieb "Matt Hicks" :
>>
>> I'm using K Means clustering for a project right now, and it's working
>> very well.  However, I'd like to determine from the clusters what
>> information distinctions define each cluster so I can explain the "reasons"
>> data fits into a specific cluster.
>>
>> Is there a proper way to do this in Spark ML?
>>
>>
>


Re: K Means Clustering Explanation

2018-03-01 Thread Christoph Brücke
Hi Matt,

I see. You could use the trained model to predict the cluster id for each
training point. Now you should be able to create a dataset with your
original input data and the associated cluster id for each data point in
the input data. Now you can group this dataset by cluster id and aggregate
over the original 5 features. E.g., get the mean for numerical data or the
value that occurs the most for categorical data.

The exact aggregation is use-case dependent.

I hope this helps,
Christoph

Am 01.03.2018 21:40 schrieb "Matt Hicks" :

Thanks for the response Christoph.

I'm converting large amounts of data into clustering training and I'm just
having a hard time reasoning about reversing the clusters (in code) back to
the original format to properly understand the dominant values in each
cluster.

For example, if I have five fields of data and I've trained ten clusters of
data I'd like to output the five fields of data as represented by each of
the ten clusters.



On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote:

> Hi matt,
>
> the cluster are defined by there centroids / cluster centers. All the
> points belonging to a certain cluster are closer to its than to the
> centroids of any other cluster.
>
> What I typically do is to convert the cluster centers back to the original
> input format or of that is not possible use the point nearest to the
> cluster center and use this as a representation of the whole cluster.
>
> Can you be a little bit more specific about your use-case?
>
> Best,
> Christoph
>
> Am 01.03.2018 20:53 schrieb "Matt Hicks" :
>
> I'm using K Means clustering for a project right now, and it's working
> very well.  However, I'd like to determine from the clusters what
> information distinctions define each cluster so I can explain the "reasons"
> data fits into a specific cluster.
>
> Is there a proper way to do this in Spark ML?
>
>


K Means Clustering Explanation

2018-03-01 Thread Matt Hicks
I'm using K Means clustering for a project right now, and it's working very
well.  However, I'd like to determine from the clusters what information
distinctions define each cluster so I can explain the "reasons" data fits into a
specific cluster.
Is there a proper way to do this in Spark ML?

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali,

The main output of KMeansModel is clusterCenters which is Array[Vector]. It
has k elements where k is the number of clusters and each elements is the
center of the specified cluster.

Yanbo


2015-12-31 12:52 GMT+08:00 :

> Hi,
>
> I am trying to use kmeans for clustering in spark using python. I
> implemented it on the data set which spark has within. It's a 3*4 matrix.
>
> Can anybody please help me with how and what is orientation of data for
> kmeans.
> Also how to find out what all clusters and its members are.
>
> Thanks
> Anjali
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


K means clustering in spark

2015-12-30 Thread anjali . gautam09
Hi,

I am trying to use kmeans for clustering in spark using python. I implemented 
it on the data set which spark has within. It's a 3*4 matrix. 

Can anybody please help me with how and what is orientation of data for kmeans. 
Also how to find out what all clusters and its members are.

Thanks 
Anjali


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Distance Calculation in Spark K means clustering

2015-08-30 Thread ashensw
Hi all,

I am currently working on some K means clustering project. I want to get the
distances of each data point to it's cluster center after building the K
means model. Currently I get the cluster centers of each data point by
sending the JavaRDD which includes all the data points to K means
predict function. Then It returns a JavaRDD which consist of all
cluster Indexes of each data point. Then I convert those JavaRDD's to lists
using collect function and use them to calculate Euclidean distances. But
since this process involve a collect function seems like it's more time
consuming. 

Is there any other efficient way to calculate these distances of each data
points to their cluster centers? 

And also I want to know the distance measure that Spark K means algorithm
use to build the model. Is it euclidean distance or squared euclidean
distance?

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-Calculation-in-Spark-K-means-clustering-tp24516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Taking too long on K-means clustering

2015-08-27 Thread masoom alam
HI every one,

I am trying to run KDD data set - basically chapter 5 of the Advanced
Analytics with Spark book. The data set is of 789MB, but Spark is taking
some 3 to 4 hours. Is it normal behaviour.or some tuning is required.
The server RAM is 32 GB, but we can only give 4 GB RAM on 64 bit Ubuntu to
Java

Please guide.

Thanks


Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
non_nan_indices]
>>> > dictionary=dict(zip(non_nan_indices[0],non_nan_values))
>>> > return Vectors.sparse (len(A),dictionary)
>>> >
>>> > X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
>>> > sc=SparkContext(appName="parallel_kmeans")
>>> > data=sc.parallelize(X,10)
>>> > model = KMeans.train(data, 1000, initializationMode="k-means||")
>>> >
>>> > where complete_dataframe is a pandas data frame that has my data.
>>> >
>>> > I get the error: Py4JNetworkError: An error occurred while trying to
>>> > connect
>>> > to the Java server.
>>> >
>>> > The error  trace is as follows:
>>> >
>>> >>  Exception happened during
>>> >> processing of request from ('127.0.0.1', 41360) Traceback (most recent
>>> >> call last):   File "/usr/lib64/python2.6/SocketServer.py", line 283,
>>> >> in _handle_request_noblock
>>> >> self.process_request(request, client_address)   File
>>> >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request
>>> >> self.finish_request(request, client_address)   File
>>> >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
>>> >> self.RequestHandlerClass(request, client_address, self)   File
>>> >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
>>> >> self.handle()   File "/root/spark/python/pyspark/accumulators.py",
>>> >> line 235, in handle
>>> >> num_updates = read_int(self.rfile)   File
>>> >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int
>>> >> raise EOFError EOFError
>>> >> 
>>> >>
>>> >>
>>> >> ---
>>> >> Py4JNetworkError  Traceback (most recent call
>>> >> last)  in ()
>>> >> > 1 model = KMeans.train(data, 1000,
>>> >> initializationMode="k-means||")
>>> >>
>>> >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
>>> >> maxIterations, runs, initializationMode, seed, initializationSteps,
>>> >> epsilon)
>>> >> 134 """Train a k-means clustering model."""
>>> >> 135 model = callMLlibFunc("trainKMeansModel",
>>> >> rdd.map(_convert_to_vector), k, maxIterations,
>>> >> --> 136   runs, initializationMode, seed,
>>> >> initializationSteps, epsilon)
>>> >> 137 centers = callJavaFunc(rdd.context,
>>> >> model.clusterCenters)
>>> >> 138 return KMeansModel([c.toArray() for c in centers])
>>> >>
>>> >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
>>> >> *args)
>>> >> 126 sc = SparkContext._active_spark_context
>>> >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
>>> >> --> 128 return callJavaFunc(sc, api, *args)
>>> >> 129
>>> >> 130
>>> >>
>>> >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
>>> >> *args)
>>> >> 119 """ Call Java Function """
>>> >> 120 args = [_py2java(sc, a) for a in args]
>>> >> --> 121 return _java2py(sc, func(*args))
>>> >> 122
>>> >> 123
>>> >>
>>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>>> >> __call__(self, *args)
>>> >> 534 END_COMMAND_PART
>>> >> 535
>>> >> --> 536 answer = self.gateway_client.send_command(command)
>>> >> 537 return_value = get_return_value(answer,
>>> >> self.gateway_client,
>>> >> 538 self.target_id, self.name)
>>> >>
>>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>>> >> send_command(self, command, retry)
>>> >> 367 if retry:
>>> >>   

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-19 Thread Rogers Jeffrey
 processing of request from ('127.0.0.1', 41360) Traceback (most recent
>> >> call last):   File "/usr/lib64/python2.6/SocketServer.py", line 283,
>> >> in _handle_request_noblock
>> >> self.process_request(request, client_address)   File
>> >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request
>> >> self.finish_request(request, client_address)   File
>> >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
>> >> self.RequestHandlerClass(request, client_address, self)   File
>> >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
>> >> self.handle()   File "/root/spark/python/pyspark/accumulators.py",
>> >> line 235, in handle
>> >> num_updates = read_int(self.rfile)   File
>> >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int
>> >> raise EOFError EOFError
>> >> 
>> >>
>> >>
>> ---
>> >> Py4JNetworkError  Traceback (most recent call
>> >> last)  in ()
>> >> > 1 model = KMeans.train(data, 1000,
>> initializationMode="k-means||")
>> >>
>> >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
>> >> maxIterations, runs, initializationMode, seed, initializationSteps,
>> >> epsilon)
>> >> 134 """Train a k-means clustering model."""
>> >> 135 model = callMLlibFunc("trainKMeansModel",
>> >> rdd.map(_convert_to_vector), k, maxIterations,
>> >> --> 136   runs, initializationMode, seed,
>> >> initializationSteps, epsilon)
>> >> 137 centers = callJavaFunc(rdd.context,
>> model.clusterCenters)
>> >> 138 return KMeansModel([c.toArray() for c in centers])
>> >>
>> >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
>> >> *args)
>> >> 126 sc = SparkContext._active_spark_context
>> >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
>> >> --> 128 return callJavaFunc(sc, api, *args)
>> >> 129
>> >> 130
>> >>
>> >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
>> >> *args)
>> >> 119 """ Call Java Function """
>> >> 120 args = [_py2java(sc, a) for a in args]
>> >> --> 121 return _java2py(sc, func(*args))
>> >> 122
>> >> 123
>> >>
>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> >> __call__(self, *args)
>> >> 534 END_COMMAND_PART
>> >> 535
>> >> --> 536 answer = self.gateway_client.send_command(command)
>> >> 537 return_value = get_return_value(answer,
>> >> self.gateway_client,
>> >> 538 self.target_id, self.name)
>> >>
>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> >> send_command(self, command, retry)
>> >> 367 if retry:
>> >> 368 #print_exc()
>> >> --> 369 response = self.send_command(command)
>> >> 370 else:
>> >> 371 response = ERROR
>> >>
>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> >> send_command(self, command, retry)
>> >> 360  the Py4J protocol.
>> >> 361 """
>> >> --> 362 connection = self._get_connection()
>> >> 363 try:
>> >> 364 response = connection.send_command(command)
>> >>
>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> >> _get_connection(self)
>> >> 316 connection = self.deque.pop()
>> >> 317 except Exception:
>> >> --> 318 connection = self._create_connection()
>> >> 319 return connection
>> >> 320
>> >>
>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> >> _create_connection(self)
>> >> 323 connection = GatewayConnection(self.address, self.port,
>> >> 324 self.auto_close, self.gateway_property)
>> >> --> 325 connection.start()
>> >> 326 return connection
>> >> 327
>> >>
>> >> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> >> start(self)
>> >> 430 'server'
>> >> 431 logger.exception(msg)
>> >> --> 432 raise Py4JNetworkError(msg)
>> >> 433
>> >> 434 def close(self):
>> >>
>> >> Py4JNetworkError: An error occurred while trying to connect to the
>> >> Java server
>> >
>> >
>> > Is there any specific setting that I am missing , that causes this
>> error?
>> >
>> > Thanks and Regards,
>> > Rogers Jeffrey L
>>
>
>


Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
I am submitting the application from a python notebook. I am launching
pyspark as follows:

SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com
SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g
  SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g -Xms30g -Xmx30g" IPYTHON=1
PYSPARK_PYTHON=/usr/bin/python SPARK_PRINT_LAUNCH_COMMAND=1
 ./spark/bin/pyspark --master spark://
54.165.202.17.compute-1.amazonaws.com:7077   --deploy-mode client

I guess I should be adding another extra argument --conf
"spark.driver.memory=15g" . Is that correct?

Regards,
Rogers Jeffrey L

On Thu, Jun 18, 2015 at 7:50 PM, Xiangrui Meng  wrote:

> With 80,000 features and 1000 clusters, you need 80,000,000 doubles to
> store the cluster centers. That is ~600MB. If there are 10 partitions,
> you might need 6GB on the driver to collect updates from workers. I
> guess the driver died. Did you specify driver memory with
> spark-submit? -Xiangrui
>
> On Thu, Jun 18, 2015 at 12:22 PM, Rogers Jeffrey
>  wrote:
> > Hi All,
> >
> > I am trying to run KMeans clustering on a large data set with 12,000
> points
> > and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode
> with
> > 8  workers running on 2 slaves with 160 GB Ram and 40 VCPU.
> >
> > My Code is as Follows:
> >
> > def convert_into_sparse_vector(A):
> > non_nan_indices=np.nonzero(~np.isnan(A) )
> > non_nan_values=A[non_nan_indices]
> > dictionary=dict(zip(non_nan_indices[0],non_nan_values))
> > return Vectors.sparse (len(A),dictionary)
> >
> > X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
> > sc=SparkContext(appName="parallel_kmeans")
> > data=sc.parallelize(X,10)
> > model = KMeans.train(data, 1000, initializationMode="k-means||")
> >
> > where complete_dataframe is a pandas data frame that has my data.
> >
> > I get the error: Py4JNetworkError: An error occurred while trying to
> connect
> > to the Java server.
> >
> > The error  trace is as follows:
> >
> >>  Exception happened during
> >> processing of request from ('127.0.0.1', 41360) Traceback (most recent
> >> call last):   File "/usr/lib64/python2.6/SocketServer.py", line 283,
> >> in _handle_request_noblock
> >> self.process_request(request, client_address)   File
> >> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request
> >> self.finish_request(request, client_address)   File
> >> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
> >> self.RequestHandlerClass(request, client_address, self)   File
> >> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
> >> self.handle()   File "/root/spark/python/pyspark/accumulators.py",
> >> line 235, in handle
> >> num_updates = read_int(self.rfile)   File
> >> "/root/spark/python/pyspark/serializers.py", line 544, in read_int
> >> raise EOFError EOFError
> >> 
> >>
> >>
> ---
> >> Py4JNetworkError  Traceback (most recent call
> >> last)  in ()
> >> > 1 model = KMeans.train(data, 1000, initializationMode="k-means||")
> >>
> >> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
> >> maxIterations, runs, initializationMode, seed, initializationSteps,
> >> epsilon)
> >> 134 """Train a k-means clustering model."""
> >> 135 model = callMLlibFunc("trainKMeansModel",
> >> rdd.map(_convert_to_vector), k, maxIterations,
> >> --> 136   runs, initializationMode, seed,
> >> initializationSteps, epsilon)
> >> 137 centers = callJavaFunc(rdd.context,
> model.clusterCenters)
> >> 138 return KMeansModel([c.toArray() for c in centers])
> >>
> >> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
> >> *args)
> >> 126 sc = SparkContext._active_spark_context
> >> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
> >> --> 128 return callJavaFunc(sc, api, *args)
> >> 129
> >> 130
> >>
> >> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
> >> *args)
> >> 119 "

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to
store the cluster centers. That is ~600MB. If there are 10 partitions,
you might need 6GB on the driver to collect updates from workers. I
guess the driver died. Did you specify driver memory with
spark-submit? -Xiangrui

On Thu, Jun 18, 2015 at 12:22 PM, Rogers Jeffrey
 wrote:
> Hi All,
>
> I am trying to run KMeans clustering on a large data set with 12,000 points
> and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode  with
> 8  workers running on 2 slaves with 160 GB Ram and 40 VCPU.
>
> My Code is as Follows:
>
> def convert_into_sparse_vector(A):
> non_nan_indices=np.nonzero(~np.isnan(A) )
> non_nan_values=A[non_nan_indices]
> dictionary=dict(zip(non_nan_indices[0],non_nan_values))
> return Vectors.sparse (len(A),dictionary)
>
> X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
> sc=SparkContext(appName="parallel_kmeans")
> data=sc.parallelize(X,10)
> model = KMeans.train(data, 1000, initializationMode="k-means||")
>
> where complete_dataframe is a pandas data frame that has my data.
>
> I get the error: Py4JNetworkError: An error occurred while trying to connect
> to the Java server.
>
> The error  trace is as follows:
>
>>  Exception happened during
>> processing of request from ('127.0.0.1', 41360) Traceback (most recent
>> call last):   File "/usr/lib64/python2.6/SocketServer.py", line 283,
>> in _handle_request_noblock
>> self.process_request(request, client_address)   File
>> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request
>> self.finish_request(request, client_address)   File
>> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
>> self.RequestHandlerClass(request, client_address, self)   File
>> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
>> self.handle()   File "/root/spark/python/pyspark/accumulators.py",
>> line 235, in handle
>> num_updates = read_int(self.rfile)   File
>> "/root/spark/python/pyspark/serializers.py", line 544, in read_int
>> raise EOFError EOFError
>> 
>>
>> ---
>> Py4JNetworkError  Traceback (most recent call
>> last)  in ()
>> > 1 model = KMeans.train(data, 1000, initializationMode="k-means||")
>>
>> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
>> maxIterations, runs, initializationMode, seed, initializationSteps,
>> epsilon)
>> 134 """Train a k-means clustering model."""
>> 135 model = callMLlibFunc("trainKMeansModel",
>> rdd.map(_convert_to_vector), k, maxIterations,
>> --> 136   runs, initializationMode, seed,
>> initializationSteps, epsilon)
>> 137 centers = callJavaFunc(rdd.context, model.clusterCenters)
>> 138 return KMeansModel([c.toArray() for c in centers])
>>
>> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
>> *args)
>> 126 sc = SparkContext._active_spark_context
>> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
>> --> 128 return callJavaFunc(sc, api, *args)
>> 129
>> 130
>>
>> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
>> *args)
>> 119 """ Call Java Function """
>> 120 args = [_py2java(sc, a) for a in args]
>> --> 121 return _java2py(sc, func(*args))
>> 122
>> 123
>>
>> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> __call__(self, *args)
>> 534 END_COMMAND_PART
>> 535
>> --> 536 answer = self.gateway_client.send_command(command)
>> 537 return_value = get_return_value(answer,
>> self.gateway_client,
>> 538 self.target_id, self.name)
>>
>> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>> send_command(self, command, retry)
>> 367 if retry:
>> 368 #print_exc()
>> --> 369 response = self.send_command(command)
>> 370 else:
>> 371 response = ERROR
>>
>> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
>>

Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
Hi All,

I am trying to run KMeans clustering on a large data set with 12,000 points
and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode
 with 8  workers running on 2 slaves with 160 GB Ram and 40 VCPU.

My Code is as Follows:

def convert_into_sparse_vector(A):
non_nan_indices=np.nonzero(~np.isnan(A) )
non_nan_values=A[non_nan_indices]
dictionary=dict(zip(non_nan_indices[0],non_nan_values))
return Vectors.sparse (len(A),dictionary)

X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
sc=SparkContext(appName="parallel_kmeans")
data=sc.parallelize(X,10)
model = KMeans.train(data, 1000, initializationMode="k-means||")

where complete_dataframe is a pandas data frame that has my data.

I get the error: *Py4JNetworkError: An error occurred while trying to
connect to the **Java server.*

The error  trace is as follows:

>  Exception happened during processing 
> of request from ('127.0.0.1', 41360) Traceback (most recent
> call last):   File "/usr/lib64/python2.6/SocketServer.py", line 283,
> in _handle_request_noblock
> self.process_request(request, client_address)   File 
> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request
> self.finish_request(request, client_address)   File 
> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
> self.RequestHandlerClass(request, client_address, self)   File 
> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
> self.handle()   File "/root/spark/python/pyspark/accumulators.py", line 
> 235, in handle
> num_updates = read_int(self.rfile)   File 
> "/root/spark/python/pyspark/serializers.py", line 544, in read_int
> raise EOFError EOFError
> 
> --- 
> Py4JNetworkError  Traceback (most recent call
> last)  in ()
> > 1 model = KMeans.train(data, 1000, initializationMode="k-means||")
>
> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
> maxIterations, runs, initializationMode, seed, initializationSteps,
> epsilon)
> 134 """Train a k-means clustering model."""
> 135 model = callMLlibFunc("trainKMeansModel", 
> rdd.map(_convert_to_vector), k, maxIterations,
> --> 136   runs, initializationMode, seed, 
> initializationSteps, epsilon)
> 137 centers = callJavaFunc(rdd.context, model.clusterCenters)
> 138 return KMeansModel([c.toArray() for c in centers])
>
> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
> *args)
> 126 sc = SparkContext._active_spark_context
> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
> --> 128 return callJavaFunc(sc, api, *args)
> 129
> 130
>
> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
> *args)
> 119 """ Call Java Function """
> 120 args = [_py2java(sc, a) for a in args]
> --> 121 return _java2py(sc, func(*args))
> 122
> 123
>
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
> 534 END_COMMAND_PART
> 535
> --> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> 538 self.target_id, self.name)
>
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> send_command(self, command, retry)
> 367 if retry:
> 368 #print_exc()
> --> 369 response = self.send_command(command)
> 370 else:
> 371 response = ERROR
>
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> send_command(self, command, retry)
> 360  the Py4J protocol.
> 361 """
> --> 362 connection = self._get_connection()
> 363 try:
> 364 response = connection.send_command(command)
>
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> _get_connection(self)
> 316 connection = self.deque.pop()
> 317 except Exception:
> --> 318 connection = self._create_connection()
> 319 return connection
> 320
>
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> _create_connection(self)
> 323 connection = GatewayConnection(se

Announcement: Generalized K-Means Clustering on Spark

2015-01-25 Thread derrickburns
This project generalizes the Spark MLLIB K-Means clusterer to support
clustering of dense or sparse, low or high dimensional data using distance
functions defined by Bregman divergences.

https://github.com/derrickburns/generalized-kmeans-clustering



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Announcement-Generalized-K-Means-Clustering-on-Spark-tp21363.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: k-means clustering

2014-11-25 Thread Yanbo Liang
Pre-processing is major workload before training model.
MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is
essential for preprocessing and would be great help to the model training.

Take a look at this
http://spark.apache.org/docs/latest/mllib-feature-extraction.html

2014-11-21 7:18 GMT+08:00 Jun Yang :

> Guys,
>
> As to the questions of pre-processing, you could just migrate your logic
> to Spark before using K-means.
>
> I only used Scala on Spark, and haven't used Python binding on Spark, but
> I think the basic steps must be the same.
>
> BTW, if your data set is big with huge sparse dimension feature vector,
> K-Means may not works as good as you expected. And I think this is still
> the optimization direction of Spark MLLib.
>
> On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi  > wrote:
>
>> Hi there,
>>
>> I would like to do "text clustering" using  k-means and Spark on a
>> massive dataset. As you know, before running the k-means, I have to do
>> pre-processing methods such as TFIDF and NLTK on my big dataset. The
>> following is my code in python :
>>
>> if __name__ == '__main__': # Cluster a bunch of text documents. import re
>> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
>> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
>> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
>> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
>> e))  remove_spl_char_regex = re.compile('[%s]' %
>> re.escape(string.punctuation)) # regex to remove special characters
>> remove_num = re.compile('[\d]+') #nltk.download() stop_words=
>> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float
>> )  a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) #
>> Remove special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove
>> stop words words = a3.split() filter_stop_words = [w for w in words if
>> not w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
>> filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r"\w+", a1) for w in
>> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())
>>
>> Can anyone explain to me how can I do the pre-processing step, before
>> running the k-means using spark.
>>
>>
>> Best Regards
>>
>> ...
>>
>> Amin Mohebbi
>>
>> PhD candidate in Software Engineering
>>  at university of Malaysia
>>
>> Tel : +60 18 2040 017
>>
>>
>>
>> E-Mail : tp025...@ex.apiit.edu.my
>>
>>   amin_...@me.com
>>
>
>
>
> --
> yangjun...@gmail.com
> http://hi.baidu.com/yjpro
>


Re: K-means clustering

2014-11-25 Thread Xiangrui Meng
There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui

On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi
 wrote:
>  I  have generated a sparse matrix by python, which has the size of
> 4000*174000 (.pkl), the following is a small part of this matrix :
>
>  (0, 45) 1
>   (0, 413) 1
>   (0, 445) 1
>   (0, 107) 4
>   (0, 80) 2
>   (0, 352) 1
>   (0, 157) 1
>   (0, 191) 1
>   (0, 315) 1
>   (0, 395) 4
>   (0, 282) 3
>   (0, 184) 1
>   (0, 403) 1
>   (0, 169) 1
>   (0, 267) 1
>   (0, 148) 1
>   (0, 449) 1
>   (0, 241) 1
>   (0, 303) 1
>   (0, 364) 1
>   (0, 257) 1
>   (0, 372) 1
>   (0, 73) 1
>   (0, 64) 1
>   (0, 427) 1
>   : :
>   (2, 399) 1
>   (2, 277) 1
>   (2, 229) 1
>   (2, 255) 1
>   (2, 409) 1
>   (2, 355) 1
>   (2, 391) 1
>   (2, 28) 1
>   (2, 384) 1
>   (2, 86) 1
>   (2, 285) 2
>   (2, 166) 1
>   (2, 165) 1
>   (2, 419) 1
>   (2, 367) 2
>   (2, 133) 1
>   (2, 61) 1
>   (2, 434) 1
>   (2, 51) 1
>   (2, 423) 1
>   (2, 398) 1
>   (2, 438) 1
>   (2, 389) 1
>   (2, 26) 1
>   (2, 455) 1
>
> I am new in Spark and would like to cluster this matrix by k-means
> algorithm. Can anyone explain to me what kind of problems  I might be faced.
> Please note that I do not want to use Mllib and would like to write my own
> k-means.
> Best Regards
>
> ...
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : tp025...@ex.apiit.edu.my
>
>   amin_...@me.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



K-means clustering

2014-11-25 Thread amin mohebbi
 I  have generated a sparse matrix by python, which has the size of  
4000*174000 (.pkl), the following is a small part of this matrix :
 (0, 45) 1  (0, 413) 1  (0, 445) 1  (0, 107) 4  (0, 80) 2  (0, 352) 1  (0, 157) 
1  (0, 191) 1  (0, 315) 1  (0, 395) 4  (0, 282) 3  (0, 184) 1  (0, 403) 1  (0, 
169) 1  (0, 267) 1  (0, 148) 1  (0, 449) 1  (0, 241) 1  (0, 303) 1  (0, 364) 1  
(0, 257) 1  (0, 372) 1  (0, 73) 1  (0, 64) 1  (0, 427) 1  : :  (2, 399) 1  (2, 
277) 1  (2, 229) 1  (2, 255) 1  (2, 409) 1  (2, 355) 1  (2, 391) 1  (2, 28) 1  
(2, 384) 1  (2, 86) 1  (2, 285) 2  (2, 166) 1  (2, 165) 1  (2, 419) 1  (2, 367) 
2  (2, 133) 1  (2, 61) 1  (2, 434) 1  (2, 51) 1  (2, 423) 1  (2, 398) 1  (2, 
438) 1  (2, 389) 1  (2, 26) 1  (2, 455) 1
I am new in Spark and would like to cluster this matrix by k-means algorithm. 
Can anyone explain to me what kind of problems  I might be faced. Please note 
that I do not want to use Mllib and would like to write my own k-means. 
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

Re: k-means clustering

2014-11-20 Thread Jun Yang
Guys,

As to the questions of pre-processing, you could just migrate your logic to
Spark before using K-means.

I only used Scala on Spark, and haven't used Python binding on Spark, but I
think the basic steps must be the same.

BTW, if your data set is big with huge sparse dimension feature vector,
K-Means may not works as good as you expected. And I think this is still
the optimization direction of Spark MLLib.

On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi 
wrote:

> Hi there,
>
> I would like to do "text clustering" using  k-means and Spark on a massive
> dataset. As you know, before running the k-means, I have to do
> pre-processing methods such as TFIDF and NLTK on my big dataset. The
> following is my code in python :
>
> if __name__ == '__main__': # Cluster a bunch of text documents. import re
> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
> e))  remove_spl_char_regex = re.compile('[%s]' %
> re.escape(string.punctuation)) # regex to remove special characters
> remove_num = re.compile('[\d]+') #nltk.download() stop_words=
> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float)
> a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # Remove
> special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove
> stop words words = a3.split() filter_stop_words = [w for w in words if not
> w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
> filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r"\w+", a1) for w in
> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())
>
> Can anyone explain to me how can I do the pre-processing step, before
> running the k-means using spark.
>
>
> Best Regards
>
> ...
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : tp025...@ex.apiit.edu.my
>
>   amin_...@me.com
>



-- 
yangjun...@gmail.com
http://hi.baidu.com/yjpro


k-means clustering

2014-11-18 Thread amin mohebbi
Hi there,
I would like to do "text clustering" using  k-means and Spark on a massive 
dataset. As you know, before running the k-means, I have to do pre-processing 
methods such as TFIDF and NLTK on my big dataset. The following is my code in 
python :

|
| if __name__ == '__main__': |
|  |  # Cluster a bunch of text documents. |
|  |  import re |
|  |  import sys |
|  |  |
|  |  k = 6 |
|  |  vocab = {} |
|  |  xs = [] |
|  |  ns=[] |
|  |  cat=[] |
|  |  filename='2013-01.csv' |
|  |  with open(filename, newline='') as f: |
|  |  try: |
|  |  newsreader = csv.reader(f) |
|  |  for row in newsreader: |
|  |  ns.append(row[3]) |
|  |  cat.append(row[4]) |
|  |  except csv.Error as e: |
|  |  sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) |
|  |  |
|  |  |
|  |  remove_spl_char_regex = re.compile('[%s]' % 
re.escape(string.punctuation)) # regex to remove special characters |
|  |  remove_num = re.compile('[\d]+') |
|  |  #nltk.download() |
|  |  stop_words=nltk.corpus.stopwords.words('english') |
|  |  |
|  |  for a in ns: |
|  |  x = defaultdict(float) |
|  |  |
|  |  |
|  |  a1 = a.strip().lower() |
|  |  a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters |
|  |  a3 = remove_num.sub("", a2) #Remove numbers |
|  |  #Remove stop words |
|  |  words = a3.split() |
|  |  filter_stop_words = [w for w in words if not w in stop_words] |
|  |  stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] |
|  |  ws=sorted(stemed) |
|  |  |
|  |  |
|  |  #ws=re.findall(r"\w+", a1) |
|  |  for w in ws: |
|  |  vocab.setdefault(w, len(vocab)) |
|  |  x[vocab[w]] += 1 |
|  |  xs.append(x.items()) |
|  |



Can anyone explain to me how can I do the pre-processing step, before running 
the k-means using spark.
 
Best Regards

...

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : tp025...@ex.apiit.edu.my

  amin_...@me.com

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Aris
Yeah - another vote here to do what's called One-Hot encoding, just convert
the single categorical feature into N columns, where N is the number of
distinct values of that feature, with a single one and all the other
features/columns set to zero.

On Tue, Sep 16, 2014 at 2:16 PM, Sean Owen  wrote:

> I think it's on the table but not yet merged?
> https://issues.apache.org/jira/browse/SPARK-1216
>
> On Tue, Sep 16, 2014 at 10:04 PM, st553  wrote:
> > Does MLlib provide utility functions to do this kind of encoding?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen
I think it's on the table but not yet merged?
https://issues.apache.org/jira/browse/SPARK-1216

On Tue, Sep 16, 2014 at 10:04 PM, st553  wrote:
> Does MLlib provide utility functions to do this kind of encoding?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Please try

val parsedData3 =
data3.repartition(12).map(_.split("\t")).map(_.toDouble).cache()

and check the storage and driver/executor memory in the WebUI. Make
sure the data is fully cached.

-Xiangrui


On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan
 wrote:
> Hi Xiangrui,
>
> Yes I am using Spark v0.9 and am not running it in local mode.
>
> I did the memory setting using "export SPARK_MEM=4G" before starting the
> Spark instance.
>
> Also previously, I was starting it with -c 1 but changed it to -c 12 since
> it is a 12 core machine. It did bring down the time taken to less than 200
> seconds from over 700 seconds.
>
> I am not sure how to repartition the data to match the CPU cores. How do I
> do it?
>
> Thank you.
>
> Ravi
>
>
> On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng  wrote:
>>
>> Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
>> and repartition your data to match number of CPU cores such that the
>> data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
>> the data. Make sure there are enough memory for caching. -Xiangrui
>>
>> On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
>>  wrote:
>> > I am trying to use MLlib for K-Means clustering on a data set with 1
>> > million
>> > rows and 50 columns (all columns have double values) which is on HDFS
>> > (raw
>> > txt file is 28 MB)
>> >
>> > I initially tried the following:
>> >
>> > val data3 = sc.textFile("hdfs://...inputData.txt")
>> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
>> > val numIterations = 10
>> > val numClusters = 200
>> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>> >
>> > This took me nearly 850 seconds.
>> >
>> > I tried using persist with MEMORY_ONLY option hoping that this would
>> > significantly speed up the algorithm:
>> >
>> > val data3 = sc.textFile("hdfs://...inputData.txt")
>> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
>> > parsedData3.persist(MEMORY_ONLY)
>> > val numIterations = 10
>> > val numClusters = 200
>> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>> >
>> > This resulted in only a marginal improvement and took around 720
>> > seconds.
>> >
>> > Is there any other way to speed up the algorithm further?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Ravi
>
>


Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
Hi Xiangrui,

Yes I am using Spark v0.9 and am not running it in local mode.

I did the memory setting using "export SPARK_MEM=4G" before starting the  Spark
instance.

Also previously, I was starting it with -c 1 but changed it to -c 12 since
it is a 12 core machine. It did bring down the time taken to less than 200
seconds from over 700 seconds.

I am not sure how to repartition the data to match the CPU cores. How do I
do it?

Thank you.

Ravi


On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng  wrote:

> Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
> and repartition your data to match number of CPU cores such that the
> data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
> the data. Make sure there are enough memory for caching. -Xiangrui
>
> On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
>  wrote:
> > I am trying to use MLlib for K-Means clustering on a data set with 1
> million
> > rows and 50 columns (all columns have double values) which is on HDFS
> (raw
> > txt file is 28 MB)
> >
> > I initially tried the following:
> >
> > val data3 = sc.textFile("hdfs://...inputData.txt")
> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> > val numIterations = 10
> > val numClusters = 200
> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
> >
> > This took me nearly 850 seconds.
> >
> > I tried using persist with MEMORY_ONLY option hoping that this would
> > significantly speed up the algorithm:
> >
> > val data3 = sc.textFile("hdfs://...inputData.txt")
> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> > parsedData3.persist(MEMORY_ONLY)
> > val numIterations = 10
> > val numClusters = 200
> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
> >
> > This resulted in only a marginal improvement and took around 720 seconds.
> >
> > Is there any other way to speed up the algorithm further?
> >
> > Thank you.
> >
> > Regards,
> > Ravi
>


Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui

On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
 wrote:
> I am trying to use MLlib for K-Means clustering on a data set with 1 million
> rows and 50 columns (all columns have double values) which is on HDFS (raw
> txt file is 28 MB)
>
> I initially tried the following:
>
> val data3 = sc.textFile("hdfs://...inputData.txt")
> val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> val numIterations = 10
> val numClusters = 200
> val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>
> This took me nearly 850 seconds.
>
> I tried using persist with MEMORY_ONLY option hoping that this would
> significantly speed up the algorithm:
>
> val data3 = sc.textFile("hdfs://...inputData.txt")
> val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> parsedData3.persist(MEMORY_ONLY)
> val numIterations = 10
> val numClusters = 200
> val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>
> This resulted in only a marginal improvement and took around 720 seconds.
>
> Is there any other way to speed up the algorithm further?
>
> Thank you.
>
> Regards,
> Ravi


Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
I am trying to use MLlib for K-Means clustering on a data set with 1
million rows and 50 columns (all columns have double values) which is on
HDFS (raw txt file is 28 MB)

I initially tried the following:

val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)

This took me nearly 850 seconds.

I tried using persist with MEMORY_ONLY option hoping that this would
significantly speed up the algorithm:

val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
parsedData3.persist(MEMORY_ONLY)
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)

This resulted in only a marginal improvement and took around 720 seconds.

Is there any other way to speed up the algorithm further?

Thank you.

Regards,
Ravi


Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
I see.  So, basically, kind of like dummy variables like with regressions.  
Thanks, Sean.

On Jul 11, 2014, at 10:11 AM, Sean Owen  wrote:

> Since you can't define your own distance function, you will need to
> convert these to numeric dimensions. 1-of-n encoding can work OK,
> depending on your use case. So a dimension that takes on 3 categorical
> values, becomes 3 dimensions, of which all are 0 except one that has
> value 1.
> 
> On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan  wrote:
>> Hi Folks,
>> 
>> Does any one have experience or recommendations on incorporating categorical 
>> features (attributes) into k-means clustering in Spark?  In other words, I 
>> want to cluster on a set of attributes that include categorical variables.
>> 
>> I know I could probably implement some custom code to parse and calculate my 
>> own similarity function, but I wanted to reach out before I did so.  I’d 
>> also prefer to take advantage of the k-means\parallel initialization feature 
>> of the model in MLlib, so an MLlib-based implementation would be preferred.
>> 
>> Thanks in advance.
>> 
>> Best,
>> 
>> -Wen



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Sean Owen
Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
depending on your use case. So a dimension that takes on 3 categorical
values, becomes 3 dimensions, of which all are 0 except one that has
value 1.

On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan  wrote:
> Hi Folks,
>
> Does any one have experience or recommendations on incorporating categorical 
> features (attributes) into k-means clustering in Spark?  In other words, I 
> want to cluster on a set of attributes that include categorical variables.
>
> I know I could probably implement some custom code to parse and calculate my 
> own similarity function, but I wanted to reach out before I did so.  I’d also 
> prefer to take advantage of the k-means\parallel initialization feature of 
> the model in MLlib, so an MLlib-based implementation would be preferred.
>
> Thanks in advance.
>
> Best,
>
> -Wen


Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
Hi Folks,

Does any one have experience or recommendations on incorporating categorical 
features (attributes) into k-means clustering in Spark?  In other words, I want 
to cluster on a set of attributes that include categorical variables.

I know I could probably implement some custom code to parse and calculate my 
own similarity function, but I wanted to reach out before I did so.  I’d also 
prefer to take advantage of the k-means\parallel initialization feature of the 
model in MLlib, so an MLlib-based implementation would be preferred.

Thanks in advance.

Best,

-Wen


signature.asc
Description: Message signed with OpenPGP using GPGMail