[jira] [Commented] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-07-01 Thread Nassir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071447#comment-16071447
 ] 

Nassir commented on SPARK-21244:


Hi, 

The pyspark k-means implementation is on the same 20 newsgroup document set 
that sklearn k-means is run on. 

pyspark version does not produce any meaningful clsuters, unlike the sklearn 
k-means (both using euclidean distance as a distance measure).

The 'bug' is that pyspark k-means applied to tf-idf documents does not provide 
expected results. I would be interested to know if anyone has used k-means in 
spark mllib to cluster a standard document set such as the 20 news group set? 
Do you get almost all the documents clump into one cluster as I do?

> KMeans applied to processed text day clumps almost all documents into one 
> cluster
> -
>
> Key: SPARK-21244
> URL: https://issues.apache.org/jira/browse/SPARK-21244
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nassir
>
> I have observed this problem for quite a while now regarding the 
> implementation of pyspark KMeans on text documents - to cluster documents 
> according to their TF-IDF vectors. The pyspark implementation - even on 
> standard datasets - clusters almost all of the documents into one cluster. 
> I implemented K-means on the same dataset with same parameters using SKlearn 
> library, and this clusters the documents very well. 
> I recommend anyone who is able to test the pyspark implementation of KMeans 
> on text documents - which obviously has a bug in it somewhere.
> (currently I am convert my spark dataframe to pandas dataframe and running k 
> means and converting back. However, this is of course not a parallel solution 
> capable of handling huge amounts of data in future)
> Here is a link to the question i posted a while back on stackoverlfow: 
> https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-06-28 Thread Nassir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nassir updated SPARK-21244:
---
Description: 
I have observed this problem for quite a while now regarding the implementation 
of pyspark KMeans on text documents - to cluster documents according to their 
TF-IDF vectors. The pyspark implementation - even on standard datasets - 
clusters almost all of the documents into one cluster. 

I implemented K-means on the same dataset with same parameters using SKlearn 
library, and this clusters the documents very well. 

I recommend anyone who is able to test the pyspark implementation of KMeans on 
text documents - which obviously has a bug in it somewhere.

(currently I am convert my spark dataframe to pandas dataframe and running k 
means and converting back. However, this is of course not a parallel solution 
capable of handling huge amounts of data in future)

Here is a link to the question i posted a while back on stackoverlfow: 
https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one

  was:
I have observed this problem for quite a while now regarding the implementation 
of pyspark KMeans on text documents - to cluster documents according to their 
TF-IDF vectors. The pyspark implementation - even on standard datasets - 
clusters almost all of the documents into one cluster. 

I implemented K-means on the same dataset with same parameters using SKlearn 
library, and this clusters the documents very well. 

I recommend anyone who is able to test the pyspark implementation of KMeans on 
text documents - which obviously has a bug in it somewhere.

(currently I am convert my spark dataframe to pandas dataframe and running k 
means and converting back. However, this is of course not a parallel solution 
capable of handling huge amounts of data in future)


> KMeans applied to processed text day clumps almost all documents into one 
> cluster
> -
>
> Key: SPARK-21244
> URL: https://issues.apache.org/jira/browse/SPARK-21244
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nassir
>
> I have observed this problem for quite a while now regarding the 
> implementation of pyspark KMeans on text documents - to cluster documents 
> according to their TF-IDF vectors. The pyspark implementation - even on 
> standard datasets - clusters almost all of the documents into one cluster. 
> I implemented K-means on the same dataset with same parameters using SKlearn 
> library, and this clusters the documents very well. 
> I recommend anyone who is able to test the pyspark implementation of KMeans 
> on text documents - which obviously has a bug in it somewhere.
> (currently I am convert my spark dataframe to pandas dataframe and running k 
> means and converting back. However, this is of course not a parallel solution 
> capable of handling huge amounts of data in future)
> Here is a link to the question i posted a while back on stackoverlfow: 
> https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-06-28 Thread Nassir (JIRA)
Nassir created SPARK-21244:
--

 Summary: KMeans applied to processed text day clumps almost all 
documents into one cluster
 Key: SPARK-21244
 URL: https://issues.apache.org/jira/browse/SPARK-21244
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.1
Reporter: Nassir


I have observed this problem for quite a while now regarding the implementation 
of pyspark KMeans on text documents - to cluster documents according to their 
TF-IDF vectors. The pyspark implementation - even on standard datasets - 
clusters almost all of the documents into one cluster. 

I implemented K-means on the same dataset with same parameters using SKlearn 
library, and this clusters the documents very well. 

I recommend anyone who is able to test the pyspark implementation of KMeans on 
text documents - which obviously has a bug in it somewhere.

(currently I am convert my spark dataframe to pandas dataframe and running k 
means and converting back. However, this is of course not a parallel solution 
capable of handling huge amounts of data in future)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2017-06-28 Thread Nassir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066652#comment-16066652
 ] 

Nassir commented on SPARK-20696:


Unfortunately, I have not found a place to make this known to the spark 
community yet.

My workaround has been to convert pyspark dataframe to pandas dataframe, use 
sklearn python K-Means to cluster documents (which works well), then convert 
pandas dataframe back to pyspark.

It works in my situation as the number of documents I am clustering is 
relatively small. However, I will want to process Big Data and would need a 
solution in pyspark with spark streaming in fuutre

Nassir

> tf-idf document clustering with K-means in Apache Spark putting points into 
> one cluster
> ---
>
> Key: SPARK-20696
> URL: https://issues.apache.org/jira/browse/SPARK-20696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nassir
>
> I am trying to do the classic job of clustering text documents by 
> pre-processing, generating tf-idf matrix, and then applying K-means. However, 
> testing this workflow on the classic 20NewsGroup dataset results in most 
> documents being clustered into one cluster. (I have initially tried to 
> cluster all documents from 6 of the 20 groups - so expecting to cluster into 
> 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this 
> technique on millions of documents. Here is the code written in Pyspark on 
> Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
> outputCol="rawFeatures", numFeatures=20)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I 
> cannot figure out what I am doing wrong as all the tutorials and code I have 
> come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means 
> but that also produces the same result. I know cosine distance is a better 
> measure to use, but I expected using standard K-means in Apache Spark would 
> provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if 
> something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: 
> http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2017-06-09 Thread Nassir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044356#comment-16044356
 ] 

Nassir commented on SPARK-20696:


This appears to be a problem with the implementation of K-means algorithm in a 
distributed environment. Sklearn library on the same dataset produces distinct 
clusters, whilst the pyspark k-means clustering fails. Has anyone actually done 
text clustering using K-means in apache spark? e.g. cluster 20 Newsgroups into 
20 clusters and see if you get multiple clusters or just one large one?

> tf-idf document clustering with K-means in Apache Spark putting points into 
> one cluster
> ---
>
> Key: SPARK-20696
> URL: https://issues.apache.org/jira/browse/SPARK-20696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nassir
>
> I am trying to do the classic job of clustering text documents by 
> pre-processing, generating tf-idf matrix, and then applying K-means. However, 
> testing this workflow on the classic 20NewsGroup dataset results in most 
> documents being clustered into one cluster. (I have initially tried to 
> cluster all documents from 6 of the 20 groups - so expecting to cluster into 
> 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this 
> technique on millions of documents. Here is the code written in Pyspark on 
> Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
> outputCol="rawFeatures", numFeatures=20)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I 
> cannot figure out what I am doing wrong as all the tutorials and code I have 
> come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means 
> but that also produces the same result. I know cosine distance is a better 
> measure to use, but I expected using standard K-means in Apache Spark would 
> provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if 
> something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: 
> http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2017-05-10 Thread Nassir (JIRA)
Nassir created SPARK-20696:
--

 Summary: tf-idf document clustering with K-means in Apache Spark 
putting points into one cluster
 Key: SPARK-20696
 URL: https://issues.apache.org/jira/browse/SPARK-20696
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Nassir


I am trying to do the classic job of clustering text documents by 
pre-processing, generating tf-idf matrix, and then applying K-means. However, 
testing this workflow on the classic 20NewsGroup dataset results in most 
documents being clustered into one cluster. (I have initially tried to cluster 
all documents from 6 of the 20 groups - so expecting to cluster into 6 
clusters).

I am implementing this in Apache Spark as my purpose is to utilise this 
technique on millions of documents. Here is the code written in Pyspark on 
Databricks:

#declare path to folder containing 6 of 20 news group categories
path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
MOUNT_NAME

#read all the text files from the 6 folders. Each entity is an entire 
document. 
text_files = sc.wholeTextFiles(path).cache()

#convert rdd to dataframe
df = text_files.toDF(["filePath", "document"]).cache()

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 

#tokenize the document text
tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
tokenized = tokenizer.transform(df).cache()

from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="tokens", 
outputCol="stopWordsRemovedTokens")
stopWordsRemoved_df = remover.transform(tokenized).cache()

hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
outputCol="rawFeatures", numFeatures=20)
tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
idfModel = idf.fit(tfVectors)

tfIdfVectors = idfModel.transform(tfVectors).cache()

#note that I have also tried to use normalized data, but get the same result
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
l2NormData = normalizer.transform(tfIdfVectors)

from pyspark.ml.clustering import KMeans

# Trains a KMeans model.
kmeans = KMeans().setK(6).setMaxIter(20)
km_model = kmeans.fit(l2NormData)

clustersTable = km_model.transform(l2NormData)

[output showing most documents get clustered into cluster 0][1]

ID number_of_documents_in_cluster 
0 3024 
3 5 
1 3 
5 2
2 2 
4 1

As you can see most of my data points get clustered into cluster 0, and I 
cannot figure out what I am doing wrong as all the tutorials and code I have 
come across online point to using this method.

In addition I have also tried normalizing the tf-idf matrix before K-means but 
that also produces the same result. I know cosine distance is a better measure 
to use, but I expected using standard K-means in Apache Spark would provide 
meaningful results.

Can anyone help with regards to whether I have a bug in my code, or if 
something is missing in my data clustering pipeline?

(Question also asked in Stackoverflow before: 
http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)

Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org