[jira] [Commented] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster
[ https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071447#comment-16071447 ] Nassir commented on SPARK-21244: Hi, The pyspark k-means implementation is on the same 20 newsgroup document set that sklearn k-means is run on. pyspark version does not produce any meaningful clsuters, unlike the sklearn k-means (both using euclidean distance as a distance measure). The 'bug' is that pyspark k-means applied to tf-idf documents does not provide expected results. I would be interested to know if anyone has used k-means in spark mllib to cluster a standard document set such as the 20 news group set? Do you get almost all the documents clump into one cluster as I do? > KMeans applied to processed text day clumps almost all documents into one > cluster > - > > Key: SPARK-21244 > URL: https://issues.apache.org/jira/browse/SPARK-21244 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Nassir > > I have observed this problem for quite a while now regarding the > implementation of pyspark KMeans on text documents - to cluster documents > according to their TF-IDF vectors. The pyspark implementation - even on > standard datasets - clusters almost all of the documents into one cluster. > I implemented K-means on the same dataset with same parameters using SKlearn > library, and this clusters the documents very well. > I recommend anyone who is able to test the pyspark implementation of KMeans > on text documents - which obviously has a bug in it somewhere. > (currently I am convert my spark dataframe to pandas dataframe and running k > means and converting back. However, this is of course not a parallel solution > capable of handling huge amounts of data in future) > Here is a link to the question i posted a while back on stackoverlfow: > https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster
[ https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nassir updated SPARK-21244: --- Description: I have observed this problem for quite a while now regarding the implementation of pyspark KMeans on text documents - to cluster documents according to their TF-IDF vectors. The pyspark implementation - even on standard datasets - clusters almost all of the documents into one cluster. I implemented K-means on the same dataset with same parameters using SKlearn library, and this clusters the documents very well. I recommend anyone who is able to test the pyspark implementation of KMeans on text documents - which obviously has a bug in it somewhere. (currently I am convert my spark dataframe to pandas dataframe and running k means and converting back. However, this is of course not a parallel solution capable of handling huge amounts of data in future) Here is a link to the question i posted a while back on stackoverlfow: https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one was: I have observed this problem for quite a while now regarding the implementation of pyspark KMeans on text documents - to cluster documents according to their TF-IDF vectors. The pyspark implementation - even on standard datasets - clusters almost all of the documents into one cluster. I implemented K-means on the same dataset with same parameters using SKlearn library, and this clusters the documents very well. I recommend anyone who is able to test the pyspark implementation of KMeans on text documents - which obviously has a bug in it somewhere. (currently I am convert my spark dataframe to pandas dataframe and running k means and converting back. However, this is of course not a parallel solution capable of handling huge amounts of data in future) > KMeans applied to processed text day clumps almost all documents into one > cluster > - > > Key: SPARK-21244 > URL: https://issues.apache.org/jira/browse/SPARK-21244 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Nassir > > I have observed this problem for quite a while now regarding the > implementation of pyspark KMeans on text documents - to cluster documents > according to their TF-IDF vectors. The pyspark implementation - even on > standard datasets - clusters almost all of the documents into one cluster. > I implemented K-means on the same dataset with same parameters using SKlearn > library, and this clusters the documents very well. > I recommend anyone who is able to test the pyspark implementation of KMeans > on text documents - which obviously has a bug in it somewhere. > (currently I am convert my spark dataframe to pandas dataframe and running k > means and converting back. However, this is of course not a parallel solution > capable of handling huge amounts of data in future) > Here is a link to the question i posted a while back on stackoverlfow: > https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster
Nassir created SPARK-21244: -- Summary: KMeans applied to processed text day clumps almost all documents into one cluster Key: SPARK-21244 URL: https://issues.apache.org/jira/browse/SPARK-21244 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.1 Reporter: Nassir I have observed this problem for quite a while now regarding the implementation of pyspark KMeans on text documents - to cluster documents according to their TF-IDF vectors. The pyspark implementation - even on standard datasets - clusters almost all of the documents into one cluster. I implemented K-means on the same dataset with same parameters using SKlearn library, and this clusters the documents very well. I recommend anyone who is able to test the pyspark implementation of KMeans on text documents - which obviously has a bug in it somewhere. (currently I am convert my spark dataframe to pandas dataframe and running k means and converting back. However, this is of course not a parallel solution capable of handling huge amounts of data in future) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
[ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066652#comment-16066652 ] Nassir commented on SPARK-20696: Unfortunately, I have not found a place to make this known to the spark community yet. My workaround has been to convert pyspark dataframe to pandas dataframe, use sklearn python K-Means to cluster documents (which works well), then convert pandas dataframe back to pyspark. It works in my situation as the number of documents I am clustering is relatively small. However, I will want to process Big Data and would need a solution in pyspark with spark streaming in fuutre Nassir > tf-idf document clustering with K-means in Apache Spark putting points into > one cluster > --- > > Key: SPARK-20696 > URL: https://issues.apache.org/jira/browse/SPARK-20696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nassir > > I am trying to do the classic job of clustering text documents by > pre-processing, generating tf-idf matrix, and then applying K-means. However, > testing this workflow on the classic 20NewsGroup dataset results in most > documents being clustered into one cluster. (I have initially tried to > cluster all documents from 6 of the 20 groups - so expecting to cluster into > 6 clusters). > I am implementing this in Apache Spark as my purpose is to utilise this > technique on millions of documents. Here is the code written in Pyspark on > Databricks: > #declare path to folder containing 6 of 20 news group categories > path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % > MOUNT_NAME > #read all the text files from the 6 folders. Each entity is an entire > document. > text_files = sc.wholeTextFiles(path).cache() > #convert rdd to dataframe > df = text_files.toDF(["filePath", "document"]).cache() > from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer > #tokenize the document text > tokenizer = Tokenizer(inputCol="document", outputCol="tokens") > tokenized = tokenizer.transform(df).cache() > from pyspark.ml.feature import StopWordsRemover > remover = StopWordsRemover(inputCol="tokens", > outputCol="stopWordsRemovedTokens") > stopWordsRemoved_df = remover.transform(tokenized).cache() > hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", > outputCol="rawFeatures", numFeatures=20) > tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() > idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) > idfModel = idf.fit(tfVectors) > tfIdfVectors = idfModel.transform(tfVectors).cache() > #note that I have also tried to use normalized data, but get the same result > from pyspark.ml.feature import Normalizer > from pyspark.ml.linalg import Vectors > normalizer = Normalizer(inputCol="features", outputCol="normFeatures") > l2NormData = normalizer.transform(tfIdfVectors) > from pyspark.ml.clustering import KMeans > # Trains a KMeans model. > kmeans = KMeans().setK(6).setMaxIter(20) > km_model = kmeans.fit(l2NormData) > clustersTable = km_model.transform(l2NormData) > [output showing most documents get clustered into cluster 0][1] > ID number_of_documents_in_cluster > 0 3024 > 3 5 > 1 3 > 5 2 > 2 2 > 4 1 > As you can see most of my data points get clustered into cluster 0, and I > cannot figure out what I am doing wrong as all the tutorials and code I have > come across online point to using this method. > In addition I have also tried normalizing the tf-idf matrix before K-means > but that also produces the same result. I know cosine distance is a better > measure to use, but I expected using standard K-means in Apache Spark would > provide meaningful results. > Can anyone help with regards to whether I have a bug in my code, or if > something is missing in my data clustering pipeline? > (Question also asked in Stackoverflow before: > http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) > Thank you in advance! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
[ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044356#comment-16044356 ] Nassir commented on SPARK-20696: This appears to be a problem with the implementation of K-means algorithm in a distributed environment. Sklearn library on the same dataset produces distinct clusters, whilst the pyspark k-means clustering fails. Has anyone actually done text clustering using K-means in apache spark? e.g. cluster 20 Newsgroups into 20 clusters and see if you get multiple clusters or just one large one? > tf-idf document clustering with K-means in Apache Spark putting points into > one cluster > --- > > Key: SPARK-20696 > URL: https://issues.apache.org/jira/browse/SPARK-20696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Nassir > > I am trying to do the classic job of clustering text documents by > pre-processing, generating tf-idf matrix, and then applying K-means. However, > testing this workflow on the classic 20NewsGroup dataset results in most > documents being clustered into one cluster. (I have initially tried to > cluster all documents from 6 of the 20 groups - so expecting to cluster into > 6 clusters). > I am implementing this in Apache Spark as my purpose is to utilise this > technique on millions of documents. Here is the code written in Pyspark on > Databricks: > #declare path to folder containing 6 of 20 news group categories > path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % > MOUNT_NAME > #read all the text files from the 6 folders. Each entity is an entire > document. > text_files = sc.wholeTextFiles(path).cache() > #convert rdd to dataframe > df = text_files.toDF(["filePath", "document"]).cache() > from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer > #tokenize the document text > tokenizer = Tokenizer(inputCol="document", outputCol="tokens") > tokenized = tokenizer.transform(df).cache() > from pyspark.ml.feature import StopWordsRemover > remover = StopWordsRemover(inputCol="tokens", > outputCol="stopWordsRemovedTokens") > stopWordsRemoved_df = remover.transform(tokenized).cache() > hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", > outputCol="rawFeatures", numFeatures=20) > tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() > idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) > idfModel = idf.fit(tfVectors) > tfIdfVectors = idfModel.transform(tfVectors).cache() > #note that I have also tried to use normalized data, but get the same result > from pyspark.ml.feature import Normalizer > from pyspark.ml.linalg import Vectors > normalizer = Normalizer(inputCol="features", outputCol="normFeatures") > l2NormData = normalizer.transform(tfIdfVectors) > from pyspark.ml.clustering import KMeans > # Trains a KMeans model. > kmeans = KMeans().setK(6).setMaxIter(20) > km_model = kmeans.fit(l2NormData) > clustersTable = km_model.transform(l2NormData) > [output showing most documents get clustered into cluster 0][1] > ID number_of_documents_in_cluster > 0 3024 > 3 5 > 1 3 > 5 2 > 2 2 > 4 1 > As you can see most of my data points get clustered into cluster 0, and I > cannot figure out what I am doing wrong as all the tutorials and code I have > come across online point to using this method. > In addition I have also tried normalizing the tf-idf matrix before K-means > but that also produces the same result. I know cosine distance is a better > measure to use, but I expected using standard K-means in Apache Spark would > provide meaningful results. > Can anyone help with regards to whether I have a bug in my code, or if > something is missing in my data clustering pipeline? > (Question also asked in Stackoverflow before: > http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) > Thank you in advance! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster
Nassir created SPARK-20696: -- Summary: tf-idf document clustering with K-means in Apache Spark putting points into one cluster Key: SPARK-20696 URL: https://issues.apache.org/jira/browse/SPARK-20696 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Nassir I am trying to do the classic job of clustering text documents by pre-processing, generating tf-idf matrix, and then applying K-means. However, testing this workflow on the classic 20NewsGroup dataset results in most documents being clustered into one cluster. (I have initially tried to cluster all documents from 6 of the 20 groups - so expecting to cluster into 6 clusters). I am implementing this in Apache Spark as my purpose is to utilise this technique on millions of documents. Here is the code written in Pyspark on Databricks: #declare path to folder containing 6 of 20 news group categories path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % MOUNT_NAME #read all the text files from the 6 folders. Each entity is an entire document. text_files = sc.wholeTextFiles(path).cache() #convert rdd to dataframe df = text_files.toDF(["filePath", "document"]).cache() from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer #tokenize the document text tokenizer = Tokenizer(inputCol="document", outputCol="tokens") tokenized = tokenizer.transform(df).cache() from pyspark.ml.feature import StopWordsRemover remover = StopWordsRemover(inputCol="tokens", outputCol="stopWordsRemovedTokens") stopWordsRemoved_df = remover.transform(tokenized).cache() hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", outputCol="rawFeatures", numFeatures=20) tfVectors = hashingTF.transform(stopWordsRemoved_df).cache() idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) idfModel = idf.fit(tfVectors) tfIdfVectors = idfModel.transform(tfVectors).cache() #note that I have also tried to use normalized data, but get the same result from pyspark.ml.feature import Normalizer from pyspark.ml.linalg import Vectors normalizer = Normalizer(inputCol="features", outputCol="normFeatures") l2NormData = normalizer.transform(tfIdfVectors) from pyspark.ml.clustering import KMeans # Trains a KMeans model. kmeans = KMeans().setK(6).setMaxIter(20) km_model = kmeans.fit(l2NormData) clustersTable = km_model.transform(l2NormData) [output showing most documents get clustered into cluster 0][1] ID number_of_documents_in_cluster 0 3024 3 5 1 3 5 2 2 2 4 1 As you can see most of my data points get clustered into cluster 0, and I cannot figure out what I am doing wrong as all the tutorials and code I have come across online point to using this method. In addition I have also tried normalizing the tf-idf matrix before K-means but that also produces the same result. I know cosine distance is a better measure to use, but I expected using standard K-means in Apache Spark would provide meaningful results. Can anyone help with regards to whether I have a bug in my code, or if something is missing in my data clustering pipeline? (Question also asked in Stackoverflow before: http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one) Thank you in advance! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org