Nassir created SPARK-20696:
------------------------------

             Summary: tf-idf document clustering with K-means in Apache Spark 
putting points into one cluster
                 Key: SPARK-20696
                 URL: https://issues.apache.org/jira/browse/SPARK-20696
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.1.0
            Reporter: Nassir


I am trying to do the classic job of clustering text documents by 
pre-processing, generating tf-idf matrix, and then applying K-means. However, 
testing this workflow on the classic 20NewsGroup dataset results in most 
documents being clustered into one cluster. (I have initially tried to cluster 
all documents from 6 of the 20 groups - so expecting to cluster into 6 
clusters).

I am implementing this in Apache Spark as my purpose is to utilise this 
technique on millions of documents. Here is the code written in Pyspark on 
Databricks:

#declare path to folder containing 6 of 20 news group categories
path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
MOUNT_NAME

#read all the text files from the 6 folders. Each entity is an entire 
document. 
text_files = sc.wholeTextFiles(path).cache()

#convert rdd to dataframe
df = text_files.toDF(["filePath", "document"]).cache()

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 

#tokenize the document text
tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
tokenized = tokenizer.transform(df).cache()

from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="tokens", 
outputCol="stopWordsRemovedTokens")
stopWordsRemoved_df = remover.transform(tokenized).cache()

hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
outputCol="rawFeatures", numFeatures=200000)
tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()    

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
idfModel = idf.fit(tfVectors)

tfIdfVectors = idfModel.transform(tfVectors).cache()

#note that I have also tried to use normalized data, but get the same result
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
l2NormData = normalizer.transform(tfIdfVectors)

from pyspark.ml.clustering import KMeans

# Trains a KMeans model.
kmeans = KMeans().setK(6).setMaxIter(20)
km_model = kmeans.fit(l2NormData)

clustersTable = km_model.transform(l2NormData)

[output showing most documents get clustered into cluster 0][1]

ID number_of_documents_in_cluster 
0 3024 
3 5 
1 3 
5 2
2 2 
4 1

As you can see most of my data points get clustered into cluster 0, and I 
cannot figure out what I am doing wrong as all the tutorials and code I have 
come across online point to using this method.

In addition I have also tried normalizing the tf-idf matrix before K-means but 
that also produces the same result. I know cosine distance is a better measure 
to use, but I expected using standard K-means in Apache Spark would provide 
meaningful results.

Can anyone help with regards to whether I have a bug in my code, or if 
something is missing in my data clustering pipeline?

(Question also asked in Stackoverflow before: 
http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)

Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to