Vasanthkumar Velayudham created SPARK-25219:
-----------------------------------------------

             Summary: KMeans Clustering - Text Data - Results are incorrect
                 Key: SPARK-25219
                 URL: https://issues.apache.org/jira/browse/SPARK-25219
             Project: Spark
          Issue Type: Bug
          Components: Spark Submit
    Affects Versions: 2.3.0
            Reporter: Vasanthkumar Velayudham


Hello Everyone,

I am facing issues with the usage of KMeans Clustering on my text data. When I 
apply clustering on my text data, after performing various transformations such 
as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated clusters are 
not proper and one cluster is found to have lot of data points assigned to it.

I am able to perform clustering with similar kind of processing and with the 
same attributes on the SKLearn KMeans algorithm. 

Upon searching in internet, I observe many have reported the same issue with 
KMeans clustering library of Spark.

Request your help in fixing this issue.

Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to