[ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596109#comment-16596109
 ] 

Marco Gaido commented on SPARK-25219:
-------------------------------------

Well, there are many differences between Spark ML and SKLearn codes you've 
posted. First of all the number of clusters is different. Moreover the input 
data to KMeans can be different.

Please store the data after the TF-IDF transformation, which is the interesting 
one. Then, take the KMeans results and the centroids: check if the distance of 
a point to the centroid it has been assigned to is lower than the distance to 
all the other centroids. If that is the case, there is no issue with KMeans, 
You may have to increase the number of runs, change the initialization method, 
change the seed and so on to get a different result, but there is no evident 
bug in the algorithm itself. If this is not the case, instead, with the input 
data to the KMeans and the reproducer, I can investigate the problem. Thanks.

> KMeans Clustering - Text Data - Results are incorrect
> -----------------------------------------------------
>
>                 Key: SPARK-25219
>                 URL: https://issues.apache.org/jira/browse/SPARK-25219
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Vasanthkumar Velayudham
>            Priority: Major
>         Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, 
> Spark_Kmeans.txt
>
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to