[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591423#comment-16591423 ]
Marco Gaido commented on SPARK-25219: ------------------------------------- Hi [~VVasanth], a JIRA like this is very difficult to work on: saying that something returns a result which is not the expected one is not a great starting point for taking an action. It would be great if you could provide a simple reproducer. The reproducer needs to involve only one thing if possible (in this case KMeans, not involving other transformation), with a set of parameters to reproduce the problem and the expected result which is returned with the same parameters by the other libraries. If the problem is more clear, I am happy to work on it, but first we need to understand whether this is indeed an issue and how to reproduce it. Thanks. > KMeans Clustering - Text Data - Results are incorrect > ----------------------------------------------------- > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: Vasanthkumar Velayudham > Priority: Major > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org