[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651793#comment-15651793 ]
nirav patel commented on SPARK-18374: ------------------------------------- [~srowen] Do you mean how to tokenize words in language agnostic way? means there are some language where it make sense to break "won't" into "won" and "t" ? At least other tokenizer like lucene families doesn't do this. Seems plainly incorrect . Other language where it make sense to tokenize on "'" instead of removing it there should be different tokenizer and they can then follow their stopwords list correctly in that regards. Tokenizer in that regards should not be generic same as stopwrods. > Incorrect words in StopWords/english.txt > ---------------------------------------- > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.0.1 > Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org