[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367089#comment-14367089 ]
Abou Haydar Elias commented on SPARK-5874: ------------------------------------------ The tokenizer as for now converts the input string to lowercase and then splits it by white spaces only. I suggest more flexibility for the Tokenizer pipeline stage. So we can eventually add stemming and text analysis directly into the Tokenizer. There are many post-tokenization steps that can be done, including (but not limited to): - [Stemming|http://en.wikipedia.org/wiki/Stemming] – Replacing words with their stems. For instance with English stemming "bikes" is replaced with "bike"; now query "bike" can find both documents containing "bike" and those containing "bikes". - Stop Words Filtering – Common words like "the", "and" and "a" rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some "noise" and actually improve search quality. - [Text Normalization|http://en.wikipedia.org/wiki/Text_normalization] – Stripping accents and other character markings can make for better searching. - Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set. so what do you think? > How to improve the current ML pipeline API? > ------------------------------------------- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org