[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367089#comment-14367089
 ] 

Abou Haydar Elias commented on SPARK-5874:
------------------------------------------

The tokenizer as for now converts the input string to lowercase and then splits 
it by white spaces only. 

I suggest more flexibility for the Tokenizer pipeline stage. So we can 
eventually add stemming and text analysis directly into the Tokenizer.

There are many post-tokenization steps that can be done, including (but not 
limited to):

- [Stemming|http://en.wikipedia.org/wiki/Stemming] – Replacing words with their 
stems. For instance with English stemming "bikes" is replaced with "bike"; now 
query "bike" can find both documents containing "bike" and those containing 
"bikes".
- Stop Words Filtering – Common words like "the", "and" and "a" rarely add any 
value to a search. Removing them shrinks the index size and increases 
performance. It may also reduce some "noise" and actually improve search 
quality.
- [Text Normalization|http://en.wikipedia.org/wiki/Text_normalization] – 
Stripping accents and other character markings can make for better searching.
- Synonym Expansion – Adding in synonyms at the same token position as the 
current word can mean better matching when users search with words in the 
synonym set.

so what do you think?

> How to improve the current ML pipeline API?
> -------------------------------------------
>
>                 Key: SPARK-5874
>                 URL: https://issues.apache.org/jira/browse/SPARK-5874
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> I created this JIRA to collect feedbacks about the ML pipeline API we 
> introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
> with confidence, which requires valuable input from the community. I'll 
> create sub-tasks for each major issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to