[
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley updated SPARK-11069:
--------------------------------------
Shepherd: Joseph K. Bradley
> Add RegexTokenizer option to convert to lowercase
> -------------------------------------------------
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Joseph K. Bradley
> Assignee: yuhao yang
> Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer
> does not. It would be nice to add an option to RegexTokenizer to convert to
> lowercase. Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat
> upper/lower case differently.
> --> I'd vote for conversion before matching. If a user needs full control,
> they can convert to lowercase manually.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]