Joseph K. Bradley created SPARK-11069:
-----------------------------------------

             Summary: Add RegexTokenizer option to convert to lowercase
                 Key: SPARK-11069
                 URL: https://issues.apache.org/jira/browse/SPARK-11069
             Project: Spark
          Issue Type: New Feature
          Components: ML
            Reporter: Joseph K. Bradley
            Priority: Minor


Tokenizer converts strings to lowercase automatically, but RegexTokenizer does 
not.  It would be nice to add an option to RegexTokenizer to convert to 
lowercase.  Proposal:
* call the Boolean Param "toLowercase"
* set default to false (so behavior does not change)

*Q*: Should conversion to lowercase happen before or after regex matching?
* Before: This is simpler.
* After: This gives the user full control since they can have the regex treat 
upper/lower case differently.
--> I'd vote for conversion before matching.  If a user needs full control, 
they can convert to lowercase manually.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to