[ 
https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313996#comment-14313996
 ] 

Augustin Borsu edited comment on SPARK-5566 at 2/11/15 9:58 AM:
----------------------------------------------------------------

https://github.com/apache/spark/pull/4504
I propose a tokenizer loosely based on the NLTK regexTokenizer.
I didn't create a standalone tokenizer in mllib that I wrap in ml as I don't 
think a standalone tokenizer is necessarly needed in mllib but if people 
disagree I can change that.


was (Author: augustinb):
We could use a tokenizer like this, but we would need to add regex and 
Array[String] parameters type to be able to change those aprameters in a 
crossvalidation.
https://github.com/apache/spark/pull/4504

> Tokenizer for mllib package
> ---------------------------
>
>                 Key: SPARK-5566
>                 URL: https://issues.apache.org/jira/browse/SPARK-5566
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> There exist tokenizer classes in the spark.ml.feature package and in the 
> LDAExample in the spark.examples.mllib package.  The Tokenizer in the 
> LDAExample is more advanced and should be made into a full-fledged public 
> class in spark.mllib.feature.  The spark.ml.feature.Tokenizer class should 
> become a wrapper around the new Tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to