[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9092 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155248366 merging with master, branch-1.6 Thank you for the PR! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155243776 **[Test build #45439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45439/consoleFull)** for PR 9092 at commit [`2663cbf`](https://github.com/apache/spark/commit/2663cbf213548c0631e88d886e8010f4dcac163c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155244180 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155239682 LGTM pending tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155231560 **[Test build #45439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45439/consoleFull)** for PR 9092 at commit [`2663cbf`](https://github.com/apache/spark/commit/2663cbf213548c0631e88d886e8010f4dcac163c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155230865 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155230847 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9092#discussion_r44311675 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -100,10 +100,25 @@ class RegexTokenizer(override val uid: String) /** @group getParam */ def getPattern: String = $(pattern) - setDefault(minTokenLength -> 1, gaps -> true, pattern -> "\\s+") + /** + * Indicates whether to convert all characters to lowercase before tokenizing. + * Default: false --- End diff -- default needs to be updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-155150009 Looks good except that one outdated doc line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154907032 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154906966 **[Test build #45328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45328/consoleFull)** for PR 9092 at commit [`43fd8e9`](https://github.com/apache/spark/commit/43fd8e954b53599ece65c5ee48f24c9b036a75a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154903263 **[Test build #45328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45328/consoleFull)** for PR 9092 at commit [`43fd8e9`](https://github.com/apache/spark/commit/43fd8e954b53599ece65c5ee48f24c9b036a75a6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154902998 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154902986 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154861500 **[Test build #45314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45314/consoleFull)** for PR 9092 at commit [`0c07366`](https://github.com/apache/spark/commit/0c07366ea6d397859b5761fd67f31db851834629). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154861510 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154859228 **[Test build #45314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45314/consoleFull)** for PR 9092 at commit [`0c07366`](https://github.com/apache/spark/commit/0c07366ea6d397859b5761fd67f31db851834629). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154858828 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154858819 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154832573 Yes, I agree. 1. Tokenizer and RegexTokenizer should have consistent behavior. 2. Whether to set toLower to true is a matter of preference. I assume for ML applications it's more common to have toLower as true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-154743585 I'm wondering now if we should set it to convert to lowercase by default. I know it breaks behavior, but otherwise, we'll introduce an inconsistency in the API (between Tokenizer and RegexTokenizer) which will be around for a long time. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9092#discussion_r44216481 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -69,6 +69,18 @@ class RegexTokenizerSuite extends SparkFunSuite with MLlibTestSparkContext { )) testRegexTokenizer(tokenizer2, dataset2) } + + test("RegexTokenizer with toLowercase true"){ --- End diff -- style: space before brace at end of line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9092#discussion_r44216479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -100,10 +100,25 @@ class RegexTokenizer(override val uid: String) /** @group getParam */ def getPattern: String = $(pattern) - setDefault(minTokenLength -> 1, gaps -> true, pattern -> "\\s+") + /** + * Indicates whether to convert all characters to lowercase before tokenizing. + * Default: false + * @group param + */ + val toLowercase: BooleanParam = new BooleanParam(this, "toLowercase", --- End diff -- final val --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-147640440 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-147640443 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43630/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-147639962 [Test build #43630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43630/console) for PR 9092 at commit [`ce09ef5`](https://github.com/apache/spark/commit/ce09ef532f2ec633e508840097fd0ac1b5285284). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-147628224 [Test build #43630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43630/consoleFull) for PR 9092 at commit [`ce09ef5`](https://github.com/apache/spark/commit/ce09ef532f2ec633e508840097fd0ac1b5285284). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-147627765 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9092#issuecomment-147627734 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11069] [ML] Add RegexTokenizer option t...
GitHub user hhbyyh opened a pull request: https://github.com/apache/spark/pull/9092 [SPARK-11069] [ML] Add RegexTokenizer option to convert to lowercase jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too You can merge this pull request into a Git repository by running: $ git pull https://github.com/hhbyyh/spark tokenLower Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9092.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9092 commit ce09ef532f2ec633e508840097fd0ac1b5285284 Author: Yuhao Yang Date: 2015-10-13T07:14:55Z add tolowercase to regexTokenizer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org