[GitHub] spark pull request: [SPARK-12708][UI] Sorting task error in Stages...
Github user yoshidakuy commented on the pull request: https://github.com/apache/spark/pull/10663#issuecomment-170205527 Thanks for comments and I agree. will fix later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12504][SQL] [Backport-1.6] Masking cred...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10669#issuecomment-170203271 **[Test build #2355 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2355/consoleFull)** for PR 10669 at commit [`212b4db`](https://github.com/apache/spark/commit/212b4dbf3c3a33c884d019068bdc6eb7fd25190c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/10671 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/10671#issuecomment-170203137 okay. Close it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10671#issuecomment-170203047 Yea I took another look - I'd prefer not to do it for the sake of doing it, unless we have a real benefit here. The optimizer is pretty hard to get right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12735] Consolidate & move spark-ec2 to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10673#issuecomment-170202936 **[Test build #49043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49043/consoleFull)** for PR 10673 at commit [`3228f07`](https://github.com/apache/spark/commit/3228f074926391ab837dbf3e8c59b4294b0cf62f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/10671#issuecomment-170202921 @rxin Thanks for explanation! Actually this PR is a minor one and it just extracts common codes to few methods to avoid duplication. It is more like to de-duplicate than refactoring, as I can tell. If you still think we shouldn't change this part. Please let me know, I will close it. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10597 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/10597#issuecomment-170202531 LGTM. Thanks @yanboliang - Merging this to master and `branch-1.6` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12224][SPARKR] R support for JDBC sourc...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/10480#issuecomment-170202498 @sun-rui Are there any more comments on this PR ? @felixcheung Could you bring this up to date with `master`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12735] Consolidate & move spark-ec2 to ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10673#issuecomment-170202409 This should be merged together with https://github.com/amplab/spark-ec2/pull/21 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12735] Consolidate & move spark-ec2 to ...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/10673 [SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-12735 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10673.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10673 commit 3228f074926391ab837dbf3e8c59b4294b0cf62f Author: Reynold Xin Date: 2016-01-09T06:51:24Z [SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10671#issuecomment-170201972 **[Test build #49042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49042/consoleFull)** for PR 10671 at commit [`8600a07`](https://github.com/apache/spark/commit/8600a07c155aa5340e9235e69d78589a53022778). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10671#issuecomment-170201871 Thanks for submitting this. Unless it is substantially better or super obvious to review, I'd avoid patches that refactor the optimizer for the sake of refactoring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10672#issuecomment-170201685 **[Test build #49041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49041/consoleFull)** for PR 10672 at commit [`798441a`](https://github.com/apache/spark/commit/798441ae25936f61c431c01a3d5d3578dd8442c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/10672#discussion_r49261090 --- Diff: dev/test-dependencies.sh --- @@ -70,19 +70,10 @@ $MVN -q versions:set -DnewVersion=$TEMP_VERSION -DgenerateBackupPoms=false > /de # Generate manifests for each Hadoop profile: for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do echo "Performing Maven install for $HADOOP_PROFILE" - $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE jar:jar install:install -q \ --pl '!assembly' \ --- End diff -- Also, note that we need to install dummy JARs and test JARs for all modules so that `mvn validate` doesn't fail during dependency resolution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/10672#discussion_r49261085 --- Diff: dev/test-dependencies.sh --- @@ -70,19 +70,10 @@ $MVN -q versions:set -DnewVersion=$TEMP_VERSION -DgenerateBackupPoms=false > /de # Generate manifests for each Hadoop profile: for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do echo "Performing Maven install for $HADOOP_PROFILE" - $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE jar:jar install:install -q \ --pl '!assembly' \ --- End diff -- @pwendell, this was from your original PR but I think it's no longer necessary because we don't run the compile phase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/10672#issuecomment-170201532 I'd like to backport the `dev/test-dependencies` infrastructure as far back as `branch-1.5` so that we can merge a similar fix there as well. After this fix gets in, I think we should audit the build for other dependencies which should be banned via enforcer rules. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/10672 [SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent future bugs Netty classes are published under artifacts with different names, so our build needs to exclude the `io.netty` and `org.jboss.netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse). This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds. /cc @rxin @srowen @pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark enforce-netty-exclusions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10672.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10672 commit f2e7a3fb46caee8537755a05e8371fcf8dfd6103 Author: Josh Rosen Date: 2016-01-09T04:50:39Z Enforce Netty exclusions. commit 64ce63624d07750c24fc5aaa4329bc1958c95f78 Author: Josh Rosen Date: 2016-01-09T06:09:42Z Add more exclusions and includes. commit 798441ae25936f61c431c01a3d5d3578dd8442c9 Author: Josh Rosen Date: 2016-01-09T06:19:01Z Add even more excludes; run mvn validate in deps test script. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/10671#issuecomment-170201362 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/10671 [SPARK-12733][SQL] Refactor duplicate codes in ProjectCollapsing JIRA: https://issues.apache.org/jira/browse/SPARK-12733 Minor PR to refactor duplicate codes in ProjectCollapsing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 remove-dup-projectcollapse Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10671.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10671 commit 8600a07c155aa5340e9235e69d78589a53022778 Author: Liang-Chi Hsieh Date: 2016-01-09T06:19:34Z Remove duplicate codes in ProjectCollapsing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12340] Fix overflow in various take fun...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10670#issuecomment-170200738 **[Test build #49040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49040/consoleFull)** for PR 10670 at commit [`d69b963`](https://github.com/apache/spark/commit/d69b96384487eeb077e2666799bd3117cfbfa9f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170200677 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49039/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170200676 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170200645 **[Test build #49039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12340] Fix overflow in various take fun...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/10670 [SPARK-12340] Fix overflow in various take functions. This is a follow-up for the original patch #10562. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-12340 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10670.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10670 commit 470d987f82f47e23dcf8fcdd162dbf713a5492b8 Author: Reynold Xin Date: 2016-01-09T05:54:25Z [SPARK-12340] Fix overflow in various take functions. This is a follow-up for the original patch #10562. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12340] Fix overflow in various take fun...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10670#issuecomment-170200425 cc @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10620 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/10620#issuecomment-170200154 LGTM, merging into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-170200024 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-170200025 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49036/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-170200016 **[Test build #49036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49036/consoleFull)** for PR 10238 at commit [`123d958`](https://github.com/apache/spark/commit/123d958ba05a36aebb2548f04418153979d243ed). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10620#issuecomment-170199898 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49037/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10620#issuecomment-170199897 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10620#issuecomment-170199833 **[Test build #49037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49037/consoleFull)** for PR 10620 at commit [`119a055`](https://github.com/apache/spark/commit/119a055c7c3749ca6014635d280e3a28324e3b45). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-170195683 OK I think I figured out why. "acc" is a boolean column. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10655#issuecomment-170195637 Thanks @RussellSpitzer. I will let @yhuai review and merge this. One question, do you know why the filter is "if (isnull(acc#2)) null else CASE 1000 WHEN 1 THEN acc#2 WHEN 0 THEN NOT acc#2 ELSE false"? Seems so complicated for "acc = 1000" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12504][SQL] [Backport-1.6] Masking cred...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10669#issuecomment-170195506 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10659 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10659#issuecomment-170195406 Looks great. I'm going to merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12504][SQL] [Backport-1.6] Masking cred...
GitHub user sureshthalamati opened a pull request: https://github.com/apache/spark/pull/10669 [SPARK-12504][SQL] [Backport-1.6] Masking credentials in the sql plan explain output for JDBC data sources. Currently credentials in JDBC URL/properties for jdbc data sources are included in the explain output. This fix removes credentials from the explain output and show only database table information. Backporting fix to 1.6 from 2.0 as discussed in PR https://github.com/apache/spark/pull/10452 CC @marmbrus You can merge this pull request into a Git repository by running: $ git pull https://github.com/sureshthalamati/spark mask_jdbc_credentials_spark_1.6.0-12504 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10669.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10669 commit 212b4dbf3c3a33c884d019068bdc6eb7fd25190c Author: sureshthalamati Date: 2016-01-09T03:39:12Z masking jdbc datasource credentials from the plan output --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10667 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10667#issuecomment-170195154 LGTM. Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9297][SQL] Add covar_pop and covar_samp
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/10029#discussion_r49260274 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Covariance.scala --- @@ -0,0 +1,212 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.util.TypeUtils +import org.apache.spark.sql.types._ + +/** + * Compute the covariance between two expressions. + * When applied on empty data (i.e., count is zero), it returns NULL. + * + */ +abstract class Covariance( +left: Expression, +right: Expression, +mutableAggBufferOffset: Int, +inputAggBufferOffset: Int) + extends ImperativeAggregate with Serializable { + + override def children: Seq[Expression] = Seq(left, right) + + override def nullable: Boolean = false + + override def dataType: DataType = DoubleType + + override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, DoubleType) + + override def checkInputDataTypes(): TypeCheckResult = { +if (left.dataType.isInstanceOf[DoubleType] && right.dataType.isInstanceOf[DoubleType]) { + TypeCheckResult.TypeCheckSuccess +} else { + TypeCheckResult.TypeCheckFailure( +s"covariance requires that both arguments are double type, " + + s"not (${left.dataType}, ${right.dataType}).") +} + } + + override def aggBufferSchema: StructType = StructType.fromAttributes(aggBufferAttributes) + + override def inputAggBufferAttributes: Seq[AttributeReference] = { +aggBufferAttributes.map(_.newInstance()) + } + + override val aggBufferAttributes: Seq[AttributeReference] = Seq( +AttributeReference("xAvg", DoubleType)(), +AttributeReference("yAvg", DoubleType)(), +AttributeReference("Ck", DoubleType)(), +AttributeReference("count", LongType)()) + + // Local cache of mutableAggBufferOffset(s) that will be used in update and merge + val mutableAggBufferOffsetPlus1 = mutableAggBufferOffset + 1 + val mutableAggBufferOffsetPlus2 = mutableAggBufferOffset + 2 + val mutableAggBufferOffsetPlus3 = mutableAggBufferOffset + 3 + + // Local cache of inputAggBufferOffset(s) that will be used in update and merge + val inputAggBufferOffsetPlus1 = inputAggBufferOffset + 1 + val inputAggBufferOffsetPlus2 = inputAggBufferOffset + 2 + val inputAggBufferOffsetPlus3 = inputAggBufferOffset + 3 + + override def initialize(buffer: MutableRow): Unit = { +buffer.setDouble(mutableAggBufferOffset, 0.0) +buffer.setDouble(mutableAggBufferOffsetPlus1, 0.0) +buffer.setDouble(mutableAggBufferOffsetPlus2, 0.0) +buffer.setLong(mutableAggBufferOffsetPlus3, 0L) + } + + override def update(buffer: MutableRow, input: InternalRow): Unit = { +val leftEval = left.eval(input) +val rightEval = right.eval(input) + +if (leftEval != null && rightEval != null) { + val x = leftEval.asInstanceOf[Double] + val y = rightEval.asInstanceOf[Double] + + var xAvg = buffer.getDouble(mutableAggBufferOffset) + var yAvg = buffer.getDouble(mutableAggBufferOffsetPlus1) + var Ck = buffer.getDouble(mutableAggBufferOffsetPlus2) + var count = buffer.getLong(mutableAggBufferOffsetPlus3) + + val deltaX = x - xAvg + val deltaY = y - yAvg + count += 1 + xAvg += deltaX / count + yAvg += deltaY / count + Ck += deltaX * (y - yAvg) + + buffer.setDouble(mutableAggBufferOffset, xAvg) + buffer.setDouble(mutableAggBufferOffsetPlus1, yAvg) + buffer.setDouble(mutableAggBufferOffsetPlus2, Ck) + b
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170193612 **[Test build #49039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170189804 In this update, the code changes include - Fixed a bug in `Range` by inheriting the trait `MultiInstanceRelation`. - Added a de-duplication resolution for all the binary nodes: `Except`, `Union` and `Co-Group`, besides `Intersect` and `Join`. - Added a new function `duplicateResolved` for all the binary nodes. - Improved the analysis exception message when failure to resolve conflicting references - Resolved all the other comments. The analysis procedure is kind of tricky. I am unable to directly include `duplicateResolved` into `childrenResolved`. `resolve` is lazy evaluated. The resolution procedure need to follow the order: children at first, then itself, and then deduplicate the attributes' expression IDs in tis children. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170189725 **[Test build #49038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49038/consoleFull)** for PR 10630 at commit [`f820c61`](https://github.com/apache/spark/commit/f820c616fe217494ccaed0bf74a0a7410ce503cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10620#issuecomment-170189419 **[Test build #49037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49037/consoleFull)** for PR 10620 at commit [`119a055`](https://github.com/apache/spark/commit/119a055c7c3749ca6014635d280e3a28324e3b45). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/10620#discussion_r49259620 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -936,6 +936,35 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(e.getMessage.contains("Distinct window functions are not supported")) } + test("window function: better support of parentheses") { +val data = Seq( + WindowData(1, "a", 5), + WindowData(2, "a", 6), + WindowData(3, "b", 7), + WindowData(4, "b", 8), + WindowData(5, "c", 9), + WindowData(6, "c", 10) +) +sparkContext.parallelize(data).toDF().registerTempTable("windowData") + +checkAnswer( + sql( +""" + |select month, area, product, + |sum(product + 1) over (partition by ((1) + (1 - 1) - + |(2 * 1 / 2) + (1) + product - (product)) order by 2) --- End diff -- Putting this query into the test is because we want to make sure some corner cases passed. E.g., (expression) op (expression op expression). I will try simpler ones. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/10620#discussion_r49259453 --- Diff: sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser/ExpressionParser.g --- @@ -223,7 +223,12 @@ precedenceUnaryPrefixExpression ; precedenceUnarySuffixExpression -: precedenceUnaryPrefixExpression (a=KW_IS nullCondition)? +: +( +(LPAREN precedenceUnaryPrefixExpression RPAREN) => LPAREN precedenceUnaryPrefixExpression (a=KW_IS nullCondition)? RPAREN --- End diff -- Yes. I think so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170182210 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49035/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170182209 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170182139 **[Test build #49035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49035/consoleFull)** for PR 10668 at commit [`bbd9c0d`](https://github.com/apache/spark/commit/bbd9c0d9066a68286310bccb9e1fbe36d3375371). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10667#issuecomment-170177470 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10667#issuecomment-170177471 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49033/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10667#issuecomment-170177405 **[Test build #49033 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49033/consoleFull)** for PR 10667 at commit [`ef3ec50`](https://github.com/apache/spark/commit/ef3ec50181f1e6588eb748d7241f5caa26de82db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-170176403 **[Test build #49036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49036/consoleFull)** for PR 10238 at commit [`123d958`](https://github.com/apache/spark/commit/123d958ba05a36aebb2548f04418153979d243ed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/10238#issuecomment-170175271 wtf. retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3369] [CORE] [STREAMING] Java mapPartit...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10413#issuecomment-170175021 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49030/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3369] [CORE] [STREAMING] Java mapPartit...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10413#issuecomment-170175020 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3369] [CORE] [STREAMING] Java mapPartit...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10413#issuecomment-170174835 **[Test build #49030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49030/consoleFull)** for PR 10413 at commit [`c3e0375`](https://github.com/apache/spark/commit/c3e0375a58365b770df8d1499efedc418cf20115). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user ehsanmok commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49256609 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -332,18 +336,18 @@ class BlockMatrix @Since("1.3.0") ( if (rowsPerBlock == other.rowsPerBlock && colsPerBlock == other.colsPerBlock) { val addedBlocks = blocks.cogroup(other.blocks, createPartitioner()) .map { case ((blockRowIndex, blockColIndex), (a, b)) => - if (a.size > 1 || b.size > 1) { -throw new SparkException("There are multiple MatrixBlocks with indices: " + - s"($blockRowIndex, $blockColIndex). Please remove them.") - } - if (a.isEmpty) { -new MatrixBlock((blockRowIndex, blockColIndex), b.head) - } else if (b.isEmpty) { -new MatrixBlock((blockRowIndex, blockColIndex), a.head) - } else { -val result = a.head.toBreeze + b.head.toBreeze -new MatrixBlock((blockRowIndex, blockColIndex), Matrices.fromBreeze(result)) - } +if (a.size > 1 || b.size > 1) { --- End diff -- Isn't it the same indentation [here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala#L334)? I don't think, I change anything there! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10659#issuecomment-170170011 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10659#issuecomment-170170013 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49029/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49255412 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Also we have ~380 doctstring lines over length of 72 I'll file a cleanup JIRA for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10659#issuecomment-170169877 **[Test build #49029 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49029/consoleFull)** for PR 10659 at commit [`e125f50`](https://github.com/apache/spark/commit/e125f50f84e09bc3176f5d0bb96cab2f4dbc29a1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170168246 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170168247 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49034/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170167622 **[Test build #49035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49035/consoleFull)** for PR 10668 at commit [`bbd9c0d`](https://github.com/apache/spark/commit/bbd9c0d9066a68286310bccb9e1fbe36d3375371). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
Github user ajbozarth commented on the pull request: https://github.com/apache/spark/pull/10668#issuecomment-170163831 Screenshots: Initial page load ![initial](https://cloud.githubusercontent.com/assets/13952758/12212748/96678fee-b622-11e5-8fb6-d60e71ed8303.png) Sort by address ![sortaddrbottom](https://cloud.githubusercontent.com/assets/13952758/12212752/967c5fb4-b622-11e5-837e-acaad61ba70c.png) ![sortaddrtop](https://cloud.githubusercontent.com/assets/13952758/12212751/967c5492-b622-11e5-8255-ee3a9cf6cbc9.png) Sort by ID ![sortidbottom](https://cloud.githubusercontent.com/assets/13952758/12212754/967ed0aa-b622-11e5-875c-6cb59a467184.png) ![sortidtop](https://cloud.githubusercontent.com/assets/13952758/12212750/967c4f1a-b622-11e5-9b06-95595b1ebdca.png) Sort by Task Count ![sorttasksbottom](https://cloud.githubusercontent.com/assets/13952758/12212753/967c5cee-b622-11e5-82d1-37aa93e256ea.png) ![sorttaskstop](https://cloud.githubusercontent.com/assets/13952758/12212749/967ac4a6-b622-11e5-90c4-857dca45c80e.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...
GitHub user ajbozarth opened a pull request: https://github.com/apache/spark/pull/10668 [SPARK-12716] [Web UI] Add a TOTALS row to the Executors Web UI Created a TOTALS row containing the totals of each column in the executors UI. By default the TOTALS row appears at the top of the table. When a column is sorted the TOTALS row will always sort to either the top or bottom of the table. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ajbozarth/spark spark12716 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10668.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10668 commit f0a725d2bc3fd0d42af88bc1488241b41c552a6f Author: Alex Bozarth Date: 2016-01-08T20:37:57Z Added a TOTALS row to the executors UI --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49253655 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. --- End diff -- Agreed, this is however the same text as used in KMeansModel so I'll update that ones docstring as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49253296 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Are we sure on the 74? Looking at pep8/pep257 it says 72 (although we extended the length for code lines so maybe we changed that too)? We could try and add a lint rule for this maybe in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12634][Python][MLlib][DOC] Update param...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10601#issuecomment-170159816 I just added a note to the parent JIRA about a formatting issue affecting all 5 PRs: [https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225] Could you please check it out & ping when I should review again? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10602#issuecomment-170159780 I just added a note to the parent JIRA about a formatting issue affecting all 5 PRs: [https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225] Could you please check it out & ping when I should review again? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12633][Python][MLlib][DOC] Update param...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10600#issuecomment-170159799 I just added a note to the parent JIRA about a formatting issue affecting all 5 PRs: [https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225] Could you please check it out & ping when I should review again? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12630][Python][MLlib][DOC] Update param...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10598#issuecomment-170159733 I just added a note to the parent JIRA about a formatting issue affecting all 5 PRs: [https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225] Could you please check it out & ping when I should review again? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12631] [PYSPARK] [DOC] PySpark clusteri...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10610#issuecomment-170159752 I just added a note to the parent JIRA about a formatting issue affecting all 5 PRs: [https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225] Could you please check it out & ping when I should review again? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10667#issuecomment-170159640 **[Test build #49033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49033/consoleFull)** for PR 10667 at commit [`ef3ec50`](https://github.com/apache/spark/commit/ef3ec50181f1e6588eb748d7241f5caa26de82db). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49252449 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Update: It should actually be 74 chars. You can check with ```pydoc pyspark``` from the spark/python directory and changing the terminal size to 80 chars wide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/10667 [SPARK-12730][TESTS] De-duplicate some test code in BlockManagerSuite This patch deduplicates some test code in BlockManagerSuite. I'm splitting this change off from a larger PR in order to make things easier to review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark block-mgr-tests-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10667.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10667 commit ef3ec50181f1e6588eb748d7241f5caa26de82db Author: Josh Rosen Date: 2016-01-08T23:37:58Z First round of de-duplication --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/10667#issuecomment-170158610 /cc @andrewor14 for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....
Github user marmbrus closed the pull request at: https://github.com/apache/spark/pull/10650 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251948 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -317,14 +317,18 @@ class BlockMatrix @Since("1.3.0") ( } /** - * Adds two block matrices together. The matrices must have the same size and matching - * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are - * instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even - * if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will - * also be a [[DenseMatrix]]. + * For given matrices `this` and `other` of compatible dimensions and compatible block dimensions, + * it applies an associative binary function on their corresponding blocks. + * + * @param other The BlockMatrix to operate on + * @param binMap An associative function taking two dense breeze matrices and returning a --- End diff -- not associative Also, this should operate on any Breeze Matrix, not just dense ones, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251953 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -332,18 +336,18 @@ class BlockMatrix @Since("1.3.0") ( if (rowsPerBlock == other.rowsPerBlock && colsPerBlock == other.colsPerBlock) { val addedBlocks = blocks.cogroup(other.blocks, createPartitioner()) .map { case ((blockRowIndex, blockColIndex), (a, b)) => - if (a.size > 1 || b.size > 1) { -throw new SparkException("There are multiple MatrixBlocks with indices: " + - s"($blockRowIndex, $blockColIndex). Please remove them.") - } - if (a.isEmpty) { -new MatrixBlock((blockRowIndex, blockColIndex), b.head) - } else if (b.isEmpty) { -new MatrixBlock((blockRowIndex, blockColIndex), a.head) - } else { -val result = a.head.toBreeze + b.head.toBreeze -new MatrixBlock((blockRowIndex, blockColIndex), Matrices.fromBreeze(result)) - } +if (a.size > 1 || b.size > 1) { + throw new SparkException("There are multiple MatrixBlocks with indices: " + +s"($blockRowIndex, $blockColIndex). Please remove them.") +} +if (a.isEmpty) { + new MatrixBlock((blockRowIndex, blockColIndex), b.head) +} else if (b.isEmpty) { + new MatrixBlock((blockRowIndex, blockColIndex), a.head) --- End diff -- This and line 344 are incorrect. What if you write a-b but a has no block? Then the resulting block will be "b" but should be "-b". Before you fix this, I'd recommend improving the unit test to catch this case & fail; then you can fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251958 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -351,6 +355,28 @@ class BlockMatrix @Since("1.3.0") ( } } + /** + * Adds two block matrices together. The matrices must have the same size and matching + * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are + * instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even + * if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will + * also be a [[DenseMatrix]]. + */ + @Since("1.3.0") + def add(other: BlockMatrix): BlockMatrix = +blockMap(other, (x: BM[Double], y: BM[Double]) => x + y) + + /** + * Subtracts two block matrices together. The matrices must have the same size and matching + * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are + * instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even + * if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will + * also be a [[DenseMatrix]]. + */ + @Since("1.6.0") --- End diff -- now needs to be updated to 2.0.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251939 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -317,14 +317,18 @@ class BlockMatrix @Since("1.3.0") ( } /** - * Adds two block matrices together. The matrices must have the same size and matching - * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are - * instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even - * if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will - * also be a [[DenseMatrix]]. + * For given matrices `this` and `other` of compatible dimensions and compatible block dimensions, + * it applies an associative binary function on their corresponding blocks. --- End diff -- not associative (subtraction is not) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251943 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -317,14 +317,18 @@ class BlockMatrix @Since("1.3.0") ( } /** - * Adds two block matrices together. The matrices must have the same size and matching - * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are - * instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even - * if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will - * also be a [[DenseMatrix]]. + * For given matrices `this` and `other` of compatible dimensions and compatible block dimensions, + * it applies an associative binary function on their corresponding blocks. + * + * @param other The BlockMatrix to operate on --- End diff -- "operate on" sounds like "other" is being modified. Rephrase: "The second BlockMatrix argument for the operator specified by `binMap`" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251950 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -332,18 +336,18 @@ class BlockMatrix @Since("1.3.0") ( if (rowsPerBlock == other.rowsPerBlock && colsPerBlock == other.colsPerBlock) { val addedBlocks = blocks.cogroup(other.blocks, createPartitioner()) .map { case ((blockRowIndex, blockColIndex), (a, b)) => - if (a.size > 1 || b.size > 1) { -throw new SparkException("There are multiple MatrixBlocks with indices: " + - s"($blockRowIndex, $blockColIndex). Please remove them.") - } - if (a.isEmpty) { -new MatrixBlock((blockRowIndex, blockColIndex), b.head) - } else if (b.isEmpty) { -new MatrixBlock((blockRowIndex, blockColIndex), a.head) - } else { -val result = a.head.toBreeze + b.head.toBreeze -new MatrixBlock((blockRowIndex, blockColIndex), Matrices.fromBreeze(result)) - } +if (a.size > 1 || b.size > 1) { --- End diff -- style: Fix indentation (The change was incorrect, or accidental.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9916#discussion_r49251954 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -351,6 +355,28 @@ class BlockMatrix @Since("1.3.0") ( } } + /** + * Adds two block matrices together. The matrices must have the same size and matching + * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are + * instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even + * if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will + * also be a [[DenseMatrix]]. + */ + @Since("1.3.0") + def add(other: BlockMatrix): BlockMatrix = +blockMap(other, (x: BM[Double], y: BM[Double]) => x + y) + + /** + * Subtracts two block matrices together. The matrices must have the same size and matching --- End diff -- ```Subtracts two block matrices together.``` --> ```Subtracts the given block matrix `other` from this block matrix: `this - other`.``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user ehsanmok commented on the pull request: https://github.com/apache/spark/pull/9916#issuecomment-170157265 @jkbradley thank you! I'm guess that'd be suitable for Spark 1.6.1 so Since annotations should be updated, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/9916#issuecomment-170156407 @ehsanmok Apologies for the slow review. We do constantly have ~100 pending PRs and many more JIRAs, so they can be hard to cover with limited reviewer bandwidth. I'll take a look now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10509][PYSPARK] Reduce excessive param ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10216#issuecomment-170156115 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10509][PYSPARK] Reduce excessive param ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10216#issuecomment-170156117 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49032/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251302 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) --- End diff -- Specify number of partitions for sc.parallelize; not doing so has caused flaky tests in the past (because of randomization interacting with partitioning). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170156045 @holdenk Thanks for the PR! That's all for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10509][PYSPARK] Reduce excessive param ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10216#issuecomment-170156009 **[Test build #49032 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49032/consoleFull)** for PR 10216 at commit [`e0f3f00`](https://github.com/apache/spark/commit/e0f3f00d761b0b53860dd0f06de320c9fdc84958). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251291 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine --- End diff -- Confusing doc; reword. Also fix indentation on next line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251293 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- I believe we try to limit doc lines in Python to <= 80 chars (unlike code, which is <= 100 chars). Could you please update this and other parts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251288 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) --- End diff -- I'd write this as more of an example than a unit test. It's good to exercise all functionality, but unit test code should go in tests.py. (We have been inconsistent about this, but it'd be good to improve.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org