date:20160108

[GitHub] spark pull request: [SPARK-12708][UI] Sorting task error in Stages...

2016-01-08 Thread yoshidakuy

Github user yoshidakuy commented on the pull request:

https://github.com/apache/spark/pull/10663#issuecomment-170205527
  
Thanks for comments and I agree. will fix later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12504][SQL] [Backport-1.6] Masking cred...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10669#issuecomment-170203271
  
**[Test build #2355 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2355/consoleFull)**
 for PR 10669 at commit 
[`212b4db`](https://github.com/apache/spark/commit/212b4dbf3c3a33c884d019068bdc6eb7fd25190c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread viirya

Github user viirya closed the pull request at:

https://github.com/apache/spark/pull/10671


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/10671#issuecomment-170203137
  
okay. Close it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10671#issuecomment-170203047
  
Yea I took another look - I'd prefer not to do it for the sake of doing it, 
unless we have a real benefit here. The optimizer is pretty hard to get right.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12735] Consolidate & move spark-ec2 to ...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10673#issuecomment-170202936
  
**[Test build #49043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49043/consoleFull)**
 for PR 10673 at commit 
[`3228f07`](https://github.com/apache/spark/commit/3228f074926391ab837dbf3e8c59b4294b0cf62f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/10671#issuecomment-170202921
  
@rxin Thanks for explanation! Actually this PR is a minor one and it just 
extracts common codes to few methods to avoid duplication. It is more like to 
de-duplicate than refactoring, as I can tell.

If you still think we shouldn't change this part. Please let me know, I 
will close it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...

2016-01-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10597


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...

2016-01-08 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/10597#issuecomment-170202531
  
LGTM. Thanks @yanboliang - Merging this to master and `branch-1.6`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12224][SPARKR] R support for JDBC sourc...

2016-01-08 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/10480#issuecomment-170202498
  
@sun-rui Are there any more comments on this PR  ?
@felixcheung Could you bring this up to date with `master`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12735] Consolidate & move spark-ec2 to ...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10673#issuecomment-170202409
  
This should be merged together with 
https://github.com/amplab/spark-ec2/pull/21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12735] Consolidate & move spark-ec2 to ...

2016-01-08 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/10673

[SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-12735

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10673.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10673


commit 3228f074926391ab837dbf3e8c59b4294b0cf62f
Author: Reynold Xin 
Date:   2016-01-09T06:51:24Z

[SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10671#issuecomment-170201972
  
**[Test build #49042 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49042/consoleFull)**
 for PR 10671 at commit 
[`8600a07`](https://github.com/apache/spark/commit/8600a07c155aa5340e9235e69d78589a53022778).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10671#issuecomment-170201871
  
Thanks for submitting this. Unless it is substantially better or super 
obvious to review, I'd avoid patches that refactor the optimizer for the sake 
of refactoring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10672#issuecomment-170201685
  
**[Test build #49041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49041/consoleFull)**
 for PR 10672 at commit 
[`798441a`](https://github.com/apache/spark/commit/798441ae25936f61c431c01a3d5d3578dd8442c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...

2016-01-08 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10672#discussion_r49261090
  
--- Diff: dev/test-dependencies.sh ---
@@ -70,19 +70,10 @@ $MVN -q versions:set -DnewVersion=$TEMP_VERSION 
-DgenerateBackupPoms=false > /de
 # Generate manifests for each Hadoop profile:
 for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do
   echo "Performing Maven install for $HADOOP_PROFILE"
-  $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE jar:jar install:install 
-q \
--pl '!assembly' \
--- End diff --

Also, note that we need to install dummy JARs and test JARs for all modules 
so that `mvn validate` doesn't fail during dependency resolution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...

2016-01-08 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/10672#discussion_r49261085
  
--- Diff: dev/test-dependencies.sh ---
@@ -70,19 +70,10 @@ $MVN -q versions:set -DnewVersion=$TEMP_VERSION 
-DgenerateBackupPoms=false > /de
 # Generate manifests for each Hadoop profile:
 for HADOOP_PROFILE in "${HADOOP_PROFILES[@]}"; do
   echo "Performing Maven install for $HADOOP_PROFILE"
-  $MVN $HADOOP2_MODULE_PROFILES -P$HADOOP_PROFILE jar:jar install:install 
-q \
--pl '!assembly' \
--- End diff --

@pwendell, this was from your original PR but I think it's no longer 
necessary because we don't run the compile phase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...

2016-01-08 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/10672#issuecomment-170201532
  
I'd like to backport the `dev/test-dependencies` infrastructure as far back 
as `branch-1.5` so that we can merge a similar fix there as well. After this 
fix gets in, I think we should audit the build for other dependencies which 
should be banned via enforcer rules.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12734][BUILD] Fix Netty exclusion and u...

2016-01-08 Thread JoshRosen

GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/10672

[SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent 
future bugs

Netty classes are published under artifacts with different names, so our 
build needs to exclude the `io.netty` and `org.jboss.netty` versions of the 
Netty artifact. However, our existing exclusions were incomplete, leading to 
situations where duplicate Netty classes would wind up on the classpath and 
cause compile errors (or worse).

This patch fixes the exclusion issue by adding more exclusions and uses 
Maven Enforcer's [banned 
dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html)
 rule to prevent these classes from accidentally being reintroduced. I also 
updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer 
rules can run as part of pull request builds.

/cc @rxin @srowen @pwendell. I'd like to backport at least the exclusion 
portion of this fix to `branch-1.5` in order to fix the documentation 
publishing job, which fails nondeterministically due to incompatible versions 
of Netty classes taking precedence on the compile-time classpath.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark enforce-netty-exclusions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10672.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10672


commit f2e7a3fb46caee8537755a05e8371fcf8dfd6103
Author: Josh Rosen 
Date:   2016-01-09T04:50:39Z

Enforce Netty exclusions.

commit 64ce63624d07750c24fc5aaa4329bc1958c95f78
Author: Josh Rosen 
Date:   2016-01-09T06:09:42Z

Add more exclusions and includes.

commit 798441ae25936f61c431c01a3d5d3578dd8442c9
Author: Josh Rosen 
Date:   2016-01-09T06:19:01Z

Add even more excludes; run mvn validate in deps test script.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/10671#issuecomment-170201362
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12733][SQL] Refactor duplicate codes in...

2016-01-08 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/10671

[SPARK-12733][SQL] Refactor duplicate codes in ProjectCollapsing

JIRA: https://issues.apache.org/jira/browse/SPARK-12733

Minor PR to refactor duplicate codes in ProjectCollapsing.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 remove-dup-projectcollapse

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10671


commit 8600a07c155aa5340e9235e69d78589a53022778
Author: Liang-Chi Hsieh 
Date:   2016-01-09T06:19:34Z

Remove duplicate codes in ProjectCollapsing.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12340] Fix overflow in various take fun...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10670#issuecomment-170200738
  
**[Test build #49040 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49040/consoleFull)**
 for PR 10670 at commit 
[`d69b963`](https://github.com/apache/spark/commit/d69b96384487eeb077e2666799bd3117cfbfa9f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10630#issuecomment-170200677
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49039/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10630#issuecomment-170200676
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10630#issuecomment-170200645
  
**[Test build #49039 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)**
 for PR 10630 at commit 
[`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12340] Fix overflow in various take fun...

2016-01-08 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/10670

[SPARK-12340] Fix overflow in various take functions.

This is a follow-up for the original patch #10562.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-12340

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10670.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10670


commit 470d987f82f47e23dcf8fcdd162dbf713a5492b8
Author: Reynold Xin 
Date:   2016-01-09T05:54:25Z

[SPARK-12340] Fix overflow in various take functions.

This is a follow-up for the original patch #10562.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12340] Fix overflow in various take fun...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10670#issuecomment-170200425
  
cc @srowen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10620


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/10620#issuecomment-170200154
  
LGTM, merging into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-170200024
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-170200025
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49036/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-170200016
  
**[Test build #49036 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49036/consoleFull)**
 for PR 10238 at commit 
[`123d958`](https://github.com/apache/spark/commit/123d958ba05a36aebb2548f04418153979d243ed).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10620#issuecomment-170199898
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49037/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10620#issuecomment-170199897
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10620#issuecomment-170199833
  
**[Test build #49037 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49037/consoleFull)**
 for PR 10620 at commit 
[`119a055`](https://github.com/apache/spark/commit/119a055c7c3749ca6014635d280e3a28324e3b45).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-170195683
  
OK I think I figured out why. "acc" is a boolean column.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12639 SQL Improve Explain for Datasource...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10655#issuecomment-170195637
  
Thanks @RussellSpitzer.

I will let @yhuai review and merge this. One question, do you know why the 
filter is "if (isnull(acc#2)) null else CASE 1000 WHEN 1 THEN acc#2 WHEN 0 THEN 
NOT acc#2 ELSE false"? Seems so complicated for "acc = 1000"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12504][SQL] [Backport-1.6] Masking cred...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10669#issuecomment-170195506
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...

2016-01-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10659


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10659#issuecomment-170195406
  
Looks great. I'm going to merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12504][SQL] [Backport-1.6] Masking cred...

2016-01-08 Thread sureshthalamati

GitHub user sureshthalamati opened a pull request:

https://github.com/apache/spark/pull/10669

[SPARK-12504][SQL] [Backport-1.6] Masking credentials in the sql plan 
explain output for JDBC data sources.

Currently credentials in JDBC URL/properties for jdbc data sources are 
included in the explain output. This fix removes  credentials from the explain 
output and show only database table information. 

Backporting fix to 1.6  from 2.0  as discussed in PR 
https://github.com/apache/spark/pull/10452

CC @marmbrus 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sureshthalamati/spark 
mask_jdbc_credentials_spark_1.6.0-12504

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10669.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10669


commit 212b4dbf3c3a33c884d019068bdc6eb7fd25190c
Author: sureshthalamati 
Date:   2016-01-09T03:39:12Z

masking jdbc datasource credentials from the plan output




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10667


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10667#issuecomment-170195154
  
LGTM. Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9297][SQL] Add covar_pop and covar_samp

2016-01-08 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/10029#discussion_r49260274
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Covariance.scala
 ---
@@ -0,0 +1,212 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.TypeUtils
+import org.apache.spark.sql.types._
+
+/**
+ * Compute the covariance between two expressions.
+ * When applied on empty data (i.e., count is zero), it returns NULL.
+ *
+ */
+abstract class Covariance(
+left: Expression,
+right: Expression,
+mutableAggBufferOffset: Int,
+inputAggBufferOffset: Int)
+  extends ImperativeAggregate with Serializable {
+
+  override def children: Seq[Expression] = Seq(left, right)
+
+  override def nullable: Boolean = false
+
+  override def dataType: DataType = DoubleType
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (left.dataType.isInstanceOf[DoubleType] && 
right.dataType.isInstanceOf[DoubleType]) {
+  TypeCheckResult.TypeCheckSuccess
+} else {
+  TypeCheckResult.TypeCheckFailure(
+s"covariance requires that both arguments are double type, " +
+  s"not (${left.dataType}, ${right.dataType}).")
+}
+  }
+
+  override def aggBufferSchema: StructType = 
StructType.fromAttributes(aggBufferAttributes)
+
+  override def inputAggBufferAttributes: Seq[AttributeReference] = {
+aggBufferAttributes.map(_.newInstance())
+  }
+
+  override val aggBufferAttributes: Seq[AttributeReference] = Seq(
+AttributeReference("xAvg", DoubleType)(),
+AttributeReference("yAvg", DoubleType)(),
+AttributeReference("Ck", DoubleType)(),
+AttributeReference("count", LongType)())
+
+  // Local cache of mutableAggBufferOffset(s) that will be used in update 
and merge
+  val mutableAggBufferOffsetPlus1 = mutableAggBufferOffset + 1
+  val mutableAggBufferOffsetPlus2 = mutableAggBufferOffset + 2
+  val mutableAggBufferOffsetPlus3 = mutableAggBufferOffset + 3
+
+  // Local cache of inputAggBufferOffset(s) that will be used in update 
and merge
+  val inputAggBufferOffsetPlus1 = inputAggBufferOffset + 1
+  val inputAggBufferOffsetPlus2 = inputAggBufferOffset + 2
+  val inputAggBufferOffsetPlus3 = inputAggBufferOffset + 3
+
+  override def initialize(buffer: MutableRow): Unit = {
+buffer.setDouble(mutableAggBufferOffset, 0.0)
+buffer.setDouble(mutableAggBufferOffsetPlus1, 0.0)
+buffer.setDouble(mutableAggBufferOffsetPlus2, 0.0)
+buffer.setLong(mutableAggBufferOffsetPlus3, 0L)
+  }
+
+  override def update(buffer: MutableRow, input: InternalRow): Unit = {
+val leftEval = left.eval(input)
+val rightEval = right.eval(input)
+
+if (leftEval != null && rightEval != null) {
+  val x = leftEval.asInstanceOf[Double]
+  val y = rightEval.asInstanceOf[Double]
+
+  var xAvg = buffer.getDouble(mutableAggBufferOffset)
+  var yAvg = buffer.getDouble(mutableAggBufferOffsetPlus1)
+  var Ck = buffer.getDouble(mutableAggBufferOffsetPlus2)
+  var count = buffer.getLong(mutableAggBufferOffsetPlus3)
+
+  val deltaX = x - xAvg
+  val deltaY = y - yAvg
+  count += 1
+  xAvg += deltaX / count
+  yAvg += deltaY / count
+  Ck += deltaX * (y - yAvg)
+
+  buffer.setDouble(mutableAggBufferOffset, xAvg)
+  buffer.setDouble(mutableAggBufferOffsetPlus1, yAvg)
+  buffer.setDouble(mutableAggBufferOffsetPlus2, Ck)
+  b

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10630#issuecomment-170193612
  
**[Test build #49039 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)**
 for PR 10630 at commit 
[`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

2016-01-08 Thread gatorsmile

Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10630#issuecomment-170189804
  
In this update, the code changes include
  - Fixed a bug in `Range` by inheriting the trait `MultiInstanceRelation`.
  - Added a de-duplication resolution for all the binary nodes: `Except`, 
`Union` and `Co-Group`, besides `Intersect` and `Join`.
  - Added a new function `duplicateResolved` for all the binary nodes.
  - Improved the analysis exception message when failure to resolve 
conflicting references
  - Resolved all the other comments. 

The analysis procedure is kind of tricky. I am unable to directly include 
`duplicateResolved` into `childrenResolved`. `resolve` is lazy evaluated. The 
resolution procedure need to follow the order: children at first, then itself, 
and then deduplicate the attributes' expression IDs in tis children.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10630#issuecomment-170189725
  
**[Test build #49038 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49038/consoleFull)**
 for PR 10630 at commit 
[`f820c61`](https://github.com/apache/spark/commit/f820c616fe217494ccaed0bf74a0a7410ce503cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10620#issuecomment-170189419
  
**[Test build #49037 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49037/consoleFull)**
 for PR 10620 at commit 
[`119a055`](https://github.com/apache/spark/commit/119a055c7c3749ca6014635d280e3a28324e3b45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/10620#discussion_r49259620
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -936,6 +936,35 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 assert(e.getMessage.contains("Distinct window functions are not 
supported"))
   }
 
+  test("window function: better support of parentheses") {
+val data = Seq(
+  WindowData(1, "a", 5),
+  WindowData(2, "a", 6),
+  WindowData(3, "b", 7),
+  WindowData(4, "b", 8),
+  WindowData(5, "c", 9),
+  WindowData(6, "c", 10)
+)
+sparkContext.parallelize(data).toDF().registerTempTable("windowData")
+
+checkAnswer(
+  sql(
+"""
+  |select month, area, product,
+  |sum(product + 1) over (partition by ((1) + (1 - 1) -
+  |(2 * 1 / 2) + (1) + product - (product)) order by 2)
--- End diff --

Putting this query into the test is because we want to make sure some 
corner cases passed. E.g., (expression) op (expression op expression). I will 
try simpler ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12577][SQL] Better support of parenthes...

2016-01-08 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/10620#discussion_r49259453
  
--- Diff: 
sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser/ExpressionParser.g
 ---
@@ -223,7 +223,12 @@ precedenceUnaryPrefixExpression
 ;
 
 precedenceUnarySuffixExpression
-: precedenceUnaryPrefixExpression (a=KW_IS nullCondition)?
+:
+(
+(LPAREN precedenceUnaryPrefixExpression RPAREN) => LPAREN 
precedenceUnaryPrefixExpression (a=KW_IS nullCondition)? RPAREN
--- End diff --

Yes. I think so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170182210
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49035/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170182209
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170182139
  
**[Test build #49035 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49035/consoleFull)**
 for PR 10668 at commit 
[`bbd9c0d`](https://github.com/apache/spark/commit/bbd9c0d9066a68286310bccb9e1fbe36d3375371).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10667#issuecomment-170177470
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10667#issuecomment-170177471
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49033/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10667#issuecomment-170177405
  
**[Test build #49033 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49033/consoleFull)**
 for PR 10667 at commit 
[`ef3ec50`](https://github.com/apache/spark/commit/ef3ec50181f1e6588eb748d7241f5caa26de82db).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-170176403
  
**[Test build #49036 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49036/consoleFull)**
 for PR 10238 at commit 
[`123d958`](https://github.com/apache/spark/commit/123d958ba05a36aebb2548f04418153979d243ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2750][WEB UI] Add https support to the ...

2016-01-08 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/10238#issuecomment-170175271
  
wtf. retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3369] [CORE] [STREAMING] Java mapPartit...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10413#issuecomment-170175021
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49030/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3369] [CORE] [STREAMING] Java mapPartit...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10413#issuecomment-170175020
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3369] [CORE] [STREAMING] Java mapPartit...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10413#issuecomment-170174835
  
**[Test build #49030 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49030/consoleFull)**
 for PR 10413 at commit 
[`c3e0375`](https://github.com/apache/spark/commit/c3e0375a58365b770df8d1499efedc418cf20115).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread ehsanmok

Github user ehsanmok commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49256609
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -332,18 +336,18 @@ class BlockMatrix @Since("1.3.0") (
 if (rowsPerBlock == other.rowsPerBlock && colsPerBlock == 
other.colsPerBlock) {
   val addedBlocks = blocks.cogroup(other.blocks, createPartitioner())
 .map { case ((blockRowIndex, blockColIndex), (a, b)) =>
-  if (a.size > 1 || b.size > 1) {
-throw new SparkException("There are multiple MatrixBlocks with 
indices: " +
-  s"($blockRowIndex, $blockColIndex). Please remove them.")
-  }
-  if (a.isEmpty) {
-new MatrixBlock((blockRowIndex, blockColIndex), b.head)
-  } else if (b.isEmpty) {
-new MatrixBlock((blockRowIndex, blockColIndex), a.head)
-  } else {
-val result = a.head.toBreeze + b.head.toBreeze
-new MatrixBlock((blockRowIndex, blockColIndex), 
Matrices.fromBreeze(result))
-  }
+if (a.size > 1 || b.size > 1) {
--- End diff --

Isn't it the same indentation 
[here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala#L334)?
 I don't think, I change anything there! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10659#issuecomment-170170011
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10659#issuecomment-170170013
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49029/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49255412
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Also we have ~380 doctstring lines over length of 72 I'll file a cleanup 
JIRA for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4628][BUILD] Remove all non-Maven-Centr...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10659#issuecomment-170169877
  
**[Test build #49029 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49029/consoleFull)**
 for PR 10659 at commit 
[`e125f50`](https://github.com/apache/spark/commit/e125f50f84e09bc3176f5d0bb96cab2f4dbc29a1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170168246
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170168247
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49034/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170167622
  
**[Test build #49035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49035/consoleFull)**
 for PR 10668 at commit 
[`bbd9c0d`](https://github.com/apache/spark/commit/bbd9c0d9066a68286310bccb9e1fbe36d3375371).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread ajbozarth

Github user ajbozarth commented on the pull request:

https://github.com/apache/spark/pull/10668#issuecomment-170163831
  
Screenshots:
Initial page load

![initial](https://cloud.githubusercontent.com/assets/13952758/12212748/96678fee-b622-11e5-8fb6-d60e71ed8303.png)
Sort by address

![sortaddrbottom](https://cloud.githubusercontent.com/assets/13952758/12212752/967c5fb4-b622-11e5-837e-acaad61ba70c.png)

![sortaddrtop](https://cloud.githubusercontent.com/assets/13952758/12212751/967c5492-b622-11e5-8255-ee3a9cf6cbc9.png)
Sort by ID

![sortidbottom](https://cloud.githubusercontent.com/assets/13952758/12212754/967ed0aa-b622-11e5-875c-6cb59a467184.png)

![sortidtop](https://cloud.githubusercontent.com/assets/13952758/12212750/967c4f1a-b622-11e5-9b06-95595b1ebdca.png)
Sort by Task Count

![sorttasksbottom](https://cloud.githubusercontent.com/assets/13952758/12212753/967c5cee-b622-11e5-82d1-37aa93e256ea.png)

![sorttaskstop](https://cloud.githubusercontent.com/assets/13952758/12212749/967ac4a6-b622-11e5-90c4-857dca45c80e.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12716] [Web UI] Add a TOTALS row to the...

2016-01-08 Thread ajbozarth

GitHub user ajbozarth opened a pull request:

https://github.com/apache/spark/pull/10668

[SPARK-12716] [Web UI] Add a TOTALS row to the Executors Web UI

Created a TOTALS row containing the totals of each column in the executors 
UI. By default the TOTALS row appears at the top of the table. When a column is 
sorted the TOTALS row will always sort to either the top or bottom of the table.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ajbozarth/spark spark12716

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10668


commit f0a725d2bc3fd0d42af88bc1488241b41c552a6f
Author: Alex Bozarth 
Date:   2016-01-08T20:37:57Z

Added a TOTALS row to the executors UI




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49253655
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
--- End diff --

Agreed, this is however the same text as used in KMeansModel so I'll update 
that ones docstring as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49253296
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Are we sure on the 74? Looking at pep8/pep257 it says 72 (although we 
extended the length for code lines so maybe we changed that too)? We could try 
and add a lint rule for this maybe in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12634][Python][MLlib][DOC] Update param...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10601#issuecomment-170159816
  
I just added a note to the parent JIRA about a formatting issue affecting 
all 5 PRs: 
[https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225]
Could you please check it out & ping when I should review again?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12632][Python][Make Parameter Descripti...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10602#issuecomment-170159780
  
I just added a note to the parent JIRA about a formatting issue affecting 
all 5 PRs: 
[https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225]
Could you please check it out & ping when I should review again?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12633][Python][MLlib][DOC] Update param...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10600#issuecomment-170159799
  
I just added a note to the parent JIRA about a formatting issue affecting 
all 5 PRs: 
[https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225]
Could you please check it out & ping when I should review again?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12630][Python][MLlib][DOC] Update param...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10598#issuecomment-170159733
  
I just added a note to the parent JIRA about a formatting issue affecting 
all 5 PRs: 
[https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225]
Could you please check it out & ping when I should review again?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12631] [PYSPARK] [DOC] PySpark clusteri...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10610#issuecomment-170159752
  
I just added a note to the parent JIRA about a formatting issue affecting 
all 5 PRs: 
[https://issues.apache.org/jira/browse/SPARK-11219?focusedCommentId=15090225&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15090225]
Could you please check it out & ping when I should review again?  Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10667#issuecomment-170159640
  
**[Test build #49033 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49033/consoleFull)**
 for PR 10667 at commit 
[`ef3ec50`](https://github.com/apache/spark/commit/ef3ec50181f1e6588eb748d7241f5caa26de82db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49252449
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Update: It should actually be 74 chars.  You can check with ```pydoc 
pyspark``` from the spark/python directory and changing the terminal size to 80 
chars wide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread JoshRosen

GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/10667

[SPARK-12730][TESTS] De-duplicate some test code in BlockManagerSuite

This patch deduplicates some test code in BlockManagerSuite. I'm splitting 
this change off from a larger PR in order to make things easier to review.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark block-mgr-tests-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10667


commit ef3ec50181f1e6588eb748d7241f5caa26de82db
Author: Josh Rosen 
Date:   2016-01-08T23:37:58Z

First round of de-duplication




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12730][TESTS] De-duplicate some test co...

2016-01-08 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/10667#issuecomment-170158610
  
/cc @andrewor14 for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes to 1....

2016-01-08 Thread marmbrus

Github user marmbrus closed the pull request at:

https://github.com/apache/spark/pull/10650


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251948
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -317,14 +317,18 @@ class BlockMatrix @Since("1.3.0") (
   }
 
   /**
-   * Adds two block matrices together. The matrices must have the same 
size and matching
-   * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that 
are being added are
-   * instances of [[SparseMatrix]], the resulting sub matrix will also be 
a [[SparseMatrix]], even
-   * if it is being added to a [[DenseMatrix]]. If two dense matrices are 
added, the output will
-   * also be a [[DenseMatrix]].
+   * For given matrices `this` and `other` of compatible dimensions and 
compatible block dimensions,
+   * it applies an associative binary function on their corresponding 
blocks.
+   *
+   * @param other The BlockMatrix to operate on
+   * @param binMap An associative function taking two dense breeze 
matrices and returning a
--- End diff --

not associative
Also, this should operate on any Breeze Matrix, not just dense ones, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251953
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -332,18 +336,18 @@ class BlockMatrix @Since("1.3.0") (
 if (rowsPerBlock == other.rowsPerBlock && colsPerBlock == 
other.colsPerBlock) {
   val addedBlocks = blocks.cogroup(other.blocks, createPartitioner())
 .map { case ((blockRowIndex, blockColIndex), (a, b)) =>
-  if (a.size > 1 || b.size > 1) {
-throw new SparkException("There are multiple MatrixBlocks with 
indices: " +
-  s"($blockRowIndex, $blockColIndex). Please remove them.")
-  }
-  if (a.isEmpty) {
-new MatrixBlock((blockRowIndex, blockColIndex), b.head)
-  } else if (b.isEmpty) {
-new MatrixBlock((blockRowIndex, blockColIndex), a.head)
-  } else {
-val result = a.head.toBreeze + b.head.toBreeze
-new MatrixBlock((blockRowIndex, blockColIndex), 
Matrices.fromBreeze(result))
-  }
+if (a.size > 1 || b.size > 1) {
+  throw new SparkException("There are multiple MatrixBlocks with 
indices: " +
+s"($blockRowIndex, $blockColIndex). Please remove them.")
+}
+if (a.isEmpty) {
+  new MatrixBlock((blockRowIndex, blockColIndex), b.head)
+} else if (b.isEmpty) {
+  new MatrixBlock((blockRowIndex, blockColIndex), a.head)
--- End diff --

This and line 344 are incorrect.  What if you write a-b but a has no block? 
 Then the resulting block will be "b" but should be "-b".

Before you fix this, I'd recommend improving the unit test to catch this 
case & fail; then you can fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251958
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -351,6 +355,28 @@ class BlockMatrix @Since("1.3.0") (
 }
   }
 
+  /**
+   * Adds two block matrices together. The matrices must have the same 
size and matching
+   * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that 
are being added are
+   * instances of [[SparseMatrix]], the resulting sub matrix will also be 
a [[SparseMatrix]], even
+   * if it is being added to a [[DenseMatrix]]. If two dense matrices are 
added, the output will
+   * also be a [[DenseMatrix]].
+   */
+  @Since("1.3.0")
+  def add(other: BlockMatrix): BlockMatrix =
+blockMap(other, (x: BM[Double], y: BM[Double]) => x + y)
+
+  /**
+   * Subtracts two block matrices together. The matrices must have the 
same size and matching
+   * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that 
are being added are
+   * instances of [[SparseMatrix]], the resulting sub matrix will also be 
a [[SparseMatrix]], even
+   * if it is being added to a [[DenseMatrix]]. If two dense matrices are 
added, the output will
+   * also be a [[DenseMatrix]].
+   */
+  @Since("1.6.0")
--- End diff --

now needs to be updated to 2.0.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251939
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -317,14 +317,18 @@ class BlockMatrix @Since("1.3.0") (
   }
 
   /**
-   * Adds two block matrices together. The matrices must have the same 
size and matching
-   * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that 
are being added are
-   * instances of [[SparseMatrix]], the resulting sub matrix will also be 
a [[SparseMatrix]], even
-   * if it is being added to a [[DenseMatrix]]. If two dense matrices are 
added, the output will
-   * also be a [[DenseMatrix]].
+   * For given matrices `this` and `other` of compatible dimensions and 
compatible block dimensions,
+   * it applies an associative binary function on their corresponding 
blocks.
--- End diff --

not associative (subtraction is not)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251943
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -317,14 +317,18 @@ class BlockMatrix @Since("1.3.0") (
   }
 
   /**
-   * Adds two block matrices together. The matrices must have the same 
size and matching
-   * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that 
are being added are
-   * instances of [[SparseMatrix]], the resulting sub matrix will also be 
a [[SparseMatrix]], even
-   * if it is being added to a [[DenseMatrix]]. If two dense matrices are 
added, the output will
-   * also be a [[DenseMatrix]].
+   * For given matrices `this` and `other` of compatible dimensions and 
compatible block dimensions,
+   * it applies an associative binary function on their corresponding 
blocks.
+   *
+   * @param other The BlockMatrix to operate on
--- End diff --

"operate on" sounds like "other" is being modified.  Rephrase: "The second 
BlockMatrix argument for the operator specified by `binMap`"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251950
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -332,18 +336,18 @@ class BlockMatrix @Since("1.3.0") (
 if (rowsPerBlock == other.rowsPerBlock && colsPerBlock == 
other.colsPerBlock) {
   val addedBlocks = blocks.cogroup(other.blocks, createPartitioner())
 .map { case ((blockRowIndex, blockColIndex), (a, b)) =>
-  if (a.size > 1 || b.size > 1) {
-throw new SparkException("There are multiple MatrixBlocks with 
indices: " +
-  s"($blockRowIndex, $blockColIndex). Please remove them.")
-  }
-  if (a.isEmpty) {
-new MatrixBlock((blockRowIndex, blockColIndex), b.head)
-  } else if (b.isEmpty) {
-new MatrixBlock((blockRowIndex, blockColIndex), a.head)
-  } else {
-val result = a.head.toBreeze + b.head.toBreeze
-new MatrixBlock((blockRowIndex, blockColIndex), 
Matrices.fromBreeze(result))
-  }
+if (a.size > 1 || b.size > 1) {
--- End diff --

style: Fix indentation (The change was incorrect, or accidental.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9916#discussion_r49251954
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 ---
@@ -351,6 +355,28 @@ class BlockMatrix @Since("1.3.0") (
 }
   }
 
+  /**
+   * Adds two block matrices together. The matrices must have the same 
size and matching
+   * `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that 
are being added are
+   * instances of [[SparseMatrix]], the resulting sub matrix will also be 
a [[SparseMatrix]], even
+   * if it is being added to a [[DenseMatrix]]. If two dense matrices are 
added, the output will
+   * also be a [[DenseMatrix]].
+   */
+  @Since("1.3.0")
+  def add(other: BlockMatrix): BlockMatrix =
+blockMap(other, (x: BM[Double], y: BM[Double]) => x + y)
+
+  /**
+   * Subtracts two block matrices together. The matrices must have the 
same size and matching
--- End diff --

```Subtracts two block matrices together.``` --> ```Subtracts the given 
block matrix `other` from this block matrix: `this - other`.```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread ehsanmok

Github user ehsanmok commented on the pull request:

https://github.com/apache/spark/pull/9916#issuecomment-170157265
  
@jkbradley thank you! I'm guess that'd be suitable for Spark 1.6.1 so Since 
annotations should be updated, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11826][MLlib] Refactor add() and subtra...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9916#issuecomment-170156407
  
@ehsanmok Apologies for the slow review.  We do constantly have ~100 
pending PRs and many more JIRAs, so they can be hard to cover with limited 
reviewer bandwidth.

I'll take a look now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10509][PYSPARK] Reduce excessive param ...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10216#issuecomment-170156115
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10509][PYSPARK] Reduce excessive param ...

2016-01-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10216#issuecomment-170156117
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49032/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251302
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
--- End diff --

Specify number of partitions for sc.parallelize; not doing so has caused 
flaky tests in the past (because of randomization interacting with 
partitioning).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170156045
  
@holdenk Thanks for the PR!  That's all for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10509][PYSPARK] Reduce excessive param ...

2016-01-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10216#issuecomment-170156009
  
**[Test build #49032 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49032/consoleFull)**
 for PR 10216 at commit 
[`e0f3f00`](https://github.com/apache/spark/commit/e0f3f00d761b0b53860dd0f06de320c9fdc84958).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251291
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
--- End diff --

Confusing doc; reword.  Also fix indentation on next line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251293
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

I believe we try to limit doc lines in Python to <= 80 chars (unlike code, 
which is <= 100 chars).  Could you please update this and other parts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251288
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
--- End diff --

I'd write this as more of an example than a unit test.  It's good to 
exercise all functionality, but unit test code should go in tests.py.  (We have 
been inconsistent about this, but it'd be good to improve.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 349 matches

Mail list logo