[GitHub] spark issue #14732: [SPARK-16320] [DOC] Document G1 heap region's effect on ...

2016-08-22 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14732
  
Oh heh, too late. No problem we may further improve the GC docs soon 
anyway. The existing link wasn't wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14759: [SPARK-16577][SPARKR] Add CRAN documentation checks to r...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14759
  
**[Test build #64224 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64224/consoleFull)**
 for PR 14759 at commit 
[`349d95d`](https://github.com/apache/spark/commit/349d95d0ce933d6670d5326ab560ccef420b814e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14759: [SPARK-16577][SPARKR] Add CRAN documentation checks to r...

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14759
  
cc @felixcheung @junyangq 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14759: [SPARK-16577][SPARKR] Add CRAN documentation chec...

2016-08-22 Thread shivaram
GitHub user shivaram opened a pull request:

https://github.com/apache/spark/pull/14759

[SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.sh

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)


## How was this patch tested?

This change adds CRAN documentation checks to be run as a part of 
`R/run-tests.sh` . As this script is also used by Jenkins this means that we 
will get documentation checks on every PR going forward.

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shivaram/spark-1 sparkr-cran-jenkins

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14759.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14759


commit 349d95d0ce933d6670d5326ab560ccef420b814e
Author: Shivaram Venkataraman 
Date:   2016-08-21T20:43:15Z

Add CRAN documentation checks to run-tests.sh




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14079
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64215/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...

2016-08-22 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12889#discussion_r75729345
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -788,6 +788,8 @@ class GeneralizedLinearRegressionModel private[ml] (
   @Since("2.0.0")
   override def write: MLWriter =
 new 
GeneralizedLinearRegressionModel.GeneralizedLinearRegressionModelWriter(this)
+
+  override val numFeatures: Int = coefficients.size
--- End diff --

Is that reflected in the documentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14079
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14732: [SPARK-16320] [DOC] Document G1 heap region's effect on ...

2016-08-22 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14732
  
LGTM. Merging to master and branch 2.0. Thanks @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14079
  
**[Test build #64215 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64215/consoleFull)**
 for PR 14079 at commit 
[`fc45f5b`](https://github.com/apache/spark/commit/fc45f5b2e2fc38aff0714f1465f03f5e0ba16e01).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...

2016-08-22 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/12889#discussion_r75728898
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -788,6 +788,8 @@ class GeneralizedLinearRegressionModel private[ml] (
   @Since("2.0.0")
   override def write: MLWriter =
 new 
GeneralizedLinearRegressionModel.GeneralizedLinearRegressionModelWriter(this)
+
+  override val numFeatures: Int = coefficients.size
--- End diff --

The base class has `@Since("1.6.0")` on the method - so it has been public 
since 1.6 already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...

2016-08-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14732#discussion_r75728831
  
--- Diff: docs/tuning.md ---
@@ -217,14 +204,22 @@ temporary objects created during task execution. Some 
steps which may be useful
 * Check if there are too many garbage collections by collecting GC stats. 
If a full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory 
available for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
-  memory used for caching by lowering `spark.memory.storageFraction`; it 
is better to cache fewer
-  objects than to slow down task execution!
-
 * If there are too many minor collections but not many major GCs, 
allocating more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory 
each task will need. If the size of Eden
   is determined to be `E`, then you can set the size of the Young 
generation using the option `-Xmn=4/3*E`. (The scaling
   up by 4/3 is to account for space used by survivor regions as well.)
+  
+* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
+  memory used for caching by lowering `spark.memory.fraction`; it is 
better to cache fewer
+  objects than to slow down task execution. Alternatively, consider 
decreasing the size of
+  the Young generation. This means lowering `-Xmn` if you've set it as 
above. If not, try changing the 
+  value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, 
meaning that the Old generation 
+  occupies 2/3 of the heap. It should be large enough such that this 
fraction exceeds `spark.memory.fraction`.
--- End diff --

sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14758: [SPARKR][MINOR] Add Xiangrui and Felix to maintainers

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14758
  
**[Test build #64223 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64223/consoleFull)**
 for PR 14758 at commit 
[`3ab82a0`](https://github.com/apache/spark/commit/3ab82a0d3828faa084b7bf77aebb62c7d89db775).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...

2016-08-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14732#discussion_r75727927
  
--- Diff: docs/tuning.md ---
@@ -217,14 +204,22 @@ temporary objects created during task execution. Some 
steps which may be useful
 * Check if there are too many garbage collections by collecting GC stats. 
If a full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory 
available for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
-  memory used for caching by lowering `spark.memory.storageFraction`; it 
is better to cache fewer
-  objects than to slow down task execution!
-
 * If there are too many minor collections but not many major GCs, 
allocating more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory 
each task will need. If the size of Eden
   is determined to be `E`, then you can set the size of the Young 
generation using the option `-Xmn=4/3*E`. (The scaling
   up by 4/3 is to account for space used by survivor regions as well.)
+  
+* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
+  memory used for caching by lowering `spark.memory.fraction`; it is 
better to cache fewer
+  objects than to slow down task execution. Alternatively, consider 
decreasing the size of
+  the Young generation. This means lowering `-Xmn` if you've set it as 
above. If not, try changing the 
+  value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, 
meaning that the Old generation 
+  occupies 2/3 of the heap. It should be large enough such that this 
fraction exceeds `spark.memory.fraction`.
--- End diff --

I tried to retain all those ideas but reworded it, because the section 
where I moved it also contains some of this discussion. I believe the current 
discussion still captures the main idea, that an old generation nearly full of 
cached data indicates `spark.memory.fraction` (not just the fraction for 
storage) could be reduced. This section talks about `Xmn` and that does 
something similar to `NewRatio` so tried to weave them into one coherent 
paragraph.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14735
  
Leaving it out of branch-2.0 sounds good to me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14758: [SPARKR][MINOR] Add Xiangrui and Felix to maintainers

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14758
  
cc @mengxr @felixcheung 

FYI - This is mostly to ensure that we can have more maintainers who can 
update the CRAN submissions. This shouldn't affect anything else on the 
development side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14758: [SPARKR][MINOR] Add Xiangrui and Felix to maintai...

2016-08-22 Thread shivaram
GitHub user shivaram opened a pull request:

https://github.com/apache/spark/pull/14758

[SPARKR][MINOR] Add Xiangrui and Felix to maintainers

## What changes were proposed in this pull request?

This change adds Xiangrui Meng and Felix Cheung to the maintainers field in 
the package description.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shivaram/spark-1 sparkr-maintainers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14758.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14758


commit 3ab82a0d3828faa084b7bf77aebb62c7d89db775
Author: Shivaram Venkataraman 
Date:   2016-08-22T18:05:13Z

Add Xiangrui and Felix to maintainers




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...

2016-08-22 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12889#discussion_r75727539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -788,6 +788,8 @@ class GeneralizedLinearRegressionModel private[ml] (
   @Since("2.0.0")
   override def write: MLWriter =
 new 
GeneralizedLinearRegressionModel.GeneralizedLinearRegressionModelWriter(this)
+
+  override val numFeatures: Int = coefficients.size
--- End diff --

We still need to add this don't we? Otherwise it is the only public method 
in this class that doesn't have it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14753
  
**[Test build #64213 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64213/consoleFull)**
 for PR 14753 at commit 
[`10861b2`](https://github.com/apache/spark/commit/10861b207e8cac0b7348b374d9054c4de03b7965).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class TypedImperativeAggregate[T >: Null] extends 
ImperativeAggregate `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...

2016-08-22 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14735
  
I don't think they should be required for branch 2.0 - some part of the 
signature change with ... is likely good to have for consistency but those 
might also be "breaking" for a *.0.1 release

If we think we should - since we did make some changes like that in 2.0.0 
branch - I could open a PR for the branch separately.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14753
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14753
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64213/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14753
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14753
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64211/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14753
  
**[Test build #64211 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64211/consoleFull)**
 for PR 14753 at commit 
[`6efddad`](https://github.com/apache/spark/commit/6efddadcb8e6d48e9898a8980f4dcceee4894ebc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class TypedImperativeAggregate[T >: Null] extends 
ImperativeAggregate `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14757: [SPARK-17190] [SQL] Removal of HiveSharedState

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14757
  
**[Test build #64222 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64222/consoleFull)**
 for PR 14757 at commit 
[`f63826e`](https://github.com/apache/spark/commit/f63826ed5c35b6f1b11c891415fe568c14bdfac7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14750
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64216/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14750
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14750
  
**[Test build #64216 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64216/consoleFull)**
 for PR 14750 at commit 
[`8fc6bcc`](https://github.com/apache/spark/commit/8fc6bccec1c4fe34116a262d20f3a97e87024e3a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...

2016-08-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14732#discussion_r75722532
  
--- Diff: docs/tuning.md ---
@@ -122,21 +122,8 @@ large records.
 `R` is the storage space within `M` where cached blocks immune to being 
evicted by execution.
 
 The value of `spark.memory.fraction` should be set in order to fit this 
amount of heap space
-comfortably within the JVM's old or "tenured" generation. Otherwise, when 
much of this space is
-used for caching and execution, the tenured generation will be full, which 
causes the JVM to
-significantly increase time spent in garbage collection. See
-https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html;>Java
 GC sizing documentation
-for more information.
--- End diff --

Should we keep the link to this reference in the `Advanced GC Tuning`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14757: [SPARK-17190] [SQL] Removal of HiveSharedState

2016-08-22 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/14757

[SPARK-17190] [SQL] Removal of HiveSharedState

### What changes were proposed in this pull request?
Since `HiveClient` is used to interact with the Hive metastore, it should 
be hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
`HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
`HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
straightforward. After removal of `HiveSharedState`, the reflection logic is 
directly applied on the choice of `ExternalCatalog` types, based on the 
configuration of `CATALOG_IMPLEMENTATION`. 

`HiveClient` is also used/invoked by the other entities besides 
HiveExternalCatalog, we defines the following two APIs:
```Scala
  /**
   * Return the existing [[HiveClient]] used to interact with the metastore.
   */
  def getClient: HiveClient

  /**
   * Return a [[HiveClient]] as a new session
   */
  def getNewClient: HiveClient
```

### How was this patch tested?
The existing test cases

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark removeHiveClient

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14757.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14757






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...

2016-08-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14732#discussion_r75722287
  
--- Diff: docs/tuning.md ---
@@ -217,14 +204,22 @@ temporary objects created during task execution. Some 
steps which may be useful
 * Check if there are too many garbage collections by collecting GC stats. 
If a full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory 
available for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
-  memory used for caching by lowering `spark.memory.storageFraction`; it 
is better to cache fewer
-  objects than to slow down task execution!
-
 * If there are too many minor collections but not many major GCs, 
allocating more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory 
each task will need. If the size of Eden
   is determined to be `E`, then you can set the size of the Young 
generation using the option `-Xmn=4/3*E`. (The scaling
   up by 4/3 is to account for space used by survivor regions as well.)
+  
+* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
+  memory used for caching by lowering `spark.memory.fraction`; it is 
better to cache fewer
+  objects than to slow down task execution. Alternatively, consider 
decreasing the size of
+  the Young generation. This means lowering `-Xmn` if you've set it as 
above. If not, try changing the 
+  value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, 
meaning that the Old generation 
+  occupies 2/3 of the heap. It should be large enough such that this 
fraction exceeds `spark.memory.fraction`.
--- End diff --

Do we need to keep the following paragraph?
```
So, by default, the tenured generation occupies 2/3 or about 0.66 of the 
heap. A value of
0.6 for `spark.memory.fraction` keeps storage and execution memory within 
the old generation with
room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then 
`NewRatio` may have to
increase to 6 or more.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14756: [SPARK-17189][SQL][MINOR] Prefers InternalRow over Unsaf...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14756
  
**[Test build #64221 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64221/consoleFull)**
 for PR 14756 at commit 
[`d600e68`](https://github.com/apache/spark/commit/d600e681bde23925dedf1261654a34894713f042).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...

2016-08-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14732#discussion_r75722253
  
--- Diff: docs/tuning.md ---
@@ -217,14 +204,22 @@ temporary objects created during task execution. Some 
steps which may be useful
 * Check if there are too many garbage collections by collecting GC stats. 
If a full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory 
available for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
-  memory used for caching by lowering `spark.memory.storageFraction`; it 
is better to cache fewer
-  objects than to slow down task execution!
-
 * If there are too many minor collections but not many major GCs, 
allocating more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory 
each task will need. If the size of Eden
   is determined to be `E`, then you can set the size of the Young 
generation using the option `-Xmn=4/3*E`. (The scaling
   up by 4/3 is to account for space used by survivor regions as well.)
+  
+* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
+  memory used for caching by lowering `spark.memory.fraction`; it is 
better to cache fewer
+  objects than to slow down task execution. Alternatively, consider 
decreasing the size of
+  the Young generation. This means lowering `-Xmn` if you've set it as 
above. If not, try changing the 
+  value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, 
meaning that the Old generation 
+  occupies 2/3 of the heap. It should be large enough such that this 
fraction exceeds `spark.memory.fraction`.
+  
+* Try the G1GC garbage collector with `-XX:+UseG1GC`. It can improve 
performance in some situations where
+  garbage collection is a bottleneck. Note that with large executor heap 
sizes, it may be important to
+  increase the [G1 region 
size](https://blogs.oracle.com/g1gc/entry/g1_gc_tuning_a_case) 
+  with `-XX:G1HeapRegionSize`
--- End diff --

Do we need to keep the following paragraph?
```
So, by default, the tenured generation occupies 2/3 or about 0.66 of the 
heap. A value of
0.6 for `spark.memory.fraction` keeps storage and execution memory within 
the old generation with
room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then 
`NewRatio` may have to
increase to 6 or more.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14756: [SPARK-17189][SQL][MINOR] Prefers InternalRow ove...

2016-08-22 Thread clockfly
GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/14756

[SPARK-17189][SQL][MINOR] Prefers InternalRow over UnsafeRow if UnsafeRow 
specific interface is not used in AggregationIterator

## What changes were proposed in this pull request?

Minor change to use InternalRow instead of UnsafeRow in method declaration 
of `AggregationIterator.generateResultProjection(...)`, as UnsafeRow specific 
methods are not used.

### Before change:
```
protected def generateResultProjection(): (UnsafeRow, MutableRow) => 
UnsafeRow
```

### After change
```
protected def generateResultProjection(): (InternalRow, MutableRow) => 
UnsafeRow
```

## How was this patch tested?
 
Existing test.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark loose_row_interface

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14756.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14756


commit d600e681bde23925dedf1261654a34894713f042
Author: Sean Zhong 
Date:   2016-08-22T17:18:05Z

Use a looser interface for InternalRow in result projection




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14572: [SPARK-16552] [FOLLOW-UP] [SQL] Store the Inferred Schem...

2016-08-22 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14572
  
sorry. I missed this PR. Can you update?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14735
  
@felixcheung I didn't look at the code very closely, but will this change 
be required in `branch-2.0` as well ? If so the merge might be hard to


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75720021
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3058,7 +3057,7 @@ setMethod("str",
 #' @note drop since 2.0.0
 setMethod("drop",
   signature(x = "SparkDataFrame"),
-  function(x, col, ...) {
--- End diff --

Thanks - that sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75719798
  
--- Diff: R/pkg/NAMESPACE ---
@@ -1,5 +1,9 @@
 # Imports from base R
-importFrom(methods, setGeneric, setMethod, setOldClass)
+# Do not include stats:: "rpois", "runif" - causes error at runtime
+importFrom("methods", "setGeneric", "setMethod", "setOldClass")
+importFrom("methods", "is", "new", "signature", "show")
--- End diff --

I was wondering about this part as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75719536
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3058,7 +3057,7 @@ setMethod("str",
 #' @note drop since 2.0.0
 setMethod("drop",
   signature(x = "SparkDataFrame"),
-  function(x, col, ...) {
--- End diff --

This actually follows from the discussion in #14705. A summary may be seen 
at https://github.com/apache/spark/pull/14735#discussion_r75661714


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14734: [SPARK-16508][SPARKR] doc updates and more CRAN check fi...

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14734
  
LGTM. I had a couple of minor comments inline. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75718809
  
--- Diff: R/pkg/R/generics.R ---
@@ -1339,7 +1339,6 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 setGeneric("spark.survreg", function(data, formula) { 
standardGeneric("spark.survreg") })
 
 #' @rdname spark.lda
-#' @param ... Additional parameters to tune LDA.
--- End diff --

never mind - I see that this is moved to mllib.R


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75718694
  
--- Diff: R/pkg/R/generics.R ---
@@ -1339,7 +1339,6 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 setGeneric("spark.survreg", function(data, formula) { 
standardGeneric("spark.survreg") })
 
 #' @rdname spark.lda
-#' @param ... Additional parameters to tune LDA.
--- End diff --

Just checking - removing `...` here is intentional ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75718243
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3058,7 +3057,7 @@ setMethod("str",
 #' @note drop since 2.0.0
 setMethod("drop",
   signature(x = "SparkDataFrame"),
-  function(x, col, ...) {
--- End diff --

just to clarify removing `...` is intentional ? Just wondering as we have 
the `@param` documentation above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...

2016-08-22 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/14734#discussion_r75717898
  
--- Diff: R/pkg/NAMESPACE ---
@@ -1,5 +1,9 @@
 # Imports from base R
-importFrom(methods, setGeneric, setMethod, setOldClass)
+# Do not include stats:: "rpois", "runif" - causes error at runtime
+importFrom("methods", "setGeneric", "setMethod", "setOldClass")
+importFrom("methods", "is", "new", "signature", "show")
--- End diff --

Do these things show up as CRAN warnings ? I dont see them on my machine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14755: [MINOR][SQL] Fix some typos in comments and test hints

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14755
  
**[Test build #64219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64219/consoleFull)**
 for PR 14755 at commit 
[`ea2d0cc`](https://github.com/apache/spark/commit/ea2d0cc34fe5da6e7b15825e1feb3cca2838d626).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14079
  
**[Test build #64220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64220/consoleFull)**
 for PR 14079 at commit 
[`f8b1bff`](https://github.com/apache/spark/commit/f8b1bffee588df45809519436983cb95c6a481f3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14743: [SparkR][Minor] Fix Cache Folder Path in Windows

2016-08-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14743


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...

2016-08-22 Thread junyangq
Github user junyangq commented on the issue:

https://github.com/apache/spark/pull/14735
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14755: [MINOR][SQL] Fix some typos in comments and test ...

2016-08-22 Thread clockfly
GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/14755

[MINOR][SQL] Fix some typos in comments and test hints

## What changes were proposed in this pull request?

Fix some typos in comments and test hints

## How was this patch tested?

N/A.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark fix_minor_typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14755.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14755


commit ea2d0cc34fe5da6e7b15825e1feb3cca2838d626
Author: Sean Zhong 
Date:   2016-08-22T17:01:21Z

minor typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/14079
  
also just realized that I forgot about @kayousterhout 's comment to add in 
checks on the invariants for the confs -- I've added that now as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14754: [SPARK-17188][SQL] Moves class QuantileSummaries to proj...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14754
  
**[Test build #64217 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64217/consoleFull)**
 for PR 14754 at commit 
[`8ae3789`](https://github.com/apache/spark/commit/8ae3789e5dcf0be97848b6baf591ee5cf6f7f243).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10896
  
**[Test build #64218 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64218/consoleFull)**
 for PR 10896 at commit 
[`86068d0`](https://github.com/apache/spark/commit/86068d0f9db2cd1be91e5ec0c56d6c7c074438c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14754: [SPARK-17188][SQL] Moves class QuantileSummaries ...

2016-08-22 Thread clockfly
GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/14754

[SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for 
implementing percentile_approx

## What changes were proposed in this pull request?

This is a sub-task of SPARK-16283 (Implement percentile_approx SQL 
function), which moves class QuantileSummaries to project catalyst so that it 
can be reused when implementing aggregation function percentile_approx.

## How was this patch tested?

This PR only does class relocation, class implementation is not changed.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark 
move_QuantileSummaries_to_catalyst

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14754.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14754


commit 8ae3789e5dcf0be97848b6baf591ee5cf6f7f243
Author: Sean Zhong 
Date:   2016-08-22T16:44:06Z

move class QuantileSummaries to catalyst




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14734: [SPARK-16508][SPARKR] doc updates and more CRAN check fi...

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14734
  
@junyangq Could you take one more look ? I will also do a pass now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14743: [SparkR][Minor] Fix Cache Folder Path in Windows

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14743
  
BTW LGTM. Merging this PR into master, branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14743: [SparkR][Minor] Fix Cache Folder Path in Windows

2016-08-22 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14743
  
Thanks @HyukjinKwon -- this is a bit surprising as it was only recently 
that you fixed the windows tests in 
https://github.com/apache/spark/commit/1c403733b89258e57daf7b8b0a2011981ad7ed8a

Lets file a separate JIRA for these test failures -- And I dont think we 
have Windows infrastructure in the AMPLab Jenkins cluster. If we can setup a 
travis one that runs something like nightly / weekly that will be great


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

2016-08-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8880#discussion_r75712664
  
--- Diff: core/src/main/scala/org/apache/spark/crypto/CryptoConf.scala ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.crypto
+
+import javax.crypto.KeyGenerator
+
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.security.Credentials
+
+import org.apache.spark.SparkConf
+
+/**
+ * CryptoConf is a class for Crypto configuration
+ */
+private[spark] object CryptoConf {
+  /**
+   * Constants and variables for spark shuffle file encryption
+   */
+  val SPARK_SHUFFLE_TOKEN = new Text("SPARK_SHUFFLE_TOKEN")
+  val SPARK_SHUFFLE_ENCRYPTION_ENABLED = "spark.shuffle.encryption.enabled"
--- End diff --

Actually, I take that back, since `spark.serializer` is used for more than 
just disk data...

Maybe `spark.io.encryption.*`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark

2016-08-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/8880#discussion_r75712428
  
--- Diff: core/src/main/scala/org/apache/spark/crypto/CryptoConf.scala ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.crypto
+
+import javax.crypto.KeyGenerator
+
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.security.Credentials
+
+import org.apache.spark.SparkConf
+
+/**
+ * CryptoConf is a class for Crypto configuration
+ */
+private[spark] object CryptoConf {
+  /**
+   * Constants and variables for spark shuffle file encryption
+   */
+  val SPARK_SHUFFLE_TOKEN = new Text("SPARK_SHUFFLE_TOKEN")
+  val SPARK_SHUFFLE_ENCRYPTION_ENABLED = "spark.shuffle.encryption.enabled"
--- End diff --

Sounds better; but I'd call it `spark.serializer.encryption.enabled` to 
follow other Spark config names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10896
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64209/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10896
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-22 Thread rajeshbalamohan
Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
For latest ORC, if the data was written out by Hive, it would have the same 
mapping. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10896
  
**[Test build #64209 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64209/consoleFull)**
 for PR 10896 at commit 
[`0375ac6`](https://github.com/apache/spark/commit/0375ac69a517092a6ac6bb412b6ffb1509835c8a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-22 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14537
  
@rajeshbalamohan So for Orc 2.x files, would schema inference be 
unnecessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64206/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12004
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12004
  
**[Test build #64206 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64206/consoleFull)**
 for PR 12004 at commit 
[`63cf84f`](https://github.com/apache/spark/commit/63cf84f17d79813404b03c259a52bccb2dcb5853).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14298: [SPARK-16283][SQL] Implement `percentile_approx` ...

2016-08-22 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14298#discussion_r75709632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala
 ---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.QuantileSummaries.Stats
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+
+/**
+ * Computes an approximate percentile (quantile) using the G-K algorithm 
(see below), for very
+ * large numbers of rows where the regular percentile() UDAF might run out 
of memory.
+ *
+ * The input is a single double value or an array of double values 
representing the percentiles
+ * requested. The output, corresponding to the input, is either a single 
double value or an
+ * array of doubles that are the percentile values.
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(col, p [, B]) - Returns an approximate pth percentile 
of a numeric column in the
+ group. The B parameter, which defaults to 1000, controls 
approximation accuracy at the cost of
+ memory; higher values yield better approximations.
+_FUNC_(col, array(p1 [, p2]...) [, B]) - Same as above, but accepts 
and returns an array of
+ percentile values instead of a single one.
+""")
+case class PercentileApprox(
+child: Expression,
+percentilesExpr: Expression,
+bExpr: Option[Expression],
+percentiles: Seq[Double],  // the extracted percentiles
+B: Int,// the extracted B
+resultAsArray: Boolean,// whether to return the result as an array
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0) extends ImperativeAggregate {
+
+  private def this(child: Expression, percentilesExpr: Expression, bExpr: 
Option[Expression]) = {
+this(
+  child = child,
+  percentilesExpr = percentilesExpr,
+  bExpr = bExpr,
+  // validate and extract percentiles
+  percentiles = 
PercentileApprox.validatePercentilesLiteral(percentilesExpr)._1,
+  // validate and extract B
+  B = 
bExpr.map(PercentileApprox.validateBLiteral(_)).getOrElse(PercentileApprox.B_DEFAULT),
+  // validate and mark whether we should return results as array of 
double or not
+  resultAsArray = 
PercentileApprox.validatePercentilesLiteral(percentilesExpr)._2)
+  }
+
+  // Constructor for the "_FUNC_(col, p) / _FUNC_(col, array(p1, ...))" 
form
+  def this(child: Expression, percentilesExpr: Expression) = {
+this(child, percentilesExpr, None)
+  }
+
+  // Constructor for the "_FUNC_(col, p, B) / _FUNC_(col, array(p1, ...), 
B)" form
+  def this(child: Expression, percentilesExpr: Expression, bExpr: 
Expression) = {
+this(child, percentilesExpr, Some(bExpr))
+  }
+
+  override def prettyName: String = "percentile_approx"
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): ImperativeAggregate =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
ImperativeAggregate =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def children: Seq[Expression] =
+bExpr.map(child :: percentilesExpr :: _ :: Nil).getOrElse(child :: 
percentilesExpr :: Nil)
+
+  // we would return null for empty inputs
+  override def nullable: Boolean = true
+
+  override def dataType: DataType = if (resultAsArray) 

[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14750
  
**[Test build #64216 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64216/consoleFull)**
 for PR 14750 at commit 
[`8fc6bcc`](https://github.com/apache/spark/commit/8fc6bccec1c4fe34116a262d20f3a97e87024e3a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10896: [SPARK-12978][SQL] Skip unnecessary final group-b...

2016-08-22 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/10896#discussion_r75707785
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala 
---
@@ -27,26 +27,87 @@ import 
org.apache.spark.sql.execution.streaming.{StateStoreRestoreExec, StateSto
  */
 object AggUtils {
 
-  def planAggregateWithoutPartial(
+  private[execution] def isAggregate(operator: SparkPlan): Boolean = {
+operator.isInstanceOf[HashAggregateExec] || 
operator.isInstanceOf[SortAggregateExec]
+  }
+
+  private[execution] def supportPartialAggregate(operator: SparkPlan): 
Boolean = {
+assert(isAggregate(operator))
+def supportPartial(exprs: Seq[AggregateExpression]) =
+  exprs.map(_.aggregateFunction).forall(_.supportsPartial)
+operator match {
+  case agg @ HashAggregateExec(_, _, aggregateExpressions, _, _, _, _) 
=>
+supportPartial(aggregateExpressions)
+  case agg @ SortAggregateExec(_, _, aggregateExpressions, _, _, _, _) 
=>
+supportPartial(aggregateExpressions)
+}
+  }
+
+  private def createPartialAggregateExec(
   groupingExpressions: Seq[NamedExpression],
   aggregateExpressions: Seq[AggregateExpression],
-  resultExpressions: Seq[NamedExpression],
-  child: SparkPlan): Seq[SparkPlan] = {
+  child: SparkPlan): SparkPlan = {
+val groupingAttributes = groupingExpressions.map(_.toAttribute)
+val functionsWithDistinct = aggregateExpressions.filter(_.isDistinct)
+val partialAggregateExpressions = aggregateExpressions.map {
+  case agg @ AggregateExpression(_, _, false, _) if 
functionsWithDistinct.length > 0 =>
+agg.copy(mode = PartialMerge)
+  case agg =>
+agg.copy(mode = Partial)
+}
+val partialAggregateAttributes =
+  
partialAggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes)
+val partialResultExpressions =
+  groupingAttributes ++
+
partialAggregateExpressions.flatMap(_.aggregateFunction.inputAggBufferAttributes)
 
-val completeAggregateExpressions = 
aggregateExpressions.map(_.copy(mode = Complete))
-val completeAggregateAttributes = 
completeAggregateExpressions.map(_.resultAttribute)
-SortAggregateExec(
-  requiredChildDistributionExpressions = Some(groupingExpressions),
+createAggregateExec(
+  requiredChildDistributionExpressions = None,
   groupingExpressions = groupingExpressions,
-  aggregateExpressions = completeAggregateExpressions,
-  aggregateAttributes = completeAggregateAttributes,
-  initialInputBufferOffset = 0,
-  resultExpressions = resultExpressions,
-  child = child
-) :: Nil
+  aggregateExpressions = partialAggregateExpressions,
+  aggregateAttributes = partialAggregateAttributes,
+  initialInputBufferOffset = if (functionsWithDistinct.length > 0) {
+groupingExpressions.length + 
functionsWithDistinct.head.aggregateFunction.children.length
+  } else {
+0
+  },
+  resultExpressions = partialResultExpressions,
+  child = child)
+  }
+
+  private def updateMergeAggregateMode(aggregateExpressions: 
Seq[AggregateExpression]) = {
+def updateMode(mode: AggregateMode) = mode match {
+  case Partial => PartialMerge
+  case Complete => Final
+  case mode => mode
+}
+aggregateExpressions.map(e => e.copy(mode = updateMode(e.mode)))
+  }
+
+  private[execution] def createPartialAggregate(operator: SparkPlan)
--- End diff --

Much better


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14239
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14239
  
**[Test build #64214 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64214/consoleFull)**
 for PR 14239 at commit 
[`c97c12f`](https://github.com/apache/spark/commit/c97c12f213b0ccb25aea840e1abfdb6c61b7f6af).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14239
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64214/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14079
  
**[Test build #64215 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64215/consoleFull)**
 for PR 14079 at commit 
[`fc45f5b`](https://github.com/apache/spark/commit/fc45f5b2e2fc38aff0714f1465f03f5e0ba16e01).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14239
  
**[Test build #64214 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64214/consoleFull)**
 for PR 14239 at commit 
[`c97c12f`](https://github.com/apache/spark/commit/c97c12f213b0ccb25aea840e1abfdb6c61b7f6af).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14753
  
**[Test build #64213 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64213/consoleFull)**
 for PR 14753 at commit 
[`10861b2`](https://github.com/apache/spark/commit/10861b207e8cac0b7348b374d9054c4de03b7965).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14038: [SPARK-16317][SQL] Add a new interface to filter files i...

2016-08-22 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/14038
  
If my understanding is correct, `PathFilter` is not passed into 
`FileSystem.listFiles` in `ListingFileCatalog#listLeafFiles` inside. If even 
so, the performance degrades you pointed out occur?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14239
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64212/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14239
  
**[Test build #64212 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64212/consoleFull)**
 for PR 14239 at commit 
[`b49be73`](https://github.com/apache/spark/commit/b49be73a476af75dd37c33378aef7352e0a4902c).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14239
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...

2016-08-22 Thread clockfly
Github user clockfly closed the pull request at:

https://github.com/apache/spark/pull/14723


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10896: [SPARK-12978][SQL] Skip unnecessary final group-b...

2016-08-22 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/10896#discussion_r75701953
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/Aggregate.scala
 ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.aggregate
+
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.SparkPlan
+
+/**
+ * A base class for aggregate implementation.
+ */
+trait Aggregate {
--- End diff --

Well I think a super class makes a bit more sense. A trait to me is a way 
to bolt on functionality. The `Aggregate` contains core functionality for both 
the Hash and Sort based version, and is the natural parent class of both.

I do have to admit that this is more a personal preference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14753: [SPARK-17187][SQL] Supports using arbitrary Java ...

2016-08-22 Thread clockfly
GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/14753

[SPARK-17187][SQL] Supports using arbitrary Java object as internal 
aggregation buffer object

## What changes were proposed in this pull request?

This PR introduces an abstract class `TypedImperativeAggregate` so that an 
aggregation function of TypedImperativeAggregate can use  **arbitrary** 
user-defined Java object as intermediate aggregation buffer object.

**This has advantages like:**
1. It now can support larger category of aggregation functions. For 
example, it will be much easier to implement aggregation function 
`percentile_approx`, which has a complex aggregation buffer definition.
2. It can be used to avoid doing serialization/de-serialization for every 
call of `update` or `merge` when converting domain specific aggregation object 
to internal Spark-Sql storage format.
3. It is easier to integrate with other existing monoid libraries like 
algebird, and supports more aggregation functions with high performance. 

Please see Java doc of `TypedImperativeAggregate` and Jira ticket 
SPARK-17187 for more information.

## How was this patch tested?

Unit tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark object_aggregation_buffer_try_2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14753.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14753


commit 6efddadcb8e6d48e9898a8980f4dcceee4894ebc
Author: Sean Zhong 
Date:   2016-08-19T16:34:56Z

object aggregation buffer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14239
  
**[Test build #64212 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64212/consoleFull)**
 for PR 14239 at commit 
[`b49be73`](https://github.com/apache/spark/commit/b49be73a476af75dd37c33378aef7352e0a4902c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14723: [SQL][WIP][Test] Supports object-based aggregation funct...

2016-08-22 Thread clockfly
Github user clockfly commented on the issue:

https://github.com/apache/spark/pull/14723
  
@liancheng  @cloud-fan 
@yhuai  @hvanhovell  @gatorsmile 
This PR is superceded by #14753, please review the new PR instead.

The motivation behind the change is that the aggregation function is also 
used by WindowExec, which may do continous `update` and `eval`. We have to 
override `eval` of ImperativeAggregate so that `eval` can accepts an 
aggregation buffer which contains generic Java object. 

For example:
```
agg.update(buffer, row1)
agg.eval(buffer)
agg.update(buffer, row2)
agg.eal(buffer)
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14753
  
**[Test build #64211 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64211/consoleFull)**
 for PR 14753 at commit 
[`6efddad`](https://github.com/apache/spark/commit/6efddadcb8e6d48e9898a8980f4dcceee4894ebc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14038: [SPARK-16317][SQL] Add a new interface to filter files i...

2016-08-22 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14038
  
Oh, i don't want to take on any more work...I just think you should make 
the predicate passed in something that goes `FileStatus => Boolean` instead of 
`String => Boolean`, and doing the filtering after the results come back.

Regarding speedup, we've seen 20x in simple test trees, but don't have real 
data on how representative that is: 
[HADOOP-13208](https://issues.apache.org/jira/browse/HADOOP-13208)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r75700225
  
--- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala ---
@@ -204,6 +213,7 @@ case object TaskResultLost extends TaskFailedReason {
 @DeveloperApi
 case object TaskKilled extends TaskFailedReason {
   override def toErrorString: String = "TaskKilled (killed intentionally)"
+  override val countTowardsTaskFailures: Boolean = false
--- End diff --

the switch to a `val` came from an earlier discussion with @kayousterhout 
... there was some other confusion, thought maybe changing to a val would make 
it more clear it is a constant.  But I don't think either of feels strongly, 
the argument to switch to a val was pretty weak.  I can change it back


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14738: [SPARK-17090][FOLLOW-UP][ML]Add expert param support to ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14738
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64208/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14738: [SPARK-17090][FOLLOW-UP][ML]Add expert param support to ...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14738
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14738: [SPARK-17090][FOLLOW-UP][ML]Add expert param support to ...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14738
  
**[Test build #64208 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64208/consoleFull)**
 for PR 14738 at commit 
[`b54b582`](https://github.com/apache/spark/commit/b54b582208554a37a68bc2a45fec6bdfed43405e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14729: [SPARK-17167] [SQL] Issue Exceptions when Analyze Table ...

2016-08-22 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14729
  
@viirya Yeah, a normal temporary table would be resolved as a LogicalPlan. 
Analyze Table does not give us any benefit there. 

However, you are also allowed to do this:
```sql
CREATE TEMPORARY VIEW tmp1
USING parquet
OPTIONS(path 'some/location')
```
For these I would like to be able to collect statistics.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...

2016-08-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14723#discussion_r75700385
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/AggregateWithObjectAggregateBufferSuite.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import 
org.apache.spark.sql.AggregateWithObjectAggregateBufferSuite.MaxWithObjectAggregateBuffer
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression, GenericMutableRow, MutableRow, UnsafeRow}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{ImperativeAggregate, 
WithObjectAggregateBuffer}
+import org.apache.spark.sql.execution.aggregate.{SortAggregateExec}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.{AbstractDataType, DataType, 
IntegerType, StructType}
+
+class AggregateWithObjectAggregateBufferSuite extends QueryTest with 
SharedSQLContext {
--- End diff --

oh right, I misread the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-08-22 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/14079#discussion_r75700278
  
--- Diff: docs/configuration.md ---
@@ -1178,6 +1178,80 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.blacklist.enabled
+  
+true in cluster mode; 
+false in local mode
+  
+  
+If set to "true", prevent Spark from scheduling tasks on executors 
that have been blacklisted
+due to too many task failures. The blacklisting algorithm can be 
further controlled by the
+other "spark.blacklist" configuration options.
+  
+
+
+  spark.blacklist.timeout
+  1h
+  
+(Experimental) How long a node or executor is blacklisted for the 
entire application, before it
+is unconditionally removed from the blacklist to attempt running new 
tasks.
+  
+
+
+  spark.blacklist.task.maxTaskAttemptsPerExecutor
+  2
--- End diff --

oops, forgot to update this -- good catch, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14749: [SPARK-17182][SQL] Mark Collect as non-determinis...

2016-08-22 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14749#discussion_r75699347
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala
 ---
@@ -54,6 +54,10 @@ abstract class Collect extends ImperativeAggregate {
 
   override def inputAggBufferAttributes: Seq[AttributeReference] = Nil
 
+  // Both `CollectList` and `CollectSet` are non-deterministic since their 
results depend on the
+  // actual order of input rows.
+  override def deterministic: Boolean = false
--- End diff --

Is `collect_set` non deterministic? It is backed by a `HashSet`, and the 
way elements are iterated over does not rely on the input order. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14239
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14239
  
**[Test build #64210 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64210/consoleFull)**
 for PR 14239 at commit 
[`5e93297`](https://github.com/apache/spark/commit/5e9329735ce71eed6f649f1fa16ddfbedc079193).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14239
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64210/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14749: [SPARK-17182][SQL] Mark Collect as non-deterministic

2016-08-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14749
  
hmm, I think aggregate function don't need the concept of `deterministic`, 
as we never check this property for aggregate functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...

2016-08-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14239
  
**[Test build #64210 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64210/consoleFull)**
 for PR 14239 at commit 
[`5e93297`](https://github.com/apache/spark/commit/5e9329735ce71eed6f649f1fa16ddfbedc079193).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10896: [SPARK-12978][SQL] Skip unnecessary final group-b...

2016-08-22 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/10896#discussion_r75695126
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala 
---
@@ -19,34 +19,90 @@ package org.apache.spark.sql.execution.aggregate
 
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.physical.Distribution
+import org.apache.spark.sql.execution.aggregate.{Aggregate => 
AggregateExec}
 import org.apache.spark.sql.execution.SparkPlan
 import org.apache.spark.sql.execution.streaming.{StateStoreRestoreExec, 
StateStoreSaveExec}
 
 /**
+ * A pattern that finds aggregate operators to support partial 
aggregations.
+ */
+object ExtractPartialAggregate {
--- End diff --

okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >