[GitHub] spark issue #14732: [SPARK-16320] [DOC] Document G1 heap region's effect on ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14732 Oh heh, too late. No problem we may further improve the GC docs soon anyway. The existing link wasn't wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14759: [SPARK-16577][SPARKR] Add CRAN documentation checks to r...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14759 **[Test build #64224 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64224/consoleFull)** for PR 14759 at commit [`349d95d`](https://github.com/apache/spark/commit/349d95d0ce933d6670d5326ab560ccef420b814e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14759: [SPARK-16577][SPARKR] Add CRAN documentation checks to r...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14759 cc @felixcheung @junyangq --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14759: [SPARK-16577][SPARKR] Add CRAN documentation chec...
GitHub user shivaram opened a pull request: https://github.com/apache/spark/pull/14759 [SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.sh ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? This change adds CRAN documentation checks to be run as a part of `R/run-tests.sh` . As this script is also used by Jenkins this means that we will get documentation checks on every PR going forward. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/shivaram/spark-1 sparkr-cran-jenkins Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14759.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14759 commit 349d95d0ce933d6670d5326ab560ccef420b814e Author: Shivaram VenkataramanDate: 2016-08-21T20:43:15Z Add CRAN documentation checks to run-tests.sh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14079 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64215/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12889#discussion_r75729345 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -788,6 +788,8 @@ class GeneralizedLinearRegressionModel private[ml] ( @Since("2.0.0") override def write: MLWriter = new GeneralizedLinearRegressionModel.GeneralizedLinearRegressionModelWriter(this) + + override val numFeatures: Int = coefficients.size --- End diff -- Is that reflected in the documentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14079 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14732: [SPARK-16320] [DOC] Document G1 heap region's effect on ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14732 LGTM. Merging to master and branch 2.0. Thanks @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14079 **[Test build #64215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64215/consoleFull)** for PR 14079 at commit [`fc45f5b`](https://github.com/apache/spark/commit/fc45f5b2e2fc38aff0714f1465f03f5e0ba16e01). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/12889#discussion_r75728898 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -788,6 +788,8 @@ class GeneralizedLinearRegressionModel private[ml] ( @Since("2.0.0") override def write: MLWriter = new GeneralizedLinearRegressionModel.GeneralizedLinearRegressionModelWriter(this) + + override val numFeatures: Int = coefficients.size --- End diff -- The base class has `@Since("1.6.0")` on the method - so it has been public since 1.6 already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14732#discussion_r75728831 --- Diff: docs/tuning.md --- @@ -217,14 +204,22 @@ temporary objects created during task execution. Some steps which may be useful * Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn't enough memory available for executing tasks. -* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of - memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer - objects than to slow down task execution! - * If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling up by 4/3 is to account for space used by survivor regions as well.) + +* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of + memory used for caching by lowering `spark.memory.fraction`; it is better to cache fewer + objects than to slow down task execution. Alternatively, consider decreasing the size of + the Young generation. This means lowering `-Xmn` if you've set it as above. If not, try changing the + value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, meaning that the Old generation + occupies 2/3 of the heap. It should be large enough such that this fraction exceeds `spark.memory.fraction`. --- End diff -- sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14758: [SPARKR][MINOR] Add Xiangrui and Felix to maintainers
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14758 **[Test build #64223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64223/consoleFull)** for PR 14758 at commit [`3ab82a0`](https://github.com/apache/spark/commit/3ab82a0d3828faa084b7bf77aebb62c7d89db775). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14732#discussion_r75727927 --- Diff: docs/tuning.md --- @@ -217,14 +204,22 @@ temporary objects created during task execution. Some steps which may be useful * Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn't enough memory available for executing tasks. -* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of - memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer - objects than to slow down task execution! - * If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling up by 4/3 is to account for space used by survivor regions as well.) + +* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of + memory used for caching by lowering `spark.memory.fraction`; it is better to cache fewer + objects than to slow down task execution. Alternatively, consider decreasing the size of + the Young generation. This means lowering `-Xmn` if you've set it as above. If not, try changing the + value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, meaning that the Old generation + occupies 2/3 of the heap. It should be large enough such that this fraction exceeds `spark.memory.fraction`. --- End diff -- I tried to retain all those ideas but reworded it, because the section where I moved it also contains some of this discussion. I believe the current discussion still captures the main idea, that an old generation nearly full of cached data indicates `spark.memory.fraction` (not just the fraction for storage) could be reduced. This section talks about `Xmn` and that does something similar to `NewRatio` so tried to weave them into one coherent paragraph. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14735 Leaving it out of branch-2.0 sounds good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14758: [SPARKR][MINOR] Add Xiangrui and Felix to maintainers
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14758 cc @mengxr @felixcheung FYI - This is mostly to ensure that we can have more maintainers who can update the CRAN submissions. This shouldn't affect anything else on the development side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14758: [SPARKR][MINOR] Add Xiangrui and Felix to maintai...
GitHub user shivaram opened a pull request: https://github.com/apache/spark/pull/14758 [SPARKR][MINOR] Add Xiangrui and Felix to maintainers ## What changes were proposed in this pull request? This change adds Xiangrui Meng and Felix Cheung to the maintainers field in the package description. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/shivaram/spark-1 sparkr-maintainers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14758.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14758 commit 3ab82a0d3828faa084b7bf77aebb62c7d89db775 Author: Shivaram VenkataramanDate: 2016-08-22T18:05:13Z Add Xiangrui and Felix to maintainers --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12889: [SPARK-15113][PySpark][ML] Add missing num featur...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12889#discussion_r75727539 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -788,6 +788,8 @@ class GeneralizedLinearRegressionModel private[ml] ( @Since("2.0.0") override def write: MLWriter = new GeneralizedLinearRegressionModel.GeneralizedLinearRegressionModelWriter(this) + + override val numFeatures: Int = coefficients.size --- End diff -- We still need to add this don't we? Otherwise it is the only public method in this class that doesn't have it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14753 **[Test build #64213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64213/consoleFull)** for PR 14753 at commit [`10861b2`](https://github.com/apache/spark/commit/10861b207e8cac0b7348b374d9054c4de03b7965). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14735 I don't think they should be required for branch 2.0 - some part of the signature change with ... is likely good to have for consistency but those might also be "breaking" for a *.0.1 release If we think we should - since we did make some changes like that in 2.0.0 branch - I could open a PR for the branch separately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14753 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14753 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64213/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14753 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14753 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64211/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14753 **[Test build #64211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64211/consoleFull)** for PR 14753 at commit [`6efddad`](https://github.com/apache/spark/commit/6efddadcb8e6d48e9898a8980f4dcceee4894ebc). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class TypedImperativeAggregate[T >: Null] extends ImperativeAggregate ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14757: [SPARK-17190] [SQL] Removal of HiveSharedState
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14757 **[Test build #64222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64222/consoleFull)** for PR 14757 at commit [`f63826e`](https://github.com/apache/spark/commit/f63826ed5c35b6f1b11c891415fe568c14bdfac7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14750 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64216/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14750 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14750 **[Test build #64216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64216/consoleFull)** for PR 14750 at commit [`8fc6bcc`](https://github.com/apache/spark/commit/8fc6bccec1c4fe34116a262d20f3a97e87024e3a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14732#discussion_r75722532 --- Diff: docs/tuning.md --- @@ -122,21 +122,8 @@ large records. `R` is the storage space within `M` where cached blocks immune to being evicted by execution. The value of `spark.memory.fraction` should be set in order to fit this amount of heap space -comfortably within the JVM's old or "tenured" generation. Otherwise, when much of this space is -used for caching and execution, the tenured generation will be full, which causes the JVM to -significantly increase time spent in garbage collection. See -https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html;>Java GC sizing documentation -for more information. --- End diff -- Should we keep the link to this reference in the `Advanced GC Tuning`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14757: [SPARK-17190] [SQL] Removal of HiveSharedState
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/14757 [SPARK-17190] [SQL] Removal of HiveSharedState ### What changes were proposed in this pull request? Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`. `HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: ```Scala /** * Return the existing [[HiveClient]] used to interact with the metastore. */ def getClient: HiveClient /** * Return a [[HiveClient]] as a new session */ def getNewClient: HiveClient ``` ### How was this patch tested? The existing test cases You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark removeHiveClient Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14757.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14757 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14732#discussion_r75722287 --- Diff: docs/tuning.md --- @@ -217,14 +204,22 @@ temporary objects created during task execution. Some steps which may be useful * Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn't enough memory available for executing tasks. -* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of - memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer - objects than to slow down task execution! - * If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling up by 4/3 is to account for space used by survivor regions as well.) + +* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of + memory used for caching by lowering `spark.memory.fraction`; it is better to cache fewer + objects than to slow down task execution. Alternatively, consider decreasing the size of + the Young generation. This means lowering `-Xmn` if you've set it as above. If not, try changing the + value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, meaning that the Old generation + occupies 2/3 of the heap. It should be large enough such that this fraction exceeds `spark.memory.fraction`. --- End diff -- Do we need to keep the following paragraph? ``` So, by default, the tenured generation occupies 2/3 or about 0.66 of the heap. A value of 0.6 for `spark.memory.fraction` keeps storage and execution memory within the old generation with room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then `NewRatio` may have to increase to 6 or more. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14756: [SPARK-17189][SQL][MINOR] Prefers InternalRow over Unsaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14756 **[Test build #64221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64221/consoleFull)** for PR 14756 at commit [`d600e68`](https://github.com/apache/spark/commit/d600e681bde23925dedf1261654a34894713f042). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14732: [SPARK-16320] [DOC] Document G1 heap region's eff...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14732#discussion_r75722253 --- Diff: docs/tuning.md --- @@ -217,14 +204,22 @@ temporary objects created during task execution. Some steps which may be useful * Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn't enough memory available for executing tasks. -* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of - memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer - objects than to slow down task execution! - * If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling up by 4/3 is to account for space used by survivor regions as well.) + +* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of + memory used for caching by lowering `spark.memory.fraction`; it is better to cache fewer + objects than to slow down task execution. Alternatively, consider decreasing the size of + the Young generation. This means lowering `-Xmn` if you've set it as above. If not, try changing the + value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, meaning that the Old generation + occupies 2/3 of the heap. It should be large enough such that this fraction exceeds `spark.memory.fraction`. + +* Try the G1GC garbage collector with `-XX:+UseG1GC`. It can improve performance in some situations where + garbage collection is a bottleneck. Note that with large executor heap sizes, it may be important to + increase the [G1 region size](https://blogs.oracle.com/g1gc/entry/g1_gc_tuning_a_case) + with `-XX:G1HeapRegionSize` --- End diff -- Do we need to keep the following paragraph? ``` So, by default, the tenured generation occupies 2/3 or about 0.66 of the heap. A value of 0.6 for `spark.memory.fraction` keeps storage and execution memory within the old generation with room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then `NewRatio` may have to increase to 6 or more. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14756: [SPARK-17189][SQL][MINOR] Prefers InternalRow ove...
GitHub user clockfly opened a pull request: https://github.com/apache/spark/pull/14756 [SPARK-17189][SQL][MINOR] Prefers InternalRow over UnsafeRow if UnsafeRow specific interface is not used in AggregationIterator ## What changes were proposed in this pull request? Minor change to use InternalRow instead of UnsafeRow in method declaration of `AggregationIterator.generateResultProjection(...)`, as UnsafeRow specific methods are not used. ### Before change: ``` protected def generateResultProjection(): (UnsafeRow, MutableRow) => UnsafeRow ``` ### After change ``` protected def generateResultProjection(): (InternalRow, MutableRow) => UnsafeRow ``` ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/spark loose_row_interface Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14756.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14756 commit d600e681bde23925dedf1261654a34894713f042 Author: Sean ZhongDate: 2016-08-22T17:18:05Z Use a looser interface for InternalRow in result projection --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14572: [SPARK-16552] [FOLLOW-UP] [SQL] Store the Inferred Schem...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14572 sorry. I missed this PR. Can you update? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14735 @felixcheung I didn't look at the code very closely, but will this change be required in `branch-2.0` as well ? If so the merge might be hard to --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75720021 --- Diff: R/pkg/R/DataFrame.R --- @@ -3058,7 +3057,7 @@ setMethod("str", #' @note drop since 2.0.0 setMethod("drop", signature(x = "SparkDataFrame"), - function(x, col, ...) { --- End diff -- Thanks - that sounds good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75719798 --- Diff: R/pkg/NAMESPACE --- @@ -1,5 +1,9 @@ # Imports from base R -importFrom(methods, setGeneric, setMethod, setOldClass) +# Do not include stats:: "rpois", "runif" - causes error at runtime +importFrom("methods", "setGeneric", "setMethod", "setOldClass") +importFrom("methods", "is", "new", "signature", "show") --- End diff -- I was wondering about this part as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75719536 --- Diff: R/pkg/R/DataFrame.R --- @@ -3058,7 +3057,7 @@ setMethod("str", #' @note drop since 2.0.0 setMethod("drop", signature(x = "SparkDataFrame"), - function(x, col, ...) { --- End diff -- This actually follows from the discussion in #14705. A summary may be seen at https://github.com/apache/spark/pull/14735#discussion_r75661714 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] doc updates and more CRAN check fi...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14734 LGTM. I had a couple of minor comments inline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75718809 --- Diff: R/pkg/R/generics.R --- @@ -1339,7 +1339,6 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s setGeneric("spark.survreg", function(data, formula) { standardGeneric("spark.survreg") }) #' @rdname spark.lda -#' @param ... Additional parameters to tune LDA. --- End diff -- never mind - I see that this is moved to mllib.R --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75718694 --- Diff: R/pkg/R/generics.R --- @@ -1339,7 +1339,6 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s setGeneric("spark.survreg", function(data, formula) { standardGeneric("spark.survreg") }) #' @rdname spark.lda -#' @param ... Additional parameters to tune LDA. --- End diff -- Just checking - removing `...` here is intentional ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75718243 --- Diff: R/pkg/R/DataFrame.R --- @@ -3058,7 +3057,7 @@ setMethod("str", #' @note drop since 2.0.0 setMethod("drop", signature(x = "SparkDataFrame"), - function(x, col, ...) { --- End diff -- just to clarify removing `...` is intentional ? Just wondering as we have the `@param` documentation above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] doc updates and more CRAN c...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75717898 --- Diff: R/pkg/NAMESPACE --- @@ -1,5 +1,9 @@ # Imports from base R -importFrom(methods, setGeneric, setMethod, setOldClass) +# Do not include stats:: "rpois", "runif" - causes error at runtime +importFrom("methods", "setGeneric", "setMethod", "setOldClass") +importFrom("methods", "is", "new", "signature", "show") --- End diff -- Do these things show up as CRAN warnings ? I dont see them on my machine --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14755: [MINOR][SQL] Fix some typos in comments and test hints
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14755 **[Test build #64219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64219/consoleFull)** for PR 14755 at commit [`ea2d0cc`](https://github.com/apache/spark/commit/ea2d0cc34fe5da6e7b15825e1feb3cca2838d626). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14079 **[Test build #64220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64220/consoleFull)** for PR 14079 at commit [`f8b1bff`](https://github.com/apache/spark/commit/f8b1bffee588df45809519436983cb95c6a481f3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14743: [SparkR][Minor] Fix Cache Folder Path in Windows
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14743 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user junyangq commented on the issue: https://github.com/apache/spark/pull/14735 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14755: [MINOR][SQL] Fix some typos in comments and test ...
GitHub user clockfly opened a pull request: https://github.com/apache/spark/pull/14755 [MINOR][SQL] Fix some typos in comments and test hints ## What changes were proposed in this pull request? Fix some typos in comments and test hints ## How was this patch tested? N/A. You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/spark fix_minor_typo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14755.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14755 commit ea2d0cc34fe5da6e7b15825e1feb3cca2838d626 Author: Sean ZhongDate: 2016-08-22T17:01:21Z minor typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user squito commented on the issue: https://github.com/apache/spark/pull/14079 also just realized that I forgot about @kayousterhout 's comment to add in checks on the invariants for the confs -- I've added that now as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14754: [SPARK-17188][SQL] Moves class QuantileSummaries to proj...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14754 **[Test build #64217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64217/consoleFull)** for PR 14754 at commit [`8ae3789`](https://github.com/apache/spark/commit/8ae3789e5dcf0be97848b6baf591ee5cf6f7f243). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/10896 **[Test build #64218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64218/consoleFull)** for PR 10896 at commit [`86068d0`](https://github.com/apache/spark/commit/86068d0f9db2cd1be91e5ec0c56d6c7c074438c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14754: [SPARK-17188][SQL] Moves class QuantileSummaries ...
GitHub user clockfly opened a pull request: https://github.com/apache/spark/pull/14754 [SPARK-17188][SQL] Moves class QuantileSummaries to project catalyst for implementing percentile_approx ## What changes were proposed in this pull request? This is a sub-task of SPARK-16283 (Implement percentile_approx SQL function), which moves class QuantileSummaries to project catalyst so that it can be reused when implementing aggregation function percentile_approx. ## How was this patch tested? This PR only does class relocation, class implementation is not changed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/spark move_QuantileSummaries_to_catalyst Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14754 commit 8ae3789e5dcf0be97848b6baf591ee5cf6f7f243 Author: Sean ZhongDate: 2016-08-22T16:44:06Z move class QuantileSummaries to catalyst --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] doc updates and more CRAN check fi...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14734 @junyangq Could you take one more look ? I will also do a pass now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14743: [SparkR][Minor] Fix Cache Folder Path in Windows
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14743 BTW LGTM. Merging this PR into master, branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14743: [SparkR][Minor] Fix Cache Folder Path in Windows
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14743 Thanks @HyukjinKwon -- this is a bit surprising as it was only recently that you fixed the windows tests in https://github.com/apache/spark/commit/1c403733b89258e57daf7b8b0a2011981ad7ed8a Lets file a separate JIRA for these test failures -- And I dont think we have Windows infrastructure in the AMPLab Jenkins cluster. If we can setup a travis one that runs something like nightly / weekly that will be great --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8880#discussion_r75712664 --- Diff: core/src/main/scala/org/apache/spark/crypto/CryptoConf.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.crypto + +import javax.crypto.KeyGenerator + +import org.apache.hadoop.io.Text +import org.apache.hadoop.security.Credentials + +import org.apache.spark.SparkConf + +/** + * CryptoConf is a class for Crypto configuration + */ +private[spark] object CryptoConf { + /** + * Constants and variables for spark shuffle file encryption + */ + val SPARK_SHUFFLE_TOKEN = new Text("SPARK_SHUFFLE_TOKEN") + val SPARK_SHUFFLE_ENCRYPTION_ENABLED = "spark.shuffle.encryption.enabled" --- End diff -- Actually, I take that back, since `spark.serializer` is used for more than just disk data... Maybe `spark.io.encryption.*`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #8880: [SPARK-5682][Core] Add encrypted shuffle in spark
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/8880#discussion_r75712428 --- Diff: core/src/main/scala/org/apache/spark/crypto/CryptoConf.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.crypto + +import javax.crypto.KeyGenerator + +import org.apache.hadoop.io.Text +import org.apache.hadoop.security.Credentials + +import org.apache.spark.SparkConf + +/** + * CryptoConf is a class for Crypto configuration + */ +private[spark] object CryptoConf { + /** + * Constants and variables for spark shuffle file encryption + */ + val SPARK_SHUFFLE_TOKEN = new Text("SPARK_SHUFFLE_TOKEN") + val SPARK_SHUFFLE_ENCRYPTION_ENABLED = "spark.shuffle.encryption.enabled" --- End diff -- Sounds better; but I'd call it `spark.serializer.encryption.enabled` to follow other Spark config names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/10896 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64209/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/10896 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 For latest ORC, if the data was written out by Hive, it would have the same mapping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10896: [SPARK-12978][SQL] Skip unnecessary final group-by when ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/10896 **[Test build #64209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64209/consoleFull)** for PR 10896 at commit [`0375ac6`](https://github.com/apache/spark/commit/0375ac69a517092a6ac6bb412b6ffb1509835c8a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14537 @rajeshbalamohan So for Orc 2.x files, would schema inference be unnecessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12004 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64206/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12004 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12004 **[Test build #64206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64206/consoleFull)** for PR 12004 at commit [`63cf84f`](https://github.com/apache/spark/commit/63cf84f17d79813404b03c259a52bccb2dcb5853). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14298: [SPARK-16283][SQL] Implement `percentile_approx` ...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/spark/pull/14298#discussion_r75709632 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala --- @@ -0,0 +1,462 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.QuantileSummaries.Stats +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.types._ + +/** + * Computes an approximate percentile (quantile) using the G-K algorithm (see below), for very + * large numbers of rows where the regular percentile() UDAF might run out of memory. + * + * The input is a single double value or an array of double values representing the percentiles + * requested. The output, corresponding to the input, is either a single double value or an + * array of doubles that are the percentile values. + */ +@ExpressionDescription( + usage = """_FUNC_(col, p [, B]) - Returns an approximate pth percentile of a numeric column in the + group. The B parameter, which defaults to 1000, controls approximation accuracy at the cost of + memory; higher values yield better approximations. +_FUNC_(col, array(p1 [, p2]...) [, B]) - Same as above, but accepts and returns an array of + percentile values instead of a single one. +""") +case class PercentileApprox( +child: Expression, +percentilesExpr: Expression, +bExpr: Option[Expression], +percentiles: Seq[Double], // the extracted percentiles +B: Int,// the extracted B +resultAsArray: Boolean,// whether to return the result as an array +mutableAggBufferOffset: Int = 0, +inputAggBufferOffset: Int = 0) extends ImperativeAggregate { + + private def this(child: Expression, percentilesExpr: Expression, bExpr: Option[Expression]) = { +this( + child = child, + percentilesExpr = percentilesExpr, + bExpr = bExpr, + // validate and extract percentiles + percentiles = PercentileApprox.validatePercentilesLiteral(percentilesExpr)._1, + // validate and extract B + B = bExpr.map(PercentileApprox.validateBLiteral(_)).getOrElse(PercentileApprox.B_DEFAULT), + // validate and mark whether we should return results as array of double or not + resultAsArray = PercentileApprox.validatePercentilesLiteral(percentilesExpr)._2) + } + + // Constructor for the "_FUNC_(col, p) / _FUNC_(col, array(p1, ...))" form + def this(child: Expression, percentilesExpr: Expression) = { +this(child, percentilesExpr, None) + } + + // Constructor for the "_FUNC_(col, p, B) / _FUNC_(col, array(p1, ...), B)" form + def this(child: Expression, percentilesExpr: Expression, bExpr: Expression) = { +this(child, percentilesExpr, Some(bExpr)) + } + + override def prettyName: String = "percentile_approx" + + override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): ImperativeAggregate = +copy(mutableAggBufferOffset = newMutableAggBufferOffset) + + override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): ImperativeAggregate = +copy(inputAggBufferOffset = newInputAggBufferOffset) + + override def children: Seq[Expression] = +bExpr.map(child :: percentilesExpr :: _ :: Nil).getOrElse(child :: percentilesExpr :: Nil) + + // we would return null for empty inputs + override def nullable: Boolean = true + + override def dataType: DataType = if (resultAsArray)
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14750 **[Test build #64216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64216/consoleFull)** for PR 14750 at commit [`8fc6bcc`](https://github.com/apache/spark/commit/8fc6bccec1c4fe34116a262d20f3a97e87024e3a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10896: [SPARK-12978][SQL] Skip unnecessary final group-b...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/10896#discussion_r75707785 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala --- @@ -27,26 +27,87 @@ import org.apache.spark.sql.execution.streaming.{StateStoreRestoreExec, StateSto */ object AggUtils { - def planAggregateWithoutPartial( + private[execution] def isAggregate(operator: SparkPlan): Boolean = { +operator.isInstanceOf[HashAggregateExec] || operator.isInstanceOf[SortAggregateExec] + } + + private[execution] def supportPartialAggregate(operator: SparkPlan): Boolean = { +assert(isAggregate(operator)) +def supportPartial(exprs: Seq[AggregateExpression]) = + exprs.map(_.aggregateFunction).forall(_.supportsPartial) +operator match { + case agg @ HashAggregateExec(_, _, aggregateExpressions, _, _, _, _) => +supportPartial(aggregateExpressions) + case agg @ SortAggregateExec(_, _, aggregateExpressions, _, _, _, _) => +supportPartial(aggregateExpressions) +} + } + + private def createPartialAggregateExec( groupingExpressions: Seq[NamedExpression], aggregateExpressions: Seq[AggregateExpression], - resultExpressions: Seq[NamedExpression], - child: SparkPlan): Seq[SparkPlan] = { + child: SparkPlan): SparkPlan = { +val groupingAttributes = groupingExpressions.map(_.toAttribute) +val functionsWithDistinct = aggregateExpressions.filter(_.isDistinct) +val partialAggregateExpressions = aggregateExpressions.map { + case agg @ AggregateExpression(_, _, false, _) if functionsWithDistinct.length > 0 => +agg.copy(mode = PartialMerge) + case agg => +agg.copy(mode = Partial) +} +val partialAggregateAttributes = + partialAggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes) +val partialResultExpressions = + groupingAttributes ++ + partialAggregateExpressions.flatMap(_.aggregateFunction.inputAggBufferAttributes) -val completeAggregateExpressions = aggregateExpressions.map(_.copy(mode = Complete)) -val completeAggregateAttributes = completeAggregateExpressions.map(_.resultAttribute) -SortAggregateExec( - requiredChildDistributionExpressions = Some(groupingExpressions), +createAggregateExec( + requiredChildDistributionExpressions = None, groupingExpressions = groupingExpressions, - aggregateExpressions = completeAggregateExpressions, - aggregateAttributes = completeAggregateAttributes, - initialInputBufferOffset = 0, - resultExpressions = resultExpressions, - child = child -) :: Nil + aggregateExpressions = partialAggregateExpressions, + aggregateAttributes = partialAggregateAttributes, + initialInputBufferOffset = if (functionsWithDistinct.length > 0) { +groupingExpressions.length + functionsWithDistinct.head.aggregateFunction.children.length + } else { +0 + }, + resultExpressions = partialResultExpressions, + child = child) + } + + private def updateMergeAggregateMode(aggregateExpressions: Seq[AggregateExpression]) = { +def updateMode(mode: AggregateMode) = mode match { + case Partial => PartialMerge + case Complete => Final + case mode => mode +} +aggregateExpressions.map(e => e.copy(mode = updateMode(e.mode))) + } + + private[execution] def createPartialAggregate(operator: SparkPlan) --- End diff -- Much better --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64214/consoleFull)** for PR 14239 at commit [`c97c12f`](https://github.com/apache/spark/commit/c97c12f213b0ccb25aea840e1abfdb6c61b7f6af). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64214/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14079 **[Test build #64215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64215/consoleFull)** for PR 14079 at commit [`fc45f5b`](https://github.com/apache/spark/commit/fc45f5b2e2fc38aff0714f1465f03f5e0ba16e01). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64214/consoleFull)** for PR 14239 at commit [`c97c12f`](https://github.com/apache/spark/commit/c97c12f213b0ccb25aea840e1abfdb6c61b7f6af). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14753 **[Test build #64213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64213/consoleFull)** for PR 14753 at commit [`10861b2`](https://github.com/apache/spark/commit/10861b207e8cac0b7348b374d9054c4de03b7965). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14038: [SPARK-16317][SQL] Add a new interface to filter files i...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/14038 If my understanding is correct, `PathFilter` is not passed into `FileSystem.listFiles` in `ListingFileCatalog#listLeafFiles` inside. If even so, the performance degrades you pointed out occur? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64212/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64212/consoleFull)** for PR 14239 at commit [`b49be73`](https://github.com/apache/spark/commit/b49be73a476af75dd37c33378aef7352e0a4902c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user clockfly closed the pull request at: https://github.com/apache/spark/pull/14723 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10896: [SPARK-12978][SQL] Skip unnecessary final group-b...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/10896#discussion_r75701953 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/Aggregate.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.aggregate + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression +import org.apache.spark.sql.catalyst.plans.physical._ +import org.apache.spark.sql.execution.SparkPlan + +/** + * A base class for aggregate implementation. + */ +trait Aggregate { --- End diff -- Well I think a super class makes a bit more sense. A trait to me is a way to bolt on functionality. The `Aggregate` contains core functionality for both the Hash and Sort based version, and is the natural parent class of both. I do have to admit that this is more a personal preference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14753: [SPARK-17187][SQL] Supports using arbitrary Java ...
GitHub user clockfly opened a pull request: https://github.com/apache/spark/pull/14753 [SPARK-17187][SQL] Supports using arbitrary Java object as internal aggregation buffer object ## What changes were proposed in this pull request? This PR introduces an abstract class `TypedImperativeAggregate` so that an aggregation function of TypedImperativeAggregate can use **arbitrary** user-defined Java object as intermediate aggregation buffer object. **This has advantages like:** 1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function `percentile_approx`, which has a complex aggregation buffer definition. 2. It can be used to avoid doing serialization/de-serialization for every call of `update` or `merge` when converting domain specific aggregation object to internal Spark-Sql storage format. 3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance. Please see Java doc of `TypedImperativeAggregate` and Jira ticket SPARK-17187 for more information. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/clockfly/spark object_aggregation_buffer_try_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14753.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14753 commit 6efddadcb8e6d48e9898a8980f4dcceee4894ebc Author: Sean ZhongDate: 2016-08-19T16:34:56Z object aggregation buffer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64212/consoleFull)** for PR 14239 at commit [`b49be73`](https://github.com/apache/spark/commit/b49be73a476af75dd37c33378aef7352e0a4902c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14723: [SQL][WIP][Test] Supports object-based aggregation funct...
Github user clockfly commented on the issue: https://github.com/apache/spark/pull/14723 @liancheng @cloud-fan @yhuai @hvanhovell @gatorsmile This PR is superceded by #14753, please review the new PR instead. The motivation behind the change is that the aggregation function is also used by WindowExec, which may do continous `update` and `eval`. We have to override `eval` of ImperativeAggregate so that `eval` can accepts an aggregation buffer which contains generic Java object. For example: ``` agg.update(buffer, row1) agg.eval(buffer) agg.update(buffer, row2) agg.eal(buffer) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14753: [SPARK-17187][SQL] Supports using arbitrary Java object ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14753 **[Test build #64211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64211/consoleFull)** for PR 14753 at commit [`6efddad`](https://github.com/apache/spark/commit/6efddadcb8e6d48e9898a8980f4dcceee4894ebc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14038: [SPARK-16317][SQL] Add a new interface to filter files i...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/14038 Oh, i don't want to take on any more work...I just think you should make the predicate passed in something that goes `FileStatus => Boolean` instead of `String => Boolean`, and doing the filtering after the results come back. Regarding speedup, we've seen 20x in simple test trees, but don't have real data on how representative that is: [HADOOP-13208](https://issues.apache.org/jira/browse/HADOOP-13208) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r75700225 --- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala --- @@ -204,6 +213,7 @@ case object TaskResultLost extends TaskFailedReason { @DeveloperApi case object TaskKilled extends TaskFailedReason { override def toErrorString: String = "TaskKilled (killed intentionally)" + override val countTowardsTaskFailures: Boolean = false --- End diff -- the switch to a `val` came from an earlier discussion with @kayousterhout ... there was some other confusion, thought maybe changing to a val would make it more clear it is a constant. But I don't think either of feels strongly, the argument to switch to a val was pretty weak. I can change it back --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [SPARK-17090][FOLLOW-UP][ML]Add expert param support to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14738 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64208/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [SPARK-17090][FOLLOW-UP][ML]Add expert param support to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14738 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [SPARK-17090][FOLLOW-UP][ML]Add expert param support to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14738 **[Test build #64208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64208/consoleFull)** for PR 14738 at commit [`b54b582`](https://github.com/apache/spark/commit/b54b582208554a37a68bc2a45fec6bdfed43405e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14729: [SPARK-17167] [SQL] Issue Exceptions when Analyze Table ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/14729 @viirya Yeah, a normal temporary table would be resolved as a LogicalPlan. Analyze Table does not give us any benefit there. However, you are also allowed to do this: ```sql CREATE TEMPORARY VIEW tmp1 USING parquet OPTIONS(path 'some/location') ``` For these I would like to be able to collect statistics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75700385 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/AggregateWithObjectAggregateBufferSuite.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.AggregateWithObjectAggregateBufferSuite.MaxWithObjectAggregateBuffer +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, GenericMutableRow, MutableRow, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.aggregate.{ImperativeAggregate, WithObjectAggregateBuffer} +import org.apache.spark.sql.execution.aggregate.{SortAggregateExec} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.{AbstractDataType, DataType, IntegerType, StructType} + +class AggregateWithObjectAggregateBufferSuite extends QueryTest with SharedSQLContext { --- End diff -- oh right, I misread the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14079: [SPARK-8425][CORE] New Blacklist Mechanism
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/14079#discussion_r75700278 --- Diff: docs/configuration.md --- @@ -1178,6 +1178,80 @@ Apart from these, the following properties are also available, and may be useful + spark.blacklist.enabled + +true in cluster mode; +false in local mode + + +If set to "true", prevent Spark from scheduling tasks on executors that have been blacklisted +due to too many task failures. The blacklisting algorithm can be further controlled by the +other "spark.blacklist" configuration options. + + + + spark.blacklist.timeout + 1h + +(Experimental) How long a node or executor is blacklisted for the entire application, before it +is unconditionally removed from the blacklist to attempt running new tasks. + + + + spark.blacklist.task.maxTaskAttemptsPerExecutor + 2 --- End diff -- oops, forgot to update this -- good catch, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14749: [SPARK-17182][SQL] Mark Collect as non-determinis...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14749#discussion_r75699347 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala --- @@ -54,6 +54,10 @@ abstract class Collect extends ImperativeAggregate { override def inputAggBufferAttributes: Seq[AttributeReference] = Nil + // Both `CollectList` and `CollectSet` are non-deterministic since their results depend on the + // actual order of input rows. + override def deterministic: Boolean = false --- End diff -- Is `collect_set` non deterministic? It is backed by a `HashSet`, and the way elements are iterated over does not rely on the input order. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64210/consoleFull)** for PR 14239 at commit [`5e93297`](https://github.com/apache/spark/commit/5e9329735ce71eed6f649f1fa16ddfbedc079193). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14239 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64210/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14749: [SPARK-17182][SQL] Mark Collect as non-deterministic
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14749 hmm, I think aggregate function don't need the concept of `deterministic`, as we never check this property for aggregate functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14239: [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14239 **[Test build #64210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64210/consoleFull)** for PR 14239 at commit [`5e93297`](https://github.com/apache/spark/commit/5e9329735ce71eed6f649f1fa16ddfbedc079193). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10896: [SPARK-12978][SQL] Skip unnecessary final group-b...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/10896#discussion_r75695126 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala --- @@ -19,34 +19,90 @@ package org.apache.spark.sql.execution.aggregate import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.plans.physical.Distribution +import org.apache.spark.sql.execution.aggregate.{Aggregate => AggregateExec} import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.streaming.{StateStoreRestoreExec, StateStoreSaveExec} /** + * A pattern that finds aggregate operators to support partial aggregations. + */ +object ExtractPartialAggregate { --- End diff -- okay --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org