[GitHub] [spark] HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] Automatically label PRs based on the paths they touch
HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] Automatically label PRs based on the paths they touch URL: https://github.com/apache/spark/pull/28114#discussion_r404553040 ## File path: .github/autolabeler.yml ## @@ -0,0 +1,54 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Bot page: https://github.com/apps/probot-autolabeler +# The matching patterns follow the .gitignore spec. +# See: https://git-scm.com/docs/gitignore#_pattern_format + +infra: + - ".github/" + - "appveyor.yml" + - "/tools/" +build: + - "/dev/" + - "/build/" + - "/project/" +release: + - "/dev/create-release/" +docs: + - "docs/" + - "examples/" + - "/README.md" + - "/CONTRIBUTING.md" +core: + - "/core/" +sql: + - "sql/" +ml: + - "ml/" + - "mllib/" + - "mllib-local/" +streaming: + - "streaming/" +python: + - "python/" +java: + - "/common/" + - "java/" +R: + - "r/" Review comment: It think we should also add `/r/` because some files like `sql/core/src/main/scala/org/apache/spark/sql/api/r/`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] Automatically label PRs based on the paths they touch
HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] Automatically label PRs based on the paths they touch URL: https://github.com/apache/spark/pull/28114#discussion_r404552167 ## File path: .github/autolabeler.yml ## @@ -0,0 +1,54 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Bot page: https://github.com/apps/probot-autolabeler +# The matching patterns follow the .gitignore spec. +# See: https://git-scm.com/docs/gitignore#_pattern_format + +infra: Review comment: What about we make the tags uppercased to make it look like the current tagging we do by @dongjoon-hyun's script? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404551886 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ## @@ -66,6 +68,19 @@ object FrequentItems extends Logging { } } + /** Helper function to resolve column to expr (if not yet) */ + // TODO: it might be helpful to have this helper in Dataset.scala, + // e.g. `drop` function uses exactly the same flow to deal with + // `Column` arguments + private def resolveColumn(df: DataFrame, col: Column): Column = { +col match { + case Column(u: UnresolvedAttribute) => +Column(df.queryExecution.analyzed.resolveQuoted( + u.name, df.sparkSession.sessionState.analyzer.resolver).getOrElse(u)) + case Column(_expr: Expression) => col +} + } Review comment: No, I mean for now you only handle the case that `Column(UnresolvedAttribute)`, but `Column` can contain any unresolved expression that can involve many `UnresolvedAttribute`. For the latter one, the added `resolveColumn` cannot resolve it correctly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
SparkQA commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#issuecomment-610189628 **[Test build #120901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120901/testReport)** for PR 28133 at commit [`c580634`](https://github.com/apache/spark/commit/c580634e16b246af621dce1abf0ed26fa8449bb2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax
cloud-fan commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax URL: https://github.com/apache/spark/pull/28026#issuecomment-610189239 > the conversion to v2 cannot simply ignore them without being a correctness bug I agree with it, that's why I propose "update ResolveCatalogs to fail if Hive specific clauses are specified in the create statement plan for v2 catalogs". Then at least it's not a correctness bug. > The option prefix is very small, but an important part of how we pass SERDEPROPERTIES. Good to know that it's a small change. Can we do it with an individual PR? This can make the PR reviews more concentrated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
viirya commented on a change in pull request #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#discussion_r404544033 ## File path: docs/_data/menu-sql.yaml ## @@ -154,7 +154,9 @@ url: sql-ref-syntax-qry-select-distribute-by.html - text: LIMIT Clause url: sql-ref-syntax-qry-select-limit.html -- text: Join Hints +- text: JOIN + url: sql-ref-syntax-qry-select-join.html +- text: JOIN HINTS Review comment: Why we need to upper case of hints? We don't really have `HINTS` in query. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610188316 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120900/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610188316 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120900/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
HyukjinKwon closed pull request #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610188300 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610188216 **[Test build #120899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120899/testReport)** for PR 28130 at commit [`7df973a`](https://github.com/apache/spark/commit/7df973ab9143133320b04207e6d23b980f7d9b77). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610188222 **[Test build #120900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120900/testReport)** for PR 28120 at commit [`944afd5`](https://github.com/apache/spark/commit/944afd50f10a9fae8ecec4794c867372dcd62bd2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
SparkQA removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610185078 **[Test build #120900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120900/testReport)** for PR 28120 at commit [`944afd5`](https://github.com/apache/spark/commit/944afd50f10a9fae8ecec4794c867372dcd62bd2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610188306 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120899/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610188306 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120899/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
HyukjinKwon commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141#issuecomment-610188206 Thank you @beliefer. I merged to branch-3.0 accordingly! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610188312 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610188300 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
SparkQA removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610185077 **[Test build #120899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120899/testReport)** for PR 28130 at commit [`7df973a`](https://github.com/apache/spark/commit/7df973ab9143133320b04207e6d23b980f7d9b77). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610188312 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#issuecomment-610187781 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25593/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27863: [SPARK-31109][MESOS][DOC] Add version information to the configuration of Mesos
HyukjinKwon commented on issue #27863: [SPARK-31109][MESOS][DOC] Add version information to the configuration of Mesos URL: https://github.com/apache/spark/pull/27863#issuecomment-610187646 Merged to branch-3.0 too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27875: [SPARK-31118][K8S][DOC] Add version information to the configuration of K8S
HyukjinKwon commented on issue #27875: [SPARK-31118][K8S][DOC] Add version information to the configuration of K8S URL: https://github.com/apache/spark/pull/27875#issuecomment-610187700 Merged to master and brnach-3.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#issuecomment-610187775 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#issuecomment-610187781 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25593/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#issuecomment-610187775 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27856: [SPARK-31092][YARN][DOC] Add version information to the configuration of Yarn
HyukjinKwon commented on issue #27856: [SPARK-31092][YARN][DOC] Add version information to the configuration of Yarn URL: https://github.com/apache/spark/pull/27856#issuecomment-610187586 Merged to master and branch-3.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28129: [SPARK-31346][SQL]Add new configuration to make sure temporary directory cleaned
cloud-fan commented on a change in pull request #28129: [SPARK-31346][SQL]Add new configuration to make sure temporary directory cleaned URL: https://github.com/apache/spark/pull/28129#discussion_r404549412 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala ## @@ -140,7 +141,9 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { try { createdTempDir.foreach { path => val fs = path.getFileSystem(hadoopConf) -if (fs.delete(path, true)) { +// Sometimes (e.g., when speculative task is enabled), temporary directories may be +// left uncleaned, confirmTempDirDeleted can confirm deleteOnExit. Review comment: Do you mean even if we delete the temp dir here, some tasks may re-create it later? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404548266 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ## @@ -66,6 +68,19 @@ object FrequentItems extends Logging { } } + /** Helper function to resolve column to expr (if not yet) */ + // TODO: it might be helpful to have this helper in Dataset.scala, + // e.g. `drop` function uses exactly the same flow to deal with + // `Column` arguments + private def resolveColumn(df: DataFrame, col: Column): Column = { +col match { + case Column(u: UnresolvedAttribute) => +Column(df.queryExecution.analyzed.resolveQuoted( + u.name, df.sparkSession.sessionState.analyzer.resolver).getOrElse(u)) + case Column(_expr: Expression) => col +} + } Review comment: The code here tries to resolve the column if it has `UnresolvedAttribute`. If it still does not provide clarity, I think it's fair to throw an exception. Similar to how `Dataset.drop` works if the argument given is a column with an unresolved attribute. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610185441 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25592/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610185388 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25591/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610185435 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610185384 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610185435 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610185441 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25592/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610184139 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25590/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#discussion_r404547325 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala ## @@ -243,6 +245,22 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts)) } } + Review comment: Please note that the following test case is executed twice; OrcSourceSuite and HiveOrcSourceSuite. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610185384 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610185388 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25591/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404547265 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ## @@ -66,6 +68,19 @@ object FrequentItems extends Logging { } } + /** Helper function to resolve column to expr (if not yet) */ + // TODO: it might be helpful to have this helper in Dataset.scala, + // e.g. `drop` function uses exactly the same flow to deal with + // `Column` arguments Review comment: The hope was to resolve this TODO before merging (either by keeping the code here and cleaning todo or by moving to another layer and also cleaning todo) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
HyukjinKwon commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404546852 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ## @@ -97,14 +97,38 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]] = { -StatFunctions.multipleApproxQuantiles( - df.select(cols.map(col): _*), +approxQuantile(cols.map(df.col), probabilities, relativeError) + } + + /** + * Calculates the approximate quantiles of numerical columns of a DataFrame. + * @see `approxQuantile(col:Str* approxQuantile)` for detailed description. + * + * @param cols the numerical columns + * @param probabilities a list of quantile probabilities + * Each number must belong to [0, 1]. + * For example 0 is the minimum, 0.5 is the median, 1 is the maximum. + * @param relativeError The relative target precision to achieve (greater than or equal to 0). + * If set to zero, the exact quantiles are computed, which could be very expensive. + * Note that values greater than 1 are accepted but give the same result as 1. + * @return the approximate quantiles at the given probabilities of each column + * + * @note null and NaN values will be ignored in numerical columns before calculation. For + * columns only containing null or NaN values, an empty array is returned. + * + * @since 3.0.0 Review comment: nit 3.0.0 -> 3.1.0 New features will be landed to Spark 3.1.0 because `branch-3.0` for Spark 3.0 is already out and it's code-frozen. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404546932 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ## @@ -132,7 +156,28 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.4.0 */ def cov(col1: String, col2: String): Double = { -StatFunctions.calculateCov(df, Seq(col1, col2)) +cov(df.col(col1), df.col(col2)) + } + + /** + * Calculate the sample covariance of two numerical columns of a DataFrame. + * This version of cov accepts [[Column]] rather than names. Review comment: I've mentioned this because docs for existing functions do have same comment. I will remove it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610185077 **[Test build #120899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120899/testReport)** for PR 28130 at commit [`7df973a`](https://github.com/apache/spark/commit/7df973ab9143133320b04207e6d23b980f7d9b77). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#issuecomment-610185078 **[Test build #120900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120900/testReport)** for PR 28120 at commit [`944afd5`](https://github.com/apache/spark/commit/944afd50f10a9fae8ecec4794c867372dcd62bd2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610184139 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25590/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610184136 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610184136 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#issuecomment-610183269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25589/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#issuecomment-610183269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25589/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#issuecomment-610183261 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#issuecomment-610183261 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
SparkQA commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#issuecomment-610182898 **[Test build #120897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120897/testReport)** for PR 28142 at commit [`18e6932`](https://github.com/apache/spark/commit/18e69325e299f33ad31856513417cf5d61625707). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
dongjoon-hyun commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#issuecomment-610182891 cc @cloud-fan and @HyukjinKwon Also, cc @gatorsmile This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference
SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference URL: https://github.com/apache/spark/pull/28130#issuecomment-610182922 **[Test build #120898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120898/testReport)** for PR 28130 at commit [`c2bcf38`](https://github.com/apache/spark/commit/c2bcf3833537575e513664e998b4faf47205f88d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#discussion_r404543896 ## File path: core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala ## @@ -73,4 +73,29 @@ class VersionUtilsSuite extends SparkFunSuite { } } } + + test("Return short version number") { +assert(shortVersion("3.0.0") === "3.0.0") +assert(shortVersion("3.0.0-SNAPSHOT") === "3.0.0") Review comment: I didn't change the version `3.0.x` in order to minimize the diff between `master` and `branch-2.4`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#discussion_r404543896 ## File path: core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala ## @@ -73,4 +73,29 @@ class VersionUtilsSuite extends SparkFunSuite { } } } + + test("Return short version number") { +assert(shortVersion("3.0.0") === "3.0.0") +assert(shortVersion("3.0.0-SNAPSHOT") === "3.0.0") Review comment: I didn't change the version example in order to minimize the diff between `master` and `branch-2.4`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142#discussion_r404543692 ## File path: core/src/main/scala/org/apache/spark/util/VersionUtils.scala ## @@ -36,6 +37,19 @@ private[spark] object VersionUtils { */ def minorVersion(sparkVersion: String): Int = majorMinorVersion(sparkVersion)._2 + /** + * Given a Spark version string, return the short version string. + * E.g., for 3.0.0-SNAPSHOT, return '3.0.0'. Review comment: I didn't change this example in order to minimize the diff between `branch-2.4` and `master`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
dongjoon-hyun opened a new pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata URL: https://github.com/apache/spark/pull/28142 ### What changes were proposed in this pull request? Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ### Why are the changes needed? This backport helps us handle this files differently in Apache Spark 3.0.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with newly added test cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404543342 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404543316 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404543086 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type
[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404531215 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ## @@ -66,6 +68,19 @@ object FrequentItems extends Logging { } } + /** Helper function to resolve column to expr (if not yet) */ + // TODO: it might be helpful to have this helper in Dataset.scala, + // e.g. `drop` function uses exactly the same flow to deal with + // `Column` arguments Review comment: We either use a block comment or many End-Of-Line comments. We don't mix both like this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404541728 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ## @@ -66,6 +68,19 @@ object FrequentItems extends Logging { } } + /** Helper function to resolve column to expr (if not yet) */ + // TODO: it might be helpful to have this helper in Dataset.scala, + // e.g. `drop` function uses exactly the same flow to deal with + // `Column` arguments + private def resolveColumn(df: DataFrame, col: Column): Column = { +col match { + case Column(u: UnresolvedAttribute) => +Column(df.queryExecution.analyzed.resolveQuoted( + u.name, df.sparkSession.sessionState.analyzer.resolver).getOrElse(u)) + case Column(_expr: Expression) => col +} + } Review comment: The problem with Column is, it can contain an unresolved expression, for example `UnresolvedAttribute + UnresolvedAttribute ...`. When we are allowed to use column name only, we can rely on `df.resolve(colName)` to resolve it. Once you extend to Column, you cannot do the same check as before. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type
viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type URL: https://github.com/apache/spark/pull/28133#discussion_r404532547 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ## @@ -132,7 +156,28 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.4.0 */ def cov(col1: String, col2: String): Double = { -StatFunctions.calculateCov(df, Seq(col1, col2)) +cov(df.col(col1), df.col(col2)) + } + + /** + * Calculate the sample covariance of two numerical columns of a DataFrame. + * This version of cov accepts [[Column]] rather than names. Review comment: I think we don't need to explicitly mention it. The function signature already tells it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404542675 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404542255 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type +
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404541183 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any Review comment: Ah, I see. I said that the three `count` entries should be merged in the earlier my comment though, I rethink now that separating `count(expr)` and `count(*)` is better along with the Pg one? https://www.postgresql.org/docs/9.5/functions-aggregate.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404540923 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type
[GitHub] [spark] maropu commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
maropu commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610177594 cc: @srowen @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610177067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120896/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610177067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120896/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610177064 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
SparkQA removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610174263 **[Test build #120896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120896/testReport)** for PR 28121 at commit [`8be6c4a`](https://github.com/apache/spark/commit/8be6c4a12ffb783d938d91b69d6ddd1c191af75d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610177064 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
AmplabJenkins commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141#issuecomment-610176890 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
SparkQA commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141#issuecomment-610176880 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/25586/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
AmplabJenkins removed a comment on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141#issuecomment-610176890 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610176994 **[Test build #120896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120896/testReport)** for PR 28121 at commit [`8be6c4a`](https://github.com/apache/spark/commit/8be6c4a12ffb783d938d91b69d6ddd1c191af75d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
AmplabJenkins removed a comment on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141#issuecomment-610176900 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25586/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)
AmplabJenkins commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S) URL: https://github.com/apache/spark/pull/28141#issuecomment-610176900 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25586/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404537908 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any Review comment: for `*`, the data type is `none`, for `expression1` and `expression2` data type, it is `any`. is the following better to understand? `none, any, any` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404537440 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,628 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a column name instead of Column. + +* Table of contents +{:toc} + + +FunctionParametersDescription + + + + {any | some | bool_or}(c: Column) + Column name + Returns true if at least one value is true + + + approx_count_distinct(c: Column[, relativeSD: Double]]) + Column name; relativeSD: the maximum estimation error allowed. + Returns the estimated cardinality by HyperLogLog++ + + + {avg | mean}(c: Column) + Column name + Returns the average of values in the input column. + + + {bool_and | every}(c: Column) + Column name + Returns true if all values are true + + + collect_list(c: Column) + Column name + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle + + + collect_set(c: Column) + Column name + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(c1: Column, c2: Column) + Column name + Returns Pearson coefficient of correlation between a set of number pairs + + + count(*) + None + Returns the total number of retrieved rows, including rows containing null + + + count(c: Column[, c: Column]) + Column name + Returns the number of rows for which the supplied column(s) are all not null + + + count(DISTINCT c: Column[, c: Column]) + Column name + Returns the number of rows for which the supplied column(s) are unique and not null + + + count_if(Predicate) + Expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to TRUE values + + + count_min_sketch(c: Column, eps: double, confidence: double, seed integer) +Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer +Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.. + + + covar_pop(c1: Column, c2: Column) + Column name + Returns the population covariance of a set of number pairs + + + covar_samp(c1: Column, c2: Column) + Column name + Returns the sample covariance of a set of number pairs + + + {first | first_value}(c: Column[, isIgnoreNull]) Review comment: btw, I think its better to use the same type names here with https://github.com/apache/spark/blob/master/docs/sql-ref-datatypes.md This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404537320 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404536581 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double Review comment: `(double, double)`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404536482 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any Review comment: What does `none; any` mean? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404536420 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double Review comment: `(double, double)`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610174701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25588/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610174701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25588/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610174696 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610174696 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404536117 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404536057 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type +
[GitHub] [spark] SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference
SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference URL: https://github.com/apache/spark/pull/28121#issuecomment-610174263 **[Test build #120896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120896/testReport)** for PR 28121 at commit [`8be6c4a`](https://github.com/apache/spark/commit/8be6c4a12ffb783d938d91b69d6ddd1c191af75d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404534975 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type +
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404535304 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type +
[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference URL: https://github.com/apache/spark/pull/28120#discussion_r404534975 ## File path: docs/sql-ref-functions-builtin-aggregate.md ## @@ -19,4 +19,616 @@ license: | limitations under the License. --- -Aggregate functions \ No newline at end of file +Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions +operate on a group of rows and return a single value. + +Spark SQL aggregate functions are grouped as agg_funcs in Spark SQL. Below is the list of functions. + +**Note:** All functions below have another signature which takes String as a expression. + + + +FunctionParameter Type(s)Description + + + + {any | some | bool_or}(expression) + boolean + Returns true if at least one value is true. + + + approx_count_distinct(expression[, relativeSD]) + (long, double) + RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++. + + + {avg | mean}(expression) + numeric or string + Returns the average of values in the input expression. + + + {bool_and | every}(expression) + boolean + Returns true if all values are true. + + + collect_list(expression) + any + Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + collect_set(expression) + any + Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle. + + + corr(expression1, expression2) + double, double + Returns Pearson coefficient of correlation between a set of number pairs. + + + count([DISTINCT] {* | expression1[, expression2]}) + none; any + If specified DISTINCT, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null. + + + count_if(predicate) + expression that will be used for aggregation calculation + Returns the count number from the predicate evaluate to `TRUE` values. + + + count_min_sketch(expression, eps, confidence, seed) + integral or string or binary, double, double, integer + Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. + + + covar_pop(expression1, expression2) + double, double + Returns the population covariance of a set of number pairs. + + + covar_samp(expression1, expression2) + double + Returns the sample covariance of a set of number pairs. + + + {first | first_value}(expression[, isIgnoreNull]) + any, boolean + Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + kurtosis(expression) + double + Returns the kurtosis value calculated from values of a group. + + + {last | last_value}(expression[, isIgnoreNull]) + any, boolean + Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic. + + + max(expression) + any numeric, string, date/time or arrays of these types + Returns the maximum value of the expression. + + + max_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the maximum value of expression2. + + + min(expression) + any numeric, string, date/time or arrays of these types + Returns the minimum value of the expression. + + + min_by(expression1, expression2) + any numeric, string, date/time or arrays of these types + Returns the value of expression1 associated with the minimum value of expression2. + + + percentile(expression, percentage [, frequency]) + numeric Type, double, integral type +
[GitHub] [spark] AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference
AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference URL: https://github.com/apache/spark/pull/28139#issuecomment-610172951 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120895/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference
AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference URL: https://github.com/apache/spark/pull/28139#issuecomment-610172945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference
SparkQA removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference URL: https://github.com/apache/spark/pull/28139#issuecomment-610170265 **[Test build #120895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120895/testReport)** for PR 28139 at commit [`1cce7c8`](https://github.com/apache/spark/commit/1cce7c8e2c2bd184acefdb05d7ffad739dbb571a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference
AmplabJenkins commented on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference URL: https://github.com/apache/spark/pull/28139#issuecomment-610172945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org