date:20200406

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] Automatically label PRs based on the paths they touch

2020-04-06 Thread GitBox

HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] 
Automatically label PRs based on the paths they touch
URL: https://github.com/apache/spark/pull/28114#discussion_r404553040
 
 

 ##
 File path: .github/autolabeler.yml
 ##
 @@ -0,0 +1,54 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Bot page: https://github.com/apps/probot-autolabeler
+# The matching patterns follow the .gitignore spec.
+# See: https://git-scm.com/docs/gitignore#_pattern_format
+
+infra:
+  - ".github/"
+  - "appveyor.yml"
+  - "/tools/"
+build:
+  - "/dev/"
+  - "/build/"
+  - "/project/"
+release:
+  - "/dev/create-release/"
+docs:
+  - "docs/"
+  - "examples/"
+  - "/README.md"
+  - "/CONTRIBUTING.md"
+core:
+  - "/core/"
+sql:
+  - "sql/"
+ml:
+  - "ml/"
+  - "mllib/"
+  - "mllib-local/"
+streaming:
+  - "streaming/"
+python:
+  - "python/"
+java:
+  - "/common/"
+  - "java/"
+R:
+  - "r/"
 
 Review comment:
   It think we should also add `/r/` because some files like 
`sql/core/src/main/scala/org/apache/spark/sql/api/r/`. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] Automatically label PRs based on the paths they touch

2020-04-06 Thread GitBox

HyukjinKwon commented on a change in pull request #28114: [SPARK-31330] 
Automatically label PRs based on the paths they touch
URL: https://github.com/apache/spark/pull/28114#discussion_r404552167
 
 

 ##
 File path: .github/autolabeler.yml
 ##
 @@ -0,0 +1,54 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Bot page: https://github.com/apps/probot-autolabeler
+# The matching patterns follow the .gitignore spec.
+# See: https://git-scm.com/docs/gitignore#_pattern_format
+
+infra:
 
 Review comment:
   What about we make the tags uppercased to make it look like the current 
tagging we do by @dongjoon-hyun's script?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

viirya commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404551886
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 ##
 @@ -66,6 +68,19 @@ object FrequentItems extends Logging {
 }
   }
 
+  /** Helper function to resolve column to expr (if not yet) */
+  // TODO: it might be helpful to have this helper in Dataset.scala,
+  // e.g. `drop` function uses exactly the same flow to deal with
+  // `Column` arguments
+  private def resolveColumn(df: DataFrame, col: Column): Column = {
+col match {
+  case Column(u: UnresolvedAttribute) =>
+Column(df.queryExecution.analyzed.resolveQuoted(
+  u.name, df.sparkSession.sessionState.analyzer.resolver).getOrElse(u))
+  case Column(_expr: Expression) => col
+}
+  }
 
 Review comment:
   No, I mean for now you only handle the case that 
`Column(UnresolvedAttribute)`, but `Column` can contain any unresolved 
expression that can involve many `UnresolvedAttribute`. For the latter one, the 
added `resolveColumn` cannot resolve it correctly.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

SparkQA commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions 
API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#issuecomment-610189628
 
 
   **[Test build #120901 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120901/testReport)**
 for PR 28133 at commit 
[`c580634`](https://github.com/apache/spark/commit/c580634e16b246af621dce1abf0ed26fa8449bb2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on issue #28026: [SPARK-31257][SQL] Unify create table syntax

2020-04-06 Thread GitBox

cloud-fan commented on issue #28026: [SPARK-31257][SQL] Unify create table 
syntax
URL: https://github.com/apache/spark/pull/28026#issuecomment-610189239
 
 
   > the conversion to v2 cannot simply ignore them without being a correctness 
bug
   
   I agree with it, that's why I propose "update ResolveCatalogs to fail if 
Hive specific clauses are specified in the create statement plan for v2 
catalogs". Then at least it's not a correctness bug.
   
   > The option prefix is very small, but an important part of how we pass 
SERDEPROPERTIES.
   
   Good to know that it's a small change. Can we do it with an individual PR? 
This can make the PR reviews more concentrated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

viirya commented on a change in pull request #28121: [SPARK-31348][SQL][DOCS] 
Document Join in SQL Reference
URL: https://github.com/apache/spark/pull/28121#discussion_r404544033
 
 

 ##
 File path: docs/_data/menu-sql.yaml
 ##
 @@ -154,7 +154,9 @@
   url: sql-ref-syntax-qry-select-distribute-by.html
 - text: LIMIT Clause 
   url: sql-ref-syntax-qry-select-limit.html
-- text: Join Hints
+- text: JOIN
+  url: sql-ref-syntax-qry-select-join.html
+- text: JOIN HINTS
 
 Review comment:
   Why we need to upper case of hints? We don't really have `HINTS` in query. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610188316
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120900/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document 
built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610188316
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120900/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

HyukjinKwon closed pull request #28141: 
[SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource 
managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610188300
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610188216
 
 
   **[Test build #120899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120899/testReport)**
 for PR 28130 at commit 
[`7df973a`](https://github.com/apache/spark/commit/7df973ab9143133320b04207e6d23b980f7d9b77).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in 
aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610188222
 
 
   **[Test build #120900 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120900/testReport)**
 for PR 28120 at commit 
[`944afd5`](https://github.com/apache/spark/commit/944afd50f10a9fae8ecec4794c867372dcd62bd2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

SparkQA removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document 
built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610185078
 
 
   **[Test build #120900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120900/testReport)**
 for PR 28120 at commit 
[`944afd5`](https://github.com/apache/spark/commit/944afd50f10a9fae8ecec4794c867372dcd62bd2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610188306
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120899/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] 
Document TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610188306
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120899/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

HyukjinKwon commented on issue #28141: 
[SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource 
managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141#issuecomment-610188206
 
 
   Thank you @beliefer. I merged to branch-3.0 accordingly!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610188312
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] 
Document TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610188300
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

SparkQA removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610185077
 
 
   **[Test build #120899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120899/testReport)**
 for PR 28130 at commit 
[`7df973a`](https://github.com/apache/spark/commit/7df973ab9143133320b04207e6d23b980f7d9b77).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document 
built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610188312
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#issuecomment-610187781
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25593/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27863: [SPARK-31109][MESOS][DOC] Add version information to the configuration of Mesos

2020-04-06 Thread GitBox

HyukjinKwon commented on issue #27863: [SPARK-31109][MESOS][DOC] Add version 
information to the configuration of Mesos
URL: https://github.com/apache/spark/pull/27863#issuecomment-610187646
 
 
   Merged to branch-3.0 too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27875: [SPARK-31118][K8S][DOC] Add version information to the configuration of K8S

2020-04-06 Thread GitBox

HyukjinKwon commented on issue #27875: [SPARK-31118][K8S][DOC] Add version 
information to the configuration of K8S
URL: https://github.com/apache/spark/pull/27875#issuecomment-610187700
 
 
   Merged to master and brnach-3.0.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#issuecomment-610187775
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#issuecomment-610187781
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25593/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#issuecomment-610187775
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27856: [SPARK-31092][YARN][DOC] Add version information to the configuration of Yarn

2020-04-06 Thread GitBox

HyukjinKwon commented on issue #27856: [SPARK-31092][YARN][DOC] Add version 
information to the configuration of Yarn
URL: https://github.com/apache/spark/pull/27856#issuecomment-610187586
 
 
   Merged to master and branch-3.0.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #28129: [SPARK-31346][SQL]Add new configuration to make sure temporary directory cleaned

2020-04-06 Thread GitBox

cloud-fan commented on a change in pull request #28129: [SPARK-31346][SQL]Add 
new configuration to make sure temporary directory cleaned
URL: https://github.com/apache/spark/pull/28129#discussion_r404549412
 
 

 ##
 File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 ##
 @@ -140,7 +141,9 @@ private[hive] trait SaveAsHiveFile extends 
DataWritingCommand {
 try {
   createdTempDir.foreach { path =>
 val fs = path.getFileSystem(hadoopConf)
-if (fs.delete(path, true)) {
+// Sometimes (e.g., when speculative task is enabled), temporary 
directories may be
+// left uncleaned, confirmTempDirDeleted can confirm deleteOnExit.
 
 Review comment:
   Do you mean even if we delete the temp dir here, some tasks may re-create it 
later?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404548266
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 ##
 @@ -66,6 +68,19 @@ object FrequentItems extends Logging {
 }
   }
 
+  /** Helper function to resolve column to expr (if not yet) */
+  // TODO: it might be helpful to have this helper in Dataset.scala,
+  // e.g. `drop` function uses exactly the same flow to deal with
+  // `Column` arguments
+  private def resolveColumn(df: DataFrame, col: Column): Column = {
+col match {
+  case Column(u: UnresolvedAttribute) =>
+Column(df.queryExecution.analyzed.resolveQuoted(
+  u.name, df.sparkSession.sessionState.analyzer.resolver).getOrElse(u))
+  case Column(_expr: Expression) => col
+}
+  }
 
 Review comment:
   The code here tries to resolve the column if it has `UnresolvedAttribute`. 
If it still does not provide clarity, I think it's fair to throw an exception. 
Similar to how `Dataset.drop` works if the argument given is a column with an 
unresolved attribute. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610185441
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25592/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] 
Document TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610185388
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25591/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610185435
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] 
Document TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610185384
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document 
built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610185435
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28120: [SPARK-31349][SQL][DOCS] Document 
built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610185441
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25592/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] 
Document TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610184139
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25590/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

dongjoon-hyun commented on a change in pull request #28142: 
[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#discussion_r404547325
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ##
 @@ -243,6 +245,22 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
 }
   }
+
 
 Review comment:
   Please note that the following test case is executed twice; OrcSourceSuite 
and HiveOrcSourceSuite.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610185384
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610185388
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25591/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404547265
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 ##
 @@ -66,6 +68,19 @@ object FrequentItems extends Logging {
 }
   }
 
+  /** Helper function to resolve column to expr (if not yet) */
+  // TODO: it might be helpful to have this helper in Dataset.scala,
+  // e.g. `drop` function uses exactly the same flow to deal with
+  // `Column` arguments
 
 Review comment:
   The hope was to resolve this TODO before merging (either by keeping the code 
here and cleaning todo or by moving to another layer and also cleaning todo)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

HyukjinKwon commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404546852
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
 ##
 @@ -97,14 +97,38 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-StatFunctions.multipleApproxQuantiles(
-  df.select(cols.map(col): _*),
+approxQuantile(cols.map(df.col), probabilities, relativeError)
+  }
+
+  /**
+   * Calculates the approximate quantiles of numerical columns of a DataFrame.
+   * @see `approxQuantile(col:Str* approxQuantile)` for detailed description.
+   *
+   * @param cols the numerical columns
+   * @param probabilities a list of quantile probabilities
+   *   Each number must belong to [0, 1].
+   *   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
+   * @param relativeError The relative target precision to achieve (greater 
than or equal to 0).
+   *   If set to zero, the exact quantiles are computed, which could be very 
expensive.
+   *   Note that values greater than 1 are accepted but give the same result 
as 1.
+   * @return the approximate quantiles at the given probabilities of each 
column
+   *
+   * @note null and NaN values will be ignored in numerical columns before 
calculation. For
+   *   columns only containing null or NaN values, an empty array is returned.
+   *
+   * @since 3.0.0
 
 Review comment:
   nit 3.0.0 -> 3.1.0
   New features will be landed to Spark 3.1.0 because `branch-3.0` for Spark 
3.0 is already out and it's code-frozen.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

kachayev commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404546932
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
 ##
 @@ -132,7 +156,28 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 1.4.0
*/
   def cov(col1: String, col2: String): Double = {
-StatFunctions.calculateCov(df, Seq(col1, col2))
+cov(df.col(col1), df.col(col2))
+  }
+
+  /**
+   * Calculate the sample covariance of two numerical columns of a DataFrame.
+   * This version of cov accepts [[Column]] rather than names.
 
 Review comment:
   I've mentioned this because docs for existing functions do have same 
comment. I will remove it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610185077
 
 
   **[Test build #120899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120899/testReport)**
 for PR 28130 at commit 
[`7df973a`](https://github.com/apache/spark/commit/7df973ab9143133320b04207e6d23b980f7d9b77).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28120: [SPARK-31349][SQL][DOCS] Document built-in 
aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#issuecomment-610185078
 
 
   **[Test build #120900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120900/testReport)**
 for PR 28120 at commit 
[`944afd5`](https://github.com/apache/spark/commit/944afd50f10a9fae8ecec4794c867372dcd62bd2).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610184139
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25590/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610184136
 
 
   Build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28130: [SPARK-31355][SQL][DOCS] 
Document TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610184136
 
 
   Build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark 
version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#issuecomment-610183269
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25589/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write 
Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#issuecomment-610183269
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25589/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark 
version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#issuecomment-610183261
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28142: [SPARK-25102][SQL][2.4] Write 
Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#issuecomment-610183261
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

SparkQA commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version 
to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#issuecomment-610182898
 
 
   **[Test build #120897 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120897/testReport)**
 for PR 28142 at commit 
[`18e6932`](https://github.com/apache/spark/commit/18e69325e299f33ad31856513417cf5d61625707).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

dongjoon-hyun commented on issue #28142: [SPARK-25102][SQL][2.4] Write Spark 
version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#issuecomment-610182891
 
 
   cc @cloud-fan and @HyukjinKwon 
   Also, cc @gatorsmile 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document TABLESAMPLE in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28130: [SPARK-31355][SQL][DOCS] Document 
TABLESAMPLE in SQL Reference
URL: https://github.com/apache/spark/pull/28130#issuecomment-610182922
 
 
   **[Test build #120898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120898/testReport)**
 for PR 28130 at commit 
[`c2bcf38`](https://github.com/apache/spark/commit/c2bcf3833537575e513664e998b4faf47205f88d).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

dongjoon-hyun commented on a change in pull request #28142: 
[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#discussion_r404543896
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala
 ##
 @@ -73,4 +73,29 @@ class VersionUtilsSuite extends SparkFunSuite {
   }
 }
   }
+
+  test("Return short version number") {
+assert(shortVersion("3.0.0") === "3.0.0")
+assert(shortVersion("3.0.0-SNAPSHOT") === "3.0.0")
 
 Review comment:
   I didn't change the version `3.0.x` in order to minimize the diff between 
`master` and `branch-2.4`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

dongjoon-hyun commented on a change in pull request #28142: 
[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#discussion_r404543896
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/util/VersionUtilsSuite.scala
 ##
 @@ -73,4 +73,29 @@ class VersionUtilsSuite extends SparkFunSuite {
   }
 }
   }
+
+  test("Return short version number") {
+assert(shortVersion("3.0.0") === "3.0.0")
+assert(shortVersion("3.0.0-SNAPSHOT") === "3.0.0")
 
 Review comment:
   I didn't change the version example in order to minimize the diff between 
`master` and `branch-2.4`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

dongjoon-hyun commented on a change in pull request #28142: 
[SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142#discussion_r404543692
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/util/VersionUtils.scala
 ##
 @@ -36,6 +37,19 @@ private[spark] object VersionUtils {
*/
   def minorVersion(sparkVersion: String): Int = 
majorMinorVersion(sparkVersion)._2
 
+  /**
+   * Given a Spark version string, return the short version string.
+   * E.g., for 3.0.0-SNAPSHOT, return '3.0.0'.
 
 Review comment:
   I didn't change this example in order to minimize the diff between 
`branch-2.4` and `master`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request #28142: [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata

2020-04-06 Thread GitBox

dongjoon-hyun opened a new pull request #28142: [SPARK-25102][SQL][2.4] Write 
Spark version to ORC/Parquet file metadata
URL: https://github.com/apache/spark/pull/28142
 
 
   ### What changes were proposed in this pull request?
   
   Currently, Spark writes Spark version number into Hive Table properties with 
`spark.sql.create.version`.
   ```
   parameters:{
 spark.sql.sources.schema.part.0={
   "type":"struct",
   "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
 },
 transient_lastDdlTime=1541142761,
 spark.sql.sources.schema.numParts=1,
 spark.sql.create.version=2.4.0
   }
   ```
   
   This PR aims to write Spark versions to ORC/Parquet file metadata with 
`org.apache.spark.sql.create.version` because we used `org.apache.` prefix in 
Parquet metadata already. It's different from Hive Table property key 
`spark.sql.create.version`, but it seems that we cannot change Hive Table 
property for backward compatibility.
   
   After this PR, ORC and Parquet file generated by Spark will have the 
following metadata.
   
   **ORC (`native` and `hive` implmentation)**
   ```
   $ orc-tools meta /tmp/o
   File Version: 0.12 with ...
   ...
   User Metadata:
 org.apache.spark.sql.create.version=3.0.0
   ```
   
   **PARQUET**
   ```
   $ parquet-tools meta /tmp/p
   ...
   creator: parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)
   extra:   org.apache.spark.sql.create.version = 3.0.0
   extra:   org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
   ```
   
   ### Why are the changes needed?
   
   This backport helps us handle this files differently in Apache Spark 3.0.0.
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the Jenkins with newly added test cases.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404543342
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404543316
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404543086
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type

[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

viirya commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404531215
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 ##
 @@ -66,6 +68,19 @@ object FrequentItems extends Logging {
 }
   }
 
+  /** Helper function to resolve column to expr (if not yet) */
+  // TODO: it might be helpful to have this helper in Dataset.scala,
+  // e.g. `drop` function uses exactly the same flow to deal with
+  // `Column` arguments
 
 Review comment:
   We either use a block comment or many End-Of-Line comments. We don't mix 
both like this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

viirya commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404541728
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 ##
 @@ -66,6 +68,19 @@ object FrequentItems extends Logging {
 }
   }
 
+  /** Helper function to resolve column to expr (if not yet) */
+  // TODO: it might be helpful to have this helper in Dataset.scala,
+  // e.g. `drop` function uses exactly the same flow to deal with
+  // `Column` arguments
+  private def resolveColumn(df: DataFrame, col: Column): Column = {
+col match {
+  case Column(u: UnresolvedAttribute) =>
+Column(df.queryExecution.analyzed.resolveQuoted(
+  u.name, df.sparkSession.sessionState.analyzer.resolver).getOrElse(u))
+  case Column(_expr: Expression) => col
+}
+  }
 
 Review comment:
   The problem with Column is, it can contain an unresolved expression, for 
example `UnresolvedAttribute + UnresolvedAttribute ...`.
   
   When we are allowed to use column name only, we can rely on 
`df.resolve(colName)` to resolve it. Once you extend to Column, you cannot do 
the same check as before.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28133: [SPARK-31156][SQL] DataFrameStatFunctions API to be consistent with respect to Column type

2020-04-06 Thread GitBox

viirya commented on a change in pull request #28133: [SPARK-31156][SQL] 
DataFrameStatFunctions API to be consistent with respect to Column type
URL: https://github.com/apache/spark/pull/28133#discussion_r404532547
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
 ##
 @@ -132,7 +156,28 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 1.4.0
*/
   def cov(col1: String, col2: String): Double = {
-StatFunctions.calculateCov(df, Seq(col1, col2))
+cov(df.col(col1), df.col(col2))
+  }
+
+  /**
+   * Calculate the sample covariance of two numerical columns of a DataFrame.
+   * This version of cov accepts [[Column]] rather than names.
 
 Review comment:
   I think we don't need to explicitly mention it. The function signature 
already tells it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404542675
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404542255
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type
+

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404541183
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
 
 Review comment:
   Ah, I see. I said that the three `count` entries should be merged in the 
earlier my comment though, I rethink now that separating `count(expr)` and 
`count(*)` is better along with the Pg one? 
https://www.postgresql.org/docs/9.5/functions-aggregate.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404540923
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type

[GitHub] [spark] maropu commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

maropu commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL 
Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610177594
 
 
   cc: @srowen @viirya 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] 
Document Join in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610177067
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120896/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join 
in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610177067
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120896/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join 
in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610177064
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

SparkQA removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document 
Join in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610174263
 
 
   **[Test build #120896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120896/testReport)**
 for PR 28121 at commit 
[`8be6c4a`](https://github.com/apache/spark/commit/8be6c4a12ffb783d938d91b69d6ddd1c191af75d).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] 
Document Join in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610177064
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28141: 
[SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource 
managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141#issuecomment-610176890
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

SparkQA commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] 
Backport version for resource managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141#issuecomment-610176880
 
 
   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/25586/
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28141: 
[SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource 
managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141#issuecomment-610176890
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in 
SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610176994
 
 
   **[Test build #120896 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120896/testReport)**
 for PR 28121 at commit 
[`8be6c4a`](https://github.com/apache/spark/commit/8be6c4a12ffb783d938d91b69d6ddd1c191af75d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28141: 
[SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource 
managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141#issuecomment-610176900
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25586/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28141: [SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource managers(Yarn, Mesos, K8S)

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28141: 
[SPARK-31092][SPARK-31109][SPARK-31118][3.0] Backport version for resource 
managers(Yarn, Mesos, K8S)
URL: https://github.com/apache/spark/pull/28141#issuecomment-610176900
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25586/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404537908
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
 
 Review comment:
   for `*`, the data type is `none`, for `expression1` and `expression2` data 
type, it is `any`. is the following better to understand? `none, any, any`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404537440
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,628 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
column name instead of Column.
+
+* Table of contents
+{:toc}
+
+  
+FunctionParametersDescription
+  
+  
+
+   {any | some | bool_or}(c: Column)
+  Column name
+  Returns true if at least one value is true
+
+
+   approx_count_distinct(c: Column[, relativeSD: 
Double]])
+  Column name; relativeSD: the maximum estimation error allowed.
+  Returns the estimated cardinality by HyperLogLog++
+   
+
+   {avg | mean}(c: Column)
+  Column name
+   Returns the average of values in the input column. 
+
+
+   {bool_and | every}(c: Column)
+  Column name
+  Returns true if all values are true
+
+
+   collect_list(c: Column)
+  Column name
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle
+   
+
+   collect_set(c: Column)
+  Column name
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+   corr(c1: Column, c2: Column)
+  Column name
+  Returns Pearson coefficient of correlation between a set of number 
pairs
+
+
+   count(*)
+  None
+  Returns the total number of retrieved rows, including rows 
containing null
+
+
+   count(c: Column[, c: Column])
+  Column name
+  Returns the number of rows for which the supplied column(s) are all 
not null
+
+
+   count(DISTINCT  c: Column[, c: Column])
+  Column name
+  Returns the number of rows for which the supplied column(s) are 
unique and not null
+ 
+
+   count_if(Predicate)
+  Expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to 
TRUE values
+ 
+
+ count_min_sketch(c: Column, eps: double, confidence: 
double, seed integer)
+Column name; eps is a value between 0.0 and 1.0; confidence is a 
value between 0.0 and 1.0; seed is a positive integer
+Returns a count-min sketch of a column with the given esp, 
confidence and seed. The result is an array of bytes, which can be deserialized 
to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data 
structure used for cardinality estimation using sub-linear space..
+
+
+   covar_pop(c1: Column, c2: Column)
+  Column name
+  Returns the population covariance of a set of number pairs
+ 
+
+   covar_samp(c1: Column, c2: Column)
+  Column name
+  Returns the sample covariance of a set of number pairs
+  
+
+   {first | first_value}(c: Column[, isIgnoreNull])
 
 Review comment:
   btw, I think its better to use the same type names here with 
https://github.com/apache/spark/blob/master/docs/sql-ref-datatypes.md


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404537320
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404536581
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
 
 Review comment:
   `(double, double)`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404536482
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
 
 Review comment:
   What does `none; any` mean?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404536420
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
 
 Review comment:
   `(double, double)`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] 
Document Join in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610174701
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25588/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join 
in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610174701
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/25588/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join 
in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610174696
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28121: [SPARK-31348][SQL][DOCS] 
Document Join in SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610174696
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kevinyu98 commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

kevinyu98 commented on a change in pull request #28120: 
[SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404536117
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404536057
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type
+

[GitHub] [spark] SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in SQL Reference

2020-04-06 Thread GitBox

SparkQA commented on issue #28121: [SPARK-31348][SQL][DOCS] Document Join in 
SQL Reference
URL: https://github.com/apache/spark/pull/28121#issuecomment-610174263
 
 
   **[Test build #120896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120896/testReport)**
 for PR 28121 at commit 
[`8be6c4a`](https://github.com/apache/spark/commit/8be6c4a12ffb783d938d91b69d6ddd1c191af75d).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404534975
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type
+

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404535304
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type
+

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

2020-04-06 Thread GitBox

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r404534975
 
 

 ##
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##
 @@ -19,4 +19,616 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL aggregate functions are grouped as agg_funcs in Spark 
SQL. Below is the list of functions.
+
+**Note:** All functions below have another signature which takes String as a 
expression.
+
+
+  
+FunctionParameter 
Type(s)Description
+  
+  
+
+  {any | some | bool_or}(expression)
+  boolean
+  Returns true if at least one value is true.
+
+
+  approx_count_distinct(expression[, relativeSD])
+  (long, double)
+  RelativeSD is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.
+   
+
+  {avg | mean}(expression)
+  numeric or string
+  Returns the average of values in the input expression. 
+
+
+  {bool_and | every}(expression)
+  boolean
+  Returns true if all values are true.
+
+
+  collect_list(expression)
+  any
+  Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+   
+
+  collect_set(expression)
+  any
+  Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.
+
+
+  corr(expression1, expression2)
+  double, double
+  Returns Pearson coefficient of correlation between a set of number 
pairs.
+
+
+  count([DISTINCT] {* | expression1[, 
expression2]})
+  none; any
+  If specified DISTINCT, returns the number of rows for 
which the supplied expression(s) are unique and not null; If specified `*`, 
returns the total number of retrieved rows, including rows containing null; 
Otherwise, returns the number of rows for which the supplied expression(s) are 
all not null.
+
+
+  count_if(predicate)
+  expression that will be used for aggregation calculation
+  Returns the count number from the predicate evaluate to `TRUE` 
values.
+ 
+
+  count_min_sketch(expression, eps, confidence, 
seed)
+  integral or string or binary, double,  double, integer
+  Eps and confidence are the double values between 0.0 and 1.0, seed 
is a positive integer. Returns a count-min sketch of a expression with the 
given esp, confidence and seed. The result is an array of bytes, which can be 
deserialized to a `CountMinSketch` before usage. Count-min sketch is a 
probabilistic data structure used for cardinality estimation using sub-linear 
space.
+
+
+  covar_pop(expression1, expression2)
+  double, double
+  Returns the population covariance of a set of number pairs.
+ 
+
+  covar_samp(expression1, expression2)
+  double
+  Returns the sample covariance of a set of number pairs.
+  
+
+  {first | first_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the first value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  kurtosis(expression)
+  double
+  Returns the kurtosis value calculated from values of a group.
+
+
+  {last | last_value}(expression[, isIgnoreNull])
+  any, boolean
+  Returns the last value of expression for a group of rows. If 
isIgnoreNull is true, returns only non-null values, default is 
false. This function is non-deterministic.
+  
+
+  max(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the maximum value of the expression.
+  
+
+  max_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the maximum value 
of expression2.
+   
+
+  min(expression)
+  any numeric, string, date/time or arrays of these types
+  Returns the minimum value of the expression.
+  
+
+  min_by(expression1, expression2)
+  any numeric, string, date/time or arrays of these types
+  Returns the value of expression1 associated with the minimum value 
of expression2.
+  
+
+  percentile(expression, percentage [, frequency])
+  numeric Type, double, integral type
+

[GitHub] [spark] AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] 
Document Set Operators in SQL Reference
URL: https://github.com/apache/spark/pull/28139#issuecomment-610172951
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/120895/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] 
Document Set Operators in SQL Reference
URL: https://github.com/apache/spark/pull/28139#issuecomment-610172945
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference

2020-04-06 Thread GitBox

SparkQA removed a comment on issue #28139: [SPARK-31362][SQL][DOCS] Document 
Set Operators in SQL Reference
URL: https://github.com/apache/spark/pull/28139#issuecomment-610170265
 
 
   **[Test build #120895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120895/testReport)**
 for PR 28139 at commit 
[`1cce7c8`](https://github.com/apache/spark/commit/1cce7c8e2c2bd184acefdb05d7ffad739dbb571a).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #28139: [SPARK-31362][SQL][DOCS] Document Set Operators in SQL Reference

2020-04-06 Thread GitBox

AmplabJenkins commented on issue #28139: [SPARK-31362][SQL][DOCS] Document Set 
Operators in SQL Reference
URL: https://github.com/apache/spark/pull/28139#issuecomment-610172945
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 718 matches

Mail list logo