[GitHub] [spark] HyukjinKwon commented on pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on PR #37005: URL: https://github.com/apache/spark/pull/37005#issuecomment-1178692367 Feel free to make a followup or a separate PR with a separate JIRA 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on code in PR #37005: URL: https://github.com/apache/spark/pull/37005#discussion_r916573997 ## .github/workflows/build_and_test.yml: ## @@ -251,13 +251,73 @@ jobs: name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on code in PR #37005: URL: https://github.com/apache/spark/pull/37005#discussion_r916573418 ## .github/workflows/build_and_test.yml: ## @@ -251,13 +251,73 @@ jobs: name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on code in PR #37005: URL: https://github.com/apache/spark/pull/37005#discussion_r916573072 ## .github/workflows/build_and_test.yml: ## @@ -251,13 +251,73 @@ jobs: name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}

[GitHub] [spark] cloud-fan commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36773: URL: https://github.com/apache/spark/pull/36773#discussion_r916571989 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -1685,6 +1709,42 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with Exp

[GitHub] [spark] cloud-fan commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r916570479 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala: ## @@ -53,16 +63,27 @@ object DistributionAndOrderingUtils

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36773: [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown

2022-07-08 Thread GitBox
HyukjinKwon commented on code in PR #36773: URL: https://github.com/apache/spark/pull/36773#discussion_r916569488 ## sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala: ## @@ -1685,6 +1709,42 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with E

[GitHub] [spark] cloud-fan commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r916568814 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala: ## @@ -143,4 +150,54 @@ object V2ExpressionUtils extends SQLConfHelpe

[GitHub] [spark] cloud-fan commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r916568578 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala: ## @@ -143,4 +150,54 @@ object V2ExpressionUtils extends SQLConfHelpe

[GitHub] [spark] ulysses-you opened a new pull request, #37129: [SPARK-39710][SQL] Support push local topK through outer join

2022-07-08 Thread GitBox
ulysses-you opened a new pull request, #37129: URL: https://github.com/apache/spark/pull/37129 ### What changes were proposed in this pull request? - Pull out the pattern of `TakeOrderedAndProjectExec` to `ExtractTopK` - Add a new rule `PushLocalTopKThroughOuterJoin` which m

[GitHub] [spark] Yikun commented on pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
Yikun commented on PR #37005: URL: https://github.com/apache/spark/pull/37005#issuecomment-1178676649 > Let's get this in and see how it gose. Sure, I will monitor in recent days. and update results here: Case 1: The developer push a commit: Case2: Merge a commit in apache/spark

[GitHub] [spark] HyukjinKwon commented on pull request #37128: What do fit in BucketedRandomProjectionLSH in spark?

2022-07-08 Thread GitBox
HyukjinKwon commented on PR #37128: URL: https://github.com/apache/spark/pull/37128#issuecomment-1178673181 @MammadTavakoli Let's either file a JIRA in https://issues.apache.org/jira/projects/SPARK/issues or ask u...@spark.apache.org -- This is an automated message from the Apache Git S

[GitHub] [spark] HyukjinKwon closed pull request #37128: What do fit in BucketedRandomProjectionLSH in spark?

2022-07-08 Thread GitBox
HyukjinKwon closed pull request #37128: What do fit in BucketedRandomProjectionLSH in spark? URL: https://github.com/apache/spark/pull/37128 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[GitHub] [spark] cloud-fan commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r916557149 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala: ## @@ -143,4 +150,54 @@ object V2ExpressionUtils extends SQLConfHelpe

[GitHub] [spark] cloud-fan commented on a diff in pull request #36995: [SPARK-39607][SQL][DSV2] Distribution and ordering support V2 function in writing

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36995: URL: https://github.com/apache/spark/pull/36995#discussion_r916556945 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala: ## @@ -17,22 +17,32 @@ package org.apache.spark.sql.exe

[GitHub] [spark] HyukjinKwon closed pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon closed pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job URL: https://github.com/apache/spark/pull/37005 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

[GitHub] [spark] HyukjinKwon commented on pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on PR #37005: URL: https://github.com/apache/spark/pull/37005#issuecomment-1178669259 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [spark] HyukjinKwon commented on pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on PR #37005: URL: https://github.com/apache/spark/pull/37005#issuecomment-1178669043 Let's get this in and see how it gose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [spark] cloud-fan commented on a diff in pull request #37113: [WIP] Supports url encode/decode function

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #37113: URL: https://github.com/apache/spark/pull/37113#discussion_r916553050 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpression.scala: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (

[GitHub] [spark] HyukjinKwon commented on pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
HyukjinKwon commented on PR #37005: URL: https://github.com/apache/spark/pull/37005#issuecomment-1178666725 Yes, all other committers have the same permission with me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[GitHub] [spark] lsm1 commented on a diff in pull request #37014: [SPARK-39624][SQL] Support coalesce partition through CartesianProduct

2022-07-08 Thread GitBox
lsm1 commented on code in PR #37014: URL: https://github.com/apache/spark/pull/37014#discussion_r916551528 ## sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala: ## @@ -2602,6 +2602,59 @@ class AdaptiveQueryExecSuite assert(findTo

[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r916548778 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -470,6 +470,52 @@ object BooleanSimplification extends Rule[LogicalPlan

[GitHub] [spark] cloud-fan commented on a diff in pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #37040: URL: https://github.com/apache/spark/pull/37040#discussion_r916547438 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -470,6 +470,52 @@ object BooleanSimplification extends Rule[LogicalPlan

[GitHub] [spark] beliefer commented on pull request #37047: [SPARK-39627][SQL] JDBC V2 pushdown should unify the compile API

2022-07-08 Thread GitBox
beliefer commented on PR #37047: URL: https://github.com/apache/spark/pull/37047#issuecomment-1178661116 @cloud-fan Thank you very much ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[GitHub] [spark] cloud-fan closed pull request #37047: [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

2022-07-08 Thread GitBox
cloud-fan closed pull request #37047: [SPARK-39627][SQL] DS V2 pushdown should unify the compile API URL: https://github.com/apache/spark/pull/37047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

[GitHub] [spark] cloud-fan commented on pull request #37047: [SPARK-39627][SQL] DS V2 pushdown should unify the compile API

2022-07-08 Thread GitBox
cloud-fan commented on PR #37047: URL: https://github.com/apache/spark/pull/37047#issuecomment-1178659839 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [spark] Yikf commented on pull request #37113: [WIP] Supports url encode/decode function

2022-07-08 Thread GitBox
Yikf commented on PR #37113: URL: https://github.com/apache/spark/pull/37113#issuecomment-1178657373 > do other databases provide these functions? Both Trino & Presto have URL-related URL functions, including url_decode/url_encode, reference doc: https://trino.io/docs/current/functio

[GitHub] [spark] Yikf commented on pull request #37113: [WIP] Supports url encode/decode function

2022-07-08 Thread GitBox
Yikf commented on PR #37113: URL: https://github.com/apache/spark/pull/37113#issuecomment-1178657122 > Both Trino & Presto have URL-related URL functions, including url_decode/url_encode, reference doc: https://trino.io/docs/current/functions/url.html, However, i don't found similar fu

[GitHub] [spark] cloud-fan commented on pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-08 Thread GitBox
cloud-fan commented on PR #37083: URL: https://github.com/apache/spark/pull/37083#issuecomment-1178656294 Maybe we should name them `BasicStatesPlanVisitor` and `BasicAndColumnStatsPlanVisitor`. We also need to make sure the updated `SizeInBytesOnlyStatsPlanVisitor` can propagate row count

[GitHub] [spark] cloud-fan commented on pull request #37048: [SPARK-39655][SQL] Add a config to limit CartesianProductExec's partition number

2022-07-08 Thread GitBox
cloud-fan commented on PR #37048: URL: https://github.com/apache/spark/pull/37048#issuecomment-1178653885 This reminds me of `CheckCartesianProducts`. In general, it's a bit hard to predict bad queries and fail earlier. Will https://github.com/apache/spark/pull/37014 solve your issue? --

[GitHub] [spark] MammadTavakoli opened a new pull request, #37128: What do fit in BucketedRandomProjectionLSH in spark?

2022-07-08 Thread GitBox
MammadTavakoli opened a new pull request, #37128: URL: https://github.com/apache/spark/pull/37128 In the spark there is an `LSH `function that use for KNN or search similarity; `BucketedRandomProjectionLSH`. The usage of it is: ``` from pyspark.ml.feature import BucketedRandomP

[GitHub] [spark] cloud-fan commented on a diff in pull request #37080: [SPARK-35208][SQL][DOCS] Add docs for LATERAL subqueries

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #37080: URL: https://github.com/apache/spark/pull/37080#discussion_r916534506 ## docs/sql-ref-syntax-qry-select-lateral-subquery.md: ## @@ -0,0 +1,87 @@ +--- +layout: global +title: LATERAL SUBQUERY +displayTitle: LATERAL SUBQUERY +license: | +

[GitHub] [spark] singhpk234 commented on pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

2022-07-08 Thread GitBox
singhpk234 commented on PR #37083: URL: https://github.com/apache/spark/pull/37083#issuecomment-1178646200 > After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor BasicStatsPlanVisitor additionally takes has columnStats such as (NDV /

[GitHub] [spark] cloud-fan commented on a diff in pull request #37080: [SPARK-35208][SQL][DOCS] Add docs for LATERAL subqueries

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #37080: URL: https://github.com/apache/spark/pull/37080#discussion_r916533538 ## docs/sql-ref-syntax-qry-select-lateral-subquery.md: ## @@ -0,0 +1,87 @@ +--- +layout: global +title: LATERAL SUBQUERY +displayTitle: LATERAL SUBQUERY +license: | +

[GitHub] [spark] beliefer commented on pull request #37040: [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic

2022-07-08 Thread GitBox
beliefer commented on PR #37040: URL: https://github.com/apache/spark/pull/37040#issuecomment-1178644252 > Can we add a new rule `OptimizeRand` for this optimization? Basically it turns rand predicates to true or false literals. OK -- This is an automated message from the Apache Gi

[GitHub] [spark] LuciferYang commented on a diff in pull request #37069: [SPARK-39667][SQL] Add another workaround when there is not enough memory to build and broadcast the table

2022-07-08 Thread GitBox
LuciferYang commented on code in PR #37069: URL: https://github.com/apache/spark/pull/37069#discussion_r916527812 ## sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala: ## @@ -179,8 +179,9 @@ case class BroadcastExchangeExec(

[GitHub] [spark] Yikun commented on pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
Yikun commented on PR #37005: URL: https://github.com/apache/spark/pull/37005#issuecomment-1178638016 Does all spark committers has right to push apache ghcr package? According to [cache image](https://github.com/apache/spark/actions/workflows/build_infra_images_cache.yml), I can confirm @

[GitHub] [spark] cloud-fan commented on a diff in pull request #37069: [SPARK-39667][SQL] Add another workaround when there is not enough memory to build and broadcast the table

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #37069: URL: https://github.com/apache/spark/pull/37069#discussion_r916526901 ## sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala: ## @@ -179,8 +179,9 @@ case class BroadcastExchangeExec( /

[GitHub] [spark] Yikun commented on a diff in pull request #37005: [SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job

2022-07-08 Thread GitBox
Yikun commented on code in PR #37005: URL: https://github.com/apache/spark/pull/37005#discussion_r916427024 ## .github/workflows/build_and_test.yml: ## @@ -251,13 +251,73 @@ jobs: name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ m

[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r916523583 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala: ## @@ -148,7 +148,28 @@ class CSVOptions( // A language tag in IETF BCP 47 format

[GitHub] [spark] cloud-fan commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

2022-07-08 Thread GitBox
cloud-fan commented on code in PR #36871: URL: https://github.com/apache/spark/pull/36871#discussion_r916522350 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala: ## @@ -117,7 +123,10 @@ class CSVInferSchema(val options: CSVOptions) extends S

[GitHub] [spark] panbingkun commented on a diff in pull request #37112: [SPARK-39704][SQL] Implement createIndex & dropIndex & indexExists in JDBC (H2 dialect)

2022-07-08 Thread GitBox
panbingkun commented on code in PR #37112: URL: https://github.com/apache/spark/pull/37112#discussion_r916521562 ## sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala: ## @@ -103,6 +106,40 @@ private[sql] object H2Dialect extends JdbcDialect { functionMap.cle

<    1   2