[jira] [Commented] (SPARK-40895) Upgrade Arrow to 10.0.0

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622960#comment-17622960
 ] 

Apache Spark commented on SPARK-40895:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38369

> Upgrade Arrow to 10.0.0
> ---
>
> Key: SPARK-40895
> URL: https://issues.apache.org/jira/browse/SPARK-40895
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40895) Upgrade Arrow to 10.0.0

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40895:


Assignee: (was: Apache Spark)

> Upgrade Arrow to 10.0.0
> ---
>
> Key: SPARK-40895
> URL: https://issues.apache.org/jira/browse/SPARK-40895
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40895) Upgrade Arrow to 10.0.0

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622959#comment-17622959
 ] 

Apache Spark commented on SPARK-40895:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38369

> Upgrade Arrow to 10.0.0
> ---
>
> Key: SPARK-40895
> URL: https://issues.apache.org/jira/browse/SPARK-40895
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40895) Upgrade Arrow to 10.0.0

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40895:


Assignee: Apache Spark

> Upgrade Arrow to 10.0.0
> ---
>
> Key: SPARK-40895
> URL: https://issues.apache.org/jira/browse/SPARK-40895
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40895) Upgrade Arrow to 10.0.0

2022-10-23 Thread Yang Jie (Jira)
Yang Jie created SPARK-40895:


 Summary: Upgrade Arrow to 10.0.0
 Key: SPARK-40895
 URL: https://issues.apache.org/jira/browse/SPARK-40895
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40894) Introduce error sub-classes of the operation not allowed parse error

2022-10-23 Thread Max Gekk (Jira)
Max Gekk created SPARK-40894:


 Summary: Introduce error sub-classes of the operation not allowed 
parse error
 Key: SPARK-40894
 URL: https://issues.apache.org/jira/browse/SPARK-40894
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


Add an error class for operationNotAllowed() and sub-classes instead of passing 
error text. For instance,
https://github.com/apache/spark/blob/933dc0c42f0caf74aaa077fd4f2c2e7208452b9b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L3042
pass an error sub-class + message parameters instead of the text "Column 
ordering must be ASC, was ..."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622942#comment-17622942
 ] 

Apache Spark commented on SPARK-40821:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/38368

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622941#comment-17622941
 ] 

Apache Spark commented on SPARK-40821:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/38368

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40893) Upgrade to use setup-java v3

2022-10-23 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-40893.
--
Resolution: Duplicate

> Upgrade to use setup-java v3
> 
>
> Key: SPARK-40893
> URL: https://issues.apache.org/jira/browse/SPARK-40893
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40867) Flaky test ProtobufCatalystDataConversionSuite

2022-10-23 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622934#comment-17622934
 ] 

Yang Jie commented on SPARK-40867:
--

Thanks [~sanysand...@gmail.com] 

> Flaky test ProtobufCatalystDataConversionSuite
> --
>
> Key: SPARK-40867
> URL: https://issues.apache.org/jira/browse/SPARK-40867
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> * 
> [https://github.com/LuciferYang/spark/actions/runs/3295309311/jobs/5433733419]
>  * 
> [https://github.com/LuciferYang/spark/actions/runs/3291252601/jobs/5425183034]
> {code:java}
> [info] ProtobufCatalystDataConversionSuite:
> [info] - single StructType(StructField(int32_type,IntegerType,true)) with 
> seed 167 *** FAILED *** (39 milliseconds)
> [info]   Incorrect evaluation (codegen off): from_protobuf(to_protobuf([0], 
> /home/runner/work/spark/spark/connector/protobuf/target/scala-2.12/test-classes/protobuf/catalyst_types.desc,
>  IntegerMsg), 
> /home/runner/work/spark/spark/connector/protobuf/target/scala-2.12/test-classes/protobuf/catalyst_types.desc,
>  IntegerMsg), actual: [null], expected: [0] (ExpressionEvalHelper.scala:209)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
> [info]   at org.scalatest.Assertions.fail(Assertions.scala:933)
> [info]   at org.scalatest.Assertions.fail$(Assertions.scala:929)
> [info]   at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199)
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkEvaluationWithoutCodegen(ProtobufCatalystDataConversionSuite.scala:33)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82)
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkEvaluation(ProtobufCatalystDataConversionSuite.scala:33)
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.checkResult(ProtobufCatalystDataConversionSuite.scala:43)
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:122)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:66)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at 

[jira] [Resolved] (SPARK-40800) Always inline expressions in OptimizeOneRowRelationSubquery

2022-10-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40800.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38260
[https://github.com/apache/spark/pull/38260]

> Always inline expressions in OptimizeOneRowRelationSubquery
> ---
>
> Key: SPARK-40800
> URL: https://issues.apache.org/jira/browse/SPARK-40800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-39699 made `CollpaseProjects` more conservative. This has impacted 
> correlated subqueries that Spark used to be able to support. For example, a 
> correlated one-row scalar subquery that has a higher-order function:
> {code:java}
> CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a 
> SELECT (
>   SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted
>   FROM (SELECT MAP('a', 1, 'b', 2) rank)
> ) FROM t1{code}
> This will throw an exception after SPARK-39699:
> {code:java}
> Unexpected operator Join Inner
> :- Aggregate [[a,b]], [[a,b] AS a#252]
> :  +- OneRowRelation
> +- Project [map(keys: [a,b], values: [1,2]) AS rank#241]
>    +- OneRowRelation
>  in correlated subquery{code}
> because the projects inside the subquery can no longer be collapsed. We 
> should always inline expressions if possible to avoid adding domain joins and 
> support a wider range of correlated subqueries. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40800) Always inline expressions in OptimizeOneRowRelationSubquery

2022-10-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40800:
---

Assignee: Allison Wang

> Always inline expressions in OptimizeOneRowRelationSubquery
> ---
>
> Key: SPARK-40800
> URL: https://issues.apache.org/jira/browse/SPARK-40800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> SPARK-39699 made `CollpaseProjects` more conservative. This has impacted 
> correlated subqueries that Spark used to be able to support. For example, a 
> correlated one-row scalar subquery that has a higher-order function:
> {code:java}
> CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a 
> SELECT (
>   SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted
>   FROM (SELECT MAP('a', 1, 'b', 2) rank)
> ) FROM t1{code}
> This will throw an exception after SPARK-39699:
> {code:java}
> Unexpected operator Join Inner
> :- Aggregate [[a,b]], [[a,b] AS a#252]
> :  +- OneRowRelation
> +- Project [map(keys: [a,b], values: [1,2]) AS rank#241]
>    +- OneRowRelation
>  in correlated subquery{code}
> because the projects inside the subquery can no longer be collapsed. We 
> should always inline expressions if possible to avoid adding domain joins and 
> support a wider range of correlated subqueries. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36114) Support subqueries with correlated non-equality predicates

2022-10-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36114:
---

Assignee: Allison Wang

> Support subqueries with correlated non-equality predicates
> --
>
> Key: SPARK-36114
> URL: https://issues.apache.org/jira/browse/SPARK-36114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> The new decorrelation framework is able to support subqueries with 
> non-equality predicates. For example:
> SELECT * FROM t1 WHERE c1 = (SELECT SUM(c1) FROM t2 WHERE t1.c2 > t2.c2)
> The restrictions in CheckAnlysis can be removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36114) Support subqueries with correlated non-equality predicates

2022-10-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36114.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38135
[https://github.com/apache/spark/pull/38135]

> Support subqueries with correlated non-equality predicates
> --
>
> Key: SPARK-36114
> URL: https://issues.apache.org/jira/browse/SPARK-36114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> The new decorrelation framework is able to support subqueries with 
> non-equality predicates. For example:
> SELECT * FROM t1 WHERE c1 = (SELECT SUM(c1) FROM t2 WHERE t1.c2 > t2.c2)
> The restrictions in CheckAnlysis can be removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40893) Upgrade to use setup-java v3

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622931#comment-17622931
 ] 

Apache Spark commented on SPARK-40893:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38366

> Upgrade to use setup-java v3
> 
>
> Key: SPARK-40893
> URL: https://issues.apache.org/jira/browse/SPARK-40893
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40893) Upgrade to use setup-java v3

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40893:


Assignee: Apache Spark

> Upgrade to use setup-java v3
> 
>
> Key: SPARK-40893
> URL: https://issues.apache.org/jira/browse/SPARK-40893
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40893) Upgrade to use setup-java v3

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622930#comment-17622930
 ] 

Apache Spark commented on SPARK-40893:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38366

> Upgrade to use setup-java v3
> 
>
> Key: SPARK-40893
> URL: https://issues.apache.org/jira/browse/SPARK-40893
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40893) Upgrade to use setup-java v3

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40893:


Assignee: (was: Apache Spark)

> Upgrade to use setup-java v3
> 
>
> Key: SPARK-40893
> URL: https://issues.apache.org/jira/browse/SPARK-40893
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40893) Upgrade to use setup-java v3

2022-10-23 Thread Yang Jie (Jira)
Yang Jie created SPARK-40893:


 Summary: Upgrade to use setup-java v3
 Key: SPARK-40893
 URL: https://issues.apache.org/jira/browse/SPARK-40893
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40890) Check error classes in DataSourceV2SQLSuite

2022-10-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622928#comment-17622928
 ] 

BingKun Pan commented on SPARK-40890:
-

I work on it.

> Check error classes in DataSourceV2SQLSuite
> ---
>
> Key: SPARK-40890
> URL: https://issues.apache.org/jira/browse/SPARK-40890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DataSourceV2SQLSuite by using checkError() instead of 
> checking the error message body.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread BingKun Pan (Jira)


[ https://issues.apache.org/jira/browse/SPARK-40891 ]


BingKun Pan deleted comment on SPARK-40891:
-

was (Author: panbingkun):
I work on it.

> Check error classes in TableIdentifierParserSuite
> -
>
> Key: SPARK-40891
> URL: https://issues.apache.org/jira/browse/SPARK-40891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in TableIdentifierParserSuite by using checkError().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40851) TimestampFormatter behavior changed when using the latest Java 8/11/17

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622926#comment-17622926
 ] 

Apache Spark commented on SPARK-40851:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38365

> TimestampFormatter behavior changed when using the latest Java 8/11/17
> --
>
> Key: SPARK-40851
> URL: https://issues.apache.org/jira/browse/SPARK-40851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Blocker
> Fix For: 3.4.0
>
>
> {code:java}
> [info] *** 12 TESTS FAILED ***
> [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5
> [error] Failed tests:
> [error]   org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite
> [error]   org.apache.spark.sql.catalyst.util.TimestampFormatterSuite
> [error]   org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite
> [error]   org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite
> [error]   org.apache.spark.sql.catalyst.expressions.TryCastSuite {code}
> We can reproduce this issue using Java 8u352/11.0.17/17.0.5,  the test errors 
> are similar to the following:
> run
> {code:java}
> build/sbt clean "catalyst/testOnly *CastWithAnsiOffSuite" {code}
> with 8u352:
> {code:java}
> [info] - SPARK-35711: cast timestamp without time zone to timestamp with 
> local time zone *** FAILED *** (190 milliseconds)
> [info]   Incorrect evaluation (codegen off): cast(0001-01-01 00:00:00 as 
> timestamp), actual: -6213561782000, expected: -621355968 
> (ExpressionEvalHelper.scala:209)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
> [info]   at org.scalatest.Assertions.fail(Assertions.scala:933)
> [info]   at org.scalatest.Assertions.fail$(Assertions.scala:929)
> [info]   at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluationWithoutCodegen(CastSuiteBase.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluation(CastSuiteBase.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198(CastSuiteBase.scala:893)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198$adapted(CastSuiteBase.scala:890)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$197(CastSuiteBase.scala:890)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at 
> org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196(CastSuiteBase.scala:890)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196$adapted(CastSuiteBase.scala:888)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$195(CastSuiteBase.scala:888)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
> [info]   at 
> 

[jira] [Assigned] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40891:


Assignee: (was: Apache Spark)

> Check error classes in TableIdentifierParserSuite
> -
>
> Key: SPARK-40891
> URL: https://issues.apache.org/jira/browse/SPARK-40891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in TableIdentifierParserSuite by using checkError().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40891:


Assignee: Apache Spark

> Check error classes in TableIdentifierParserSuite
> -
>
> Key: SPARK-40891
> URL: https://issues.apache.org/jira/browse/SPARK-40891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in TableIdentifierParserSuite by using checkError().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622925#comment-17622925
 ] 

Apache Spark commented on SPARK-40891:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38364

> Check error classes in TableIdentifierParserSuite
> -
>
> Key: SPARK-40891
> URL: https://issues.apache.org/jira/browse/SPARK-40891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in TableIdentifierParserSuite by using checkError().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40851) TimestampFormatter behavior changed when using the latest Java 8/11/17

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622922#comment-17622922
 ] 

Apache Spark commented on SPARK-40851:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38363

> TimestampFormatter behavior changed when using the latest Java 8/11/17
> --
>
> Key: SPARK-40851
> URL: https://issues.apache.org/jira/browse/SPARK-40851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Blocker
> Fix For: 3.4.0
>
>
> {code:java}
> [info] *** 12 TESTS FAILED ***
> [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5
> [error] Failed tests:
> [error]   org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite
> [error]   org.apache.spark.sql.catalyst.util.TimestampFormatterSuite
> [error]   org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite
> [error]   org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite
> [error]   org.apache.spark.sql.catalyst.expressions.TryCastSuite {code}
> We can reproduce this issue using Java 8u352/11.0.17/17.0.5,  the test errors 
> are similar to the following:
> run
> {code:java}
> build/sbt clean "catalyst/testOnly *CastWithAnsiOffSuite" {code}
> with 8u352:
> {code:java}
> [info] - SPARK-35711: cast timestamp without time zone to timestamp with 
> local time zone *** FAILED *** (190 milliseconds)
> [info]   Incorrect evaluation (codegen off): cast(0001-01-01 00:00:00 as 
> timestamp), actual: -6213561782000, expected: -621355968 
> (ExpressionEvalHelper.scala:209)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
> [info]   at org.scalatest.Assertions.fail(Assertions.scala:933)
> [info]   at org.scalatest.Assertions.fail$(Assertions.scala:929)
> [info]   at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:209)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluationWithoutCodegen(CastSuiteBase.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.checkEvaluation(CastSuiteBase.scala:49)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198(CastSuiteBase.scala:893)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$198$adapted(CastSuiteBase.scala:890)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$197(CastSuiteBase.scala:890)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at 
> org.apache.spark.sql.catalyst.util.DateTimeTestUtils$.withDefaultTimeZone(DateTimeTestUtils.scala:61)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196(CastSuiteBase.scala:890)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$196$adapted(CastSuiteBase.scala:888)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.CastSuiteBase.$anonfun$new$195(CastSuiteBase.scala:888)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207)
> [info]   at 
> 

[jira] [Commented] (SPARK-40880) Reimplement `summary` with dataframe operations

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622917#comment-17622917
 ] 

Apache Spark commented on SPARK-40880:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38362

> Reimplement `summary` with dataframe operations
> ---
>
> Key: SPARK-40880
> URL: https://issues.apache.org/jira/browse/SPARK-40880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40849) Async log purge

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40849:


Assignee: Boyang Jerry Peng

> Async log purge
> ---
>
> Key: SPARK-40849
> URL: https://issues.apache.org/jira/browse/SPARK-40849
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Assignee: Boyang Jerry Peng
>Priority: Major
>
> Purging old entries in both the offset log and commit log will be done 
> asynchronously.
>  
> For every micro-batch, older entries in both offset log and commit log are 
> deleted. This is done so that the offset log and commit log do not 
> continually grow.  Please reference logic here
>  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539]
>  
>  
> The time spent performing these log purges is grouped with the “walCommit” 
> execution time in the StreamingProgressListener metrics.  Around two thirds 
> of the “walCommit” execution time is performing these purge operations thus 
> making these operations asynchronous will also reduce latency.  Also, we do 
> not necessarily need to perform the purges every micro-batch.  When these 
> purges are executed asynchronously, they do not need to block micro-batch 
> execution and we don’t need to start another purge until the current one is 
> finished.  The purges can happen essentially in the background.  We will just 
> have to synchronize the purges with the offset WAL commits and completion 
> commits so that we don’t have concurrent modifications of the offset log and 
> commit log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread BingKun Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622910#comment-17622910
 ] 

BingKun Pan commented on SPARK-40891:
-

I work on it.

> Check error classes in TableIdentifierParserSuite
> -
>
> Key: SPARK-40891
> URL: https://issues.apache.org/jira/browse/SPARK-40891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in TableIdentifierParserSuite by using checkError().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40849) Async log purge

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40849.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38313
[https://github.com/apache/spark/pull/38313]

> Async log purge
> ---
>
> Key: SPARK-40849
> URL: https://issues.apache.org/jira/browse/SPARK-40849
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Assignee: Boyang Jerry Peng
>Priority: Major
> Fix For: 3.4.0
>
>
> Purging old entries in both the offset log and commit log will be done 
> asynchronously.
>  
> For every micro-batch, older entries in both offset log and commit log are 
> deleted. This is done so that the offset log and commit log do not 
> continually grow.  Please reference logic here
>  
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539]
>  
>  
> The time spent performing these log purges is grouped with the “walCommit” 
> execution time in the StreamingProgressListener metrics.  Around two thirds 
> of the “walCommit” execution time is performing these purge operations thus 
> making these operations asynchronous will also reduce latency.  Also, we do 
> not necessarily need to perform the purges every micro-batch.  When these 
> purges are executed asynchronously, they do not need to block micro-batch 
> execution and we don’t need to start another purge until the current one is 
> finished.  The purges can happen essentially in the background.  We will just 
> have to synchronize the purges with the offset WAL commits and completion 
> commits so that we don’t have concurrent modifications of the offset log and 
> commit log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40880) Reimplement `summary` with dataframe operations

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40880:


Assignee: Ruifeng Zheng

> Reimplement `summary` with dataframe operations
> ---
>
> Key: SPARK-40880
> URL: https://issues.apache.org/jira/browse/SPARK-40880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40880) Reimplement `summary` with dataframe operations

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40880.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38346
[https://github.com/apache/spark/pull/38346]

> Reimplement `summary` with dataframe operations
> ---
>
> Key: SPARK-40880
> URL: https://issues.apache.org/jira/browse/SPARK-40880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40877) Reimplement `crosstab` with dataframe operations

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40877.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38340
[https://github.com/apache/spark/pull/38340]

> Reimplement `crosstab` with dataframe operations
> 
>
> Key: SPARK-40877
> URL: https://issues.apache.org/jira/browse/SPARK-40877
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40877) Reimplement `crosstab` with dataframe operations

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40877:


Assignee: Ruifeng Zheng

> Reimplement `crosstab` with dataframe operations
> 
>
> Key: SPARK-40877
> URL: https://issues.apache.org/jira/browse/SPARK-40877
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40884) Upgrade fabric8io - kubernetes-client to 6.2.0

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40884:


Assignee: Bjørn Jørgensen

> Upgrade fabric8io - kubernetes-client to 6.2.0
> --
>
> Key: SPARK-40884
> URL: https://issues.apache.org/jira/browse/SPARK-40884
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.2.0]
> [Snakeyaml version should be updated to mitigate 
> CVE-2022-28857|https://github.com/fabric8io/kubernetes-client/issues/4383]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40884) Upgrade fabric8io - kubernetes-client to 6.2.0

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40884.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38348
[https://github.com/apache/spark/pull/38348]

> Upgrade fabric8io - kubernetes-client to 6.2.0
> --
>
> Key: SPARK-40884
> URL: https://issues.apache.org/jira/browse/SPARK-40884
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.2.0]
> [Snakeyaml version should be updated to mitigate 
> CVE-2022-28857|https://github.com/fabric8io/kubernetes-client/issues/4383]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40874) Fix broadcasts in Python UDFs when encryption is enabled

2022-10-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40874:
-
Fix Version/s: 3.1.4
   3.3.1
   3.2.3

> Fix broadcasts in Python UDFs when encryption is enabled
> 
>
> Key: SPARK-40874
> URL: https://issues.apache.org/jira/browse/SPARK-40874
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.1.4, 3.4.0, 3.3.1, 3.2.3
>
>
> The following Pyspark script:
> {noformat}
> bin/pyspark --conf spark.io.encryption.enabled=true
> ...
> bar = {"a": "aa", "b": "bb"}
> foo = spark.sparkContext.broadcast(bar)
> spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
> spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
> {noformat}
> fails with:
> {noformat}
> 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 
> 1]
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 811, in main
> func, profiler, deserializer, serializer = read_command(pickleSer, infile)
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 87, in read_command
> command = serializer._read_with_length(file)
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 173, in _read_with_length
> return self.loads(obj)
>   File 
> "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 471, in loads
> return cloudpickle.loads(obj, encoding=encoding)
> EOFError: Ran out of input
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40892) Loosen the requirement of window_time rule - allow multiple window_time calls

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622897#comment-17622897
 ] 

Apache Spark commented on SPARK-40892:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/38361

> Loosen the requirement of window_time rule - allow multiple window_time calls
> -
>
> Key: SPARK-40892
> URL: https://issues.apache.org/jira/browse/SPARK-40892
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-40821 introduces a new SQL function "window_time" to extract the 
> representative time from window (which also carries over the event time 
> metadata as well if feasible).
> SPARK-40821 followed the existing rule of time window / session window which 
> only allows a single function call in a same projection (strictly saying, it 
> considers the call of function as once if the function is called with same 
> parameters).
> For existing rules, the restriction makes sense since allowing this would 
> produce cartesian product of rows (although Spark can handle it). But given 
> that window_time only produces one value, the restriction no longer makes 
> sense.
> It would be better to unlock the functionality. Note that this means the 
> resulting column of "window_time()" is no longer be "window_time". (Note that 
> this is the practice most of function calls do. The rules time window and 
> session window don't follow the practice so arguably they have a bug, but 
> fixing the bug would bring backward incompatibility...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40892) Loosen the requirement of window_time rule - allow multiple window_time calls

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40892:


Assignee: Apache Spark

> Loosen the requirement of window_time rule - allow multiple window_time calls
> -
>
> Key: SPARK-40892
> URL: https://issues.apache.org/jira/browse/SPARK-40892
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-40821 introduces a new SQL function "window_time" to extract the 
> representative time from window (which also carries over the event time 
> metadata as well if feasible).
> SPARK-40821 followed the existing rule of time window / session window which 
> only allows a single function call in a same projection (strictly saying, it 
> considers the call of function as once if the function is called with same 
> parameters).
> For existing rules, the restriction makes sense since allowing this would 
> produce cartesian product of rows (although Spark can handle it). But given 
> that window_time only produces one value, the restriction no longer makes 
> sense.
> It would be better to unlock the functionality. Note that this means the 
> resulting column of "window_time()" is no longer be "window_time". (Note that 
> this is the practice most of function calls do. The rules time window and 
> session window don't follow the practice so arguably they have a bug, but 
> fixing the bug would bring backward incompatibility...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40892) Loosen the requirement of window_time rule - allow multiple window_time calls

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40892:


Assignee: (was: Apache Spark)

> Loosen the requirement of window_time rule - allow multiple window_time calls
> -
>
> Key: SPARK-40892
> URL: https://issues.apache.org/jira/browse/SPARK-40892
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-40821 introduces a new SQL function "window_time" to extract the 
> representative time from window (which also carries over the event time 
> metadata as well if feasible).
> SPARK-40821 followed the existing rule of time window / session window which 
> only allows a single function call in a same projection (strictly saying, it 
> considers the call of function as once if the function is called with same 
> parameters).
> For existing rules, the restriction makes sense since allowing this would 
> produce cartesian product of rows (although Spark can handle it). But given 
> that window_time only produces one value, the restriction no longer makes 
> sense.
> It would be better to unlock the functionality. Note that this means the 
> resulting column of "window_time()" is no longer be "window_time". (Note that 
> this is the practice most of function calls do. The rules time window and 
> session window don't follow the practice so arguably they have a bug, but 
> fixing the bug would bring backward incompatibility...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40892) Loosen the requirement of window_time rule - allow multiple window_time calls

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622896#comment-17622896
 ] 

Apache Spark commented on SPARK-40892:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/38361

> Loosen the requirement of window_time rule - allow multiple window_time calls
> -
>
> Key: SPARK-40892
> URL: https://issues.apache.org/jira/browse/SPARK-40892
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-40821 introduces a new SQL function "window_time" to extract the 
> representative time from window (which also carries over the event time 
> metadata as well if feasible).
> SPARK-40821 followed the existing rule of time window / session window which 
> only allows a single function call in a same projection (strictly saying, it 
> considers the call of function as once if the function is called with same 
> parameters).
> For existing rules, the restriction makes sense since allowing this would 
> produce cartesian product of rows (although Spark can handle it). But given 
> that window_time only produces one value, the restriction no longer makes 
> sense.
> It would be better to unlock the functionality. Note that this means the 
> resulting column of "window_time()" is no longer be "window_time". (Note that 
> this is the practice most of function calls do. The rules time window and 
> session window don't follow the practice so arguably they have a bug, but 
> fixing the bug would bring backward incompatibility...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40892) Loosen the requirement of window_time rule - allow multiple window_time calls

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-40892:
-
Description: 
SPARK-40821 introduces a new SQL function "window_time" to extract the 
representative time from window (which also carries over the event time 
metadata as well if feasible).

SPARK-40821 followed the existing rule of time window / session window which 
only allows a single function call in a same projection (strictly saying, it 
considers the call of function as once if the function is called with same 
parameters).

For existing rules, the restriction makes sense since allowing this would 
produce cartesian product of rows (although Spark can handle it). But given 
that window_time only produces one value, the restriction no longer makes sense.

It would be better to unlock the functionality. Note that this means the 
resulting column of "window_time()" is no longer be "window_time". (Note that 
this is the practice most of function calls do. The rules time window and 
session window don't follow the practice so arguably they have a bug, but 
fixing the bug would bring backward incompatibility...)

  was:
SPARK-40821 introduces a new SQL function "window_time" to extract the 
representative time from window (which also carries over the event time 
metadata as well if feasible).

SPARK-40821 followed the existing rule of time window / session window which 
only allows a single function call in a same projection (strictly saying, it 
considers the call of function as once if the function is called with same 
parameters).

For existing rules, the restriction makes sense since allowing this would 
produce cartesian product of rows (although Spark can handle it). But given 
that window_time only produces one value, the restriction no longer makes sense.

It would be better to unlock the functionality. Note that this means the 
resulting column of "window_time()" is no longer be "window_time".


> Loosen the requirement of window_time rule - allow multiple window_time calls
> -
>
> Key: SPARK-40892
> URL: https://issues.apache.org/jira/browse/SPARK-40892
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-40821 introduces a new SQL function "window_time" to extract the 
> representative time from window (which also carries over the event time 
> metadata as well if feasible).
> SPARK-40821 followed the existing rule of time window / session window which 
> only allows a single function call in a same projection (strictly saying, it 
> considers the call of function as once if the function is called with same 
> parameters).
> For existing rules, the restriction makes sense since allowing this would 
> produce cartesian product of rows (although Spark can handle it). But given 
> that window_time only produces one value, the restriction no longer makes 
> sense.
> It would be better to unlock the functionality. Note that this means the 
> resulting column of "window_time()" is no longer be "window_time". (Note that 
> this is the practice most of function calls do. The rules time window and 
> session window don't follow the practice so arguably they have a bug, but 
> fixing the bug would bring backward incompatibility...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40892) Loosen the requirement of window_time rule - allow multiple window_time calls

2022-10-23 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-40892:


 Summary: Loosen the requirement of window_time rule - allow 
multiple window_time calls
 Key: SPARK-40892
 URL: https://issues.apache.org/jira/browse/SPARK-40892
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


SPARK-40821 introduces a new SQL function "window_time" to extract the 
representative time from window (which also carries over the event time 
metadata as well if feasible).

SPARK-40821 followed the existing rule of time window / session window which 
only allows a single function call in a same projection (strictly saying, it 
considers the call of function as once if the function is called with same 
parameters).

For existing rules, the restriction makes sense since allowing this would 
produce cartesian product of rows (although Spark can handle it). But given 
that window_time only produces one value, the restriction no longer makes sense.

It would be better to unlock the functionality. Note that this means the 
resulting column of "window_time()" is no longer be "window_time".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-40821:
-
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-40821:
-
Target Version/s:   (was: 3.3.0)

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622892#comment-17622892
 ] 

Jungtaek Lim edited comment on SPARK-40821 at 10/23/22 11:16 PM:
-

[~alex-balikov] 

I thought you created multiple JIRA tickets... My bad. Could you please 1) 
clone this ticket to have a separate ticket number and reopen it, and 2) update 
this ticket to only contain the window_time change? Thanks in advance.


was (Author: kabhwan):
[~alex-balikov] 

I thought you created multiple JIRA tickets... My bad. Could you please clone 
this ticket to have a separate ticket number, and update this ticket to only 
contain the window_time change? Thanks in advance.

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622892#comment-17622892
 ] 

Jungtaek Lim commented on SPARK-40821:
--

[~alex-balikov] 

I thought you created multiple JIRA tickets... My bad. Could you please clone 
this ticket to have a separate ticket number, and update this ticket to only 
contain the window_time change? Thanks in advance.

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40821.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38288
[https://github.com/apache/spark/pull/38288]

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40821) Fix late record filtering to support chaining of steteful operators

2022-10-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40821:


Assignee: Alex Balikov

> Fix late record filtering to support chaining of steteful operators
> ---
>
> Key: SPARK-40821
> URL: https://issues.apache.org/jira/browse/SPARK-40821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Alex Balikov
>Assignee: Alex Balikov
>Priority: Major
>
> Currently chaining of stateful operators is Spark Structured Streaming is not 
> supported for various reasons and is blocked by the unsupported operations 
> check (spark.sql.streaming.unsupportedOperationCheck flag). We propose to fix 
> this as chaining of stateful operators is a common streaming scenario - e.g.
> stream-stream join -> windowed aggregation
> window aggregation -> window aggregation
> etc
> What is broken:
>  # every stateful operator performs late record filtering against the global 
> watermark. When chaining stateful operators (e.g. window aggregations) the 
> output produced by the first stateful operator is effectively late against 
> the watermark and thus filtered out by the next operator late record 
> filtering (technically the next operator should not do late record filtering 
> but it can be changed to assert for correctness detection, etc)
>  # when chaining window aggregations, the first window aggregating operator 
> produces records with schema \{ window: { start: Timestamp, end: Timestamp }, 
> agg: Long } - there is not explicit event time in the schema to be used by 
> the next stateful operator (the correct event time should be window.end - 1 )
>  # stream-stream time-interval join can produce late records by semantics, 
> e.g. if the join condition is:
> left.eventTime BETWEEN right.eventTime + INTERVAL 1 HOUR right.eventTime - 
> INTERVAL 1 HOUR
>           the produced records can be delayed by 1 hr relative to the 
> watermark.
> Proposed fixes:
>  1. 1 can be fixed by performing late record filtering against the previous 
> microbatch watermark instead of the current microbatch watermark.
> 2. 2 can be fixed by allowing the window and session_window functions to work 
> on the window column directly and compute the correct event time 
> transparently to the user. Also, introduce window_time SQL function to 
> compute correct event time from the window column.
> 3. 3 can be fixed by adding support for per-operator watermarks instead of a 
> single global watermark. In the example of stream-stream time interval join 
> followed by a stateful operator, the join operator will 'delay' the 
> downstream operator watermarks by a correct value to handle the delayed 
> records. Only stream-stream time-interval joins will be delaying the 
> watermark, any other operators will not delay downstream watermarks.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40811) Use checkError() to intercept ParseException

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622857#comment-17622857
 ] 

Apache Spark commented on SPARK-40811:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/38360

> Use checkError() to intercept ParseException
> 
>
> Key: SPARK-40811
> URL: https://issues.apache.org/jira/browse/SPARK-40811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Port the following test suites onto checkError():
> - SQLViewSuite
> - JDBCTableCatalogSuite



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread Max Gekk (Jira)
Max Gekk created SPARK-40891:


 Summary: Check error classes in TableIdentifierParserSuite
 Key: SPARK-40891
 URL: https://issues.apache.org/jira/browse/SPARK-40891
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
 Fix For: 3.4.0


Check error classes in DataSourceV2SQLSuite by using checkError() instead of 
checking the error message body.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40891) Check error classes in TableIdentifierParserSuite

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40891:
-
Description: 
Check error classes in TableIdentifierParserSuite by using checkError().


  was:
Check error classes in DataSourceV2SQLSuite by using checkError() instead of 
checking the error message body.




> Check error classes in TableIdentifierParserSuite
> -
>
> Key: SPARK-40891
> URL: https://issues.apache.org/jira/browse/SPARK-40891
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in TableIdentifierParserSuite by using checkError().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40890) Check error classes in DataSourceV2SQLSuite

2022-10-23 Thread Max Gekk (Jira)
Max Gekk created SPARK-40890:


 Summary: Check error classes in DataSourceV2SQLSuite
 Key: SPARK-40890
 URL: https://issues.apache.org/jira/browse/SPARK-40890
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
 Fix For: 3.4.0


Check error classes in PlanResolutionSuite by using checkError() instead of 
assertUnsupported.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40890) Check error classes in DataSourceV2SQLSuite

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40890:
-
Description: 
Check error classes in DataSourceV2SQLSuite by using checkError() instead of 
checking the error message body.



  was:
Check error classes in PlanResolutionSuite by using checkError() instead of 
assertUnsupported.




> Check error classes in DataSourceV2SQLSuite
> ---
>
> Key: SPARK-40890
> URL: https://issues.apache.org/jira/browse/SPARK-40890
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in DataSourceV2SQLSuite by using checkError() instead of 
> checking the error message body.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2022-10-23 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621032#comment-17621032
 ] 

Enrico Minack edited comment on SPARK-40588 at 10/23/22 5:01 PM:
-

Here is a more concise and complete example to reproduce this issue.

Run this with 512m memory and one executor, e.g.:
{code:bash}
spark-shell --driver-memory 512m --master "local[1]"
{code}
{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.adaptive.enabled", true)

val ids = 200
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", 
"day").join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}
Check the written files are sorted (states {{OK}} when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}
Files are not sorted for Spark 3.0.x, 3.1.x, 3.2.x and 3.3.x. Current master 
(3.4.0) seems to be fixed.


was (Author: enricomi):
Here is a more concise and complete example to reproduce this issue.

Run this with 512m memory and one executor, e.g.:
{code:bash}
spark-shell --driver-memory 512m --master "local[1]"
{code}
{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.adaptive.enabled", true)

val ids = 100
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", 
"day").join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}
Check the written files are sorted (states {{OK}} when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}
Files are not sorted for Spark 3.0.x, 3.1.x, 3.2.x and 3.3.x. Current master 
(3.4.0) seems to be fixed.

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> 

[jira] [Comment Edited] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2022-10-23 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621032#comment-17621032
 ] 

Enrico Minack edited comment on SPARK-40588 at 10/23/22 4:55 PM:
-

Here is a more concise and complete example to reproduce this issue.

Run this with 512m memory and one executor, e.g.:
{code:bash}
spark-shell --driver-memory 512m --master "local[1]"
{code}
{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.adaptive.enabled", true)

val ids = 100
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", "day")
  .join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}
Check the written files are sorted (states {{OK}} when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}
Files are not sorted for Spark 3.0.x, 3.1.x, 3.2.x and 3.3.x. Current master 
(3.4.0) seems to be fixed.


was (Author: enricomi):
Here is a more concise and complete example to reproduce this issue:

{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.adaptive.enabled", true)

val ids = 100
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", "day")
  .join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}

Check the written files are sorted (states {{OK}} when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}

Files are not sorted for Spark 3.0.x, 3.1.x, 3.2.x and 3.3.x. Current master 
(3.4.0) seems to be fixed.

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with 

[jira] [Comment Edited] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2022-10-23 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621032#comment-17621032
 ] 

Enrico Minack edited comment on SPARK-40588 at 10/23/22 4:55 PM:
-

Here is a more concise and complete example to reproduce this issue.

Run this with 512m memory and one executor, e.g.:
{code:bash}
spark-shell --driver-memory 512m --master "local[1]"
{code}
{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.adaptive.enabled", true)

val ids = 100
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", 
"day").join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}
Check the written files are sorted (states {{OK}} when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}
Files are not sorted for Spark 3.0.x, 3.1.x, 3.2.x and 3.3.x. Current master 
(3.4.0) seems to be fixed.


was (Author: enricomi):
Here is a more concise and complete example to reproduce this issue.

Run this with 512m memory and one executor, e.g.:
{code:bash}
spark-shell --driver-memory 512m --master "local[1]"
{code}
{code:scala}
import org.apache.spark.sql.SaveMode

spark.conf.set("spark.sql.adaptive.enabled", true)

val ids = 100
val days = 2
val parts = 2

val ds = spark.range(0, days, 1, parts).withColumnRenamed("id", "day")
  .join(spark.range(0, ids, 1, parts))

ds.repartition($"day")
  .sortWithinPartitions($"day", $"id")
  .write
  .partitionBy("day")
  .mode(SaveMode.Overwrite)
  .csv("interleaved.csv")
{code}
Check the written files are sorted (states {{OK}} when file is sorted):
{code:bash}
for file in interleaved.csv/day\=*/part-*
do
  echo "$(sort -n "$file" | md5sum | cut -d " " -f 1)  $file"
done | md5sum -c
{code}
Files are not sorted for Spark 3.0.x, 3.1.x, 3.2.x and 3.3.x. Current master 
(3.4.0) seems to be fixed.

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> 

[jira] [Resolved] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40886.
--
Fix Version/s: 3.3.2
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 38355
[https://github.com/apache/spark/pull/38355]

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40886:


Assignee: Cheng Pan

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40886:
-
Priority: Trivial  (was: Major)

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Trivial
> Fix For: 3.4.0, 3.3.2
>
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40801) Upgrade Apache Commons Text to 1.10

2022-10-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40801:
-
Priority: Minor  (was: Major)

> Upgrade Apache Commons Text to 1.10
> ---
>
> Key: SPARK-40801
> URL: https://issues.apache.org/jira/browse/SPARK-40801
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0, 3.3.2
>
>
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40889) Check error classes in PlanResolutionSuite

2022-10-23 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622839#comment-17622839
 ] 

Max Gekk commented on SPARK-40889:
--

cc [~panbingkun]

> Check error classes in PlanResolutionSuite
> --
>
> Key: SPARK-40889
> URL: https://issues.apache.org/jira/browse/SPARK-40889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in PlanResolutionSuite by using checkError() instead of 
> assertUnsupported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40889) Check error classes in PlanResolutionSuite

2022-10-23 Thread Max Gekk (Jira)
Max Gekk created SPARK-40889:


 Summary: Check error classes in PlanResolutionSuite
 Key: SPARK-40889
 URL: https://issues.apache.org/jira/browse/SPARK-40889
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
 Fix For: 3.4.0


Check error classes in HiveQuerySuite by using checkError() instead of 
assertUnsupportedFeature





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40889) Check error classes in PlanResolutionSuite

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40889:
-
Description: 
Check error classes in PlanResolutionSuite by using checkError() instead of 
assertUnsupported.



  was:
Check error classes in HiveQuerySuite by using checkError() instead of 
assertUnsupportedFeature




> Check error classes in PlanResolutionSuite
> --
>
> Key: SPARK-40889
> URL: https://issues.apache.org/jira/browse/SPARK-40889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in PlanResolutionSuite by using checkError() instead of 
> assertUnsupported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40888) Check error classes in HiveQuerySuite

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40888:


Assignee: (was: BingKun Pan)

> Check error classes in HiveQuerySuite
> -
>
> Key: SPARK-40888
> URL: https://issues.apache.org/jira/browse/SPARK-40888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in HiveQuerySuite by using checkError() instead of 
> assertUnsupportedFeature



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40888) Check error classes in HiveQuerySuite

2022-10-23 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622838#comment-17622838
 ] 

Max Gekk commented on SPARK-40888:
--

[~panbingkun] Would you like to take this?

> Check error classes in HiveQuerySuite
> -
>
> Key: SPARK-40888
> URL: https://issues.apache.org/jira/browse/SPARK-40888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in HiveQuerySuite by using checkError() instead of 
> assertUnsupportedFeature



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40888) Check error classes in HiveQuerySuite

2022-10-23 Thread Max Gekk (Jira)
Max Gekk created SPARK-40888:


 Summary: Check error classes in HiveQuerySuite
 Key: SPARK-40888
 URL: https://issues.apache.org/jira/browse/SPARK-40888
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: BingKun Pan
 Fix For: 3.4.0


Check error classes in DDLParserSuite by using checkError(). For instance, 
replace

{code:scala}
intercept("CREATE TABLE my_tab (id bigint) SKEWED BY (id) ON (1,2,3)",
  "CREATE TABLE ... SKEWED BY")
{code}
by
{code:scala}
checkError(
  exception = parseException("CREATE TABLE my_tab (id bigint) SKEWED BY 
(id) ON (1,2,3)"),
  errorClass = "...",
  parameters = Map.empty,
  context = ...)
{code}
at 
https://github.com/apache/spark/blob/1a1341910249d545365ba3d6679c6943896dde22/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala#L546





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40888) Check error classes in HiveQuerySuite

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40888:
-
Description: 
Check error classes in HiveQuerySuite by using checkError() instead of 
assertUnsupportedFeature



  was:
Check error classes in DDLParserSuite by using checkError(). For instance, 
replace

{code:scala}
intercept("CREATE TABLE my_tab (id bigint) SKEWED BY (id) ON (1,2,3)",
  "CREATE TABLE ... SKEWED BY")
{code}
by
{code:scala}
checkError(
  exception = parseException("CREATE TABLE my_tab (id bigint) SKEWED BY 
(id) ON (1,2,3)"),
  errorClass = "...",
  parameters = Map.empty,
  context = ...)
{code}
at 
https://github.com/apache/spark/blob/1a1341910249d545365ba3d6679c6943896dde22/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala#L546




> Check error classes in HiveQuerySuite
> -
>
> Key: SPARK-40888
> URL: https://issues.apache.org/jira/browse/SPARK-40888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in HiveQuerySuite by using checkError() instead of 
> assertUnsupportedFeature



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40751) Migrate type check failures of high order functions onto error classes

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622808#comment-17622808
 ] 

Apache Spark commented on SPARK-40751:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38359

> Migrate type check failures of high order functions onto error classes
> --
>
> Key: SPARK-40751
> URL: https://issues.apache.org/jira/browse/SPARK-40751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the high-order 
> functions expressions:
> 1. ArraySort (2): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L403-L407
> 2. ArrayAggregate (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L807
> 3. MapZipWith (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L1028



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40751) Migrate type check failures of high order functions onto error classes

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40751:


Assignee: (was: Apache Spark)

> Migrate type check failures of high order functions onto error classes
> --
>
> Key: SPARK-40751
> URL: https://issues.apache.org/jira/browse/SPARK-40751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the high-order 
> functions expressions:
> 1. ArraySort (2): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L403-L407
> 2. ArrayAggregate (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L807
> 3. MapZipWith (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L1028



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40751) Migrate type check failures of high order functions onto error classes

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622807#comment-17622807
 ] 

Apache Spark commented on SPARK-40751:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38359

> Migrate type check failures of high order functions onto error classes
> --
>
> Key: SPARK-40751
> URL: https://issues.apache.org/jira/browse/SPARK-40751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the high-order 
> functions expressions:
> 1. ArraySort (2): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L403-L407
> 2. ArrayAggregate (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L807
> 3. MapZipWith (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L1028



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40751) Migrate type check failures of high order functions onto error classes

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40751:


Assignee: Apache Spark

> Migrate type check failures of high order functions onto error classes
> --
>
> Key: SPARK-40751
> URL: https://issues.apache.org/jira/browse/SPARK-40751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the high-order 
> functions expressions:
> 1. ArraySort (2): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L403-L407
> 2. ArrayAggregate (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L807
> 3. MapZipWith (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L1028



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40588) Sorting issue with partitioned-writing and AQE turned on

2022-10-23 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-40588:
--
Summary: Sorting issue with partitioned-writing and AQE turned on  (was: 
Sorting issue with AQE turned on  )

> Sorting issue with partitioned-writing and AQE turned on
> 
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40588) Sorting issue with AQE turned on

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40588:


Assignee: (was: Apache Spark)

> Sorting issue with AQE turned on  
> --
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40588) Sorting issue with AQE turned on

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40588:


Assignee: Apache Spark

> Sorting issue with AQE turned on  
> --
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40588) Sorting issue with AQE turned on

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622804#comment-17622804
 ] 

Apache Spark commented on SPARK-40588:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38358

> Sorting issue with AQE turned on  
> --
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40887) Allow Spark on K8s to integrate w/ Log Service

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40887:


Assignee: (was: Apache Spark)

> Allow Spark on K8s to integrate w/ Log Service
> --
>
> Key: SPARK-40887
> URL: https://issues.apache.org/jira/browse/SPARK-40887
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>
> https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40887) Allow Spark on K8s to integrate w/ Log Service

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40887:


Assignee: Apache Spark

> Allow Spark on K8s to integrate w/ Log Service
> --
>
> Key: SPARK-40887
> URL: https://issues.apache.org/jira/browse/SPARK-40887
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>
> https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40887) Allow Spark on K8s to integrate w/ Log Service

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622802#comment-17622802
 ] 

Apache Spark commented on SPARK-40887:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/38357

> Allow Spark on K8s to integrate w/ Log Service
> --
>
> Key: SPARK-40887
> URL: https://issues.apache.org/jira/browse/SPARK-40887
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>
> https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40887) Allow Spark on K8s to integrate w/ Log Service

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40887:


Assignee: Apache Spark

> Allow Spark on K8s to integrate w/ Log Service
> --
>
> Key: SPARK-40887
> URL: https://issues.apache.org/jira/browse/SPARK-40887
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>
> https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40887) Allow Spark on K8s to integrate w/ Log Service

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622803#comment-17622803
 ] 

Apache Spark commented on SPARK-40887:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/38357

> Allow Spark on K8s to integrate w/ Log Service
> --
>
> Key: SPARK-40887
> URL: https://issues.apache.org/jira/browse/SPARK-40887
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>
> https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40887) Allow Spark on K8s to integrate w/ Log Service

2022-10-23 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-40887:
-

 Summary: Allow Spark on K8s to integrate w/ Log Service
 Key: SPARK-40887
 URL: https://issues.apache.org/jira/browse/SPARK-40887
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Cheng Pan


https://docs.google.com/document/d/1MfB39LD4B4Rp7MDRxZbMKMbdNSe6V6mBmMQ-gkCnM-0/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622797#comment-17622797
 ] 

Apache Spark commented on SPARK-40885:
--

User 'ming95' has created a pull request for this issue:
https://github.com/apache/spark/pull/38356

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2
>Reporter: ming95
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40885:


Assignee: Apache Spark

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2
>Reporter: ming95
>Assignee: Apache Spark
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40885) Spark will filter out data field sorting when dynamic partitions and data fields are sorted at the same time

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40885:


Assignee: (was: Apache Spark)

> Spark will filter out data field sorting when dynamic partitions and data 
> fields are sorted at the same time
> 
>
> Key: SPARK-40885
> URL: https://issues.apache.org/jira/browse/SPARK-40885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.3.0, 3.2.2
>Reporter: ming95
>Priority: Major
> Attachments: 1666494504884.jpg
>
>
> When using dynamic partitions to write data and sort partitions and data 
> fields, Spark will filter the sorting of data fields.
>  
> reproduce sql:
> {code:java}
> CREATE TABLE `sort_table`(
>   `id` int,
>   `name` string
>   )
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION 'sort_table';CREATE TABLE `test_table`(
>   `id` int,
>   `name` string)
> PARTITIONED BY (
>   `dt` string)
> stored as textfile
> LOCATION
>   'test_table';//gen test data
> insert into test_table partition(dt=20221011) select 10,"15" union all select 
> 1,"10" union  all select 5,"50" union  all select 20,"2" union  all select 
> 30,"14"  ;
> set spark.hadoop.hive.exec.dynamici.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> // this sql sort with partition filed (`dt`) and data filed (`name`), but 
> sort with `name` can not work
> insert overwrite table sort_table partition(dt) select id,name,dt from 
> test_table order by name,dt;
>  {code}
>  
> The Sort operator of DAG has only one sort field, but there are actually two 
> in SQL.(See the attached drawing)
>  
> It relate this issue : https://issues.apache.org/jira/browse/SPARK-40588



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622789#comment-17622789
 ] 

Apache Spark commented on SPARK-40886:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/38355

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Priority: Major
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40886:


Assignee: Apache Spark

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40886:


Assignee: (was: Apache Spark)

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Priority: Major
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622788#comment-17622788
 ] 

Apache Spark commented on SPARK-40886:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/38355

> Bump Jackson Databind 2.13.4.2
> --
>
> Key: SPARK-40886
> URL: https://issues.apache.org/jira/browse/SPARK-40886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Cheng Pan
>Priority: Major
>
> Jackson 2.13.4.1 has a regression about gradle
> https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40886) Bump Jackson Databind 2.13.4.2

2022-10-23 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-40886:
-

 Summary: Bump Jackson Databind 2.13.4.2
 Key: SPARK-40886
 URL: https://issues.apache.org/jira/browse/SPARK-40886
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0, 3.3.1
Reporter: Cheng Pan


Jackson 2.13.4.1 has a regression about gradle

https://github.com/FasterXML/jackson-databind/issues/3627



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40760) Migrate type check failures of interval expressions onto error classes

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40760.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38237
[https://github.com/apache/spark/pull/38237]

> Migrate type check failures of interval expressions onto error classes
> --
>
> Key: SPARK-40760
> URL: https://issues.apache.org/jira/browse/SPARK-40760
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the interval 
> expressions:
> 1. Average (1):
> https://github.com/apache/spark/blob/47d119dfc1a06ee2d520396129b4f09bc22d3fb7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala#L78
> 2. ApproxCountDistinctForIntervals (3):
> https://github.com/apache/spark/blob/08123a3795683238352e5bf55452de381349fdd9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala#L80-L91



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40756) Migrate type check failures of string expressions onto error classes

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40756:


Assignee: BingKun Pan

> Migrate type check failures of string expressions onto error classes
> 
>
> Key: SPARK-40756
> URL: https://issues.apache.org/jira/browse/SPARK-40756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the string and 
> regexp expressions:
> 1. Elt (3):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L276-L284
> 2. RegExpReplace (2):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L597-L604



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40756) Migrate type check failures of string expressions onto error classes

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40756.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38299
[https://github.com/apache/spark/pull/38299]

> Migrate type check failures of string expressions onto error classes
> 
>
> Key: SPARK-40756
> URL: https://issues.apache.org/jira/browse/SPARK-40756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the string and 
> regexp expressions:
> 1. Elt (3):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L276-L284
> 2. RegExpReplace (2):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L597-L604



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-40588) Sorting issue with AQE turned on

2022-10-23 Thread zzzzming95 (Jira)


[ https://issues.apache.org/jira/browse/SPARK-40588 ]


ming95 deleted comment on SPARK-40588:


was (Author: zing):
Yes, I found the same problem. This should be a bug in Spark. When the sorting 
field is the same as the dynamic partitioning field, the sorting of non 
partitioning fields will be filtered out.

> Sorting issue with AQE turned on  
> --
>
> Key: SPARK-40588
> URL: https://issues.apache.org/jira/browse/SPARK-40588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.3
> Environment: Spark v3.1.3
> Scala v2.12.13
>Reporter: Swetha Baskaran
>Priority: Major
> Attachments: image-2022-10-16-22-05-47-159.png
>
>
> We are attempting to partition data by a few columns, sort by a particular 
> _sortCol_ and write out one file per partition. 
> {code:java}
> df
>     .repartition(col("day"), col("month"), col("year"))
>     .withColumn("partitionId",spark_partition_id)
>     .withColumn("monotonicallyIncreasingIdUnsorted",monotonicallyIncreasingId)
>     .sortWithinPartitions("year", "month", "day", "sortCol")
>     .withColumn("monotonicallyIncreasingIdSorted",monotonicallyIncreasingId)
>     .write
>     .partitionBy("year", "month", "day")
>     .parquet(path){code}
> When inspecting the results, we observe one file per partition, however we 
> see an _alternating_ pattern of unsorted rows in some files.
> {code:java}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832121344,"monotonicallyIncreasingIdSorted":6287832121344}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877022389,"monotonicallyIncreasingIdSorted":6287876860586}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287877567881,"monotonicallyIncreasingIdSorted":6287832121345}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287835105553,"monotonicallyIncreasingIdSorted":6287876860587}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832570127,"monotonicallyIncreasingIdSorted":6287832121346}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287879965760,"monotonicallyIncreasingIdSorted":6287876860588}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287878762347,"monotonicallyIncreasingIdSorted":6287832121347}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287837165012,"monotonicallyIncreasingIdSorted":6287876860589}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287832910545,"monotonicallyIncreasingIdSorted":6287832121348}
> {"sortCol":1303413,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287881244758,"monotonicallyIncreasingIdSorted":6287876860590}
> {"sortCol":10,"partitionId":732,"monotonicallyIncreasingIdUnsorted":6287880041345,"monotonicallyIncreasingIdSorted":6287832121349}{code}
> Here is a 
> [gist|https://gist.github.com/Swebask/543030748a768be92d3c0ae343d2ae89] to 
> reproduce the issue. 
> Turning off AQE with spark.conf.set("spark.sql.adaptive.enabled", false) 
> fixes the issue.
> I'm working on identifying why AQE affects the sort order. Any leads or 
> thoughts would be appreciated!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40391) Test the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40391:


Assignee: BingKun Pan

> Test the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION
> -
>
> Key: SPARK-40391
> URL: https://issues.apache.org/jira/browse/SPARK-40391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Minor
>
> Add a test for the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION and place 
> it to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40391) Test the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40391.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38351
[https://github.com/apache/spark/pull/38351]

> Test the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION
> -
>
> Key: SPARK-40391
> URL: https://issues.apache.org/jira/browse/SPARK-40391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> Add a test for the error class UNSUPPORTED_FEATURE.JDBC_TRANSACTION and place 
> it to QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37945) Use error classes in the execution errors of arithmetic ops

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37945:


Assignee: Khalid Mammadov

> Use error classes in the execution errors of arithmetic ops
> ---
>
> Key: SPARK-37945
> URL: https://issues.apache.org/jira/browse/SPARK-37945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Khalid Mammadov
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * overflowInSumOfDecimalError
> * overflowInIntegralDivideError
> * arithmeticOverflowError
> * unaryMinusCauseOverflowError
> * binaryArithmeticCauseOverflowError
> * unscaledValueTooLargeForPrecisionError
> * decimalPrecisionExceedsMaxPrecisionError
> * outOfDecimalTypeRangeError
> * integerOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37945) Use error classes in the execution errors of arithmetic ops

2022-10-23 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37945.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38273
[https://github.com/apache/spark/pull/38273]

> Use error classes in the execution errors of arithmetic ops
> ---
>
> Key: SPARK-37945
> URL: https://issues.apache.org/jira/browse/SPARK-37945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Khalid Mammadov
>Priority: Major
> Fix For: 3.4.0
>
>
> Migrate the following errors in QueryExecutionErrors:
> * overflowInSumOfDecimalError
> * overflowInIntegralDivideError
> * arithmeticOverflowError
> * unaryMinusCauseOverflowError
> * binaryArithmeticCauseOverflowError
> * unscaledValueTooLargeForPrecisionError
> * decimalPrecisionExceedsMaxPrecisionError
> * outOfDecimalTypeRangeError
> * integerOverflowError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org