date:20220802

[spark] branch master updated: [SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 31ab8bc4d5d [SPARK-39904][SQL] Rename inferDate to prefersDate and 
clarify semantics of the option in CSV data source
31ab8bc4d5d is described below

commit 31ab8bc4d5df0c84cb5542200b286defea037cc2
Author: Ivan Sadikov 
AuthorDate: Wed Aug 3 13:06:44 2022 +0900

[SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of 
the option in CSV data source

### What changes were proposed in this pull request?

This is a follow-up for https://github.com/apache/spark/pull/36871.

PR renames `inferDate` to `prefersDate` to avoid confusion when dates 
inference would change the column type and result in confusion when the user 
meant to infer timestamps.

The patch also updates semantics of the option: `prefersDate` is allowed to 
be used during schema inference (`inferSchema`) as well as user-provided schema 
where it could be used as a fallback mechanism when parsing timestamps.

### Why are the changes needed?

Fixes ambiguity when setting `prefersDate` to true and clarifies semantics 
of the option.

### Does this PR introduce _any_ user-facing change?

Although it is an option rename, the original PR was merged a few days ago 
and the config option has not been included in a Spark release.

### How was this patch tested?

I added a unit test for prefersDate = true with a user schema.

Closes #37327 from sadikovi/rename_config.

Authored-by: Ivan Sadikov 
Signed-off-by: Hyukjin Kwon 
---
 docs/sql-data-sources-csv.md   |  8 ++---
 docs/sql-data-sources-json.md  |  4 +--
 .../spark/sql/catalyst/csv/CSVInferSchema.scala|  6 ++--
 .../apache/spark/sql/catalyst/csv/CSVOptions.scala | 24 -
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  4 +--
 .../sql/catalyst/csv/CSVInferSchemaSuite.scala | 10 +++---
 .../sql/catalyst/csv/UnivocityParserSuite.scala|  4 +--
 .../sql/execution/datasources/csv/CSVSuite.scala   | 42 --
 8 files changed, 72 insertions(+), 30 deletions(-)

diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md
index 7b538528219..98d31a59ac7 100644
--- a/docs/sql-data-sources-csv.md
+++ b/docs/sql-data-sources-csv.md
@@ -109,9 +109,9 @@ Data source options of CSV can be set via:
 read
   
   
-inferDate 
+prefersDate
 false
-Whether or not to infer columns that satisfy the 
dateFormat option as Date. Requires 
inferSchema to be true. When false, 
columns with dates will be inferred as String (or as 
Timestamp if it fits the timestampFormat).
+During schema inference (inferSchema), attempts to infer 
string columns that contain dates or timestamps as Date if the 
values satisfy the dateFormat option and failed to be parsed by 
the respective formatter. With a user-provided schema, attempts to parse 
timestamp columns as dates using dateFormat if they fail to 
conform to timestampFormat, in this case the parsed values will be 
cast to timestamp type afterwards.
 read
   
   
@@ -176,8 +176,8 @@ Data source options of CSV can be set via:
   
   
 enableDateTimeParsingFallback
-Enabled if the time parser policy is legacy or no custom date or 
timestamp pattern was provided
-Allows to fall back to the backward compatible (Spark 1.x and 2.0) 
behavior of parsing dates and timestamps if values do not match the set 
patterns.
+Enabled if the time parser policy has legacy settings or if no custom 
date or timestamp pattern was provided.
+Allows falling back to the backward compatible (Spark 1.x and 2.0) 
behavior of parsing dates and timestamps if values do not match the set 
patterns.
 read
   
   
diff --git a/docs/sql-data-sources-json.md b/docs/sql-data-sources-json.md
index 500cd65b58b..a0772dd3656 100644
--- a/docs/sql-data-sources-json.md
+++ b/docs/sql-data-sources-json.md
@@ -204,8 +204,8 @@ Data source options of JSON can be set via:
   
   
 enableDateTimeParsingFallback
-Enabled if the time parser policy is legacy or no custom date or 
timestamp pattern was provided
-Allows to fall back to the backward compatible (Spark 1.x and 2.0) 
behavior of parsing dates and timestamps if values do not match the set 
patterns.
+Enabled if the time parser policy has legacy settings or if no custom 
date or timestamp pattern was provided.
+Allows falling back to the backward compatible (Spark 1.x and 2.0) 
behavior of parsing dates and timestamps if values do not match the set 
patterns.
 read
   
   
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

[spark] branch branch-3.3 updated: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

2022-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 2254240dba4 [SPARK-39867][SQL] Global limit should not inherit 
OrderPreservingUnaryNode
2254240dba4 is described below

commit 2254240dba4a71d9a68a22ca9a83080351fa3343
Author: ulysses-you 
AuthorDate: Wed Aug 3 11:59:22 2022 +0800

[SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

Make GlobalLimit inherit UnaryNode rather than OrderPreservingUnaryNode

Global limit can not promise the output ordering is same with child, it 
actually depend on the certain physical plan.

For all physical plan with gobal limits:
- CollectLimitExec: it does not promise output ordering
- GlobalLimitExec: it required all tuples so it can assume the child is 
shuffle or child is single partition. Then it can use output ordering of child
- TakeOrderedAndProjectExec: it do sort inside it's implementation

This bug get worse since we pull out v1 write require ordering.

yes, bug fix

fix test and add test

Closes #37284 from ulysses-you/sort.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit e9cc1024df4d587a0f456842d495db91984ed9db)
Signed-off-by: Wenchen Fan 
---
 .../catalyst/plans/logical/basicLogicalOperators.scala|  7 ++-
 .../sql/catalyst/optimizer/EliminateSortsSuite.scala  | 15 ++-
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index 774f6956162..e12a5918ee0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -1248,8 +1248,13 @@ object Limit {
  * A global (coordinated) limit. This operator can emit at most `limitExpr` 
number in total.
  *
  * See [[Limit]] for more information.
+ *
+ * Note that, we can not make it inherit [[OrderPreservingUnaryNode]] due to 
the different strategy
+ * of physical plan. The output ordering of child will be broken if a shuffle 
exchange comes in
+ * between the child and global limit, due to the fact that shuffle reader 
fetches blocks in random
+ * order.
  */
-case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends 
OrderPreservingUnaryNode {
+case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends 
UnaryNode {
   override def output: Seq[Attribute] = child.output
   override def maxRows: Option[Long] = {
 limitExpr match {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
index 7ceac3b3000..b97dc455dad 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
@@ -115,9 +115,9 @@ class EliminateSortsSuite extends AnalysisTest {
 
   test("SPARK-33183: remove redundant sort by") {
 val orderedPlan = testRelation.select('a, 'b).orderBy('a.asc, 
'b.desc_nullsFirst)
-val unnecessaryReordered = orderedPlan.limit(2).select('a).sortBy('a.asc, 
'b.desc_nullsFirst)
+val unnecessaryReordered = LocalLimit(2, 
orderedPlan).select('a).sortBy('a.asc, 'b.desc_nullsFirst)
 val optimized = Optimize.execute(unnecessaryReordered.analyze)
-val correctAnswer = orderedPlan.limit(2).select('a).analyze
+val correctAnswer = LocalLimit(2, orderedPlan).select('a).analyze
 comparePlans(optimized, correctAnswer)
   }
 
@@ -161,11 +161,11 @@ class EliminateSortsSuite extends AnalysisTest {
 comparePlans(optimized, correctAnswer)
   }
 
-  test("SPARK-33183: limits should not affect order for local sort") {
+  test("SPARK-33183: local limits should not affect order for local sort") {
 val orderedPlan = testRelation.select('a, 'b).orderBy('a.asc, 'b.desc)
-val filteredAndReordered = orderedPlan.limit(Literal(10)).sortBy('a.asc, 
'b.desc)
+val filteredAndReordered = LocalLimit(10, orderedPlan).sortBy('a.asc, 
'b.desc)
 val optimized = Optimize.execute(filteredAndReordered.analyze)
-val correctAnswer = orderedPlan.limit(Literal(10)).analyze
+val correctAnswer = LocalLimit(10, orderedPlan).analyze
 comparePlans(optimized, correctAnswer)
   }
 
@@ -442,4 +442,9 @@ class EliminateSortsSuite extends AnalysisTest {
   .sortBy($"c".asc).analyze
 comparePlans(Optimize.execute(plan3), expected3)
   }
+
+

[spark] branch master updated: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

2022-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e9cc1024df4 [SPARK-39867][SQL] Global limit should not inherit 
OrderPreservingUnaryNode
e9cc1024df4 is described below

commit e9cc1024df4d587a0f456842d495db91984ed9db
Author: ulysses-you 
AuthorDate: Wed Aug 3 11:59:22 2022 +0800

[SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

### What changes were proposed in this pull request?

Make GlobalLimit inherit UnaryNode rather than OrderPreservingUnaryNode

### Why are the changes needed?

Global limit can not promise the output ordering is same with child, it 
actually depend on the certain physical plan.

For all physical plan with gobal limits:
- CollectLimitExec: it does not promise output ordering
- GlobalLimitExec: it required all tuples so it can assume the child is 
shuffle or child is single partition. Then it can use output ordering of child
- TakeOrderedAndProjectExec: it do sort inside it's implementation

This bug get worse since we pull out v1 write require ordering.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

fix test and add test

Closes #37284 from ulysses-you/sort.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/plans/logical/basicLogicalOperators.scala|  7 ++-
 .../sql/catalyst/optimizer/EliminateSortsSuite.scala  | 15 ++-
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
index ef5f87b23ec..5d288dc323f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
@@ -1448,8 +1448,13 @@ object Limit {
  * A global (coordinated) limit. This operator can emit at most `limitExpr` 
number in total.
  *
  * See [[Limit]] for more information.
+ *
+ * Note that, we can not make it inherit [[OrderPreservingUnaryNode]] due to 
the different strategy
+ * of physical plan. The output ordering of child will be broken if a shuffle 
exchange comes in
+ * between the child and global limit, due to the fact that shuffle reader 
fetches blocks in random
+ * order.
  */
-case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends 
OrderPreservingUnaryNode {
+case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends 
UnaryNode {
   override def output: Seq[Attribute] = child.output
   override def maxRows: Option[Long] = {
 limitExpr match {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
index 9dcc9f89790..edd840d63f9 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala
@@ -116,10 +116,10 @@ class EliminateSortsSuite extends AnalysisTest {
 
   test("SPARK-33183: remove redundant sort by") {
 val orderedPlan = testRelation.select($"a", $"b").orderBy($"a".asc, 
$"b".desc_nullsFirst)
-val unnecessaryReordered = orderedPlan.limit(2).select($"a")
+val unnecessaryReordered = LocalLimit(2, orderedPlan).select($"a")
   .sortBy($"a".asc, $"b".desc_nullsFirst)
 val optimized = Optimize.execute(unnecessaryReordered.analyze)
-val correctAnswer = orderedPlan.limit(2).select($"a").analyze
+val correctAnswer = LocalLimit(2, orderedPlan).select($"a").analyze
 comparePlans(optimized, correctAnswer)
   }
 
@@ -163,11 +163,11 @@ class EliminateSortsSuite extends AnalysisTest {
 comparePlans(optimized, correctAnswer)
   }
 
-  test("SPARK-33183: limits should not affect order for local sort") {
+  test("SPARK-33183: local limits should not affect order for local sort") {
 val orderedPlan = testRelation.select($"a", $"b").orderBy($"a".asc, 
$"b".desc)
-val filteredAndReordered = orderedPlan.limit(Literal(10)).sortBy($"a".asc, 
$"b".desc)
+val filteredAndReordered = LocalLimit(10, orderedPlan).sortBy($"a".asc, 
$"b".desc)
 val optimized = Optimize.execute(filteredAndReordered.analyze)
-val correctAnswer = orderedPlan.limit(Literal(10)).analyze
+val correctAnswer = LocalLimit(10, orderedPlan).analyze
 comparePlans(optimized, correctAnswer)
   }
 
@@ -444,4 +444,9 @@ class EliminateSortsSuite extends

[spark] branch branch-3.2 updated (d4f6541d767 -> 265bd219071)

2022-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from d4f6541d767 [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer
 add 265bd219071 [SPARK-39835][SQL][3.2] Fix EliminateSorts remove global 
sort below the local sort

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/optimizer/Optimizer.scala   | 34 +-
 .../catalyst/optimizer/EliminateSortsSuite.scala   | 20 +
 2 files changed, 47 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort

2022-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 586ccce83c8 [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global 
sort below the local sort
586ccce83c8 is described below

commit 586ccce83c8673b49517b84445a90055eb8098d9
Author: ulysses-you 
AuthorDate: Wed Aug 3 11:14:18 2022 +0800

[SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the 
local sort

backport https://github.com/apache/spark/pull/37250 into branch-3.1

### What changes were proposed in this pull request?

 Correct the `EliminateSorts` follows:

- If the upper sort is global then we can remove the global or local sort 
recursively.
- If the upper sort is local then we can only remove the local sort 
recursively.

### Why are the changes needed?

If a global sort below locol sort, we should not remove the global sort 
becuase the output partitioning can be affected.

This issue is going to worse since we pull out the V1 Write sort to logcial 
side.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes #37276 from ulysses-you/SPARK-39835-3.1.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/optimizer/Optimizer.scala   | 38 +-
 .../catalyst/optimizer/EliminateSortsSuite.scala   | 20 
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 8b777bed706..e5f531ff2f5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -1105,23 +1105,45 @@ object EliminateSorts extends Rule[LogicalPlan] {
   }
 case Sort(orders, false, child) if 
SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
   applyLocally.lift(child).getOrElse(child)
-case s @ Sort(_, _, child) => s.copy(child = recursiveRemoveSort(child))
+case s @ Sort(_, global, child) => s.copy(child = 
recursiveRemoveSort(child, global))
 case j @ Join(originLeft, originRight, _, cond, _) if 
cond.forall(_.deterministic) =>
-  j.copy(left = recursiveRemoveSort(originLeft), right = 
recursiveRemoveSort(originRight))
+  j.copy(left = recursiveRemoveSort(originLeft, true),
+right = recursiveRemoveSort(originRight, true))
 case g @ Aggregate(_, aggs, originChild) if isOrderIrrelevantAggs(aggs) =>
-  g.copy(child = recursiveRemoveSort(originChild))
+  g.copy(child = recursiveRemoveSort(originChild, true))
   }
 
-  private def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match 
{
-case Sort(_, _, child) => recursiveRemoveSort(child)
-case other if canEliminateSort(other) =>
-  other.withNewChildren(other.children.map(recursiveRemoveSort))
-case _ => plan
+  /**
+   * If the upper sort is global then we can remove the global or local sort 
recursively.
+   * If the upper sort is local then we can only remove the local sort 
recursively.
+   */
+  private def recursiveRemoveSort(
+  plan: LogicalPlan,
+  canRemoveGlobalSort: Boolean): LogicalPlan = {
+plan match {
+  case Sort(_, global, child) if canRemoveGlobalSort || !global =>
+recursiveRemoveSort(child, canRemoveGlobalSort)
+  case Sort(sortOrder, true, child) =>
+// For this case, the upper sort is local so the ordering of present 
sort is unnecessary,
+// so here we only preserve its output partitioning using 
`RepartitionByExpression`.
+// We should use `None` as the optNumPartitions so AQE can coalesce 
shuffle partitions.
+// This behavior is same with original global sort.
+RepartitionByExpression(sortOrder, recursiveRemoveSort(child, true), 
None)
+  case other if canEliminateSort(other) =>
+other.withNewChildren(other.children.map(c => recursiveRemoveSort(c, 
canRemoveGlobalSort)))
+  case other if canEliminateGlobalSort(other) =>
+other.withNewChildren(other.children.map(c => recursiveRemoveSort(c, 
true)))
+  case _ => plan
+}
   }
 
   private def canEliminateSort(plan: LogicalPlan): Boolean = plan match {
 case p: Project => p.projectList.forall(_.deterministic)
 case f: Filter => f.condition.deterministic
+case _ => false
+  }
+
+  private def canEliminateGlobalSort(plan: LogicalPlan): Boolean = plan match {
 case r: RepartitionByExpression => 
r.partitionExpressions.forall(_.deterministic)
 case _: Repartition => true
 case _ => false
diff

[spark] branch branch-3.3 updated (41779ea2612 -> ea6d57715fd)

2022-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


from 41779ea2612 [MINOR][DOCS] Remove generated statement about Scala 
version in docs homepage as Spark supports multiple versions
 add ea6d57715fd [SPARK-39911][SQL][3.3] Optimize global Sort to 
RepartitionByExpression

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalyst/optimizer/Optimizer.scala|  6 ++
 .../spark/sql/catalyst/optimizer/EliminateSortsSuite.scala | 10 +++---
 2 files changed, 13 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39954][BUILD] Upgrade ASM to 9.3

2022-08-02 Thread viirya

This is an automated email from the ASF dual-hosted git repository.

viirya pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 932beda891a [SPARK-39954][BUILD] Upgrade ASM to 9.3
932beda891a is described below

commit 932beda891aadc9fc21f9d673a865a618f345776
Author: Dongjoon Hyun 
AuthorDate: Tue Aug 2 19:23:01 2022 -0700

[SPARK-39954][BUILD] Upgrade ASM to 9.3

### What changes were proposed in this pull request?

This PR aims to upgrade ASM to 9.3.

### Why are the changes needed?

This will help us to use Java 19 and to be ready for the future releases.
- https://openjdk.org/projects/jdk/19/ (RC1: Aug 11st, GA: Sep 20th)
- https://asm.ow2.io/versions.html
- https://issues.apache.org/jira/browse/XBEAN-334 Upgrade to ASM 9.3

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #37382 from dongjoon-hyun/SPARK-39954.

Authored-by: Dongjoon Hyun 
Signed-off-by: Liang-Chi Hsieh 
---
 dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 6 +++---
 project/plugins.sbt   | 4 ++--
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
b/dev/deps/spark-deps-hadoop-2-hive-2.3
index ec5fbf67388..1f43eb679bc 100644
--- a/dev/deps/spark-deps-hadoop-2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
@@ -257,7 +257,7 @@ tink/1.6.1//tink-1.6.1.jar
 transaction-api/1.1//transaction-api-1.1.jar
 univocity-parsers/2.9.1//univocity-parsers-2.9.1.jar
 velocity/1.5//velocity-1.5.jar
-xbean-asm9-shaded/4.20//xbean-asm9-shaded-4.20.jar
+xbean-asm9-shaded/4.21//xbean-asm9-shaded-4.21.jar
 xercesImpl/2.12.2//xercesImpl-2.12.2.jar
 xml-apis/1.4.01//xml-apis-1.4.01.jar
 xmlenc/0.52//xmlenc-0.52.jar
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index db2e022e5ff..6b275d852a7 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -247,7 +247,7 @@ transaction-api/1.1//transaction-api-1.1.jar
 univocity-parsers/2.9.1//univocity-parsers-2.9.1.jar
 velocity/1.5//velocity-1.5.jar
 wildfly-openssl/1.0.7.Final//wildfly-openssl-1.0.7.Final.jar
-xbean-asm9-shaded/4.20//xbean-asm9-shaded-4.20.jar
+xbean-asm9-shaded/4.21//xbean-asm9-shaded-4.21.jar
 xz/1.8//xz-1.8.jar
 zjsonpatch/0.3.0//zjsonpatch-0.3.0.jar
 zookeeper-jute/3.6.2//zookeeper-jute-3.6.2.jar
diff --git a/pom.xml b/pom.xml
index 0cb83b8debd..bf33d0265da 100644
--- a/pom.xml
+++ b/pom.xml
@@ -472,7 +472,7 @@
   
 org.apache.xbean
 xbean-asm9-shaded
-4.20
+4.21

[spark] branch master updated: [SPARK-39776][SQL][FOLLOW] Update UT of PlanStabilitySuite in ANSI mode

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dfcf2a8b456 [SPARK-39776][SQL][FOLLOW] Update UT of PlanStabilitySuite 
in ANSI mode
dfcf2a8b456 is described below

commit dfcf2a8b45679ed3e89fd0f3bfdb25c8043bee3e
Author: Angerszh 
AuthorDate: Wed Aug 3 11:19:52 2022 +0900

[SPARK-39776][SQL][FOLLOW] Update UT of PlanStabilitySuite in ANSI mode

### What changes were proposed in this pull request?
The current verbose string of LogicalPlan didn't contain join type, we 
should add this
```
(10) BroadcastHashJoin [codegen id : 4]
Left keys [1]: [ws_sold_date_sk#7]
Right keys [1]: [d_date_sk#9]
Join type: Inner
Join condition: None
```

### Why are the changes needed?
Add missed join type in verbose plan

### Does this PR introduce _any_ user-facing change?
User can see join type in verbose plan string

### How was this patch tested?
Existed UT

Closes #37378 from AngersZh/SPARK-39776-FOLLOW_UP.

Authored-by: Angerszh 
Signed-off-by: Hyukjin Kwon 
---
 .../approved-plans-v1_4/q83.ansi/explain.txt   | 10 ++
 .../approved-plans-v1_4/q83.sf100.ansi/explain.txt | 10 ++
 2 files changed, 20 insertions(+)

diff --git 
a/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
 
b/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
index 905d29293a3..d0246aa1c9b 100644
--- 
a/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
+++ 
b/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
@@ -83,6 +83,7 @@ Arguments: HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)
 (8) BroadcastHashJoin [codegen id : 5]
 Left keys [1]: [sr_item_sk#1]
 Right keys [1]: [i_item_sk#5]
+Join type: Inner
 Join condition: None
 
 (9) Project [codegen id : 5]
@@ -95,6 +96,7 @@ Output [1]: [d_date_sk#7]
 (11) BroadcastHashJoin [codegen id : 5]
 Left keys [1]: [sr_returned_date_sk#3]
 Right keys [1]: [d_date_sk#7]
+Join type: Inner
 Join condition: None
 
 (12) Project [codegen id : 5]
@@ -140,6 +142,7 @@ Output [2]: [i_item_sk#16, i_item_id#17]
 (20) BroadcastHashJoin [codegen id : 10]
 Left keys [1]: [cr_item_sk#13]
 Right keys [1]: [i_item_sk#16]
+Join type: Inner
 Join condition: None
 
 (21) Project [codegen id : 10]
@@ -152,6 +155,7 @@ Output [1]: [d_date_sk#18]
 (23) BroadcastHashJoin [codegen id : 10]
 Left keys [1]: [cr_returned_date_sk#15]
 Right keys [1]: [d_date_sk#18]
+Join type: Inner
 Join condition: None
 
 (24) Project [codegen id : 10]
@@ -183,6 +187,7 @@ Arguments: HashedRelationBroadcastMode(List(input[0, 
string, true]),false), [pla
 (29) BroadcastHashJoin [codegen id : 18]
 Left keys [1]: [item_id#11]
 Right keys [1]: [item_id#22]
+Join type: Inner
 Join condition: None
 
 (30) Project [codegen id : 18]
@@ -210,6 +215,7 @@ Output [2]: [i_item_sk#27, i_item_id#28]
 (35) BroadcastHashJoin [codegen id : 16]
 Left keys [1]: [wr_item_sk#24]
 Right keys [1]: [i_item_sk#27]
+Join type: Inner
 Join condition: None
 
 (36) Project [codegen id : 16]
@@ -222,6 +228,7 @@ Output [1]: [d_date_sk#29]
 (38) BroadcastHashJoin [codegen id : 16]
 Left keys [1]: [wr_returned_date_sk#26]
 Right keys [1]: [d_date_sk#29]
+Join type: Inner
 Join condition: None
 
 (39) Project [codegen id : 16]
@@ -253,6 +260,7 @@ Arguments: HashedRelationBroadcastMode(List(input[0, 
string, true]),false), [pla
 (44) BroadcastHashJoin [codegen id : 18]
 Left keys [1]: [item_id#11]
 Right keys [1]: [item_id#33]
+Join type: Inner
 Join condition: None
 
 (45) Project [codegen id : 18]
@@ -332,6 +340,7 @@ Arguments: HashedRelationBroadcastMode(List(cast(input[0, 
int, true] as bigint))
 (57) BroadcastHashJoin [codegen id : 2]
 Left keys [1]: [d_week_seq#41]
 Right keys [1]: [d_week_seq#43]
+Join type: LeftSemi
 Join condition: None
 
 (58) Project [codegen id : 2]
@@ -345,6 +354,7 @@ Arguments: HashedRelationBroadcastMode(List(input[0, date, 
true]),false), [plan_
 (60) BroadcastHashJoin [codegen id : 3]
 Left keys [1]: [d_date#39]
 Right keys [1]: [d_date#40]
+Join type: LeftSemi
 Join condition: None
 
 (61) Project [codegen id : 3]
diff --git 
a/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.sf100.ansi/explain.txt
 
b/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.sf100.ansi/explain.txt
index e6a65be7ec4..746f0afa87a 100644
--- 
a/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.sf100.ansi/explain.txt
+++ 
b/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.sf100.ansi/explain.txt
@@ -68,6 +68,7 @@ Output [1]: [d_date_sk#5]
 (5) BroadcastHashJoin

[spark] branch branch-3.3 updated: [MINOR][DOCS] Remove generated statement about Scala version in docs homepage as Spark supports multiple versions

2022-08-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 41779ea2612 [MINOR][DOCS] Remove generated statement about Scala 
version in docs homepage as Spark supports multiple versions
41779ea2612 is described below

commit 41779ea26122de6a2f0e70a0398f82841a3f909b
Author: Sean Owen 
AuthorDate: Tue Aug 2 21:18:45 2022 -0500

[MINOR][DOCS] Remove generated statement about Scala version in docs 
homepage as Spark supports multiple versions

### What changes were proposed in this pull request?

Remove this statement from the docs homepage:
"For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a 
compatible Scala version (2.12.x)."

### Why are the changes needed?

It's misleading, as Spark supports 2.12 and 2.13.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #37381 from srowen/RemoveScalaStatement.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 
(cherry picked from commit 73ef5432547e3e8e9b0cce0913200a94402aeb4c)
Signed-off-by: Sean Owen 
---
 docs/index.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index c6caf31d560..0c3c0273757 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -41,9 +41,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, 
Mac OS), and it sh
 
 Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.
 Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0.
-For the Scala API, Spark {{site.SPARK_VERSION}}
-uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible 
Scala version
-({{site.SCALA_BINARY_VERSION}}.x).
+When using the Scala API, it is necessary for applications to use the same 
version of Scala that Spark was compiled for.
+For example, when using Scala 2.13, use Spark compiled for 2.13, and compile 
code/applications for Scala 2.13 as well.
 
 For Python 3.9, Arrow optimization and pandas UDFs might not work due to the 
supported Python versions in Apache Arrow. Please refer to the latest [Python 
Compatibility](https://arrow.apache.org/docs/python/install.html#python-compatibility)
 page.
 For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required 
additionally for Apache Arrow library. This prevents 
`java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
java.nio.DirectByteBuffer.(long, int) not available` when Apache Arrow uses 
Netty internally.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Remove generated statement about Scala version in docs homepage as Spark supports multiple versions

2022-08-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 73ef5432547 [MINOR][DOCS] Remove generated statement about Scala 
version in docs homepage as Spark supports multiple versions
73ef5432547 is described below

commit 73ef5432547e3e8e9b0cce0913200a94402aeb4c
Author: Sean Owen 
AuthorDate: Tue Aug 2 21:18:45 2022 -0500

[MINOR][DOCS] Remove generated statement about Scala version in docs 
homepage as Spark supports multiple versions

### What changes were proposed in this pull request?

Remove this statement from the docs homepage:
"For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a 
compatible Scala version (2.12.x)."

### Why are the changes needed?

It's misleading, as Spark supports 2.12 and 2.13.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #37381 from srowen/RemoveScalaStatement.

Authored-by: Sean Owen 
Signed-off-by: Sean Owen 
---
 docs/index.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/index.md b/docs/index.md
index 9c5b1e09658..462d3bd79cd 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -42,9 +42,8 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, 
Mac OS), and it sh
 Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.
 Python 3.7 support is deprecated as of Spark 3.4.0.
 Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0.
-For the Scala API, Spark {{site.SPARK_VERSION}}
-uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible 
Scala version
-({{site.SCALA_BINARY_VERSION}}.x).
+When using the Scala API, it is necessary for applications to use the same 
version of Scala that Spark was compiled for.
+For example, when using Scala 2.13, use Spark compiled for 2.13, and compile 
code/applications for Scala 2.13 as well.
 
 For Python 3.9, Arrow optimization and pandas UDFs might not work due to the 
supported Python versions in Apache Arrow. Please refer to the latest [Python 
Compatibility](https://arrow.apache.org/docs/python/install.html#python-compatibility)
 page.
 For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required 
additionally for Apache Arrow library. This prevents 
`java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
java.nio.DirectByteBuffer.(long, int) not available` when Apache Arrow uses 
Netty internally.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields

2022-08-02 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new a0242eabaee [SPARK-39951][SQL] Update Parquet V2 columnar check for 
nested fields
a0242eabaee is described below

commit a0242eabaeef39ec4d74d2bdd0bcac78c71a63e6
Author: Adam Binford 
AuthorDate: Tue Aug 2 16:50:05 2022 -0700

[SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields

### What changes were proposed in this pull request?

Update the `supportsColumnarReads` check for Parquet V2 to take into 
account support for nested fields. Also fixed a typo I saw in one of the tests.

### Why are the changes needed?

Match Parquet V1 in returning columnar batches if nested field 
vectorization is enabled.

### Does this PR introduce _any_ user-facing change?

Parquet V2 scans will return columnar batches with nested fields if the 
config is enabled.

### How was this patch tested?

Added new UTs checking both V1 and V2 return columnar batches for nested 
fields when the config is enabled.

Closes #37379 from Kimahriman/parquet-v2-columnar.

Authored-by: Adam Binford 
Signed-off-by: Chao Sun 
---
 .../datasources/parquet/ParquetFileFormat.scala|  5 +--
 .../v2/parquet/ParquetPartitionReaderFactory.scala |  9 +++--
 .../datasources/parquet/ParquetQuerySuite.scala| 43 ++
 .../parquet/ParquetSchemaPruningSuite.scala|  2 +-
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index 44dc145d36e..9765e7c7801 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -173,9 +173,8 @@ class ParquetFileFormat
*/
   override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
 val conf = sparkSession.sessionState.conf
-conf.parquetVectorizedReaderEnabled && conf.wholeStageEnabled &&
-  ParquetUtils.isBatchReadSupportedForSchema(conf, schema) &&
-!WholeStageCodegenExec.isTooManyFields(conf, schema)
+ParquetUtils.isBatchReadSupportedForSchema(conf, schema) && 
conf.wholeStageEnabled &&
+  !WholeStageCodegenExec.isTooManyFields(conf, schema)
   }
 
   override def vectorTypes(
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
index ea4f5e0d287..c16b762510f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
@@ -37,12 +37,13 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.RebaseDateTime.RebaseSpec
 import org.apache.spark.sql.connector.expressions.aggregate.Aggregation
 import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader}
+import org.apache.spark.sql.execution.WholeStageCodegenExec
 import org.apache.spark.sql.execution.datasources.{AggregatePushDownUtils, 
DataSourceUtils, PartitionedFile, RecordReaderIterator}
 import org.apache.spark.sql.execution.datasources.parquet._
 import org.apache.spark.sql.execution.datasources.v2._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.Filter
-import org.apache.spark.sql.types.{AtomicType, StructType}
+import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.vectorized.ColumnarBatch
 import org.apache.spark.util.SerializableConfiguration
 
@@ -72,6 +73,8 @@ case class ParquetPartitionReaderFactory(
   private val enableOffHeapColumnVector = sqlConf.offHeapColumnVectorEnabled
   private val enableVectorizedReader: Boolean =
 ParquetUtils.isBatchReadSupportedForSchema(sqlConf, resultSchema)
+  private val supportsColumnar = enableVectorizedReader && 
sqlConf.wholeStageEnabled &&
+!WholeStageCodegenExec.isTooManyFields(sqlConf, resultSchema)
   private val enableRecordFilter: Boolean = sqlConf.parquetRecordFilterEnabled
   private val timestampConversion: Boolean = 
sqlConf.isParquetINT96TimestampConversion
   private val capacity = sqlConf.parquetVectorizedReaderBatchSize
@@ -104,9 +107,7 @@ case class ParquetPartitionReaderFactory(
   }
 
   override def supportColumnarReads(partition: InputPartition): Boolean = {
-

[spark] branch master updated: [SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields

2022-08-02 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b9bc3c79190 [SPARK-39951][SQL] Update Parquet V2 columnar check for 
nested fields
b9bc3c79190 is described below

commit b9bc3c79190fd2fbe91001a96c738a176e3e0e10
Author: Adam Binford 
AuthorDate: Tue Aug 2 16:50:05 2022 -0700

[SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields

### What changes were proposed in this pull request?

Update the `supportsColumnarReads` check for Parquet V2 to take into 
account support for nested fields. Also fixed a typo I saw in one of the tests.

### Why are the changes needed?

Match Parquet V1 in returning columnar batches if nested field 
vectorization is enabled.

### Does this PR introduce _any_ user-facing change?

Parquet V2 scans will return columnar batches with nested fields if the 
config is enabled.

### How was this patch tested?

Added new UTs checking both V1 and V2 return columnar batches for nested 
fields when the config is enabled.

Closes #37379 from Kimahriman/parquet-v2-columnar.

Authored-by: Adam Binford 
Signed-off-by: Chao Sun 
---
 .../datasources/parquet/ParquetFileFormat.scala|  5 +--
 .../v2/parquet/ParquetPartitionReaderFactory.scala |  9 +++--
 .../datasources/parquet/ParquetQuerySuite.scala| 43 ++
 .../parquet/ParquetSchemaPruningSuite.scala|  2 +-
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index 3349f335841..513379d23d6 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -167,9 +167,8 @@ class ParquetFileFormat
*/
   override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
 val conf = sparkSession.sessionState.conf
-conf.parquetVectorizedReaderEnabled && conf.wholeStageEnabled &&
-  ParquetUtils.isBatchReadSupportedForSchema(conf, schema) &&
-!WholeStageCodegenExec.isTooManyFields(conf, schema)
+ParquetUtils.isBatchReadSupportedForSchema(conf, schema) && 
conf.wholeStageEnabled &&
+  !WholeStageCodegenExec.isTooManyFields(conf, schema)
   }
 
   override def vectorTypes(
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
index c9572e474c8..0f6e5201df8 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
@@ -37,12 +37,13 @@ import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.RebaseDateTime.RebaseSpec
 import org.apache.spark.sql.connector.expressions.aggregate.Aggregation
 import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader}
+import org.apache.spark.sql.execution.WholeStageCodegenExec
 import org.apache.spark.sql.execution.datasources.{AggregatePushDownUtils, 
DataSourceUtils, PartitionedFile, RecordReaderIterator}
 import org.apache.spark.sql.execution.datasources.parquet._
 import org.apache.spark.sql.execution.datasources.v2._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.Filter
-import org.apache.spark.sql.types.{AtomicType, StructType}
+import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.vectorized.ColumnarBatch
 import org.apache.spark.util.SerializableConfiguration
 
@@ -72,6 +73,8 @@ case class ParquetPartitionReaderFactory(
   private val enableOffHeapColumnVector = sqlConf.offHeapColumnVectorEnabled
   private val enableVectorizedReader: Boolean =
 ParquetUtils.isBatchReadSupportedForSchema(sqlConf, resultSchema)
+  private val supportsColumnar = enableVectorizedReader && 
sqlConf.wholeStageEnabled &&
+!WholeStageCodegenExec.isTooManyFields(sqlConf, resultSchema)
   private val enableRecordFilter: Boolean = sqlConf.parquetRecordFilterEnabled
   private val timestampConversion: Boolean = 
sqlConf.isParquetINT96TimestampConversion
   private val capacity = sqlConf.parquetVectorizedReaderBatchSize
@@ -104,9 +107,7 @@ case class ParquetPartitionReaderFactory(
   }
 
   override def supportColumnarReads(partition: InputPartition): Boolean = {
-

[spark] branch master updated: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

2022-08-02 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 51e4c2cc55a [SPARK-39917][SQL] Use different error classes for 
numeric/interval arithmetic overflow
51e4c2cc55a is described below

commit 51e4c2cc55aa01f07b28b1cd807b553f8729075d
Author: Gengliang Wang 
AuthorDate: Tue Aug 2 12:59:05 2022 -0700

[SPARK-39917][SQL] Use different error classes for numeric/interval 
arithmetic overflow

### What changes were proposed in this pull request?

Similar with https://github.com/apache/spark/pull/37313, currently, when  
arithmetic overflow errors happen under ANSI mode, the error messages are like
```
[ARITHMETIC_OVERFLOW] long overflow. Use 'try_multiply' to tolerate 
overflow and return NULL instead. If necessary set spark.sql.ansi.enabled to 
"false"
```

The "(except for ANSI interval type)" part is confusing. We should remove 
it for the numeric arithmetic operations and have a new error class for the 
interval division error: `INTERVAL_ARITHMETIC_OVERFLOW`

### Why are the changes needed?

For better error messages

### Does this PR introduce _any_ user-facing change?

Yes, Use different error classes for arithmetic overflows of 
numeric/interval.. After changes, the error messages are simpler and more clear.

### How was this patch tested?

UT

Closes #37374 from gengliangwang/SPARK-39917.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 core/src/main/resources/error/error-classes.json   |  8 +++-
 .../sql/catalyst/expressions/arithmetic.scala  | 20 +-
 .../catalyst/expressions/intervalExpressions.scala |  3 +-
 .../sql/catalyst/util/IntervalMathUtils.scala  | 46 ++
 .../spark/sql/errors/QueryExecutionErrors.scala| 14 +++
 .../sql-tests/results/ansi/interval.sql.out| 14 +++
 .../resources/sql-tests/results/interval.sql.out   | 14 +++
 .../sql-tests/results/postgreSQL/int4.sql.out  | 12 +++---
 .../sql-tests/results/postgreSQL/int8.sql.out  |  8 ++--
 .../results/postgreSQL/window_part2.sql.out|  4 +-
 .../apache/spark/sql/DataFrameAggregateSuite.scala |  8 ++--
 11 files changed, 110 insertions(+), 41 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index c4b59799f88..ed6dd112e9f 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -7,7 +7,7 @@
   },
   "ARITHMETIC_OVERFLOW" : {
 "message" : [
-  ". If necessary set  to \"false\" (except 
for ANSI interval type) to bypass this error."
+  ". If necessary set  to \"false\" to 
bypass this error."
 ],
 "sqlState" : "22003"
   },
@@ -210,6 +210,12 @@
   ""
 ]
   },
+  "INTERVAL_ARITHMETIC_OVERFLOW" : {
+"message" : [
+  "."
+],
+"sqlState" : "22003"
+  },
   "INTERVAL_DIVIDED_BY_ZERO" : {
 "message" : [
   "Division by zero. Use `try_divide` to tolerate divisor being 0 and 
return NULL instead."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index 86e6e6d7323..24ac685eace 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -25,7 +25,7 @@ import org.apache.spark.sql.catalyst.expressions.codegen._
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
 import org.apache.spark.sql.catalyst.trees.SQLQueryContext
 import org.apache.spark.sql.catalyst.trees.TreePattern.{BINARY_ARITHMETIC, 
TreePattern, UNARY_POSITIVE}
-import org.apache.spark.sql.catalyst.util.{IntervalUtils, MathUtils, TypeUtils}
+import org.apache.spark.sql.catalyst.util.{IntervalMathUtils, IntervalUtils, 
MathUtils, TypeUtils}
 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
@@ -89,7 +89,7 @@ case class UnaryMinus(
   defineCodeGen(ctx, ev, c => s"$iu.$method($c)")
 case _: AnsiIntervalType =>
   nullSafeCodeGen(ctx, ev, eval => {
-val mathUtils = MathUtils.getClass.getCanonicalName.stripSuffix("$")
+val mathUtils = 
IntervalMathUtils.getClass.getCanonicalName.stripSuffix("$")
 s"${ev.value} = $mathUtils.negateExact($eval);"
   })
   }
@@ -98,8 +98,8 @@ case class UnaryMinus(
 case CalendarIntervalType if failOnError =>
   IntervalUtils.negateExact(input.asInstanceOf[CalendarInterval])
 case CalendarIntervalType =>

[spark] branch branch-3.3 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new fb9f85ed3b2 [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer
fb9f85ed3b2 is described below

commit fb9f85ed3b2391fae3349a34cbda951eee224fd1
Author: ulysses-you 
AuthorDate: Tue Aug 2 18:05:48 2022 +0900

[SPARK-39932][SQL] WindowExec should clear the final partition buffer

### What changes were proposed in this pull request?

Explicitly clear final partition buffer if can not find next in 
`WindowExec`. The same fix in `WindowInPandasExec`

### Why are the changes needed?

We do a repartition after a window, then we need do a local sort after 
window due to RoundRobinPartitioning shuffle.

The error stack:
```java
ExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 
rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 
bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355)
```

`WindowExec` only clear buffer in `fetchNextPartition` so the final 
partition buffer miss to clear.

It is not a big problem since we have task completion listener.
```scala
taskContext.addTaskCompletionListener(context -> {
  cleanupResources();
});
```

This bug only affects if the window is not the last operator for this task 
and the follow operator like sort.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

N/A

Closes #37358 from ulysses-you/window.

Authored-by: ulysses-you 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4)
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/execution/python/WindowInPandasExec.scala | 10 --
 .../org/apache/spark/sql/execution/window/WindowExec.scala | 10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
index e73da99786c..ccb1ed92525 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
@@ -332,8 +332,14 @@ case class WindowInPandasExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if (!found) {
+// clear final partition
+buffer.clear()
+  }
+  found
+}
 
 override final def next(): Iterator[UnsafeRow] = {
   // Load the next partition if we need to.
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
index 33c37e871e3..dc85585b13d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
@@ -158,8 +158,14 @@ case class WindowExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if

[spark] branch branch-3.2 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new d4f6541d767 [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer
d4f6541d767 is described below

commit d4f6541d767724cb5ee643f3d3fde8d67b2dc1bc
Author: ulysses-you 
AuthorDate: Tue Aug 2 18:05:48 2022 +0900

[SPARK-39932][SQL] WindowExec should clear the final partition buffer

### What changes were proposed in this pull request?

Explicitly clear final partition buffer if can not find next in 
`WindowExec`. The same fix in `WindowInPandasExec`

### Why are the changes needed?

We do a repartition after a window, then we need do a local sort after 
window due to RoundRobinPartitioning shuffle.

The error stack:
```java
ExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 
rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 
bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355)
```

`WindowExec` only clear buffer in `fetchNextPartition` so the final 
partition buffer miss to clear.

It is not a big problem since we have task completion listener.
```scala
taskContext.addTaskCompletionListener(context -> {
  cleanupResources();
});
```

This bug only affects if the window is not the last operator for this task 
and the follow operator like sort.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

N/A

Closes #37358 from ulysses-you/window.

Authored-by: ulysses-you 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4)
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/execution/python/WindowInPandasExec.scala | 10 --
 .../org/apache/spark/sql/execution/window/WindowExec.scala | 10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
index 07c0aab1b6b..1fc4f36498b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
@@ -354,8 +354,14 @@ case class WindowInPandasExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if (!found) {
+// clear final partition
+buffer.clear()
+  }
+  found
+}
 
 override final def next(): Iterator[UnsafeRow] = {
   // Load the next partition if we need to.
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
index 374659e03a3..cd66efa5487 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
@@ -178,8 +178,14 @@ case class WindowExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if

[spark] branch branch-3.0 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 2f3e4e36017 [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer
2f3e4e36017 is described below

commit 2f3e4e36017d16d67086fd4ecaf39636a2fb4b7c
Author: ulysses-you 
AuthorDate: Tue Aug 2 18:05:48 2022 +0900

[SPARK-39932][SQL] WindowExec should clear the final partition buffer

### What changes were proposed in this pull request?

Explicitly clear final partition buffer if can not find next in 
`WindowExec`. The same fix in `WindowInPandasExec`

### Why are the changes needed?

We do a repartition after a window, then we need do a local sort after 
window due to RoundRobinPartitioning shuffle.

The error stack:
```java
ExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 
rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 
bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355)
```

`WindowExec` only clear buffer in `fetchNextPartition` so the final 
partition buffer miss to clear.

It is not a big problem since we have task completion listener.
```scala
taskContext.addTaskCompletionListener(context -> {
  cleanupResources();
});
```

This bug only affects if the window is not the last operator for this task 
and the follow operator like sort.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

N/A

Closes #37358 from ulysses-you/window.

Authored-by: ulysses-you 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4)
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/execution/python/WindowInPandasExec.scala | 10 --
 .../org/apache/spark/sql/execution/window/WindowExec.scala | 10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
index f54c4b8f220..e71b846eb7a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
@@ -353,8 +353,14 @@ case class WindowInPandasExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if (!found) {
+// clear final partition
+buffer.clear()
+  }
+  found
+}
 
 override final def next(): Iterator[UnsafeRow] = {
   // Load the next partition if we need to.
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
index 42fa07f4a6a..1bff26de888 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
@@ -182,8 +182,14 @@ case class WindowExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if

[spark] branch master updated (1fac870126c -> 222cb80227f)

2022-08-02 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1fac870126c [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer
 add 222cb80227f [SPARK-39945][BUILD] Upgrade `sbt-mima-plugin` to 1.1.0

No new revisions were added by this update.

Summary of changes:
 project/plugins.sbt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (37074c5058f -> 1fac870126c)

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 37074c5058f [SPARK-39943][BUILD] Upgrade `rocksdbjni` to 7.4.4
 add 1fac870126c [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/execution/python/WindowInPandasExec.scala | 10 --
 .../org/apache/spark/sql/execution/window/WindowExec.scala | 10 --
 2 files changed, 16 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

2022-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new ed842ba8e82 [SPARK-39932][SQL] WindowExec should clear the final 
partition buffer
ed842ba8e82 is described below

commit ed842ba8e82e845efc38bd115909ce54faef318a
Author: ulysses-you 
AuthorDate: Tue Aug 2 18:05:48 2022 +0900

[SPARK-39932][SQL] WindowExec should clear the final partition buffer

### What changes were proposed in this pull request?

Explicitly clear final partition buffer if can not find next in 
`WindowExec`. The same fix in `WindowInPandasExec`

### Why are the changes needed?

We do a repartition after a window, then we need do a local sort after 
window due to RoundRobinPartitioning shuffle.

The error stack:
```java
ExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 
rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 
bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355)
```

`WindowExec` only clear buffer in `fetchNextPartition` so the final 
partition buffer miss to clear.

It is not a big problem since we have task completion listener.
```scala
taskContext.addTaskCompletionListener(context -> {
  cleanupResources();
});
```

This bug only affects if the window is not the last operator for this task 
and the follow operator like sort.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

N/A

Closes #37358 from ulysses-you/window.

Authored-by: ulysses-you 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4)
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/execution/python/WindowInPandasExec.scala | 10 --
 .../org/apache/spark/sql/execution/window/WindowExec.scala | 10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
index 983fe9db738..a87024cfcd7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
@@ -353,8 +353,14 @@ case class WindowInPandasExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if (!found) {
+// clear final partition
+buffer.clear()
+  }
+  found
+}
 
 override final def next(): Iterator[UnsafeRow] = {
   // Load the next partition if we need to.
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
index 6e0e36cbe59..963decb4cf4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala
@@ -178,8 +178,14 @@ case class WindowExec(
 // Iteration
 var rowIndex = 0
 
-override final def hasNext: Boolean =
-  (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+override final def hasNext: Boolean = {
+  val found = (bufferIterator != null && bufferIterator.hasNext) || 
nextRowAvailable
+  if

[spark] branch master updated (83966e87348 -> 37074c5058f)

2022-08-02 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 83966e87348 [SPARK-39940][SS] Refresh catalog table on streaming query 
with DSv1 sink
 add 37074c5058f [SPARK-39943][BUILD] Upgrade `rocksdbjni` to 7.4.4

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (99fc389a0cd -> 83966e87348)

2022-08-02 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 99fc389a0cd [SPARK-39935][SQL][TESTS] Switch `validateParsingError()` 
onto `checkError()`
 add 83966e87348 [SPARK-39940][SS] Refresh catalog table on streaming query 
with DSv1 sink

No new revisions were added by this update.

Summary of changes:
 .../execution/streaming/MicroBatchExecution.scala  |  9 -
 .../streaming/test/DataStreamTableAPISuite.scala   | 40 +-
 2 files changed, 47 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39935][SQL][TESTS] Switch `validateParsingError()` onto `checkError()`

2022-08-02 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 99fc389a0cd [SPARK-39935][SQL][TESTS] Switch `validateParsingError()` 
onto `checkError()`
99fc389a0cd is described below

commit 99fc389a0cd26056c4ad591c0aec70aac108ebe7
Author: Max Gekk 
AuthorDate: Tue Aug 2 11:56:33 2022 +0500

[SPARK-39935][SQL][TESTS] Switch `validateParsingError()` onto 
`checkError()`

### What changes were proposed in this pull request?
1. Re-implemented `validateParsingError()` using `checkError()`.
2. Removed `checkParsingError()` and replaced by `checkError()`.

### Why are the changes needed?
1. To prepare test infra for testing of query contexts.
3. To check message parameters instead of entire text message. This PR is 
some kind of follow up of https://github.com/apache/spark/pull/36693 and 
https://github.com/apache/spark/pull/37322.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly 
*TruncateTableParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly 
*ShowPartitionsParserSuite"
$ build/sbt "sql/testOnly *QueryParsingErrorsSuite"
```

Closes #37363 from MaxGekk/checkParsingError-to-checkError.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../spark/sql/errors/QueryErrorsSuiteBase.scala|  31 +-
 .../spark/sql/errors/QueryParsingErrorsSuite.scala | 416 +++--
 .../command/ShowPartitionsParserSuite.scala|  10 +-
 .../command/TruncateTableParserSuite.scala |  10 +-
 4 files changed, 67 insertions(+), 400 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryErrorsSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryErrorsSuiteBase.scala
index d03c18882a4..525771f3038 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryErrorsSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryErrorsSuiteBase.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.errors
 
-import org.apache.spark.{QueryContext, SparkThrowable}
+import org.apache.spark.QueryContext
 import org.apache.spark.sql.catalyst.parser.ParseException
 import org.apache.spark.sql.test.SharedSparkSession
 
@@ -28,28 +28,13 @@ trait QueryErrorsSuiteBase extends SharedSparkSession {
   errorClass: String,
   errorSubClass: Option[String] = None,
   sqlState: String,
-  message: String): Unit = {
-val exception = intercept[ParseException] {
-  sql(sqlText)
-}
-checkParsingError(exception, errorClass, errorSubClass, sqlState, message)
-  }
-
-  def checkParsingError(
-  exception: Exception with SparkThrowable,
-  errorClass: String,
-  errorSubClass: Option[String] = None,
-  sqlState: String,
-  message: String): Unit = {
-val fullErrorClass = if (errorSubClass.isDefined) {
-  errorClass + "." + errorSubClass.get
-} else {
-  errorClass
-}
-assert(exception.getErrorClass === errorClass)
-assert(exception.getErrorSubClass === errorSubClass.orNull)
-assert(exception.getSqlState === sqlState)
-assert(exception.getMessage === s"""\n[$fullErrorClass] """ + message)
+  parameters: Map[String, String] = Map.empty): Unit = {
+checkError(
+  exception = intercept[ParseException](sql(sqlText)),
+  errorClass = errorClass,
+  errorSubClass = errorSubClass,
+  sqlState = Some(sqlState),
+  parameters = parameters)
   }
 
   case class ExpectedContext(
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
index 6dc40c3eadb..e9379b461ec 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala
@@ -28,14 +28,7 @@ class QueryParsingErrorsSuite extends QueryTest with 
QueryErrorsSuiteBase {
   sqlText = "SELECT * FROM t1 NATURAL JOIN LATERAL (SELECT c1 + c2 AS c2)",
   errorClass = "UNSUPPORTED_FEATURE",
   errorSubClass = Some("LATERAL_NATURAL_JOIN"),
-  sqlState = "0A000",
-  message =
-"""The feature is not supported: NATURAL join with LATERAL 
correlation.(line 1, pos 14)
-  |
-  |== SQL ==
-  |SELECT * FROM t1 NATURAL JOIN LATERAL (SELECT c1 + c2 AS c2)
-  |--^^^
-  |""".stripMargin)
+  sqlState = "0A000")
   }
 
   test("UNSUPPORTED_FEATURE: LATERAL join with USING join not supported") {
@@ -43,14 +36,7 @@ class QueryParsingErrorsSuite extends QueryTest

[spark] branch master updated: [SPARK-39904][SQL] Rename inferDate to prefersDate and clarify semantics of the option in CSV data source

[spark] branch branch-3.3 updated: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

[spark] branch master updated: [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode

[spark] branch branch-3.2 updated (d4f6541d767 -> 265bd219071)

[spark] branch branch-3.1 updated: [SPARK-39835][SQL][3.1] Fix EliminateSorts remove global sort below the local sort

[spark] branch branch-3.3 updated (41779ea2612 -> ea6d57715fd)

[spark] branch master updated: [SPARK-39954][BUILD] Upgrade ASM to 9.3

[spark] branch master updated: [SPARK-39776][SQL][FOLLOW] Update UT of PlanStabilitySuite in ANSI mode

[spark] branch branch-3.3 updated: [MINOR][DOCS] Remove generated statement about Scala version in docs homepage as Spark supports multiple versions

[spark] branch master updated: [MINOR][DOCS] Remove generated statement about Scala version in docs homepage as Spark supports multiple versions

[spark] branch branch-3.3 updated: [SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields

[spark] branch master updated: [SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields

[spark] branch master updated: [SPARK-39917][SQL] Use different error classes for numeric/interval arithmetic overflow

[spark] branch branch-3.3 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

[spark] branch branch-3.2 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

[spark] branch branch-3.0 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

[spark] branch master updated (1fac870126c -> 222cb80227f)

[spark] branch master updated (37074c5058f -> 1fac870126c)

[spark] branch branch-3.1 updated: [SPARK-39932][SQL] WindowExec should clear the final partition buffer

[spark] branch master updated (83966e87348 -> 37074c5058f)

[spark] branch master updated (99fc389a0cd -> 83966e87348)

[spark] branch master updated: [SPARK-39935][SQL][TESTS] Switch `validateParsingError()` onto `checkError()`

22 matches

Site Navigation

Mail list logo

Footer information